## Archive for **May 2012**

## Information Criteria and Vuong Test

When it comes to model selection between two non-nest models, the information criteria, e.g. AIC or BIC, is often used and the model with a lower information criteria is preferred.

However, even with AIC or BIC, we are still unable to answer the question of whether the model A is significantly better than the model B probabilistically. Proposed by Quang Vuong (1989), Vuong test considers a better model with the individual log likelihoods significantly higher than the ones of its rival. A demonstration of Vuong test is given below.

First of all, two models for proportional outcomes, namely TOBIT regression and NLS (Nonlinear Least Squares) regression, are estimated below with information criteria, e.g. AIC and BIC, calculated and the likelihood of each individual record computed. As shown in the following output, NLS regression has a lower BIC and therefore might be considered a “better” model.

Next, with the likelihood of each individual record from both models, Vuong test is calculated with the formulation given below

Vuong statistic = [LR(model1, model2) – C] / sqrt(N * V) ~ N(0, 1)

LR(…) is the summation of individual log likelihood ratio between 2 models. “C” is a correction term for the difference of DF (Degrees of Freedom) between 2 models. “N” is the number of records. “V” is the variance of individual log likelihood ratio between 2 models. Vuong demonstrated that Vuong statistic is distributed as a standard normal N(0, 1). As a result, the model 1 is better with Vuong statistic > 1.96 and the model 2 is better with Vuong statistic < -1.96.

As shown in the output, although the model2, e.g. NLS regression, is preferred by a lower BIC, Vuong statistic doesn’t show the evidence that NLS regression is significantly better than TOBIT regression but indicates instead that both models are equally close to the true model.

## Decision Stump with the Implementation in SAS

A decision stump is a naively simple but effective rule-based supervised learning algorithm similar to CART (Classification & Regression Tree). However, the stump is a 1-level decision tree consisting of 2 terminal nodes.

Albeit simple, the decision stump has shown successful use cases in many aspects. For instance, as a weak classifier, the decision stump has been proven an excellent base learner in ensemble learning algorithms such as bagging and boosting. Moreover, a single decision stump can also be employed to do feature screening for predictive modeling and cut-point searching for continuous features. The following is an example showing the SAS implementation as well as the predictive power of a decision stump.

First of all, a testing data is simulated with 1 binary response variable Y and 3 continuous features X1 – X3, which X1 is the most related feature to Y with a single cut-point at 5, X2 is also related to Y but with 2 different cut-points at 1.5 and 7.5, and X3 is a pure noise.

The SAS macro below is showing how to program a single decision stump. And this macro would be used to search for the simulated cut-point in each continuous feature.

%macro stump(data = , w = , y = , xlist = ); %let i = 1; %local i; proc sql; create table _out ( variable char(32), gt_value num, gini num ); quit; %do %while (%scan(&xlist, &i) ne %str()); %let x = %scan(&xlist, &i); data _tmp1(keep = &w &y &x); set &data; where &y in (0, 1); run; proc sql; create table _tmp2 as select b.&x as gt_value, sum(case when a.&x <= b.&x then &w * &y else 0 end) / sum(case when a.&x <= b.&x then &w else 0 end) as p1_1, sum(case when a.&x > b.&x then &w * &y else 0 end) / sum(case when a.&x > b.&x then &w else 0 end) as p1_2, sum(case when a.&x <= b.&x then 1 else 0 end) / count(*) as ppn1, sum(case when a.&x > b.&x then 1 else 0 end) / count(*) as ppn2, 2 * calculated p1_1 * (1 - calculated p1_1) * calculated ppn1 + 2 * calculated p1_2 * (1 - calculated p1_2) * calculated ppn2 as gini from _tmp1 as a, (select distinct &x from _tmp1) as b group by b.&x; insert into _out select "&x", gt_value, gini from _tmp2 having gini = min(gini); drop table _tmp1; quit; %let i = %eval(&i + 1); %end; proc sort data = _out; by gini; run; proc report data = _out box spacing = 1 split = "*" nowd; column("DECISION STUMP SUMMARY" variable gt_value gini); define variable / "VARIABLE" width = 30 center; define gt_value / "CUTOFF VALUE*(GREATER THAN)" width = 15 center; define gini / "GINI" width = 10 center format = 9.4; run; %mend stump;

As shown in the table below, the decision stump did a fairly good job both in identifying predictive features and in locating cut-points. Both related features, X1 and X2, have been identified and correctly ranked in terms of associations with Y. For X1, the cut-point located is 4.97, extremely close to 5. For X2, the cut-point located is 7.46, close enough to 1 of 2 simulated cut-points.

## Modeling Rates and Proportions in SAS – 8

**7. FRACTIONAL LOGIT MODEL**

Different from all models introduced previously that assume specific distributional families for the proportional outcomes of interests, the fractional logit model proposed by Papke and Wooldridge (1996) is a quasi-likelihood method that does not specify the full distribution but only requires the conditional mean to be correctly specified for consistent parameter estimates. Under the assumption E(Y|X) = G(X`B) = 1 / (1 + EXP(-X`B)), the fractional logit has the identical likelihood function to the one for a Bernoulli distribution such that

F(Y) = (G(X`B) ** Y) * (1 – G(X`B)) ** (1 – Y) with 1 >= Y >= 0

Based upon the above formulation, parameter estimates are calculated in the same manner as in the binary logistic regression by maximizing the log likelihood.

In SAS, the most convenient way to implement the fractional logit model is with GLIMMIX procedure. In addition, we can also use NLMIXED procedure by explicitly specifying the likelihood function as shown above.

## Modeling Rates and Proportions in SAS – 7

**6. SIMPLEX MODEL**

Dispersion models proposed by Jorgensen (1997) can be considered a more general case of Generalized Linear Models by McCullagh and Nelder (1989) and include a dispersion parameter describing the distributional shape. The simplex model developed by Barndorff-Nielsen and Jorgensen (1991) is a special dispersion model and is useful to model proportional outcomes. A simplex model has the density function given by

F(Y) = (2 * pi * sigma ^ 2 * (Y * (1 – Y)) ^ 3) ^ (-0.5) * EXP((-1 / (2 * sigma ^ 2)) * d(Y; Mu))

where d(Y; Mu) = (Y – Mu) ^ 2 / (Y * (1 – Y) * Mu ^ 2 * (1 – Mu) ^ 2) is a unit deviance function.

Similar to the Beta model, a simplex model also consists of 2 components. The first is a sub-model to describe the expected mean Mu. Since 0 < Mu < 1, the logit link function can be used to specify the relationship between the expected mean and covariates X such that LOG(Mu / (1 – Mu)) = X`B. The second is a sub-model to describe the pattern of dispersion parameter sigma ^ 2 also by a set of covariates Z such that LOG(sigma ^ 2) = Z`G. Due to the similar nature of parameterization between Beta model and Simplex model, model performances of these 2 often have been compared with each other. However, it is still an open question which model is able to outperform its competitor.

Similar to the case of Beta model, there is no out-of-box procedure in SAS estimating the simplex model. However, following its density function, we are able to model the simplex model with NLMIXED procedure as given below.