Archive for February 2012
2. OLS REGRESSION WITH LOGIT TRANSFORMATION
The most common modeling practice next to naïve OLS regression is also another OLS regression but with a log ratio, also known as LOGIT, transformation of y in the interval (0, 1) such that
Ln(Y / (1 – Y)) = X`B + e, where the error term e ~ N(0, sigma ^ 2)
This class of models has long been used by econometricians (Atkinson, 1985) due to its simple statistical function and easy implementation within OLS framework that is most familiar to statistical practitioners in the industry. With the LOGIT transformation, while Y is strictly bounded by (0, 1), Ln(Y / (1 – Y)) is however well defined on the whole real line. More importantly, most model development techniques and diagnosis statistics can be ported directly from the naïve OLS regression with no or little adjustment.
Albeit straightforward, this modeling practice is not free of criticisms. A key concern is that, in order to ensure Ln(Y / (1 – Y)) ~ N(X`B, sigma ^ 2) and therefore the error term e ~ N(0, sigma ^ 2), Y should follow an additive logistic normal distribution theoretically, of which might be questionable and is subject to the result of corresponding statistical diagnosis. As a result, with this modeling strategy, it is important to test if the error term e follows the standard normal distribution in the post-model diagnosis, such as Shapiro-Wilk or Jarque-Bera tests. Moreover, the transformation Ln(Y / (1 – Y)) ~ N(X`B, sigma ^ 2) assumes the stabilized variance, which is contradictory to the statistical nature of rate and proportion outcomes previously described in the study, e.g. heteroscedasticity.
In SAS, although an OLS regression can be conveniently estimated through REG procedure, we still prefer using NLMIXED procedure to better illustrate the mathematical scheme of the model discussed. However, for the purpose of comparison and verification, both procedures are used and demonstrated below.
Albeit conceptually imperfect, the estimation of an OLS regression with the LOGIT transformed Y is not without any value from a practical standpoint and should be considered an initial step in such modeling practice instead. As shown in the subsequent sections, estimated coefficients from the OLS regression can provide a set of effective starting points for coefficients in more complex models estimated through NLMIXED procedure.
In the financial service industry, we commonly observed the necessity to model a continuous variable between 0 and 1. For instance, given a pool of accounts in the specific credit score band, the percent of delinquencies with underlying risk drivers is often a key interest for the portfolio manager. Another example is the Loss Given Default (LGD) measuring the proportion not recovered from a default borrower in the collection process. In scenarios described above, the outcome of interests is usually observed in either a close interval [0, 1] or an open interval (0, 1). Without the loss of generality, it is reasonable to assume that the close interval [0, 1] is a composition of 3 different data generating schemes, the point mass at 0s, the point mass at 1s, and the open interval (0, 1). From the modeling standpoint, it is trivial to separate these 3 components either by an ordinal logistic regression or by 2 binary logistic regression models. Therefore, the only challenge that we are still facing has simply become how to successfully develop a statistical model predicting a response variable in the open interval (0, 1), which is also the purpose of my study. In addition, it is worth noting that another more convenient but less satisfactory remedy to handle the issue of 0 / 1 point mass is to add 1 / (2 * N) to a Y = 0 observation and subtract 1 / (2 * N) from a Y = 1 observation, where N is the total number of observations. However, this heuristic approach might introduce potential bias with a non-trivial number of observations at 0 / 1.
To the best of my knowledge, although research interests in statistical models for rates and proportions have remained strong in the past years, there is still no unanimous consensus on either the distributional assumption or the modeling framework. An interesting but somewhat ironic observation is that the naïve OLS (Ordinary Least Square) regression with a distributional assumption of Gaussian has been the most popular modeling technique for rates and proportions thus far due to its simplicity. Based upon the literature survey, this widely used approach is subject to a couple of conceptual flaws. First and the most evidential of all, it is obvious that a variable observed in the open interval (0, 1), e.g. a percentage, is not defined on the real line and therefore shouldn’t be assumed normally distributed. Secondly, a profound statistical nature of the rate and proportion outcome is that the conditional variance is the function of the mean, which violates the constant variance assumption in OLS regression and is also known as Heteroscedasticity.
In light of drawbacks described above, we’d surveyed a wide range of alternative models to the naïve OLS regression and like to illustrate six promising candidates to statistical practitioners in SAS community through the analysis of 401K participation data used by Parke and Wooldridge (1996). The purpose of the original study was to investigate the relationship between the 401K participation rate, a dependent variable in [0, 1], and a set of independent variables. However, in this exercise, I will only sample a portion of the data with the participation rate in (0, 1) to serve the purpose of my study.
In practice, OLS (Ordinary Least Square) regression has been widely used to model rates and proportions bounded between 0 and 1 due to its simplicity. However, the conditional distribution of an OLS regression model is assumed Gaussian N(X`B, sigma ^ 2), which is questionable for a variate in the open interval (0, 1). In this paper, we surveyed six alternatives modeling methods for such outcomes, including OLS regression with the LOGIT transformation, NLS (Nonlinear Least Square) regression, Tobit model, Beta model, Simplex model, and Fractional LOGIT model, and demonstrated their implementations in SAS through a data analysis exercise. The purpose of my study is to provide a comprehensive survey in SAS user community on how to model percentage and proportion outcomes in SAS.
Rate and Proportion outcomes, OLS, NLS, Tobit, Beta, Simplex, Fractional LOGIT, PROC NLMIXED