In the financial service industry, we commonly observed the necessity to model a continuous variable between 0 and 1. For instance, given a pool of accounts in the specific credit score band, the percent of delinquencies with underlying risk drivers is often a key interest for the portfolio manager. Another example is the Loss Given Default (LGD) measuring the proportion not recovered from a default borrower in the collection process. In scenarios described above, the outcome of interests is usually observed in either a close interval [0, 1] or an open interval (0, 1). Without the loss of generality, it is reasonable to assume that the close interval [0, 1] is a composition of 3 different data generating schemes, the point mass at 0s, the point mass at 1s, and the open interval (0, 1). From the modeling standpoint, it is trivial to separate these 3 components either by an ordinal logistic regression or by 2 binary logistic regression models. Therefore, the only challenge that we are still facing has simply become how to successfully develop a statistical model predicting a response variable in the open interval (0, 1), which is also the purpose of my study. In addition, it is worth noting that another more convenient but less satisfactory remedy to handle the issue of 0 / 1 point mass is to add 1 / (2 * N) to a Y = 0 observation and subtract 1 / (2 * N) from a Y = 1 observation, where N is the total number of observations. However, this heuristic approach might introduce potential bias with a non-trivial number of observations at 0 / 1.
To the best of my knowledge, although research interests in statistical models for rates and proportions have remained strong in the past years, there is still no unanimous consensus on either the distributional assumption or the modeling framework. An interesting but somewhat ironic observation is that the naïve OLS (Ordinary Least Square) regression with a distributional assumption of Gaussian has been the most popular modeling technique for rates and proportions thus far due to its simplicity. Based upon the literature survey, this widely used approach is subject to a couple of conceptual flaws. First and the most evidential of all, it is obvious that a variable observed in the open interval (0, 1), e.g. a percentage, is not defined on the real line and therefore shouldn’t be assumed normally distributed. Secondly, a profound statistical nature of the rate and proportion outcome is that the conditional variance is the function of the mean, which violates the constant variance assumption in OLS regression and is also known as Heteroscedasticity.
In light of drawbacks described above, we’d surveyed a wide range of alternative models to the naïve OLS regression and like to illustrate six promising candidates to statistical practitioners in SAS community through the analysis of 401K participation data used by Parke and Wooldridge (1996). The purpose of the original study was to investigate the relationship between the 401K participation rate, a dependent variable in [0, 1], and a set of independent variables. However, in this exercise, I will only sample a portion of the data with the participation rate in (0, 1) to serve the purpose of my study.