Modeling Rates and Proportions in SAS – 3


The most common modeling practice next to naïve OLS regression is also another OLS regression but with a log ratio, also known as LOGIT, transformation of y in the interval (0, 1) such that

Ln(Y / (1 – Y)) = X`B + e, where the error term e ~ N(0, sigma ^ 2)

This class of models has long been used by econometricians (Atkinson, 1985) due to its simple statistical function and easy implementation within OLS framework that is most familiar to statistical practitioners in the industry. With the LOGIT transformation, while Y is strictly bounded by (0, 1), Ln(Y / (1 – Y)) is however well defined on the whole real line. More importantly, most model development techniques and diagnosis statistics can be ported directly from the naïve OLS regression with no or little adjustment.

Albeit straightforward, this modeling practice is not free of criticisms. A key concern is that, in order to ensure Ln(Y / (1 – Y)) ~ N(X`B, sigma ^ 2) and therefore the error term e ~ N(0, sigma ^ 2), Y should follow an additive logistic normal distribution theoretically, of which might be questionable and is subject to the result of corresponding statistical diagnosis. As a result, with this modeling strategy, it is important to test if the error term e follows the standard normal distribution in the post-model diagnosis, such as Shapiro-Wilk or Jarque-Bera tests. Moreover, the transformation Ln(Y / (1 – Y)) ~ N(X`B, sigma ^ 2) assumes the stabilized variance, which is contradictory to the statistical nature of rate and proportion outcomes previously described in the study, e.g. heteroscedasticity.

In SAS, although an OLS regression can be conveniently estimated through REG procedure, we still prefer using NLMIXED procedure to better illustrate the mathematical scheme of the model discussed. However, for the purpose of comparison and verification, both procedures are used and demonstrated below.

Albeit conceptually imperfect, the estimation of an OLS regression with the LOGIT transformed Y is not without any value from a practical standpoint and should be considered an initial step in such modeling practice instead. As shown in the subsequent sections, estimated coefficients from the OLS regression can provide a set of effective starting points for coefficients in more complex models estimated through NLMIXED procedure.