## Archive for the ‘**Econometrics**’ Category

## Modeling Dollar Amounts in Regression Setting

After switching the role from the credit risk to the operational risk in 2015, I spent countless weekend hours in the Starbucks researching on how to model operational losses in the regression setting in light of the heightened scrutiny. While I feel very comfortable with various frequency models, how to model severity and loss remain challenging both conceptually and empirically. The same challenge also holds true for modeling other financial measures in dollar amounts, such as balance, profit, or cost.

Most practitioners still prefer modeling severity and loss under the Gaussian distributional assumption explicitly or implicitly. In practice, there are 3 commonly used approaches, as elaborated below.

– First of all, the simple OLS regression to model severity and loss directly without any transformation remains the number one choice due to the simplicity. Given the inconsistency between the empirical data range and the conceptual domain for a Gaussian distribution, it is evidential that this approach is problematic.

– Secondly, the OLS regression to model LOG transformed severity and loss under the Lognormal distributional assumption is also a common approach. In this method, Log(Y) instead of Y is estimated. However, given E(Log(Y)|X) != Log(E(Y|X)), the estimation bias is introduced and therefore should be corrected by MSE / 2. In addition, the positive domain of a Lognormal might not work well in cases of losses with a lower bound that can be either zero or a known threshold value.

– At last, the Tobit regression under the censored Normal distribution seems a viable solution that supports the non-negative or any above-threshold values shown in severity or loss measures. Nonetheless, the censorship itself is questionable given that the unobservability of negative or below-threshold values is not due to the censorship but attributable to the data nature governed by the data collection process. Therefore, the argument for the data censorship is not well supported.

Considering the aforementioned challenge, I investigated and experimented various approaches given different data characteristics observed empirically.

– In cases of severity or loss observed in the range of (0, inf), GLM under Gamma or Inverse Gaussian distributional assumption can be considered (https://statcompute.wordpress.com/2015/08/16/some-considerations-of-modeling-severity-in-operational-losses). In addition, the mean-variance relationship can be employed to assess the appropriateness of the correct distribution by either the modified Park test (https://statcompute.wordpress.com/2016/11/20/modified-park-test-in-sas) or the value of power parameter in the Tweedie distribution (https://statcompute.wordpress.com/2017/06/24/using-tweedie-parameter-to-identify-distributions).

– In cases of severity or loss observed in the range of [alpha, inf) with alpha being positive, then a regression under the type-I Pareto distribution (https://statcompute.wordpress.com/2016/12/11/estimate-regression-with-type-i-pareto-response) can be considered. However, there is a caveat that the conditional mean only exists when the shape parameter is large than 1.

– In cases of severity or loss observed in the range of [0, inf) with a small number of zeros, then a regression under the Lomax distribution (https://statcompute.wordpress.com/2016/11/13/parameter-estimation-of-pareto-type-ii-distribution-with-nlmixed-in-sas) or the Tweedie distribution (https://statcompute.wordpress.com/2017/06/29/model-operational-loss-directly-with-tweedie-glm) can be considered. For the Lomax model, it is worth pointing out that the shape parameter alpha has to be large than 2 in order to to have both mean and variance defined.

– In cases of severity or loss observed in the range of [0, inf) with many zeros, then a ZAGA or ZAIG model (https://statcompute.wordpress.com/2017/09/17/model-non-negative-numeric-outcomes-with-zeros) can be considered by assuming the measure governed by a mixed distribution between the point-mass at zeros and the standard Gamma or Inverse Gaussian. As a result, a ZA model consists of 2 sub-models, a nu model separating zeros and positive values and a mu model estimating the conditional mean of positive values.

## Additional Thoughts on Estimating LGD with Proportional Odds Model

In my previous post (https://statcompute.wordpress.com/2018/01/28/modeling-lgd-with-proportional-odds-model), I’ve discussed how to use Proportional Odds Models in the LGD model development. In particular, I specifically mentioned that we would estimate a sub-model, which can be Gamma or Simplex regression, to project the conditional mean for LGD values in the (0, 1) range. However, it is worth pointing out that, if we would define a finer LGD segmentation, the necessity of this sub-model is completely optional. A standalone Proportional Odds Model without any sub-model is more than sufficient to serve the purpose of stress testing, e.g. CCAR.

In the example below, I will define 5 categories based upon LGD values in the [0, 1] range, estimate a Proportional Odds Model as usual, and then demonstrate how to apply the model outcome in the setting of stress testing with the stressed model input, e.g. LTV.

First of all, I defined 5 instead of 3 categories for LGD values, as shown below. Nonetheless, we could use a even finer category definition in practice to achieve a more accurate outcome.

df <- read.csv("lgd.csv") df$lgd <- round(1 - df$Recovery_rate, 4) l1 <- c(-Inf, 0, 0.0999, 0.4999, 0.9999, Inf) l2 <- c("A", "B", "C", "D", "E") df$lgd_cat <- cut(df$lgd, breaks = l1, labels = l2, ordered_result = T) summary(df$lgd_cat) m1 <- ordinal::clm(lgd_cat ~ LTV, data = df) #Coefficients: # Estimate Std. Error z value Pr(>|z|) #LTV 2.3841 0.1083 22.02 <2e-16 *** # #Threshold coefficients: # Estimate Std. Error z value #A|B 0.54082 0.07897 6.848 #B|C 2.12270 0.08894 23.866 #C|D 3.18098 0.10161 31.307 #D|E 4.80338 0.13174 36.460

After the model estimation, it is straightforward to calculate the probability of each LGD category. The only question remained is how to calculate the LGD projection for each individual account as well as for the whole portfolio. In order to calculate the LGD projection, we need two factors, namely the probability and the expected mean of each LGD category, such that

Estimated_LGD = SUM_i [Prob(category i) * LGD_Mean(category i)], where i = A, B, C, D, and E in this particular case.

The calculation is shown below with the estimated LGD = 0.23 that is consistent with the actual LGD = 0.23 for the whole portfolio.

prob_A <- exp(df$LTV * (-m1$beta) + m1$Theta[1]) / (1 + exp(df$LTV * (-m1$beta) + m1$Theta[1])) prob_B <- exp(df$LTV * (-m1$beta) + m1$Theta[2]) / (1 + exp(df$LTV * (-m1$beta) + m1$Theta[2])) - prob_A prob_C <- exp(df$LTV * (-m1$beta) + m1$Theta[3]) / (1 + exp(df$LTV * (-m1$beta) + m1$Theta[3])) - prob_A - prob_B prob_D <- exp(df$LTV * (-m1$beta) + m1$Theta[4]) / (1 + exp(df$LTV * (-m1$beta) + m1$Theta[4])) - prob_A - prob_B - prob_C prob_E <- 1 - exp(df$LTV * (-m1$beta) + m1$Theta[4]) / (1 + exp(df$LTV * (-m1$beta) + m1$Theta[4])) pred <- data.frame(prob_A, prob_B, prob_C, prob_D, prob_E) sum(apply(pred, 2, mean) * aggregate(df['lgd'], df['lgd_cat'], mean)[2]) #[1] 0.2262811

One might be wondering how to apply the model outcome with simple averages in stress testing that the model input is stressed, e.g. more severe, and might be also concerned about the lack of model sensitivity. In the demonstration below, let’s stress the model input LTV by 50% and then evaluate the stressed LGD.

df$LTV_ST <- df$LTV * 1.5 prob_A <- exp(df$LTV_ST * (-m1$beta) + m1$Theta[1]) / (1 + exp(df$LTV_ST * (-m1$beta) + m1$Theta[1])) prob_B <- exp(df$LTV_ST * (-m1$beta) + m1$Theta[2]) / (1 + exp(df$LTV_ST * (-m1$beta) + m1$Theta[2])) - prob_A prob_C <- exp(df$LTV_ST * (-m1$beta) + m1$Theta[3]) / (1 + exp(df$LTV_ST * (-m1$beta) + m1$Theta[3])) - prob_A - prob_B prob_D <- exp(df$LTV_ST * (-m1$beta) + m1$Theta[4]) / (1 + exp(df$LTV_ST * (-m1$beta) + m1$Theta[4])) - prob_A - prob_B - prob_C prob_E <- 1 - exp(df$LTV_ST * (-m1$beta) + m1$Theta[4]) / (1 + exp(df$LTV_ST * (-m1$beta) + m1$Theta[4])) pred_ST <- data.frame(prob_A, prob_B, prob_C, prob_D, prob_E) sum(apply(pred_ST, 2, mean) * aggregate(df['lgd'], df['lgd_cat'], mean)[2]) #[1] 0.3600153

As shown above, although we only use a simple averages as the expected mean for each LGD category, the overall LGD still increases by ~60%. The reason is that, with the more stressed model input, the Proportional Odds Model is able to push more accounts into categories with higher LGD. For instance, the output below shows that, if LTV is stressed by 50% overall, ~146% more accounts would roll into the most severe LGD category without any recovery.

apply(pred_ST, 2, mean) / apply(pred, 2, mean) # prob_A prob_B prob_C prob_D prob_E #0.6715374 0.7980619 1.0405573 1.4825803 2.4639293

## Estimating Parameters of A Hyper-Poisson Distribution in SAS

Similar to COM-Poisson, Double-Poisson, and Generalized Poisson distributions discussed in my previous post (https://statcompute.wordpress.com/2016/11/27/more-about-flexible-frequency-models/), the Hyper-Poisson distribution is another extension of the standard Poisson and is able to accommodate both under-dispersion and over-dispersion that are common in real-world problems. Given the complexity of parameterization and computation, the Hyper-Poisson is somewhat under-investigated. To the best of my knowledge, there is no off-shelf computing routine in SAS for the Hyper-Poisson distribution and only a R function available in http://www4.ujaen.es/~ajsaez/hp.fit.r written by A.J. Sáez-Castillo and A. Conde-Sánchez (2013).

The SAS code presented below is the starting point of my attempt on the Hyper-Poisson and its potential applications. The purpose is to replicate the calculation result shown in the Table 6 of “On the Hyper-Poisson Distribution and its Generalization with Applications” by Bayo H. Lawal (2017) (http://www.journalrepository.org/media/journals/BJMCS_6/2017/Mar/Lawal2132017BJMCS32184.pdf). As a result, the parameterization employed in my SAS code will closely follow Bayo H. Lawal (2017) instead of A.J. Sáez-Castillo and A. Conde-Sánchez (2013).

data d1; input y n @@; datalines; 0 121 1 85 2 19 3 1 4 0 5 0 6 1 ; run; data df; set d1; where n > 0; do i = 1 to n; output; end; run; proc nlmixed data = df; parms lambda = 1 beta = 1; theta = 1; do k = 1 to 100; theta = theta + gamma(beta) * (lambda ** k) / gamma(beta + k); end; prob = (gamma(beta) / gamma(beta + y)) * ((lambda ** y) / theta); ll = log(prob); model y ~ general(ll); run; /* Standard Parameter Estimate Error DF t Value Pr > |t| Alpha lambda 0.3752 0.1178 227 3.19 0.0016 0.05 beta 0.5552 0.2266 227 2.45 0.0150 0.05 */

As shown, the estimated Lambda = 0.3752 and the estimated Beta = 0.5552 are identical to what is presented in the paper. The next step is be to explore applications in the frequency modeling as well as its value in business cases.

## Model Non-Negative Numeric Outcomes with Zeros

As mentioned in the previous post (https://statcompute.wordpress.com/2017/06/29/model-operational-loss-directly-with-tweedie-glm/), we often need to model non-negative numeric outcomes with zeros in the operational loss model development. Tweedie GLM provides a convenient interface to model non-negative losses directly by assuming that aggregated losses are the Poisson sum of Gamma outcomes, which however might not be well supported empirically from the data generation standpoint.

In examples below, we demonstrated another flexible option, namely Zero-Adjusted (ZA) models, in both scenarios of modeling non-negative numeric outcomes, one with a small number of zeros and the other with a large number of zeros. The basic idea of ZA models is very intuitive and similar to the concept of Hurdle models for count outcomes. In a nutshell, non-negative numeric outcomes can be considered two data generation processes, one for point-mass at zeros and the other governed by a statistical distribution for positive outcomes. The latter could be either Gamma or Inverse Gaussian.

First of all, we sampled down an auto-claim data in a way that only 10 claims are zeros and the rest are all positive. While 10 is an arbitrary choice in the example, other small numbers should show similar results.

pkgs <- list("cplm", "gamlss", "MLmetrics") lapply(pkgs, require, character.only = T) data(AutoClaim, package = "cplm") df1 <- na.omit(AutoClaim) # SMALL NUMBER OF ZEROS set.seed(2017) smp <- sample(seq(nrow(df1[df1$CLM_AMT == 0, ])), size = 10, replace = FALSE) df2 <- rbind(df1[df1$CLM_AMT > 0, ], df1[df1$CLM_AMT == 0, ][smp, ])

Next, we applied both Tweedie and zero-adjusted Gamma (ZAGA) models to the data with only 10 zero outcomes. It is worth mentioning that ZAGA doesn’t have to be overly complex in this case. As shown below, while we estimated the Gamma Mu parameter with model attributes, the Nu parameter to separate zeros is just a constant with the intercept = -5.4. Both Tweedie and GAZA models gave very similar estimated parameters and predictive measures with MAPE = 0.61.

tw <- cpglm(CLM_AMT ~ BLUEBOOK + NPOLICY, data = df2) # Estimate Std. Error t value Pr(>|t|) # (Intercept) 8.194e+00 7.234e-02 113.277 < 2e-16 *** # BLUEBOOK 2.047e-05 3.068e-06 6.671 3.21e-11 *** # NPOLICY 7.274e-02 3.102e-02 2.345 0.0191 * MAPE(df2$CLM_AMT, fitted(tw)) # 0.6053669 zaga0 <- gamlss(CLM_AMT ~ BLUEBOOK + NPOLICY, data = df2, family = "ZAGA") # Mu Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 8.203e+00 4.671e-02 175.629 < 2e-16 *** # BLUEBOOK 2.053e-05 2.090e-06 9.821 < 2e-16 *** # NPOLICY 6.948e-02 2.057e-02 3.377 0.000746 *** # Nu Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) -5.3886 0.3169 -17 <2e-16 *** MAPE(df2$CLM_AMT, (1 - fitted(zaga0, what = "nu")) * fitted(zaga0, what = "mu")) # 0.6053314

In the next case, we used the full data with a large number of zeros in the response and then applied both Tweedie and ZAGA models again. However, in ZAGA model, we estimated two sub-models this time, one for the Nu parameter to separate zeros from non-zeros and the other for the Mu parameter to model non-zero outcomes. As shown below, ZAGA outperformed Tweedie in terms of MAPE due to the advantage that ZAGA is able to explain two data generation schemes separately with different model attributes, which is the capability beyond what Tweedie can provide.

# LARGE NUMBER OF ZEROS tw <- cpglm(CLM_AMT ~ BLUEBOOK + NPOLICY + CLM_FREQ5 + MVR_PTS + INCOME, data = df1) # Estimate Std. Error t value Pr(>|t|) # (Intercept) 6.854e+00 1.067e-01 64.241 < 2e-16 *** # BLUEBOOK 1.332e-05 4.495e-06 2.963 0.00305 ** # NPOLICY 4.380e-02 3.664e-02 1.195 0.23196 # CLM_FREQ5 2.064e-01 2.937e-02 7.026 2.29e-12 *** # MVR_PTS 1.066e-01 1.510e-02 7.063 1.76e-12 *** # INCOME -4.606e-06 8.612e-07 -5.348 9.12e-08 *** MAPE(df1$CLM_AMT, fitted(tw)) # 1.484484 zaga1 <- gamlss(CLM_AMT ~ BLUEBOOK + NPOLICY, nu.formula = ~(CLM_FREQ5 + MVR_PTS + INCOME), data = df1, family = "ZAGA") # Mu Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 8.203e+00 4.682e-02 175.218 < 2e-16 *** # BLUEBOOK 2.053e-05 2.091e-06 9.816 < 2e-16 *** # NPOLICY 6.948e-02 2.067e-02 3.362 0.000778 *** # Nu Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 1.153e+00 5.077e-02 22.72 <2e-16 *** # CLM_FREQ5 -3.028e-01 2.283e-02 -13.26 <2e-16 *** # MVR_PTS -1.509e-01 1.217e-02 -12.41 <2e-16 *** # INCOME 7.285e-06 6.269e-07 11.62 <2e-16 *** MAPE(df1$CLM_AMT, (1 - fitted(zaga1, what = "nu")) * fitted(zaga1, what = "mu")) # 1.470228

Given the great flexibility of ZA models, we also have the luxury to explore other candidates than ZAGA. For instance, if the positive part of non-negative outcomes demonstrates a high variance, we can also try a zero-inflated Inverse Gaussian (ZAIG) model, as shown below.

zaig1 <- gamlss(CLM_AMT ~ BLUEBOOK + NPOLICY, nu.formula = ~(CLM_FREQ5 + MVR_PTS + INCOME), data = df1, family = "ZAIG") # Mu Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 8.205e+00 5.836e-02 140.591 < 2e-16 *** # BLUEBOOK 2.163e-05 2.976e-06 7.268 3.97e-13 *** # NPOLICY 5.898e-02 2.681e-02 2.200 0.0278 * # Nu Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 1.153e+00 5.077e-02 22.72 <2e-16 *** # CLM_FREQ5 -3.028e-01 2.283e-02 -13.26 <2e-16 *** # MVR_PTS -1.509e-01 1.217e-02 -12.41 <2e-16 *** # INCOME 7.285e-06 6.269e-07 11.62 <2e-16 *** MAPE(df1$CLM_AMT, (1 - fitted(zaig1, what = "nu")) * fitted(zaig1, what = "mu")) # 1.469236

## Finer Monotonic Binning Based on Isotonic Regression

In my early post (https://statcompute.wordpress.com/2017/01/22/monotonic-binning-with-smbinning-package/), I wrote a monobin() function based on the smbinning package by Herman Jopia to improve the monotonic binning algorithm. The function works well and provides robust binning outcomes. However, there are a couple potential drawbacks due to the coarse binning. First of all, the derived Information Value for each binned variable might tend to be low. Secondly, the binned variable might not be granular enough to reflect the data nature.

In light of the aforementioned, I drafted an improved function isobin() based on the isotonic regression (https://en.wikipedia.org/wiki/Isotonic_regression), as shown below.

isobin <- function(data, y, x) { d1 <- data[c(y, x)] d2 <- d1[!is.na(d1[x]), ] c <- cor(d2[, 2], d2[, 1], method = "spearman", use = "complete.obs") reg <- isoreg(d2[, 2], c / abs(c) * d2[, 1]) k <- knots(as.stepfun(reg)) sm1 <-smbinning.custom(d1, y, x, k) c1 <- subset(sm1$ivtable, subset = CntGood * CntBad > 0, select = Cutpoint) c2 <- suppressWarnings(as.numeric(unlist(strsplit(c1$Cutpoint, " ")))) c3 <- c2[!is.na(c2)] return(smbinning.custom(d1, y, x, c3[-length(c3)])) }

Compared with the legacy monobin(), the isobin() function is able to significantly increase the binning granularity as well as moderately improve the Information Value.

**LTV Binning with isobin() Function**

Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate Odds LnOdds WoE IV 1 <= 46 81 78 3 81 78 3 0.0139 0.9630 0.0370 26.0000 3.2581 1.9021 0.0272 2 <= 71 312 284 28 393 362 31 0.0535 0.9103 0.0897 10.1429 2.3168 0.9608 0.0363 3 <= 72 22 20 2 415 382 33 0.0038 0.9091 0.0909 10.0000 2.3026 0.9466 0.0025 4 <= 73 27 24 3 442 406 36 0.0046 0.8889 0.1111 8.0000 2.0794 0.7235 0.0019 5 <= 81 303 268 35 745 674 71 0.0519 0.8845 0.1155 7.6571 2.0356 0.6797 0.0194 6 <= 83 139 122 17 884 796 88 0.0238 0.8777 0.1223 7.1765 1.9708 0.6149 0.0074 7 <= 90 631 546 85 1515 1342 173 0.1081 0.8653 0.1347 6.4235 1.8600 0.5040 0.0235 8 <= 94 529 440 89 2044 1782 262 0.0906 0.8318 0.1682 4.9438 1.5981 0.2422 0.0049 9 <= 95 145 119 26 2189 1901 288 0.0248 0.8207 0.1793 4.5769 1.5210 0.1651 0.0006 10 <= 100 907 709 198 3096 2610 486 0.1554 0.7817 0.2183 3.5808 1.2756 -0.0804 0.0010 11 <= 101 195 151 44 3291 2761 530 0.0334 0.7744 0.2256 3.4318 1.2331 -0.1229 0.0005 12 <= 110 1217 934 283 4508 3695 813 0.2085 0.7675 0.2325 3.3004 1.1940 -0.1619 0.0057 13 <= 112 208 158 50 4716 3853 863 0.0356 0.7596 0.2404 3.1600 1.1506 -0.2054 0.0016 14 <= 115 253 183 70 4969 4036 933 0.0433 0.7233 0.2767 2.6143 0.9610 -0.3950 0.0075 15 <= 136 774 548 226 5743 4584 1159 0.1326 0.7080 0.2920 2.4248 0.8857 -0.4702 0.0333 16 <= 138 27 18 9 5770 4602 1168 0.0046 0.6667 0.3333 2.0000 0.6931 -0.6628 0.0024 17 > 138 66 39 27 5836 4641 1195 0.0113 0.5909 0.4091 1.4444 0.3677 -0.9882 0.0140 18 Missing 1 0 1 5837 4641 1196 0.0002 0.0000 1.0000 0.0000 -Inf -Inf Inf 19 Total 5837 4641 1196 NA NA NA 1.0000 0.7951 0.2049 3.8804 1.3559 0.0000 0.1897

**LTV Binning with monobin() Function**

Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate Odds LnOdds WoE IV 1 <= 85 1025 916 109 1025 916 109 0.1756 0.8937 0.1063 8.4037 2.1287 0.7727 0.0821 2 <= 94 1019 866 153 2044 1782 262 0.1746 0.8499 0.1501 5.6601 1.7334 0.3775 0.0221 3 <= 100 1052 828 224 3096 2610 486 0.1802 0.7871 0.2129 3.6964 1.3074 -0.0486 0.0004 4 <= 105 808 618 190 3904 3228 676 0.1384 0.7649 0.2351 3.2526 1.1795 -0.1765 0.0045 5 <= 114 985 748 237 4889 3976 913 0.1688 0.7594 0.2406 3.1561 1.1493 -0.2066 0.0076 6 > 114 947 665 282 5836 4641 1195 0.1622 0.7022 0.2978 2.3582 0.8579 -0.4981 0.0461 7 Missing 1 0 1 5837 4641 1196 0.0002 0.0000 1.0000 0.0000 -Inf -Inf Inf 8 Total 5837 4641 1196 NA NA NA 1.0000 0.7951 0.2049 3.8804 1.3559 0.0000 0.1628

**Bureau_Score Binning with isobin() Function**

Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate Odds LnOdds WoE IV 1 <= 491 4 1 3 4 1 3 0.0007 0.2500 0.7500 0.3333 -1.0986 -2.4546 0.0056 2 <= 532 24 9 15 28 10 18 0.0041 0.3750 0.6250 0.6000 -0.5108 -1.8668 0.0198 3 <= 559 51 24 27 79 34 45 0.0087 0.4706 0.5294 0.8889 -0.1178 -1.4737 0.0256 4 <= 560 2 1 1 81 35 46 0.0003 0.5000 0.5000 1.0000 0.0000 -1.3559 0.0008 5 <= 572 34 17 17 115 52 63 0.0058 0.5000 0.5000 1.0000 0.0000 -1.3559 0.0143 6 <= 602 153 84 69 268 136 132 0.0262 0.5490 0.4510 1.2174 0.1967 -1.1592 0.0459 7 <= 605 56 31 25 324 167 157 0.0096 0.5536 0.4464 1.2400 0.2151 -1.1408 0.0162 8 <= 606 14 8 6 338 175 163 0.0024 0.5714 0.4286 1.3333 0.2877 -1.0683 0.0035 9 <= 607 17 10 7 355 185 170 0.0029 0.5882 0.4118 1.4286 0.3567 -0.9993 0.0037 10 <= 632 437 261 176 792 446 346 0.0749 0.5973 0.4027 1.4830 0.3940 -0.9619 0.0875 11 <= 639 150 95 55 942 541 401 0.0257 0.6333 0.3667 1.7273 0.5465 -0.8094 0.0207 12 <= 653 451 300 151 1393 841 552 0.0773 0.6652 0.3348 1.9868 0.6865 -0.6694 0.0412 13 <= 662 295 213 82 1688 1054 634 0.0505 0.7220 0.2780 2.5976 0.9546 -0.4014 0.0091 14 <= 665 100 77 23 1788 1131 657 0.0171 0.7700 0.2300 3.3478 1.2083 -0.1476 0.0004 15 <= 667 57 44 13 1845 1175 670 0.0098 0.7719 0.2281 3.3846 1.2192 -0.1367 0.0002 16 <= 677 381 300 81 2226 1475 751 0.0653 0.7874 0.2126 3.7037 1.3093 -0.0466 0.0001 17 <= 679 66 53 13 2292 1528 764 0.0113 0.8030 0.1970 4.0769 1.4053 0.0494 0.0000 18 <= 683 160 129 31 2452 1657 795 0.0274 0.8062 0.1938 4.1613 1.4258 0.0699 0.0001 19 <= 689 203 164 39 2655 1821 834 0.0348 0.8079 0.1921 4.2051 1.4363 0.0804 0.0002 20 <= 699 304 249 55 2959 2070 889 0.0521 0.8191 0.1809 4.5273 1.5101 0.1542 0.0012 21 <= 707 312 268 44 3271 2338 933 0.0535 0.8590 0.1410 6.0909 1.8068 0.4509 0.0094 22 <= 717 368 318 50 3639 2656 983 0.0630 0.8641 0.1359 6.3600 1.8500 0.4941 0.0132 23 <= 721 134 119 15 3773 2775 998 0.0230 0.8881 0.1119 7.9333 2.0711 0.7151 0.0094 24 <= 723 49 44 5 3822 2819 1003 0.0084 0.8980 0.1020 8.8000 2.1748 0.8188 0.0043 25 <= 739 425 394 31 4247 3213 1034 0.0728 0.9271 0.0729 12.7097 2.5424 1.1864 0.0700 26 <= 746 166 154 12 4413 3367 1046 0.0284 0.9277 0.0723 12.8333 2.5520 1.1961 0.0277 27 <= 756 234 218 16 4647 3585 1062 0.0401 0.9316 0.0684 13.6250 2.6119 1.2560 0.0422 28 <= 761 110 104 6 4757 3689 1068 0.0188 0.9455 0.0545 17.3333 2.8526 1.4967 0.0260 29 <= 763 46 44 2 4803 3733 1070 0.0079 0.9565 0.0435 22.0000 3.0910 1.7351 0.0135 30 <= 767 96 92 4 4899 3825 1074 0.0164 0.9583 0.0417 23.0000 3.1355 1.7795 0.0293 31 <= 772 77 74 3 4976 3899 1077 0.0132 0.9610 0.0390 24.6667 3.2055 1.8495 0.0249 32 <= 787 269 260 9 5245 4159 1086 0.0461 0.9665 0.0335 28.8889 3.3635 2.0075 0.0974 33 <= 794 95 93 2 5340 4252 1088 0.0163 0.9789 0.0211 46.5000 3.8395 2.4835 0.0456 34 > 794 182 179 3 5522 4431 1091 0.0312 0.9835 0.0165 59.6667 4.0888 2.7328 0.0985 35 Missing 315 210 105 5837 4641 1196 0.0540 0.6667 0.3333 2.0000 0.6931 -0.6628 0.0282 36 Total 5837 4641 1196 NA NA NA 1.0000 0.7951 0.2049 3.8804 1.3559 0.0000 0.8357

**Bureau_Score Binning with monobin() Function**

Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate Odds LnOdds WoE IV 1 <= 617 513 284 229 513 284 229 0.0879 0.5536 0.4464 1.2402 0.2153 -1.1407 0.1486 2 <= 642 515 317 198 1028 601 427 0.0882 0.6155 0.3845 1.6010 0.4706 -0.8853 0.0861 3 <= 657 512 349 163 1540 950 590 0.0877 0.6816 0.3184 2.1411 0.7613 -0.5946 0.0363 4 <= 672 487 371 116 2027 1321 706 0.0834 0.7618 0.2382 3.1983 1.1626 -0.1933 0.0033 5 <= 685 494 396 98 2521 1717 804 0.0846 0.8016 0.1984 4.0408 1.3964 0.0405 0.0001 6 <= 701 521 428 93 3042 2145 897 0.0893 0.8215 0.1785 4.6022 1.5265 0.1706 0.0025 7 <= 714 487 418 69 3529 2563 966 0.0834 0.8583 0.1417 6.0580 1.8014 0.4454 0.0144 8 <= 730 489 441 48 4018 3004 1014 0.0838 0.9018 0.0982 9.1875 2.2178 0.8619 0.0473 9 <= 751 513 476 37 4531 3480 1051 0.0879 0.9279 0.0721 12.8649 2.5545 1.1986 0.0859 10 <= 775 492 465 27 5023 3945 1078 0.0843 0.9451 0.0549 17.2222 2.8462 1.4903 0.1157 11 > 775 499 486 13 5522 4431 1091 0.0855 0.9739 0.0261 37.3846 3.6213 2.2653 0.2126 12 Missing 315 210 105 5837 4641 1196 0.0540 0.6667 0.3333 2.0000 0.6931 -0.6628 0.0282 13 Total 5837 4641 1196 NA NA NA 1.0000 0.7951 0.2049 3.8804 1.3559 0.0000 0.7810

## Estimating Conway-Maxwell-Poisson Regression in SAS

Conway-Maxwell-Poisson (CMP) regression is a flexible way to model frequency outcomes with both under-dispersion and over-dispersion. In SAS, CMP regression can be estimated with COUNTREG procedure directly or with NLMIXED procedure by specifying the likelihood function. However, the use of NLMIXED procedure is extremely cumbersome in that we need to estimate a standard Poisson regression and then use estimated parameters as initial values parameter estimates for the CMP regression.

In the example below, we will show how to employ GLIMMIX procedure to estimate a CMP regression by providing both the log likelihood function and the variance function in terms of the expected mean.

proc glimmix data = mylib.credit_count; model majordrg = age acadmos minordrg ownrent / link = log solution; _nu = 1 / exp(_phi_); _variance_ = (1 / _nu) / ((_mu_) ** (1 / _nu)); _z = 0; do i = 0 to 100; _z = _z + (_mu_ ** i) / fact(i) ** _nu; end; _prob = (_mu_ ** majordrg) / (fact(majordrg) ** _nu) * (_z ** (-1)); _logl_ = log(_prob); run;

Since the scale parameter **_phi_** is strictly above 0, the function **1 / exp(_phi_)** in the line #3 is to ensure the **Nu** parameter bounded between 0 and 1.

In addition, the DO loop is to calculate the normalization constant **Z** such that the PMF would sum up to 1. As there is no closed form for the calculation of **Z**, we need to calculate it numerically at the cost of a longer computing time.

Other implicit advantages of GLIMMIX procedure over NLMIXED procedure include the unnecessity to provide initiate values of parameter estimates and a shorter computing time.

## Modeling Generalized Poisson Regression in SAS

The Generalized Poisson (GP) regression is a very flexible statistical model for count outcomes in that it can accommodate both over-dispersion and under-dispersion, which makes it a very practical modeling approach in real-world problems and is considered a serious contender for the Quasi-Poisson regression.

Prob(Y) = Alpha / Y! * (Alpha + Xi * Y) ^ (Y – 1) * EXP(-Alpha – Xi * Y)

E(Y) = Mu = Alpha / (1 – Xi)

Var(Y) = Mu / (1 – Xi) ^ 2

While the GP regression can be conveniently estimated with HMM procedure in SAS, I’d always like to dive a little deeper into its model specification and likelihood function to have a better understanding. For instance, there is a slight difference in GP model outcomes between HMM procedure in SAS and VGAM package in R. After looking into the detail, I then realized that the difference is solely due to the different parameterization.

Basically, there are three steps for estimating a GP regression with NLMIXED procedure. Due to the complexity of GP likelihood function and its convergence process, it is always a good practice to estimate a baseline Standard Poisson regression as a starting point and then to output its parameter estimates into a table, e.g. _EST as shown below.

ods output ParameterEstimates = _est; proc genmod data = mylib.credit_count; model majordrg = age acadmos minordrg ownrent / dist = poisson link = log; run;

After acquiring parameter estimates from a Standard Poisson regression, we can use them to construct initiate values of parameter estimates for the Generalized Poisson regression. In the code snippet below, we used SQL procedure to create 2 macro variables that we are going to use in the final model estimation of GP regression.

proc sql noprint; select "_"||compress(upcase(parameter), ' ')||" = "||compress(put(estimate, 10.2), ' ') into :_parm separated by ' ' from _est; select case when upcase(parameter) = 'INTERCEPT' then "_"||compress(upcase(parameter), ' ') else "_"||compress(upcase(parameter), ' ')||" * "||compress(upcase(parameter), ' ') end into :_xb separated by ' + ' from _est where upcase(parameter) ~= 'SCALE'; quit; /* %put &_parm; _INTERCEPT = -1.38 _AGE = 0.01 _ACADMOS = 0.00 _MINORDRG = 0.46 _OWNRENT = -0.20 _SCALE = 1.00 %put &_xb; _INTERCEPT + _AGE * AGE + _ACADMOS * ACADMOS + _MINORDRG * MINORDRG + _OWNRENT * OWNRENT */

In the last step, we used the NLMIXED procedure to estimate the GP regression by specifying its log likelihood function that would generate identical model results as with HMM procedure. It is worth mentioning that the expected mean _mu = exp(x * beta) in SAS and the term exp(x * beta) refers to the _alpha parameter in R. Therefore, the intercept would be different between SAS and R, primarily due to different ways of parameterization with the identical statistical logic.

proc nlmixed data = mylib.credit_count; parms &_parm.; _xb = &_xb.; _xi = 1 - exp(-_scale); _mu = exp(_xb); _alpha = _mu * (1 - _xi); _prob = _alpha / fact(majordrg) * (_alpha + _xi * majordrg) ** (majordrg - 1) * exp(- _alpha - _xi * majordrg); ll = log(_prob); model majordrg ~ general(ll); run;

In addition to HMM and NLMIXED procedures, GLIMMIX can also be employed to estimate the GP regression, as shown below. In this case, we need to specify both the log likelihood function and the variance function in terms of the expected mean.

proc glimmix data = mylib.credit_count; model majordrg = age acadmos minordrg ownrent / link = log solution; _xi = 1 - 1 / exp(_phi_); _variance_ = _mu_ / (1 - _xi) ** 2; _alpha = _mu_ * (1 - _xi); _prob = _alpha / fact(majordrg) * (_alpha + _xi * majordrg) ** (majordrg - 1) * exp(- _alpha - _xi * majordrg); _logl_ = log(_prob); run;