## Archive for the ‘**Econometrics**’ Category

## Model Non-Negative Numeric Outcomes with Zeros

As mentioned in the previous post (https://statcompute.wordpress.com/2017/06/29/model-operational-loss-directly-with-tweedie-glm/), we often need to model non-negative numeric outcomes with zeros in the operational loss model development. Tweedie GLM provides a convenient interface to model non-negative losses directly by assuming that aggregated losses are the Poisson sum of Gamma outcomes, which however might not be well supported empirically from the data generation standpoint.

In examples below, we demonstrated another flexible option, namely Zero-Adjusted (ZA) models, in both scenarios of modeling non-negative numeric outcomes, one with a small number of zeros and the other with a large number of zeros. The basic idea of ZA models is very intuitive and similar to the concept of Hurdle models for count outcomes. In a nutshell, non-negative numeric outcomes can be considered two data generation processes, one for point-mass at zeros and the other governed by a statistical distribution for positive outcomes. The latter could be either Gamma or Inverse Gaussian.

First of all, we sampled down an auto-claim data in a way that only 10 claims are zeros and the rest are all positive. While 10 is an arbitrary choice in the example, other small numbers should show similar results.

pkgs <- list("cplm", "gamlss", "MLmetrics") lapply(pkgs, require, character.only = T) data(AutoClaim, package = "cplm") df1 <- na.omit(AutoClaim) # SMALL NUMBER OF ZEROS set.seed(2017) smp <- sample(seq(nrow(df1[df1$CLM_AMT == 0, ])), size = 10, replace = FALSE) df2 <- rbind(df1[df1$CLM_AMT > 0, ], df1[df1$CLM_AMT == 0, ][smp, ])

Next, we applied both Tweedie and zero-adjusted Gamma (ZAGA) models to the data with only 10 zero outcomes. It is worth mentioning that ZAGA doesnâ€™t have to be overly complex in this case. As shown below, while we estimated the Gamma Mu parameter with model attributes, the Nu parameter to separate zeros is just a constant with the intercept = -5.4. Both Tweedie and GAZA models gave very similar estimated parameters and predictive measures with MAPE = 0.61.

tw <- cpglm(CLM_AMT ~ BLUEBOOK + NPOLICY, data = df2) # Estimate Std. Error t value Pr(>|t|) # (Intercept) 8.194e+00 7.234e-02 113.277 < 2e-16 *** # BLUEBOOK 2.047e-05 3.068e-06 6.671 3.21e-11 *** # NPOLICY 7.274e-02 3.102e-02 2.345 0.0191 * MAPE(df2$CLM_AMT, fitted(tw)) # 0.6053669 zaga0 <- gamlss(CLM_AMT ~ BLUEBOOK + NPOLICY, data = df2, family = "ZAGA") # Mu Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 8.203e+00 4.671e-02 175.629 < 2e-16 *** # BLUEBOOK 2.053e-05 2.090e-06 9.821 < 2e-16 *** # NPOLICY 6.948e-02 2.057e-02 3.377 0.000746 *** # Nu Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) -5.3886 0.3169 -17 <2e-16 *** MAPE(df2$CLM_AMT, (1 - fitted(zaga0, what = "nu")) * fitted(zaga0, what = "mu")) # 0.6053314

In the next case, we used the full data with a large number of zeros in the response and then applied both Tweedie and ZAGA models again. However, in ZAGA model, we estimated two sub-models this time, one for the Nu parameter to separate zeros from non-zeros and the other for the Mu parameter to model non-zero outcomes. As shown below, ZAGA outperformed Tweedie in terms of MAPE due to the advantage that ZAGA is able to explain two data generation schemes separately with different model attributes, which is the capability beyond what Tweedie can provide.

# LARGE NUMBER OF ZEROS tw <- cpglm(CLM_AMT ~ BLUEBOOK + NPOLICY + CLM_FREQ5 + MVR_PTS + INCOME, data = df1) # Estimate Std. Error t value Pr(>|t|) # (Intercept) 6.854e+00 1.067e-01 64.241 < 2e-16 *** # BLUEBOOK 1.332e-05 4.495e-06 2.963 0.00305 ** # NPOLICY 4.380e-02 3.664e-02 1.195 0.23196 # CLM_FREQ5 2.064e-01 2.937e-02 7.026 2.29e-12 *** # MVR_PTS 1.066e-01 1.510e-02 7.063 1.76e-12 *** # INCOME -4.606e-06 8.612e-07 -5.348 9.12e-08 *** MAPE(df1$CLM_AMT, fitted(tw)) # 1.484484 zaga1 <- gamlss(CLM_AMT ~ BLUEBOOK + NPOLICY, nu.formula = ~(CLM_FREQ5 + MVR_PTS + INCOME), data = df1, family = "ZAGA") # Mu Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 8.203e+00 4.682e-02 175.218 < 2e-16 *** # BLUEBOOK 2.053e-05 2.091e-06 9.816 < 2e-16 *** # NPOLICY 6.948e-02 2.067e-02 3.362 0.000778 *** # Nu Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 1.153e+00 5.077e-02 22.72 <2e-16 *** # CLM_FREQ5 -3.028e-01 2.283e-02 -13.26 <2e-16 *** # MVR_PTS -1.509e-01 1.217e-02 -12.41 <2e-16 *** # INCOME 7.285e-06 6.269e-07 11.62 <2e-16 *** MAPE(df1$CLM_AMT, (1 - fitted(zaga1, what = "nu")) * fitted(zaga1, what = "mu")) # 1.470228

Given the great flexibility of ZA models, we also have the luxury to explore other candidates than ZAGA. For instance, if the positive part of non-negative outcomes demonstrates a high variance, we can also try a zero-inflated Inverse Gaussian (ZAIG) model, as shown below.

zaig1 <- gamlss(CLM_AMT ~ BLUEBOOK + NPOLICY, nu.formula = ~(CLM_FREQ5 + MVR_PTS + INCOME), data = df1, family = "ZAIG") # Mu Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 8.205e+00 5.836e-02 140.591 < 2e-16 *** # BLUEBOOK 2.163e-05 2.976e-06 7.268 3.97e-13 *** # NPOLICY 5.898e-02 2.681e-02 2.200 0.0278 * # Nu Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 1.153e+00 5.077e-02 22.72 <2e-16 *** # CLM_FREQ5 -3.028e-01 2.283e-02 -13.26 <2e-16 *** # MVR_PTS -1.509e-01 1.217e-02 -12.41 <2e-16 *** # INCOME 7.285e-06 6.269e-07 11.62 <2e-16 *** MAPE(df1$CLM_AMT, (1 - fitted(zaig1, what = "nu")) * fitted(zaig1, what = "mu")) # 1.469236

## Finer Monotonic Binning Based on Isotonic Regression

In my early post (https://statcompute.wordpress.com/2017/01/22/monotonic-binning-with-smbinning-package/), I wrote a monobin() function based on the smbinning package by Herman Jopia to improve the monotonic binning algorithm. The function works well and provides robust binning outcomes. However, there are a couple potential drawbacks due to the coarse binning. First of all, the derived Information Value for each binned variable might tend to be low. Secondly, the binned variable might not be granular enough to reflect the data nature.

In light of the aforementioned, I drafted an improved function isobin() based on the isotonic regression (https://en.wikipedia.org/wiki/Isotonic_regression), as shown below.

isobin <- function(data, y, x) { d1 <- data[c(y, x)] d2 <- d1[!is.na(d1[x]), ] c <- cor(d2[, 2], d2[, 1], method = "spearman", use = "complete.obs") reg <- isoreg(d2[, 2], c / abs(c) * d2[, 1]) k <- knots(as.stepfun(reg)) sm1 <-smbinning.custom(d1, y, x, k) c1 <- subset(sm1$ivtable, subset = CntGood * CntBad > 0, select = Cutpoint) c2 <- suppressWarnings(as.numeric(unlist(strsplit(c1$Cutpoint, " ")))) c3 <- c2[!is.na(c2)] return(smbinning.custom(d1, y, x, c3[-length(c3)])) }

Compared with the legacy monobin(), the isobin() function is able to significantly increase the binning granularity as well as moderately improve the Information Value.

**LTV Binning with isobin() Function**

Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate Odds LnOdds WoE IV 1 <= 46 81 78 3 81 78 3 0.0139 0.9630 0.0370 26.0000 3.2581 1.9021 0.0272 2 <= 71 312 284 28 393 362 31 0.0535 0.9103 0.0897 10.1429 2.3168 0.9608 0.0363 3 <= 72 22 20 2 415 382 33 0.0038 0.9091 0.0909 10.0000 2.3026 0.9466 0.0025 4 <= 73 27 24 3 442 406 36 0.0046 0.8889 0.1111 8.0000 2.0794 0.7235 0.0019 5 <= 81 303 268 35 745 674 71 0.0519 0.8845 0.1155 7.6571 2.0356 0.6797 0.0194 6 <= 83 139 122 17 884 796 88 0.0238 0.8777 0.1223 7.1765 1.9708 0.6149 0.0074 7 <= 90 631 546 85 1515 1342 173 0.1081 0.8653 0.1347 6.4235 1.8600 0.5040 0.0235 8 <= 94 529 440 89 2044 1782 262 0.0906 0.8318 0.1682 4.9438 1.5981 0.2422 0.0049 9 <= 95 145 119 26 2189 1901 288 0.0248 0.8207 0.1793 4.5769 1.5210 0.1651 0.0006 10 <= 100 907 709 198 3096 2610 486 0.1554 0.7817 0.2183 3.5808 1.2756 -0.0804 0.0010 11 <= 101 195 151 44 3291 2761 530 0.0334 0.7744 0.2256 3.4318 1.2331 -0.1229 0.0005 12 <= 110 1217 934 283 4508 3695 813 0.2085 0.7675 0.2325 3.3004 1.1940 -0.1619 0.0057 13 <= 112 208 158 50 4716 3853 863 0.0356 0.7596 0.2404 3.1600 1.1506 -0.2054 0.0016 14 <= 115 253 183 70 4969 4036 933 0.0433 0.7233 0.2767 2.6143 0.9610 -0.3950 0.0075 15 <= 136 774 548 226 5743 4584 1159 0.1326 0.7080 0.2920 2.4248 0.8857 -0.4702 0.0333 16 <= 138 27 18 9 5770 4602 1168 0.0046 0.6667 0.3333 2.0000 0.6931 -0.6628 0.0024 17 > 138 66 39 27 5836 4641 1195 0.0113 0.5909 0.4091 1.4444 0.3677 -0.9882 0.0140 18 Missing 1 0 1 5837 4641 1196 0.0002 0.0000 1.0000 0.0000 -Inf -Inf Inf 19 Total 5837 4641 1196 NA NA NA 1.0000 0.7951 0.2049 3.8804 1.3559 0.0000 0.1897

**LTV Binning with monobin() Function**

Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate Odds LnOdds WoE IV 1 <= 85 1025 916 109 1025 916 109 0.1756 0.8937 0.1063 8.4037 2.1287 0.7727 0.0821 2 <= 94 1019 866 153 2044 1782 262 0.1746 0.8499 0.1501 5.6601 1.7334 0.3775 0.0221 3 <= 100 1052 828 224 3096 2610 486 0.1802 0.7871 0.2129 3.6964 1.3074 -0.0486 0.0004 4 <= 105 808 618 190 3904 3228 676 0.1384 0.7649 0.2351 3.2526 1.1795 -0.1765 0.0045 5 <= 114 985 748 237 4889 3976 913 0.1688 0.7594 0.2406 3.1561 1.1493 -0.2066 0.0076 6 > 114 947 665 282 5836 4641 1195 0.1622 0.7022 0.2978 2.3582 0.8579 -0.4981 0.0461 7 Missing 1 0 1 5837 4641 1196 0.0002 0.0000 1.0000 0.0000 -Inf -Inf Inf 8 Total 5837 4641 1196 NA NA NA 1.0000 0.7951 0.2049 3.8804 1.3559 0.0000 0.1628

**Bureau_Score Binning with isobin() Function**

Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate Odds LnOdds WoE IV 1 <= 491 4 1 3 4 1 3 0.0007 0.2500 0.7500 0.3333 -1.0986 -2.4546 0.0056 2 <= 532 24 9 15 28 10 18 0.0041 0.3750 0.6250 0.6000 -0.5108 -1.8668 0.0198 3 <= 559 51 24 27 79 34 45 0.0087 0.4706 0.5294 0.8889 -0.1178 -1.4737 0.0256 4 <= 560 2 1 1 81 35 46 0.0003 0.5000 0.5000 1.0000 0.0000 -1.3559 0.0008 5 <= 572 34 17 17 115 52 63 0.0058 0.5000 0.5000 1.0000 0.0000 -1.3559 0.0143 6 <= 602 153 84 69 268 136 132 0.0262 0.5490 0.4510 1.2174 0.1967 -1.1592 0.0459 7 <= 605 56 31 25 324 167 157 0.0096 0.5536 0.4464 1.2400 0.2151 -1.1408 0.0162 8 <= 606 14 8 6 338 175 163 0.0024 0.5714 0.4286 1.3333 0.2877 -1.0683 0.0035 9 <= 607 17 10 7 355 185 170 0.0029 0.5882 0.4118 1.4286 0.3567 -0.9993 0.0037 10 <= 632 437 261 176 792 446 346 0.0749 0.5973 0.4027 1.4830 0.3940 -0.9619 0.0875 11 <= 639 150 95 55 942 541 401 0.0257 0.6333 0.3667 1.7273 0.5465 -0.8094 0.0207 12 <= 653 451 300 151 1393 841 552 0.0773 0.6652 0.3348 1.9868 0.6865 -0.6694 0.0412 13 <= 662 295 213 82 1688 1054 634 0.0505 0.7220 0.2780 2.5976 0.9546 -0.4014 0.0091 14 <= 665 100 77 23 1788 1131 657 0.0171 0.7700 0.2300 3.3478 1.2083 -0.1476 0.0004 15 <= 667 57 44 13 1845 1175 670 0.0098 0.7719 0.2281 3.3846 1.2192 -0.1367 0.0002 16 <= 677 381 300 81 2226 1475 751 0.0653 0.7874 0.2126 3.7037 1.3093 -0.0466 0.0001 17 <= 679 66 53 13 2292 1528 764 0.0113 0.8030 0.1970 4.0769 1.4053 0.0494 0.0000 18 <= 683 160 129 31 2452 1657 795 0.0274 0.8062 0.1938 4.1613 1.4258 0.0699 0.0001 19 <= 689 203 164 39 2655 1821 834 0.0348 0.8079 0.1921 4.2051 1.4363 0.0804 0.0002 20 <= 699 304 249 55 2959 2070 889 0.0521 0.8191 0.1809 4.5273 1.5101 0.1542 0.0012 21 <= 707 312 268 44 3271 2338 933 0.0535 0.8590 0.1410 6.0909 1.8068 0.4509 0.0094 22 <= 717 368 318 50 3639 2656 983 0.0630 0.8641 0.1359 6.3600 1.8500 0.4941 0.0132 23 <= 721 134 119 15 3773 2775 998 0.0230 0.8881 0.1119 7.9333 2.0711 0.7151 0.0094 24 <= 723 49 44 5 3822 2819 1003 0.0084 0.8980 0.1020 8.8000 2.1748 0.8188 0.0043 25 <= 739 425 394 31 4247 3213 1034 0.0728 0.9271 0.0729 12.7097 2.5424 1.1864 0.0700 26 <= 746 166 154 12 4413 3367 1046 0.0284 0.9277 0.0723 12.8333 2.5520 1.1961 0.0277 27 <= 756 234 218 16 4647 3585 1062 0.0401 0.9316 0.0684 13.6250 2.6119 1.2560 0.0422 28 <= 761 110 104 6 4757 3689 1068 0.0188 0.9455 0.0545 17.3333 2.8526 1.4967 0.0260 29 <= 763 46 44 2 4803 3733 1070 0.0079 0.9565 0.0435 22.0000 3.0910 1.7351 0.0135 30 <= 767 96 92 4 4899 3825 1074 0.0164 0.9583 0.0417 23.0000 3.1355 1.7795 0.0293 31 <= 772 77 74 3 4976 3899 1077 0.0132 0.9610 0.0390 24.6667 3.2055 1.8495 0.0249 32 <= 787 269 260 9 5245 4159 1086 0.0461 0.9665 0.0335 28.8889 3.3635 2.0075 0.0974 33 <= 794 95 93 2 5340 4252 1088 0.0163 0.9789 0.0211 46.5000 3.8395 2.4835 0.0456 34 > 794 182 179 3 5522 4431 1091 0.0312 0.9835 0.0165 59.6667 4.0888 2.7328 0.0985 35 Missing 315 210 105 5837 4641 1196 0.0540 0.6667 0.3333 2.0000 0.6931 -0.6628 0.0282 36 Total 5837 4641 1196 NA NA NA 1.0000 0.7951 0.2049 3.8804 1.3559 0.0000 0.8357

**Bureau_Score Binning with monobin() Function**

Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate Odds LnOdds WoE IV 1 <= 617 513 284 229 513 284 229 0.0879 0.5536 0.4464 1.2402 0.2153 -1.1407 0.1486 2 <= 642 515 317 198 1028 601 427 0.0882 0.6155 0.3845 1.6010 0.4706 -0.8853 0.0861 3 <= 657 512 349 163 1540 950 590 0.0877 0.6816 0.3184 2.1411 0.7613 -0.5946 0.0363 4 <= 672 487 371 116 2027 1321 706 0.0834 0.7618 0.2382 3.1983 1.1626 -0.1933 0.0033 5 <= 685 494 396 98 2521 1717 804 0.0846 0.8016 0.1984 4.0408 1.3964 0.0405 0.0001 6 <= 701 521 428 93 3042 2145 897 0.0893 0.8215 0.1785 4.6022 1.5265 0.1706 0.0025 7 <= 714 487 418 69 3529 2563 966 0.0834 0.8583 0.1417 6.0580 1.8014 0.4454 0.0144 8 <= 730 489 441 48 4018 3004 1014 0.0838 0.9018 0.0982 9.1875 2.2178 0.8619 0.0473 9 <= 751 513 476 37 4531 3480 1051 0.0879 0.9279 0.0721 12.8649 2.5545 1.1986 0.0859 10 <= 775 492 465 27 5023 3945 1078 0.0843 0.9451 0.0549 17.2222 2.8462 1.4903 0.1157 11 > 775 499 486 13 5522 4431 1091 0.0855 0.9739 0.0261 37.3846 3.6213 2.2653 0.2126 12 Missing 315 210 105 5837 4641 1196 0.0540 0.6667 0.3333 2.0000 0.6931 -0.6628 0.0282 13 Total 5837 4641 1196 NA NA NA 1.0000 0.7951 0.2049 3.8804 1.3559 0.0000 0.7810

## Estimating Conway-Maxwell-Poisson Regression in SAS

Conway-Maxwell-Poisson (CMP) regression is a flexible way to model frequency outcomes with both under-dispersion and over-dispersion. In SAS, CMP regression can be estimated with COUNTREG procedure directly or with NLMIXED procedure by specifying the likelihood function. However, the use of NLMIXED procedure is extremely cumbersome in that we need to estimate a standard Poisson regression and then use estimated parameters as initial values parameter estimates for the CMP regression.

In the example below, we will show how to employ GLIMMIX procedure to estimate a CMP regression by providing both the log likelihood function and the variance function in terms of the expected mean.

proc glimmix data = mylib.credit_count; model majordrg = age acadmos minordrg ownrent / link = log solution; _nu = 1 / exp(_phi_); _variance_ = (1 / _nu) / ((_mu_) ** (1 / _nu)); _z = 0; do i = 0 to 100; _z = _z + (_mu_ ** i) / fact(i) ** _nu; end; _prob = (_mu_ ** majordrg) / (fact(majordrg) ** _nu) * (_z ** (-1)); _logl_ = log(_prob); run;

Since the scale parameter **_phi_** is strictly above 0, the function **1 / exp(_phi_)** in the line #3 is to ensure the **Nu** parameter bounded between 0 and 1.

In addition, the DO loop is to calculate the normalization constant **Z** such that the PMF would sum up to 1. As there is no closed form for the calculation of **Z**, we need to calculate it numerically at the cost of a longer computing time.

Other implicit advantages of GLIMMIX procedure over NLMIXED procedure include the unnecessity to provide initiate values of parameter estimates and a shorter computing time.

## Modeling Generalized Poisson Regression in SAS

The Generalized Poisson (GP) regression is a very flexible statistical model for count outcomes in that it can accommodate both over-dispersion and under-dispersion, which makes it a very practical modeling approach in real-world problems and is considered a serious contender for the Quasi-Poisson regression.

Prob(Y) = Alpha / Y! * (Alpha + Xi * Y) ^ (Y – 1) * EXP(-Alpha – Xi * Y)

E(Y) = Mu = Alpha / (1 – Xi)

Var(Y) = Mu / (1 – Xi) ^ 2

While the GP regression can be conveniently estimated with HMM procedure in SAS, I’d always like to dive a little deeper into its model specification and likelihood function to have a better understanding. For instance, there is a slight difference in GP model outcomes between HMM procedure in SAS and VGAM package in R. After looking into the detail, I then realized that the difference is solely due to the different parameterization.

Basically, there are three steps for estimating a GP regression with NLMIXED procedure. Due to the complexity of GP likelihood function and its convergence process, it is always a good practice to estimate a baseline Standard Poisson regression as a starting point and then to output its parameter estimates into a table, e.g. _EST as shown below.

ods output ParameterEstimates = _est; proc genmod data = mylib.credit_count; model majordrg = age acadmos minordrg ownrent / dist = poisson link = log; run;

After acquiring parameter estimates from a Standard Poisson regression, we can use them to construct initiate values of parameter estimates for the Generalized Poisson regression. In the code snippet below, we used SQL procedure to create 2 macro variables that we are going to use in the final model estimation of GP regression.

proc sql noprint; select "_"||compress(upcase(parameter), ' ')||" = "||compress(put(estimate, 10.2), ' ') into :_parm separated by ' ' from _est; select case when upcase(parameter) = 'INTERCEPT' then "_"||compress(upcase(parameter), ' ') else "_"||compress(upcase(parameter), ' ')||" * "||compress(upcase(parameter), ' ') end into :_xb separated by ' + ' from _est where upcase(parameter) ~= 'SCALE'; quit; /* %put &_parm; _INTERCEPT = -1.38 _AGE = 0.01 _ACADMOS = 0.00 _MINORDRG = 0.46 _OWNRENT = -0.20 _SCALE = 1.00 %put &_xb; _INTERCEPT + _AGE * AGE + _ACADMOS * ACADMOS + _MINORDRG * MINORDRG + _OWNRENT * OWNRENT */

In the last step, we used the NLMIXED procedure to estimate the GP regression by specifying its log likelihood function that would generate identical model results as with HMM procedure. It is worth mentioning that the expected mean _mu = exp(x * beta) in SAS and the term exp(x * beta) refers to the _alpha parameter in R. Therefore, the intercept would be different between SAS and R, primarily due to different ways of parameterization with the identical statistical logic.

proc nlmixed data = mylib.credit_count; parms &_parm.; _xb = &_xb.; _xi = 1 - exp(-_scale); _mu = exp(_xb); _alpha = _mu * (1 - _xi); _prob = _alpha / fact(majordrg) * (_alpha + _xi * majordrg) ** (majordrg - 1) * exp(- _alpha - _xi * majordrg); ll = log(_prob); model majordrg ~ general(ll); run;

In addition to HMM and NLMIXED procedures, GLIMMIX can also be employed to estimate the GP regression, as shown below. In this case, we need to specify both the log likelihood function and the variance function in terms of the expected mean.

proc glimmix data = mylib.credit_count; model majordrg = age acadmos minordrg ownrent / link = log solution; _xi = 1 - 1 / exp(_phi_); _variance_ = _mu_ / (1 - _xi) ** 2; _alpha = _mu_ * (1 - _xi); _prob = _alpha / fact(majordrg) * (_alpha + _xi * majordrg) ** (majordrg - 1) * exp(- _alpha - _xi * majordrg); _logl_ = log(_prob); run;

## Estimate Regression with (Type-I) Pareto Response

The Type-I Pareto distribution has a probability function shown as below

f(y; a, k) = k * (a ^ k) / (y ^ (k + 1))

In the formulation, the scale parameter **0 < a < y** and the shape parameter **k > 1 **.

The positive lower bound of Type-I Pareto distribution is particularly appealing in modeling the severity measure in that there is usually a reporting threshold for operational loss events. For instance, the reporting threshold of ABA operational risk consortium data is $10,000 and any loss event below the threshold value would be not reported, which might add the complexity in the severity model estimation.

In practice, instead of modeling the severity measure directly, we might model the shifted response ** y` = severity – threshold ** to accommodate the threshold value such that the supporting domain of y` could start from 0 and that the Gamma, Inverse Gaussian, or Lognormal regression can still be applicable. However, under the distributional assumption of Type-I Pareto with a known lower end, we do not need to shift the severity measure anymore but model it directly based on the probability function.

Below is the R code snippet showing how to estimate a regression model for the Pareto response with the lower bound ** a = 2 ** by using the **VGAM** package.

library(VGAM) set.seed(2017) n <- 200 a <- 2 x <- runif(n) k <- exp(1 + 5 * x) pdata <- data.frame(y = rpareto(n = n, scale = a, shape = k), x = x) fit <- vglm(y ~ x, paretoff(scale = a), data = pdata, trace = TRUE) summary(fit) # Coefficients: # Estimate Std. Error z value Pr(>|z|) # (Intercept) 1.0322 0.1363 7.574 3.61e-14 *** # x 4.9815 0.2463 20.229 < 2e-16 *** AIC(fit) # -644.458 BIC(fit) # -637.8614

The SAS code below estimating the Type-I Pareto regression provides almost identical model estimation.

proc nlmixed data = pdata; parms b0 = 0.1 b1 = 0.1; k = exp(b0 + b1 * x); a = 2; lh = k * (a ** k) / (y ** (k + 1)); ll = log(lh); model y ~ general(ll); run; /* Fit Statistics -2 Log Likelihood -648.5 AIC (smaller is better) -644.5 AICC (smaller is better) -644.4 BIC (smaller is better) -637.9 Parameter Estimate Standard DF t Value Pr > |t| Error b0 1.0322 0.1385 200 7.45 <.0001 b1 4.9815 0.2518 200 19.78 <.0001 */

At last, it is worth pointing out that the conditional mean of Type-I Pareto response is not equal to ** exp(x * beta) ** but ** a * k / (k – 1) ** with ** k = exp(x * beta) **. Therefore, the conditional mean only exists when ** k > 1 **, which might cause numerical issues in the model estimation.

## Pregibon Test for Goodness of Link in SAS

When estimating generalized linear models for binary outcomes, we often choose the logit link function by default and seldom consider other alternatives such as probit or cloglog. The Pregibon test (Pregibon, 1980) provides a mean to check the goodness of link with a simple logic outlined below.

1. First of all, we can estimate the regression model with the hypothesized link function, e.g. logit;

2. After the model estimation, we calculate yhat and yhat ^ 2 and then estimate a secondary regression with the identical response variable Y and link function but with yhat and yhat ^ 2 as model predictors (with the intercept).

3. If the link function is correctly specified, then the t-value of yaht ^2 should be insignificant.

The SAS macro shown below is the implementation of Pregibon test in the context of logistic regressions. However, the same idea can be generalized to any GLM.

%macro pregibon(data = , y = , x = ); ***********************************************************; * SAS MACRO PERFORMING PREGIBON TEST FOR GOODNESS OF LINK *; * ======================================================= *; * INPUT PAREMETERS: *; * DATA : INPUT SAS DATA TABLE *; * Y : THE DEPENDENT VARIABLE WITH 0 / 1 VALUES *; * X : MODEL PREDICTORS *; * ======================================================= *; * AUTHOR: WENSUI.LIU@53.COM *; ***********************************************************; options mprint mlogic nocenter; %let links = logit probit cloglog; %let loop = 1; proc sql noprint; select n(&data) - 3 into :df from &data; quit; %do %while (%scan(&links, &loop) ne %str()); %let link = %scan(&links, &loop); proc logistic data = &data noprint desc; model &y = &x / link = &link; score data = &data out = _out1; run; data _out2; set _out1(rename = (p_1 = p1)); p2 = p1 * p1; run; ods listing close; ods output ParameterEstimates = _parm; proc logistic data = _out2 desc; model &y = p1 p2 / link = &link ; run; ods listing; %if &loop = 1 %then %do; data _parm1; format link $10.; set _parm(where = (variable = "p2")); link = upcase("&link"); run; %end; %else %do; data _parm1; set _parm1 _parm(where = (variable = "p2") in = new); if new then link = upcase("&link"); run; %end; data _parm2(drop = variable); set _parm1; _t = estimate / stderr; _df = &df; _p = (1 - probt(abs(_t), _df)) * 2; run; %let loop = %eval(&loop + 1); %end; title; proc report data = _last_ spacing = 1 headline nowindows split = "*"; column(" * PREGIBON TEST FOR GOODNESS OF LINK * H0: THE LINK FUNCTION IS SPECIFIED CORRECTLY * " link _t _df _p); define link / "LINK FUNCTION" width = 15 order order = data; define _t / "T-VALUE" width = 15 format = 12.4; define _df / "DF" width = 10; define _p / "P-VALUE" width = 15 format = 12.4; run; %mend;

After applying the macro to the kyphosis data (https://stat.ethz.ch/R-manual/R-devel/library/rpart/html/kyphosis.html), we can see that both logit and probit can be considered appropriate link functions in this specific case and cloglog might not be a good choice.

PREGIBON TEST FOR GOODNESS OF LINK H0: THE LINK FUNCTION IS SPECIFIED CORRECTLY LINK FUNCTION T-VALUE DF P-VALUE ----------------------------------------------------------- LOGIT -1.6825 78 0.0965 PROBIT -1.7940 78 0.0767 CLOGLOG -2.3632 78 0.0206

## Scorecard Development with Data from Multiple Sources

This week, one of my friends asked me a very interesting and practical question in the scorecard development. The model development data were collected from multiple independent sources with various data sizes, heterogeneous risk profiles and different bad rates. While the performance statistics seem satisfactory on the model training dataset, the model doesn’t generalize well with new accounts that might come from a unknown source. The situation is very common in a consulting company where a risk or marketing model is sometimes developed with the data from multiple organizations.

To better understand the issue, I simulated a dataset consisting of two groups. In the dataset, while x0 and x1 govern the group segmentation, x2 and x3 define the bad definition. It is important to point out that the group information “grp” is only known in the model development sample but is unknown in the production population. Therefore, the variable “grp”, albeit predictive, can not be explicitly used in the model estimation.

data one; do i = 1 to 100000; x0 = ranuni(0); x1 = ranuni(1); x2 = ranuni(2); x3 = ranuni(3); if 1 + x0 * 2 + x1 * 4 + rannor(1) > 5 then do; grp = 1; if x2 * 2 + x3 * 4 + rannor(2) > 5 then bad = 1; else bad = 0; end; else do; grp = 0; if x2 * 4 + x3 * 2 + rannor(3) > 4 then bad = 1; else bad = 0; end; output; end; run;

Our first approach is to use all variables x0 – x3 to build a logistic regression and then evaluate the model altogether and by groups.

proc logistic data = one desc noprint; model bad = x0 x1 x2 x3; score data = one out = mdl1 (rename = (p_1 = score1)); run; GOOD BAD SEPARATION REPORT FOR SCORE1 IN DATA MDL1 MAXIMUM KS = 59.5763 AT SCORE POINT 0.2281 ( AUC STATISTICS = 0.8800, GINI COEFFICIENT = 0.7599, DIVERGENCE = 2.6802 ) MIN MAX GOOD BAD TOTAL BAD CUMULATIVE BAD CUMU. BAD SCORE SCORE # # # RATE BAD RATE PERCENT PERCENT -------------------------------------------------------------------------------------------------------- BAD 0.6800 0.9699 2,057 7,943 10,000 79.43% 79.43% 33.81% 33.81% | 0.4679 0.6799 4,444 5,556 10,000 55.56% 67.50% 23.65% 57.46% | 0.3094 0.4679 6,133 3,867 10,000 38.67% 57.89% 16.46% 73.92% | 0.1947 0.3094 7,319 2,681 10,000 26.81% 50.12% 11.41% 85.33% | 0.1181 0.1946 8,364 1,636 10,000 16.36% 43.37% 6.96% 92.29% | 0.0690 0.1181 9,044 956 10,000 9.56% 37.73% 4.07% 96.36% | 0.0389 0.0690 9,477 523 10,000 5.23% 33.09% 2.23% 98.59% | 0.0201 0.0389 9,752 248 10,000 2.48% 29.26% 1.06% 99.64% V 0.0085 0.0201 9,925 75 10,000 0.75% 26.09% 0.32% 99.96% GOOD 0.0005 0.0085 9,991 9 10,000 0.09% 23.49% 0.04% 100.00% ========== ========== ========== ========== ========== 0.0005 0.9699 76,506 23,494 100,000 GOOD BAD SEPARATION REPORT FOR SCORE1 IN DATA MDL1(WHERE = (GRP = 0)) MAXIMUM KS = 61.0327 AT SCORE POINT 0.2457 ( AUC STATISTICS = 0.8872, GINI COEFFICIENT = 0.7744, DIVERGENCE = 2.8605 ) MIN MAX GOOD BAD TOTAL BAD CUMULATIVE BAD CUMU. BAD SCORE SCORE # # # RATE BAD RATE PERCENT PERCENT -------------------------------------------------------------------------------------------------------- BAD 0.7086 0.9699 1,051 6,162 7,213 85.43% 85.43% 30.51% 30.51% | 0.5019 0.7086 2,452 4,762 7,214 66.01% 75.72% 23.58% 54.10% | 0.3407 0.5019 3,710 3,504 7,214 48.57% 66.67% 17.35% 71.45% | 0.2195 0.3406 4,696 2,517 7,213 34.90% 58.73% 12.46% 83.91% | 0.1347 0.2195 5,650 1,564 7,214 21.68% 51.32% 7.74% 91.66% | 0.0792 0.1347 6,295 919 7,214 12.74% 44.89% 4.55% 96.21% | 0.0452 0.0792 6,737 476 7,213 6.60% 39.42% 2.36% 98.56% | 0.0234 0.0452 7,000 214 7,214 2.97% 34.86% 1.06% 99.62% V 0.0099 0.0234 7,150 64 7,214 0.89% 31.09% 0.32% 99.94% GOOD 0.0007 0.0099 7,201 12 7,213 0.17% 27.99% 0.06% 100.00% ========== ========== ========== ========== ========== 0.0007 0.9699 51,942 20,194 72,136 GOOD BAD SEPARATION REPORT FOR SCORE1 IN DATA MDL1(WHERE = (GRP = 1)) MAXIMUM KS = 53.0942 AT SCORE POINT 0.2290 ( AUC STATISTICS = 0.8486, GINI COEFFICIENT = 0.6973, DIVERGENCE = 2.0251 ) MIN MAX GOOD BAD TOTAL BAD CUMULATIVE BAD CUMU. BAD SCORE SCORE # # # RATE BAD RATE PERCENT PERCENT -------------------------------------------------------------------------------------------------------- BAD 0.5863 0.9413 1,351 1,435 2,786 51.51% 51.51% 43.48% 43.48% | 0.3713 0.5862 2,136 651 2,787 23.36% 37.43% 19.73% 63.21% | 0.2299 0.3712 2,340 446 2,786 16.01% 30.29% 13.52% 76.73% | 0.1419 0.2298 2,525 262 2,787 9.40% 25.07% 7.94% 84.67% | 0.0832 0.1419 2,584 202 2,786 7.25% 21.50% 6.12% 90.79% | 0.0480 0.0832 2,643 144 2,787 5.17% 18.78% 4.36% 95.15% | 0.0270 0.0480 2,682 104 2,786 3.73% 16.63% 3.15% 98.30% | 0.0140 0.0270 2,741 46 2,787 1.65% 14.76% 1.39% 99.70% V 0.0058 0.0140 2,776 10 2,786 0.36% 13.16% 0.30% 100.00% GOOD 0.0005 0.0058 2,786 0 2,786 0.00% 11.84% 0.00% 100.00% ========== ========== ========== ========== ========== 0.0005 0.9413 24,564 3,300 27,864

As shown in the above output, while the overall model performance looks ok, it doesn’t generalize well in the dataset from the 2nd group with a smaller size. While the overall KS could be as high as 60, the KS for the 2nd group is merely 53. The reason is that the overall model performance is heavily influenced by the dataset from the 1st group with the larger size. Therefore, the estimated model is biased toward the risk profile reflected in the 1st group.

To alleviate the bias in the first model, we could first introduce a look-alike model driven by x0 – x1 with the purpose to profile the group and then build two separate risk models with x2 – x3 only for 1st and 2nd groups respectively. As a result, the final predicted probability should be the composite of all three sub-models, as shown below. The model evaluation is also provided to compared with the first model.

proc logistic data = one desc noprint; where grp = 0; model bad = x2 x3; score data = one out = mdl20(rename = (p_1 = p_10)); run; proc logistic data = one desc noprint; where grp = 1; model bad = x2 x3; score data = one out = mdl21(rename = (p_1 = p_11)); run; proc logistic data = one desc noprint; model grp = x0 x1; score data = one out = seg; run; data mdl2; merge seg mdl20 mdl21; by i; score2 = p_10 * (1 - p_1) + p_11 * p_1; run; GOOD BAD SEPARATION REPORT FOR SCORE2 IN DATA MDL2 MAXIMUM KS = 60.6234 AT SCORE POINT 0.2469 ( AUC STATISTICS = 0.8858, GINI COEFFICIENT = 0.7715, DIVERGENCE = 2.8434 ) MIN MAX GOOD BAD TOTAL BAD CUMULATIVE BAD CUMU. BAD SCORE SCORE # # # RATE BAD RATE PERCENT PERCENT -------------------------------------------------------------------------------------------------------- BAD 0.6877 0.9677 2,011 7,989 10,000 79.89% 79.89% 34.00% 34.00% | 0.4749 0.6876 4,300 5,700 10,000 57.00% 68.45% 24.26% 58.27% | 0.3125 0.4748 6,036 3,964 10,000 39.64% 58.84% 16.87% 75.14% | 0.1932 0.3124 7,451 2,549 10,000 25.49% 50.51% 10.85% 85.99% | 0.1142 0.1932 8,379 1,621 10,000 16.21% 43.65% 6.90% 92.89% | 0.0646 0.1142 9,055 945 10,000 9.45% 37.95% 4.02% 96.91% | 0.0345 0.0646 9,533 467 10,000 4.67% 33.19% 1.99% 98.90% | 0.0166 0.0345 9,800 200 10,000 2.00% 29.29% 0.85% 99.75% V 0.0062 0.0166 9,946 54 10,000 0.54% 26.10% 0.23% 99.98% GOOD 0.0001 0.0062 9,995 5 10,000 0.05% 23.49% 0.02% 100.00% ========== ========== ========== ========== ========== 0.0001 0.9677 76,506 23,494 100,000 GOOD BAD SEPARATION REPORT FOR SCORE2 IN DATA MDL2(WHERE = (GRP = 0)) MAXIMUM KS = 61.1591 AT SCORE POINT 0.2458 ( AUC STATISTICS = 0.8880, GINI COEFFICIENT = 0.7759, DIVERGENCE = 2.9130 ) MIN MAX GOOD BAD TOTAL BAD CUMULATIVE BAD CUMU. BAD SCORE SCORE # # # RATE BAD RATE PERCENT PERCENT -------------------------------------------------------------------------------------------------------- BAD 0.7221 0.9677 1,075 6,138 7,213 85.10% 85.10% 30.40% 30.40% | 0.5208 0.7221 2,436 4,778 7,214 66.23% 75.66% 23.66% 54.06% | 0.3533 0.5208 3,670 3,544 7,214 49.13% 66.82% 17.55% 71.61% | 0.2219 0.3532 4,726 2,487 7,213 34.48% 58.73% 12.32% 83.92% | 0.1309 0.2219 5,617 1,597 7,214 22.14% 51.41% 7.91% 91.83% | 0.0731 0.1309 6,294 920 7,214 12.75% 44.97% 4.56% 96.39% | 0.0387 0.0731 6,762 451 7,213 6.25% 39.44% 2.23% 98.62% | 0.0189 0.0387 7,009 205 7,214 2.84% 34.86% 1.02% 99.63% V 0.0074 0.0189 7,152 62 7,214 0.86% 31.09% 0.31% 99.94% GOOD 0.0002 0.0073 7,201 12 7,213 0.17% 27.99% 0.06% 100.00% ========== ========== ========== ========== ========== 0.0002 0.9677 51,942 20,194 72,136 GOOD BAD SEPARATION REPORT FOR SCORE2 IN DATA MDL2(WHERE = (GRP = 1)) MAXIMUM KS = 57.6788 AT SCORE POINT 0.1979 ( AUC STATISTICS = 0.8717, GINI COEFFICIENT = 0.7434, DIVERGENCE = 2.4317 ) MIN MAX GOOD BAD TOTAL BAD CUMULATIVE BAD CUMU. BAD SCORE SCORE # # # RATE BAD RATE PERCENT PERCENT -------------------------------------------------------------------------------------------------------- BAD 0.5559 0.9553 1,343 1,443 2,786 51.79% 51.79% 43.73% 43.73% | 0.3528 0.5559 2,001 786 2,787 28.20% 40.00% 23.82% 67.55% | 0.2213 0.3528 2,364 422 2,786 15.15% 31.71% 12.79% 80.33% | 0.1372 0.2213 2,513 274 2,787 9.83% 26.24% 8.30% 88.64% | 0.0840 0.1372 2,588 198 2,786 7.11% 22.42% 6.00% 94.64% | 0.0484 0.0840 2,683 104 2,787 3.73% 19.30% 3.15% 97.79% | 0.0256 0.0483 2,729 57 2,786 2.05% 16.84% 1.73% 99.52% | 0.0118 0.0256 2,776 11 2,787 0.39% 14.78% 0.33% 99.85% V 0.0040 0.0118 2,781 5 2,786 0.18% 13.16% 0.15% 100.00% GOOD 0.0001 0.0040 2,786 0 2,786 0.00% 11.84% 0.00% 100.00% ========== ========== ========== ========== ========== 0.0001 0.9553 24,564 3,300 27,864

After comparing KS statistics from two modeling approaches, we can see that, while the performance from the 2nd approach on the overall sample is only slightly better than the one from the 1st approach, the KS on the 2nd group with a smaller size, e.g. grp = 1, increases from 53 upto 58 by 8.6%. While the example is just for two groups, it is trivial to generalize in cases with more than two groups.