## Archive for **September 2017**

## Monotonic WoE Binning for LGD Models

While the monotonic binning algorithm has been widely used in scorecard and PD model (Probability of Default) developments, the similar idea can be generalized to LGD (Loss Given Default) models. In the post below, two SAS macros performing the monotonic binning for LGD are demonstrated.

The first one tends to generate relatively coarse bins based on iterative grouping, which requires a longer computing time.

%macro lgd_bin1(data = , y = , x = ); %let maxbin = 20; data _tmp1 (keep = x y); set &data; y = min(1, max(0, &y)); x = &x; run; proc sql noprint; select count(distinct x) into :xflg from _last_; quit; %let nbin = %sysfunc(min(&maxbin, &xflg)); %if &nbin > 2 %then %do; %do j = &nbin %to 2 %by -1; proc rank data = _tmp1 groups = &j out = _data_ (keep = x rank y); var x; ranks rank; run; proc summary data = _last_ nway; class rank; output out = _tmp2 (drop = _type_ rename = (_freq_ = freq)) sum(y) = bads mean(y) = bad_rate min(x) = minx max(x) = maxx; run; proc sql noprint; select case when min(bad_rate) > 0 then 1 else 0 end into :minflg from _tmp2; select case when max(bad_rate) < 1 then 1 else 0 end into :maxflg from _tmp2; quit; %if &minflg = 1 & &maxflg = 1 %then %do; proc corr data = _tmp2 spearman noprint outs = _corr; var minx; with bad_rate; run; proc sql noprint; select case when abs(minx) = 1 then 1 else 0 end into :cor from _corr where _type_ = 'CORR'; quit; %if &cor = 1 %then %goto loopout; %end; %end; %end; %loopout: proc sql noprint; create table _tmp3 as select a.rank + 1 as bin, a.minx as minx, a.maxx as maxx, a.freq as freq, a.freq / b.freq as dist, a.bad_rate as avg_lgd, a.bads / b.bads as bpct, (a.freq - a.bads) / (b.freq - b.bads) as gpct, log(calculated bpct / calculated gpct) as woe, (calculated bpct - calculated gpct) / calculated woe as iv from _tmp2 as a, (select sum(freq) as freq, sum(bads) as bads from _tmp2) as b; quit; proc print data = _last_ noobs label; var minx maxx freq dist avg_lgd woe; format freq comma8. dist percent10.2; label minx = "Lower Limit" maxx = "Upper Limit" freq = "Freq" dist = "Dist" avg_lgd = "Average LGD" woe = "WoE"; sum freq dist; run; %mend lgd_bin1;

The second one can generate much finer bins based on the idea of isotonic regressions and is more computationally efficient.

%macro lgd_bin2(data = , y = , x = ); data _data_ (keep = x y); set &data; y = min(1, max(0, &y)); x = &x; run; proc transreg data = _last_ noprint; model identity(y) = monotone(x); output out = _tmp1 tip = _t; run; proc summary data = _last_ nway; class _tx; output out = _data_ (drop = _freq_ _type_) mean(y) = lgd; run; proc sort data = _last_; by lgd; run; data _tmp2; set _last_; by lgd; _idx = _n_; if lgd = 0 then _idx = _idx + 1; if lgd = 1 then _idx = _idx - 1; run; proc sql noprint; create table _tmp3 as select a.*, b._idx from _tmp1 as a inner join _tmp2 as b on a._tx = b._tx; create table _tmp4 as select min(a.x) as minx, max(a.x) as maxx, sum(a.y) as bads, count(a.y) as freq, count(a.y) / b.freq as dist, mean(a.y) as avg_lgd, sum(a.y) / b.bads as bpct, sum(1 - a.y) / (b.freq - b.bads) as gpct, log(calculated bpct / calculated gpct) as woe, (calculated bpct - calculated gpct) * calculated woe as iv from _tmp3 as a, (select count(*) as freq, sum(y) as bads from _tmp3) as b group by a._idx; quit; proc print data = _last_ noobs label; var minx maxx freq dist avg_lgd woe; format freq comma8. dist percent10.2; label minx = "Lower Limit" maxx = "Upper Limit" freq = "Freq" dist = "Dist" avg_lgd = "Average LGD" woe = "WoE"; sum freq dist; run; %mend lgd_bin2;

Below is the output comparison between two macros with the testing data downloaded from http://www.creditriskanalytics.net/datasets-private.html. Should you have any feedback, please feel free to leave me a message.

## Granular Monotonic Binning in SAS

In the post (https://statcompute.wordpress.com/2017/06/15/finer-monotonic-binning-based-on-isotonic-regression), it is shown how to do a finer monotonic binning with isotonic regression in R.

Below is a SAS macro implementing the monotonic binning with the same idea of isotonic regression. This macro is more efficient than the one shown in (https://statcompute.wordpress.com/2012/06/10/a-sas-macro-implementing-monotonic-woe-transformation-in-scorecard-development) without iterative binning and is also able to significantly increase the binning granularity.

%macro monobin(data = , y = , x = ); options mprint mlogic; data _data_ (keep = _x _y); set &data; where &y in (0, 1) and &x ~= .; _y = &y; _x = &x; run; proc transreg data = _last_ noprint; model identity(_y) = monotone(_x); output out = _tmp1 tip = _t; run; proc summary data = _last_ nway; class _t_x; output out = _data_ (drop = _freq_ _type_) mean(_y) = _rate; run; proc sort data = _last_; by _rate; run; data _tmp2; set _last_; by _rate; _idx = _n_; if _rate = 0 then _idx = _idx + 1; if _rate = 1 then _idx = _idx - 1; run; proc sql noprint; create table _tmp3 as select a.*, b._idx from _tmp1 as a inner join _tmp2 as b on a._t_x = b._t_x; create table _tmp4 as select a._idx, min(a._x) as _min_x, max(a._x) as _max_x, sum(a._y) as _bads, count(a._y) as _freq, mean(a._y) as _rate, sum(a._y) / b.bads as _bpct, sum(1 - a._y) / (b.freq - b.bads) as _gpct, log(calculated _bpct / calculated _gpct) as _woe, (calculated _bpct - calculated _gpct) * calculated _woe as _iv from _tmp3 as a, (select count(*) as freq, sum(_y) as bads from _tmp3) as b group by a._idx; quit; title "Monotonic WoE Binning for %upcase(%trim(&x))"; proc print data = _last_ label noobs; var _min_x _max_x _bads _freq _rate _woe _iv; label _min_x = "Lower" _max_x = "Upper" _bads = "#Bads" _freq = "#Freq" _rate = "BadRate" _woe = "WoE" _iv = "IV"; sum _bads _freq _iv; run; title; %mend monobin;

Below is the sample output for LTV, showing an identical binning scheme to the one generated by the R isobin() function.

## Model Non-Negative Numeric Outcomes with Zeros

As mentioned in the previous post (https://statcompute.wordpress.com/2017/06/29/model-operational-loss-directly-with-tweedie-glm/), we often need to model non-negative numeric outcomes with zeros in the operational loss model development. Tweedie GLM provides a convenient interface to model non-negative losses directly by assuming that aggregated losses are the Poisson sum of Gamma outcomes, which however might not be well supported empirically from the data generation standpoint.

In examples below, we demonstrated another flexible option, namely Zero-Adjusted (ZA) models, in both scenarios of modeling non-negative numeric outcomes, one with a small number of zeros and the other with a large number of zeros. The basic idea of ZA models is very intuitive and similar to the concept of Hurdle models for count outcomes. In a nutshell, non-negative numeric outcomes can be considered two data generation processes, one for point-mass at zeros and the other governed by a statistical distribution for positive outcomes. The latter could be either Gamma or Inverse Gaussian.

First of all, we sampled down an auto-claim data in a way that only 10 claims are zeros and the rest are all positive. While 10 is an arbitrary choice in the example, other small numbers should show similar results.

pkgs <- list("cplm", "gamlss", "MLmetrics") lapply(pkgs, require, character.only = T) data(AutoClaim, package = "cplm") df1 <- na.omit(AutoClaim) # SMALL NUMBER OF ZEROS set.seed(2017) smp <- sample(seq(nrow(df1[df1$CLM_AMT == 0, ])), size = 10, replace = FALSE) df2 <- rbind(df1[df1$CLM_AMT > 0, ], df1[df1$CLM_AMT == 0, ][smp, ])

Next, we applied both Tweedie and zero-adjusted Gamma (ZAGA) models to the data with only 10 zero outcomes. It is worth mentioning that ZAGA doesn’t have to be overly complex in this case. As shown below, while we estimated the Gamma Mu parameter with model attributes, the Nu parameter to separate zeros is just a constant with the intercept = -5.4. Both Tweedie and GAZA models gave very similar estimated parameters and predictive measures with MAPE = 0.61.

tw <- cpglm(CLM_AMT ~ BLUEBOOK + NPOLICY, data = df2) # Estimate Std. Error t value Pr(>|t|) # (Intercept) 8.194e+00 7.234e-02 113.277 < 2e-16 *** # BLUEBOOK 2.047e-05 3.068e-06 6.671 3.21e-11 *** # NPOLICY 7.274e-02 3.102e-02 2.345 0.0191 * MAPE(df2$CLM_AMT, fitted(tw)) # 0.6053669 zaga0 <- gamlss(CLM_AMT ~ BLUEBOOK + NPOLICY, data = df2, family = "ZAGA") # Mu Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 8.203e+00 4.671e-02 175.629 < 2e-16 *** # BLUEBOOK 2.053e-05 2.090e-06 9.821 < 2e-16 *** # NPOLICY 6.948e-02 2.057e-02 3.377 0.000746 *** # Nu Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) -5.3886 0.3169 -17 <2e-16 *** MAPE(df2$CLM_AMT, (1 - fitted(zaga0, what = "nu")) * fitted(zaga0, what = "mu")) # 0.6053314

In the next case, we used the full data with a large number of zeros in the response and then applied both Tweedie and ZAGA models again. However, in ZAGA model, we estimated two sub-models this time, one for the Nu parameter to separate zeros from non-zeros and the other for the Mu parameter to model non-zero outcomes. As shown below, ZAGA outperformed Tweedie in terms of MAPE due to the advantage that ZAGA is able to explain two data generation schemes separately with different model attributes, which is the capability beyond what Tweedie can provide.

# LARGE NUMBER OF ZEROS tw <- cpglm(CLM_AMT ~ BLUEBOOK + NPOLICY + CLM_FREQ5 + MVR_PTS + INCOME, data = df1) # Estimate Std. Error t value Pr(>|t|) # (Intercept) 6.854e+00 1.067e-01 64.241 < 2e-16 *** # BLUEBOOK 1.332e-05 4.495e-06 2.963 0.00305 ** # NPOLICY 4.380e-02 3.664e-02 1.195 0.23196 # CLM_FREQ5 2.064e-01 2.937e-02 7.026 2.29e-12 *** # MVR_PTS 1.066e-01 1.510e-02 7.063 1.76e-12 *** # INCOME -4.606e-06 8.612e-07 -5.348 9.12e-08 *** MAPE(df1$CLM_AMT, fitted(tw)) # 1.484484 zaga1 <- gamlss(CLM_AMT ~ BLUEBOOK + NPOLICY, nu.formula = ~(CLM_FREQ5 + MVR_PTS + INCOME), data = df1, family = "ZAGA") # Mu Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 8.203e+00 4.682e-02 175.218 < 2e-16 *** # BLUEBOOK 2.053e-05 2.091e-06 9.816 < 2e-16 *** # NPOLICY 6.948e-02 2.067e-02 3.362 0.000778 *** # Nu Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 1.153e+00 5.077e-02 22.72 <2e-16 *** # CLM_FREQ5 -3.028e-01 2.283e-02 -13.26 <2e-16 *** # MVR_PTS -1.509e-01 1.217e-02 -12.41 <2e-16 *** # INCOME 7.285e-06 6.269e-07 11.62 <2e-16 *** MAPE(df1$CLM_AMT, (1 - fitted(zaga1, what = "nu")) * fitted(zaga1, what = "mu")) # 1.470228

Given the great flexibility of ZA models, we also have the luxury to explore other candidates than ZAGA. For instance, if the positive part of non-negative outcomes demonstrates a high variance, we can also try a zero-inflated Inverse Gaussian (ZAIG) model, as shown below.

zaig1 <- gamlss(CLM_AMT ~ BLUEBOOK + NPOLICY, nu.formula = ~(CLM_FREQ5 + MVR_PTS + INCOME), data = df1, family = "ZAIG") # Mu Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 8.205e+00 5.836e-02 140.591 < 2e-16 *** # BLUEBOOK 2.163e-05 2.976e-06 7.268 3.97e-13 *** # NPOLICY 5.898e-02 2.681e-02 2.200 0.0278 * # Nu Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 1.153e+00 5.077e-02 22.72 <2e-16 *** # CLM_FREQ5 -3.028e-01 2.283e-02 -13.26 <2e-16 *** # MVR_PTS -1.509e-01 1.217e-02 -12.41 <2e-16 *** # INCOME 7.285e-06 6.269e-07 11.62 <2e-16 *** MAPE(df1$CLM_AMT, (1 - fitted(zaig1, what = "nu")) * fitted(zaig1, what = "mu")) # 1.469236

## Variable Selection with Elastic Net

LASSO has been a popular algorithm for the variable selection and extremely effective with high-dimension data. However, it often tends to “over-regularize” a model that might be overly compact and therefore under-predictive.

The Elastic Net addresses the aforementioned “over-regularization” by balancing between LASSO and ridge penalties. In particular, a hyper-parameter, namely Alpha, would be used to regularize the model such that the model would become a LASSO in case of Alpha = 1 and a ridge in case of Alpha = 0. In practice, Alpha can be tuned easily by the cross-validation. Below is a demonstration of Elastic Net with R glmnet package and its comparison with LASSO and ridge models.

pkgs <- list("glmnet", "doParallel", "foreach", "pROC") lapply(pkgs, require, character.only = T) registerDoParallel(cores = 4) df1 <- read.csv("Downloads/credit_count.txt") df2 <- df1[df1$CARDHLDR == 1, ] set.seed(2017) n <- nrow(df2) sample <- sample(seq(n), size = n * 0.5, replace = FALSE) train <- df2[sample, -1] test <- df2[-sample, -1] mdlY <- as.factor(as.matrix(train["DEFAULT"])) mdlX <- as.matrix(train[setdiff(colnames(df1), c("CARDHLDR", "DEFAULT"))]) newY <- as.factor(as.matrix(test["DEFAULT"])) newX <- as.matrix(test[setdiff(colnames(df1), c("CARDHLDR", "DEFAULT"))])

First of all, we estimates a LASSO model with Alpha = 1. The function cv.glmnet() is used to search for a regularization parameter, namely Lambda, that controls the penalty strength. As shown below, the model only identifies 2 attributes out of total 12.

# LASSO WITH ALPHA = 1 cv1 <- cv.glmnet(mdlX, mdlY, family = "binomial", nfold = 10, type.measure = "deviance", paralle = TRUE, alpha = 1) md1 <- glmnet(mdlX, mdlY, family = "binomial", lambda = cv1$lambda.1se, alpha = 1) coef(md1) #(Intercept) -1.963030e+00 #AGE . #ACADMOS . #ADEPCNT . #MAJORDRG . #MINORDRG . #OWNRENT . #INCOME -5.845981e-05 #SELFEMPL . #INCPER . #EXP_INC . #SPENDING . #LOGSPEND -4.015902e-02 roc(newY, as.numeric(predict(md1, newX, type = "response"))) #Area under the curve: 0.636

We next estimates a ridge model as below by setting Alpha = 0. Similarly, Lambda is searched by the cross-validation. Since the ridge penalty would only regularize the magnitude of each coefficient, we end up with a “full” model with all model attributes. The model performance is slightly better with 10 more variables, which is a debatable outcome.

# RIDGE WITH ALPHA = 0 cv2 <- cv.glmnet(mdlX, mdlY, family = "binomial", nfold = 10, type.measure = "deviance", paralle = TRUE, alpha = 0) md2 <- glmnet(mdlX, mdlY, family = "binomial", lambda = cv2$lambda.1se, alpha = 0) coef(md2) #(Intercept) -2.221016e+00 #AGE -4.184422e-04 #ACADMOS -3.085096e-05 #ADEPCNT 1.485114e-04 #MAJORDRG 6.684849e-03 #MINORDRG 1.006660e-03 #OWNRENT -9.082750e-03 #INCOME -6.960253e-06 #SELFEMPL 3.610381e-03 #INCPER -3.881890e-07 #EXP_INC -1.416971e-02 #SPENDING -1.638184e-05 #LOGSPEND -6.213884e-03 roc(newY, as.numeric(predict(md2, newX, type = "response"))) #Area under the curve: 0.6435

At last, we use the Elastic Net by tuning the value of Alpha through a line search with the parallelism. In this particular case, Alpha = 0.3 is chosen through the cross-validation. As shown below, 6 variables are used in the model that even performs better than the ridge model with all 12 attributes.

# ELASTIC NET WITH 0 < ALPHA < 1 a <- seq(0.1, 0.9, 0.05) search <- foreach(i = a, .combine = rbind) %dopar% { cv <- cv.glmnet(mdlX, mdlY, family = "binomial", nfold = 10, type.measure = "deviance", paralle = TRUE, alpha = i) data.frame(cvm = cv$cvm[cv$lambda == cv$lambda.1se], lambda.1se = cv$lambda.1se, alpha = i) } cv3 <- search[search$cvm == min(search$cvm), ] md3 <- glmnet(mdlX, mdlY, family = "binomial", lambda = cv3$lambda.1se, alpha = cv3$alpha) coef(md3) #(Intercept) -1.434700e+00 #AGE -8.426525e-04 #ACADMOS . #ADEPCNT . #MAJORDRG 6.276924e-02 #MINORDRG . #OWNRENT -2.780958e-02 #INCOME -1.305118e-04 #SELFEMPL . #INCPER -2.085349e-06 #EXP_INC . #SPENDING . #LOGSPEND -9.992808e-02 roc(newY, as.numeric(predict(md3, newX, type = "response"))) #Area under the curve: 0.6449