## Archive for the ‘**Credit Risk**’ Category

## Additional Thoughts on Estimating LGD with Proportional Odds Model

In my previous post (https://statcompute.wordpress.com/2018/01/28/modeling-lgd-with-proportional-odds-model), I’ve discussed how to use Proportional Odds Models in the LGD model development. In particular, I specifically mentioned that we would estimate a sub-model, which can be Gamma or Simplex regression, to project the conditional mean for LGD values in the (0, 1) range. However, it is worth pointing out that, if we would define a finer LGD segmentation, the necessity of this sub-model is completely optional. A standalone Proportional Odds Model without any sub-model is more than sufficient to serve the purpose of stress testing, e.g. CCAR.

In the example below, I will define 5 categories based upon LGD values in the [0, 1] range, estimate a Proportional Odds Model as usual, and then demonstrate how to apply the model outcome in the setting of stress testing with the stressed model input, e.g. LTV.

First of all, I defined 5 instead of 3 categories for LGD values, as shown below. Nonetheless, we could use a even finer category definition in practice to achieve a more accurate outcome.

df <- read.csv("lgd.csv") df$lgd <- round(1 - df$Recovery_rate, 4) l1 <- c(-Inf, 0, 0.0999, 0.4999, 0.9999, Inf) l2 <- c("A", "B", "C", "D", "E") df$lgd_cat <- cut(df$lgd, breaks = l1, labels = l2, ordered_result = T) summary(df$lgd_cat) m1 <- ordinal::clm(lgd_cat ~ LTV, data = df) #Coefficients: # Estimate Std. Error z value Pr(>|z|) #LTV 2.3841 0.1083 22.02 <2e-16 *** # #Threshold coefficients: # Estimate Std. Error z value #A|B 0.54082 0.07897 6.848 #B|C 2.12270 0.08894 23.866 #C|D 3.18098 0.10161 31.307 #D|E 4.80338 0.13174 36.460

After the model estimation, it is straightforward to calculate the probability of each LGD category. The only question remained is how to calculate the LGD projection for each individual account as well as for the whole portfolio. In order to calculate the LGD projection, we need two factors, namely the probability and the expected mean of each LGD category, such that

Estimated_LGD = SUM_i [Prob(category i) * LGD_Mean(category i)], where i = A, B, C, D, and E in this particular case.

The calculation is shown below with the estimated LGD = 0.23 that is consistent with the actual LGD = 0.23 for the whole portfolio.

prob_A <- exp(df$LTV * (-m1$beta) + m1$Theta[1]) / (1 + exp(df$LTV * (-m1$beta) + m1$Theta[1])) prob_B <- exp(df$LTV * (-m1$beta) + m1$Theta[2]) / (1 + exp(df$LTV * (-m1$beta) + m1$Theta[2])) - prob_A prob_C <- exp(df$LTV * (-m1$beta) + m1$Theta[3]) / (1 + exp(df$LTV * (-m1$beta) + m1$Theta[3])) - prob_A - prob_B prob_D <- exp(df$LTV * (-m1$beta) + m1$Theta[4]) / (1 + exp(df$LTV * (-m1$beta) + m1$Theta[4])) - prob_A - prob_B - prob_C prob_E <- 1 - exp(df$LTV * (-m1$beta) + m1$Theta[4]) / (1 + exp(df$LTV * (-m1$beta) + m1$Theta[4])) pred <- data.frame(prob_A, prob_B, prob_C, prob_D, prob_E) sum(apply(pred, 2, mean) * aggregate(df['lgd'], df['lgd_cat'], mean)[2]) #[1] 0.2262811

One might be wondering how to apply the model outcome with simple averages in stress testing that the model input is stressed, e.g. more severe, and might be also concerned about the lack of model sensitivity. In the demonstration below, let’s stress the model input LTV by 50% and then evaluate the stressed LGD.

df$LTV_ST <- df$LTV * 1.5 prob_A <- exp(df$LTV_ST * (-m1$beta) + m1$Theta[1]) / (1 + exp(df$LTV_ST * (-m1$beta) + m1$Theta[1])) prob_B <- exp(df$LTV_ST * (-m1$beta) + m1$Theta[2]) / (1 + exp(df$LTV_ST * (-m1$beta) + m1$Theta[2])) - prob_A prob_C <- exp(df$LTV_ST * (-m1$beta) + m1$Theta[3]) / (1 + exp(df$LTV_ST * (-m1$beta) + m1$Theta[3])) - prob_A - prob_B prob_D <- exp(df$LTV_ST * (-m1$beta) + m1$Theta[4]) / (1 + exp(df$LTV_ST * (-m1$beta) + m1$Theta[4])) - prob_A - prob_B - prob_C prob_E <- 1 - exp(df$LTV_ST * (-m1$beta) + m1$Theta[4]) / (1 + exp(df$LTV_ST * (-m1$beta) + m1$Theta[4])) pred_ST <- data.frame(prob_A, prob_B, prob_C, prob_D, prob_E) sum(apply(pred_ST, 2, mean) * aggregate(df['lgd'], df['lgd_cat'], mean)[2]) #[1] 0.3600153

As shown above, although we only use a simple averages as the expected mean for each LGD category, the overall LGD still increases by ~60%. The reason is that, with the more stressed model input, the Proportional Odds Model is able to push more accounts into categories with higher LGD. For instance, the output below shows that, if LTV is stressed by 50% overall, ~146% more accounts would roll into the most severe LGD category without any recovery.

apply(pred_ST, 2, mean) / apply(pred, 2, mean) # prob_A prob_B prob_C prob_D prob_E #0.6715374 0.7980619 1.0405573 1.4825803 2.4639293

## Estimating Parameters of A Hyper-Poisson Distribution in SAS

Similar to COM-Poisson, Double-Poisson, and Generalized Poisson distributions discussed in my previous post (https://statcompute.wordpress.com/2016/11/27/more-about-flexible-frequency-models/), the Hyper-Poisson distribution is another extension of the standard Poisson and is able to accommodate both under-dispersion and over-dispersion that are common in real-world problems. Given the complexity of parameterization and computation, the Hyper-Poisson is somewhat under-investigated. To the best of my knowledge, there is no off-shelf computing routine in SAS for the Hyper-Poisson distribution and only a R function available in http://www4.ujaen.es/~ajsaez/hp.fit.r written by A.J. Sáez-Castillo and A. Conde-Sánchez (2013).

The SAS code presented below is the starting point of my attempt on the Hyper-Poisson and its potential applications. The purpose is to replicate the calculation result shown in the Table 6 of “On the Hyper-Poisson Distribution and its Generalization with Applications” by Bayo H. Lawal (2017) (http://www.journalrepository.org/media/journals/BJMCS_6/2017/Mar/Lawal2132017BJMCS32184.pdf). As a result, the parameterization employed in my SAS code will closely follow Bayo H. Lawal (2017) instead of A.J. Sáez-Castillo and A. Conde-Sánchez (2013).

data d1; input y n @@; datalines; 0 121 1 85 2 19 3 1 4 0 5 0 6 1 ; run; data df; set d1; where n > 0; do i = 1 to n; output; end; run; proc nlmixed data = df; parms lambda = 1 beta = 1; theta = 1; do k = 1 to 100; theta = theta + gamma(beta) * (lambda ** k) / gamma(beta + k); end; prob = (gamma(beta) / gamma(beta + y)) * ((lambda ** y) / theta); ll = log(prob); model y ~ general(ll); run; /* Standard Parameter Estimate Error DF t Value Pr > |t| Alpha lambda 0.3752 0.1178 227 3.19 0.0016 0.05 beta 0.5552 0.2266 227 2.45 0.0150 0.05 */

As shown, the estimated Lambda = 0.3752 and the estimated Beta = 0.5552 are identical to what is presented in the paper. The next step is be to explore applications in the frequency modeling as well as its value in business cases.

## Modeling LGD with Proportional Odds Model

The LGD model is an important component in the expected loss calculation. In https://statcompute.wordpress.com/2015/11/01/quasi-binomial-model-in-sas, I discussed how to model LGD with the quasi-binomial regression that is simple and makes no distributional assumption.

In the real-world LGD data, we usually would observe 3 ordered categories of values, including 0, 1, and in-betweens. In cases with a nontrivial number of 0 and 1 values, the ordered logit model, which is also known as Proportional Odds model, can be applicable. In the demonstration below, I will show how we can potentially use the proportional odds model in the LGD model development.

First of all, we need to categorize all numeric LGD values into three ordinal categories. As shown below, there are more than 30% of 0 and 1 values.

df <- read.csv("lgd.csv") df$lgd <- round(1 - df$Recovery_rate, 4) df$lgd_cat <- cut(df$lgd, breaks = c(-Inf, 0, 0.9999, Inf), labels = c("L", "M", "H"), ordered_result = T) summary(df$lgd_cat) # L M H # 730 1672 143

The estimation of a proportional odds model is straightforward with clm() in the ordinal package or polr() in the MASS package. As demonstrated below, in addition to the coefficient for LTV, there are 2 intercepts to differentiate 3 categories.

m1 <- ordinal::clm(lgd_cat ~ LTV, data = df) summary(m1) #Coefficients: # Estimate Std. Error z value Pr(>|z|) #LTV 2.0777 0.1267 16.4 <2e-16 *** #--- #Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 # #Threshold coefficients: # Estimate Std. Error z value #L|M 0.38134 0.08676 4.396 #M|H 4.50145 0.14427 31.201

It is important to point out that, in a proportional odds model, it is the cumulative probability that is derived from the linear combination of model variables. For instance, the cumulative probability of LGD belonging to L or M is formulated as

Prob(LGD <= M) = Exp(4.50 – 2.08 * LTV) / (1 + Exp(4.50 – 2.08 * LTV))

Likewise, we would have

Prob(LGD <= L) = Exp(0.38 – 2.08 * LTV) / (1 + Exp(0.38 – 2.08 * LTV))

With above cumulative probabilities, then we can calculate the probability of each category as below.

Prob(LGD = L) = Prob(LGD <= L)

Prob(LGD = M) = Prob(LGD <= M) – Prob(LGD <= L)

Prob(LGD = H) = 1 – Prob(LGD <= M)

The R code is showing the detailed calculation how to convert cumulative probabilities to probabilities of interest.

cumprob_L <- exp(df$LTV * (-m1$beta) + m1$Theta[1]) / (1 + exp(df$LTV * (-m1$beta) + m1$Theta[1])) cumprob_M <- exp(df$LTV * (-m1$beta) + m1$Theta[2]) / (1 + exp(df$LTV * (-m1$beta) + m1$Theta[2])) prob_L <- cumprob_L prob_M <- cumprob_M - cumprob_L prob_H <- 1 - cumprob_M pred <- data.frame(prob_L, prob_M, prob_H) apply(pred, 2, mean) # prob_L prob_M prob_H #0.28751210 0.65679888 0.05568903

After predicting the probability of each category, we would need another sub-model to estimate the conditional LGD for lgd_cat = “M” with either Beta or Simplex regression. (See https://statcompute.wordpress.com/2014/10/27/flexible-beta-modeling and https://statcompute.wordpress.com/2014/02/02/simplex-model-in-r) The final LGD prediction can be formulated as

E(LGD|X)

= Prob(Y = 0|X) * E(Y|X, Y = 0) + Prob(Y = 1|X) * E(Y|X, Y = 1) + Prob(0 < Y < 1|X) * E(Y|X, 0 < Y < 1)

= Prob(Y = 1|X) + Prob(0 < Y < 1|X) * E(Y|X, 0 < Y < 1)

where E(Y|X, 0 < Y < 1) can be calculated from the sub-model.

## Monotonic WoE Binning for LGD Models

While the monotonic binning algorithm has been widely used in scorecard and PD model (Probability of Default) developments, the similar idea can be generalized to LGD (Loss Given Default) models. In the post below, two SAS macros performing the monotonic binning for LGD are demonstrated.

The first one tends to generate relatively coarse bins based on iterative grouping, which requires a longer computing time.

%macro lgd_bin1(data = , y = , x = ); %let maxbin = 20; data _tmp1 (keep = x y); set &data; y = min(1, max(0, &y)); x = &x; run; proc sql noprint; select count(distinct x) into :xflg from _last_; quit; %let nbin = %sysfunc(min(&maxbin, &xflg)); %if &nbin > 2 %then %do; %do j = &nbin %to 2 %by -1; proc rank data = _tmp1 groups = &j out = _data_ (keep = x rank y); var x; ranks rank; run; proc summary data = _last_ nway; class rank; output out = _tmp2 (drop = _type_ rename = (_freq_ = freq)) sum(y) = bads mean(y) = bad_rate min(x) = minx max(x) = maxx; run; proc sql noprint; select case when min(bad_rate) > 0 then 1 else 0 end into :minflg from _tmp2; select case when max(bad_rate) < 1 then 1 else 0 end into :maxflg from _tmp2; quit; %if &minflg = 1 & &maxflg = 1 %then %do; proc corr data = _tmp2 spearman noprint outs = _corr; var minx; with bad_rate; run; proc sql noprint; select case when abs(minx) = 1 then 1 else 0 end into :cor from _corr where _type_ = 'CORR'; quit; %if &cor = 1 %then %goto loopout; %end; %end; %end; %loopout: proc sql noprint; create table _tmp3 as select a.rank + 1 as bin, a.minx as minx, a.maxx as maxx, a.freq as freq, a.freq / b.freq as dist, a.bad_rate as avg_lgd, a.bads / b.bads as bpct, (a.freq - a.bads) / (b.freq - b.bads) as gpct, log(calculated bpct / calculated gpct) as woe, (calculated bpct - calculated gpct) / calculated woe as iv from _tmp2 as a, (select sum(freq) as freq, sum(bads) as bads from _tmp2) as b; quit; proc print data = _last_ noobs label; var minx maxx freq dist avg_lgd woe; format freq comma8. dist percent10.2; label minx = "Lower Limit" maxx = "Upper Limit" freq = "Freq" dist = "Dist" avg_lgd = "Average LGD" woe = "WoE"; sum freq dist; run; %mend lgd_bin1;

The second one can generate much finer bins based on the idea of isotonic regressions and is more computationally efficient.

%macro lgd_bin2(data = , y = , x = ); data _data_ (keep = x y); set &data; y = min(1, max(0, &y)); x = &x; run; proc transreg data = _last_ noprint; model identity(y) = monotone(x); output out = _tmp1 tip = _t; run; proc summary data = _last_ nway; class _tx; output out = _data_ (drop = _freq_ _type_) mean(y) = lgd; run; proc sort data = _last_; by lgd; run; data _tmp2; set _last_; by lgd; _idx = _n_; if lgd = 0 then _idx = _idx + 1; if lgd = 1 then _idx = _idx - 1; run; proc sql noprint; create table _tmp3 as select a.*, b._idx from _tmp1 as a inner join _tmp2 as b on a._tx = b._tx; create table _tmp4 as select min(a.x) as minx, max(a.x) as maxx, sum(a.y) as bads, count(a.y) as freq, count(a.y) / b.freq as dist, mean(a.y) as avg_lgd, sum(a.y) / b.bads as bpct, sum(1 - a.y) / (b.freq - b.bads) as gpct, log(calculated bpct / calculated gpct) as woe, (calculated bpct - calculated gpct) * calculated woe as iv from _tmp3 as a, (select count(*) as freq, sum(y) as bads from _tmp3) as b group by a._idx; quit; proc print data = _last_ noobs label; var minx maxx freq dist avg_lgd woe; format freq comma8. dist percent10.2; label minx = "Lower Limit" maxx = "Upper Limit" freq = "Freq" dist = "Dist" avg_lgd = "Average LGD" woe = "WoE"; sum freq dist; run; %mend lgd_bin2;

Below is the output comparison between two macros with the testing data downloaded from http://www.creditriskanalytics.net/datasets-private.html. Should you have any feedback, please feel free to leave me a message.

## Granular Monotonic Binning in SAS

In the post (https://statcompute.wordpress.com/2017/06/15/finer-monotonic-binning-based-on-isotonic-regression), it is shown how to do a finer monotonic binning with isotonic regression in R.

Below is a SAS macro implementing the monotonic binning with the same idea of isotonic regression. This macro is more efficient than the one shown in (https://statcompute.wordpress.com/2012/06/10/a-sas-macro-implementing-monotonic-woe-transformation-in-scorecard-development) without iterative binning and is also able to significantly increase the binning granularity.

%macro monobin(data = , y = , x = ); options mprint mlogic; data _data_ (keep = _x _y); set &data; where &y in (0, 1) and &x ~= .; _y = &y; _x = &x; run; proc transreg data = _last_ noprint; model identity(_y) = monotone(_x); output out = _tmp1 tip = _t; run; proc summary data = _last_ nway; class _t_x; output out = _data_ (drop = _freq_ _type_) mean(_y) = _rate; run; proc sort data = _last_; by _rate; run; data _tmp2; set _last_; by _rate; _idx = _n_; if _rate = 0 then _idx = _idx + 1; if _rate = 1 then _idx = _idx - 1; run; proc sql noprint; create table _tmp3 as select a.*, b._idx from _tmp1 as a inner join _tmp2 as b on a._t_x = b._t_x; create table _tmp4 as select a._idx, min(a._x) as _min_x, max(a._x) as _max_x, sum(a._y) as _bads, count(a._y) as _freq, mean(a._y) as _rate, sum(a._y) / b.bads as _bpct, sum(1 - a._y) / (b.freq - b.bads) as _gpct, log(calculated _bpct / calculated _gpct) as _woe, (calculated _bpct - calculated _gpct) * calculated _woe as _iv from _tmp3 as a, (select count(*) as freq, sum(_y) as bads from _tmp3) as b group by a._idx; quit; title "Monotonic WoE Binning for %upcase(%trim(&x))"; proc print data = _last_ label noobs; var _min_x _max_x _bads _freq _rate _woe _iv; label _min_x = "Lower" _max_x = "Upper" _bads = "#Bads" _freq = "#Freq" _rate = "BadRate" _woe = "WoE" _iv = "IV"; sum _bads _freq _iv; run; title; %mend monobin;

Below is the sample output for LTV, showing an identical binning scheme to the one generated by the R isobin() function.

## Finer Monotonic Binning Based on Isotonic Regression

In my early post (https://statcompute.wordpress.com/2017/01/22/monotonic-binning-with-smbinning-package/), I wrote a monobin() function based on the smbinning package by Herman Jopia to improve the monotonic binning algorithm. The function works well and provides robust binning outcomes. However, there are a couple potential drawbacks due to the coarse binning. First of all, the derived Information Value for each binned variable might tend to be low. Secondly, the binned variable might not be granular enough to reflect the data nature.

In light of the aforementioned, I drafted an improved function isobin() based on the isotonic regression (https://en.wikipedia.org/wiki/Isotonic_regression), as shown below.

isobin <- function(data, y, x) { d1 <- data[c(y, x)] d2 <- d1[!is.na(d1[x]), ] c <- cor(d2[, 2], d2[, 1], method = "spearman", use = "complete.obs") reg <- isoreg(d2[, 2], c / abs(c) * d2[, 1]) k <- knots(as.stepfun(reg)) sm1 <-smbinning.custom(d1, y, x, k) c1 <- subset(sm1$ivtable, subset = CntGood * CntBad > 0, select = Cutpoint) c2 <- suppressWarnings(as.numeric(unlist(strsplit(c1$Cutpoint, " ")))) c3 <- c2[!is.na(c2)] return(smbinning.custom(d1, y, x, c3[-length(c3)])) }

Compared with the legacy monobin(), the isobin() function is able to significantly increase the binning granularity as well as moderately improve the Information Value.

**LTV Binning with isobin() Function**

Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate Odds LnOdds WoE IV 1 <= 46 81 78 3 81 78 3 0.0139 0.9630 0.0370 26.0000 3.2581 1.9021 0.0272 2 <= 71 312 284 28 393 362 31 0.0535 0.9103 0.0897 10.1429 2.3168 0.9608 0.0363 3 <= 72 22 20 2 415 382 33 0.0038 0.9091 0.0909 10.0000 2.3026 0.9466 0.0025 4 <= 73 27 24 3 442 406 36 0.0046 0.8889 0.1111 8.0000 2.0794 0.7235 0.0019 5 <= 81 303 268 35 745 674 71 0.0519 0.8845 0.1155 7.6571 2.0356 0.6797 0.0194 6 <= 83 139 122 17 884 796 88 0.0238 0.8777 0.1223 7.1765 1.9708 0.6149 0.0074 7 <= 90 631 546 85 1515 1342 173 0.1081 0.8653 0.1347 6.4235 1.8600 0.5040 0.0235 8 <= 94 529 440 89 2044 1782 262 0.0906 0.8318 0.1682 4.9438 1.5981 0.2422 0.0049 9 <= 95 145 119 26 2189 1901 288 0.0248 0.8207 0.1793 4.5769 1.5210 0.1651 0.0006 10 <= 100 907 709 198 3096 2610 486 0.1554 0.7817 0.2183 3.5808 1.2756 -0.0804 0.0010 11 <= 101 195 151 44 3291 2761 530 0.0334 0.7744 0.2256 3.4318 1.2331 -0.1229 0.0005 12 <= 110 1217 934 283 4508 3695 813 0.2085 0.7675 0.2325 3.3004 1.1940 -0.1619 0.0057 13 <= 112 208 158 50 4716 3853 863 0.0356 0.7596 0.2404 3.1600 1.1506 -0.2054 0.0016 14 <= 115 253 183 70 4969 4036 933 0.0433 0.7233 0.2767 2.6143 0.9610 -0.3950 0.0075 15 <= 136 774 548 226 5743 4584 1159 0.1326 0.7080 0.2920 2.4248 0.8857 -0.4702 0.0333 16 <= 138 27 18 9 5770 4602 1168 0.0046 0.6667 0.3333 2.0000 0.6931 -0.6628 0.0024 17 > 138 66 39 27 5836 4641 1195 0.0113 0.5909 0.4091 1.4444 0.3677 -0.9882 0.0140 18 Missing 1 0 1 5837 4641 1196 0.0002 0.0000 1.0000 0.0000 -Inf -Inf Inf 19 Total 5837 4641 1196 NA NA NA 1.0000 0.7951 0.2049 3.8804 1.3559 0.0000 0.1897

**LTV Binning with monobin() Function**

Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate Odds LnOdds WoE IV 1 <= 85 1025 916 109 1025 916 109 0.1756 0.8937 0.1063 8.4037 2.1287 0.7727 0.0821 2 <= 94 1019 866 153 2044 1782 262 0.1746 0.8499 0.1501 5.6601 1.7334 0.3775 0.0221 3 <= 100 1052 828 224 3096 2610 486 0.1802 0.7871 0.2129 3.6964 1.3074 -0.0486 0.0004 4 <= 105 808 618 190 3904 3228 676 0.1384 0.7649 0.2351 3.2526 1.1795 -0.1765 0.0045 5 <= 114 985 748 237 4889 3976 913 0.1688 0.7594 0.2406 3.1561 1.1493 -0.2066 0.0076 6 > 114 947 665 282 5836 4641 1195 0.1622 0.7022 0.2978 2.3582 0.8579 -0.4981 0.0461 7 Missing 1 0 1 5837 4641 1196 0.0002 0.0000 1.0000 0.0000 -Inf -Inf Inf 8 Total 5837 4641 1196 NA NA NA 1.0000 0.7951 0.2049 3.8804 1.3559 0.0000 0.1628

**Bureau_Score Binning with isobin() Function**

Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate Odds LnOdds WoE IV 1 <= 491 4 1 3 4 1 3 0.0007 0.2500 0.7500 0.3333 -1.0986 -2.4546 0.0056 2 <= 532 24 9 15 28 10 18 0.0041 0.3750 0.6250 0.6000 -0.5108 -1.8668 0.0198 3 <= 559 51 24 27 79 34 45 0.0087 0.4706 0.5294 0.8889 -0.1178 -1.4737 0.0256 4 <= 560 2 1 1 81 35 46 0.0003 0.5000 0.5000 1.0000 0.0000 -1.3559 0.0008 5 <= 572 34 17 17 115 52 63 0.0058 0.5000 0.5000 1.0000 0.0000 -1.3559 0.0143 6 <= 602 153 84 69 268 136 132 0.0262 0.5490 0.4510 1.2174 0.1967 -1.1592 0.0459 7 <= 605 56 31 25 324 167 157 0.0096 0.5536 0.4464 1.2400 0.2151 -1.1408 0.0162 8 <= 606 14 8 6 338 175 163 0.0024 0.5714 0.4286 1.3333 0.2877 -1.0683 0.0035 9 <= 607 17 10 7 355 185 170 0.0029 0.5882 0.4118 1.4286 0.3567 -0.9993 0.0037 10 <= 632 437 261 176 792 446 346 0.0749 0.5973 0.4027 1.4830 0.3940 -0.9619 0.0875 11 <= 639 150 95 55 942 541 401 0.0257 0.6333 0.3667 1.7273 0.5465 -0.8094 0.0207 12 <= 653 451 300 151 1393 841 552 0.0773 0.6652 0.3348 1.9868 0.6865 -0.6694 0.0412 13 <= 662 295 213 82 1688 1054 634 0.0505 0.7220 0.2780 2.5976 0.9546 -0.4014 0.0091 14 <= 665 100 77 23 1788 1131 657 0.0171 0.7700 0.2300 3.3478 1.2083 -0.1476 0.0004 15 <= 667 57 44 13 1845 1175 670 0.0098 0.7719 0.2281 3.3846 1.2192 -0.1367 0.0002 16 <= 677 381 300 81 2226 1475 751 0.0653 0.7874 0.2126 3.7037 1.3093 -0.0466 0.0001 17 <= 679 66 53 13 2292 1528 764 0.0113 0.8030 0.1970 4.0769 1.4053 0.0494 0.0000 18 <= 683 160 129 31 2452 1657 795 0.0274 0.8062 0.1938 4.1613 1.4258 0.0699 0.0001 19 <= 689 203 164 39 2655 1821 834 0.0348 0.8079 0.1921 4.2051 1.4363 0.0804 0.0002 20 <= 699 304 249 55 2959 2070 889 0.0521 0.8191 0.1809 4.5273 1.5101 0.1542 0.0012 21 <= 707 312 268 44 3271 2338 933 0.0535 0.8590 0.1410 6.0909 1.8068 0.4509 0.0094 22 <= 717 368 318 50 3639 2656 983 0.0630 0.8641 0.1359 6.3600 1.8500 0.4941 0.0132 23 <= 721 134 119 15 3773 2775 998 0.0230 0.8881 0.1119 7.9333 2.0711 0.7151 0.0094 24 <= 723 49 44 5 3822 2819 1003 0.0084 0.8980 0.1020 8.8000 2.1748 0.8188 0.0043 25 <= 739 425 394 31 4247 3213 1034 0.0728 0.9271 0.0729 12.7097 2.5424 1.1864 0.0700 26 <= 746 166 154 12 4413 3367 1046 0.0284 0.9277 0.0723 12.8333 2.5520 1.1961 0.0277 27 <= 756 234 218 16 4647 3585 1062 0.0401 0.9316 0.0684 13.6250 2.6119 1.2560 0.0422 28 <= 761 110 104 6 4757 3689 1068 0.0188 0.9455 0.0545 17.3333 2.8526 1.4967 0.0260 29 <= 763 46 44 2 4803 3733 1070 0.0079 0.9565 0.0435 22.0000 3.0910 1.7351 0.0135 30 <= 767 96 92 4 4899 3825 1074 0.0164 0.9583 0.0417 23.0000 3.1355 1.7795 0.0293 31 <= 772 77 74 3 4976 3899 1077 0.0132 0.9610 0.0390 24.6667 3.2055 1.8495 0.0249 32 <= 787 269 260 9 5245 4159 1086 0.0461 0.9665 0.0335 28.8889 3.3635 2.0075 0.0974 33 <= 794 95 93 2 5340 4252 1088 0.0163 0.9789 0.0211 46.5000 3.8395 2.4835 0.0456 34 > 794 182 179 3 5522 4431 1091 0.0312 0.9835 0.0165 59.6667 4.0888 2.7328 0.0985 35 Missing 315 210 105 5837 4641 1196 0.0540 0.6667 0.3333 2.0000 0.6931 -0.6628 0.0282 36 Total 5837 4641 1196 NA NA NA 1.0000 0.7951 0.2049 3.8804 1.3559 0.0000 0.8357

**Bureau_Score Binning with monobin() Function**

Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate Odds LnOdds WoE IV 1 <= 617 513 284 229 513 284 229 0.0879 0.5536 0.4464 1.2402 0.2153 -1.1407 0.1486 2 <= 642 515 317 198 1028 601 427 0.0882 0.6155 0.3845 1.6010 0.4706 -0.8853 0.0861 3 <= 657 512 349 163 1540 950 590 0.0877 0.6816 0.3184 2.1411 0.7613 -0.5946 0.0363 4 <= 672 487 371 116 2027 1321 706 0.0834 0.7618 0.2382 3.1983 1.1626 -0.1933 0.0033 5 <= 685 494 396 98 2521 1717 804 0.0846 0.8016 0.1984 4.0408 1.3964 0.0405 0.0001 6 <= 701 521 428 93 3042 2145 897 0.0893 0.8215 0.1785 4.6022 1.5265 0.1706 0.0025 7 <= 714 487 418 69 3529 2563 966 0.0834 0.8583 0.1417 6.0580 1.8014 0.4454 0.0144 8 <= 730 489 441 48 4018 3004 1014 0.0838 0.9018 0.0982 9.1875 2.2178 0.8619 0.0473 9 <= 751 513 476 37 4531 3480 1051 0.0879 0.9279 0.0721 12.8649 2.5545 1.1986 0.0859 10 <= 775 492 465 27 5023 3945 1078 0.0843 0.9451 0.0549 17.2222 2.8462 1.4903 0.1157 11 > 775 499 486 13 5522 4431 1091 0.0855 0.9739 0.0261 37.3846 3.6213 2.2653 0.2126 12 Missing 315 210 105 5837 4641 1196 0.0540 0.6667 0.3333 2.0000 0.6931 -0.6628 0.0282 13 Total 5837 4641 1196 NA NA NA 1.0000 0.7951 0.2049 3.8804 1.3559 0.0000 0.7810

## Monotonic Binning with Smbinning Package

The R package smbinning (http://www.scoringmodeling.com/rpackage/smbinning) provides a very user-friendly interface for the WoE (Weight of Evidence) binning algorithm employed in the scorecard development. However, there are several improvement opportunities in my view:

1. First of all, the underlying algorithm in the smbinning() function utilizes the recursive partitioning, which does not necessarily guarantee the monotonicity.

2. Secondly, the density in each generated bin is not even. The frequency in some bins could be much higher than the one in others.

3. At last, the function might not provide the binning outcome for some variables due to the lack of statistical significance.

In light of the above, I wrote an enhanced version by utilizing the smbinning.custom() function, shown as below. The idea is very simple. Within the repeat loop, we would bin the variable iteratively until a certain criterion is met and then feed the list of cut points into the smbinning.custom() function. As a result, we are able to achieve a set of monotonic bins with similar frequencies regardless of the so-called “statistical significance”, which is a premature step for the variable transformation in my mind.

monobin <- function(data, y, x) { d1 <- data[c(y, x)] n <- min(20, nrow(unique(d1[x]))) repeat { d1$bin <- Hmisc::cut2(d1[, x], g = n) d2 <- aggregate(d1[-3], d1[3], mean) c <- cor(d2[-1], method = "spearman") if(abs(c[1, 2]) == 1 | n == 2) break n <- n - 1 } d3 <- aggregate(d1[-3], d1[3], max) cuts <- d3[-length(d3[, 3]), 3] return(smbinning::smbinning.custom(d1, y, x, cuts)) }

Below are a couple comparisons between the generic smbinning() and the home-brew monobin() functions with the use of a toy data.

In the first example, we applied the smbinning() function to a variable named “rev_util”. As shown in the highlighted rows in the column “BadRate”, the binning outcome is not monotonic.

Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate Odds LnOdds WoE IV 1 <= 0 965 716 249 965 716 249 0.1653 0.7420 0.2580 2.8755 1.0562 -0.2997 0.0162 2 <= 5 522 496 26 1487 1212 275 0.0894 0.9502 0.0498 19.0769 2.9485 1.5925 0.1356 3 <= 24 1166 1027 139 2653 2239 414 0.1998 0.8808 0.1192 7.3885 1.9999 0.6440 0.0677 4 <= 40 779 651 128 3432 2890 542 0.1335 0.8357 0.1643 5.0859 1.6265 0.2705 0.0090 5 <= 73 1188 932 256 4620 3822 798 0.2035 0.7845 0.2155 3.6406 1.2922 -0.0638 0.0008 6 <= 96 684 482 202 5304 4304 1000 0.1172 0.7047 0.2953 2.3861 0.8697 -0.4863 0.0316 7 > 96 533 337 196 5837 4641 1196 0.0913 0.6323 0.3677 1.7194 0.5420 -0.8140 0.0743 8 Missing 0 0 0 5837 4641 1196 0.0000 NaN NaN NaN NaN NaN NaN 9 Total 5837 4641 1196 NA NA NA 1.0000 0.7951 0.2049 3.8804 1.3559 0.0000 0.3352

Next, we did the same with the monobin() function. As shown below, the algorithm provided a monotonic binning at the cost of granularity. Albeit coarse, the result is directionally correct with no inversion.

Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate Odds LnOdds WoE IV 1 <= 30 2962 2495 467 2962 2495 467 0.5075 0.8423 0.1577 5.3426 1.6757 0.3198 0.0471 2 > 30 2875 2146 729 5837 4641 1196 0.4925 0.7464 0.2536 2.9438 1.0797 -0.2763 0.0407 3 Missing 0 0 0 5837 4641 1196 0.0000 NaN NaN NaN NaN NaN NaN 4 Total 5837 4641 1196 NA NA NA 1.0000 0.7951 0.2049 3.8804 1.3559 0.0000 0.0878

In the second example, we applied the smbinning() function to a variable named “bureau_score”. As shown in the highlighted rows, the frequencies in these two bins are much higher than the rest.

Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate Odds LnOdds WoE IV 1 <= 605 324 167 157 324 167 157 0.0555 0.5154 0.4846 1.0637 0.0617 -1.2942 0.1233 2 <= 632 468 279 189 792 446 346 0.0802 0.5962 0.4038 1.4762 0.3895 -0.9665 0.0946 3 <= 662 896 608 288 1688 1054 634 0.1535 0.6786 0.3214 2.1111 0.7472 -0.6087 0.0668 4 <= 699 1271 1016 255 2959 2070 889 0.2177 0.7994 0.2006 3.9843 1.3824 0.0264 0.0002 5 <= 717 680 586 94 3639 2656 983 0.1165 0.8618 0.1382 6.2340 1.8300 0.4741 0.0226 6 <= 761 1118 1033 85 4757 3689 1068 0.1915 0.9240 0.0760 12.1529 2.4976 1.1416 0.1730 7 > 761 765 742 23 5522 4431 1091 0.1311 0.9699 0.0301 32.2609 3.4739 2.1179 0.2979 8 Missing 315 210 105 5837 4641 1196 0.0540 0.6667 0.3333 2.0000 0.6931 -0.6628 0.0282 9 Total 5837 4641 1196 NA NA NA 1.0000 0.7951 0.2049 3.8804 1.3559 0.0000 0.8066

With the monobin() function applied to the same variable, we were able to get a set of more granular bins with similar frequencies.

Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate Odds LnOdds WoE IV 1 <= 617 513 284 229 513 284 229 0.0879 0.5536 0.4464 1.2402 0.2153 -1.1407 0.1486 2 <= 642 515 317 198 1028 601 427 0.0882 0.6155 0.3845 1.6010 0.4706 -0.8853 0.0861 3 <= 657 512 349 163 1540 950 590 0.0877 0.6816 0.3184 2.1411 0.7613 -0.5946 0.0363 4 <= 672 487 371 116 2027 1321 706 0.0834 0.7618 0.2382 3.1983 1.1626 -0.1933 0.0033 5 <= 685 494 396 98 2521 1717 804 0.0846 0.8016 0.1984 4.0408 1.3964 0.0405 0.0001 6 <= 701 521 428 93 3042 2145 897 0.0893 0.8215 0.1785 4.6022 1.5265 0.1706 0.0025 7 <= 714 487 418 69 3529 2563 966 0.0834 0.8583 0.1417 6.0580 1.8014 0.4454 0.0144 8 <= 730 489 441 48 4018 3004 1014 0.0838 0.9018 0.0982 9.1875 2.2178 0.8619 0.0473 9 <= 751 513 476 37 4531 3480 1051 0.0879 0.9279 0.0721 12.8649 2.5545 1.1986 0.0859 10 <= 775 492 465 27 5023 3945 1078 0.0843 0.9451 0.0549 17.2222 2.8462 1.4903 0.1157 11 > 775 499 486 13 5522 4431 1091 0.0855 0.9739 0.0261 37.3846 3.6213 2.2653 0.2126 12 Missing 315 210 105 5837 4641 1196 0.0540 0.6667 0.3333 2.0000 0.6931 -0.6628 0.0282 13 Total 5837 4641 1196 NA NA NA 1.0000 0.7951 0.2049 3.8804 1.3559 0.0000 0.7810