Improve General Regression Neural Network by Monotonic Binning

A major criticism on the binning algorithm as well as on the WoE transformation is that the use of binned predictors will decrease the model predictive power due to the loss of data granularity after the WoE transformation. While talk is cheap, I would use the example below to show that using the monotonic binning algorithm to pre-process predictors in a GRNN is actually able to alleviate the over-fitting and to improve the prediction accuracy for the hold-out sample.

First of all, the whole dataset was split into half, e.g. one as the training sample and another as the hold-out sample. The smoothing parameter, e.g. sigma, was chosen through the random search and happened to be 2.198381 for both GRNNs.

  1. For the first GRNN with untransformed raw predictors, the AUC for the training sample is 0.69 and the AUC for the hold-out sample is 0.66.
  2. For the second GRNN with WoE-transformed predictors, the AUC for the training sample is 0.72 and the AUC for the hold-out sample is 0.69.

In this particular example, it is clearly shown that there is roughly a 4% – 5% improvement in the AUC statistic for both training and hold-out samples through the use of monotonic binning and WoE transformations.

df1 <- read.table("credit_count.txt", header = T, sep = ",")
df2 <- df1[which(df1$CARDHLDR == 1), ]
Y <- df2$DEFAULT
X <- scale(df2[, 3:ncol(df2)])
i <- sample(seq(length(Y)), length(Y) / 2)
# WITHOUT BINNING
Y1 <- Y[i]
Y2 <- Y[i]
X1 <- X[i, ]
X2 <- X[i, ]
net11 <- grnn.fit(x = X1, y = Y1)
test1 <- grnn.search_auc(net11, gen_latin(1, 3, 10), nfolds = 4)
# $best
# sigma auc
# 2.198381 0.6297201
net12 <- grnn.fit(x = X1, y = Y1, sigma = test1$best$sigma)
MLmetrics::AUC(grnn.parpred(net12, X1), Y1)
# 0.6855638
MLmetrics::AUC(grnn.parpred(net12, X2), Y2)
# 0.6555798
# WITH BINNING
df3 <- data.frame(df2[, 3:ncol(df2)], Y)
bin_out <- batch_bin(df3, method = 3)
df_woe <- batch_woe(df3, bin_out$BinLst)
W <- scale(df_woe$df[, 1])
W1 <- W[i, ]
W2 <- W[i, ]
net21 <- grnn.fit(x = W1, y = Y1)
test2 <- grnn.search_auc(net21, gen_latin(1, 3, 10), nfolds = 4)
# $best
# sigma auc
# 2.198381 0.6820317
net22 <- grnn.fit(x = W1, y = Y1, sigma = test2$best$sigma)
MLmetrics::AUC(grnn.parpred(net22, W1), Y1)
# 0.7150051
MLmetrics::AUC(grnn.parpred(net22, W2), Y2)
# 0.6884229

view raw
grnn_bin.R
hosted with ❤ by GitHub

GRNN with Small Samples

After a bank launches a new product or acquires a new portfolio, the risk modeling team would often be faced with a challenge of how to estimate the corresponding performance, e.g. risk or loss, with a limited number of data points conditional on business drivers or macro-economic indicators. For instance, it is required to project the 9-quarter loss in CCAR, regardless of the portfolio age. In such cases, the prevalent practice based upon conventional regression models might not be applicable given the requirement for a sufficient number of samples in order to draw the statistical inference. As a result, we would have to rely on the input of SME (Subject Matter Expert), to gauge the performance based on similar products and portfolios, or to fall back on simple statistical metrics such as Average or Median that can’t be intuitively related to predictors.

With the GRNN implemented in the YAGeR project (https://github.com/statcompute/yager), it is however technically feasible to project the expected performance conditional on predictors due to the fact that the projected Y_i of a future case is determined by the distance between the predictor vector X_i and each X vector in the training sample, subject to a smoothing parameter namely Sigma. While more samples in the training data are certainly helpful to estimate a generalizable model, a couple data points, e.g. even only one or two data points in the extreme case, are also conceptually sufficient to form a GRNN that is able to generate sensible projections without violating statistical assumptions.

Following are a couple practical considerations.

  1. Although normalizing the input data, e.g. X matrix, in a GRNN is usually necessary for the numerical reason, the exact scaling is not required. Practically, the “rough” scaling can be employed and ranges or variances used in the normalization can be based upon the historical data of X that might not be reflected in the training data with only a small sample size.
  2. With limited data points in the training data, the Sigma value can be chosen by the L-O-O (Leave-One-Out) or empirically based upon another GRNN with a similar data structure that might or might not be related to the training data. What’s more, it is easy enough to dynamically fine-tune or refresh the Sigma value with more data samples becoming available along the time.
  3. While there is no requirement for the variable selection in a GRNN, the model developer does have the flexibility of judgmentally choosing predictors based upon the prior information and eliminating variables not showing correct marginal effects in PDP (https://statcompute.wordpress.com/2019/10/19/partial-dependence-plot-pdp-of-grnn).

Below is an example of using 100 data points as the training sample to predict LGD within the unity interval of 1,000 cases with both GLM and GRNN. Out of 100 trials, while the GLM only outperformed the simple average for 32 times, the GRNN was able to do better for 76 times.

source("yager.R")
df <- read.table("lgd", header = T)[, 1:8]
Y <- 1 df$rr
X <- scale(df[, 2:8])
pre.N <- 1000
trn.N <- 100
try.N <- 100
seeds <- floor(with(set.seed(2020), runif(try.N) * 1e8))
test_glm <- function(seed) {
i1 <- with(set.seed(seed), sample(seq(length(Y)), pre.N))
Y1 <- Y[i1]
X1 <- X[i1, ]
Y2 <- Y[i1]
X2 <- X[i1, ]
i2 <- with(set.seed(seed), sample(seq(length(Y2)), trn.N))
gm <- glm(Y2 ~ ., data = data.frame(Y2, X2)[i2, ], family = quasibinomial)
round(MLmetrics::R2_Score(predict(gm, newdata = data.frame(X1), type = "response"), Y1), 4)
}
perf.glm <- Reduce(c, lapply(seeds, test_glm))
summary(perf.glm)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# -0.39300 -0.10483 -0.02280 -0.05135 0.01230 0.08920
sum(perf.glm > 0) / length(perf.glm)
# [1] 0.32
test_grnn <- function(seed) {
i1 <- with(set.seed(seed), sample(seq(length(Y)), pre.N))
Y1 <- Y[i1]
X1 <- X[i1, ]
Y2 <- Y[i1]
X2 <- X[i1, ]
i2 <- with(set.seed(seed), sample(seq(length(Y2)), trn.N))
gn <- grnn.fit(X2[i2, ], Y2[i2])
round(MLmetrics::R2_Score(grnn.predict(gn, X1), Y1), 4)
}
perf.grnn <- Reduce(c, lapply(seeds, test_grnn))
summary(perf.grnn)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# -0.06130 0.00075 0.03075 0.02739 0.05437 0.10000
sum(perf.grnn > 0) / length(perf.grnn)
# [1] 0.76

view raw
grnn_SmallSample.R
hosted with ❤ by GitHub

Assess Variable Importance In GRNN

Technically speaking, there is no need to evaluate the variable importance and to perform the variable selection in the training of a GRNN. It’s also been a consensus that the neural network is a black-box model and it is not an easy task to assess the variable importance in a neural network. However, from the practical prospect, it is helpful to understand the individual contribution of each predictor to the overall goodness-of-fit of a GRNN. For instance, the variable importance can help us make up a beautiful business story to decorate our model. In addition, dropping variables with trivial contributions also helps us come up with a more parsimonious model as well as improve the computational efficiency.

In the YAGeR project (https://github.com/statcompute/yager), two functions have been added with the purpose to assess the variable importance in a GRNN. While the grnn.x_imp() function (https://github.com/statcompute/yager/blob/master/code/grnn.x_imp.R) will provide the importance assessment of a single variable, the grnn.imp() function (https://github.com/statcompute/yager/blob/master/code/grnn.imp.R) can give us a full picture of the variable importance for all variables in the GRNN. The returned value “imp1” is calculated as the decrease in AUC with all values for the variable of interest equal to its mean and the “imp2” is calculated as the decrease in AUC with the variable of interest dropped completely. The variable with a higher value of the decrease in AUC is deemed more important.

Below is an example demonstrating how to assess the variable importance in a GRNN. As shown in the output, there are three variables making no contribution to AUC statistic. It is also noted that dropping three unimportant variables in the GRNN can actually increase AUC in the hold-out sample. What’s more, marginal effects of variables remaining in the GRNN make more sense now with all showing nice monotonic relationships, in particular “tot_open_tr”.

Y <- df$bad
X <- scale(df_woe$df[, 1])
set.seed(2019)
i <- sample(seq(length(Y)), length(Y) / 4)
Y1 <- Y[i]
Y2 <- Y[i]
X1 <- X[i, ]
X2 <- X[i, ]
net1 <- grnn.fit(x = X1, y = Y1)
rst <- grnn.optmiz_auc(net1, lower = 1, upper = 3)
net2 <- grnn.fit(x = X1, y = Y1, sigma = rst$sigma)
xrank <- grnn.imp(net2)
#idx var imp1 imp2
# 9 woe.bureau_score 0.03629427 0.03490435
# 8 woe.rev_util 0.01150345 0.01045408
# 1 woe.tot_derog 0.01033528 0.00925820
# 10 woe.ltv 0.01033330 0.00910178
# 11 woe.tot_income 0.00506666 0.00509438
# 3 woe.age_oldest_tr 0.00430835 0.00476373
# 4 woe.tot_open_tr 0.00392424 0.00523496
# 2 woe.tot_tr 0.00123152 0.00215021
# 5 woe.tot_rev_tr 0.00000000 0.00000000
# 6 woe.tot_rev_debt 0.00000000 0.00000000
# 7 woe.tot_rev_line 0.00000000 0.00000000
excol <- xrank[xrank$imp1 == 0, ]$idx
#[1] 5 6 7
MLmetrics::AUC(y_pred = grnn.parpred(net2, X2), y_true = Y2)
# [1] 0.7584476
MLmetrics::AUC(y_pred = grnn.parpred(grnn.fit(x = X1[, excol], y = Y1, sigma = net2$sigma), X2[, excol]), y_true = Y2)
# [1] 0.7626386
barplot(t(as.matrix(xrank[, 3:4])), beside = TRUE, col = c("lightcyan4", "lightcyan2"), border = NA, yaxt = "n",
names.arg = substring(xrank$var, 5), main = "Variable Importance Rank", cex.names = 1)

view raw
grnn.imp.R
hosted with ❤ by GitHub

imp

margin

Monotonic Binning Driven by Decision Tree

After the development of MOB package (https://github.com/statcompute/MonotonicBinning), I was asked by a couple users about the possibility of using the decision tree to drive the monotonic binning. Although I am not aware of any R package implementing the decision tree with the monotonic constraint, I did manage to find a solution based upon the decision tree.

The Rborist package is an implementation of the Random Forest that would enforce the monotonicity at the local level within each tree but not at the global level for the whole forest. However, with a few tweaks on the Rborist syntax, it is not difficult to convert the forest with many trees into the forest with a single tree. After all necessary adjustments, I finally ended up with a decision tree that can be used to drive the monotonic binning algorithm, as shown in the arb_bin() function below, and will consider adding it into the MOB package later.

arb_bin <- function(data, y, x) {
yname <- deparse(substitute(y))
xname <- deparse(substitute(x))
df1 <- subset(data, !is.na(data[[xname]]) & data[[yname]] %in% c(0, 1), select = c(xname, yname))
df2 <- data.frame(y = df1[[yname]], x = df1[[xname]])
spc <- cor(df2[, 2], df2[, 1], method = "spearman", use = "complete.obs")
mdl <- Rborist::Rborist(as.matrix(df2$x), df2$y, noValidate = T, nTree = 1, regMono = spc / abs(spc),
ctgCensus = "prob", minInfo = exp(100), nSamp = nrow(df2) , withRepl = F)
df3 <- data.frame(y = df2$y, x = df2$x, yhat = predict(mdl, newdata = as.matrix(df2$x), ctgCensus = "prob")$yPred)
df4 <- Reduce(rbind,
lapply(split(df3, df3$yhat),
function(x) data.frame(maxx = max(x$x), yavg = mean(x$y), yhat = round(mean(x$yhat), 8))))
df5 <- df4[order(df4$maxx), ]
h <- ifelse(df5[["yavg"]][1] %in% c(0, 1), 2, 1)
t <- ifelse(df5[["yavg"]][nrow(df5)] %in% c(0, 1), 2, 1)
cuts <- df5$maxx[h:max(h, (nrow(df5) t))]
return(list(df = manual_bin(data, yname, xname, cuts = cuts),
cuts = cuts))
}
arb_bin(df, bad, rev_util)
# bin rule freq dist mv_cnt bad_freq bad_rate woe iv ks
# 01 $X <= 24 2653 0.4545 0 414 0.1560 -0.3320 0.0452 13.6285
# 02 $X > 24 & $X <= 36 597 0.1023 0 96 0.1608 -0.2963 0.0082 16.3969
# 03 $X > 36 & $X <= 40 182 0.0312 0 32 0.1758 -0.1890 0.0011 16.9533
# 04 $X > 40 & $X <= 58 669 0.1146 0 137 0.2048 -0.0007 0.0000 16.9615
# 05 $X > 58 & $X <= 60 77 0.0132 0 16 0.2078 0.0177 0.0000 16.9381
# 06 $X > 60 & $X <= 72 408 0.0699 0 95 0.2328 0.1636 0.0020 15.7392
# 07 $X > 72 & $X <= 73 34 0.0058 0 8 0.2353 0.1773 0.0002 15.6305
# 08 $X > 73 & $X <= 75 62 0.0106 0 16 0.2581 0.2999 0.0010 15.2839
# 09 $X > 75 & $X <= 83 246 0.0421 0 70 0.2846 0.4340 0.0089 13.2233
# 10 $X > 83 & $X <= 96 376 0.0644 0 116 0.3085 0.5489 0.0225 9.1266
# 11 $X > 96 & $X <= 98 50 0.0086 0 17 0.3400 0.6927 0.0049 8.4162
# 12 $X > 98 483 0.0827 0 179 0.3706 0.8263 0.0695 0.0000
arb_bin(df, bad, tot_derog)
# bin rule freq dist mv_cnt bad_freq bad_rate woe iv ks
# 00 is.na($X) 213 0.0365 213 70 0.3286 0.6416 0.0178 2.7716
# 01 $X <= 0 2850 0.4883 0 367 0.1288 -0.5559 0.1268 20.0442
# 02 $X > 0 & $X <= 1 891 0.1526 0 193 0.2166 0.0704 0.0008 18.9469
# 03 $X > 1 & $X <= 2 478 0.0819 0 121 0.2531 0.2740 0.0066 16.5222
# 04 $X > 2 & $X <= 3 332 0.0569 0 86 0.2590 0.3050 0.0058 14.6321
# 05 $X > 3 & $X <= 23 1064 0.1823 0 353 0.3318 0.6557 0.0931 0.4370
# 06 $X > 23 9 0.0015 0 6 0.6667 2.0491 0.0090 0.0000

view raw
do_arb_bin.R
hosted with ❤ by GitHub

WoE Transformation for Loss Given Default Models

In the intro section of my MOB package (https://github.com/statcompute/MonotonicBinning#introduction), reasons and benefits of using WoE transformations in the context of logistic regressions with binary outcomes had been discussed. What’s more, the similar idea can be easily generalized to other statistical models in the credit risk area, such as LGD (Loss Given Default) models with fractional outcomes.

Measuring the ratio between net and gross charge-offs, LGD can take any value within the unity interval of [0, 1] with no unanimous consensus on the distributional assumption either academically or empirically. In the banking industry, a popular approach to model LGD is the use of Quasi-Binomial models that makes no assumption of any statistical distribution but merely specifies the conditional mean by a Logit link function. With the specification of Logit link, the idea of WoE transformations can be ported directly from logistic regressions to Quasi-Binomial models.

The example below shows how to perform WoE transformations through monotonic binning based upon the fractional outcome, e.g. LGD, by using the function qtl_lgd() (https://github.com/statcompute/MonotonicBinning/blob/master/code/qtl_lgd.R).

qtl_lgd(df, lgd, ltv)
#$df
# bin rule freq dist mv_cnt mean_y woe iv ks
# 1 01 $X <= 0.2442486803 320 0.1257 0 0.0948 -1.0370 0.0987 9.5173
# 2 02 $X > 0.2442486803 & $X <= 0.3994659888 318 0.1250 0 0.0994 -0.9850 0.0900 18.6516
# 3 03 $X > 0.3994659888 & $X <= 0.5314432946 318 0.1250 0 0.1265 -0.7135 0.0515 25.8646
# 4 04 $X > 0.5314432946 & $X <= 0.6594855396 318 0.1250 0 0.1283 -0.6974 0.0494 32.9504
# 5 05 $X > 0.6594855396 & $X <= 0.7917383883 318 0.1250 0 0.1769 -0.3182 0.0116 36.5819
# 6 06 $X > 0.7917383883 & $X <= 0.9243704807 320 0.1257 0 0.2788 0.2683 0.0097 32.9670
# 7 07 $X > 0.9243704807 & $X <= 1.0800711662 317 0.1246 0 0.4028 0.8251 0.1020 20.6104
# 8 08 $X > 1.0800711662 316 0.1242 0 0.5204 1.3006 0.2681 0.0000
# $cuts
# [1] 0.2442487 0.3994660 0.5314433 0.6594855 0.7917384 0.9243705 1.0800712

view raw
use_qtl_lgd.R
hosted with ❤ by GitHub

As demonstrated in the outcome table, the average LGD increases along with the LTV (Loan-to-Value) and the WoE transformation of LTV is strictly linear with respect to the Logit of average LGD.

Why Use Weight of Evidence?

I had been asked why I spent so much effort on developing SAS macros and R functions to do monotonic binning for the WoE transformation, given the availability of other cutting-edge data mining algorithms that will automatically generate the prediction with whatever predictors fed in the model. Nonetheless, what really distinguishes a good modeler from the rest is how to handle challenging data issues before feeding data in the model, including missing values, outliers, linearity, and predictability, in a scalable way that can be rolled out to hundreds or even thousands of potential model drivers in the production environment.

The WoE transformation through monotonic binning provides a convenient way to address each of aforementioned concerns.

1. Because WoE is a piecewise transformation based on the data discretization, all missing values would fall into a standalone category either by itself or to be combined with the neighbor that shares a similar event probability. As a result, the special treatment for missing values is not necessary.

2. After the monotonic binning of each variable, since the WoE value for each bin is a projection from the predictor into the response that is defined by the log ratio between event and non-event distributions, any raw value of the predictor doesn’t matter anymore and therefore the issue related to outliers would disappear.

3. While many modelers would like to use log or power transformations to achieve a good linear relationship between the predictor and log odds of the response, which is heuristic at best with no guarantee for the good outcome, the WoE transformation is strictly linear with respect to log odds of the response with the unity correlation. It is also worth mentioning that a numeric variable and its strictly monotone functions should converge to the same monotonic WoE transformation.

4. At last, because the WoE is defined as the log ratio between event and non-event distributions, it is indicative of the separation between cases with Y = 0 and cases with Y = 1. As the weighted sum of WoE values with the weight being the difference in event and non-event distributions, the IV (Information Value) is an important statistic commonly used to measure the predictor importance.

Below is a simple example showing how to use WoE transformations in the estimation of a logistic regression.

df <- readRDS("df.rds")
### SHOWING THE RESPONSE IN THE LAST COLUMN ###
head(df, 2)
#tot_derog tot_tr age_oldest_tr tot_open_tr tot_rev_tr tot_rev_debt tot_rev_line rev_util bureau_score ltv tot_income bad
# 6 7 46 NaN NaN NaN NaN 0 747 109 4800.00 0
# 0 21 153 6 1 97 4637 2 744 97 5833.33 0
source("mob.R")
bin_out <- batch_bin(df, 3)
bin_out$BinSum
# var nbin unique miss min median max ks iv
# tot_derog 7 29 213 0 0.0 32 20.0442 0.2599
# tot_tr 15 67 213 0 16.0 77 17.3002 0.1425
# ……
top <- paste(bin_out$BinSum[order(bin_out$BinSum[["iv"]], decreasing = T), ][1:6, "var"], sep = '')
par(mfrow = c(2, 3))
lapply(top, function(x) plot(bin_out$BinLst[[x]]$df[["woe"]],
log(bin_out$BinLst[[x]]$df[["bad_rate"]] / (1 bin_out$BinLst[[x]]$df[["bad_rate"]])),
type = "b", main = x, cex.main = 2, xlab = paste("woe of", x), ylab = "logit(bad)", cex = 2, col = "red"))
df_woe <- batch_woe(df, bin_out$BinLst)
str(df_woe$df)
#'data.frame': 5837 obs. of 12 variables:
# $ idx_ : int 1 2 3 4 5 6 7 8 9 10 …
# $ woe.tot_derog : num 0.656 -0.556 -0.556 0.274 0.274 …
# $ woe.tot_tr : num 0.407 -0.322 -0.4 -0.322 0.303 …
# ……
### PARSE VARIABLES WITH IV > 0.1 ###
x1 <- paste("woe", bin_out$BinSum[bin_out$BinSum[["iv"]] > 0.1, ]$var, sep = ".")
# "woe.tot_derog" "woe.tot_tr" "woe.age_oldest_tr" "woe.tot_rev_line" "woe.rev_util" "woe.bureau_score" "woe.ltv"
fml1 <- as.formula(paste("bad", paste(x1, collapse = " + "), sep = " ~ "))
sum1 <- summary(glm(fml1, data = cbind(bad = df$bad, df_woe$df), family = "binomial"))
### PARSE SIGNIFICANT VARIABLES WITH P-VALUE < 0.05 ###
x2 <- paste(row.names(sum1$coefficients)[sum1$coefficients[, 4] < 0.05][1])
# "woe.age_oldest_tr" "woe.tot_rev_line" "woe.rev_util" "woe.bureau_score" "woe.ltv"
fml2 <- as.formula(paste("bad", paste(x2, collapse = " + "), sep = " ~ "))
mdl2 <- glm(fml2, data = cbind(bad = df$bad, df_woe$df), family = "binomial")
# Estimate Std. Error z value Pr(>|z|)
#(Intercept) -1.38600 0.03801 -36.461 < 2e-16 ***
#woe.age_oldest_tr 0.30376 0.08176 3.715 0.000203 ***
#woe.tot_rev_line 0.42935 0.06793 6.321 2.61e-10 ***
#woe.rev_util 0.29150 0.08721 3.342 0.000831 ***
#woe.bureau_score 0.83568 0.04974 16.803 < 2e-16 ***
#woe.ltv 0.97789 0.09121 10.721 < 2e-16 ***
pROC::roc(response = df$bad, predictor = fitted(mdl2))
# Area under the curve: 0.7751

view raw
use_woe.R
hosted with ❤ by GitHub

top6

More General Weighted Binning

You might be wondering what motivates me spending countless weekend hours on the MOB package. The answer is plain and simple. It is users that are driving the development work.

After I published the wts_bin() function last week showing the impact of two-value weights on the monotonic binning outcome (https://statcompute.wordpress.com/2019/04/21/binning-with-weights), a question was asked if I can write a more general weighted binning function with weights being any positive value. The function wqtl_bin() is my answer (https://github.com/statcompute/MonotonicBinning/blob/master/code/wqtl_bin.R).

Below is an example demonstrating how to use the wqtl_bin() function. First of all, let’s apply the function to the case with two-value weights that was illustrated last week. As expected, statistics from both approaches are identical. In the second use case, let’s assume that weights can be any value under the Uniform distribution between 0 and 10. With positive random weights, all statistics have changed.

It is worth mentioning that, while binning rules can be the same with or without weights in some cases, it is not necessarily true in all situations, depending on the distribution of weights across the data sample. As shown in binning outcomes for “ltv” below, there are 7 bins without weights but only 5 with weights.

wqtl_bin(cbind(df, w = ifelse(df$bad == 1, 1, 5)), bad, tot_derog, w)
#$df
# bin rule cnt freq dist mv_wt bad_freq bad_rate woe iv ks
#1 00 is.na($X) 213 785 0.0322 785 70 0.0892 0.6416 0.0178 2.7716
#2 01 $X <= 1 3741 16465 0.6748 0 560 0.0340 -0.3811 0.0828 18.9469
#3 02 $X > 1 & $X <= 2 478 1906 0.0781 0 121 0.0635 0.2740 0.0066 16.5222
#4 03 $X > 2 & $X <= 4 587 2231 0.0914 0 176 0.0789 0.5078 0.0298 10.6623
#5 04 $X > 4 818 3014 0.1235 0 269 0.0893 0.6426 0.0685 0.0000
#$cuts
#[1] 1 2 4
wqtl_bin(cbind(df, w = runif(nrow(df), 0, 10)), bad, tot_derog, w)
#$df
# bin rule cnt freq dist mv_wt bad_freq bad_rate woe iv ks
#1 00 is.na($X) 213 952.32 0.0325 952.32 304.89 0.3202 0.5808 0.0128 2.1985
#2 01 $X <= 1 3741 18773.11 0.6408 0.00 2943.75 0.1568 -0.3484 0.0700 17.8830
#3 02 $X > 1 & $X <= 2 478 2425.26 0.0828 0.00 604.51 0.2493 0.2312 0.0047 15.8402
#4 03 $X > 2 & $X <= 4 587 2989.80 0.1021 0.00 882.83 0.2953 0.4639 0.0249 10.4761
#5 04 $X > 4 818 4156.29 0.1419 0.00 1373.26 0.3304 0.6275 0.0657 0.0000
#$cuts
#[1] 1 2 4
wqtl_bin(cbind(df, w = runif(nrow(df), 0, 10)), bad, ltv, w)
#$df
# bin rule cnt freq dist mv_wt bad_freq bad_rate woe iv ks
#1 01 $X <= 88 1289 6448.76 0.2202 0.00 759.93 0.1178 -0.6341 0.0724 11.4178
#2 02 $X > 88 & $X <= 98 1351 6695.88 0.2286 0.00 1211.98 0.1810 -0.1306 0.0037 14.2883
#3 03 $X > 98 & $X <= 104 1126 5662.21 0.1933 0.00 1212.52 0.2141 0.0788 0.0012 12.7295
#4 04 $X > 104 & $X <= 113 1044 5277.64 0.1802 0.00 1210.91 0.2294 0.1674 0.0053 9.5611
#5 05 $X > 113 | is.na($X) 1027 5205.38 0.1777 0.93 1497.29 0.2876 0.4721 0.0451 0.0000
qtl_bin(df, bad, ltv)
#$df
# bin rule freq dist mv_cnt bad_freq bad_rate woe iv ks
#1 01 $X <= 84 956 0.1638 0 102 0.1067 -0.7690 0.0759 9.8728
#2 02 $X > 84 & $X <= 93 960 0.1645 0 142 0.1479 -0.3951 0.0227 15.6254
#3 03 $X > 93 & $X <= 99 876 0.1501 0 187 0.2135 0.0518 0.0004 14.8359
#4 04 $X > 99 & $X <= 103 821 0.1407 0 179 0.2180 0.0787 0.0009 13.7025
#5 05 $X > 103 & $X <= 109 773 0.1324 0 178 0.2303 0.1492 0.0031 11.6401
#6 06 $X > 109 & $X <= 117 722 0.1237 0 190 0.2632 0.3263 0.0144 7.2169
#7 07 $X > 117 | is.na($X) 729 0.1249 1 218 0.2990 0.5041 0.0364 0.0000

view raw
wqtl_out.R
hosted with ❤ by GitHub

Binning with Weights

After working on the MOB package, I received requests from multiple users if I can write a binning function that takes the weighting scheme into consideration. It is a legitimate request from the practical standpoint. For instance, in the development of fraud detection models, we often would sample down non-fraud cases given an extremely low frequency of fraud instances. After the sample down, a weight value > 1 should be assigned to all non-fraud cases to reflect the fraud rate in the pre-sample data.

While accommodating the request for weighting cases is trivial, I’d like to do a simple experitment showing what the impact might be with the consideration of weighting.

– First of all, let’s apply the monotonic binning to a variable named “tot_derog”. In this unweighted binning output, KS = 18.94, IV = 0.21, and WoE values range from -0.38 to 0.64.

– In the first trial, a weight value = 5 is assigned to cases with Y = 0 and a weight value = 1 assigned to cases with Y = 1. As expected, frequency, distribution, bad_frequency, and bad_rate changed. However, KS, IV, and WoE remain identical.

– In the second trial, a weight value = 1 is assigned to cases with Y = 0 and a weight value = 5 assigned to cases with Y = 1. Once again, KS, IV, and WoE are still the same as the unweighted output.

The conclusion from this demonstrate is very clear. In cases of two-value weights assigned to the binary Y, the variable importance reflected by IV / KS and WoE values should remain identical with or without weights. However, if you are concerned about the binning distribution and the bad rate in each bin, the function wts_bin() should do the correction and is available in the project repository (https://github.com/statcompute/MonotonicBinning).

derog_bin <- qtl_bin(df, bad, tot_derog)
derog_bin
#$df
# bin rule freq dist mv_cnt bad_freq bad_rate woe iv ks
# 00 is.na($X) 213 0.0365 213 70 0.3286 0.6416 0.0178 2.7716
# 01 $X <= 1 3741 0.6409 0 560 0.1497 -0.3811 0.0828 18.9469
# 02 $X > 1 & $X <= 2 478 0.0819 0 121 0.2531 0.2740 0.0066 16.5222
# 03 $X > 2 & $X <= 4 587 0.1006 0 176 0.2998 0.5078 0.0298 10.6623
# 04 $X > 4 818 0.1401 0 269 0.3289 0.6426 0.0685 0.0000
# $cuts
# [1] 1 2 4
wts_bin(derog_bin$df, c(1, 5))
# bin rule wt_freq wt_dist wt_bads wt_badrate wt_woe wt_iv wt_ks
# 00 is.na($X) 493 0.0464 350 0.7099 0.6416 0.0178 2.7716
# 01 $X <= 1 5981 0.5631 2800 0.4681 -0.3811 0.0828 18.9469
# 02 $X > 1 & $X <= 2 962 0.0906 605 0.6289 0.2740 0.0066 16.5222
# 03 $X > 2 & $X <= 4 1291 0.1216 880 0.6816 0.5078 0.0298 10.6623
# 04 $X > 4 1894 0.1783 1345 0.7101 0.6426 0.0685 0.0000
wts_bin(derog_bin$df, c(5, 1))
# bin rule wt_freq wt_dist wt_bads wt_badrate wt_woe wt_iv wt_ks
# 00 is.na($X) 785 0.0322 70 0.0892 0.6416 0.0178 2.7716
# 01 $X <= 1 16465 0.6748 560 0.0340 -0.3811 0.0828 18.9469
# 02 $X > 1 & $X <= 2 1906 0.0781 121 0.0635 0.2740 0.0066 16.5222
# 03 $X > 2 & $X <= 4 2231 0.0914 176 0.0789 0.5078 0.0298 10.6623
# 04 $X > 4 3014 0.1235 269 0.0893 0.6426 0.0685 0.0000

view raw
wts_bin.R
hosted with ❤ by GitHub

Batch Deployment of WoE Transformations

After wrapping up the function batch_woe() today with the purpose to allow users to apply WoE transformations to many independent variables simultaneously, I have completed the development of major functions in the MOB package that can be usable for the model development in a production setting.

The function batch_woe() basically is the wrapper around cal_woe() and has two input parameters. The “data” parameter is the data frame that we would deploy binning outcomes and the “slst” parameter is the list of multiple binning specification tables that is either the direct output from the function batch_bin or created manually by combining outputs from multiple binning functions.

There are also two components in the output of batch_woe(), a list of PSI tables for transformed variables and a data frame with a row index and all transformed variables. The default printout is a PSI summation of all input variables to be transformed. As shown below, all PSI values are below 0.1 and therefore none is concerning.

binout <- batch_bin(df, 1)

woeout <- batch_woe(df[sample(seq(nrow(df)), 2000, replace = T), ], binout$BinLst)

woeout 
#     tot_derog tot_tr age_oldest_tr tot_open_tr tot_rev_tr tot_rev_debt ...
# psi    0.0027 0.0044        0.0144      0.0011      3e-04       0.0013 ...

str(woeout, max.level = 1)
# List of 2
#  $ psi:List of 11
#  $ df :'data.frame':	2000 obs. of  12 variables:
#  - attr(*, "class")= chr "psiSummary"

head(woeout$df, 1)
#  idx_ woe.tot_derog woe.tot_tr woe.age_oldest_tr woe.tot_open_tr woe.tot_rev_tr ...
#     1       -0.3811    -0.0215           -0.5356         -0.0722        -0.1012 ...

All source codes of the MOB package are available on https://github.com/statcompute/MonotonicBinning and free (as free beer) to download and distribute.

Batch Processing of Monotonic Binning

In my GitHub repository (https://github.com/statcompute/MonotonicBinning), multiple R functions have been developed to implement the monotonic binning by using either iterative discretization or isotonic regression. With these functions, we can run the monotonic binning for one independent variable at a time. However, in a real-world production environment, we often would want to apply the binning algorithm to hundreds or thousands of variables at once. In addition, we might be interested in comparing different binning outcomes.

The function batch_bin() is designed to apply a monotonic binning function to all numeric variables in a data frame with the last column as the dependent variable. Currently, four binning algorithms are supported, including qtl_bin() and bad_bin() by iterative discretizations, iso_bin() by isotonic regression, and gbm_bin() by generalized boosted model. Before using these four functions, we need to save related R files in the working folder, which would be sourced by the batch_bin() function. Scripts for R functions can be downloaded from https://github.com/statcompute/MonotonicBinning/tree/master/code.

Below is the demonstrating showing how to use the batch_bin() function, which only requires two input parameters, a data frame and an integer number indicating the binning method. With method = 1, the batch_bin() function implements the iterative discretization by quantiles. With method = 4, the batch_bin() function implements the generalized boosted modelling. As shown below, both KS and IV with method = 4 are higher than with method = 1 due to more granular bins. For instance, while the method = 1 only generates 2 bins, the method = 4 can generate 11 bins.

head(df, 2)
# tot_derog tot_tr age_oldest_tr tot_open_tr tot_rev_tr tot_rev_debt tot_rev_line rev_util bureau_score ltv tot_income bad
#1 6 7 46 NaN NaN NaN NaN 0 747 109 4800.00 0
#2 0 21 153 6 1 97 4637 2 744 97 5833.33 0
batch_bin(df, 1)
#|var | nbin| unique| miss| min| median| max| ks| iv|
#|:————–|—–:|——-:|—–:|—-:|——–:|——–:|——–:|——-:|
#|tot_derog | 5| 29| 213| 0| 0.0| 32| 18.9469| 0.2055|
#|tot_tr | 5| 67| 213| 0| 16.0| 77| 15.7052| 0.1302|
#|age_oldest_tr | 10| 460| 216| 1| 137.0| 588| 19.9821| 0.2539|
#|tot_open_tr | 3| 26| 1416| 0| 5.0| 26| 6.7157| 0.0240|
#|tot_rev_tr | 3| 21| 636| 0| 3.0| 24| 9.0104| 0.0717|
#|tot_rev_debt | 3| 3880| 477| 0| 3009.5| 96260| 8.5102| 0.0627|
#|tot_rev_line | 9| 3617| 477| 0| 10573.0| 205395| 26.4924| 0.4077|
#|rev_util | 2| 101| 0| 0| 30.0| 100| 15.1570| 0.0930|
#|bureau_score | 12| 315| 315| 443| 692.5| 848| 34.8028| 0.7785|
#|ltv | 7| 145| 1| 0| 100.0| 176| 15.6254| 0.1538|
#|tot_income | 4| 1639| 5| 0| 3400.0| 8147167| 9.1526| 0.0500|
batch_bin(df, 1)$BinLst[["rev_util"]]$df
# bin rule freq dist mv_cnt bad_freq bad_rate woe iv ks
# 01 $X <= 31 3007 0.5152 0 472 0.1570 -0.3250 0.0493 15.157
# 02 $X > 31 2830 0.4848 0 724 0.2558 0.2882 0.0437 0.000
batch_bin(df, 4)
#|var | nbin| unique| miss| min| median| max| ks| iv|
#|:————–|—–:|——-:|—–:|—-:|——–:|——–:|——–:|——-:|
#|tot_derog | 8| 29| 213| 0| 0.0| 32| 20.0442| 0.2556|
#|tot_tr | 13| 67| 213| 0| 16.0| 77| 17.3002| 0.1413|
#|age_oldest_tr | 22| 460| 216| 1| 137.0| 588| 20.3646| 0.2701|
#|tot_open_tr | 6| 26| 1416| 0| 5.0| 26| 6.8695| 0.0274|
#|tot_rev_tr | 4| 21| 636| 0| 3.0| 24| 9.0779| 0.0789|
#|tot_rev_debt | 9| 3880| 477| 0| 3009.5| 96260| 8.8722| 0.0848|
#|tot_rev_line | 21| 3617| 477| 0| 10573.0| 205395| 26.8943| 0.4445|
#|rev_util | 11| 101| 0| 0| 30.0| 100| 16.9615| 0.1635|
#|bureau_score | 30| 315| 315| 443| 692.5| 848| 35.2651| 0.8344|
#|ltv | 17| 145| 1| 0| 100.0| 176| 16.8807| 0.1911|
#|tot_income | 17| 1639| 5| 0| 3400.0| 8147167| 10.3386| 0.0775|
batch_bin(df, 4)$BinLst[["rev_util"]]$df
# bin rule freq dist mv_cnt bad_freq bad_rate woe iv ks
# 01 $X <= 24 2653 0.4545 0 414 0.1560 -0.3320 0.0452 13.6285
# 02 $X > 24 & $X <= 36 597 0.1023 0 96 0.1608 -0.2963 0.0082 16.3969
# 03 $X > 36 & $X <= 40 182 0.0312 0 32 0.1758 -0.1890 0.0011 16.9533
# 04 $X > 40 & $X <= 58 669 0.1146 0 137 0.2048 -0.0007 0.0000 16.9615
# 05 $X > 58 & $X <= 60 77 0.0132 0 16 0.2078 0.0177 0.0000 16.9381
# 06 $X > 60 & $X <= 73 442 0.0757 0 103 0.2330 0.1647 0.0022 15.6305
# 07 $X > 73 & $X <= 75 62 0.0106 0 16 0.2581 0.2999 0.0010 15.2839
# 08 $X > 75 & $X <= 83 246 0.0421 0 70 0.2846 0.4340 0.0089 13.2233
# 09 $X > 83 & $X <= 96 376 0.0644 0 116 0.3085 0.5489 0.0225 9.1266
# 10 $X > 96 & $X <= 98 50 0.0086 0 17 0.3400 0.6927 0.0049 8.4162
# 11 $X > 98 483 0.0827 0 179 0.3706 0.8263 0.0695 0.0000

view raw
use_BatchBin.R
hosted with ❤ by GitHub

Monotonic Binning with GBM

In addition to monotonic binning algorithms introduced in my previous post (https://statcompute.wordpress.com/2019/03/10/a-summary-of-my-home-brew-binning-algorithms-for-scorecard-development), two more functions based on Generalized Boosted Regression Models have been added to my GitHub repository, gbm_bin() and gbmcv_bin().

The function gbm_bin() estimates a GBM model without the cross validation and tends to generate a more granular binning outcome.

gbm_bin(df, bad, tot_derog)
# $df
# bin rule freq dist mv_cnt bad_freq bad_rate woe iv ks
# 00 is.na($X) 213 0.0365 213 70 0.3286 0.6416 0.0178 2.7716
# 01 $X <= 1 3741 0.6409 0 560 0.1497 -0.3811 0.0828 18.9469
# 02 $X > 1 & $X <= 2 478 0.0819 0 121 0.2531 0.2740 0.0066 16.5222
# 03 $X > 2 & $X <= 3 332 0.0569 0 86 0.2590 0.3050 0.0058 14.6321
# 04 $X > 3 & $X <= 9 848 0.1453 0 282 0.3325 0.6593 0.0750 3.2492
# 05 $X > 9 225 0.0385 0 77 0.3422 0.7025 0.0228 0.0000
# $cuts
# [1] 1 2 3 9

view raw
gbm_bin
hosted with ❤ by GitHub

The function gbmcv_bin() estimates a GBM model with the cross validation (CV). Therefore, it would generate a more stable but coarse binning outcome. Nonetheless, the computation is more expensive due to CV, especially for large datasets.

gbmcv_bin(df, bad, tot_derog)
### OUTPUT ###
# $df
# bin rule freq dist mv_cnt bad_freq bad_rate woe iv ks
# 00 is.na($X) 213 0.0365 213 70 0.3286 0.6416 0.0178 2.7716
# 01 $X <= 1 3741 0.6409 0 560 0.1497 -0.3811 0.0828 18.9469
# 02 $X > 1 & $X <= 2 478 0.0819 0 121 0.2531 0.2740 0.0066 16.5222
# 03 $X > 2 1405 0.2407 0 445 0.3167 0.5871 0.0970 0.0000
# $cuts
# [1] 1 2

view raw
gbmcv_bin
hosted with ❤ by GitHub

Motivated by the idea of my friend Talbot (https://www.linkedin.com/in/talbot-katz-b76785), I also drafted a function pava_bin() based upon the Pool Adjacent Violators Algorithm (PAVA) and compared it with the iso_bin() function based on the isotonic regression. As shown in the comparison below, there is no difference in the binning outcome. However, the computing cost of pava_bin() function is higher given that PAVA is an iterative algorithm solving for the monotonicity.

pava_bin(df, bad, tot_derog)$df
# bin rule freq dist mv_cnt bad_freq bad_rate woe iv ks
# 00 is.na($X) 213 0.0365 213 70 0.3286 0.6416 0.0178 2.7716
# 01 $X <= 1 3741 0.6409 0 560 0.1497 -0.3811 0.0828 18.9469
# 02 $X > 1 & $X <= 2 478 0.0819 0 121 0.2531 0.2740 0.0066 16.5222
# 03 $X > 2 & $X <= 3 332 0.0569 0 86 0.2590 0.3050 0.0058 14.6321
# 04 $X > 3 & $X <= 23 1064 0.1823 0 353 0.3318 0.6557 0.0931 0.4370
# 05 $X > 23 9 0.0015 0 6 0.6667 2.0491 0.0090 0.0000
iso_bin(df, bad, tot_derog)$df
# bin rule freq dist mv_cnt bad_freq bad_rate woe iv ks
# 00 is.na($X) 213 0.0365 213 70 0.3286 0.6416 0.0178 2.7716
# 01 $X <= 1 3741 0.6409 0 560 0.1497 -0.3811 0.0828 18.9469
# 02 $X > 1 & $X <= 2 478 0.0819 0 121 0.2531 0.2740 0.0066 16.5222
# 03 $X > 2 & $X <= 3 332 0.0569 0 86 0.2590 0.3050 0.0058 14.6321
# 04 $X > 3 & $X <= 23 1064 0.1823 0 353 0.3318 0.6557 0.0931 0.4370
# 05 $X > 23 9 0.0015 0 6 0.6667 2.0491 0.0090 0.0000

view raw
pava_compare
hosted with ❤ by GitHub

Deployment of Binning Outcomes in Production

In my previous post (https://statcompute.wordpress.com/2019/03/10/a-summary-of-my-home-brew-binning-algorithms-for-scorecard-development), I’ve shown different monotonic binning algorithm that I developed over time. However, these binning functions are all useless without a deployment vehicle in production. During the weekend, I finally had time to draft a R function
(https://github.com/statcompute/MonotonicBinning/blob/master/code/calc_woe.R) that can be used to deploy the binning outcome and to apply the WoE transformation to the attribute from an input data frame.

Below is a complete example showing how to apply the binning function mono_bin() to an attribute named “ltv” in the data frame, generate the binning specification, and then deploy the binning logic to calculate the WoE transformation of “ltv”. There are two objects returned from the calc_woe.R() function, the original data frame with an new column named “woe.ltv” and a summary table showing the population stability index (PSI) of the input attribute “ltv”.

While all are welcome to use my R codes and functions for your own purposes, I greatly appreciate it if you could reference the work and acknowledge my efforts.

url <- "https://github.com/statcompute/MonotonicBinning/blob/master/data/accepts.rds?raw=true"
download.file(url, "df.rds", mode = "wb")
df <- readRDS("df.rds")
source("https://raw.githubusercontent.com/statcompute/MonotonicBinning/master/code/manual_bin.R")
source("https://raw.githubusercontent.com/statcompute/MonotonicBinning/master/code/mono_bin.R")
ltv_bin <- mono_bin(df, bad, ltv)
ltv_bin$df
# bin rule freq dist mv_cnt bad_freq bad_rate woe iv ks
# 1 01 $X <= 86 1108 0.1898 0 122 0.1101 -0.7337 0.0810 11.0448
# 2 02 $X > 86 & $X <= 95 1081 0.1852 0 166 0.1536 -0.3510 0.0205 16.8807
# 3 03 $X > 95 & $X <= 101 1102 0.1888 0 242 0.2196 0.0880 0.0015 15.1771
# 4 04 $X > 101 & $X <= 106 743 0.1273 0 177 0.2382 0.1935 0.0050 12.5734
# 5 05 $X > 106 & $X <= 115 935 0.1602 0 226 0.2417 0.2126 0.0077 8.9540
# 6 06 $X > 115 | is.na($X) 868 0.1487 1 263 0.3030 0.5229 0.0468 0.0000
source("https://raw.githubusercontent.com/statcompute/MonotonicBinning/master/code/calc_woe.R")
ltv_woe <- calc_woe(df[sample(seq(nrow(df)), 1000), ], ltv, ltv_bin$df)
ltv_woe$psi
# bin rule dist woe cal_freq cal_dist cal_woe psi
# 1 01 $X <= 86 0.1898 -0.7337 188 0.188 -0.7337 0e+00
# 2 02 $X > 86 & $X <= 95 0.1852 -0.3510 179 0.179 -0.3510 2e-04
# 3 03 $X > 95 & $X <= 101 0.1888 0.0880 192 0.192 0.0880 1e-04
# 4 04 $X > 101 & $X <= 106 0.1273 0.1935 129 0.129 0.1935 0e+00
# 5 05 $X > 106 & $X <= 115 0.1602 0.2126 167 0.167 0.2126 3e-04
# 6 06 $X > 115 | is.na($X) 0.1487 0.5229 145 0.145 0.5229 1e-04
head(ltv_woe$df[, c("ltv", "woe.ltv")])
# ltv woe.ltv
# 2378 74 -0.7337
# 1897 60 -0.7337
# 2551 80 -0.7337
# 2996 83 -0.7337
# 1174 85 -0.7337
# 2073 74 -0.7337

view raw
woe_deploy.R
hosted with ❤ by GitHub

A Summary of My Home-Brew Binning Algorithms for Scorecard Development

Thus far, I have published four different monotonic binning algorithms for the scorecard development and think that it might be a right timing to do a quick summary. R functions for these binning algorithms are also available on https://github.com/statcompute/MonotonicBinning.

The first one was posted back in 2017 (https://statcompute.wordpress.com/2017/01/22/monotonic-binning-with-smbinning-package) based on my SAS macro (https://statcompute.wordpress.com/2012/06/10/a-sas-macro-implementing-monotonic-woe-transformation-in-scorecard-development) that has been widely used by sasORs. This R function mono_bin() is designed to generate monotonic bins with roughly equal densities, e.g. size of records in each bin. There are two potential issues for this binning algorithm. Albeit robust, the binning outcome is too coarse and and therefore might not be granular enough to capture the data nature. In addition, although the algorithm is fully automatic and able to converge globally, it requires iterations that might be computationally expensive for big datasets.

In light of aforementioned shortcomings, I developed the second one based on the isotonic regression (https://statcompute.wordpress.com/2017/06/15/finer-monotonic-binning-based-on-isotonic-regression and https://statcompute.wordpress.com/2018/11/23/more-robust-monotonic-binning-based-on-isotonic-regression) that successfully addresses both the coarse binning and iterations.

The third one was developed last year just out of my own curiosity (https://statcompute.wordpress.com/2018/10/14/monotonic-binning-with-equal-sized-bads-for-scorecard-development) for the purpose to generate monotonic bins with roughly equal-sized bads, e.g. Y = 1. From the performance standpoint, this one is not materially different from the first one. It is more like a brainteaser for myself.

The last one (https://statcompute.wordpress.com/2018/11/25/improving-binning-by-bootstrap-bumping) was mainly motivated by the idea of Bootstrap Bumping from Tibshirani and Knight (1997) and implements the bumping on top of the second one above based on the isotonic regression. The outcome of this one is satisfactory in two folds. First of all, since the bumping is based on bootstrap samples drawn from the original dataset, the concern about over-fitting due to the sample bias can be somewhat addressed. Secondly, through the bumping search across all bootstrap samples, chances are that a closer-to-optimal solution can be achieved.

R functions for all 4 binning algorithms on the GitHub are built on top of an utility function manual_bin() (https://github.com/statcompute/MonotonicBinning/blob/master/code/manual_bin.R). In the R code, I tried my best to make it as generic as possible without importing additional packages. The only exception is that the parallel package is used in the bump_bin() function to speed up the computation. My next task might be writing a scoring function to make these binning algorithms useful in production.

Statistical Assessments of AUC

In the scorecard development, the area under ROC curve, also known as AUC, has been widely used to measure the performance of a risk scorecard. Given everything else equal, the scorecard with a higher AUC is considered more predictive than the one with a lower AUC. However, little attention has been paid to the statistical analysis of AUC itself during the scorecard development.

While it might be less of a concern to rely on a simple comparison of AUC for the model selection in the development stage and then to pick the scorecard with a higher AUC, more attention should be called for on AUC analysis in the post-development stage. For instance, the senior management would need to decide whether it is worthy to retire a legacy scorecard that might be still performing and to launch the full-scale deployment of a new scorecard just for an increase in AUC that might not even be statistically significant. While the claim of certain business benefits can always be used as an argument in favor of the new scorecard, the justification would become even more compelling with a solid statistical evidence. What’s more, the model validation analyst might also want to leverage the outcome of AUC analysis to ensure the statistical soundness of new scorecards.

In the example below, two logistic regressions were estimated with AUC = 0.6554 and BIC = 6,402 for the model with 6 variables and AUC = 0.6429 and BIC = 6,421 for the model with 3 variables.

df1 <- read.csv("Downloads/credit_count.txt")
df2 <- df1[which(df1$CARDHLDR == 1), ]
y <- "DEFAULT"
x1 <- c("OWNRENT", "INCOME", "INCPER", "LOGSPEND", "AGE", "EXP_INC")
x2 <- c("MAJORDRG", "MINORDRG", "INCOME")

m1 <- glm(eval(paste(y, paste(x1, collapse = " + "), sep = " ~ ")), data = df2, family = binomial)
#              Estimate Std. Error z value Pr(|z|)
#(Intercept) -1.749e-01  1.659e-01  -1.054 0.291683
#OWNRENT     -2.179e-01  7.686e-02  -2.835 0.004581 **
#INCOME      -2.424e-04  4.364e-05  -5.554 2.79e-08 ***
#INCPER      -1.291e-05  3.318e-06  -3.890 0.000100 ***
#LOGSPEND    -2.165e-01  2.848e-02  -7.601 2.95e-14 ***
#AGE         -8.330e-03  3.774e-03  -2.207 0.027312 *
#EXP_INC      1.340e+00  3.467e-01   3.865 0.000111 ***

BIC(m1)
# 6401.586

roc1 <- pROC::roc(response = df2$DEFAULT, predictor = fitted(m1))
# Area under the curve: 0.6554

m2 <- glm(eval(paste(y, paste(x2, collapse = " + "), sep = " ~ ")), data = df2, family = binomial)
#              Estimate Std. Error z value Pr(|z|)
#(Intercept) -1.222e+00  9.076e-02 -13.459  < 2e-16 ***
#MAJORDRG     2.031e-01  6.921e-02   2.934  0.00335 **
#MINORDRG     1.920e-01  4.784e-02   4.013 5.99e-05 ***
#INCOME      -4.706e-04  3.919e-05 -12.007  < 2e-16 ***

BIC(m2)
# 6421.232

roc2 <- pROC::roc(response = df2$DEFAULT, predictor = fitted(m2))
# Area under the curve: 0.6429

Both AUC and BIC statistics seemed to favor the first model. However, is a 2% difference in AUC significant enough to infer a better model? Under the Null Hypothesis of no difference in AUC, three statistical tests were employed to assess the difference in AUC / ROC between two models.

set.seed(2019)
# REFERENCE:
# A METHOD OF COMPARING THE AREAS UNDER RECEIVER OPERATING CHARACTERISTIC CURVES DERIVED FROM THE SAME CASES
# HANLEY JA, MCNEIL BJ (1983)
pROC::roc.test(roc1, roc2, method = "bootstrap", boot.n = 500, progress = "none", paired = T)
# D = 1.7164, boot.n = 500, boot.stratified = 1, p-value = 0.0861

# REFERENCE:
# COMPARING THE AREAS UNDER TWO OR MORE CORRELATED RECEIVER OPERATING CHARACTERISTIC CURVES: A NONPARAMETRIC APPROACH
# DELONG ER, DELONG DM, CLARKE-PEARSON DL (1988)
pROC::roc.test(roc1, roc2, method = "delong", paired = T)
# Z = 1.7713, p-value = 0.0765

# REFERENCE
# A DISTRIBUTION-FREE PROCEDURE FOR COMPARING RECEIVER OPERATING CHARACTERISTIC CURVES FROM A PAIRED EXPERIMENT
# VENKATRAMAN ES, BEGG CB (1996)
pROC::roc.test(roc1, roc2, method = "venkatraman", boot.n = 500, progress = "none", paired = T)
# E = 277560, boot.n = 500, p-value = 0.074

Based upon the above output, there is no strong statistical evidence against the Null Hypothesis.

pscl::vuong(m1, m2)
#              Vuong z-statistic             H_A  p-value
#Raw                   2.0963830 model1 > model2 0.018024
#AIC-corrected         1.8311449 model1 > model2 0.033539
#BIC-corrected         0.8684585 model1 > model2 0.192572

In addition, a Vuong test is also performed, supporting no difference between two models after corrected for the Schwarz penalty.

An Utility Function For Monotonic Binning

In all monotonic algorithms that I posted before, I heavily relied on the smbinning::smbinning.custom() function contributed by Herman Jopia as the utility function generating the binning output and therefore feel deeply indebted to his excellent work. However, the availability of smbinning::smbinning.custom() function shouldn’t become my excuse for being lazy. Over the weekend, I drafted a function, e.g. manual_bin(), serving the similar purpose.

Although it is not as flexible and elegant as Herman’s work, the manual_bin() function does have certain advantages of handling miss values and therefore improves the calculation of WoE and Information Value for missing values.
1. For the missing-value category, if there are both good and bad records, then this category will be considered a standalone bin.
2. For the missing-value category, if there are only either good or bad records but not both, then this category will be merged into the bin with lowest or highest bad rate. Therefore, WoE and IV for the missing value won’t be shown as “NaN” again.

In addition, the output of manual_bin() function also includes a set of rules that might be potentially applied to R dataframe in order to generate WoE transformations, on which I will show in the future.

manual_bin <- function(df, yname, xname, cuts) {
cuts <- sort(c(Inf, cuts, Inf))
df1 <- df[which(df[[yname]] %in% c(0, 1)), c(yname, xname)]
all_cnt <- nrow(df1)
all_bcnt <- sum(df1[[yname]])
### IDENTIFY DIFFERENT CASES WITH MISSING VALUES ###
if (all(!is.na(df1[[xname]])) == TRUE) {
miss_flg <- 0
df2 <- df1
}
else {
miss_flg <- 1
df2 <- df1[!is.na(df1[, xname]), ]
mis <- df1[is.na(df1[, xname]), ]
mis_cnt <- nrow(mis)
mis_bcnt <- sum(mis[[yname]])
if (sum(mis[[yname]]) %in% c(nrow(mis), 0)) {
miss_flg <- 2
}
}
### SLICE DATAFRAME BY CUT POINTS ###
for (i in seq(length(cuts) 1)) {
bin <- sprintf("%02d", i)
bin_cnt <- nrow(df2[which(df2[[xname]] > cuts[i] & df2[[xname]] <= cuts[i + 1]), ])
bin_bcnt <- nrow(df2[which(df2[[xname]] > cuts[i] & df2[[xname]] <= cuts[i + 1] & df2[[yname]] == 1), ])
if (i == 1) {
bin_summ <- data.frame(bin = bin, xmin = cuts[i], xmax = cuts[i + 1], cnt = bin_cnt, bcnt = bin_bcnt)
}
else {
bin_summ <- rbind(bin_summ,
data.frame(bin = bin, xmin = cuts[i], xmax = cuts[i + 1], cnt = bin_cnt, bcnt = bin_bcnt))
}
}
bin_summ$mis_cnt <- 0
### FIRST CASE FOR MISSING VALUES: BOTH GOODS AND BADS ###
if (miss_flg == 1) {
bin_summ <- rbind(data.frame(bin = sprintf("%02d", 0), xmin = NA, xmax = NA, cnt = mis_cnt, bcnt = mis_bcnt, mis_cnt = mis_cnt),
bin_summ)
}
### SECOND CASE FOR MISSING VALUES: ONLY GOODS OR BADS ###
if (miss_flg == 2) {
rate <- bin_summ$bcnt / bin_summ$cnt
if (mis_bcnt == 0) {
bin_summ[rate == min(rate), "cnt"] <- bin_summ[rate == min(rate), "cnt"] + mis_cnt
bin_summ[rate == min(rate), "mis_cnt"] <- mis_cnt
}
else {
bin_summ[rate == max(rate), "cnt"] <- bin_summ[rate == max(rate), "cnt"] + mis_cnt
bin_summ[rate == max(rate), "bcnt"] <- bin_summ[rate == max(rate), "bcnt"] + mis_bcnt
bin_summ[rate == max(rate), "mis_cnt"] <- mis_cnt
}
}
bin_summ$dist <- bin_summ$cnt / all_cnt
bin_summ$brate <- bin_summ$bcnt / bin_summ$cnt
bin_summ$woe <- log((bin_summ$bcnt / all_bcnt) / ((bin_summ$cnt bin_summ$bcnt) / (all_cnt all_bcnt)))
bin_summ$iv <- (bin_summ$bcnt / all_bcnt (bin_summ$cnt bin_summ$bcnt) / (all_cnt all_bcnt)) * bin_summ$woe
bin_summ$ks <- abs(cumsum(bin_summ$bcnt) / all_bcnt cumsum(bin_summ$cnt bin_summ$bcnt) / (all_cnt all_bcnt)) * 100
bin_summ$rule <- NA
for (i in seq(nrow(bin_summ))) {
if (bin_summ[i, ]$bin == '00') {
bin_summ[i, ]$rule <- paste("is.na($X)", sep = '')
}
else if (bin_summ[i, ]$bin == '01') {
if (bin_summ[i, ]$mis_cnt > 0) {
bin_summ[i, ]$rule <- paste("$X <= ", bin_summ[i, ]$xmax, " | is.na($X)", sep = '')
}
else {
bin_summ[i, ]$rule <- paste("$X <= ", bin_summ[i, ]$xmax, sep = '')
}
}
else if (i == nrow(bin_summ)) {
if (bin_summ[i, ]$mis_cnt > 0) {
bin_summ[i, ]$rule <- paste("$X > ", bin_summ[i, ]$xmin, " | is.na($X)", sep = '')
}
else {
bin_summ[i, ]$rule <- paste("$X > ", bin_summ[i, ]$xmin, sep = '')
}
}
else {
bin_summ[i, ]$rule <- paste("$X > ", bin_summ[i, ]$xmin, " & ", "$X <= ", bin_summ[i, ]$xmax, sep = '')
}
}
return(result <- data.frame(Bin = bin_summ$bin, Rule = format(bin_summ$rule, width = 30, justify = "right"),
Frequency = bin_summ$cnt, Percent = round(bin_summ$dist, 2),
MV_Cnt = bin_summ$mis_cnt, Bad_Freq = bin_summ$bcnt, Bad_Rate = round(bin_summ$brate, 2),
WoE = round(bin_summ$woe, 4), InfoValue = round(bin_summ$iv, 4), KS_Stat = round(bin_summ$ks, 2)))
}
# SAMPLE OUTPUT:
# Bin Rule Frequency Percent MV_Cnt Bad_Freq Bad_Rate WoE InfoValue KS_Stat
#1 01 $X <= 82 814 0.14 0 81 0.10 -0.8467 0.0764 9.02
#2 02 $X > 82 & $X <= 91 837 0.14 0 120 0.14 -0.4316 0.0234 14.44
#3 03 $X > 91 & $X <= 97 811 0.14 0 148 0.18 -0.1436 0.0027 16.35
#4 04 $X > 97 & $X <= 101 829 0.14 0 181 0.22 0.0806 0.0009 15.18
#5 05 $X > 101 & $X <= 107 870 0.15 0 206 0.24 0.1855 0.0054 12.26
#6 06 $X > 107 & $X <= 115 808 0.14 0 197 0.24 0.2241 0.0074 8.95
#7 07 $X > 115 | is.na($X) 868 0.15 1 263 0.30 0.5229 0.0468 0.00

view raw
manual_bin.R
hosted with ❤ by GitHub

Improving Binning by Bootstrap Bumping

In the post (https://statcompute.wordpress.com/2018/11/23/more-robust-monotonic-binning-based-on-isotonic-regression), a more robust version of monotonic binning based on the isotonic regression was introduced. Nonetheless, due to the loss of granularity, the predictability has been somewhat compromised, which is a typical dilemma in the data science. On one hand, we don’t want to use a learning algorithm that is too greedy and therefore over-fits the data at the cost of simplicity and generality. On the other hand, we’d also like to get the most predictive power out of our data for better business results.

It is worth mentioning that, although there is a consensus that advanced ensemble algorithms are able to significantly improve the prediction outcome, both bagging and boosting would also destroy the simple structure of binning outputs and therefore might not be directly applicable in this simple case.

In light of above considerations, the bumping (Bootstrap Umbrella of Model Parameters) procedure, which was detailed in Model Search And Inference By Bootstrap Bumping by Tibshirani and Knight (1997), should serve our dual purposes. First of all, since the final binning structure would be derived from an isotonic regression based on the bootstrap sample, the concern about over-fitting the original training data can be addressed. Secondly, through the bumping search across all bootstrap samples, chances are that a closer-to-optimal solution can be achieved. It is noted that, since the original sample is always included in the bumping procedure, a binning outcome with bumping that is at least as good as the one without is guaranteed.

The R function bump_bin() is my effort of implementing the bumping procedure on top of the monotonic binning function based on isotonic regression. Because of the mutual independence of binning across all bootstrap samples, the bumping is a perfect use case of parallelism for the purpose of faster execution, as demonstrated in the function.

bump_bin <- function(data, y, x, n) {
n1 <- 50
n2 <- 10
set.seed(2019)
seeds <- c(0, round(runif(n) * as.numeric(paste('1e', ceiling(log10(n)) + 2, sep = '')), 0))
yname <- deparse(substitute(y))
xname <- deparse(substitute(x))
df1 <- data[, c(yname, xname)]
df2 <- df1[!is.na(df1[, xname]), c(xname, yname)]
cor <- cor(df2[, 2], df2[, 1], method = "spearman", use = "complete.obs")
### MONOTONIC BINNING WITH BOOTSTRAP SAMPLES ###
bin <- function(seed) {
if (seed == 0) {
smp <- df2
}
else {
set.seed(seed)
smp <- df2[sample(seq(nrow(df2)), nrow(df2), replace = T), ]
}
reg <- isoreg(smp[, 1], cor / abs(cor) * smp[, 2])
cut <- knots(as.stepfun(reg))
df2$cut <- cut(df2[[xname]], breaks = unique(cut), include.lowest = T)
df3 <- Reduce(rbind,
lapply(split(df2, df2$cut),
function(x) data.frame(n = nrow(x), b = sum(x[[yname]]), g = sum(1 x[[yname]]),
maxx = max(x[[xname]]), minx = min(x[[xname]]))))
df4 <- df3[which(df3[["n"]] > n1 & df3[["b"]] > n2 & df3[["g"]] > n2), ]
df2$good <- 1 df2[[yname]]
out <- smbinning::smbinning.custom(df2, "good", xname, cuts = df4$maxx[nrow(df4)])$ivtable
return(data.frame(iv = out$IV[length(out$IV)], nbin = nrow(out) 2,
cuts = I(list(df4$maxx[nrow(df4)])),
abs_cor = abs(cor(as.numeric(row.names(out)[1:(nrow(out) 2)]),
out$WoE[1:(nrow(out) 2)], method = "spearman"))))
}
bump_out <- Reduce(rbind, parallel::mclapply(seeds, mc.cores = parallel::detectCores(), bin))
### FIND THE CUT MAXIMIZING THE INFORMATION VALUE ###
cut2 <- bump_out[order(bump_out["abs_cor"], bump_out["iv"], bump_out["nbin"]), ]$cuts[[1]]
df1$good <- 1 df1[[yname]]
return(smbinning::smbinning.custom(df1, "good", xname, cuts = cut2)$ivtable)
}
df <- sas7bdat::read.sas7bdat("Downloads/accepts.sas7bdat")
bump_bin(df, bad, bureau_score, n = 200)

view raw
bump_bin.r
hosted with ❤ by GitHub

The output below shows the bumping result based on 20 bootstrap samples. There is a small improvement in the information value, e.g. 0.8055 vs 0.8021 without bumping, implying a potential opportunity of bumping with a simpler binning structure, e.g. 12 bins vs 20 bins.

   Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate    Odds  LnOdds     WoE     IV
1    <= 565     92      41     51        92         41        51 0.0158   0.4457  0.5543  0.8039 -0.2183 -1.5742 0.0532
2    <= 620    470     269    201       562        310       252 0.0805   0.5723  0.4277  1.3383  0.2914 -1.0645 0.1172
3    <= 653    831     531    300      1393        841       552 0.1424   0.6390  0.3610  1.7700  0.5710 -0.7850 0.1071
4    <= 662    295     213     82      1688       1054       634 0.0505   0.7220  0.2780  2.5976  0.9546 -0.4014 0.0091
5    <= 665    100      77     23      1788       1131       657 0.0171   0.7700  0.2300  3.3478  1.2083 -0.1476 0.0004
6    <= 675    366     290     76      2154       1421       733 0.0627   0.7923  0.2077  3.8158  1.3391 -0.0168 0.0000
7    <= 699    805     649    156      2959       2070       889 0.1379   0.8062  0.1938  4.1603  1.4256  0.0696 0.0007
8    <= 707    312     268     44      3271       2338       933 0.0535   0.8590  0.1410  6.0909  1.8068  0.4509 0.0094
9    <= 716    321     278     43      3592       2616       976 0.0550   0.8660  0.1340  6.4651  1.8664  0.5105 0.0122
10   <= 721    181     159     22      3773       2775       998 0.0310   0.8785  0.1215  7.2273  1.9779  0.6219 0.0099
11   <= 755    851     789     62      4624       3564      1060 0.1458   0.9271  0.0729 12.7258  2.5436  1.1877 0.1403
12      755    898     867     31      5522       4431      1091 0.1538   0.9655  0.0345 27.9677  3.3311  1.9751 0.3178
13  Missing    315     210    105      5837       4641      1196 0.0540   0.6667  0.3333  2.0000  0.6931 -0.6628 0.0282
14    Total   5837    4641   1196        NA         NA        NA 1.0000   0.7951  0.2049  3.8804  1.3559  0.0000 0.8055

The output below is based on bumping with 200 bootstrap samples. The information value has been improved by 2%, e.g. 0.8174 vs 0.8021, with a lower risk of over-fitting, e.g. 14 bins vs 20 bins.

   Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate    Odds  LnOdds     WoE     IV
1    <= 559     79      34     45        79         34        45 0.0135   0.4304  0.5696  0.7556 -0.2803 -1.6362 0.0496
2    <= 633    735     428    307       814        462       352 0.1259   0.5823  0.4177  1.3941  0.3323 -1.0237 0.1684
3    <= 637     86      53     33       900        515       385 0.0147   0.6163  0.3837  1.6061  0.4738 -0.8822 0.0143
4    <= 653    493     326    167      1393        841       552 0.0845   0.6613  0.3387  1.9521  0.6689 -0.6870 0.0477
5    <= 662    295     213     82      1688       1054       634 0.0505   0.7220  0.2780  2.5976  0.9546 -0.4014 0.0091
6    <= 665    100      77     23      1788       1131       657 0.0171   0.7700  0.2300  3.3478  1.2083 -0.1476 0.0004
7    <= 679    504     397    107      2292       1528       764 0.0863   0.7877  0.2123  3.7103  1.3111 -0.0448 0.0002
8    <= 683    160     129     31      2452       1657       795 0.0274   0.8062  0.1938  4.1613  1.4258  0.0699 0.0001
9    <= 699    507     413     94      2959       2070       889 0.0869   0.8146  0.1854  4.3936  1.4802  0.1242 0.0013
10   <= 716    633     546     87      3592       2616       976 0.1084   0.8626  0.1374  6.2759  1.8367  0.4808 0.0216
11   <= 722    202     178     24      3794       2794      1000 0.0346   0.8812  0.1188  7.4167  2.0037  0.6478 0.0118
12   <= 746    619     573     46      4413       3367      1046 0.1060   0.9257  0.0743 12.4565  2.5222  1.1663 0.0991
13   <= 761    344     322     22      4757       3689      1068 0.0589   0.9360  0.0640 14.6364  2.6835  1.3276 0.0677
14      761    765     742     23      5522       4431      1091 0.1311   0.9699  0.0301 32.2609  3.4739  2.1179 0.2979
15  Missing    315     210    105      5837       4641      1196 0.0540   0.6667  0.3333  2.0000  0.6931 -0.6628 0.0282
16    Total   5837    4641   1196        NA         NA        NA 1.0000   0.7951  0.2049  3.8804  1.3559  0.0000 0.8174

More Robust Monotonic Binning Based on Isotonic Regression

Since publishing the monotonic binning function based upon the isotonic regression (https://statcompute.wordpress.com/2017/06/15/finer-monotonic-binning-based-on-isotonic-regression), I’ve received some feedback from peers. A potential concern is that, albeit improving the granularity and predictability, the binning is too fine and might not generalize well in the new data.

In light of the concern, I revised the function by imposing two thresholds, including a minimum sample size and a minimum number of bads for each bin. Both thresholds can be adjusted based on the specific use case. For instance, I set the minimum sample size equal to 50 and the minimum number of bads (and goods) equal to 10 in the example below.

isoreg_bin <- function(data, y, x) {
n1 <- 50
n2 <- 10
yname <- deparse(substitute(y))
xname <- deparse(substitute(x))
df1 <- data[, c(yname, xname)]
df2 <- df1[!is.na(df1[, xname]), c(xname, yname)]
cor <- cor(df2[, 2], df2[, 1], method = "spearman", use = "complete.obs")
reg <- isoreg(df2[, 1], cor / abs(cor) * df2[, 2])
cut <- knots(as.stepfun(reg))
df2$cut <- cut(df2[[xname]], breaks = unique(cut), include.lowest = T)
df3 <- Reduce(rbind,
lapply(split(df2, df2$cut),
function(x) data.frame(n = nrow(x),
b = sum(x[[yname]]),
g = sum(1 x[[yname]]),
maxx = max(x[[xname]]),
minx = min(x[[xname]]))))
df4 <- df3[which(df3[["n"]] > n1 & df3[["b"]] > n2 & df3[["g"]] > n2), ]
df1$good <- 1 df1[[yname]]
return(smbinning::smbinning.custom(df1, "good", xname, cuts = df4$maxx[nrow(df4)])$ivtable)
}
df <- sas7bdat::read.sas7bdat("Downloads/accepts.sas7bdat")
isoreg_bin(df, bad, bureau_score)

view raw
isoreg_bin.r
hosted with ❤ by GitHub

As shown in the output below, the number of generated bins and the information value happened to be between the result in (https://statcompute.wordpress.com/2017/06/15/finer-monotonic-binning-based-on-isotonic-regression) and the result in (https://statcompute.wordpress.com/2017/01/22/monotonic-binning-with-smbinning-package). More importantly, given a larger sample size for each bin, the binning algorithm is more robust and generalizable.

   Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate    Odds  LnOdds     WoE     IV
1    <= 559     79      34     45        79         34        45 0.0135   0.4304  0.5696  0.7556 -0.2803 -1.6362 0.0496
2    <= 602    189     102     87       268        136       132 0.0324   0.5397  0.4603  1.1724  0.1591 -1.1969 0.0608
3    <= 605     56      31     25       324        167       157 0.0096   0.5536  0.4464  1.2400  0.2151 -1.1408 0.0162
4    <= 632    468     279    189       792        446       346 0.0802   0.5962  0.4038  1.4762  0.3895 -0.9665 0.0946
5    <= 639    150      95     55       942        541       401 0.0257   0.6333  0.3667  1.7273  0.5465 -0.8094 0.0207
6    <= 653    451     300    151      1393        841       552 0.0773   0.6652  0.3348  1.9868  0.6865 -0.6694 0.0412
7    <= 662    295     213     82      1688       1054       634 0.0505   0.7220  0.2780  2.5976  0.9546 -0.4014 0.0091
8    <= 665    100      77     23      1788       1131       657 0.0171   0.7700  0.2300  3.3478  1.2083 -0.1476 0.0004
9    <= 667     57      44     13      1845       1175       670 0.0098   0.7719  0.2281  3.3846  1.2192 -0.1367 0.0002
10   <= 677    381     300     81      2226       1475       751 0.0653   0.7874  0.2126  3.7037  1.3093 -0.0466 0.0001
11   <= 679     66      53     13      2292       1528       764 0.0113   0.8030  0.1970  4.0769  1.4053  0.0494 0.0000
12   <= 683    160     129     31      2452       1657       795 0.0274   0.8062  0.1938  4.1613  1.4258  0.0699 0.0001
13   <= 689    203     164     39      2655       1821       834 0.0348   0.8079  0.1921  4.2051  1.4363  0.0804 0.0002
14   <= 699    304     249     55      2959       2070       889 0.0521   0.8191  0.1809  4.5273  1.5101  0.1542 0.0012
15   <= 707    312     268     44      3271       2338       933 0.0535   0.8590  0.1410  6.0909  1.8068  0.4509 0.0094
16   <= 717    368     318     50      3639       2656       983 0.0630   0.8641  0.1359  6.3600  1.8500  0.4941 0.0132
17   <= 721    134     119     15      3773       2775       998 0.0230   0.8881  0.1119  7.9333  2.0711  0.7151 0.0094
18   <= 739    474     438     36      4247       3213      1034 0.0812   0.9241  0.0759 12.1667  2.4987  1.1428 0.0735
19   <= 746    166     154     12      4413       3367      1046 0.0284   0.9277  0.0723 12.8333  2.5520  1.1961 0.0277
20      746   1109    1064     45      5522       4431      1091 0.1900   0.9594  0.0406 23.6444  3.1631  1.8072 0.3463
21  Missing    315     210    105      5837       4641      1196 0.0540   0.6667  0.3333  2.0000  0.6931 -0.6628 0.0282
22    Total   5837    4641   1196        NA         NA        NA 1.0000   0.7951  0.2049  3.8804  1.3559  0.0000 0.8021

Monotonic Binning with Equal-Sized Bads for Scorecard Development

In previous posts (https://statcompute.wordpress.com/2017/01/22/monotonic-binning-with-smbinning-package) and (https://statcompute.wordpress.com/2017/06/15/finer-monotonic-binning-based-on-isotonic-regression), I’ve developed 2 different algorithms for monotonic binning. While the first tends to generate bins with equal densities, the second would define finer bins based on the isotonic regression.

In the code snippet below, a third approach would be illustrated for the purpose to generate bins with roughly equal-sized bads. Once again, for the reporting layer, I leveraged the flexible smbinning::smbinning.custom() function with a small tweak.


df <- sas7bdat::read.sas7bdat("Downloads/accepts.sas7bdat")

monobin <- function(df, x, y) {
  yname <- deparse(substitute(y))
  xname <- deparse(substitute(x))
  d1 <- df[c(yname, xname)]
  d2 <- d1[which(d1[[yname]] == 1), ]
  nbin <- round(1 / max(table(d2[[xname]]) / sum(table(d2[[xname]]))))
  repeat {
    cuts <- Hmisc::cut2(d2[[xname]], g = nbin, onlycuts = T)
    d1$cut <- cut(d1[[xname]], breaks = cuts, include.lowest = T)
    d3 <- Reduce(rbind, Map(function(x) data.frame(xmean = mean(x[[xname]], na.rm = T), ymean = mean(x[[yname]])), split(d1, d1$cut)))
    if(abs(cor(d3$xmean, d3$ymean, method = "spearman")) == 1 | nrow(d3) == 2) {
      break
    }
    nbin <- nbin - 1
  }
  df$good <- 1 -  d1[[yname]]
  return(smbinning::smbinning.custom(df, "good", xname, cuts = cuts[c(-1, -length(cuts))]))
}

As shown in the output, the number of bads in each bin, with the exception for missings, is similar and varying within a small range. However, the number of records tends to increase to ensure the monotonicity of bad rates across all bins.

monobin(df, bureau_score, bad)
#   Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate    Odds LnOdds     WoE     IV
#1    <= 602    268     136    132       268        136       132 0.0459   0.5075  0.4925  1.0303 0.0299 -1.3261 0.1075
#2    <= 621    311     185    126       579        321       258 0.0533   0.5949  0.4051  1.4683 0.3841 -0.9719 0.0636
#3    <= 636    302     186    116       881        507       374 0.0517   0.6159  0.3841  1.6034 0.4722 -0.8838 0.0503
#4    <= 649    392     259    133      1273        766       507 0.0672   0.6607  0.3393  1.9474 0.6665 -0.6895 0.0382
#5    <= 661    387     268    119      1660       1034       626 0.0663   0.6925  0.3075  2.2521 0.8119 -0.5441 0.0227
#6    <= 676    529     415    114      2189       1449       740 0.0906   0.7845  0.2155  3.6404 1.2921 -0.0639 0.0004
#7    <= 693    606     491    115      2795       1940       855 0.1038   0.8102  0.1898  4.2696 1.4515  0.0956 0.0009
#8     717   1883    1775    108      5522       4431      1091 0.3226   0.9426  0.0574 16.4352 2.7994  1.4435 0.4217
#10  Missing    315     210    105      5837       4641      1196 0.0540   0.6667  0.3333  2.0000 0.6931 -0.6628 0.0282
#11    Total   5837    4641   1196        NA         NA        NA 1.0000   0.7951  0.2049  3.8804 1.3559  0.0000 0.7508

Additional Thoughts on Estimating LGD with Proportional Odds Model

In my previous post (https://statcompute.wordpress.com/2018/01/28/modeling-lgd-with-proportional-odds-model), I’ve discussed how to use Proportional Odds Models in the LGD model development. In particular, I specifically mentioned that we would estimate a sub-model, which can be Gamma or Simplex regression, to project the conditional mean for LGD values in the (0, 1) range. However, it is worth pointing out that, if we would define a finer LGD segmentation, the necessity of this sub-model is completely optional. A standalone Proportional Odds Model without any sub-model is more than sufficient to serve the purpose of stress testing, e.g. CCAR.

In the example below, I will define 5 categories based upon LGD values in the [0, 1] range, estimate a Proportional Odds Model as usual, and then demonstrate how to apply the model outcome in the setting of stress testing with the stressed model input, e.g. LTV.

First of all, I defined 5 instead of 3 categories for LGD values, as shown below. Nonetheless, we could use a even finer category definition in practice to achieve a more accurate outcome.


df <- read.csv("lgd.csv")
df$lgd <- round(1 - df$Recovery_rate, 4)
l1 <- c(-Inf, 0, 0.0999, 0.4999, 0.9999, Inf)
l2 <- c("A", "B", "C", "D", "E")
df$lgd_cat <- cut(df$lgd, breaks = l1, labels = l2, ordered_result = T)
summary(df$lgd_cat)
m1 <- ordinal::clm(lgd_cat ~ LTV, data = df)
#Coefficients:
#    Estimate Std. Error z value Pr(>|z|)    
#LTV   2.3841     0.1083   22.02   <2e-16 ***
#
#Threshold coefficients:
#    Estimate Std. Error z value
#A|B  0.54082    0.07897   6.848
#B|C  2.12270    0.08894  23.866
#C|D  3.18098    0.10161  31.307
#D|E  4.80338    0.13174  36.460

After the model estimation, it is straightforward to calculate the probability of each LGD category. The only question remained is how to calculate the LGD projection for each individual account as well as for the whole portfolio. In order to calculate the LGD projection, we need two factors, namely the probability and the expected mean of each LGD category, such that

Estimated_LGD = SUM_i [Prob(category i) * LGD_Mean(category i)], where i = A, B, C, D, and E in this particular case.

The calculation is shown below with the estimated LGD = 0.23 that is consistent with the actual LGD = 0.23 for the whole portfolio.


prob_A <- exp(df$LTV * (-m1$beta) + m1$Theta[1]) / (1 + exp(df$LTV * (-m1$beta) + m1$Theta[1])) 
prob_B <- exp(df$LTV * (-m1$beta) + m1$Theta[2]) / (1 + exp(df$LTV * (-m1$beta) + m1$Theta[2])) - prob_A
prob_C <- exp(df$LTV * (-m1$beta) + m1$Theta[3]) / (1 + exp(df$LTV * (-m1$beta) + m1$Theta[3])) - prob_A - prob_B
prob_D <- exp(df$LTV * (-m1$beta) + m1$Theta[4]) / (1 + exp(df$LTV * (-m1$beta) + m1$Theta[4])) - prob_A - prob_B - prob_C
prob_E <- 1 - exp(df$LTV * (-m1$beta) + m1$Theta[4]) / (1 + exp(df$LTV * (-m1$beta) + m1$Theta[4]))
pred <- data.frame(prob_A, prob_B, prob_C, prob_D, prob_E)
sum(apply(pred, 2, mean) * aggregate(df['lgd'], df['lgd_cat'], mean)[2])
#[1] 0.2262811

One might be wondering how to apply the model outcome with simple averages in stress testing that the model input is stressed, e.g. more severe, and might be also concerned about the lack of model sensitivity. In the demonstration below, let’s stress the model input LTV by 50% and then evaluate the stressed LGD.


df$LTV_ST <- df$LTV * 1.5
prob_A <- exp(df$LTV_ST * (-m1$beta) + m1$Theta[1]) / (1 + exp(df$LTV_ST * (-m1$beta) + m1$Theta[1])) 
prob_B <- exp(df$LTV_ST * (-m1$beta) + m1$Theta[2]) / (1 + exp(df$LTV_ST * (-m1$beta) + m1$Theta[2])) - prob_A
prob_C <- exp(df$LTV_ST * (-m1$beta) + m1$Theta[3]) / (1 + exp(df$LTV_ST * (-m1$beta) + m1$Theta[3])) - prob_A - prob_B
prob_D <- exp(df$LTV_ST * (-m1$beta) + m1$Theta[4]) / (1 + exp(df$LTV_ST * (-m1$beta) + m1$Theta[4])) - prob_A - prob_B - prob_C
prob_E <- 1 - exp(df$LTV_ST * (-m1$beta) + m1$Theta[4]) / (1 + exp(df$LTV_ST * (-m1$beta) + m1$Theta[4]))
pred_ST <- data.frame(prob_A, prob_B, prob_C, prob_D, prob_E)
sum(apply(pred_ST, 2, mean) * aggregate(df['lgd'], df['lgd_cat'], mean)[2])
#[1] 0.3600153

As shown above, although we only use a simple averages as the expected mean for each LGD category, the overall LGD still increases by ~60%. The reason is that, with the more stressed model input, the Proportional Odds Model is able to push more accounts into categories with higher LGD. For instance, the output below shows that, if LTV is stressed by 50% overall, ~146% more accounts would roll into the most severe LGD category without any recovery.


apply(pred_ST, 2, mean) / apply(pred, 2, mean)
#   prob_A    prob_B    prob_C    prob_D    prob_E 
#0.6715374 0.7980619 1.0405573 1.4825803 2.4639293

Estimating Parameters of A Hyper-Poisson Distribution in SAS

Similar to COM-Poisson, Double-Poisson, and Generalized Poisson distributions discussed in my previous post (https://statcompute.wordpress.com/2016/11/27/more-about-flexible-frequency-models/), the Hyper-Poisson distribution is another extension of the standard Poisson and is able to accommodate both under-dispersion and over-dispersion that are common in real-world problems. Given the complexity of parameterization and computation, the Hyper-Poisson is somewhat under-investigated. To the best of my knowledge, there is no off-shelf computing routine in SAS for the Hyper-Poisson distribution and only a R function available in http://www4.ujaen.es/~ajsaez/hp.fit.r written by A.J. Sáez-Castillo and A. Conde-Sánchez (2013).

The SAS code presented below is the starting point of my attempt on the Hyper-Poisson and its potential applications. The purpose is to replicate the calculation result shown in the Table 6 of “On the Hyper-Poisson Distribution and its Generalization with Applications” by Bayo H. Lawal (2017) (http://www.journalrepository.org/media/journals/BJMCS_6/2017/Mar/Lawal2132017BJMCS32184.pdf). As a result, the parameterization employed in my SAS code will closely follow Bayo H. Lawal (2017) instead of A.J. Sáez-Castillo and A. Conde-Sánchez (2013).


data d1;
  input y n @@;
datalines;
0 121 1 85 2 19 3 1 4 0 5 0 6 1
;
run;

data df;
  set d1;
  where n > 0;
  do i = 1 to n;
    output;
  end;
run;

proc nlmixed data = df;
  parms lambda = 1 beta = 1;
  theta = 1;
  do k = 1 to 100;
    theta = theta + gamma(beta) * (lambda ** k) / gamma(beta + k);
  end;
  prob = (gamma(beta) / gamma(beta + y)) * ((lambda ** y) / theta);
  ll = log(prob);
  model y ~ general(ll);
run;

/*
                     Standard
Parameter  Estimate     Error    DF  t Value  Pr > |t|   Alpha
lambda       0.3752    0.1178   227     3.19    0.0016    0.05
beta         0.5552    0.2266   227     2.45    0.0150    0.05 
*/

As shown, the estimated Lambda = 0.3752 and the estimated Beta = 0.5552 are identical to what is presented in the paper. The next step is be to explore applications in the frequency modeling as well as its value in business cases.

HP

Modeling LGD with Proportional Odds Model

The LGD model is an important component in the expected loss calculation. In https://statcompute.wordpress.com/2015/11/01/quasi-binomial-model-in-sas, I discussed how to model LGD with the quasi-binomial regression that is simple and makes no distributional assumption.

In the real-world LGD data, we usually would observe 3 ordered categories of values, including 0, 1, and in-betweens. In cases with a nontrivial number of 0 and 1 values, the ordered logit model, which is also known as Proportional Odds model, can be applicable. In the demonstration below, I will show how we can potentially use the proportional odds model in the LGD model development.

First of all, we need to categorize all numeric LGD values into three ordinal categories. As shown below, there are more than 30% of 0 and 1 values.

df <- read.csv("lgd.csv")
df$lgd <- round(1 - df$Recovery_rate, 4)
df$lgd_cat <- cut(df$lgd, breaks = c(-Inf, 0, 0.9999, Inf), labels = c("L", "M", "H"), ordered_result = T)
summary(df$lgd_cat)

#   L    M    H 
# 730 1672  143 

The estimation of a proportional odds model is straightforward with clm() in the ordinal package or polr() in the MASS package. As demonstrated below, in addition to the coefficient for LTV, there are 2 intercepts to differentiate 3 categories.

m1 <- ordinal::clm(lgd_cat ~ LTV, data = df)
summary(m1)

#Coefficients:
#    Estimate Std. Error z value Pr(>|z|)    
#LTV   2.0777     0.1267    16.4   <2e-16 ***
#---
#Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
#Threshold coefficients:
#    Estimate Std. Error z value
#L|M  0.38134    0.08676   4.396
#M|H  4.50145    0.14427  31.201

It is important to point out that, in a proportional odds model, it is the cumulative probability that is derived from the linear combination of model variables. For instance, the cumulative probability of LGD belonging to L or M is formulated as

Prob(LGD <= M) = Exp(4.50 – 2.08 * LTV) / (1 + Exp(4.50 – 2.08 * LTV))

Likewise, we would have

Prob(LGD <= L) = Exp(0.38 – 2.08 * LTV) / (1 + Exp(0.38 – 2.08 * LTV))

With above cumulative probabilities, then we can calculate the probability of each category as below.

Prob(LGD = L) = Prob(LGD <= L)
Prob(LGD = M) = Prob(LGD <= M) – Prob(LGD <= L)
Prob(LGD = H) = 1 – Prob(LGD <= M)

The R code is showing the detailed calculation how to convert cumulative probabilities to probabilities of interest.

cumprob_L <- exp(df$LTV * (-m1$beta) + m1$Theta[1]) / (1 + exp(df$LTV * (-m1$beta) + m1$Theta[1])) 
cumprob_M <- exp(df$LTV * (-m1$beta) + m1$Theta[2]) / (1 + exp(df$LTV * (-m1$beta) + m1$Theta[2])) 
prob_L <- cumprob_L
prob_M <- cumprob_M - cumprob_L
prob_H <- 1 - cumprob_M
pred <- data.frame(prob_L, prob_M, prob_H)
apply(pred, 2, mean)

#    prob_L     prob_M     prob_H 
#0.28751210 0.65679888 0.05568903 

After predicting the probability of each category, we would need another sub-model to estimate the conditional LGD for lgd_cat = “M” with either Beta or Simplex regression. (See https://statcompute.wordpress.com/2014/10/27/flexible-beta-modeling and https://statcompute.wordpress.com/2014/02/02/simplex-model-in-r) The final LGD prediction can be formulated as

E(LGD|X)
= Prob(Y = 0|X) * E(Y|X, Y = 0) + Prob(Y = 1|X) * E(Y|X, Y = 1) + Prob(0 < Y < 1|X) * E(Y|X, 0 < Y < 1)
= Prob(Y = 1|X) + Prob(0 < Y < 1|X) * E(Y|X, 0 < Y < 1)

where E(Y|X, 0 < Y < 1) can be calculated from the sub-model.

Monotonic WoE Binning for LGD Models

While the monotonic binning algorithm has been widely used in scorecard and PD model (Probability of Default) developments, the similar idea can be generalized to LGD (Loss Given Default) models. In the post below, two SAS macros performing the monotonic binning for LGD are demonstrated.

The first one tends to generate relatively coarse bins based on iterative grouping, which requires a longer computing time.


%macro lgd_bin1(data = , y = , x = );

%let maxbin = 20;

data _tmp1 (keep = x y);
  set &data;
  y = min(1, max(0, &y));
  x = &x;
run;

proc sql noprint;
  select
    count(distinct x) into :xflg
  from
    _last_;
quit;

%let nbin = %sysfunc(min(&maxbin, &xflg));

%if &nbin > 2 %then %do;
  %do j = &nbin %to 2 %by -1;
    proc rank data = _tmp1 groups = &j out = _data_ (keep = x rank y);
      var x;
      ranks rank;
    run;

    proc summary data = _last_ nway;
      class rank;
      output out = _tmp2 (drop = _type_ rename = (_freq_ = freq))
      sum(y) = bads  mean(y) = bad_rate 
      min(x) = minx  max(x)  = maxx;
    run;

    proc sql noprint;
      select
        case when min(bad_rate) > 0 then 1 else 0 end into :minflg
      from
        _tmp2;
 
      select
        case when max(bad_rate) < 1 then 1 else 0 end into :maxflg
      from
        _tmp2;              
    quit;

    %if &minflg = 1 & &maxflg = 1 %then %do;
      proc corr data = _tmp2 spearman noprint outs = _corr;
        var minx;
        with bad_rate;
      run;
      
      proc sql noprint;
        select
          case when abs(minx) = 1 then 1 else 0 end into :cor
        from
          _corr
        where
          _type_ = 'CORR';
      quit;
 
      %if &cor = 1 %then %goto loopout;
    %end;
  %end;
%end;

%loopout:

proc sql noprint;
create table
  _tmp3 as
select
  a.rank + 1                                           as bin,
  a.minx                                               as minx,
  a.maxx                                               as maxx,
  a.freq                                               as freq,
  a.freq / b.freq                                      as dist,
  a.bad_rate                                           as avg_lgd,
  a.bads / b.bads                                      as bpct,
  (a.freq - a.bads) / (b.freq - b.bads)                as gpct,
  log(calculated bpct / calculated gpct)               as woe,
  (calculated bpct - calculated gpct) / calculated woe as iv 
from
  _tmp2 as a, (select sum(freq) as freq, sum(bads) as bads from _tmp2) as b;
quit;

proc print data = _last_ noobs label;
  var minx maxx freq dist avg_lgd woe;
  format freq comma8. dist percent10.2;
  label
    minx    = "Lower Limit"
    maxx    = "Upper Limit"
    freq    = "Freq"
    dist    = "Dist"
    avg_lgd = "Average LGD"
    woe     = "WoE";
  sum freq dist;
run; 

%mend lgd_bin1;

The second one can generate much finer bins based on the idea of isotonic regressions and is more computationally efficient.


%macro lgd_bin2(data = , y = , x = );

data _data_ (keep = x y);
  set &data;
  y = min(1, max(0, &y));
  x = &x;
run;

proc transreg data = _last_ noprint;
  model identity(y) = monotone(x);
  output out = _tmp1 tip = _t;
run;
 
proc summary data = _last_ nway;
  class _tx;
  output out = _data_ (drop = _freq_ _type_) mean(y) = lgd;
run;

proc sort data = _last_;
  by lgd;
run;
 
data _tmp2;
  set _last_;
  by lgd;
  _idx = _n_;
  if lgd = 0 then _idx = _idx + 1;
  if lgd = 1 then _idx = _idx - 1;
run;

proc sql noprint;
create table
  _tmp3 as
select
  a.*,
  b._idx
from
  _tmp1 as a inner join _tmp2 as b
on
  a._tx = b._tx;

create table
  _tmp4 as
select
  min(a.x)                                             as minx,
  max(a.x)                                             as maxx,
  sum(a.y)                                             as bads,
  count(a.y)                                           as freq,
  count(a.y) / b.freq                                  as dist,
  mean(a.y)                                            as avg_lgd,
  sum(a.y) / b.bads                                    as bpct,
  sum(1 - a.y) / (b.freq - b.bads)                     as gpct,
  log(calculated bpct / calculated gpct)               as woe,
  (calculated bpct - calculated gpct) * calculated woe as iv
from
  _tmp3 as a, (select count(*) as freq, sum(y) as bads from _tmp3) as b
group by
  a._idx;
quit;

proc print data = _last_ noobs label;
  var minx maxx freq dist avg_lgd woe; 
  format freq comma8. dist percent10.2;
  label
    minx    = "Lower Limit"
    maxx    = "Upper Limit"
    freq    = "Freq"
    dist    = "Dist"
    avg_lgd = "Average LGD"
    woe     = "WoE";
  sum freq dist;
run; 

%mend lgd_bin2;

Below is the output comparison between two macros with the testing data downloaded from http://www.creditriskanalytics.net/datasets-private.html. Should you have any feedback, please feel free to leave me a message.

lgd1

Granular Monotonic Binning in SAS

In the post (https://statcompute.wordpress.com/2017/06/15/finer-monotonic-binning-based-on-isotonic-regression), it is shown how to do a finer monotonic binning with isotonic regression in R.

Below is a SAS macro implementing the monotonic binning with the same idea of isotonic regression. This macro is more efficient than the one shown in (https://statcompute.wordpress.com/2012/06/10/a-sas-macro-implementing-monotonic-woe-transformation-in-scorecard-development) without iterative binning and is also able to significantly increase the binning granularity.

%macro monobin(data = , y = , x = );
options mprint mlogic;

data _data_ (keep = _x _y);
  set &data;
  where &y in (0, 1) and &x ~= .;
  _y = &y;
  _x = &x;
run;

proc transreg data = _last_ noprint;
  model identity(_y) = monotone(_x);
  output out = _tmp1 tip = _t;
run;

proc summary data = _last_ nway;
  class _t_x;
  output out = _data_ (drop = _freq_ _type_) mean(_y) = _rate;
run;

proc sort data = _last_;
  by _rate;
run;

data _tmp2;
  set _last_;
  by _rate;
  _idx = _n_;
  if _rate = 0 then _idx = _idx + 1;
  if _rate = 1 then _idx = _idx - 1;
run;
  
proc sql noprint;
create table
  _tmp3 as
select
  a.*,
  b._idx
from
  _tmp1 as a inner join _tmp2 as b
on
  a._t_x = b._t_x;
  
create table
  _tmp4 as
select
  a._idx,
  min(a._x)                                               as _min_x,
  max(a._x)                                               as _max_x,
  sum(a._y)                                               as _bads,
  count(a._y)                                             as _freq,
  mean(a._y)                                              as _rate,
  sum(a._y) / b.bads                                      as _bpct,
  sum(1 - a._y) / (b.freq - b.bads)                       as _gpct,
  log(calculated _bpct / calculated _gpct)                as _woe,
  (calculated _bpct - calculated _gpct) * calculated _woe as _iv
from 
  _tmp3 as a, (select count(*) as freq, sum(_y) as bads from _tmp3) as b
group by
  a._idx;
quit;

title "Monotonic WoE Binning for %upcase(%trim(&x))";
proc print data = _last_ label noobs;
  var _min_x _max_x _bads _freq _rate _woe _iv;
  label
    _min_x = "Lower"
    _max_x = "Upper"
    _bads  = "#Bads"
    _freq  = "#Freq"
    _rate  = "BadRate"
    _woe   = "WoE"
    _iv    = "IV";
  sum _bads _freq _iv;
run;
title;

%mend monobin;

Below is the sample output for LTV, showing an identical binning scheme to the one generated by the R isobin() function.

Screenshot from 2017-09-24 21-30-40

Finer Monotonic Binning Based on Isotonic Regression

In my early post (https://statcompute.wordpress.com/2017/01/22/monotonic-binning-with-smbinning-package/), I wrote a monobin() function based on the smbinning package by Herman Jopia to improve the monotonic binning algorithm. The function works well and provides robust binning outcomes. However, there are a couple potential drawbacks due to the coarse binning. First of all, the derived Information Value for each binned variable might tend to be low. Secondly, the binned variable might not be granular enough to reflect the data nature.

In light of the aforementioned, I drafted an improved function isobin() based on the isotonic regression (https://en.wikipedia.org/wiki/Isotonic_regression), as shown below.

isobin <- function(data, y, x) {
  d1 <- data[c(y, x)]
  d2 <- d1[!is.na(d1[x]), ]
  c <- cor(d2[, 2], d2[, 1], method = "spearman", use = "complete.obs")
  reg <- isoreg(d2[, 2], c / abs(c) * d2[, 1])
  k <- knots(as.stepfun(reg))
  sm1 <-smbinning.custom(d1, y, x, k)
  c1 <- subset(sm1$ivtable, subset = CntGood * CntBad > 0, select = Cutpoint)
  c2 <- suppressWarnings(as.numeric(unlist(strsplit(c1$Cutpoint, " "))))
  c3 <- c2[!is.na(c2)]
  return(smbinning.custom(d1, y, x, c3[-length(c3)]))
}

Compared with the legacy monobin(), the isobin() function is able to significantly increase the binning granularity as well as moderately improve the Information Value.

LTV Binning with isobin() Function

   Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate    Odds LnOdds     WoE     IV
1     <= 46     81      78      3        81         78         3 0.0139   0.9630  0.0370 26.0000 3.2581  1.9021 0.0272
2     <= 71    312     284     28       393        362        31 0.0535   0.9103  0.0897 10.1429 2.3168  0.9608 0.0363
3     <= 72     22      20      2       415        382        33 0.0038   0.9091  0.0909 10.0000 2.3026  0.9466 0.0025
4     <= 73     27      24      3       442        406        36 0.0046   0.8889  0.1111  8.0000 2.0794  0.7235 0.0019
5     <= 81    303     268     35       745        674        71 0.0519   0.8845  0.1155  7.6571 2.0356  0.6797 0.0194
6     <= 83    139     122     17       884        796        88 0.0238   0.8777  0.1223  7.1765 1.9708  0.6149 0.0074
7     <= 90    631     546     85      1515       1342       173 0.1081   0.8653  0.1347  6.4235 1.8600  0.5040 0.0235
8     <= 94    529     440     89      2044       1782       262 0.0906   0.8318  0.1682  4.9438 1.5981  0.2422 0.0049
9     <= 95    145     119     26      2189       1901       288 0.0248   0.8207  0.1793  4.5769 1.5210  0.1651 0.0006
10   <= 100    907     709    198      3096       2610       486 0.1554   0.7817  0.2183  3.5808 1.2756 -0.0804 0.0010
11   <= 101    195     151     44      3291       2761       530 0.0334   0.7744  0.2256  3.4318 1.2331 -0.1229 0.0005
12   <= 110   1217     934    283      4508       3695       813 0.2085   0.7675  0.2325  3.3004 1.1940 -0.1619 0.0057
13   <= 112    208     158     50      4716       3853       863 0.0356   0.7596  0.2404  3.1600 1.1506 -0.2054 0.0016
14   <= 115    253     183     70      4969       4036       933 0.0433   0.7233  0.2767  2.6143 0.9610 -0.3950 0.0075
15   <= 136    774     548    226      5743       4584      1159 0.1326   0.7080  0.2920  2.4248 0.8857 -0.4702 0.0333
16   <= 138     27      18      9      5770       4602      1168 0.0046   0.6667  0.3333  2.0000 0.6931 -0.6628 0.0024
17    > 138     66      39     27      5836       4641      1195 0.0113   0.5909  0.4091  1.4444 0.3677 -0.9882 0.0140
18  Missing      1       0      1      5837       4641      1196 0.0002   0.0000  1.0000  0.0000   -Inf    -Inf    Inf
19    Total   5837    4641   1196        NA         NA        NA 1.0000   0.7951  0.2049  3.8804 1.3559  0.0000 0.1897

LTV Binning with monobin() Function

  Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate   Odds LnOdds     WoE     IV
1    <= 85   1025     916    109      1025        916       109 0.1756   0.8937  0.1063 8.4037 2.1287  0.7727 0.0821
2    <= 94   1019     866    153      2044       1782       262 0.1746   0.8499  0.1501 5.6601 1.7334  0.3775 0.0221
3   <= 100   1052     828    224      3096       2610       486 0.1802   0.7871  0.2129 3.6964 1.3074 -0.0486 0.0004
4   <= 105    808     618    190      3904       3228       676 0.1384   0.7649  0.2351 3.2526 1.1795 -0.1765 0.0045
5   <= 114    985     748    237      4889       3976       913 0.1688   0.7594  0.2406 3.1561 1.1493 -0.2066 0.0076
6    > 114    947     665    282      5836       4641      1195 0.1622   0.7022  0.2978 2.3582 0.8579 -0.4981 0.0461
7  Missing      1       0      1      5837       4641      1196 0.0002   0.0000  1.0000 0.0000   -Inf    -Inf    Inf
8    Total   5837    4641   1196        NA         NA        NA 1.0000   0.7951  0.2049 3.8804 1.3559  0.0000 0.1628

Bureau_Score Binning with isobin() Function

   Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate    Odds  LnOdds     WoE     IV
1    <= 491      4       1      3         4          1         3 0.0007   0.2500  0.7500  0.3333 -1.0986 -2.4546 0.0056
2    <= 532     24       9     15        28         10        18 0.0041   0.3750  0.6250  0.6000 -0.5108 -1.8668 0.0198
3    <= 559     51      24     27        79         34        45 0.0087   0.4706  0.5294  0.8889 -0.1178 -1.4737 0.0256
4    <= 560      2       1      1        81         35        46 0.0003   0.5000  0.5000  1.0000  0.0000 -1.3559 0.0008
5    <= 572     34      17     17       115         52        63 0.0058   0.5000  0.5000  1.0000  0.0000 -1.3559 0.0143
6    <= 602    153      84     69       268        136       132 0.0262   0.5490  0.4510  1.2174  0.1967 -1.1592 0.0459
7    <= 605     56      31     25       324        167       157 0.0096   0.5536  0.4464  1.2400  0.2151 -1.1408 0.0162
8    <= 606     14       8      6       338        175       163 0.0024   0.5714  0.4286  1.3333  0.2877 -1.0683 0.0035
9    <= 607     17      10      7       355        185       170 0.0029   0.5882  0.4118  1.4286  0.3567 -0.9993 0.0037
10   <= 632    437     261    176       792        446       346 0.0749   0.5973  0.4027  1.4830  0.3940 -0.9619 0.0875
11   <= 639    150      95     55       942        541       401 0.0257   0.6333  0.3667  1.7273  0.5465 -0.8094 0.0207
12   <= 653    451     300    151      1393        841       552 0.0773   0.6652  0.3348  1.9868  0.6865 -0.6694 0.0412
13   <= 662    295     213     82      1688       1054       634 0.0505   0.7220  0.2780  2.5976  0.9546 -0.4014 0.0091
14   <= 665    100      77     23      1788       1131       657 0.0171   0.7700  0.2300  3.3478  1.2083 -0.1476 0.0004
15   <= 667     57      44     13      1845       1175       670 0.0098   0.7719  0.2281  3.3846  1.2192 -0.1367 0.0002
16   <= 677    381     300     81      2226       1475       751 0.0653   0.7874  0.2126  3.7037  1.3093 -0.0466 0.0001
17   <= 679     66      53     13      2292       1528       764 0.0113   0.8030  0.1970  4.0769  1.4053  0.0494 0.0000
18   <= 683    160     129     31      2452       1657       795 0.0274   0.8062  0.1938  4.1613  1.4258  0.0699 0.0001
19   <= 689    203     164     39      2655       1821       834 0.0348   0.8079  0.1921  4.2051  1.4363  0.0804 0.0002
20   <= 699    304     249     55      2959       2070       889 0.0521   0.8191  0.1809  4.5273  1.5101  0.1542 0.0012
21   <= 707    312     268     44      3271       2338       933 0.0535   0.8590  0.1410  6.0909  1.8068  0.4509 0.0094
22   <= 717    368     318     50      3639       2656       983 0.0630   0.8641  0.1359  6.3600  1.8500  0.4941 0.0132
23   <= 721    134     119     15      3773       2775       998 0.0230   0.8881  0.1119  7.9333  2.0711  0.7151 0.0094
24   <= 723     49      44      5      3822       2819      1003 0.0084   0.8980  0.1020  8.8000  2.1748  0.8188 0.0043
25   <= 739    425     394     31      4247       3213      1034 0.0728   0.9271  0.0729 12.7097  2.5424  1.1864 0.0700
26   <= 746    166     154     12      4413       3367      1046 0.0284   0.9277  0.0723 12.8333  2.5520  1.1961 0.0277
27   <= 756    234     218     16      4647       3585      1062 0.0401   0.9316  0.0684 13.6250  2.6119  1.2560 0.0422
28   <= 761    110     104      6      4757       3689      1068 0.0188   0.9455  0.0545 17.3333  2.8526  1.4967 0.0260
29   <= 763     46      44      2      4803       3733      1070 0.0079   0.9565  0.0435 22.0000  3.0910  1.7351 0.0135
30   <= 767     96      92      4      4899       3825      1074 0.0164   0.9583  0.0417 23.0000  3.1355  1.7795 0.0293
31   <= 772     77      74      3      4976       3899      1077 0.0132   0.9610  0.0390 24.6667  3.2055  1.8495 0.0249
32   <= 787    269     260      9      5245       4159      1086 0.0461   0.9665  0.0335 28.8889  3.3635  2.0075 0.0974
33   <= 794     95      93      2      5340       4252      1088 0.0163   0.9789  0.0211 46.5000  3.8395  2.4835 0.0456
34    > 794    182     179      3      5522       4431      1091 0.0312   0.9835  0.0165 59.6667  4.0888  2.7328 0.0985
35  Missing    315     210    105      5837       4641      1196 0.0540   0.6667  0.3333  2.0000  0.6931 -0.6628 0.0282
36    Total   5837    4641   1196        NA         NA        NA 1.0000   0.7951  0.2049  3.8804  1.3559  0.0000 0.8357

Bureau_Score Binning with monobin() Function

   Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate    Odds LnOdds     WoE     IV
1    <= 617    513     284    229       513        284       229 0.0879   0.5536  0.4464  1.2402 0.2153 -1.1407 0.1486
2    <= 642    515     317    198      1028        601       427 0.0882   0.6155  0.3845  1.6010 0.4706 -0.8853 0.0861
3    <= 657    512     349    163      1540        950       590 0.0877   0.6816  0.3184  2.1411 0.7613 -0.5946 0.0363
4    <= 672    487     371    116      2027       1321       706 0.0834   0.7618  0.2382  3.1983 1.1626 -0.1933 0.0033
5    <= 685    494     396     98      2521       1717       804 0.0846   0.8016  0.1984  4.0408 1.3964  0.0405 0.0001
6    <= 701    521     428     93      3042       2145       897 0.0893   0.8215  0.1785  4.6022 1.5265  0.1706 0.0025
7    <= 714    487     418     69      3529       2563       966 0.0834   0.8583  0.1417  6.0580 1.8014  0.4454 0.0144
8    <= 730    489     441     48      4018       3004      1014 0.0838   0.9018  0.0982  9.1875 2.2178  0.8619 0.0473
9    <= 751    513     476     37      4531       3480      1051 0.0879   0.9279  0.0721 12.8649 2.5545  1.1986 0.0859
10   <= 775    492     465     27      5023       3945      1078 0.0843   0.9451  0.0549 17.2222 2.8462  1.4903 0.1157
11    > 775    499     486     13      5522       4431      1091 0.0855   0.9739  0.0261 37.3846 3.6213  2.2653 0.2126
12  Missing    315     210    105      5837       4641      1196 0.0540   0.6667  0.3333  2.0000 0.6931 -0.6628 0.0282
13    Total   5837    4641   1196        NA         NA        NA 1.0000   0.7951  0.2049  3.8804 1.3559  0.0000 0.7810

Monotonic Binning with Smbinning Package

The R package smbinning (https://cran.r-project.org/web/packages/smbinning/index.html) provides a very user-friendly interface for the WoE (Weight of Evidence) binning algorithm employed in the scorecard development. However, there are several improvement opportunities in my view:

1. First of all, the underlying algorithm in the smbinning() function utilizes the recursive partitioning, which does not necessarily guarantee the monotonicity.
2. Secondly, the density in each generated bin is not even. The frequency in some bins could be much higher than the one in others.
3. At last, the function might not provide the binning outcome for some variables due to the lack of statistical significance.

In light of the above, I wrote an enhanced version by utilizing the smbinning.custom() function, shown as below. The idea is very simple. Within the repeat loop, we would bin the variable iteratively until a certain criterion is met and then feed the list of cut points into the smbinning.custom() function. As a result, we are able to achieve a set of monotonic bins with similar frequencies regardless of the so-called “statistical significance”, which is a premature step for the variable transformation in my mind.

monobin <- function(data, y, x) {
  d1 <- data[c(y, x)]
  n <- min(20, nrow(unique(d1[x])))
  repeat {
    d1$bin <- Hmisc::cut2(d1[, x], g = n)
    d2 <- aggregate(d1[-3], d1[3], mean)
    c <- cor(d2[-1], method = "spearman")
    if(abs(c[1, 2]) == 1 | n == 2) break
    n <- n - 1
  }
  d3 <- aggregate(d1[-3], d1[3], max)
  cuts <- d3[-length(d3[, 3]), 3]
  return(smbinning::smbinning.custom(d1, y, x, cuts))
}

Below are a couple comparisons between the generic smbinning() and the home-brew monobin() functions with the use of a toy data.

In the first example, we applied the smbinning() function to a variable named "rev_util". As shown in the highlighted rows in the column "BadRate", the binning outcome is not monotonic.

  Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate    Odds LnOdds     WoE     IV
1     <= 0    965     716    249       965        716       249 0.1653   0.7420  0.2580  2.8755 1.0562 -0.2997 0.0162
2     <= 5    522     496     26      1487       1212       275 0.0894   0.9502  0.0498 19.0769 2.9485  1.5925 0.1356
3    <= 24   1166    1027    139      2653       2239       414 0.1998   0.8808  0.1192  7.3885 1.9999  0.6440 0.0677
4    <= 40    779     651    128      3432       2890       542 0.1335   0.8357  0.1643  5.0859 1.6265  0.2705 0.0090
5    <= 73   1188     932    256      4620       3822       798 0.2035   0.7845  0.2155  3.6406 1.2922 -0.0638 0.0008
6     96    533     337    196      5837       4641      1196 0.0913   0.6323  0.3677  1.7194 0.5420 -0.8140 0.0743
8  Missing      0       0      0      5837       4641      1196 0.0000      NaN     NaN     NaN    NaN     NaN    NaN
9    Total   5837    4641   1196        NA         NA        NA 1.0000   0.7951  0.2049  3.8804 1.3559  0.0000 0.3352

Next, we did the same with the monobin() function. As shown below, the algorithm provided a monotonic binning at the cost of granularity. Albeit coarse, the result is directionally correct with no inversion.

  Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate   Odds LnOdds     WoE     IV
1     30   2875    2146    729      5837       4641      1196 0.4925   0.7464  0.2536 2.9438 1.0797 -0.2763 0.0407
3  Missing      0       0      0      5837       4641      1196 0.0000      NaN     NaN    NaN    NaN     NaN    NaN
4    Total   5837    4641   1196        NA         NA        NA 1.0000   0.7951  0.2049 3.8804 1.3559  0.0000 0.0878

In the second example, we applied the smbinning() function to a variable named “bureau_score”. As shown in the highlighted rows, the frequencies in these two bins are much higher than the rest.

  Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate    Odds LnOdds     WoE     IV
1   <= 605    324     167    157       324        167       157 0.0555   0.5154  0.4846  1.0637 0.0617 -1.2942 0.1233
2   <= 632    468     279    189       792        446       346 0.0802   0.5962  0.4038  1.4762 0.3895 -0.9665 0.0946
3   <= 662    896     608    288      1688       1054       634 0.1535   0.6786  0.3214  2.1111 0.7472 -0.6087 0.0668
4   <= 699   1271    1016    255      2959       2070       889 0.2177   0.7994  0.2006  3.9843 1.3824  0.0264 0.0002
5   <= 717    680     586     94      3639       2656       983 0.1165   0.8618  0.1382  6.2340 1.8300  0.4741 0.0226
6    761    765     742     23      5522       4431      1091 0.1311   0.9699  0.0301 32.2609 3.4739  2.1179 0.2979
8  Missing    315     210    105      5837       4641      1196 0.0540   0.6667  0.3333  2.0000 0.6931 -0.6628 0.0282
9    Total   5837    4641   1196        NA         NA        NA 1.0000   0.7951  0.2049  3.8804 1.3559  0.0000 0.8066

With the monobin() function applied to the same variable, we were able to get a set of more granular bins with similar frequencies.

   Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate    Odds LnOdds     WoE     IV
1    <= 617    513     284    229       513        284       229 0.0879   0.5536  0.4464  1.2402 0.2153 -1.1407 0.1486
2    <= 642    515     317    198      1028        601       427 0.0882   0.6155  0.3845  1.6010 0.4706 -0.8853 0.0861
3    <= 657    512     349    163      1540        950       590 0.0877   0.6816  0.3184  2.1411 0.7613 -0.5946 0.0363
4    <= 672    487     371    116      2027       1321       706 0.0834   0.7618  0.2382  3.1983 1.1626 -0.1933 0.0033
5    <= 685    494     396     98      2521       1717       804 0.0846   0.8016  0.1984  4.0408 1.3964  0.0405 0.0001
6    <= 701    521     428     93      3042       2145       897 0.0893   0.8215  0.1785  4.6022 1.5265  0.1706 0.0025
7    <= 714    487     418     69      3529       2563       966 0.0834   0.8583  0.1417  6.0580 1.8014  0.4454 0.0144
8    <= 730    489     441     48      4018       3004      1014 0.0838   0.9018  0.0982  9.1875 2.2178  0.8619 0.0473
9    <= 751    513     476     37      4531       3480      1051 0.0879   0.9279  0.0721 12.8649 2.5545  1.1986 0.0859
10    775    499     486     13      5522       4431      1091 0.0855   0.9739  0.0261 37.3846 3.6213  2.2653 0.2126
12  Missing    315     210    105      5837       4641      1196 0.0540   0.6667  0.3333  2.0000 0.6931 -0.6628 0.0282
13    Total   5837    4641   1196        NA         NA        NA 1.0000   0.7951  0.2049  3.8804 1.3559  0.0000 0.7810

Scorecard Development with Data from Multiple Sources

This week, one of my friends asked me a very interesting and practical question in the scorecard development. The model development data were collected from multiple independent sources with various data sizes, heterogeneous risk profiles and different bad rates. While the performance statistics seem satisfactory on the model training dataset, the model doesn’t generalize well with new accounts that might come from a unknown source. The situation is very common in a consulting company where a risk or marketing model is sometimes developed with the data from multiple organizations.

To better understand the issue, I simulated a dataset consisting of two groups. In the dataset, while x0 and x1 govern the group segmentation, x2 and x3 define the bad definition. It is important to point out that the group information “grp” is only known in the model development sample but is unknown in the production population. Therefore, the variable “grp”, albeit predictive, can not be explicitly used in the model estimation.

data one;
  do i = 1 to 100000;
    x0 = ranuni(0);
    x1 = ranuni(1);
    x2 = ranuni(2);
    x3 = ranuni(3);
    if 1 + x0 * 2 + x1 * 4 + rannor(1) > 5 then do;
      grp = 1;
      if x2 * 2 + x3 * 4 + rannor(2) > 5 then bad = 1;
    	else bad = 0;
    end;
    else do;
      grp = 0;
      if x2 * 4 + x3 * 2 + rannor(3) > 4 then bad = 1;
    	else bad = 0;
    end;
    output;
  end;
run;

Our first approach is to use all variables x0 – x3 to build a logistic regression and then evaluate the model altogether and by groups.

proc logistic data = one desc noprint;
  model bad = x0 x1 x2 x3;
  score data = one out = mdl1 (rename = (p_1 = score1));
run;

                            GOOD BAD SEPARATION REPORT FOR SCORE1 IN DATA MDL1
                                MAXIMUM KS = 59.5763 AT SCORE POINT 0.2281
               ( AUC STATISTICS = 0.8800, GINI COEFFICIENT = 0.7599, DIVERGENCE = 2.6802 )

          MIN        MAX           GOOD        BAD      TOTAL    BAD     CUMULATIVE    BAD      CUMU. BAD
         SCORE      SCORE             #          #          #    RATE      BAD RATE  PERCENT      PERCENT
 --------------------------------------------------------------------------------------------------------
  BAD     0.6800     0.9699       2,057      7,943     10,000   79.43%      79.43%    33.81%      33.81%
   |      0.4679     0.6799       4,444      5,556     10,000   55.56%      67.50%    23.65%      57.46%
   |      0.3094     0.4679       6,133      3,867     10,000   38.67%      57.89%    16.46%      73.92%
   |      0.1947     0.3094       7,319      2,681     10,000   26.81%      50.12%    11.41%      85.33%
   |      0.1181     0.1946       8,364      1,636     10,000   16.36%      43.37%     6.96%      92.29%
   |      0.0690     0.1181       9,044        956     10,000    9.56%      37.73%     4.07%      96.36%
   |      0.0389     0.0690       9,477        523     10,000    5.23%      33.09%     2.23%      98.59%
   |      0.0201     0.0389       9,752        248     10,000    2.48%      29.26%     1.06%      99.64%
   V      0.0085     0.0201       9,925         75     10,000    0.75%      26.09%     0.32%      99.96%
 GOOD     0.0005     0.0085       9,991          9     10,000    0.09%      23.49%     0.04%     100.00%
       ========== ========== ========== ========== ==========
          0.0005     0.9699      76,506     23,494    100,000

                  GOOD BAD SEPARATION REPORT FOR SCORE1 IN DATA MDL1(WHERE = (GRP = 0))
                                MAXIMUM KS = 61.0327 AT SCORE POINT 0.2457
               ( AUC STATISTICS = 0.8872, GINI COEFFICIENT = 0.7744, DIVERGENCE = 2.8605 )

          MIN        MAX           GOOD        BAD      TOTAL    BAD     CUMULATIVE    BAD      CUMU. BAD
         SCORE      SCORE             #          #          #    RATE      BAD RATE  PERCENT      PERCENT
 --------------------------------------------------------------------------------------------------------
  BAD     0.7086     0.9699       1,051      6,162      7,213   85.43%      85.43%    30.51%      30.51%
   |      0.5019     0.7086       2,452      4,762      7,214   66.01%      75.72%    23.58%      54.10%
   |      0.3407     0.5019       3,710      3,504      7,214   48.57%      66.67%    17.35%      71.45%
   |      0.2195     0.3406       4,696      2,517      7,213   34.90%      58.73%    12.46%      83.91%
   |      0.1347     0.2195       5,650      1,564      7,214   21.68%      51.32%     7.74%      91.66%
   |      0.0792     0.1347       6,295        919      7,214   12.74%      44.89%     4.55%      96.21%
   |      0.0452     0.0792       6,737        476      7,213    6.60%      39.42%     2.36%      98.56%
   |      0.0234     0.0452       7,000        214      7,214    2.97%      34.86%     1.06%      99.62%
   V      0.0099     0.0234       7,150         64      7,214    0.89%      31.09%     0.32%      99.94%
 GOOD     0.0007     0.0099       7,201         12      7,213    0.17%      27.99%     0.06%     100.00%
       ========== ========== ========== ========== ==========
          0.0007     0.9699      51,942     20,194     72,136

                  GOOD BAD SEPARATION REPORT FOR SCORE1 IN DATA MDL1(WHERE = (GRP = 1))
                                MAXIMUM KS = 53.0942 AT SCORE POINT 0.2290
               ( AUC STATISTICS = 0.8486, GINI COEFFICIENT = 0.6973, DIVERGENCE = 2.0251 )

          MIN        MAX           GOOD        BAD      TOTAL    BAD     CUMULATIVE    BAD      CUMU. BAD
         SCORE      SCORE             #          #          #    RATE      BAD RATE  PERCENT      PERCENT
 --------------------------------------------------------------------------------------------------------
  BAD     0.5863     0.9413       1,351      1,435      2,786   51.51%      51.51%    43.48%      43.48%
   |      0.3713     0.5862       2,136        651      2,787   23.36%      37.43%    19.73%      63.21%
   |      0.2299     0.3712       2,340        446      2,786   16.01%      30.29%    13.52%      76.73%
   |      0.1419     0.2298       2,525        262      2,787    9.40%      25.07%     7.94%      84.67%
   |      0.0832     0.1419       2,584        202      2,786    7.25%      21.50%     6.12%      90.79%
   |      0.0480     0.0832       2,643        144      2,787    5.17%      18.78%     4.36%      95.15%
   |      0.0270     0.0480       2,682        104      2,786    3.73%      16.63%     3.15%      98.30%
   |      0.0140     0.0270       2,741         46      2,787    1.65%      14.76%     1.39%      99.70%
   V      0.0058     0.0140       2,776         10      2,786    0.36%      13.16%     0.30%     100.00%
 GOOD     0.0005     0.0058       2,786          0      2,786    0.00%      11.84%     0.00%     100.00%
       ========== ========== ========== ========== ==========
          0.0005     0.9413      24,564      3,300     27,864

As shown in the above output, while the overall model performance looks ok, it doesn’t generalize well in the dataset from the 2nd group with a smaller size. While the overall KS could be as high as 60, the KS for the 2nd group is merely 53. The reason is that the overall model performance is heavily influenced by the dataset from the 1st group with the larger size. Therefore, the estimated model is biased toward the risk profile reflected in the 1st group.

To alleviate the bias in the first model, we could first introduce a look-alike model driven by x0 – x1 with the purpose to profile the group and then build two separate risk models with x2 – x3 only for 1st and 2nd groups respectively. As a result, the final predicted probability should be the composite of all three sub-models, as shown below. The model evaluation is also provided to compared with the first model.

proc logistic data = one desc noprint;
  where grp = 0;
  model bad = x2 x3;
  score data = one out = mdl20(rename = (p_1 = p_10));
run;

proc logistic data = one desc noprint;
  where grp = 1;
  model bad = x2 x3;
  score data = one out = mdl21(rename = (p_1 = p_11));
run;

proc logistic data = one desc noprint;
  model grp = x0 x1;
  score data = one out = seg;
run;

data mdl2;
  merge seg mdl20 mdl21;
  by i;
  score2 = p_10 * (1 - p_1) + p_11 * p_1;
run;

                            GOOD BAD SEPARATION REPORT FOR SCORE2 IN DATA MDL2
                                MAXIMUM KS = 60.6234 AT SCORE POINT 0.2469
               ( AUC STATISTICS = 0.8858, GINI COEFFICIENT = 0.7715, DIVERGENCE = 2.8434 )

          MIN        MAX           GOOD        BAD      TOTAL    BAD     CUMULATIVE    BAD      CUMU. BAD
         SCORE      SCORE             #          #          #    RATE      BAD RATE  PERCENT      PERCENT
 --------------------------------------------------------------------------------------------------------
  BAD     0.6877     0.9677       2,011      7,989     10,000   79.89%      79.89%    34.00%      34.00%
   |      0.4749     0.6876       4,300      5,700     10,000   57.00%      68.45%    24.26%      58.27%
   |      0.3125     0.4748       6,036      3,964     10,000   39.64%      58.84%    16.87%      75.14%
   |      0.1932     0.3124       7,451      2,549     10,000   25.49%      50.51%    10.85%      85.99%
   |      0.1142     0.1932       8,379      1,621     10,000   16.21%      43.65%     6.90%      92.89%
   |      0.0646     0.1142       9,055        945     10,000    9.45%      37.95%     4.02%      96.91%
   |      0.0345     0.0646       9,533        467     10,000    4.67%      33.19%     1.99%      98.90%
   |      0.0166     0.0345       9,800        200     10,000    2.00%      29.29%     0.85%      99.75%
   V      0.0062     0.0166       9,946         54     10,000    0.54%      26.10%     0.23%      99.98%
 GOOD     0.0001     0.0062       9,995          5     10,000    0.05%      23.49%     0.02%     100.00%
       ========== ========== ========== ========== ==========
          0.0001     0.9677      76,506     23,494    100,000

                  GOOD BAD SEPARATION REPORT FOR SCORE2 IN DATA MDL2(WHERE = (GRP = 0))
                                MAXIMUM KS = 61.1591 AT SCORE POINT 0.2458
               ( AUC STATISTICS = 0.8880, GINI COEFFICIENT = 0.7759, DIVERGENCE = 2.9130 )

          MIN        MAX           GOOD        BAD      TOTAL    BAD     CUMULATIVE    BAD      CUMU. BAD
         SCORE      SCORE             #          #          #    RATE      BAD RATE  PERCENT      PERCENT
 --------------------------------------------------------------------------------------------------------
  BAD     0.7221     0.9677       1,075      6,138      7,213   85.10%      85.10%    30.40%      30.40%
   |      0.5208     0.7221       2,436      4,778      7,214   66.23%      75.66%    23.66%      54.06%
   |      0.3533     0.5208       3,670      3,544      7,214   49.13%      66.82%    17.55%      71.61%
   |      0.2219     0.3532       4,726      2,487      7,213   34.48%      58.73%    12.32%      83.92%
   |      0.1309     0.2219       5,617      1,597      7,214   22.14%      51.41%     7.91%      91.83%
   |      0.0731     0.1309       6,294        920      7,214   12.75%      44.97%     4.56%      96.39%
   |      0.0387     0.0731       6,762        451      7,213    6.25%      39.44%     2.23%      98.62%
   |      0.0189     0.0387       7,009        205      7,214    2.84%      34.86%     1.02%      99.63%
   V      0.0074     0.0189       7,152         62      7,214    0.86%      31.09%     0.31%      99.94%
 GOOD     0.0002     0.0073       7,201         12      7,213    0.17%      27.99%     0.06%     100.00%
       ========== ========== ========== ========== ==========
          0.0002     0.9677      51,942     20,194     72,136

                  GOOD BAD SEPARATION REPORT FOR SCORE2 IN DATA MDL2(WHERE = (GRP = 1))
                                MAXIMUM KS = 57.6788 AT SCORE POINT 0.1979
               ( AUC STATISTICS = 0.8717, GINI COEFFICIENT = 0.7434, DIVERGENCE = 2.4317 )

          MIN        MAX           GOOD        BAD      TOTAL    BAD     CUMULATIVE    BAD      CUMU. BAD
         SCORE      SCORE             #          #          #    RATE      BAD RATE  PERCENT      PERCENT
 --------------------------------------------------------------------------------------------------------
  BAD     0.5559     0.9553       1,343      1,443      2,786   51.79%      51.79%    43.73%      43.73%
   |      0.3528     0.5559       2,001        786      2,787   28.20%      40.00%    23.82%      67.55%
   |      0.2213     0.3528       2,364        422      2,786   15.15%      31.71%    12.79%      80.33%
   |      0.1372     0.2213       2,513        274      2,787    9.83%      26.24%     8.30%      88.64%
   |      0.0840     0.1372       2,588        198      2,786    7.11%      22.42%     6.00%      94.64%
   |      0.0484     0.0840       2,683        104      2,787    3.73%      19.30%     3.15%      97.79%
   |      0.0256     0.0483       2,729         57      2,786    2.05%      16.84%     1.73%      99.52%
   |      0.0118     0.0256       2,776         11      2,787    0.39%      14.78%     0.33%      99.85%
   V      0.0040     0.0118       2,781          5      2,786    0.18%      13.16%     0.15%     100.00%
 GOOD     0.0001     0.0040       2,786          0      2,786    0.00%      11.84%     0.00%     100.00%
       ========== ========== ========== ========== ==========
          0.0001     0.9553      24,564      3,300     27,864

After comparing KS statistics from two modeling approaches, we can see that, while the performance from the 2nd approach on the overall sample is only slightly better than the one from the 1st approach, the KS on the 2nd group with a smaller size, e.g. grp = 1, increases from 53 upto 58 by 8.6%. While the example is just for two groups, it is trivial to generalize in cases with more than two groups.

Risk Models with Generalized PLS

While developing risk models with hundreds of potential variables, we often run into the situation that risk characteristics or macro-economic indicators are highly correlated, namely multicollinearity. In such cases, we might have to drop variables with high VIFs or employ “variable shrinkage” methods, e.g. lasso or ridge, to suppress variables with colinearity.

Feature extraction approaches based on PCA and PLS have been widely discussed but are rarely used in real-world applications due to concerns around model interpretability and implementation. In the example below, it is shown that there shouldn’t any hurdle in the model implementation, e.g. score, given that coefficients can be extracted from a GPLS model in the similar way from a GLM model. In addition, compared with GLM with 8 variables, GPLS with only 5 components is able to provide a comparable performance in the hold-out testing data.

R Code

library(gpls)
library(pROC)

df1 <- read.csv("credit_count.txt")
df2 <- df1[df1$CARDHLDR == 1, -c(1, 10, 11, 12, 13)]
set.seed(2016)
n <- nrow(df2)
sample <- sample(seq(n), size = n / 2, replace = FALSE)
train <- df2[sample, ]
test <- df2[-sample, ]

m1 <- glm(DEFAULT ~ ., data = train, family = "binomial")
cat("\n### ROC OF GLM PREDICTION WITH TRAINING DATA ###\n")
print(roc(train$DEFAULT, predict(m1, newdata = train, type = "response")))
cat("\n### ROC OF GLM PREDICTION WITH TESTING DATA ###\n")
print(roc(test$DEFAULT, predict(m1, newdata = test, type = "response")))

m2 <- gpls(DEFAULT ~ ., data = train, family = "binomial", K.prov = 5)
cat("\n### ROC OF GPLS PREDICTION WITH TRAINING DATA ###\n")
print(roc(train$DEFAULT, predict(m2, newdata = train)$predicted[, 1]))
cat("\n### ROC OF GPLS PREDICTION WITH TESTING DATA ###\n")
print(roc(test$DEFAULT, predict(m2, newdata = test)$predicted[, 1]))

cat("\n### COEFFICIENTS COMPARISON BETWEEN GLM AND GPLS ###\n")
print(data.frame(glm = m1$coefficients, gpls = m2$coefficients))

Output

### ROC OF GLM PREDICTION WITH TRAINING DATA ###

Call:
roc.default(response = train$DEFAULT, predictor = predict(m1,     newdata = train, type = "response"))

Data: predict(m1, newdata = train, type = "response") in 4753 controls (train$DEFAULT 0) < 496 cases (train$DEFAULT 1).
Area under the curve: 0.6641

### ROC OF GLM PREDICTION WITH TESTING DATA ###

Call:
roc.default(response = test$DEFAULT, predictor = predict(m1,     newdata = test, type = "response"))

Data: predict(m1, newdata = test, type = "response") in 4750 controls (test$DEFAULT 0) < 500 cases (test$DEFAULT 1).
Area under the curve: 0.6537

### ROC OF GPLS PREDICTION WITH TRAINING DATA ###

Call:
roc.default(response = train$DEFAULT, predictor = predict(m2,     newdata = train)$predicted[, 1])

Data: predict(m2, newdata = train)$predicted[, 1] in 4753 controls (train$DEFAULT 0) < 496 cases (train$DEFAULT 1).
Area under the curve: 0.6627

### ROC OF GPLS PREDICTION WITH TESTING DATA ###

Call:
roc.default(response = test$DEFAULT, predictor = predict(m2,     newdata = test)$predicted[, 1])

Data: predict(m2, newdata = test)$predicted[, 1] in 4750 controls (test$DEFAULT 0) < 500 cases (test$DEFAULT 1).
Area under the curve: 0.6542

### COEFFICIENTS COMPARISON BETWEEN GLM AND GPLS ###
                      glm          gpls
(Intercept) -0.1940785071 -0.1954618828
AGE         -0.0122709412 -0.0147883358
ACADMOS      0.0005302022  0.0003671781
ADEPCNT      0.1090667092  0.1352491711
MAJORDRG     0.0757313171  0.0813835741
MINORDRG     0.2621574192  0.2547176301
OWNRENT     -0.2803919685 -0.1032119571
INCOME      -0.0004222914 -0.0004531543
LOGSPEND    -0.1688395555 -0.1525963363

Prediction Intervals for Poisson Regression

Different from the confidence interval that is to address the uncertainty related to the conditional mean, the prediction interval is to accommodate the additional uncertainty associated with prediction errors. As a result, the prediction interval is always wider than the confidence interval in a regression model. In the context of risk modeling, the prediction interval is often used to address the potential model risk due to aforementioned uncertainties.

While calculating prediction interval of OLS regression based on the Gaussian distributional assumption is relatively straightforward with the off-shelf solution in R, it could be more complicated in a Generalized Linear Model, e.g. Poisson regression. In this post, I am going to show two empirical methods, one based on bootstrapping and the other based on simulation, calculating the prediction interval of a Poisson regression. Because of the high computing cost, the parallelism with foreach() function will be used to improve the efficiency.

First of all, let’s estimate a Poisson regression with glm() and generate a couple fake new data points to calculate model predictions. Since the toy data is very small with only 32 records with all categorical predictors, I doubled the sample size by rbind() to ensure the appropriate data coverage in the bootstrapping.

pkgs <- c('doParallel', 'foreach')
lapply(pkgs, require, character.only = T)
registerDoParallel(cores = 4)

data(AutoCollision, package = "insuranceData")
df <- rbind(AutoCollision, AutoCollision)
mdl <- glm(Claim_Count ~ Age + Vehicle_Use, data = df, family = poisson(link = "log"))
new_fake <- df[1:5, 1:2]

The first method shown below is based on the bootstrapping with following steps:

1. Bootstrapped the original model development sample by the random sample with replacements;

2. Repeated the above many times, e.g. 1000, to generate different bootstrapped samples;

3. Refitted models with bootstrapped samples;

4. Generated predictions with new data points, e.g. “new_fake”, but with refitted models;

5. Generated random numbers based on Poisson distribution with the mean, e.g. lambda, equal to the predicted values from refitted models

6. Collected all Poisson random numbers from the previous step and calculated the percentiles.

boot_pi <- function(model, pdata, n, p) {
  odata <- model$data
  lp <- (1 - p) / 2
  up <- 1 - lp
  set.seed(2016)
  seeds <- round(runif(n, 1, 1000), 0)
  boot_y <- foreach(i = 1:n, .combine = rbind) %dopar% {
    set.seed(seeds[i])
    bdata <- odata[sample(seq(nrow(odata)), size = nrow(odata), replace = TRUE), ]
    bpred <- predict(update(model, data = bdata), type = "response", newdata = pdata)
    rpois(length(bpred), lambda = bpred)
  }
  boot_ci <- t(apply(boot_y, 2, quantile, c(lp, up)))
  return(data.frame(pred = predict(model, newdata = pdata, type = "response"), lower = boot_ci[, 1], upper = boot_ci[, 2]))
}

boot_pi(mdl, new_fake, 1000, 0.95)
#      pred lower upper
#1 12.63040     6    21
#2 38.69738    25    55
#3 26.97271    16    39
#4 10.69951     4    18
#5 52.50839    35    70

The second method is based on the simulation and outlined as below:

1. Re-produced the model response variable, e.g. Claim_Count, by simulating Poisson random numbers with lambda equal to predicted values from the original model;

2. Repeated the above simulations many times, e.g. 1000, to generate many response series;

3. Generated 1000 updated model samples by replacing the original response with the new response generated from simulations;

4. Refitted models with these updated samples

5. Generated predictions with new data points, e.g. “new_fake”, but with refitted models;

6. Generated Poisson random numbers with lambda equal to the predicted values from refitted models

7. Collected all Poisson random numbers from the previous step and calculated the percentiles.

sim_pi <- function(model, pdata, n, p) {
  odata <- model$data
  yhat <- predict(model, type = "response")
  lp <- (1 - p) / 2
  up <- 1 - lp
  set.seed(2016)
  seeds <- round(runif(n, 1, 1000), 0)
  sim_y <- foreach(i = 1:n, .combine = rbind) %dopar% {
    set.seed(seeds[i])
    sim_y <- rpois(length(yhat), lambda = yhat)
    sdata <- data.frame(y = sim_y, odata[names(model$x)])
    refit <- glm(y ~ ., data = sdata, family = poisson)
    bpred <- predict(refit, type = "response", newdata = pdata)
    rpois(length(bpred),lambda = bpred)
  }
  sim_ci <- t(apply(sim_y, 2, quantile, c(lp, up)))
  return(data.frame(pred = predict(model, newdata = pdata, type = "response"), lower = sim_ci[, 1], upper = sim_ci[, 2]))
}

sim_pi(mdl, new_fake, 1000, 0.95)
#      pred lower upper
#1 12.63040     6    21
#2 38.69738    26    52
#3 26.97271    17    39
#4 10.69951     4    18
#5 52.50839    38    68

As demonstrated above, after a large number of replications, outcomes from both methods are highly consistent.

Calculate Leave-One-Out Prediction for GLM

In the model development, the “leave-one-out” prediction is a way of cross-validation, calculated as below:
1. First of all, after a model is developed, each observation used in the model development is removed in turn and then the model is refitted with the remaining observations
2. The out-of-sample prediction for the refitted model is calculated with the removed observation one by one to assemble the LOO, e.g. leave-one-out predicted values for the whole model development sample.
The loo_predict() function below is a general routine to calculate the LOO prediction for any GLM object, which can be further employed to investigate the model stability and predictability.

> pkgs <- c('doParallel', 'foreach')
> lapply(pkgs, require, character.only = T)
[[1]]
[1] TRUE

[[2]]
[1] TRUE

> registerDoParallel(cores = 8)
>
> data(AutoCollision, package = "insuranceData")
> # A GAMMA GLM #
> model1 <- glm(Severity ~ Age + Vehicle_Use, data = AutoCollision, family = Gamma(link = "log"))
> # A POISSON GLM #
> model2 <- glm(Claim_Count ~ Age + Vehicle_Use, data = AutoCollision, family = poisson(link = "log"))
>
> loo_predict <- function(obj) {
+   yhat <- foreach(i = 1:nrow(obj$data), .combine = rbind) %dopar% {
+     predict(update(obj, data = obj$data[-i, ]), obj$data[i,], type = "response")
+   }
+   return(data.frame(result = yhat[, 1], row.names = NULL))
+ }
> # TEST CASE 1
> test1 <- loo_predict(model1)
> test1$result
 [1] 303.7393 328.7292 422.6610 375.5023 240.9785 227.6365 288.4404 446.5589
 [9] 213.9368 244.7808 278.7786 443.2256 213.9262 243.2495 266.9166 409.2565
[17] 175.0334 172.0683 197.2911 326.5685 187.2529 215.9931 249.9765 349.3873
[25] 190.1174 218.6321 243.7073 359.9631 192.3655 215.5986 233.1570 348.2781
> # TEST CASE 2
> test2 <- loo_predict(model2)
> test2$result
 [1]  11.15897  37.67273  28.76127  11.54825  50.26364 152.35489 122.23782
 [8]  44.57048 129.58158 465.84173 260.48114 107.23832 167.40672 510.41127
[15] 316.50765 121.75804 172.56928 546.25390 341.03826 134.04303 359.30141
[22] 977.29107 641.69934 251.32547 248.79229 684.86851 574.13994 238.42209
[29] 148.77733 504.12221 422.75047 167.61203

Download Federal Reserve Economic Data (FRED) with Python

In the operational loss calculation, it is important to use CPI (Consumer Price Index) adjusting historical losses. Below is an example showing how to download CPI data online directly from Federal Reserve Bank of St. Louis and then to calculate monthly and quarterly CPI adjustment factors with Python.

In [1]: import pandas_datareader.data as web

In [2]: import pandas as pd

In [3]: import numpy as np

In [4]: import datetime as dt

In [5]: # SET START AND END DATES OF THE SERIES

In [6]: sdt = dt.datetime(2000, 1, 1)

In [7]: edt = dt.datetime(2015, 9, 1)

In [8]: cpi = web.DataReader("CPIAUCNS", "fred", sdt, edt)

In [9]: cpi.head()
Out[9]:
            CPIAUCNS
DATE
2000-01-01     168.8
2000-02-01     169.8
2000-03-01     171.2
2000-04-01     171.3
2000-05-01     171.5

In [10]: df1 = pd.DataFrame({'month': [dt.datetime.strftime(i, "%Y-%m") for i in cpi.index]})

In [11]: df1['qtr'] = [str(x.year) + "-Q" + str(x.quarter) for x in cpi.index]

In [12]: df1['m_cpi'] = cpi.values

In [13]: df1.index = cpi.index

In [14]: grp = df1.groupby('qtr', as_index = False)

In [15]: df2 = grp['m_cpi'].agg({'q_cpi': np.mean})

In [16]: df3 = pd.merge(df1, df2, how = 'inner', left_on = 'qtr', right_on = 'qtr')

In [17]: maxm_cpi = np.array(df3.m_cpi)[-1]

In [18]: maxq_cpi = np.array(df3.q_cpi)[-1]

In [19]: df3['m_factor'] = maxm_cpi / df3.m_cpi

In [20]: df3['q_factor'] = maxq_cpi / df3.q_cpi

In [21]: df3.index = cpi.index

In [22]: final = df3.sort_index(ascending = False)

In [23]: final.head(12)
Out[23]:
              month      qtr    m_cpi       q_cpi  m_factor  q_factor
DATE
2015-09-01  2015-09  2015-Q3  237.945  238.305000  1.000000  1.000000
2015-08-01  2015-08  2015-Q3  238.316  238.305000  0.998443  1.000000
2015-07-01  2015-07  2015-Q3  238.654  238.305000  0.997029  1.000000
2015-06-01  2015-06  2015-Q2  238.638  237.680667  0.997096  1.002627
2015-05-01  2015-05  2015-Q2  237.805  237.680667  1.000589  1.002627
2015-04-01  2015-04  2015-Q2  236.599  237.680667  1.005689  1.002627
2015-03-01  2015-03  2015-Q1  236.119  234.849333  1.007733  1.014714
2015-02-01  2015-02  2015-Q1  234.722  234.849333  1.013731  1.014714
2015-01-01  2015-01  2015-Q1  233.707  234.849333  1.018134  1.014714
2014-12-01  2014-12  2014-Q4  234.812  236.132000  1.013343  1.009202
2014-11-01  2014-11  2014-Q4  236.151  236.132000  1.007597  1.009202
2014-10-01  2014-10  2014-Q4  237.433  236.132000  1.002156  1.009202

Quasi-Binomial Model in SAS

Similar to quasi-Poisson regressions, quasi-binomial regressions try to address the excessive variance by the inclusion of a dispersion parameter. In addition to addressing the over-dispersion, quasi-binomial regressions also demonstrate extra values in other areas, such as LGD model development in credit risk modeling, due to its flexible distributional assumption.

Measuring the ratio between NCO and GCO, LGD could take any value in the range [0, 1] with no unanimous consensus on the distributional assumption currently in the industry. An advantage of quasi-binomial regression is that it makes no assumption of a specific distribution but merely specifies the conditional mean for a given model response. As a result, the trade-off is the lack of likelihood-based measures such as AIC and BIC.

Below is a demonstration on how to estimate a quasi-binomial model with GLIMMIX procedure in SAS.

proc glimmix data = _last_;
  model y = age number start / link = logit solution;
  _variance_ = _mu_ * (1-_mu_);
  random _residual_;
run;  
/*
              Model Information
Data Set                     WORK.KYPHOSIS   
Response Variable            y               
Response Distribution        Unknown         
Link Function                Logit           
Variance Function            _mu_ * (1-_mu_) 
Variance Matrix              Diagonal        
Estimation Technique         Quasi-Likelihood
Degrees of Freedom Method    Residual        

                       Parameter Estimates 
                         Standard
Effect       Estimate       Error       DF    t Value    Pr > |t|
Intercept     -2.0369      1.3853       77      -1.47      0.1455
age           0.01093    0.006160       77       1.77      0.0800
number         0.4106      0.2149       77       1.91      0.0598
start         -0.2065     0.06470       77      -3.19      0.0020
Residual       0.9132           .        .        .         .    
*/

For the comparison purpose, the same model is also estimated with R glm() function, showing identical outputs.

summary(glm(data = kyphosis, Kyphosis ~ ., family = quasibinomial))
#Coefficients:
#            Estimate Std. Error t value Pr(>|t|)   
#(Intercept) -2.03693    1.38527  -1.470  0.14552   
#Age          0.01093    0.00616   1.774  0.07996 . 
#Number       0.41060    0.21489   1.911  0.05975 . 
#Start       -0.20651    0.06470  -3.192  0.00205 **
#---
#(Dispersion parameter for quasibinomial family taken to be 0.913249)

A Two-Stage Approach in Loan-Level Consumer PD Models

In the traditional methodology of loan-level consumer probability of default (PD) models, a general functional form is taken as below

Logit(PD) = A * X + B * Y

where A * X is the linear combination of loan-level risk characteristics and B * Y is the linear combination of macro-economic indicators. While B * Y is considered a dynamic factor and might vary according to macro-economic conditions, A * X would usually remain static for each loan regardless of macro-economic scenarios during the forecast horizon, e.g. 9 quarters for CCAR exercises. However, based upon our observation in 2008 financial crisis, it is neither realistic nor conservative to assume the loan-level consumer risk characteristics, e.g. loan-to-value, credit score, and delinquent status, immune from the economic downturn.

In this article, a two-stage approach is proposed to address the above drawback in the legacy method by the means of injecting macroeconomic dynamics into loan-level risk characteristics that were assumed static in the PD model formulation. With this innovative two-stage method, it is assumed that the risk profile of each consumer loan, e.g. A * X, is jointly determined by both the historical profile and macro-economic dynamics such that

A * X = Z = Function(macroeconomic Indicator) offset by the historical Z for each consumer loan
==> Delta(Z) = C * M

where C * M represents the linear combination of macro-economic indicators or variants and scores the health of a general economy such that C * M is positive when the economy is deteriorating and is negative when the economy is improving. As a result, the loan-level risk characteristics, e.g. A * X, would be updated dynamically upon macroeconomic inputs for the whole forecast horizon across various macroeconomic scenarios.

It is worth mentioning that Fair Issac also announced the FICO score economic calibration service in 2013, which forecasts how FICO score would change under different macroeconomic scenarios and which shares the similar idea of the two-stage approach with the intent to address the same issue faced by CCAR practitioners in the banking industry.

Model Segmentation with Recursive Partitioning

library(party)

df1 <- read.csv("credit_count.csv")
df2 <- df1[df1$CARDHLDR == 1, ]

mdl <- mob(DEFAULT ~ MAJORDRG + MINORDRG + INCOME + OWNRENT | AGE + SELFEMPL, data = df2, family = binomial(), control = mob_control(minsplit = 1000), model = glinearModel)

print(mdl)
#1) AGE <= 22.91667; criterion = 1, statistic = 48.255
#  2)*  weights = 1116 
#Terminal node model
#Binomial GLM with coefficients:
#(Intercept)     MAJORDRG     MINORDRG       INCOME      OWNRENT  
# -0.6651905    0.0633978    0.5182472   -0.0006038    0.3071785  
#
#1) AGE > 22.91667
#  3)*  weights = 9383 
#Terminal node model
#Binomial GLM with coefficients:
#(Intercept)     MAJORDRG     MINORDRG       INCOME      OWNRENT  
# -1.4117010    0.2262091    0.2067880   -0.0003822   -0.2127193  

### TEST FOR STRUCTURAL CHANGE ###
sctest(mdl, node = 1)
#                   AGE    SELFEMPL
#statistic 4.825458e+01 20.88612025
#p.value   1.527484e-07  0.04273836

summary(mdl, node = 2)
#Coefficients:
#              Estimate Std. Error z value Pr(>|z|)    
#(Intercept) -0.6651905  0.2817480  -2.361 0.018229 *  
#MAJORDRG     0.0633978  0.3487305   0.182 0.855743    
#MINORDRG     0.5182472  0.2347656   2.208 0.027278 *  
#INCOME      -0.0006038  0.0001639  -3.685 0.000229 ***
#OWNRENT      0.3071785  0.2028491   1.514 0.129945    

summary(mdl, node = 3)
#Coefficients:
#              Estimate Std. Error z value Pr(>|z|)    
#(Intercept) -1.412e+00  1.002e-01 -14.093  < 2e-16 ***
#MAJORDRG     2.262e-01  7.067e-02   3.201  0.00137 ** 
#MINORDRG     2.068e-01  4.925e-02   4.199 2.68e-05 ***
#INCOME      -3.822e-04  4.186e-05  -9.131  < 2e-16 ***
#OWNRENT     -2.127e-01  7.755e-02  -2.743  0.00609 ** 

SAS Macro Aligning The Logit Variable to A Scorecard with Specific PDO and Base Point

%macro align_score(data = , y = , logit = , pdo = , base_point = , base_odds = , min_point = 100, max_point = 900); 
***********************************************************; 
* THE MACRO IS TO ALIGN A LOGIT VARIABLE TO A SCORE WITH  *; 
* SPECIFIC PDO, BASE POINT, AND BASE ODDS                 *; 
* ======================================================= *; 
* PAMAMETERS:                                             *; 
*  DATA      : INPUT SAS DATA TABLE                       *; 
*  Y         : PERFORMANCE VARIABLE WITH 0/1 VALUE        *; 
*  LOGIT     : A LOGIT VARIABLE TO BE ALIGNED FROM        *; 
*  PDO       : PDO OF THE SCORE ALIGNED TO                *; 
*  BASE_POINT: BASE POINT OF THE SCORE ALIGNED TO         *; 
*  BASE_ODDS : ODDS AT BASE POINT OF THE SCORE ALIGNED TO *; 
*  MIN_POINT : LOWER END OF SCORE POINT, 100 BY DEFAULT   *; 
*  MAX_POINT : UPPER END OF SCORE POINT, 900 BY DEFAULT   *; 
* ======================================================= *; 
* OUTPUTS:                                                *; 
*  ALIGN_FORMULA.SAS                                      *; 
*  A SAS CODE WITH THE FORMULA TO ALIGN THE LOGIT FIELD   *; 
*  TO A SPECIFIC SCORE TOGETHER WITH THE STATISTICAL      *; 
*  SUMMARY OF ALIGN_SCORE                                 *;                                    
***********************************************************; 
 
options nocenter nonumber nodate mprint mlogic symbolgen 
        orientation = landscape ls = 150 formchar = "|----|+|---+=|-/\&lt;&gt;*"; 
 
%local b0 b1; 
 
data _tmp1 (keep = &amp;y &amp;logit); 
  set &amp;data; 
  where &amp;y in (0, 1) and &amp;logit ~= .; 
run; 
 
ods listing close; 
ods output ParameterEstimates = _est1 (keep = variable estimate); 
proc logistic data = _last_ desc; 
  model &amp;y = &amp;logit; 
run; 
ods listing; 
 
data _null_; 
  set _last_; 
   
  if _n_ = 1 then do; 
    b = - (estimate + (log(&amp;base_odds) - (log(2) / &amp;pdo) * &amp;base_point)) / (log(2) / &amp;pdo); 
    call symput('b0', put(b, 15.8)); 
  end; 
  else do; 
    b = estimate / (log(2) / &amp;pdo); 
    call symput('b1', put(b, 15.8)); 
  end; 
run; 
 
filename formula "ALIGN_FORMULA.SAS"; 
 
data _null_; 
  file formula; 
 
  put @3 3 * "*" " SCORE ALIGNMENT FORMULA OF %upcase(&amp;logit) " 3 * "*" ";"; 
  put; 
  put @3 "ALIGN_SCORE = MAX(MIN(ROUND((%trim(&amp;b0)) - (%trim(&amp;b1)) * %upcase(&amp;logit), 1), &amp;max_point), &amp;min_point);"; 
  put; 
run; 
 
data _tmp2; 
  set _tmp1; 
  %inc formula; 
run; 
 
proc summary data = _last_ nway; 
  class &amp;y; 
  output out = _tmp3(drop = _type_ _freq_) 
  min(align_score) = min_scr max(align_score) = max_scr 
  median(align_score) = mdn_scr; 
run; 
 
data _null_; 
  set _last_; 
  file formula mod; 
 
  if _n_ = 1 then do; 
    put @3 3 * "*" " STATISTICAL SUMMARY OF ALIGN_SCORE BY INPUT DATA: " 3 * "*" ";"; 
  end; 
  put @3 "* WHEN %upcase(&amp;y) = " &amp;y ": MIN(SCORE) = " min_scr " MEDIAN(SCORE) = " mdn_scr " MAX(SCORE) = " max_scr "*;"; 
run; 
 
proc datasets library = work nolist; 
  delete _: (mt = data); 
run; 
quit; 
 
***********************************************************; 
*                     END OF THE MACRO                    *; 
***********************************************************;  
%mend align_score;

Thoughts on Modeling Practices in Exposure at Default

In the consumer credit risk arena, EAD (Exposure at Default) is a major component in the calculation of EL (Expected Loss) particularly in Line of Credit products such as Credit Card or HeLOC. While it is common to gauge EAD at the portfolio level based upon the utilization rate, a leading practice implemented in more complex banks is to estimate account-level EAD models with the inclusion of individual risk characteristics.

Empirically, as three risk measures, namely LEQ, CCF, and EADF, are employed to describe EAD at the account-level, each of them carries a different business meaning and therefore should be applied to different type of accounts. For instance, in a Credit Card portfolio, while LEQ should be applicable to most accounts, it might not be suitable for accounts with close to 100% utilization rates. In addition, as newly originated or inactive accounts do not have any account activity, EADF might be more appropriate.

The table below is a summary of three EAD measures together with their use cases in the practice. I hope that it is useful for consumer banking practitioners and wish the best in the coming CCAR (Comprehensive Capital Analysis and Review) 2015.
EAD

Estimating Composite Models for Count Outcomes with FMM Procedure

Once upon a time when I learned SAS, it was still version 6.X. As a old dog with 10+ years of experience in SAS, I’ve been trying my best to keep up with new tricks in each release of SAS. In SAS 9.3, my favorite novel feature in SAS/STAT is FMM procedure to estimate finite mixture models.

In 2008 when I drafted “Count Data Models in SAS®” (www2.sas.com/proceedings/forum2008/371-2008.pdf‎), it was pretty cumbersome to specify the log likelihood function of a composite model for count outcomes, e.g. hurdle Poisson or zero-inflated Poisson model. However, with the availability of FMM procedure, estimating composite models has never been easier.

In the demonstration below, I am going to show a side-by-side comparison how to estimate three types of composite models for count outcomes, including hurdle Poisson, zero-inflated Poisson, and finite mixture Poisson models, with FMM and NLMIXED procedure respectively. As shown, both procedures provided identical model estimations.

Hurdle Poisson Model

*** HURDLE POISSON MODEL WITH FMM PROCEDURE ***;
proc fmm data = tmp1 tech = trureg;
  model majordrg = age acadmos minordrg logspend / dist = truncpoisson;
  model majordrg = / dist = constant;
  probmodel age acadmos minordrg logspend;
run;
/*
Fit Statistics

-2 Log Likelihood             8201.0
AIC  (smaller is better)      8221.0
AICC (smaller is better)      8221.0
BIC  (smaller is better)      8293.5

Parameter Estimates for 'Truncated Poisson' Model
 
                                Standard
Component  Effect     Estimate     Error  z Value  Pr > |z|

        1  Intercept   -2.0706    0.3081    -6.72    <.0001
        1  AGE         0.01796  0.005482     3.28    0.0011
        1  ACADMOS    0.000852  0.000700     1.22    0.2240
        1  MINORDRG     0.1739   0.03441     5.05    <.0001
        1  LOGSPEND     0.1229   0.04219     2.91    0.0036

Parameter Estimates for Mixing Probabilities
 
                         Standard
Effect       Estimate       Error    z Value    Pr > |z|

Intercept     -4.2309      0.1808     -23.40      <.0001
AGE           0.01694    0.003323       5.10      <.0001
ACADMOS      0.002240    0.000492       4.55      <.0001
MINORDRG       0.7653     0.03842      19.92      <.0001
LOGSPEND       0.2301     0.02683       8.58      <.0001
*/

*** HURDLE POISSON MODEL WITH NLMIXED PROCEDURE ***;
proc nlmixed data = tmp1 tech = trureg maxit = 500;
  parms B1_intercept = -4 B1_age = 0 B1_acadmos = 0 B1_minordrg = 0 B1_logspend = 0
        B2_intercept = -2 B2_age = 0 B2_acadmos = 0 B2_minordrg = 0	B2_logspend = 0;

  eta1 = B1_intercept + B1_age * age + B1_acadmos * acadmos + B1_minordrg * minordrg + B1_logspend * logspend;
  exp_eta1 = exp(eta1);
  p0 = 1 / (1 + exp_eta1);
  eta2 = B2_intercept + B2_age * age + B2_acadmos * acadmos + B2_minordrg * minordrg + B2_logspend * logspend;
  exp_eta2 = exp(eta2);
  if majordrg = 0 then _prob_ = p0;
  else _prob_ = (1 - p0) * exp(-exp_eta2) * (exp_eta2 ** majordrg) / ((1 - exp(-exp_eta2)) * fact(majordrg));
  ll = log(_prob_);
  model majordrg ~ general(ll);
run;
/*
Fit Statistics

-2 Log Likelihood                 8201.0
AIC (smaller is better)           8221.0
AICC (smaller is better)          8221.0
BIC (smaller is better)           8293.5

Parameter Estimates
 
                          Standard
Parameter      Estimate      Error     DF   t Value   Pr > |t|

B1_intercept    -4.2309     0.1808    1E4    -23.40     <.0001
B1_age          0.01694   0.003323    1E4      5.10     <.0001
B1_acadmos     0.002240   0.000492    1E4      4.55     <.0001
B1_minordrg      0.7653    0.03842    1E4     19.92     <.0001
B1_logspend      0.2301    0.02683    1E4      8.58     <.0001
============
B2_intercept    -2.0706     0.3081    1E4     -6.72     <.0001
B2_age          0.01796   0.005482    1E4      3.28     0.0011
B2_acadmos     0.000852   0.000700    1E4      1.22     0.2240
B2_minordrg      0.1739    0.03441    1E4      5.05     <.0001
B2_logspend      0.1229    0.04219    1E4      2.91     0.0036
*/

Zero-Inflated Poisson Model

*** ZERO-INFLATED POISSON MODEL WITH FMM PROCEDURE ***;
proc fmm data = tmp1 tech = trureg;
  model majordrg = age acadmos minordrg logspend / dist = poisson;
  model majordrg = / dist = constant;
  probmodel age acadmos minordrg logspend;
run;
/*
Fit Statistics

-2 Log Likelihood             8147.9
AIC  (smaller is better)      8167.9
AICC (smaller is better)      8167.9
BIC  (smaller is better)      8240.5

Parameter Estimates for 'Poisson' Model
 
                                Standard
Component  Effect     Estimate     Error  z Value  Pr > |z|

        1  Intercept   -2.2780    0.3002    -7.59    <.0001
        1  AGE         0.01956  0.006019     3.25    0.0012
        1  ACADMOS    0.000249  0.000668     0.37    0.7093
        1  MINORDRG     0.1176   0.02711     4.34    <.0001
        1  LOGSPEND     0.1644   0.03531     4.66    <.0001

Parameter Estimates for Mixing Probabilities
 
                         Standard
Effect       Estimate       Error    z Value    Pr > |z|

Intercept     -1.9111      0.4170      -4.58      <.0001
AGE          -0.00082    0.008406      -0.10      0.9218
ACADMOS      0.002934    0.001085       2.70      0.0068
MINORDRG       1.4424      0.1361      10.59      <.0001
LOGSPEND      0.09562     0.05080       1.88      0.0598
*/

*** ZERO-INFLATED POISSON MODEL WITH NLMIXED PROCEDURE ***;
proc nlmixed data = tmp1 tech = trureg maxit = 500;
  parms B1_intercept = -2 B1_age = 0 B1_acadmos = 0 B1_minordrg = 0 B1_logspend = 0
        B2_intercept = -2 B2_age = 0 B2_acadmos = 0 B2_minordrg = 0	B2_logspend = 0;

  eta1 = B1_intercept + B1_age * age + B1_acadmos * acadmos + B1_minordrg * minordrg + B1_logspend * logspend;
  exp_eta1 = exp(eta1);
  p0 = 1 / (1 + exp_eta1);
  eta2 = B2_intercept + B2_age * age + B2_acadmos * acadmos + B2_minordrg * minordrg + B2_logspend * logspend;
  exp_eta2 = exp(eta2);
  if majordrg = 0 then _prob_ = p0 + (1 - p0) * exp(-exp_eta2);
  else _prob_ = (1 - p0) * exp(-exp_eta2) * (exp_eta2 ** majordrg) / fact(majordrg);
  ll = log(_prob_);
  model majordrg ~ general(ll);
run;
/*
Fit Statistics

-2 Log Likelihood                 8147.9
AIC (smaller is better)           8167.9
AICC (smaller is better)          8167.9
BIC (smaller is better)           8240.5

Parameter Estimates
 
                          Standard
Parameter      Estimate      Error     DF   t Value   Pr > |t|

B1_intercept    -1.9111     0.4170    1E4     -4.58     <.0001
B1_age         -0.00082   0.008406    1E4     -0.10     0.9219
B1_acadmos     0.002934   0.001085    1E4      2.70     0.0068
B1_minordrg      1.4424     0.1361    1E4     10.59     <.0001
B1_logspend     0.09562    0.05080    1E4      1.88     0.0598
============
B2_intercept    -2.2780     0.3002    1E4     -7.59     <.0001
B2_age          0.01956   0.006019    1E4      3.25     0.0012
B2_acadmos     0.000249   0.000668    1E4      0.37     0.7093
B2_minordrg      0.1176    0.02711    1E4      4.34     <.0001
B2_logspend      0.1644    0.03531    1E4      4.66     <.0001
*/

Two-Class Finite Mixture Poisson Model

*** TWO-CLASS FINITE MIXTURE POISSON MODEL WITH FMM PROCEDURE ***;
proc fmm data = tmp1 tech = trureg;
  model majordrg = age acadmos minordrg logspend / dist = poisson k = 2;
  probmodel age acadmos minordrg logspend;
run;
/*
Fit Statistics

-2 Log Likelihood             8136.8
AIC  (smaller is better)      8166.8
AICC (smaller is better)      8166.9
BIC  (smaller is better)      8275.7

Parameter Estimates for 'Poisson' Model
 
                                Standard
Component  Effect     Estimate     Error  z Value  Pr > |z|

        1  Intercept   -2.4449    0.3497    -6.99    <.0001
        1  AGE         0.02214  0.006628     3.34    0.0008
        1  ACADMOS    0.000529  0.000770     0.69    0.4920
        1  MINORDRG    0.05054   0.04015     1.26    0.2081
        1  LOGSPEND     0.2140   0.04127     5.18    <.0001
        2  Intercept   -8.0935    1.5915    -5.09    <.0001
        2  AGE         0.01150   0.01294     0.89    0.3742
        2  ACADMOS    0.004567  0.002055     2.22    0.0263
        2  MINORDRG     0.2638    0.6770     0.39    0.6968
        2  LOGSPEND     0.6826    0.2203     3.10    0.0019

Parameter Estimates for Mixing Probabilities
 
                         Standard
Effect       Estimate       Error    z Value    Pr > |z|

Intercept     -1.4275      0.5278      -2.70      0.0068
AGE          -0.00277     0.01011      -0.27      0.7844
ACADMOS      0.001614    0.001440       1.12      0.2623
MINORDRG       1.5865      0.1791       8.86      <.0001
LOGSPEND     -0.06949     0.07436      -0.93      0.3501
*/

*** TWO-CLASS FINITE MIXTURE POISSON MODEL WITH NLMIXED PROCEDURE ***;
proc nlmixed data = tmp1 tech = trureg maxit = 500;
  parms B1_intercept = -2 B1_age = 0 B1_acadmos = 0 B1_minordrg = 0 B1_logspend = 0
        B2_intercept = -8 B2_age = 0 B2_acadmos = 0 B2_minordrg = 0 B2_logspend = 0
		B3_intercept = -1 B3_age = 0 B3_acadmos = 0 B3_minordrg = 0 B3_logspend = 0;

  eta1 = B1_intercept + B1_age * age + B1_acadmos * acadmos + B1_minordrg * minordrg + B1_logspend * logspend;
  exp_eta1 = exp(eta1);
  prob1 = exp(-exp_eta1) * exp_eta1 ** majordrg / fact(majordrg);
  eta2 = B2_intercept + B2_age * age + B2_acadmos * acadmos + B2_minordrg * minordrg + B2_logspend * logspend;
  exp_eta2 = exp(eta2);
  prob2 = exp(-exp_eta2) * exp_eta2 ** majordrg / fact(majordrg);
  eta3 = B3_intercept + B3_age * age + B3_acadmos * acadmos + B3_minordrg * minordrg + B3_logspend * logspend;
  exp_eta3 = exp(eta3);
  p = exp_eta3 / (1 + exp_eta3);
  _prob_ = p * prob1 + (1 - p) * prob2;
  ll = log(_prob_);
  model majordrg ~ general(ll);
run;
/*
Fit Statistics

-2 Log Likelihood                 8136.8
AIC (smaller is better)           8166.8
AICC (smaller is better)          8166.9
BIC (smaller is better)           8275.7

Parameter Estimates
 
                          Standard
Parameter      Estimate      Error     DF   t Value   Pr > |t|

B1_intercept    -2.4449     0.3497    1E4     -6.99     <.0001
B1_age          0.02214   0.006628    1E4      3.34     0.0008
B1_acadmos     0.000529   0.000770    1E4      0.69     0.4920
B1_minordrg     0.05054    0.04015    1E4      1.26     0.2081
B1_logspend      0.2140    0.04127    1E4      5.18     <.0001
============
B2_intercept    -8.0935     1.5916    1E4     -5.09     <.0001
B2_age          0.01150    0.01294    1E4      0.89     0.3742
B2_acadmos     0.004567   0.002055    1E4      2.22     0.0263
B2_minordrg      0.2638     0.6770    1E4      0.39     0.6968
B2_logspend      0.6826     0.2203    1E4      3.10     0.0020
============
B3_intercept    -1.4275     0.5278    1E4     -2.70     0.0068
B3_age         -0.00277    0.01011    1E4     -0.27     0.7844
B3_acadmos     0.001614   0.001440    1E4      1.12     0.2623
B3_minordrg      1.5865     0.1791    1E4      8.86     <.0001
B3_logspend    -0.06949    0.07436    1E4     -0.93     0.3501
*/

A SAS Macro for Scorecard Evaluation with Weights

On 09/28/2012, I posted a SAS macro evaluation the scorecard performance, e.g. KS / AUC statistics (https://statcompute.wordpress.com/2012/09/28/a-sas-macro-for-scorecard-performance-evaluation). However, this macro is not generic enough to handle cases with a weighting variable. In a recent project that I am working on, there is a weight variable attached to each credit applicant due to the reject inference. Therefore, there is no excuse for me to continue neglecting the necessity of developing another SAS macro that can take care of the weight variable with any positive values in the scorecard evaluation. Below is a quick draft of the macro. You might have to tweak it a little to suit your needs in the production.

%macro wt_auc(data = , score = , y = , weight = );
***********************************************************;
* THE MACRO IS TO EVALUATE THE SEPARATION POWER OF A      *;
* SCORECARD WITH WEIGHTS                                  *;
* ------------------------------------------------------- *;
* PARAMETERS:                                             *;
*  DATA  : INPUT DATASET                                  *;
*  SCORE : SCORECARD VARIABLE                             *;
*  Y     : RESPONSE VARIABLE IN (0, 1)                    *;
*  WEIGHT: WEIGHT VARIABLE WITH POSITIVE VALUES           *; 
* ------------------------------------------------------- *;
* OUTPUTS:                                                *;
*  A SUMMARY TABLE WITH KS AND AUC STATISTICS             *;
* ------------------------------------------------------- *;
* CONTACT:                                                *;
*  WENSUI.LIU@53.COM                                      *;
***********************************************************;
options nocenter nonumber nodate mprint mlogic symbolgen
        orientation = landscape ls = 125 formchar = "|----|+|---+=|-/\<>*";

data _tmp1 (keep = &score &y &weight);
  set &data;
  where &score ~= . and &y in (0, 1) and &weight >= 0;
run;

*** CAPTURE THE DIRECTION OF SCORE ***;
ods listing close;
ods output spearmancorr = _cor;
proc corr data = _tmp1 spearman;
  var &y;
  with &score;
run;
ods listing;

data _null_;
  set _cor;
  if &y >= 0 then do;
    call symput('desc', 'descending');
  end;
  else do;
    call symput('desc', ' ');
  end;
run;

proc sql noprint;
create table
  _tmp2 as    
select
  &score                                         as _scr,
  sum(&y)                                        as _bcnt,
  count(*)                                       as _cnt,
  sum(case when &y = 1 then &weight else 0 end)  as _bwt,
  sum(case when &y = 0 then &weight else 0 end)  as _gwt
from
  _tmp1
group by
  &score;

select
  sum(_bwt) into :bsum
from
  _tmp2;
  
select
  sum(_gwt) into :gsum
from
  _tmp2;

select
  sum(_cnt) into :cnt
from
  _tmp2;    
quit;
%put &cnt;

proc sort data = _tmp2;
  by &desc _scr;
run;

data _tmp3;
  set _tmp2;
  by &desc _scr;
  retain _gcum _bcum _cntcum;
  _gcum + _gwt;
  _bcum + _bwt;
  _cntcum + _cnt;
  _gpct = _gcum / &gsum;
  _bpct = _bcum / &bsum;
  _ks   = abs(_gpct - _bpct) * 100;
  _rank = int(_cntcum / ceil(&cnt / 10)) + 1;
run;

proc sort data = _tmp3 sortsize = max;
  by _gpct _bpct;
run;

data _tmp4;
  set _tmp3;
  by _gpct _bpct;
  if last._gpct then do;
    _idx + 1;
    output;
  end;
run;

proc sql noprint;
create table
  _tmp5 as
select
  a._gcum as gcum,
  (b._gpct - a._gpct) * (b._bpct + a._bpct) / 2 as dx
from
  _tmp4 as a, _tmp4 as b
where
  a._idx + 1 = b._idx;

select
  sum(dx) format = 15.8 into :AUC
from
  _tmp5;

select
  max(_ks) format = 15.8 into :KS_STAT
from
  _tmp3;

select
  _scr format = 6.2 into :KS_SCORE
from
  _tmp3
where
  _ks = (select max(_ks) from _tmp3);

create table
  _tmp6 as
select
  _rank                       as rank,
  min(_scr)                   as min_scr,
  max(_scr)                   as max_scr,
  sum(_cnt)                   as cnt,
  sum(_gwt + _bwt)            as wt,
  sum(_gwt)                   as gwt,
  sum(_bwt)                   as bwt,
  sum(_bwt) / calculated wt   as bad_rate
from
  _tmp3
group by
  _rank;    
quit;  

proc report data = _last_ spacing = 1 split = "/" headline nowd;
  column("GOOD BAD SEPARATION REPORT FOR %upcase(%trim(&score)) IN DATA %upcase(%trim(&data))/
          MAXIMUM KS = %trim(&ks_stat) AT SCORE POINT %trim(&ks_score) and AUC STATISTICS = %trim(&auc)/ /"
          rank min_scr max_scr cnt wt gwt bwt bad_rate);
  define rank       / noprint order order = data;
  define min_scr    / "MIN/SCORE"             width = 10 format = 9.2        analysis min center;
  define max_scr    / "MAX/SCORE"             width = 10 format = 9.2        analysis max center;
  define cnt        / "RAW/COUNT"             width = 10 format = comma9.    analysis sum;
  define wt         / "WEIGHTED/SUM"          width = 15 format = comma14.2  analysis sum;
  define gwt        / "WEIGHTED/GOODS"        width = 15 format = comma14.2  analysis sum;
  define bwt        / "WEIGHTED/BADS"         width = 15 format = comma14.2  analysis sum;
  define bad_rate   / "BAD/RATE"              width = 10 format = percent9.2 order center;
  rbreak after / summarize dol skip;
run;

proc datasets library = work nolist;
  delete _tmp: / memtype = data;
run;
quit;

***********************************************************;
*                     END OF THE MACRO                    *;
***********************************************************;
%mend wt_auc;

In the first demo below, a weight variable with fractional values is tested.

*** TEST CASE OF FRACTIONAL WEIGHTS ***;
data one;
  set data.accepts;
  weight = ranuni(1);
run;

%wt_auc(data = one, score = bureau_score, y = bad, weight = weight);
/*
                   GOOD BAD SEPARATION REPORT FOR BUREAU_SCORE IN DATA ONE
       MAXIMUM KS = 34.89711721 AT SCORE POINT 678.00 and AUC STATISTICS = 0.73521009

    MIN        MAX            RAW        WEIGHTED        WEIGHTED        WEIGHTED    BAD
   SCORE      SCORE         COUNT             SUM           GOODS            BADS    RATE
 -------------------------------------------------------------------------------------------
    443.00     619.00         539          276.29          153.16          123.13   44.56%
    620.00     644.00         551          273.89          175.00           98.89   36.11%
    645.00     660.00         544          263.06          176.88           86.18   32.76%
    661.00     676.00         555          277.26          219.88           57.38   20.70%
    677.00     692.00         572          287.45          230.41           57.04   19.84%
    693.00     707.00         510          251.51          208.25           43.26   17.20%
    708.00     724.00         576          276.31          243.89           32.42   11.73%
    725.00     746.00         566          285.53          262.73           22.80    7.98%
    747.00     772.00         563          285.58          268.95           16.62    5.82%
    773.00     848.00         546          272.40          264.34            8.06    2.96%
 ========== ========== ========== =============== =============== ===============
    443.00     848.00       5,522        2,749.28        2,203.49          545.79
*/

In the second demo, a weight variable with positive integers is also tested.

*** TEST CASE OF INTEGER WEIGHTS ***;
data two;
  set data.accepts;
  weight = rand("poisson", 20);
run;

%wt_auc(data = two, score = bureau_score, y = bad, weight = weight);
/*
                   GOOD BAD SEPARATION REPORT FOR BUREAU_SCORE IN DATA TWO
       MAXIMUM KS = 35.58884479 AT SCORE POINT 679.00 and AUC STATISTICS = 0.73725030

    MIN        MAX            RAW        WEIGHTED        WEIGHTED        WEIGHTED    BAD
   SCORE      SCORE         COUNT             SUM           GOODS            BADS    RATE
 -------------------------------------------------------------------------------------------
    443.00     619.00         539       10,753.00        6,023.00        4,730.00   43.99%
    620.00     644.00         551       11,019.00        6,897.00        4,122.00   37.41%
    645.00     660.00         544       10,917.00        7,479.00        3,438.00   31.49%
    661.00     676.00         555       11,168.00        8,664.00        2,504.00   22.42%
    677.00     692.00         572       11,525.00        9,283.00        2,242.00   19.45%
    693.00     707.00         510       10,226.00        8,594.00        1,632.00   15.96%
    708.00     724.00         576       11,497.00       10,117.00        1,380.00   12.00%
    725.00     746.00         566       11,331.00       10,453.00          878.00    7.75%
    747.00     772.00         563       11,282.00       10,636.00          646.00    5.73%
    773.00     848.00         546       10,893.00       10,598.00          295.00    2.71%
 ========== ========== ========== =============== =============== ===============
    443.00     848.00       5,522      110,611.00       88,744.00       21,867.00
*/

As shown, the macro works well in both cases. Please feel free to let me know if it helps in your cases.

A SAS Macro Calculating PDO

In the development of credit scorecards, the model developer usually will scale the predicted probability of default / delinquent into a range of discrete score points for the purpose of operational convenience. While there are multiple ways to perform scaling, the most popular one in the credit risk arena is to scale the predicted probability logarithmically such that the odds, the ratio between goods and bads, will be doubled / halved after the increase / decrease of a certain number of score points, which is also called PDO (points to double the odds) in the industry.

In practice, PDO along the time carries strong implications in credit policies and strategies and is often used as an effectiveness measure for a scorecard. Chances are that a scorecard could still maintain a satisfactory predictiveness, e.g. K-S statistic or AUC, but might lose its effectiveness, e.g. increase in PDO. For instance, if a credit strategy was originally developed to take effect at credit score = 700 by design, it might require credit score = 720 later after the increase in PDO, which will significantly discount the effectiveness of such strategy. As a result, it is critically important to monitor PDO of a scorecard in the production environment.

Below is a sas macro to calculate PDO of a scorecard.

options nocenter nonumber nodate mprint mlogic symbolgen
        orientation = landscape ls = 125 formchar = "|----|+|---+=|-/\<>*";

%macro get_pdo(data = , score = , y = , wt = NONE, ref_score = , target_odds = , target_pdo = );
**********************************************************************;
* THIS MACRO IS TO CALCULATE OBSERVED ODDS AND PDO FOR ANY SCORECARD *;
* AND COMPUTE ALIGNMENT BETAS TO REACH TARGET ODDS AND PDO           *;
* ------------------------------------------------------------------ *;
* PARAMETERS:                                                        *;
*  DATA       : INPUT DATASET                                        *;
*  SCORE      : SCORECARD VARIABLE                                   *;
*  Y          : RESPONSE VARIABLE IN (0, 1)                          *;
*  WT         : WEIGHT VARIABLE IN POSITIVE INTEGER                  *;
*  REF_SCORE  : REFERENCE SCORE POINT FOR TARGET ODDS AND PDO        *;
*  TARGET_ODDS: TARGET ODDS AT REFERENCE SCORE OF SCORECARD          *;
*  TARGET_PDO : TARGET POINTS TO DOUBLE ODDS OF SCARECARD            *; 
* ------------------------------------------------------------------ *;
* OUTPUTS:                                                           *;
*  REPORT : PDO REPORT WITH THE CALIBRATION FORMULA IN HTML FORMAT   *;
* ------------------------------------------------------------------ *;
* AUTHOR: WENSUI.LIU@53.COM                                          *;
**********************************************************************;

options nonumber nodate orientation = landscape nocenter;  

*** CHECK IF THERE IS WEIGHT VARIABLE ***;
%if %upcase(&wt) = NONE %then %do;
  data _tmp1 (keep = &y &score _wt);
    set &data;
    where &y in (1, 0)  and
          &score ~= .;
    _wt = 1;
    &score = round(&score., 1);
  run;
%end;
%else %do;
  data _tmp1 (keep = &y &score _wt);
    set &data;
    where &y in (1, 0)        and
          &score ~= .         and
          round(&wt., 1) > 0;
    _wt = round(&wt., 1);
    &score = round(&score., 1);
  run;
%end;

proc logistic data = _tmp1 desc outest = _est1 noprint;
  model &y = &score;
  freq _wt;
run;

proc sql noprint;
  select round(min(&score), 0.01) into :min_score from _tmp1;

  select round(max(&score), 0.01) into :max_score from _tmp1;
quit;

data _est2;
  set _est1 (keep = intercept &score rename = (&score = slope));

  adjust_beta0 = &ref_score - (&target_pdo * log(&target_odds) / log(2)) - intercept * &target_pdo / log(2);
  adjust_beta1 = -1 * (&target_pdo * slope / log(2));

  do i = -5 to 5;
    old_pdo = round(-log(2) / slope, 0.01);
    old_ref = &ref_score + (i) * old_pdo;
    old_odd = exp(-(slope * old_ref + intercept)); 
    if old_ref >= &min_score and old_ref <= &max_score then output; 
  end;
run;

data _tmp2;
  set _tmp1;
  
  if _n_ = 1 then do;
    set _est2(obs = 1);
  end;

  adjusted = adjust_beta0 + adjust_beta1 * &score;
run;

proc logistic data = _tmp2 desc noprint outest = _est3;
  model &y = adjusted;
  freq _wt;
run;

data _est4;
  set _est3 (keep = intercept adjusted rename = (adjusted = slope));

  adjust_beta0 = &ref_score - (&target_pdo * log(&target_odds) / log(2)) - intercept * &target_pdo / log(2);
  adjust_beta1 = -1 * (&target_pdo * slope / log(2));

  do i = -5 to 5;
    new_pdo = round(-log(2) / slope, 0.01);
    new_ref = &ref_score + (i) * new_pdo;
    new_odd = exp(-(slope * new_ref + intercept)); 
    if new_ref >= &min_score and new_ref <= &max_score then output;
  end;
run;
 
proc sql noprint;
create table
  _final as
select
  &target_pdo            as target_pdo,
  &target_odds           as target_odds, 
  a.old_pdo              as pdo1,
  a.old_ref              as ref1,
  a.old_odd              as odd1,
  log(a.old_odd)         as ln_odd1,
  a.adjust_beta0         as adjust_beta0, 
  a.adjust_beta1         as adjust_beta1,
  b.new_pdo              as pdo2,
  b.new_ref              as ref2,
  b.new_odd              as odd2,
  log(b.new_odd)         as ln_odd2
from
  _est2 as a inner join _est4 as b
on
  a.i = b.i;

select round(pdo1, 1) into :pdo1 from _final;

select put(max(pdo1 / pdo2 - 1, 0), percent10.2) into :compare from _final;

select case when pdo1 > pdo2 then 1 else 0 end into :flag from _final;

select put(adjust_beta0, 12.8) into :beta0 from _final;

select put(adjust_beta1, 12.8) into :beta1 from _final;
quit;

%put &compare;
ods html file = "%upcase(%trim(&score))_PDO_SUMMARY.html" style = sasweb;
title;
proc report data  = _final box spacing = 1 split = "/" 
  style(header) = [font_face = "courier new"] style(column) = [font_face = "courier new"]
  style(lines) = [font_face = "courier new" font_size = 2] style(report) = [font_face = "courier new"];

  column("/SUMMARY OF POINTS TO DOUBLE ODDS FOR %upcase(&score) WEIGHTED BY %upcase(&wt) IN DATA %upcase(&data)
          /( TARGET PDO = &target_pdo, TARGET ODDS = &target_odds AT REFERENCE SCORE &ref_score ) / "
         pdo1 ref1 odd1 ln_odd1 pdo2 ref2 odd2 ln_odd2);

  define pdo1    / "OBSERVED/SCORE PDO"   width = 10 format = 4.   center;
  define ref1    / "OBSERVED/REF. SCORE"  width = 15 format = 5.   center order order = data;
  define odd1    / "OBSERVED/ODDS"        width = 15 format = 14.4 center;
  define ln_odd1 / "OBSERVED/LOG ODDS"    width = 15 format = 8.2  center;
  define pdo2    / "ADJUSTED/SCORE PDO"   width = 10 format = 4.   center;
  define ref2    / "ADJUSTED/REF. SCORE"  width = 15 format = 5.   center;
  define odd2    / "ADJUSTED/ODDS"        width = 15 format = 14.4 center;  
  define ln_odd2 / "ADJUSTED/LOG ODDS"    width = 15 format = 8.2  center;

  compute after;
  %if &flag = 1 %then %do;
    line @15 "THE SCORE ODDS IS DETERIORATED BY %trim(&compare).";
    line @15 "CALIBRATION FORMULA: ADJUSTED SCORE = %trim(&beta0) + %trim(&beta1) * %trim(%upcase(&score)).";
  %end;
  %else %do;
    line @25 "THERE IS NO DETERIORATION IN THE SCORE ODDS."; 
  %end;
  endcomp;
run;;
ods html close;

*************************************************;
*              END OF THE MACRO                 *;
*************************************************; 
%mend get_pdo;

In the following example, I called the macro to calculate PDO of bureau_score in a test data. Parameter inputs of “ref_score”, “target_odds”, and “target_pdo” could be acquired from the development or the benchmark sample. As shown in the output, the observed PDO is higher than the target PDO, implying 2.93% deterioration in the score effectiveness. In addition to PDO calculation, the sas macro also generated a formulation to calibrate the bureau_score back to the target PDO.

data tmp1;
  set data.accepts;
  where bureau_score ~= .;
run;

%get_pdo(data = tmp1, score = bureau_score, y = bad, wt = weight, ref_score = 680, target_odds = 20, target_pdo = 45);

/*
 -----------------------------------------------------------------------------------------------------------------------
 |                  SUMMARY OF POINTS TO DOUBLE ODDS FOR BUREAU_SCORE WEIGHTED BY WEIGHT IN DATA TMP1                  |
 |                            ( TARGET PDO = 45, TARGET ODDS = 20 AT REFERENCE SCORE 680 )                             |
 |                                                                                                                     |
 | OBSERVED     OBSERVED        OBSERVED        OBSERVED      ADJUSTED     ADJUSTED        ADJUSTED        ADJUSTED    |
 |SCORE PDO    REF. SCORE         ODDS          LOG ODDS     SCORE PDO    REF. SCORE         ODDS          LOG ODDS    |
 |---------------------------------------------------------------------------------------------------------------------| 
 |     46   |       448     |        0.6404 |      -0.45    |     45   |       455     |        0.6250 |      -0.47    | 
 |----------+---------------+---------------+---------------+----------+---------------+---------------+---------------| 
 |     46   |       495     |        1.2809 |       0.25    |     45   |       500     |        1.2500 |       0.22    | 
 |----------+---------------+---------------+---------------+----------+---------------+---------------+---------------| 
 |     46   |       541     |        2.5618 |       0.94    |     45   |       545     |        2.5000 |       0.92    | 
 |----------+---------------+---------------+---------------+----------+---------------+---------------+---------------| 
 |     46   |       587     |        5.1239 |       1.63    |     45   |       590     |        5.0000 |       1.61    | 
 |----------+---------------+---------------+---------------+----------+---------------+---------------+---------------| 
 |     46   |       634     |       10.2483 |       2.33    |     45   |       635     |       10.0000 |       2.30    | 
 |----------+---------------+---------------+---------------+----------+---------------+---------------+---------------| 
 |     46   |       680     |       20.4976 |       3.02    |     45   |       680     |       20.0000 |       3.00    | 
 |----------+---------------+---------------+---------------+----------+---------------+---------------+---------------| 
 |     46   |       726     |       40.9971 |       3.71    |     45   |       725     |       40.0000 |       3.69    | 
 |----------+---------------+---------------+---------------+----------+---------------+---------------+---------------| 
 |     46   |       773     |       81.9981 |       4.41    |     45   |       770     |       80.0000 |       4.38    | 
 |----------+---------------+---------------+---------------+----------+---------------+---------------+---------------| 
 |     46   |       819     |      164.0038 |       5.10    |     45   |       815     |      160.0000 |       5.08    | 
 |---------------------------------------------------------------------------------------------------------------------| 
 |              THE SCORE ODDS IS DETERIORATED BY 2.93%.                                                               |            
 |              CALIBRATION FORMULA: ADJUSTED SCORE = 20.92914838 + 0.97156812 * BUREAU_SCORE.                         |            
 ----------------------------------------------------------------------------------------------------------------------- 
*/

In the second example, I used the formulation from the above to generate a adjust_score from the bureau_score and then calculated PDO again. As shown in the output, after the calibration, the observed PDO of adjust_score now becomes equivalent to the target PDO of bureau_score. When the score increases every 45 points, e.g. from 680 to 725, the odds value would be doubled as anticipated, e.g. from 20 to 40.

data tmp2;
  set tmp1;
  adjust_score = 20.92914838 + 0.97156812 * bureau_score;
run;

%get_pdo(data = tmp2, score = adjust_score, y = bad, wt = weight, ref_score = 680, target_odds = 20, target_pdo = 45);

/*
 -----------------------------------------------------------------------------------------------------------------------
 |                  SUMMARY OF POINTS TO DOUBLE ODDS FOR ADJUST_SCORE WEIGHTED BY WEIGHT IN DATA TMP2                  |
 |                            ( TARGET PDO = 45, TARGET ODDS = 20 AT REFERENCE SCORE 680 )                             |
 |                                                                                                                     |
 | OBSERVED     OBSERVED        OBSERVED        OBSERVED      ADJUSTED     ADJUSTED        ADJUSTED        ADJUSTED    |
 |SCORE PDO    REF. SCORE         ODDS          LOG ODDS     SCORE PDO    REF. SCORE         ODDS          LOG ODDS    |
 |---------------------------------------------------------------------------------------------------------------------| 
 |     45   |       455     |        0.6249 |      -0.47    |     45   |       455     |        0.6250 |      -0.47    | 
 |----------+---------------+---------------+---------------+----------+---------------+---------------+---------------| 
 |     45   |       500     |        1.2498 |       0.22    |     45   |       500     |        1.2500 |       0.22    | 
 |----------+---------------+---------------+---------------+----------+---------------+---------------+---------------| 
 |     45   |       545     |        2.4996 |       0.92    |     45   |       545     |        2.5000 |       0.92    | 
 |----------+---------------+---------------+---------------+----------+---------------+---------------+---------------| 
 |     45   |       590     |        4.9991 |       1.61    |     45   |       590     |        5.0000 |       1.61    | 
 |----------+---------------+---------------+---------------+----------+---------------+---------------+---------------| 
 |     45   |       635     |        9.9980 |       2.30    |     45   |       635     |       10.0000 |       2.30    | 
 |----------+---------------+---------------+---------------+----------+---------------+---------------+---------------| 
 |     45   |       680     |       19.9956 |       3.00    |     45   |       680     |       20.0000 |       3.00    | 
 |----------+---------------+---------------+---------------+----------+---------------+---------------+---------------| 
 |     45   |       725     |       39.9905 |       3.69    |     45   |       725     |       40.0000 |       3.69    | 
 |----------+---------------+---------------+---------------+----------+---------------+---------------+---------------| 
 |     45   |       770     |       79.9796 |       4.38    |     45   |       770     |       80.0000 |       4.38    | 
 |----------+---------------+---------------+---------------+----------+---------------+---------------+---------------| 
 |     45   |       815     |      159.9565 |       5.07    |     45   |       815     |      160.0000 |       5.08    | 
 |---------------------------------------------------------------------------------------------------------------------| 
 |                        THERE IS NO DETERIORATION IN THE SCORE ODDS.                                                 |
 ----------------------------------------------------------------------------------------------------------------------- 
*/

Modeling Practices of Loss Forecasting for Consumer Banking Portfolio

Roll Rate Models

The roll rate model is the most commonly used modeling practice for the loss forecasting in the consumer banking arena built at the portfolio level instead of at the individual account level.

In this modeling practice, the whole portfolio is segmented by various delinquency buckets, e.g. Current, 30-DPD, 60-DPD, 90-DPD, 120-DPD, 150-DPD, and charge-off. The purpose is to evaluate the probability of an account in a specific delinquency bucket flowing into the next stage of delinquency status during the course of 1 month. The table below demonstrates the scheme of a basic roll rate model. In the table below, projected rolling rates are shown in the bottom two rows highlighted in red. The projected rate for each delinquency bucket is simply the moving average of previous 3 months.

Due to its nature of simplicity, the roll rate model is able to fit into various business scenarios, e.g. delinquency tracking and loss forecasting, and across different consumer products, e.g. installment loans and revolving credits. However, since the rolling rate for a specific delinquency bucket is estimated on the moving-average basis without being able to incorporate exogenous risk factors and economic drivers, a roll rate model is most applicable to the short-term loss forecasting, e.g. usually within 3 months, and also not able to estimate the portfolio loss in a stressed economic scenario.

Vintage Loss Models

A vintage loss model is another widely used modeling technique for the loss forecasting and is similar to the roll rate model in that they both are portfolio-based instead of account-based. However, in the vintage loss model, the whole portfolio is segmented by various origination vintages instead of delinquency buckets.

In this modeling practice, once the vintage criterion is determined, each delinquency performance, e.g. from 30-DPD through Charge-off, of every segment can be tracked over time through the full life cycle. For instance, in the case of loss estimation, the loss rate of a specific vintage can be formulated as

Ln(Loss_Rate / (1 – Loss_Rate)) = A * Vintage_Quality + B * Economy_Driver + S(Maturation)

In the model specification above, while vintage_quality, e.g. origination credit score or loan-to-value, and economy_driver, e.g. unemployment rate or housing price index, are linear components, Maturation, e.g. months on book, is a non-parametric term to reflect the nonlinearity of a maturity curve.

Compared with the roll rate model, the vintage loss model demonstrates a twofold benefit. First of all, with the inclusion of maturation information, the loss trend can be captured and utilized to improve the forecasting in a longer term. Secondly, incorporated with the economic information, the model can also be used to perform the stress testing in various economic scenarios. However, a caveat is that the vintage loss model is more suitable for installment loans than for revolving credits in loss forecasting due to impacts of various business practices such as balance transfers and teaser rates.

Expected Losses Models

Albeit easy to understand and simple to implement, two methods introduced above are all under the criticism of not being able to incorporate loan-specific characteristics and a finer level of economic information. Significantly different from roll rate models and vintage loss curves, expected loss (EL) estimation is a modern modeling practice developed on the basis of 3 risk parameters, namely probability of default (PD), exposure at default (EAD), and loss given default (LGD).

In this modeling practice, each risk parameter is modeled separately with account-level risk factors and economic indicators. For each individual account, the expected loss during the course of next 12 months can be formulated as

EL = PD * EAD * LGD

This modeling methodology not only is in line with Basel framework but also has following advantages over traditional methods for loss forecasting, including roll rate models and vintage loss models.

1. The account-level modeling practice provides a more granular risk profiling for each individual borrower.
2. Each risk parameter is driven by a separate set of economic factors independently, allowing a more dynamic view of economic impacts
3. The modeling methodology is in line with statistical techniques prevailing in the consumer lending arena and intuitively adoptable by most model developers.

Since this methodology heavily relies on statistical modeling concepts, the loss estimation is subject to specific statistical assumptions on distributions and functional forms.