Monotonic Binning Driven by Decision Tree

After the development of MOB package (https://github.com/statcompute/MonotonicBinning), I was asked by a couple users about the possibility of using the decision tree to drive the monotonic binning. Although I am not aware of any R package implementing the decision tree with the monotonic constraint, I did manage to find a solution based upon the decision tree.

The Rborist package is an implementation of the Random Forest that would enforce the monotonicity at the local level within each tree but not at the global level for the whole forest. However, with a few tweaks on the Rborist syntax, it is not difficult to convert the forest with many trees into the forest with a single tree. After all necessary adjustments, I finally ended up with a decision tree that can be used to drive the monotonic binning algorithm, as shown in the arb_bin() function below, and will consider adding it into the MOB package later.

arb_bin <- function(data, y, x) {
yname <- deparse(substitute(y))
xname <- deparse(substitute(x))
df1 <- subset(data, !is.na(data[[xname]]) & data[[yname]] %in% c(0, 1), select = c(xname, yname))
df2 <- data.frame(y = df1[[yname]], x = df1[[xname]])
spc <- cor(df2[, 2], df2[, 1], method = "spearman", use = "complete.obs")
mdl <- Rborist::Rborist(as.matrix(df2$x), df2$y, noValidate = T, nTree = 1, regMono = spc / abs(spc),
ctgCensus = "prob", minInfo = exp(100), nSamp = nrow(df2) , withRepl = F)
df3 <- data.frame(y = df2$y, x = df2$x, yhat = predict(mdl, newdata = as.matrix(df2$x), ctgCensus = "prob")$yPred)
df4 <- Reduce(rbind,
lapply(split(df3, df3$yhat),
function(x) data.frame(maxx = max(x$x), yavg = mean(x$y), yhat = round(mean(x$yhat), 8))))
df5 <- df4[order(df4$maxx), ]
h <- ifelse(df5[["yavg"]][1] %in% c(0, 1), 2, 1)
t <- ifelse(df5[["yavg"]][nrow(df5)] %in% c(0, 1), 2, 1)
cuts <- df5$maxx[h:max(h, (nrow(df5) t))]
return(list(df = manual_bin(data, yname, xname, cuts = cuts),
cuts = cuts))
}
arb_bin(df, bad, rev_util)
# bin rule freq dist mv_cnt bad_freq bad_rate woe iv ks
# 01 $X <= 24 2653 0.4545 0 414 0.1560 -0.3320 0.0452 13.6285
# 02 $X > 24 & $X <= 36 597 0.1023 0 96 0.1608 -0.2963 0.0082 16.3969
# 03 $X > 36 & $X <= 40 182 0.0312 0 32 0.1758 -0.1890 0.0011 16.9533
# 04 $X > 40 & $X <= 58 669 0.1146 0 137 0.2048 -0.0007 0.0000 16.9615
# 05 $X > 58 & $X <= 60 77 0.0132 0 16 0.2078 0.0177 0.0000 16.9381
# 06 $X > 60 & $X <= 72 408 0.0699 0 95 0.2328 0.1636 0.0020 15.7392
# 07 $X > 72 & $X <= 73 34 0.0058 0 8 0.2353 0.1773 0.0002 15.6305
# 08 $X > 73 & $X <= 75 62 0.0106 0 16 0.2581 0.2999 0.0010 15.2839
# 09 $X > 75 & $X <= 83 246 0.0421 0 70 0.2846 0.4340 0.0089 13.2233
# 10 $X > 83 & $X <= 96 376 0.0644 0 116 0.3085 0.5489 0.0225 9.1266
# 11 $X > 96 & $X <= 98 50 0.0086 0 17 0.3400 0.6927 0.0049 8.4162
# 12 $X > 98 483 0.0827 0 179 0.3706 0.8263 0.0695 0.0000
arb_bin(df, bad, tot_derog)
# bin rule freq dist mv_cnt bad_freq bad_rate woe iv ks
# 00 is.na($X) 213 0.0365 213 70 0.3286 0.6416 0.0178 2.7716
# 01 $X <= 0 2850 0.4883 0 367 0.1288 -0.5559 0.1268 20.0442
# 02 $X > 0 & $X <= 1 891 0.1526 0 193 0.2166 0.0704 0.0008 18.9469
# 03 $X > 1 & $X <= 2 478 0.0819 0 121 0.2531 0.2740 0.0066 16.5222
# 04 $X > 2 & $X <= 3 332 0.0569 0 86 0.2590 0.3050 0.0058 14.6321
# 05 $X > 3 & $X <= 23 1064 0.1823 0 353 0.3318 0.6557 0.0931 0.4370
# 06 $X > 23 9 0.0015 0 6 0.6667 2.0491 0.0090 0.0000

view raw
do_arb_bin.R
hosted with ❤ by GitHub

Granular Weighted Binning by Generalized Boosted Model

In the post https://statcompute.wordpress.com/2019/04/27/more-general-weighted-binning, I’ve shown how to do the weighted binning with the function wqtl_bin() by the iterative partitioning. However, the outcome from wqtl_bin() sometimes can be too coarse. The function wgbm_bin() (https://github.com/statcompute/MonotonicBinning/blob/master/code/wgbm_bin.R) leverages the idea of gbm() that implements the Generalized Boosted Model and generates more granular weighted binning outcomes.

Below is the demonstration showing the difference between wqtl_bin() and wgbm_bin() outcomes. Even with the same data, the wgbm_bin() function is able to generate a more granular binning result and 14% higher Information Value.

df <- readRDS("archive/accepts.rds")
head(df, 1)
# bankruptcy bad app_id tot_derog tot_tr age_oldest_tr tot_open_tr tot_rev_tr tot_rev_debt tot_rev_line rev_util bureau_score purch_price msrp
# 0 0 1001 6 7 46 NaN NaN NaN NaN 0 747 19678 17160
# down_pyt purpose loan_term loan_amt ltv tot_income used_ind weight
# 947.15 LEASE 36 18730.85 109 4800 0 4.75
### BY ITERATIVE PARTITION ###
source("wqtl_bin.R")
wqtl_bin(df, bad, tot_open_tr, weight)
# bin rule cnt freq dist mv_wt bad_freq bad_rate woe iv ks
# 00 is.na($X) 1416 5398.50 0.2323 5398.5 354 0.0656 0.2573 0.0173 6.7157
# 01 $X <= 6 2994 12050.25 0.5185 0.0 579 0.0480 -0.0722 0.0026 3.0908
# 02 $X > 6 1427 5792.00 0.2492 0.0 263 0.0454 -0.1315 0.0041 0.0000
### BY GENERALIZED BOOSTED MODEL ###
source("wgbm_bin.R")
wgbm_bin(df, bad, tot_open_tr, weight)
# bin rule cnt freq dist mv_wt bad_freq bad_rate woe iv ks
# 00 is.na($X) 1416 5398.50 0.2323 5398.5 354 0.0656 0.2573 0.0173 6.7157
# 01 $X <= 2 525 2085.00 0.0897 0.0 109 0.0523 0.0166 0.0000 6.8658
# 02 $X > 2 & $X <= 3 605 2408.75 0.1036 0.0 124 0.0515 0.0004 0.0000 6.8695
# 03 $X > 3 & $X <= 5 1319 5342.75 0.2299 0.0 246 0.0460 -0.1169 0.0030 4.3181
# 04 $X > 5 & $X <= 14 1899 7696.50 0.3312 0.0 353 0.0459 -0.1210 0.0046 0.5213
# 05 $X > 14 73 309.25 0.0133 0.0 10 0.0323 -0.4846 0.0025 0.0000

view raw
use_wtwoe.R
hosted with ❤ by GitHub

wtwoe

Why Use Weight of Evidence?

I had been asked why I spent so much effort on developing SAS macros and R functions to do monotonic binning for the WoE transformation, given the availability of other cutting-edge data mining algorithms that will automatically generate the prediction with whatever predictors fed in the model. Nonetheless, what really distinguishes a good modeler from the rest is how to handle challenging data issues before feeding data in the model, including missing values, outliers, linearity, and predictability, in a scalable way that can be rolled out to hundreds or even thousands of potential model drivers in the production environment.

The WoE transformation through monotonic binning provides a convenient way to address each of aforementioned concerns.

1. Because WoE is a piecewise transformation based on the data discretization, all missing values would fall into a standalone category either by itself or to be combined with the neighbor that shares a similar event probability. As a result, the special treatment for missing values is not necessary.

2. After the monotonic binning of each variable, since the WoE value for each bin is a projection from the predictor into the response that is defined by the log ratio between event and non-event distributions, any raw value of the predictor doesn’t matter anymore and therefore the issue related to outliers would disappear.

3. While many modelers would like to use log or power transformations to achieve a good linear relationship between the predictor and log odds of the response, which is heuristic at best with no guarantee for the good outcome, the WoE transformation is strictly linear with respect to log odds of the response with the unity correlation. It is also worth mentioning that a numeric variable and its strictly monotone functions should converge to the same monotonic WoE transformation.

4. At last, because the WoE is defined as the log ratio between event and non-event distributions, it is indicative of the separation between cases with Y = 0 and cases with Y = 1. As the weighted sum of WoE values with the weight being the difference in event and non-event distributions, the IV (Information Value) is an important statistic commonly used to measure the predictor importance.

Below is a simple example showing how to use WoE transformations in the estimation of a logistic regression.

df <- readRDS("df.rds")
### SHOWING THE RESPONSE IN THE LAST COLUMN ###
head(df, 2)
#tot_derog tot_tr age_oldest_tr tot_open_tr tot_rev_tr tot_rev_debt tot_rev_line rev_util bureau_score ltv tot_income bad
# 6 7 46 NaN NaN NaN NaN 0 747 109 4800.00 0
# 0 21 153 6 1 97 4637 2 744 97 5833.33 0
source("mob.R")
bin_out <- batch_bin(df, 3)
bin_out$BinSum
# var nbin unique miss min median max ks iv
# tot_derog 7 29 213 0 0.0 32 20.0442 0.2599
# tot_tr 15 67 213 0 16.0 77 17.3002 0.1425
# ……
top <- paste(bin_out$BinSum[order(bin_out$BinSum[["iv"]], decreasing = T), ][1:6, "var"], sep = '')
par(mfrow = c(2, 3))
lapply(top, function(x) plot(bin_out$BinLst[[x]]$df[["woe"]],
log(bin_out$BinLst[[x]]$df[["bad_rate"]] / (1 bin_out$BinLst[[x]]$df[["bad_rate"]])),
type = "b", main = x, cex.main = 2, xlab = paste("woe of", x), ylab = "logit(bad)", cex = 2, col = "red"))
df_woe <- batch_woe(df, bin_out$BinLst)
str(df_woe$df)
#'data.frame': 5837 obs. of 12 variables:
# $ idx_ : int 1 2 3 4 5 6 7 8 9 10 …
# $ woe.tot_derog : num 0.656 -0.556 -0.556 0.274 0.274 …
# $ woe.tot_tr : num 0.407 -0.322 -0.4 -0.322 0.303 …
# ……
### PARSE VARIABLES WITH IV > 0.1 ###
x1 <- paste("woe", bin_out$BinSum[bin_out$BinSum[["iv"]] > 0.1, ]$var, sep = ".")
# "woe.tot_derog" "woe.tot_tr" "woe.age_oldest_tr" "woe.tot_rev_line" "woe.rev_util" "woe.bureau_score" "woe.ltv"
fml1 <- as.formula(paste("bad", paste(x1, collapse = " + "), sep = " ~ "))
sum1 <- summary(glm(fml1, data = cbind(bad = df$bad, df_woe$df), family = "binomial"))
### PARSE SIGNIFICANT VARIABLES WITH P-VALUE < 0.05 ###
x2 <- paste(row.names(sum1$coefficients)[sum1$coefficients[, 4] < 0.05][1])
# "woe.age_oldest_tr" "woe.tot_rev_line" "woe.rev_util" "woe.bureau_score" "woe.ltv"
fml2 <- as.formula(paste("bad", paste(x2, collapse = " + "), sep = " ~ "))
mdl2 <- glm(fml2, data = cbind(bad = df$bad, df_woe$df), family = "binomial")
# Estimate Std. Error z value Pr(>|z|)
#(Intercept) -1.38600 0.03801 -36.461 < 2e-16 ***
#woe.age_oldest_tr 0.30376 0.08176 3.715 0.000203 ***
#woe.tot_rev_line 0.42935 0.06793 6.321 2.61e-10 ***
#woe.rev_util 0.29150 0.08721 3.342 0.000831 ***
#woe.bureau_score 0.83568 0.04974 16.803 < 2e-16 ***
#woe.ltv 0.97789 0.09121 10.721 < 2e-16 ***
pROC::roc(response = df$bad, predictor = fitted(mdl2))
# Area under the curve: 0.7751

view raw
use_woe.R
hosted with ❤ by GitHub

top6

More General Weighted Binning

You might be wondering what motivates me spending countless weekend hours on the MOB package. The answer is plain and simple. It is users that are driving the development work.

After I published the wts_bin() function last week showing the impact of two-value weights on the monotonic binning outcome (https://statcompute.wordpress.com/2019/04/21/binning-with-weights), a question was asked if I can write a more general weighted binning function with weights being any positive value. The function wqtl_bin() is my answer (https://github.com/statcompute/MonotonicBinning/blob/master/code/wqtl_bin.R).

Below is an example demonstrating how to use the wqtl_bin() function. First of all, let’s apply the function to the case with two-value weights that was illustrated last week. As expected, statistics from both approaches are identical. In the second use case, let’s assume that weights can be any value under the Uniform distribution between 0 and 10. With positive random weights, all statistics have changed.

It is worth mentioning that, while binning rules can be the same with or without weights in some cases, it is not necessarily true in all situations, depending on the distribution of weights across the data sample. As shown in binning outcomes for “ltv” below, there are 7 bins without weights but only 5 with weights.

wqtl_bin(cbind(df, w = ifelse(df$bad == 1, 1, 5)), bad, tot_derog, w)
#$df
# bin rule cnt freq dist mv_wt bad_freq bad_rate woe iv ks
#1 00 is.na($X) 213 785 0.0322 785 70 0.0892 0.6416 0.0178 2.7716
#2 01 $X <= 1 3741 16465 0.6748 0 560 0.0340 -0.3811 0.0828 18.9469
#3 02 $X > 1 & $X <= 2 478 1906 0.0781 0 121 0.0635 0.2740 0.0066 16.5222
#4 03 $X > 2 & $X <= 4 587 2231 0.0914 0 176 0.0789 0.5078 0.0298 10.6623
#5 04 $X > 4 818 3014 0.1235 0 269 0.0893 0.6426 0.0685 0.0000
#$cuts
#[1] 1 2 4
wqtl_bin(cbind(df, w = runif(nrow(df), 0, 10)), bad, tot_derog, w)
#$df
# bin rule cnt freq dist mv_wt bad_freq bad_rate woe iv ks
#1 00 is.na($X) 213 952.32 0.0325 952.32 304.89 0.3202 0.5808 0.0128 2.1985
#2 01 $X <= 1 3741 18773.11 0.6408 0.00 2943.75 0.1568 -0.3484 0.0700 17.8830
#3 02 $X > 1 & $X <= 2 478 2425.26 0.0828 0.00 604.51 0.2493 0.2312 0.0047 15.8402
#4 03 $X > 2 & $X <= 4 587 2989.80 0.1021 0.00 882.83 0.2953 0.4639 0.0249 10.4761
#5 04 $X > 4 818 4156.29 0.1419 0.00 1373.26 0.3304 0.6275 0.0657 0.0000
#$cuts
#[1] 1 2 4
wqtl_bin(cbind(df, w = runif(nrow(df), 0, 10)), bad, ltv, w)
#$df
# bin rule cnt freq dist mv_wt bad_freq bad_rate woe iv ks
#1 01 $X <= 88 1289 6448.76 0.2202 0.00 759.93 0.1178 -0.6341 0.0724 11.4178
#2 02 $X > 88 & $X <= 98 1351 6695.88 0.2286 0.00 1211.98 0.1810 -0.1306 0.0037 14.2883
#3 03 $X > 98 & $X <= 104 1126 5662.21 0.1933 0.00 1212.52 0.2141 0.0788 0.0012 12.7295
#4 04 $X > 104 & $X <= 113 1044 5277.64 0.1802 0.00 1210.91 0.2294 0.1674 0.0053 9.5611
#5 05 $X > 113 | is.na($X) 1027 5205.38 0.1777 0.93 1497.29 0.2876 0.4721 0.0451 0.0000
qtl_bin(df, bad, ltv)
#$df
# bin rule freq dist mv_cnt bad_freq bad_rate woe iv ks
#1 01 $X <= 84 956 0.1638 0 102 0.1067 -0.7690 0.0759 9.8728
#2 02 $X > 84 & $X <= 93 960 0.1645 0 142 0.1479 -0.3951 0.0227 15.6254
#3 03 $X > 93 & $X <= 99 876 0.1501 0 187 0.2135 0.0518 0.0004 14.8359
#4 04 $X > 99 & $X <= 103 821 0.1407 0 179 0.2180 0.0787 0.0009 13.7025
#5 05 $X > 103 & $X <= 109 773 0.1324 0 178 0.2303 0.1492 0.0031 11.6401
#6 06 $X > 109 & $X <= 117 722 0.1237 0 190 0.2632 0.3263 0.0144 7.2169
#7 07 $X > 117 | is.na($X) 729 0.1249 1 218 0.2990 0.5041 0.0364 0.0000

view raw
wqtl_out.R
hosted with ❤ by GitHub

Binning with Weights

After working on the MOB package, I received requests from multiple users if I can write a binning function that takes the weighting scheme into consideration. It is a legitimate request from the practical standpoint. For instance, in the development of fraud detection models, we often would sample down non-fraud cases given an extremely low frequency of fraud instances. After the sample down, a weight value > 1 should be assigned to all non-fraud cases to reflect the fraud rate in the pre-sample data.

While accommodating the request for weighting cases is trivial, I’d like to do a simple experitment showing what the impact might be with the consideration of weighting.

– First of all, let’s apply the monotonic binning to a variable named “tot_derog”. In this unweighted binning output, KS = 18.94, IV = 0.21, and WoE values range from -0.38 to 0.64.

– In the first trial, a weight value = 5 is assigned to cases with Y = 0 and a weight value = 1 assigned to cases with Y = 1. As expected, frequency, distribution, bad_frequency, and bad_rate changed. However, KS, IV, and WoE remain identical.

– In the second trial, a weight value = 1 is assigned to cases with Y = 0 and a weight value = 5 assigned to cases with Y = 1. Once again, KS, IV, and WoE are still the same as the unweighted output.

The conclusion from this demonstrate is very clear. In cases of two-value weights assigned to the binary Y, the variable importance reflected by IV / KS and WoE values should remain identical with or without weights. However, if you are concerned about the binning distribution and the bad rate in each bin, the function wts_bin() should do the correction and is available in the project repository (https://github.com/statcompute/MonotonicBinning).

derog_bin <- qtl_bin(df, bad, tot_derog)
derog_bin
#$df
# bin rule freq dist mv_cnt bad_freq bad_rate woe iv ks
# 00 is.na($X) 213 0.0365 213 70 0.3286 0.6416 0.0178 2.7716
# 01 $X <= 1 3741 0.6409 0 560 0.1497 -0.3811 0.0828 18.9469
# 02 $X > 1 & $X <= 2 478 0.0819 0 121 0.2531 0.2740 0.0066 16.5222
# 03 $X > 2 & $X <= 4 587 0.1006 0 176 0.2998 0.5078 0.0298 10.6623
# 04 $X > 4 818 0.1401 0 269 0.3289 0.6426 0.0685 0.0000
# $cuts
# [1] 1 2 4
wts_bin(derog_bin$df, c(1, 5))
# bin rule wt_freq wt_dist wt_bads wt_badrate wt_woe wt_iv wt_ks
# 00 is.na($X) 493 0.0464 350 0.7099 0.6416 0.0178 2.7716
# 01 $X <= 1 5981 0.5631 2800 0.4681 -0.3811 0.0828 18.9469
# 02 $X > 1 & $X <= 2 962 0.0906 605 0.6289 0.2740 0.0066 16.5222
# 03 $X > 2 & $X <= 4 1291 0.1216 880 0.6816 0.5078 0.0298 10.6623
# 04 $X > 4 1894 0.1783 1345 0.7101 0.6426 0.0685 0.0000
wts_bin(derog_bin$df, c(5, 1))
# bin rule wt_freq wt_dist wt_bads wt_badrate wt_woe wt_iv wt_ks
# 00 is.na($X) 785 0.0322 70 0.0892 0.6416 0.0178 2.7716
# 01 $X <= 1 16465 0.6748 560 0.0340 -0.3811 0.0828 18.9469
# 02 $X > 1 & $X <= 2 1906 0.0781 121 0.0635 0.2740 0.0066 16.5222
# 03 $X > 2 & $X <= 4 2231 0.0914 176 0.0789 0.5078 0.0298 10.6623
# 04 $X > 4 3014 0.1235 269 0.0893 0.6426 0.0685 0.0000

view raw
wts_bin.R
hosted with ❤ by GitHub

Batch Deployment of WoE Transformations

After wrapping up the function batch_woe() today with the purpose to allow users to apply WoE transformations to many independent variables simultaneously, I have completed the development of major functions in the MOB package that can be usable for the model development in a production setting.

The function batch_woe() basically is the wrapper around cal_woe() and has two input parameters. The “data” parameter is the data frame that we would deploy binning outcomes and the “slst” parameter is the list of multiple binning specification tables that is either the direct output from the function batch_bin or created manually by combining outputs from multiple binning functions.

There are also two components in the output of batch_woe(), a list of PSI tables for transformed variables and a data frame with a row index and all transformed variables. The default printout is a PSI summation of all input variables to be transformed. As shown below, all PSI values are below 0.1 and therefore none is concerning.

binout <- batch_bin(df, 1)

woeout <- batch_woe(df[sample(seq(nrow(df)), 2000, replace = T), ], binout$BinLst)

woeout 
#     tot_derog tot_tr age_oldest_tr tot_open_tr tot_rev_tr tot_rev_debt ...
# psi    0.0027 0.0044        0.0144      0.0011      3e-04       0.0013 ...

str(woeout, max.level = 1)
# List of 2
#  $ psi:List of 11
#  $ df :'data.frame':	2000 obs. of  12 variables:
#  - attr(*, "class")= chr "psiSummary"

head(woeout$df, 1)
#  idx_ woe.tot_derog woe.tot_tr woe.age_oldest_tr woe.tot_open_tr woe.tot_rev_tr ...
#     1       -0.3811    -0.0215           -0.5356         -0.0722        -0.1012 ...

All source codes of the MOB package are available on https://github.com/statcompute/MonotonicBinning and free (as free beer) to download and distribute.

Batch Processing of Monotonic Binning

In my GitHub repository (https://github.com/statcompute/MonotonicBinning), multiple R functions have been developed to implement the monotonic binning by using either iterative discretization or isotonic regression. With these functions, we can run the monotonic binning for one independent variable at a time. However, in a real-world production environment, we often would want to apply the binning algorithm to hundreds or thousands of variables at once. In addition, we might be interested in comparing different binning outcomes.

The function batch_bin() is designed to apply a monotonic binning function to all numeric variables in a data frame with the last column as the dependent variable. Currently, four binning algorithms are supported, including qtl_bin() and bad_bin() by iterative discretizations, iso_bin() by isotonic regression, and gbm_bin() by generalized boosted model. Before using these four functions, we need to save related R files in the working folder, which would be sourced by the batch_bin() function. Scripts for R functions can be downloaded from https://github.com/statcompute/MonotonicBinning/tree/master/code.

Below is the demonstrating showing how to use the batch_bin() function, which only requires two input parameters, a data frame and an integer number indicating the binning method. With method = 1, the batch_bin() function implements the iterative discretization by quantiles. With method = 4, the batch_bin() function implements the generalized boosted modelling. As shown below, both KS and IV with method = 4 are higher than with method = 1 due to more granular bins. For instance, while the method = 1 only generates 2 bins, the method = 4 can generate 11 bins.

head(df, 2)
# tot_derog tot_tr age_oldest_tr tot_open_tr tot_rev_tr tot_rev_debt tot_rev_line rev_util bureau_score ltv tot_income bad
#1 6 7 46 NaN NaN NaN NaN 0 747 109 4800.00 0
#2 0 21 153 6 1 97 4637 2 744 97 5833.33 0
batch_bin(df, 1)
#|var | nbin| unique| miss| min| median| max| ks| iv|
#|:————–|—–:|——-:|—–:|—-:|——–:|——–:|——–:|——-:|
#|tot_derog | 5| 29| 213| 0| 0.0| 32| 18.9469| 0.2055|
#|tot_tr | 5| 67| 213| 0| 16.0| 77| 15.7052| 0.1302|
#|age_oldest_tr | 10| 460| 216| 1| 137.0| 588| 19.9821| 0.2539|
#|tot_open_tr | 3| 26| 1416| 0| 5.0| 26| 6.7157| 0.0240|
#|tot_rev_tr | 3| 21| 636| 0| 3.0| 24| 9.0104| 0.0717|
#|tot_rev_debt | 3| 3880| 477| 0| 3009.5| 96260| 8.5102| 0.0627|
#|tot_rev_line | 9| 3617| 477| 0| 10573.0| 205395| 26.4924| 0.4077|
#|rev_util | 2| 101| 0| 0| 30.0| 100| 15.1570| 0.0930|
#|bureau_score | 12| 315| 315| 443| 692.5| 848| 34.8028| 0.7785|
#|ltv | 7| 145| 1| 0| 100.0| 176| 15.6254| 0.1538|
#|tot_income | 4| 1639| 5| 0| 3400.0| 8147167| 9.1526| 0.0500|
batch_bin(df, 1)$BinLst[["rev_util"]]$df
# bin rule freq dist mv_cnt bad_freq bad_rate woe iv ks
# 01 $X <= 31 3007 0.5152 0 472 0.1570 -0.3250 0.0493 15.157
# 02 $X > 31 2830 0.4848 0 724 0.2558 0.2882 0.0437 0.000
batch_bin(df, 4)
#|var | nbin| unique| miss| min| median| max| ks| iv|
#|:————–|—–:|——-:|—–:|—-:|——–:|——–:|——–:|——-:|
#|tot_derog | 8| 29| 213| 0| 0.0| 32| 20.0442| 0.2556|
#|tot_tr | 13| 67| 213| 0| 16.0| 77| 17.3002| 0.1413|
#|age_oldest_tr | 22| 460| 216| 1| 137.0| 588| 20.3646| 0.2701|
#|tot_open_tr | 6| 26| 1416| 0| 5.0| 26| 6.8695| 0.0274|
#|tot_rev_tr | 4| 21| 636| 0| 3.0| 24| 9.0779| 0.0789|
#|tot_rev_debt | 9| 3880| 477| 0| 3009.5| 96260| 8.8722| 0.0848|
#|tot_rev_line | 21| 3617| 477| 0| 10573.0| 205395| 26.8943| 0.4445|
#|rev_util | 11| 101| 0| 0| 30.0| 100| 16.9615| 0.1635|
#|bureau_score | 30| 315| 315| 443| 692.5| 848| 35.2651| 0.8344|
#|ltv | 17| 145| 1| 0| 100.0| 176| 16.8807| 0.1911|
#|tot_income | 17| 1639| 5| 0| 3400.0| 8147167| 10.3386| 0.0775|
batch_bin(df, 4)$BinLst[["rev_util"]]$df
# bin rule freq dist mv_cnt bad_freq bad_rate woe iv ks
# 01 $X <= 24 2653 0.4545 0 414 0.1560 -0.3320 0.0452 13.6285
# 02 $X > 24 & $X <= 36 597 0.1023 0 96 0.1608 -0.2963 0.0082 16.3969
# 03 $X > 36 & $X <= 40 182 0.0312 0 32 0.1758 -0.1890 0.0011 16.9533
# 04 $X > 40 & $X <= 58 669 0.1146 0 137 0.2048 -0.0007 0.0000 16.9615
# 05 $X > 58 & $X <= 60 77 0.0132 0 16 0.2078 0.0177 0.0000 16.9381
# 06 $X > 60 & $X <= 73 442 0.0757 0 103 0.2330 0.1647 0.0022 15.6305
# 07 $X > 73 & $X <= 75 62 0.0106 0 16 0.2581 0.2999 0.0010 15.2839
# 08 $X > 75 & $X <= 83 246 0.0421 0 70 0.2846 0.4340 0.0089 13.2233
# 09 $X > 83 & $X <= 96 376 0.0644 0 116 0.3085 0.5489 0.0225 9.1266
# 10 $X > 96 & $X <= 98 50 0.0086 0 17 0.3400 0.6927 0.0049 8.4162
# 11 $X > 98 483 0.0827 0 179 0.3706 0.8263 0.0695 0.0000

view raw
use_BatchBin.R
hosted with ❤ by GitHub

Monotonic Binning with GBM

In addition to monotonic binning algorithms introduced in my previous post (https://statcompute.wordpress.com/2019/03/10/a-summary-of-my-home-brew-binning-algorithms-for-scorecard-development), two more functions based on Generalized Boosted Regression Models have been added to my GitHub repository, gbm_bin() and gbmcv_bin().

The function gbm_bin() estimates a GBM model without the cross validation and tends to generate a more granular binning outcome.

gbm_bin(df, bad, tot_derog)
# $df
# bin rule freq dist mv_cnt bad_freq bad_rate woe iv ks
# 00 is.na($X) 213 0.0365 213 70 0.3286 0.6416 0.0178 2.7716
# 01 $X <= 1 3741 0.6409 0 560 0.1497 -0.3811 0.0828 18.9469
# 02 $X > 1 & $X <= 2 478 0.0819 0 121 0.2531 0.2740 0.0066 16.5222
# 03 $X > 2 & $X <= 3 332 0.0569 0 86 0.2590 0.3050 0.0058 14.6321
# 04 $X > 3 & $X <= 9 848 0.1453 0 282 0.3325 0.6593 0.0750 3.2492
# 05 $X > 9 225 0.0385 0 77 0.3422 0.7025 0.0228 0.0000
# $cuts
# [1] 1 2 3 9

view raw
gbm_bin
hosted with ❤ by GitHub

The function gbmcv_bin() estimates a GBM model with the cross validation (CV). Therefore, it would generate a more stable but coarse binning outcome. Nonetheless, the computation is more expensive due to CV, especially for large datasets.

gbmcv_bin(df, bad, tot_derog)
### OUTPUT ###
# $df
# bin rule freq dist mv_cnt bad_freq bad_rate woe iv ks
# 00 is.na($X) 213 0.0365 213 70 0.3286 0.6416 0.0178 2.7716
# 01 $X <= 1 3741 0.6409 0 560 0.1497 -0.3811 0.0828 18.9469
# 02 $X > 1 & $X <= 2 478 0.0819 0 121 0.2531 0.2740 0.0066 16.5222
# 03 $X > 2 1405 0.2407 0 445 0.3167 0.5871 0.0970 0.0000
# $cuts
# [1] 1 2

view raw
gbmcv_bin
hosted with ❤ by GitHub

Motivated by the idea of my friend Talbot (https://www.linkedin.com/in/talbot-katz-b76785), I also drafted a function pava_bin() based upon the Pool Adjacent Violators Algorithm (PAVA) and compared it with the iso_bin() function based on the isotonic regression. As shown in the comparison below, there is no difference in the binning outcome. However, the computing cost of pava_bin() function is higher given that PAVA is an iterative algorithm solving for the monotonicity.

pava_bin(df, bad, tot_derog)$df
# bin rule freq dist mv_cnt bad_freq bad_rate woe iv ks
# 00 is.na($X) 213 0.0365 213 70 0.3286 0.6416 0.0178 2.7716
# 01 $X <= 1 3741 0.6409 0 560 0.1497 -0.3811 0.0828 18.9469
# 02 $X > 1 & $X <= 2 478 0.0819 0 121 0.2531 0.2740 0.0066 16.5222
# 03 $X > 2 & $X <= 3 332 0.0569 0 86 0.2590 0.3050 0.0058 14.6321
# 04 $X > 3 & $X <= 23 1064 0.1823 0 353 0.3318 0.6557 0.0931 0.4370
# 05 $X > 23 9 0.0015 0 6 0.6667 2.0491 0.0090 0.0000
iso_bin(df, bad, tot_derog)$df
# bin rule freq dist mv_cnt bad_freq bad_rate woe iv ks
# 00 is.na($X) 213 0.0365 213 70 0.3286 0.6416 0.0178 2.7716
# 01 $X <= 1 3741 0.6409 0 560 0.1497 -0.3811 0.0828 18.9469
# 02 $X > 1 & $X <= 2 478 0.0819 0 121 0.2531 0.2740 0.0066 16.5222
# 03 $X > 2 & $X <= 3 332 0.0569 0 86 0.2590 0.3050 0.0058 14.6321
# 04 $X > 3 & $X <= 23 1064 0.1823 0 353 0.3318 0.6557 0.0931 0.4370
# 05 $X > 23 9 0.0015 0 6 0.6667 2.0491 0.0090 0.0000

view raw
pava_compare
hosted with ❤ by GitHub

Deployment of Binning Outcomes in Production

In my previous post (https://statcompute.wordpress.com/2019/03/10/a-summary-of-my-home-brew-binning-algorithms-for-scorecard-development), I’ve shown different monotonic binning algorithm that I developed over time. However, these binning functions are all useless without a deployment vehicle in production. During the weekend, I finally had time to draft a R function
(https://github.com/statcompute/MonotonicBinning/blob/master/code/calc_woe.R) that can be used to deploy the binning outcome and to apply the WoE transformation to the attribute from an input data frame.

Below is a complete example showing how to apply the binning function mono_bin() to an attribute named “ltv” in the data frame, generate the binning specification, and then deploy the binning logic to calculate the WoE transformation of “ltv”. There are two objects returned from the calc_woe.R() function, the original data frame with an new column named “woe.ltv” and a summary table showing the population stability index (PSI) of the input attribute “ltv”.

While all are welcome to use my R codes and functions for your own purposes, I greatly appreciate it if you could reference the work and acknowledge my efforts.

url <- "https://github.com/statcompute/MonotonicBinning/blob/master/data/accepts.rds?raw=true"
download.file(url, "df.rds", mode = "wb")
df <- readRDS("df.rds")
source("https://raw.githubusercontent.com/statcompute/MonotonicBinning/master/code/manual_bin.R")
source("https://raw.githubusercontent.com/statcompute/MonotonicBinning/master/code/mono_bin.R")
ltv_bin <- mono_bin(df, bad, ltv)
ltv_bin$df
# bin rule freq dist mv_cnt bad_freq bad_rate woe iv ks
# 1 01 $X <= 86 1108 0.1898 0 122 0.1101 -0.7337 0.0810 11.0448
# 2 02 $X > 86 & $X <= 95 1081 0.1852 0 166 0.1536 -0.3510 0.0205 16.8807
# 3 03 $X > 95 & $X <= 101 1102 0.1888 0 242 0.2196 0.0880 0.0015 15.1771
# 4 04 $X > 101 & $X <= 106 743 0.1273 0 177 0.2382 0.1935 0.0050 12.5734
# 5 05 $X > 106 & $X <= 115 935 0.1602 0 226 0.2417 0.2126 0.0077 8.9540
# 6 06 $X > 115 | is.na($X) 868 0.1487 1 263 0.3030 0.5229 0.0468 0.0000
source("https://raw.githubusercontent.com/statcompute/MonotonicBinning/master/code/calc_woe.R")
ltv_woe <- calc_woe(df[sample(seq(nrow(df)), 1000), ], ltv, ltv_bin$df)
ltv_woe$psi
# bin rule dist woe cal_freq cal_dist cal_woe psi
# 1 01 $X <= 86 0.1898 -0.7337 188 0.188 -0.7337 0e+00
# 2 02 $X > 86 & $X <= 95 0.1852 -0.3510 179 0.179 -0.3510 2e-04
# 3 03 $X > 95 & $X <= 101 0.1888 0.0880 192 0.192 0.0880 1e-04
# 4 04 $X > 101 & $X <= 106 0.1273 0.1935 129 0.129 0.1935 0e+00
# 5 05 $X > 106 & $X <= 115 0.1602 0.2126 167 0.167 0.2126 3e-04
# 6 06 $X > 115 | is.na($X) 0.1487 0.5229 145 0.145 0.5229 1e-04
head(ltv_woe$df[, c("ltv", "woe.ltv")])
# ltv woe.ltv
# 2378 74 -0.7337
# 1897 60 -0.7337
# 2551 80 -0.7337
# 2996 83 -0.7337
# 1174 85 -0.7337
# 2073 74 -0.7337

view raw
woe_deploy.R
hosted with ❤ by GitHub

A Summary of My Home-Brew Binning Algorithms for Scorecard Development

Thus far, I have published four different monotonic binning algorithms for the scorecard development and think that it might be a right timing to do a quick summary. R functions for these binning algorithms are also available on https://github.com/statcompute/MonotonicBinning.

The first one was posted back in 2017 (https://statcompute.wordpress.com/2017/01/22/monotonic-binning-with-smbinning-package) based on my SAS macro (https://statcompute.wordpress.com/2012/06/10/a-sas-macro-implementing-monotonic-woe-transformation-in-scorecard-development) that has been widely used by sasORs. This R function mono_bin() is designed to generate monotonic bins with roughly equal densities, e.g. size of records in each bin. There are two potential issues for this binning algorithm. Albeit robust, the binning outcome is too coarse and and therefore might not be granular enough to capture the data nature. In addition, although the algorithm is fully automatic and able to converge globally, it requires iterations that might be computationally expensive for big datasets.

In light of aforementioned shortcomings, I developed the second one based on the isotonic regression (https://statcompute.wordpress.com/2017/06/15/finer-monotonic-binning-based-on-isotonic-regression and https://statcompute.wordpress.com/2018/11/23/more-robust-monotonic-binning-based-on-isotonic-regression) that successfully addresses both the coarse binning and iterations.

The third one was developed last year just out of my own curiosity (https://statcompute.wordpress.com/2018/10/14/monotonic-binning-with-equal-sized-bads-for-scorecard-development) for the purpose to generate monotonic bins with roughly equal-sized bads, e.g. Y = 1. From the performance standpoint, this one is not materially different from the first one. It is more like a brainteaser for myself.

The last one (https://statcompute.wordpress.com/2018/11/25/improving-binning-by-bootstrap-bumping) was mainly motivated by the idea of Bootstrap Bumping from Tibshirani and Knight (1997) and implements the bumping on top of the second one above based on the isotonic regression. The outcome of this one is satisfactory in two folds. First of all, since the bumping is based on bootstrap samples drawn from the original dataset, the concern about over-fitting due to the sample bias can be somewhat addressed. Secondly, through the bumping search across all bootstrap samples, chances are that a closer-to-optimal solution can be achieved.

R functions for all 4 binning algorithms on the GitHub are built on top of an utility function manual_bin() (https://github.com/statcompute/MonotonicBinning/blob/master/code/manual_bin.R). In the R code, I tried my best to make it as generic as possible without importing additional packages. The only exception is that the parallel package is used in the bump_bin() function to speed up the computation. My next task might be writing a scoring function to make these binning algorithms useful in production.

Statistical Assessments of AUC

In the scorecard development, the area under ROC curve, also known as AUC, has been widely used to measure the performance of a risk scorecard. Given everything else equal, the scorecard with a higher AUC is considered more predictive than the one with a lower AUC. However, little attention has been paid to the statistical analysis of AUC itself during the scorecard development.

While it might be less of a concern to rely on a simple comparison of AUC for the model selection in the development stage and then to pick the scorecard with a higher AUC, more attention should be called for on AUC analysis in the post-development stage. For instance, the senior management would need to decide whether it is worthy to retire a legacy scorecard that might be still performing and to launch the full-scale deployment of a new scorecard just for an increase in AUC that might not even be statistically significant. While the claim of certain business benefits can always be used as an argument in favor of the new scorecard, the justification would become even more compelling with a solid statistical evidence. What’s more, the model validation analyst might also want to leverage the outcome of AUC analysis to ensure the statistical soundness of new scorecards.

In the example below, two logistic regressions were estimated with AUC = 0.6554 and BIC = 6,402 for the model with 6 variables and AUC = 0.6429 and BIC = 6,421 for the model with 3 variables.

df1 <- read.csv("Downloads/credit_count.txt")
df2 <- df1[which(df1$CARDHLDR == 1), ]
y <- "DEFAULT"
x1 <- c("OWNRENT", "INCOME", "INCPER", "LOGSPEND", "AGE", "EXP_INC")
x2 <- c("MAJORDRG", "MINORDRG", "INCOME")

m1 <- glm(eval(paste(y, paste(x1, collapse = " + "), sep = " ~ ")), data = df2, family = binomial)
#              Estimate Std. Error z value Pr(|z|)
#(Intercept) -1.749e-01  1.659e-01  -1.054 0.291683
#OWNRENT     -2.179e-01  7.686e-02  -2.835 0.004581 **
#INCOME      -2.424e-04  4.364e-05  -5.554 2.79e-08 ***
#INCPER      -1.291e-05  3.318e-06  -3.890 0.000100 ***
#LOGSPEND    -2.165e-01  2.848e-02  -7.601 2.95e-14 ***
#AGE         -8.330e-03  3.774e-03  -2.207 0.027312 *
#EXP_INC      1.340e+00  3.467e-01   3.865 0.000111 ***

BIC(m1)
# 6401.586

roc1 <- pROC::roc(response = df2$DEFAULT, predictor = fitted(m1))
# Area under the curve: 0.6554

m2 <- glm(eval(paste(y, paste(x2, collapse = " + "), sep = " ~ ")), data = df2, family = binomial)
#              Estimate Std. Error z value Pr(|z|)
#(Intercept) -1.222e+00  9.076e-02 -13.459  < 2e-16 ***
#MAJORDRG     2.031e-01  6.921e-02   2.934  0.00335 **
#MINORDRG     1.920e-01  4.784e-02   4.013 5.99e-05 ***
#INCOME      -4.706e-04  3.919e-05 -12.007  < 2e-16 ***

BIC(m2)
# 6421.232

roc2 <- pROC::roc(response = df2$DEFAULT, predictor = fitted(m2))
# Area under the curve: 0.6429

Both AUC and BIC statistics seemed to favor the first model. However, is a 2% difference in AUC significant enough to infer a better model? Under the Null Hypothesis of no difference in AUC, three statistical tests were employed to assess the difference in AUC / ROC between two models.

set.seed(2019)
# REFERENCE:
# A METHOD OF COMPARING THE AREAS UNDER RECEIVER OPERATING CHARACTERISTIC CURVES DERIVED FROM THE SAME CASES
# HANLEY JA, MCNEIL BJ (1983)
pROC::roc.test(roc1, roc2, method = "bootstrap", boot.n = 500, progress = "none", paired = T)
# D = 1.7164, boot.n = 500, boot.stratified = 1, p-value = 0.0861

# REFERENCE:
# COMPARING THE AREAS UNDER TWO OR MORE CORRELATED RECEIVER OPERATING CHARACTERISTIC CURVES: A NONPARAMETRIC APPROACH
# DELONG ER, DELONG DM, CLARKE-PEARSON DL (1988)
pROC::roc.test(roc1, roc2, method = "delong", paired = T)
# Z = 1.7713, p-value = 0.0765

# REFERENCE
# A DISTRIBUTION-FREE PROCEDURE FOR COMPARING RECEIVER OPERATING CHARACTERISTIC CURVES FROM A PAIRED EXPERIMENT
# VENKATRAMAN ES, BEGG CB (1996)
pROC::roc.test(roc1, roc2, method = "venkatraman", boot.n = 500, progress = "none", paired = T)
# E = 277560, boot.n = 500, p-value = 0.074

Based upon the above output, there is no strong statistical evidence against the Null Hypothesis.

pscl::vuong(m1, m2)
#              Vuong z-statistic             H_A  p-value
#Raw                   2.0963830 model1 > model2 0.018024
#AIC-corrected         1.8311449 model1 > model2 0.033539
#BIC-corrected         0.8684585 model1 > model2 0.192572

In addition, a Vuong test is also performed, supporting no difference between two models after corrected for the Schwarz penalty.

An Utility Function For Monotonic Binning

In all monotonic algorithms that I posted before, I heavily relied on the smbinning::smbinning.custom() function contributed by Herman Jopia as the utility function generating the binning output and therefore feel deeply indebted to his excellent work. However, the availability of smbinning::smbinning.custom() function shouldn’t become my excuse for being lazy. Over the weekend, I drafted a function, e.g. manual_bin(), serving the similar purpose.

Although it is not as flexible and elegant as Herman’s work, the manual_bin() function does have certain advantages of handling miss values and therefore improves the calculation of WoE and Information Value for missing values.
1. For the missing-value category, if there are both good and bad records, then this category will be considered a standalone bin.
2. For the missing-value category, if there are only either good or bad records but not both, then this category will be merged into the bin with lowest or highest bad rate. Therefore, WoE and IV for the missing value won’t be shown as “NaN” again.

In addition, the output of manual_bin() function also includes a set of rules that might be potentially applied to R dataframe in order to generate WoE transformations, on which I will show in the future.

manual_bin <- function(df, yname, xname, cuts) {
cuts <- sort(c(Inf, cuts, Inf))
df1 <- df[which(df[[yname]] %in% c(0, 1)), c(yname, xname)]
all_cnt <- nrow(df1)
all_bcnt <- sum(df1[[yname]])
### IDENTIFY DIFFERENT CASES WITH MISSING VALUES ###
if (all(!is.na(df1[[xname]])) == TRUE) {
miss_flg <- 0
df2 <- df1
}
else {
miss_flg <- 1
df2 <- df1[!is.na(df1[, xname]), ]
mis <- df1[is.na(df1[, xname]), ]
mis_cnt <- nrow(mis)
mis_bcnt <- sum(mis[[yname]])
if (sum(mis[[yname]]) %in% c(nrow(mis), 0)) {
miss_flg <- 2
}
}
### SLICE DATAFRAME BY CUT POINTS ###
for (i in seq(length(cuts) 1)) {
bin <- sprintf("%02d", i)
bin_cnt <- nrow(df2[which(df2[[xname]] > cuts[i] & df2[[xname]] <= cuts[i + 1]), ])
bin_bcnt <- nrow(df2[which(df2[[xname]] > cuts[i] & df2[[xname]] <= cuts[i + 1] & df2[[yname]] == 1), ])
if (i == 1) {
bin_summ <- data.frame(bin = bin, xmin = cuts[i], xmax = cuts[i + 1], cnt = bin_cnt, bcnt = bin_bcnt)
}
else {
bin_summ <- rbind(bin_summ,
data.frame(bin = bin, xmin = cuts[i], xmax = cuts[i + 1], cnt = bin_cnt, bcnt = bin_bcnt))
}
}
bin_summ$mis_cnt <- 0
### FIRST CASE FOR MISSING VALUES: BOTH GOODS AND BADS ###
if (miss_flg == 1) {
bin_summ <- rbind(data.frame(bin = sprintf("%02d", 0), xmin = NA, xmax = NA, cnt = mis_cnt, bcnt = mis_bcnt, mis_cnt = mis_cnt),
bin_summ)
}
### SECOND CASE FOR MISSING VALUES: ONLY GOODS OR BADS ###
if (miss_flg == 2) {
rate <- bin_summ$bcnt / bin_summ$cnt
if (mis_bcnt == 0) {
bin_summ[rate == min(rate), "cnt"] <- bin_summ[rate == min(rate), "cnt"] + mis_cnt
bin_summ[rate == min(rate), "mis_cnt"] <- mis_cnt
}
else {
bin_summ[rate == max(rate), "cnt"] <- bin_summ[rate == max(rate), "cnt"] + mis_cnt
bin_summ[rate == max(rate), "bcnt"] <- bin_summ[rate == max(rate), "bcnt"] + mis_bcnt
bin_summ[rate == max(rate), "mis_cnt"] <- mis_cnt
}
}
bin_summ$dist <- bin_summ$cnt / all_cnt
bin_summ$brate <- bin_summ$bcnt / bin_summ$cnt
bin_summ$woe <- log((bin_summ$bcnt / all_bcnt) / ((bin_summ$cnt bin_summ$bcnt) / (all_cnt all_bcnt)))
bin_summ$iv <- (bin_summ$bcnt / all_bcnt (bin_summ$cnt bin_summ$bcnt) / (all_cnt all_bcnt)) * bin_summ$woe
bin_summ$ks <- abs(cumsum(bin_summ$bcnt) / all_bcnt cumsum(bin_summ$cnt bin_summ$bcnt) / (all_cnt all_bcnt)) * 100
bin_summ$rule <- NA
for (i in seq(nrow(bin_summ))) {
if (bin_summ[i, ]$bin == '00') {
bin_summ[i, ]$rule <- paste("is.na($X)", sep = '')
}
else if (bin_summ[i, ]$bin == '01') {
if (bin_summ[i, ]$mis_cnt > 0) {
bin_summ[i, ]$rule <- paste("$X <= ", bin_summ[i, ]$xmax, " | is.na($X)", sep = '')
}
else {
bin_summ[i, ]$rule <- paste("$X <= ", bin_summ[i, ]$xmax, sep = '')
}
}
else if (i == nrow(bin_summ)) {
if (bin_summ[i, ]$mis_cnt > 0) {
bin_summ[i, ]$rule <- paste("$X > ", bin_summ[i, ]$xmin, " | is.na($X)", sep = '')
}
else {
bin_summ[i, ]$rule <- paste("$X > ", bin_summ[i, ]$xmin, sep = '')
}
}
else {
bin_summ[i, ]$rule <- paste("$X > ", bin_summ[i, ]$xmin, " & ", "$X <= ", bin_summ[i, ]$xmax, sep = '')
}
}
return(result <- data.frame(Bin = bin_summ$bin, Rule = format(bin_summ$rule, width = 30, justify = "right"),
Frequency = bin_summ$cnt, Percent = round(bin_summ$dist, 2),
MV_Cnt = bin_summ$mis_cnt, Bad_Freq = bin_summ$bcnt, Bad_Rate = round(bin_summ$brate, 2),
WoE = round(bin_summ$woe, 4), InfoValue = round(bin_summ$iv, 4), KS_Stat = round(bin_summ$ks, 2)))
}
# SAMPLE OUTPUT:
# Bin Rule Frequency Percent MV_Cnt Bad_Freq Bad_Rate WoE InfoValue KS_Stat
#1 01 $X <= 82 814 0.14 0 81 0.10 -0.8467 0.0764 9.02
#2 02 $X > 82 & $X <= 91 837 0.14 0 120 0.14 -0.4316 0.0234 14.44
#3 03 $X > 91 & $X <= 97 811 0.14 0 148 0.18 -0.1436 0.0027 16.35
#4 04 $X > 97 & $X <= 101 829 0.14 0 181 0.22 0.0806 0.0009 15.18
#5 05 $X > 101 & $X <= 107 870 0.15 0 206 0.24 0.1855 0.0054 12.26
#6 06 $X > 107 & $X <= 115 808 0.14 0 197 0.24 0.2241 0.0074 8.95
#7 07 $X > 115 | is.na($X) 868 0.15 1 263 0.30 0.5229 0.0468 0.00

view raw
manual_bin.R
hosted with ❤ by GitHub

Improving Binning by Bootstrap Bumping

In the post (https://statcompute.wordpress.com/2018/11/23/more-robust-monotonic-binning-based-on-isotonic-regression), a more robust version of monotonic binning based on the isotonic regression was introduced. Nonetheless, due to the loss of granularity, the predictability has been somewhat compromised, which is a typical dilemma in the data science. On one hand, we don’t want to use a learning algorithm that is too greedy and therefore over-fits the data at the cost of simplicity and generality. On the other hand, we’d also like to get the most predictive power out of our data for better business results.

It is worth mentioning that, although there is a consensus that advanced ensemble algorithms are able to significantly improve the prediction outcome, both bagging and boosting would also destroy the simple structure of binning outputs and therefore might not be directly applicable in this simple case.

In light of above considerations, the bumping (Bootstrap Umbrella of Model Parameters) procedure, which was detailed in Model Search And Inference By Bootstrap Bumping by Tibshirani and Knight (1997), should serve our dual purposes. First of all, since the final binning structure would be derived from an isotonic regression based on the bootstrap sample, the concern about over-fitting the original training data can be addressed. Secondly, through the bumping search across all bootstrap samples, chances are that a closer-to-optimal solution can be achieved. It is noted that, since the original sample is always included in the bumping procedure, a binning outcome with bumping that is at least as good as the one without is guaranteed.

The R function bump_bin() is my effort of implementing the bumping procedure on top of the monotonic binning function based on isotonic regression. Because of the mutual independence of binning across all bootstrap samples, the bumping is a perfect use case of parallelism for the purpose of faster execution, as demonstrated in the function.

bump_bin <- function(data, y, x, n) {
n1 <- 50
n2 <- 10
set.seed(2019)
seeds <- c(0, round(runif(n) * as.numeric(paste('1e', ceiling(log10(n)) + 2, sep = '')), 0))
yname <- deparse(substitute(y))
xname <- deparse(substitute(x))
df1 <- data[, c(yname, xname)]
df2 <- df1[!is.na(df1[, xname]), c(xname, yname)]
cor <- cor(df2[, 2], df2[, 1], method = "spearman", use = "complete.obs")
### MONOTONIC BINNING WITH BOOTSTRAP SAMPLES ###
bin <- function(seed) {
if (seed == 0) {
smp <- df2
}
else {
set.seed(seed)
smp <- df2[sample(seq(nrow(df2)), nrow(df2), replace = T), ]
}
reg <- isoreg(smp[, 1], cor / abs(cor) * smp[, 2])
cut <- knots(as.stepfun(reg))
df2$cut <- cut(df2[[xname]], breaks = unique(cut), include.lowest = T)
df3 <- Reduce(rbind,
lapply(split(df2, df2$cut),
function(x) data.frame(n = nrow(x), b = sum(x[[yname]]), g = sum(1 x[[yname]]),
maxx = max(x[[xname]]), minx = min(x[[xname]]))))
df4 <- df3[which(df3[["n"]] > n1 & df3[["b"]] > n2 & df3[["g"]] > n2), ]
df2$good <- 1 df2[[yname]]
out <- smbinning::smbinning.custom(df2, "good", xname, cuts = df4$maxx[nrow(df4)])$ivtable
return(data.frame(iv = out$IV[length(out$IV)], nbin = nrow(out) 2,
cuts = I(list(df4$maxx[nrow(df4)])),
abs_cor = abs(cor(as.numeric(row.names(out)[1:(nrow(out) 2)]),
out$WoE[1:(nrow(out) 2)], method = "spearman"))))
}
bump_out <- Reduce(rbind, parallel::mclapply(seeds, mc.cores = parallel::detectCores(), bin))
### FIND THE CUT MAXIMIZING THE INFORMATION VALUE ###
cut2 <- bump_out[order(bump_out["abs_cor"], bump_out["iv"], bump_out["nbin"]), ]$cuts[[1]]
df1$good <- 1 df1[[yname]]
return(smbinning::smbinning.custom(df1, "good", xname, cuts = cut2)$ivtable)
}
df <- sas7bdat::read.sas7bdat("Downloads/accepts.sas7bdat")
bump_bin(df, bad, bureau_score, n = 200)

view raw
bump_bin.r
hosted with ❤ by GitHub

The output below shows the bumping result based on 20 bootstrap samples. There is a small improvement in the information value, e.g. 0.8055 vs 0.8021 without bumping, implying a potential opportunity of bumping with a simpler binning structure, e.g. 12 bins vs 20 bins.

   Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate    Odds  LnOdds     WoE     IV
1    <= 565     92      41     51        92         41        51 0.0158   0.4457  0.5543  0.8039 -0.2183 -1.5742 0.0532
2    <= 620    470     269    201       562        310       252 0.0805   0.5723  0.4277  1.3383  0.2914 -1.0645 0.1172
3    <= 653    831     531    300      1393        841       552 0.1424   0.6390  0.3610  1.7700  0.5710 -0.7850 0.1071
4    <= 662    295     213     82      1688       1054       634 0.0505   0.7220  0.2780  2.5976  0.9546 -0.4014 0.0091
5    <= 665    100      77     23      1788       1131       657 0.0171   0.7700  0.2300  3.3478  1.2083 -0.1476 0.0004
6    <= 675    366     290     76      2154       1421       733 0.0627   0.7923  0.2077  3.8158  1.3391 -0.0168 0.0000
7    <= 699    805     649    156      2959       2070       889 0.1379   0.8062  0.1938  4.1603  1.4256  0.0696 0.0007
8    <= 707    312     268     44      3271       2338       933 0.0535   0.8590  0.1410  6.0909  1.8068  0.4509 0.0094
9    <= 716    321     278     43      3592       2616       976 0.0550   0.8660  0.1340  6.4651  1.8664  0.5105 0.0122
10   <= 721    181     159     22      3773       2775       998 0.0310   0.8785  0.1215  7.2273  1.9779  0.6219 0.0099
11   <= 755    851     789     62      4624       3564      1060 0.1458   0.9271  0.0729 12.7258  2.5436  1.1877 0.1403
12      755    898     867     31      5522       4431      1091 0.1538   0.9655  0.0345 27.9677  3.3311  1.9751 0.3178
13  Missing    315     210    105      5837       4641      1196 0.0540   0.6667  0.3333  2.0000  0.6931 -0.6628 0.0282
14    Total   5837    4641   1196        NA         NA        NA 1.0000   0.7951  0.2049  3.8804  1.3559  0.0000 0.8055

The output below is based on bumping with 200 bootstrap samples. The information value has been improved by 2%, e.g. 0.8174 vs 0.8021, with a lower risk of over-fitting, e.g. 14 bins vs 20 bins.

   Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate    Odds  LnOdds     WoE     IV
1    <= 559     79      34     45        79         34        45 0.0135   0.4304  0.5696  0.7556 -0.2803 -1.6362 0.0496
2    <= 633    735     428    307       814        462       352 0.1259   0.5823  0.4177  1.3941  0.3323 -1.0237 0.1684
3    <= 637     86      53     33       900        515       385 0.0147   0.6163  0.3837  1.6061  0.4738 -0.8822 0.0143
4    <= 653    493     326    167      1393        841       552 0.0845   0.6613  0.3387  1.9521  0.6689 -0.6870 0.0477
5    <= 662    295     213     82      1688       1054       634 0.0505   0.7220  0.2780  2.5976  0.9546 -0.4014 0.0091
6    <= 665    100      77     23      1788       1131       657 0.0171   0.7700  0.2300  3.3478  1.2083 -0.1476 0.0004
7    <= 679    504     397    107      2292       1528       764 0.0863   0.7877  0.2123  3.7103  1.3111 -0.0448 0.0002
8    <= 683    160     129     31      2452       1657       795 0.0274   0.8062  0.1938  4.1613  1.4258  0.0699 0.0001
9    <= 699    507     413     94      2959       2070       889 0.0869   0.8146  0.1854  4.3936  1.4802  0.1242 0.0013
10   <= 716    633     546     87      3592       2616       976 0.1084   0.8626  0.1374  6.2759  1.8367  0.4808 0.0216
11   <= 722    202     178     24      3794       2794      1000 0.0346   0.8812  0.1188  7.4167  2.0037  0.6478 0.0118
12   <= 746    619     573     46      4413       3367      1046 0.1060   0.9257  0.0743 12.4565  2.5222  1.1663 0.0991
13   <= 761    344     322     22      4757       3689      1068 0.0589   0.9360  0.0640 14.6364  2.6835  1.3276 0.0677
14      761    765     742     23      5522       4431      1091 0.1311   0.9699  0.0301 32.2609  3.4739  2.1179 0.2979
15  Missing    315     210    105      5837       4641      1196 0.0540   0.6667  0.3333  2.0000  0.6931 -0.6628 0.0282
16    Total   5837    4641   1196        NA         NA        NA 1.0000   0.7951  0.2049  3.8804  1.3559  0.0000 0.8174

More Robust Monotonic Binning Based on Isotonic Regression

Since publishing the monotonic binning function based upon the isotonic regression (https://statcompute.wordpress.com/2017/06/15/finer-monotonic-binning-based-on-isotonic-regression), I’ve received some feedback from peers. A potential concern is that, albeit improving the granularity and predictability, the binning is too fine and might not generalize well in the new data.

In light of the concern, I revised the function by imposing two thresholds, including a minimum sample size and a minimum number of bads for each bin. Both thresholds can be adjusted based on the specific use case. For instance, I set the minimum sample size equal to 50 and the minimum number of bads (and goods) equal to 10 in the example below.

isoreg_bin <- function(data, y, x) {
n1 <- 50
n2 <- 10
yname <- deparse(substitute(y))
xname <- deparse(substitute(x))
df1 <- data[, c(yname, xname)]
df2 <- df1[!is.na(df1[, xname]), c(xname, yname)]
cor <- cor(df2[, 2], df2[, 1], method = "spearman", use = "complete.obs")
reg <- isoreg(df2[, 1], cor / abs(cor) * df2[, 2])
cut <- knots(as.stepfun(reg))
df2$cut <- cut(df2[[xname]], breaks = unique(cut), include.lowest = T)
df3 <- Reduce(rbind,
lapply(split(df2, df2$cut),
function(x) data.frame(n = nrow(x),
b = sum(x[[yname]]),
g = sum(1 x[[yname]]),
maxx = max(x[[xname]]),
minx = min(x[[xname]]))))
df4 <- df3[which(df3[["n"]] > n1 & df3[["b"]] > n2 & df3[["g"]] > n2), ]
df1$good <- 1 df1[[yname]]
return(smbinning::smbinning.custom(df1, "good", xname, cuts = df4$maxx[nrow(df4)])$ivtable)
}
df <- sas7bdat::read.sas7bdat("Downloads/accepts.sas7bdat")
isoreg_bin(df, bad, bureau_score)

view raw
isoreg_bin.r
hosted with ❤ by GitHub

As shown in the output below, the number of generated bins and the information value happened to be between the result in (https://statcompute.wordpress.com/2017/06/15/finer-monotonic-binning-based-on-isotonic-regression) and the result in (https://statcompute.wordpress.com/2017/01/22/monotonic-binning-with-smbinning-package). More importantly, given a larger sample size for each bin, the binning algorithm is more robust and generalizable.

   Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate    Odds  LnOdds     WoE     IV
1    <= 559     79      34     45        79         34        45 0.0135   0.4304  0.5696  0.7556 -0.2803 -1.6362 0.0496
2    <= 602    189     102     87       268        136       132 0.0324   0.5397  0.4603  1.1724  0.1591 -1.1969 0.0608
3    <= 605     56      31     25       324        167       157 0.0096   0.5536  0.4464  1.2400  0.2151 -1.1408 0.0162
4    <= 632    468     279    189       792        446       346 0.0802   0.5962  0.4038  1.4762  0.3895 -0.9665 0.0946
5    <= 639    150      95     55       942        541       401 0.0257   0.6333  0.3667  1.7273  0.5465 -0.8094 0.0207
6    <= 653    451     300    151      1393        841       552 0.0773   0.6652  0.3348  1.9868  0.6865 -0.6694 0.0412
7    <= 662    295     213     82      1688       1054       634 0.0505   0.7220  0.2780  2.5976  0.9546 -0.4014 0.0091
8    <= 665    100      77     23      1788       1131       657 0.0171   0.7700  0.2300  3.3478  1.2083 -0.1476 0.0004
9    <= 667     57      44     13      1845       1175       670 0.0098   0.7719  0.2281  3.3846  1.2192 -0.1367 0.0002
10   <= 677    381     300     81      2226       1475       751 0.0653   0.7874  0.2126  3.7037  1.3093 -0.0466 0.0001
11   <= 679     66      53     13      2292       1528       764 0.0113   0.8030  0.1970  4.0769  1.4053  0.0494 0.0000
12   <= 683    160     129     31      2452       1657       795 0.0274   0.8062  0.1938  4.1613  1.4258  0.0699 0.0001
13   <= 689    203     164     39      2655       1821       834 0.0348   0.8079  0.1921  4.2051  1.4363  0.0804 0.0002
14   <= 699    304     249     55      2959       2070       889 0.0521   0.8191  0.1809  4.5273  1.5101  0.1542 0.0012
15   <= 707    312     268     44      3271       2338       933 0.0535   0.8590  0.1410  6.0909  1.8068  0.4509 0.0094
16   <= 717    368     318     50      3639       2656       983 0.0630   0.8641  0.1359  6.3600  1.8500  0.4941 0.0132
17   <= 721    134     119     15      3773       2775       998 0.0230   0.8881  0.1119  7.9333  2.0711  0.7151 0.0094
18   <= 739    474     438     36      4247       3213      1034 0.0812   0.9241  0.0759 12.1667  2.4987  1.1428 0.0735
19   <= 746    166     154     12      4413       3367      1046 0.0284   0.9277  0.0723 12.8333  2.5520  1.1961 0.0277
20      746   1109    1064     45      5522       4431      1091 0.1900   0.9594  0.0406 23.6444  3.1631  1.8072 0.3463
21  Missing    315     210    105      5837       4641      1196 0.0540   0.6667  0.3333  2.0000  0.6931 -0.6628 0.0282
22    Total   5837    4641   1196        NA         NA        NA 1.0000   0.7951  0.2049  3.8804  1.3559  0.0000 0.8021

Monotonic Binning with Equal-Sized Bads for Scorecard Development

In previous posts (https://statcompute.wordpress.com/2017/01/22/monotonic-binning-with-smbinning-package) and (https://statcompute.wordpress.com/2017/06/15/finer-monotonic-binning-based-on-isotonic-regression), I’ve developed 2 different algorithms for monotonic binning. While the first tends to generate bins with equal densities, the second would define finer bins based on the isotonic regression.

In the code snippet below, a third approach would be illustrated for the purpose to generate bins with roughly equal-sized bads. Once again, for the reporting layer, I leveraged the flexible smbinning::smbinning.custom() function with a small tweak.


df <- sas7bdat::read.sas7bdat("Downloads/accepts.sas7bdat")

monobin <- function(df, x, y) {
  yname <- deparse(substitute(y))
  xname <- deparse(substitute(x))
  d1 <- df[c(yname, xname)]
  d2 <- d1[which(d1[[yname]] == 1), ]
  nbin <- round(1 / max(table(d2[[xname]]) / sum(table(d2[[xname]]))))
  repeat {
    cuts <- Hmisc::cut2(d2[[xname]], g = nbin, onlycuts = T)
    d1$cut <- cut(d1[[xname]], breaks = cuts, include.lowest = T)
    d3 <- Reduce(rbind, Map(function(x) data.frame(xmean = mean(x[[xname]], na.rm = T), ymean = mean(x[[yname]])), split(d1, d1$cut)))
    if(abs(cor(d3$xmean, d3$ymean, method = "spearman")) == 1 | nrow(d3) == 2) {
      break
    }
    nbin <- nbin - 1
  }
  df$good <- 1 -  d1[[yname]]
  return(smbinning::smbinning.custom(df, "good", xname, cuts = cuts[c(-1, -length(cuts))]))
}

As shown in the output, the number of bads in each bin, with the exception for missings, is similar and varying within a small range. However, the number of records tends to increase to ensure the monotonicity of bad rates across all bins.

monobin(df, bureau_score, bad)
#   Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate    Odds LnOdds     WoE     IV
#1    <= 602    268     136    132       268        136       132 0.0459   0.5075  0.4925  1.0303 0.0299 -1.3261 0.1075
#2    <= 621    311     185    126       579        321       258 0.0533   0.5949  0.4051  1.4683 0.3841 -0.9719 0.0636
#3    <= 636    302     186    116       881        507       374 0.0517   0.6159  0.3841  1.6034 0.4722 -0.8838 0.0503
#4    <= 649    392     259    133      1273        766       507 0.0672   0.6607  0.3393  1.9474 0.6665 -0.6895 0.0382
#5    <= 661    387     268    119      1660       1034       626 0.0663   0.6925  0.3075  2.2521 0.8119 -0.5441 0.0227
#6    <= 676    529     415    114      2189       1449       740 0.0906   0.7845  0.2155  3.6404 1.2921 -0.0639 0.0004
#7    <= 693    606     491    115      2795       1940       855 0.1038   0.8102  0.1898  4.2696 1.4515  0.0956 0.0009
#8     717   1883    1775    108      5522       4431      1091 0.3226   0.9426  0.0574 16.4352 2.7994  1.4435 0.4217
#10  Missing    315     210    105      5837       4641      1196 0.0540   0.6667  0.3333  2.0000 0.6931 -0.6628 0.0282
#11    Total   5837    4641   1196        NA         NA        NA 1.0000   0.7951  0.2049  3.8804 1.3559  0.0000 0.7508

Granular Monotonic Binning in SAS

In the post (https://statcompute.wordpress.com/2017/06/15/finer-monotonic-binning-based-on-isotonic-regression), it is shown how to do a finer monotonic binning with isotonic regression in R.

Below is a SAS macro implementing the monotonic binning with the same idea of isotonic regression. This macro is more efficient than the one shown in (https://statcompute.wordpress.com/2012/06/10/a-sas-macro-implementing-monotonic-woe-transformation-in-scorecard-development) without iterative binning and is also able to significantly increase the binning granularity.

%macro monobin(data = , y = , x = );
options mprint mlogic;

data _data_ (keep = _x _y);
  set &data;
  where &y in (0, 1) and &x ~= .;
  _y = &y;
  _x = &x;
run;

proc transreg data = _last_ noprint;
  model identity(_y) = monotone(_x);
  output out = _tmp1 tip = _t;
run;

proc summary data = _last_ nway;
  class _t_x;
  output out = _data_ (drop = _freq_ _type_) mean(_y) = _rate;
run;

proc sort data = _last_;
  by _rate;
run;

data _tmp2;
  set _last_;
  by _rate;
  _idx = _n_;
  if _rate = 0 then _idx = _idx + 1;
  if _rate = 1 then _idx = _idx - 1;
run;
  
proc sql noprint;
create table
  _tmp3 as
select
  a.*,
  b._idx
from
  _tmp1 as a inner join _tmp2 as b
on
  a._t_x = b._t_x;
  
create table
  _tmp4 as
select
  a._idx,
  min(a._x)                                               as _min_x,
  max(a._x)                                               as _max_x,
  sum(a._y)                                               as _bads,
  count(a._y)                                             as _freq,
  mean(a._y)                                              as _rate,
  sum(a._y) / b.bads                                      as _bpct,
  sum(1 - a._y) / (b.freq - b.bads)                       as _gpct,
  log(calculated _bpct / calculated _gpct)                as _woe,
  (calculated _bpct - calculated _gpct) * calculated _woe as _iv
from 
  _tmp3 as a, (select count(*) as freq, sum(_y) as bads from _tmp3) as b
group by
  a._idx;
quit;

title "Monotonic WoE Binning for %upcase(%trim(&x))";
proc print data = _last_ label noobs;
  var _min_x _max_x _bads _freq _rate _woe _iv;
  label
    _min_x = "Lower"
    _max_x = "Upper"
    _bads  = "#Bads"
    _freq  = "#Freq"
    _rate  = "BadRate"
    _woe   = "WoE"
    _iv    = "IV";
  sum _bads _freq _iv;
run;
title;

%mend monobin;

Below is the sample output for LTV, showing an identical binning scheme to the one generated by the R isobin() function.

Screenshot from 2017-09-24 21-30-40

Finer Monotonic Binning Based on Isotonic Regression

In my early post (https://statcompute.wordpress.com/2017/01/22/monotonic-binning-with-smbinning-package/), I wrote a monobin() function based on the smbinning package by Herman Jopia to improve the monotonic binning algorithm. The function works well and provides robust binning outcomes. However, there are a couple potential drawbacks due to the coarse binning. First of all, the derived Information Value for each binned variable might tend to be low. Secondly, the binned variable might not be granular enough to reflect the data nature.

In light of the aforementioned, I drafted an improved function isobin() based on the isotonic regression (https://en.wikipedia.org/wiki/Isotonic_regression), as shown below.

isobin <- function(data, y, x) {
  d1 <- data[c(y, x)]
  d2 <- d1[!is.na(d1[x]), ]
  c <- cor(d2[, 2], d2[, 1], method = "spearman", use = "complete.obs")
  reg <- isoreg(d2[, 2], c / abs(c) * d2[, 1])
  k <- knots(as.stepfun(reg))
  sm1 <-smbinning.custom(d1, y, x, k)
  c1 <- subset(sm1$ivtable, subset = CntGood * CntBad > 0, select = Cutpoint)
  c2 <- suppressWarnings(as.numeric(unlist(strsplit(c1$Cutpoint, " "))))
  c3 <- c2[!is.na(c2)]
  return(smbinning.custom(d1, y, x, c3[-length(c3)]))
}

Compared with the legacy monobin(), the isobin() function is able to significantly increase the binning granularity as well as moderately improve the Information Value.

LTV Binning with isobin() Function

   Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate    Odds LnOdds     WoE     IV
1     <= 46     81      78      3        81         78         3 0.0139   0.9630  0.0370 26.0000 3.2581  1.9021 0.0272
2     <= 71    312     284     28       393        362        31 0.0535   0.9103  0.0897 10.1429 2.3168  0.9608 0.0363
3     <= 72     22      20      2       415        382        33 0.0038   0.9091  0.0909 10.0000 2.3026  0.9466 0.0025
4     <= 73     27      24      3       442        406        36 0.0046   0.8889  0.1111  8.0000 2.0794  0.7235 0.0019
5     <= 81    303     268     35       745        674        71 0.0519   0.8845  0.1155  7.6571 2.0356  0.6797 0.0194
6     <= 83    139     122     17       884        796        88 0.0238   0.8777  0.1223  7.1765 1.9708  0.6149 0.0074
7     <= 90    631     546     85      1515       1342       173 0.1081   0.8653  0.1347  6.4235 1.8600  0.5040 0.0235
8     <= 94    529     440     89      2044       1782       262 0.0906   0.8318  0.1682  4.9438 1.5981  0.2422 0.0049
9     <= 95    145     119     26      2189       1901       288 0.0248   0.8207  0.1793  4.5769 1.5210  0.1651 0.0006
10   <= 100    907     709    198      3096       2610       486 0.1554   0.7817  0.2183  3.5808 1.2756 -0.0804 0.0010
11   <= 101    195     151     44      3291       2761       530 0.0334   0.7744  0.2256  3.4318 1.2331 -0.1229 0.0005
12   <= 110   1217     934    283      4508       3695       813 0.2085   0.7675  0.2325  3.3004 1.1940 -0.1619 0.0057
13   <= 112    208     158     50      4716       3853       863 0.0356   0.7596  0.2404  3.1600 1.1506 -0.2054 0.0016
14   <= 115    253     183     70      4969       4036       933 0.0433   0.7233  0.2767  2.6143 0.9610 -0.3950 0.0075
15   <= 136    774     548    226      5743       4584      1159 0.1326   0.7080  0.2920  2.4248 0.8857 -0.4702 0.0333
16   <= 138     27      18      9      5770       4602      1168 0.0046   0.6667  0.3333  2.0000 0.6931 -0.6628 0.0024
17    > 138     66      39     27      5836       4641      1195 0.0113   0.5909  0.4091  1.4444 0.3677 -0.9882 0.0140
18  Missing      1       0      1      5837       4641      1196 0.0002   0.0000  1.0000  0.0000   -Inf    -Inf    Inf
19    Total   5837    4641   1196        NA         NA        NA 1.0000   0.7951  0.2049  3.8804 1.3559  0.0000 0.1897

LTV Binning with monobin() Function

  Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate   Odds LnOdds     WoE     IV
1    <= 85   1025     916    109      1025        916       109 0.1756   0.8937  0.1063 8.4037 2.1287  0.7727 0.0821
2    <= 94   1019     866    153      2044       1782       262 0.1746   0.8499  0.1501 5.6601 1.7334  0.3775 0.0221
3   <= 100   1052     828    224      3096       2610       486 0.1802   0.7871  0.2129 3.6964 1.3074 -0.0486 0.0004
4   <= 105    808     618    190      3904       3228       676 0.1384   0.7649  0.2351 3.2526 1.1795 -0.1765 0.0045
5   <= 114    985     748    237      4889       3976       913 0.1688   0.7594  0.2406 3.1561 1.1493 -0.2066 0.0076
6    > 114    947     665    282      5836       4641      1195 0.1622   0.7022  0.2978 2.3582 0.8579 -0.4981 0.0461
7  Missing      1       0      1      5837       4641      1196 0.0002   0.0000  1.0000 0.0000   -Inf    -Inf    Inf
8    Total   5837    4641   1196        NA         NA        NA 1.0000   0.7951  0.2049 3.8804 1.3559  0.0000 0.1628

Bureau_Score Binning with isobin() Function

   Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate    Odds  LnOdds     WoE     IV
1    <= 491      4       1      3         4          1         3 0.0007   0.2500  0.7500  0.3333 -1.0986 -2.4546 0.0056
2    <= 532     24       9     15        28         10        18 0.0041   0.3750  0.6250  0.6000 -0.5108 -1.8668 0.0198
3    <= 559     51      24     27        79         34        45 0.0087   0.4706  0.5294  0.8889 -0.1178 -1.4737 0.0256
4    <= 560      2       1      1        81         35        46 0.0003   0.5000  0.5000  1.0000  0.0000 -1.3559 0.0008
5    <= 572     34      17     17       115         52        63 0.0058   0.5000  0.5000  1.0000  0.0000 -1.3559 0.0143
6    <= 602    153      84     69       268        136       132 0.0262   0.5490  0.4510  1.2174  0.1967 -1.1592 0.0459
7    <= 605     56      31     25       324        167       157 0.0096   0.5536  0.4464  1.2400  0.2151 -1.1408 0.0162
8    <= 606     14       8      6       338        175       163 0.0024   0.5714  0.4286  1.3333  0.2877 -1.0683 0.0035
9    <= 607     17      10      7       355        185       170 0.0029   0.5882  0.4118  1.4286  0.3567 -0.9993 0.0037
10   <= 632    437     261    176       792        446       346 0.0749   0.5973  0.4027  1.4830  0.3940 -0.9619 0.0875
11   <= 639    150      95     55       942        541       401 0.0257   0.6333  0.3667  1.7273  0.5465 -0.8094 0.0207
12   <= 653    451     300    151      1393        841       552 0.0773   0.6652  0.3348  1.9868  0.6865 -0.6694 0.0412
13   <= 662    295     213     82      1688       1054       634 0.0505   0.7220  0.2780  2.5976  0.9546 -0.4014 0.0091
14   <= 665    100      77     23      1788       1131       657 0.0171   0.7700  0.2300  3.3478  1.2083 -0.1476 0.0004
15   <= 667     57      44     13      1845       1175       670 0.0098   0.7719  0.2281  3.3846  1.2192 -0.1367 0.0002
16   <= 677    381     300     81      2226       1475       751 0.0653   0.7874  0.2126  3.7037  1.3093 -0.0466 0.0001
17   <= 679     66      53     13      2292       1528       764 0.0113   0.8030  0.1970  4.0769  1.4053  0.0494 0.0000
18   <= 683    160     129     31      2452       1657       795 0.0274   0.8062  0.1938  4.1613  1.4258  0.0699 0.0001
19   <= 689    203     164     39      2655       1821       834 0.0348   0.8079  0.1921  4.2051  1.4363  0.0804 0.0002
20   <= 699    304     249     55      2959       2070       889 0.0521   0.8191  0.1809  4.5273  1.5101  0.1542 0.0012
21   <= 707    312     268     44      3271       2338       933 0.0535   0.8590  0.1410  6.0909  1.8068  0.4509 0.0094
22   <= 717    368     318     50      3639       2656       983 0.0630   0.8641  0.1359  6.3600  1.8500  0.4941 0.0132
23   <= 721    134     119     15      3773       2775       998 0.0230   0.8881  0.1119  7.9333  2.0711  0.7151 0.0094
24   <= 723     49      44      5      3822       2819      1003 0.0084   0.8980  0.1020  8.8000  2.1748  0.8188 0.0043
25   <= 739    425     394     31      4247       3213      1034 0.0728   0.9271  0.0729 12.7097  2.5424  1.1864 0.0700
26   <= 746    166     154     12      4413       3367      1046 0.0284   0.9277  0.0723 12.8333  2.5520  1.1961 0.0277
27   <= 756    234     218     16      4647       3585      1062 0.0401   0.9316  0.0684 13.6250  2.6119  1.2560 0.0422
28   <= 761    110     104      6      4757       3689      1068 0.0188   0.9455  0.0545 17.3333  2.8526  1.4967 0.0260
29   <= 763     46      44      2      4803       3733      1070 0.0079   0.9565  0.0435 22.0000  3.0910  1.7351 0.0135
30   <= 767     96      92      4      4899       3825      1074 0.0164   0.9583  0.0417 23.0000  3.1355  1.7795 0.0293
31   <= 772     77      74      3      4976       3899      1077 0.0132   0.9610  0.0390 24.6667  3.2055  1.8495 0.0249
32   <= 787    269     260      9      5245       4159      1086 0.0461   0.9665  0.0335 28.8889  3.3635  2.0075 0.0974
33   <= 794     95      93      2      5340       4252      1088 0.0163   0.9789  0.0211 46.5000  3.8395  2.4835 0.0456
34    > 794    182     179      3      5522       4431      1091 0.0312   0.9835  0.0165 59.6667  4.0888  2.7328 0.0985
35  Missing    315     210    105      5837       4641      1196 0.0540   0.6667  0.3333  2.0000  0.6931 -0.6628 0.0282
36    Total   5837    4641   1196        NA         NA        NA 1.0000   0.7951  0.2049  3.8804  1.3559  0.0000 0.8357

Bureau_Score Binning with monobin() Function

   Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate    Odds LnOdds     WoE     IV
1    <= 617    513     284    229       513        284       229 0.0879   0.5536  0.4464  1.2402 0.2153 -1.1407 0.1486
2    <= 642    515     317    198      1028        601       427 0.0882   0.6155  0.3845  1.6010 0.4706 -0.8853 0.0861
3    <= 657    512     349    163      1540        950       590 0.0877   0.6816  0.3184  2.1411 0.7613 -0.5946 0.0363
4    <= 672    487     371    116      2027       1321       706 0.0834   0.7618  0.2382  3.1983 1.1626 -0.1933 0.0033
5    <= 685    494     396     98      2521       1717       804 0.0846   0.8016  0.1984  4.0408 1.3964  0.0405 0.0001
6    <= 701    521     428     93      3042       2145       897 0.0893   0.8215  0.1785  4.6022 1.5265  0.1706 0.0025
7    <= 714    487     418     69      3529       2563       966 0.0834   0.8583  0.1417  6.0580 1.8014  0.4454 0.0144
8    <= 730    489     441     48      4018       3004      1014 0.0838   0.9018  0.0982  9.1875 2.2178  0.8619 0.0473
9    <= 751    513     476     37      4531       3480      1051 0.0879   0.9279  0.0721 12.8649 2.5545  1.1986 0.0859
10   <= 775    492     465     27      5023       3945      1078 0.0843   0.9451  0.0549 17.2222 2.8462  1.4903 0.1157
11    > 775    499     486     13      5522       4431      1091 0.0855   0.9739  0.0261 37.3846 3.6213  2.2653 0.2126
12  Missing    315     210    105      5837       4641      1196 0.0540   0.6667  0.3333  2.0000 0.6931 -0.6628 0.0282
13    Total   5837    4641   1196        NA         NA        NA 1.0000   0.7951  0.2049  3.8804 1.3559  0.0000 0.7810

Monotonic Binning with Smbinning Package

The R package smbinning (https://cran.r-project.org/web/packages/smbinning/index.html) provides a very user-friendly interface for the WoE (Weight of Evidence) binning algorithm employed in the scorecard development. However, there are several improvement opportunities in my view:

1. First of all, the underlying algorithm in the smbinning() function utilizes the recursive partitioning, which does not necessarily guarantee the monotonicity.
2. Secondly, the density in each generated bin is not even. The frequency in some bins could be much higher than the one in others.
3. At last, the function might not provide the binning outcome for some variables due to the lack of statistical significance.

In light of the above, I wrote an enhanced version by utilizing the smbinning.custom() function, shown as below. The idea is very simple. Within the repeat loop, we would bin the variable iteratively until a certain criterion is met and then feed the list of cut points into the smbinning.custom() function. As a result, we are able to achieve a set of monotonic bins with similar frequencies regardless of the so-called “statistical significance”, which is a premature step for the variable transformation in my mind.

monobin <- function(data, y, x) {
  d1 <- data[c(y, x)]
  n <- min(20, nrow(unique(d1[x])))
  repeat {
    d1$bin <- Hmisc::cut2(d1[, x], g = n)
    d2 <- aggregate(d1[-3], d1[3], mean)
    c <- cor(d2[-1], method = "spearman")
    if(abs(c[1, 2]) == 1 | n == 2) break
    n <- n - 1
  }
  d3 <- aggregate(d1[-3], d1[3], max)
  cuts <- d3[-length(d3[, 3]), 3]
  return(smbinning::smbinning.custom(d1, y, x, cuts))
}

Below are a couple comparisons between the generic smbinning() and the home-brew monobin() functions with the use of a toy data.

In the first example, we applied the smbinning() function to a variable named "rev_util". As shown in the highlighted rows in the column "BadRate", the binning outcome is not monotonic.

  Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate    Odds LnOdds     WoE     IV
1     <= 0    965     716    249       965        716       249 0.1653   0.7420  0.2580  2.8755 1.0562 -0.2997 0.0162
2     <= 5    522     496     26      1487       1212       275 0.0894   0.9502  0.0498 19.0769 2.9485  1.5925 0.1356
3    <= 24   1166    1027    139      2653       2239       414 0.1998   0.8808  0.1192  7.3885 1.9999  0.6440 0.0677
4    <= 40    779     651    128      3432       2890       542 0.1335   0.8357  0.1643  5.0859 1.6265  0.2705 0.0090
5    <= 73   1188     932    256      4620       3822       798 0.2035   0.7845  0.2155  3.6406 1.2922 -0.0638 0.0008
6     96    533     337    196      5837       4641      1196 0.0913   0.6323  0.3677  1.7194 0.5420 -0.8140 0.0743
8  Missing      0       0      0      5837       4641      1196 0.0000      NaN     NaN     NaN    NaN     NaN    NaN
9    Total   5837    4641   1196        NA         NA        NA 1.0000   0.7951  0.2049  3.8804 1.3559  0.0000 0.3352

Next, we did the same with the monobin() function. As shown below, the algorithm provided a monotonic binning at the cost of granularity. Albeit coarse, the result is directionally correct with no inversion.

  Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate   Odds LnOdds     WoE     IV
1     30   2875    2146    729      5837       4641      1196 0.4925   0.7464  0.2536 2.9438 1.0797 -0.2763 0.0407
3  Missing      0       0      0      5837       4641      1196 0.0000      NaN     NaN    NaN    NaN     NaN    NaN
4    Total   5837    4641   1196        NA         NA        NA 1.0000   0.7951  0.2049 3.8804 1.3559  0.0000 0.0878

In the second example, we applied the smbinning() function to a variable named “bureau_score”. As shown in the highlighted rows, the frequencies in these two bins are much higher than the rest.

  Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate    Odds LnOdds     WoE     IV
1   <= 605    324     167    157       324        167       157 0.0555   0.5154  0.4846  1.0637 0.0617 -1.2942 0.1233
2   <= 632    468     279    189       792        446       346 0.0802   0.5962  0.4038  1.4762 0.3895 -0.9665 0.0946
3   <= 662    896     608    288      1688       1054       634 0.1535   0.6786  0.3214  2.1111 0.7472 -0.6087 0.0668
4   <= 699   1271    1016    255      2959       2070       889 0.2177   0.7994  0.2006  3.9843 1.3824  0.0264 0.0002
5   <= 717    680     586     94      3639       2656       983 0.1165   0.8618  0.1382  6.2340 1.8300  0.4741 0.0226
6    761    765     742     23      5522       4431      1091 0.1311   0.9699  0.0301 32.2609 3.4739  2.1179 0.2979
8  Missing    315     210    105      5837       4641      1196 0.0540   0.6667  0.3333  2.0000 0.6931 -0.6628 0.0282
9    Total   5837    4641   1196        NA         NA        NA 1.0000   0.7951  0.2049  3.8804 1.3559  0.0000 0.8066

With the monobin() function applied to the same variable, we were able to get a set of more granular bins with similar frequencies.

   Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate    Odds LnOdds     WoE     IV
1    <= 617    513     284    229       513        284       229 0.0879   0.5536  0.4464  1.2402 0.2153 -1.1407 0.1486
2    <= 642    515     317    198      1028        601       427 0.0882   0.6155  0.3845  1.6010 0.4706 -0.8853 0.0861
3    <= 657    512     349    163      1540        950       590 0.0877   0.6816  0.3184  2.1411 0.7613 -0.5946 0.0363
4    <= 672    487     371    116      2027       1321       706 0.0834   0.7618  0.2382  3.1983 1.1626 -0.1933 0.0033
5    <= 685    494     396     98      2521       1717       804 0.0846   0.8016  0.1984  4.0408 1.3964  0.0405 0.0001
6    <= 701    521     428     93      3042       2145       897 0.0893   0.8215  0.1785  4.6022 1.5265  0.1706 0.0025
7    <= 714    487     418     69      3529       2563       966 0.0834   0.8583  0.1417  6.0580 1.8014  0.4454 0.0144
8    <= 730    489     441     48      4018       3004      1014 0.0838   0.9018  0.0982  9.1875 2.2178  0.8619 0.0473
9    <= 751    513     476     37      4531       3480      1051 0.0879   0.9279  0.0721 12.8649 2.5545  1.1986 0.0859
10    775    499     486     13      5522       4431      1091 0.0855   0.9739  0.0261 37.3846 3.6213  2.2653 0.2126
12  Missing    315     210    105      5837       4641      1196 0.0540   0.6667  0.3333  2.0000 0.6931 -0.6628 0.0282
13    Total   5837    4641   1196        NA         NA        NA 1.0000   0.7951  0.2049  3.8804 1.3559  0.0000 0.7810

Scorecard Development with Data from Multiple Sources

This week, one of my friends asked me a very interesting and practical question in the scorecard development. The model development data were collected from multiple independent sources with various data sizes, heterogeneous risk profiles and different bad rates. While the performance statistics seem satisfactory on the model training dataset, the model doesn’t generalize well with new accounts that might come from a unknown source. The situation is very common in a consulting company where a risk or marketing model is sometimes developed with the data from multiple organizations.

To better understand the issue, I simulated a dataset consisting of two groups. In the dataset, while x0 and x1 govern the group segmentation, x2 and x3 define the bad definition. It is important to point out that the group information “grp” is only known in the model development sample but is unknown in the production population. Therefore, the variable “grp”, albeit predictive, can not be explicitly used in the model estimation.

data one;
  do i = 1 to 100000;
    x0 = ranuni(0);
    x1 = ranuni(1);
    x2 = ranuni(2);
    x3 = ranuni(3);
    if 1 + x0 * 2 + x1 * 4 + rannor(1) > 5 then do;
      grp = 1;
      if x2 * 2 + x3 * 4 + rannor(2) > 5 then bad = 1;
    	else bad = 0;
    end;
    else do;
      grp = 0;
      if x2 * 4 + x3 * 2 + rannor(3) > 4 then bad = 1;
    	else bad = 0;
    end;
    output;
  end;
run;

Our first approach is to use all variables x0 – x3 to build a logistic regression and then evaluate the model altogether and by groups.

proc logistic data = one desc noprint;
  model bad = x0 x1 x2 x3;
  score data = one out = mdl1 (rename = (p_1 = score1));
run;

                            GOOD BAD SEPARATION REPORT FOR SCORE1 IN DATA MDL1
                                MAXIMUM KS = 59.5763 AT SCORE POINT 0.2281
               ( AUC STATISTICS = 0.8800, GINI COEFFICIENT = 0.7599, DIVERGENCE = 2.6802 )

          MIN        MAX           GOOD        BAD      TOTAL    BAD     CUMULATIVE    BAD      CUMU. BAD
         SCORE      SCORE             #          #          #    RATE      BAD RATE  PERCENT      PERCENT
 --------------------------------------------------------------------------------------------------------
  BAD     0.6800     0.9699       2,057      7,943     10,000   79.43%      79.43%    33.81%      33.81%
   |      0.4679     0.6799       4,444      5,556     10,000   55.56%      67.50%    23.65%      57.46%
   |      0.3094     0.4679       6,133      3,867     10,000   38.67%      57.89%    16.46%      73.92%
   |      0.1947     0.3094       7,319      2,681     10,000   26.81%      50.12%    11.41%      85.33%
   |      0.1181     0.1946       8,364      1,636     10,000   16.36%      43.37%     6.96%      92.29%
   |      0.0690     0.1181       9,044        956     10,000    9.56%      37.73%     4.07%      96.36%
   |      0.0389     0.0690       9,477        523     10,000    5.23%      33.09%     2.23%      98.59%
   |      0.0201     0.0389       9,752        248     10,000    2.48%      29.26%     1.06%      99.64%
   V      0.0085     0.0201       9,925         75     10,000    0.75%      26.09%     0.32%      99.96%
 GOOD     0.0005     0.0085       9,991          9     10,000    0.09%      23.49%     0.04%     100.00%
       ========== ========== ========== ========== ==========
          0.0005     0.9699      76,506     23,494    100,000

                  GOOD BAD SEPARATION REPORT FOR SCORE1 IN DATA MDL1(WHERE = (GRP = 0))
                                MAXIMUM KS = 61.0327 AT SCORE POINT 0.2457
               ( AUC STATISTICS = 0.8872, GINI COEFFICIENT = 0.7744, DIVERGENCE = 2.8605 )

          MIN        MAX           GOOD        BAD      TOTAL    BAD     CUMULATIVE    BAD      CUMU. BAD
         SCORE      SCORE             #          #          #    RATE      BAD RATE  PERCENT      PERCENT
 --------------------------------------------------------------------------------------------------------
  BAD     0.7086     0.9699       1,051      6,162      7,213   85.43%      85.43%    30.51%      30.51%
   |      0.5019     0.7086       2,452      4,762      7,214   66.01%      75.72%    23.58%      54.10%
   |      0.3407     0.5019       3,710      3,504      7,214   48.57%      66.67%    17.35%      71.45%
   |      0.2195     0.3406       4,696      2,517      7,213   34.90%      58.73%    12.46%      83.91%
   |      0.1347     0.2195       5,650      1,564      7,214   21.68%      51.32%     7.74%      91.66%
   |      0.0792     0.1347       6,295        919      7,214   12.74%      44.89%     4.55%      96.21%
   |      0.0452     0.0792       6,737        476      7,213    6.60%      39.42%     2.36%      98.56%
   |      0.0234     0.0452       7,000        214      7,214    2.97%      34.86%     1.06%      99.62%
   V      0.0099     0.0234       7,150         64      7,214    0.89%      31.09%     0.32%      99.94%
 GOOD     0.0007     0.0099       7,201         12      7,213    0.17%      27.99%     0.06%     100.00%
       ========== ========== ========== ========== ==========
          0.0007     0.9699      51,942     20,194     72,136

                  GOOD BAD SEPARATION REPORT FOR SCORE1 IN DATA MDL1(WHERE = (GRP = 1))
                                MAXIMUM KS = 53.0942 AT SCORE POINT 0.2290
               ( AUC STATISTICS = 0.8486, GINI COEFFICIENT = 0.6973, DIVERGENCE = 2.0251 )

          MIN        MAX           GOOD        BAD      TOTAL    BAD     CUMULATIVE    BAD      CUMU. BAD
         SCORE      SCORE             #          #          #    RATE      BAD RATE  PERCENT      PERCENT
 --------------------------------------------------------------------------------------------------------
  BAD     0.5863     0.9413       1,351      1,435      2,786   51.51%      51.51%    43.48%      43.48%
   |      0.3713     0.5862       2,136        651      2,787   23.36%      37.43%    19.73%      63.21%
   |      0.2299     0.3712       2,340        446      2,786   16.01%      30.29%    13.52%      76.73%
   |      0.1419     0.2298       2,525        262      2,787    9.40%      25.07%     7.94%      84.67%
   |      0.0832     0.1419       2,584        202      2,786    7.25%      21.50%     6.12%      90.79%
   |      0.0480     0.0832       2,643        144      2,787    5.17%      18.78%     4.36%      95.15%
   |      0.0270     0.0480       2,682        104      2,786    3.73%      16.63%     3.15%      98.30%
   |      0.0140     0.0270       2,741         46      2,787    1.65%      14.76%     1.39%      99.70%
   V      0.0058     0.0140       2,776         10      2,786    0.36%      13.16%     0.30%     100.00%
 GOOD     0.0005     0.0058       2,786          0      2,786    0.00%      11.84%     0.00%     100.00%
       ========== ========== ========== ========== ==========
          0.0005     0.9413      24,564      3,300     27,864

As shown in the above output, while the overall model performance looks ok, it doesn’t generalize well in the dataset from the 2nd group with a smaller size. While the overall KS could be as high as 60, the KS for the 2nd group is merely 53. The reason is that the overall model performance is heavily influenced by the dataset from the 1st group with the larger size. Therefore, the estimated model is biased toward the risk profile reflected in the 1st group.

To alleviate the bias in the first model, we could first introduce a look-alike model driven by x0 – x1 with the purpose to profile the group and then build two separate risk models with x2 – x3 only for 1st and 2nd groups respectively. As a result, the final predicted probability should be the composite of all three sub-models, as shown below. The model evaluation is also provided to compared with the first model.

proc logistic data = one desc noprint;
  where grp = 0;
  model bad = x2 x3;
  score data = one out = mdl20(rename = (p_1 = p_10));
run;

proc logistic data = one desc noprint;
  where grp = 1;
  model bad = x2 x3;
  score data = one out = mdl21(rename = (p_1 = p_11));
run;

proc logistic data = one desc noprint;
  model grp = x0 x1;
  score data = one out = seg;
run;

data mdl2;
  merge seg mdl20 mdl21;
  by i;
  score2 = p_10 * (1 - p_1) + p_11 * p_1;
run;

                            GOOD BAD SEPARATION REPORT FOR SCORE2 IN DATA MDL2
                                MAXIMUM KS = 60.6234 AT SCORE POINT 0.2469
               ( AUC STATISTICS = 0.8858, GINI COEFFICIENT = 0.7715, DIVERGENCE = 2.8434 )

          MIN        MAX           GOOD        BAD      TOTAL    BAD     CUMULATIVE    BAD      CUMU. BAD
         SCORE      SCORE             #          #          #    RATE      BAD RATE  PERCENT      PERCENT
 --------------------------------------------------------------------------------------------------------
  BAD     0.6877     0.9677       2,011      7,989     10,000   79.89%      79.89%    34.00%      34.00%
   |      0.4749     0.6876       4,300      5,700     10,000   57.00%      68.45%    24.26%      58.27%
   |      0.3125     0.4748       6,036      3,964     10,000   39.64%      58.84%    16.87%      75.14%
   |      0.1932     0.3124       7,451      2,549     10,000   25.49%      50.51%    10.85%      85.99%
   |      0.1142     0.1932       8,379      1,621     10,000   16.21%      43.65%     6.90%      92.89%
   |      0.0646     0.1142       9,055        945     10,000    9.45%      37.95%     4.02%      96.91%
   |      0.0345     0.0646       9,533        467     10,000    4.67%      33.19%     1.99%      98.90%
   |      0.0166     0.0345       9,800        200     10,000    2.00%      29.29%     0.85%      99.75%
   V      0.0062     0.0166       9,946         54     10,000    0.54%      26.10%     0.23%      99.98%
 GOOD     0.0001     0.0062       9,995          5     10,000    0.05%      23.49%     0.02%     100.00%
       ========== ========== ========== ========== ==========
          0.0001     0.9677      76,506     23,494    100,000

                  GOOD BAD SEPARATION REPORT FOR SCORE2 IN DATA MDL2(WHERE = (GRP = 0))
                                MAXIMUM KS = 61.1591 AT SCORE POINT 0.2458
               ( AUC STATISTICS = 0.8880, GINI COEFFICIENT = 0.7759, DIVERGENCE = 2.9130 )

          MIN        MAX           GOOD        BAD      TOTAL    BAD     CUMULATIVE    BAD      CUMU. BAD
         SCORE      SCORE             #          #          #    RATE      BAD RATE  PERCENT      PERCENT
 --------------------------------------------------------------------------------------------------------
  BAD     0.7221     0.9677       1,075      6,138      7,213   85.10%      85.10%    30.40%      30.40%
   |      0.5208     0.7221       2,436      4,778      7,214   66.23%      75.66%    23.66%      54.06%
   |      0.3533     0.5208       3,670      3,544      7,214   49.13%      66.82%    17.55%      71.61%
   |      0.2219     0.3532       4,726      2,487      7,213   34.48%      58.73%    12.32%      83.92%
   |      0.1309     0.2219       5,617      1,597      7,214   22.14%      51.41%     7.91%      91.83%
   |      0.0731     0.1309       6,294        920      7,214   12.75%      44.97%     4.56%      96.39%
   |      0.0387     0.0731       6,762        451      7,213    6.25%      39.44%     2.23%      98.62%
   |      0.0189     0.0387       7,009        205      7,214    2.84%      34.86%     1.02%      99.63%
   V      0.0074     0.0189       7,152         62      7,214    0.86%      31.09%     0.31%      99.94%
 GOOD     0.0002     0.0073       7,201         12      7,213    0.17%      27.99%     0.06%     100.00%
       ========== ========== ========== ========== ==========
          0.0002     0.9677      51,942     20,194     72,136

                  GOOD BAD SEPARATION REPORT FOR SCORE2 IN DATA MDL2(WHERE = (GRP = 1))
                                MAXIMUM KS = 57.6788 AT SCORE POINT 0.1979
               ( AUC STATISTICS = 0.8717, GINI COEFFICIENT = 0.7434, DIVERGENCE = 2.4317 )

          MIN        MAX           GOOD        BAD      TOTAL    BAD     CUMULATIVE    BAD      CUMU. BAD
         SCORE      SCORE             #          #          #    RATE      BAD RATE  PERCENT      PERCENT
 --------------------------------------------------------------------------------------------------------
  BAD     0.5559     0.9553       1,343      1,443      2,786   51.79%      51.79%    43.73%      43.73%
   |      0.3528     0.5559       2,001        786      2,787   28.20%      40.00%    23.82%      67.55%
   |      0.2213     0.3528       2,364        422      2,786   15.15%      31.71%    12.79%      80.33%
   |      0.1372     0.2213       2,513        274      2,787    9.83%      26.24%     8.30%      88.64%
   |      0.0840     0.1372       2,588        198      2,786    7.11%      22.42%     6.00%      94.64%
   |      0.0484     0.0840       2,683        104      2,787    3.73%      19.30%     3.15%      97.79%
   |      0.0256     0.0483       2,729         57      2,786    2.05%      16.84%     1.73%      99.52%
   |      0.0118     0.0256       2,776         11      2,787    0.39%      14.78%     0.33%      99.85%
   V      0.0040     0.0118       2,781          5      2,786    0.18%      13.16%     0.15%     100.00%
 GOOD     0.0001     0.0040       2,786          0      2,786    0.00%      11.84%     0.00%     100.00%
       ========== ========== ========== ========== ==========
          0.0001     0.9553      24,564      3,300     27,864

After comparing KS statistics from two modeling approaches, we can see that, while the performance from the 2nd approach on the overall sample is only slightly better than the one from the 1st approach, the KS on the 2nd group with a smaller size, e.g. grp = 1, increases from 53 upto 58 by 8.6%. While the example is just for two groups, it is trivial to generalize in cases with more than two groups.

Risk Models with Generalized PLS

While developing risk models with hundreds of potential variables, we often run into the situation that risk characteristics or macro-economic indicators are highly correlated, namely multicollinearity. In such cases, we might have to drop variables with high VIFs or employ “variable shrinkage” methods, e.g. lasso or ridge, to suppress variables with colinearity.

Feature extraction approaches based on PCA and PLS have been widely discussed but are rarely used in real-world applications due to concerns around model interpretability and implementation. In the example below, it is shown that there shouldn’t any hurdle in the model implementation, e.g. score, given that coefficients can be extracted from a GPLS model in the similar way from a GLM model. In addition, compared with GLM with 8 variables, GPLS with only 5 components is able to provide a comparable performance in the hold-out testing data.

R Code

library(gpls)
library(pROC)

df1 <- read.csv("credit_count.txt")
df2 <- df1[df1$CARDHLDR == 1, -c(1, 10, 11, 12, 13)]
set.seed(2016)
n <- nrow(df2)
sample <- sample(seq(n), size = n / 2, replace = FALSE)
train <- df2[sample, ]
test <- df2[-sample, ]

m1 <- glm(DEFAULT ~ ., data = train, family = "binomial")
cat("\n### ROC OF GLM PREDICTION WITH TRAINING DATA ###\n")
print(roc(train$DEFAULT, predict(m1, newdata = train, type = "response")))
cat("\n### ROC OF GLM PREDICTION WITH TESTING DATA ###\n")
print(roc(test$DEFAULT, predict(m1, newdata = test, type = "response")))

m2 <- gpls(DEFAULT ~ ., data = train, family = "binomial", K.prov = 5)
cat("\n### ROC OF GPLS PREDICTION WITH TRAINING DATA ###\n")
print(roc(train$DEFAULT, predict(m2, newdata = train)$predicted[, 1]))
cat("\n### ROC OF GPLS PREDICTION WITH TESTING DATA ###\n")
print(roc(test$DEFAULT, predict(m2, newdata = test)$predicted[, 1]))

cat("\n### COEFFICIENTS COMPARISON BETWEEN GLM AND GPLS ###\n")
print(data.frame(glm = m1$coefficients, gpls = m2$coefficients))

Output

### ROC OF GLM PREDICTION WITH TRAINING DATA ###

Call:
roc.default(response = train$DEFAULT, predictor = predict(m1,     newdata = train, type = "response"))

Data: predict(m1, newdata = train, type = "response") in 4753 controls (train$DEFAULT 0) < 496 cases (train$DEFAULT 1).
Area under the curve: 0.6641

### ROC OF GLM PREDICTION WITH TESTING DATA ###

Call:
roc.default(response = test$DEFAULT, predictor = predict(m1,     newdata = test, type = "response"))

Data: predict(m1, newdata = test, type = "response") in 4750 controls (test$DEFAULT 0) < 500 cases (test$DEFAULT 1).
Area under the curve: 0.6537

### ROC OF GPLS PREDICTION WITH TRAINING DATA ###

Call:
roc.default(response = train$DEFAULT, predictor = predict(m2,     newdata = train)$predicted[, 1])

Data: predict(m2, newdata = train)$predicted[, 1] in 4753 controls (train$DEFAULT 0) < 496 cases (train$DEFAULT 1).
Area under the curve: 0.6627

### ROC OF GPLS PREDICTION WITH TESTING DATA ###

Call:
roc.default(response = test$DEFAULT, predictor = predict(m2,     newdata = test)$predicted[, 1])

Data: predict(m2, newdata = test)$predicted[, 1] in 4750 controls (test$DEFAULT 0) < 500 cases (test$DEFAULT 1).
Area under the curve: 0.6542

### COEFFICIENTS COMPARISON BETWEEN GLM AND GPLS ###
                      glm          gpls
(Intercept) -0.1940785071 -0.1954618828
AGE         -0.0122709412 -0.0147883358
ACADMOS      0.0005302022  0.0003671781
ADEPCNT      0.1090667092  0.1352491711
MAJORDRG     0.0757313171  0.0813835741
MINORDRG     0.2621574192  0.2547176301
OWNRENT     -0.2803919685 -0.1032119571
INCOME      -0.0004222914 -0.0004531543
LOGSPEND    -0.1688395555 -0.1525963363

SAS Macro Aligning The Logit Variable to A Scorecard with Specific PDO and Base Point

%macro align_score(data = , y = , logit = , pdo = , base_point = , base_odds = , min_point = 100, max_point = 900); 
***********************************************************; 
* THE MACRO IS TO ALIGN A LOGIT VARIABLE TO A SCORE WITH  *; 
* SPECIFIC PDO, BASE POINT, AND BASE ODDS                 *; 
* ======================================================= *; 
* PAMAMETERS:                                             *; 
*  DATA      : INPUT SAS DATA TABLE                       *; 
*  Y         : PERFORMANCE VARIABLE WITH 0/1 VALUE        *; 
*  LOGIT     : A LOGIT VARIABLE TO BE ALIGNED FROM        *; 
*  PDO       : PDO OF THE SCORE ALIGNED TO                *; 
*  BASE_POINT: BASE POINT OF THE SCORE ALIGNED TO         *; 
*  BASE_ODDS : ODDS AT BASE POINT OF THE SCORE ALIGNED TO *; 
*  MIN_POINT : LOWER END OF SCORE POINT, 100 BY DEFAULT   *; 
*  MAX_POINT : UPPER END OF SCORE POINT, 900 BY DEFAULT   *; 
* ======================================================= *; 
* OUTPUTS:                                                *; 
*  ALIGN_FORMULA.SAS                                      *; 
*  A SAS CODE WITH THE FORMULA TO ALIGN THE LOGIT FIELD   *; 
*  TO A SPECIFIC SCORE TOGETHER WITH THE STATISTICAL      *; 
*  SUMMARY OF ALIGN_SCORE                                 *;                                    
***********************************************************; 
 
options nocenter nonumber nodate mprint mlogic symbolgen 
        orientation = landscape ls = 150 formchar = "|----|+|---+=|-/\&lt;&gt;*"; 
 
%local b0 b1; 
 
data _tmp1 (keep = &amp;y &amp;logit); 
  set &amp;data; 
  where &amp;y in (0, 1) and &amp;logit ~= .; 
run; 
 
ods listing close; 
ods output ParameterEstimates = _est1 (keep = variable estimate); 
proc logistic data = _last_ desc; 
  model &amp;y = &amp;logit; 
run; 
ods listing; 
 
data _null_; 
  set _last_; 
   
  if _n_ = 1 then do; 
    b = - (estimate + (log(&amp;base_odds) - (log(2) / &amp;pdo) * &amp;base_point)) / (log(2) / &amp;pdo); 
    call symput('b0', put(b, 15.8)); 
  end; 
  else do; 
    b = estimate / (log(2) / &amp;pdo); 
    call symput('b1', put(b, 15.8)); 
  end; 
run; 
 
filename formula "ALIGN_FORMULA.SAS"; 
 
data _null_; 
  file formula; 
 
  put @3 3 * "*" " SCORE ALIGNMENT FORMULA OF %upcase(&amp;logit) " 3 * "*" ";"; 
  put; 
  put @3 "ALIGN_SCORE = MAX(MIN(ROUND((%trim(&amp;b0)) - (%trim(&amp;b1)) * %upcase(&amp;logit), 1), &amp;max_point), &amp;min_point);"; 
  put; 
run; 
 
data _tmp2; 
  set _tmp1; 
  %inc formula; 
run; 
 
proc summary data = _last_ nway; 
  class &amp;y; 
  output out = _tmp3(drop = _type_ _freq_) 
  min(align_score) = min_scr max(align_score) = max_scr 
  median(align_score) = mdn_scr; 
run; 
 
data _null_; 
  set _last_; 
  file formula mod; 
 
  if _n_ = 1 then do; 
    put @3 3 * "*" " STATISTICAL SUMMARY OF ALIGN_SCORE BY INPUT DATA: " 3 * "*" ";"; 
  end; 
  put @3 "* WHEN %upcase(&amp;y) = " &amp;y ": MIN(SCORE) = " min_scr " MEDIAN(SCORE) = " mdn_scr " MAX(SCORE) = " max_scr "*;"; 
run; 
 
proc datasets library = work nolist; 
  delete _: (mt = data); 
run; 
quit; 
 
***********************************************************; 
*                     END OF THE MACRO                    *; 
***********************************************************;  
%mend align_score;

A Prototype of Monotonic Binning Algorithm with R

I’ve been asked many time if I have a piece of R code implementing the monotonic binning algorithm, similar to the one that I developed with SAS (https://statcompute.wordpress.com/2012/06/10/a-sas-macro-implementing-monotonic-woe-transformation-in-scorecard-development) and with Python (https://statcompute.wordpress.com/2012/12/08/monotonic-binning-with-python). Today, I finally had time to draft a quick prototype with 20 lines of R code, which is however barely useable without the further polish. But it is still a little surprising to me how efficient it can be to use R in algorithm prototyping, much sleeker than SAS macro.

library(sas7bdat)
library(Hmisc)

bin <- function(x, y){
  n <- min(50, length(unique(x)))
  repeat {
    n   <- n - 1
    d1  <- data.frame(x, y, bin = cut2(x, g = n)) 
    d2  <- aggregate(d1[-3], d1[3], mean)
    cor <- cor(d2[-1], method = "spearman")
    if(abs(cor[1, 2]) == 1) break
  }
  d2[2] <- NULL
  colnames(d2) <- c('LEVEL', 'RATE')
  head <- paste(toupper(substitute(y)), " RATE by ", toupper(substitute(x)), sep = '')
  cat("+-", rep("-", nchar(head)), "-+\n", sep = '')
  cat("| ", head, ' |\n', sep = '')
  cat("+-", rep("-", nchar(head)), "-+\n", sep = '')
  print(d2)
  cat("\n")
}

data <- read.sas7bdat("C:\\Users\\liuwensui\\Downloads\\accepts.sas7bdat")
attach(data)

bin(bureau_score, bad)
bin(age_oldest_tr, bad)
bin(tot_income, bad)
bin(tot_tr, bad)

R output:

+--------------------------+
| BAD RATE by BUREAU_SCORE |
+--------------------------+
       LEVEL       RATE
1  [443,618) 0.44639376
2  [618,643) 0.38446602
3  [643,658) 0.31835938
4  [658,673) 0.23819302
5  [673,686) 0.19838057
6  [686,702) 0.17850288
7  [702,715) 0.14168378
8  [715,731) 0.09815951
9  [731,752) 0.07212476
10 [752,776) 0.05487805
11 [776,848] 0.02605210

+---------------------------+
| BAD RATE by AGE_OLDEST_TR |
+---------------------------+
       LEVEL       RATE
1  [  1, 34) 0.33333333
2  [ 34, 62) 0.30560928
3  [ 62, 87) 0.25145068
4  [ 87,113) 0.23346304
5  [113,130) 0.21616162
6  [130,149) 0.20036101
7  [149,168) 0.19361702
8  [168,198) 0.15530303
9  [198,245) 0.11111111
10 [245,308) 0.10700389
11 [308,588] 0.08730159

+------------------------+
| BAD RATE by TOT_INCOME |
+------------------------+
           LEVEL      RATE
1 [   0,   2570) 0.2498715
2 [2570,   4510) 0.2034068
3 [4510,8147167] 0.1602327

+--------------------+
| BAD RATE by TOT_TR |
+--------------------+
    LEVEL      RATE
1 [ 0,12) 0.2672370
2 [12,22) 0.1827676
3 [22,77] 0.1422764

A SAS Macro for Scorecard Evaluation with Weights

On 09/28/2012, I posted a SAS macro evaluation the scorecard performance, e.g. KS / AUC statistics (https://statcompute.wordpress.com/2012/09/28/a-sas-macro-for-scorecard-performance-evaluation). However, this macro is not generic enough to handle cases with a weighting variable. In a recent project that I am working on, there is a weight variable attached to each credit applicant due to the reject inference. Therefore, there is no excuse for me to continue neglecting the necessity of developing another SAS macro that can take care of the weight variable with any positive values in the scorecard evaluation. Below is a quick draft of the macro. You might have to tweak it a little to suit your needs in the production.

%macro wt_auc(data = , score = , y = , weight = );
***********************************************************;
* THE MACRO IS TO EVALUATE THE SEPARATION POWER OF A      *;
* SCORECARD WITH WEIGHTS                                  *;
* ------------------------------------------------------- *;
* PARAMETERS:                                             *;
*  DATA  : INPUT DATASET                                  *;
*  SCORE : SCORECARD VARIABLE                             *;
*  Y     : RESPONSE VARIABLE IN (0, 1)                    *;
*  WEIGHT: WEIGHT VARIABLE WITH POSITIVE VALUES           *; 
* ------------------------------------------------------- *;
* OUTPUTS:                                                *;
*  A SUMMARY TABLE WITH KS AND AUC STATISTICS             *;
* ------------------------------------------------------- *;
* CONTACT:                                                *;
*  WENSUI.LIU@53.COM                                      *;
***********************************************************;
options nocenter nonumber nodate mprint mlogic symbolgen
        orientation = landscape ls = 125 formchar = "|----|+|---+=|-/\<>*";

data _tmp1 (keep = &score &y &weight);
  set &data;
  where &score ~= . and &y in (0, 1) and &weight >= 0;
run;

*** CAPTURE THE DIRECTION OF SCORE ***;
ods listing close;
ods output spearmancorr = _cor;
proc corr data = _tmp1 spearman;
  var &y;
  with &score;
run;
ods listing;

data _null_;
  set _cor;
  if &y >= 0 then do;
    call symput('desc', 'descending');
  end;
  else do;
    call symput('desc', ' ');
  end;
run;

proc sql noprint;
create table
  _tmp2 as    
select
  &score                                         as _scr,
  sum(&y)                                        as _bcnt,
  count(*)                                       as _cnt,
  sum(case when &y = 1 then &weight else 0 end)  as _bwt,
  sum(case when &y = 0 then &weight else 0 end)  as _gwt
from
  _tmp1
group by
  &score;

select
  sum(_bwt) into :bsum
from
  _tmp2;
  
select
  sum(_gwt) into :gsum
from
  _tmp2;

select
  sum(_cnt) into :cnt
from
  _tmp2;    
quit;
%put &cnt;

proc sort data = _tmp2;
  by &desc _scr;
run;

data _tmp3;
  set _tmp2;
  by &desc _scr;
  retain _gcum _bcum _cntcum;
  _gcum + _gwt;
  _bcum + _bwt;
  _cntcum + _cnt;
  _gpct = _gcum / &gsum;
  _bpct = _bcum / &bsum;
  _ks   = abs(_gpct - _bpct) * 100;
  _rank = int(_cntcum / ceil(&cnt / 10)) + 1;
run;

proc sort data = _tmp3 sortsize = max;
  by _gpct _bpct;
run;

data _tmp4;
  set _tmp3;
  by _gpct _bpct;
  if last._gpct then do;
    _idx + 1;
    output;
  end;
run;

proc sql noprint;
create table
  _tmp5 as
select
  a._gcum as gcum,
  (b._gpct - a._gpct) * (b._bpct + a._bpct) / 2 as dx
from
  _tmp4 as a, _tmp4 as b
where
  a._idx + 1 = b._idx;

select
  sum(dx) format = 15.8 into :AUC
from
  _tmp5;

select
  max(_ks) format = 15.8 into :KS_STAT
from
  _tmp3;

select
  _scr format = 6.2 into :KS_SCORE
from
  _tmp3
where
  _ks = (select max(_ks) from _tmp3);

create table
  _tmp6 as
select
  _rank                       as rank,
  min(_scr)                   as min_scr,
  max(_scr)                   as max_scr,
  sum(_cnt)                   as cnt,
  sum(_gwt + _bwt)            as wt,
  sum(_gwt)                   as gwt,
  sum(_bwt)                   as bwt,
  sum(_bwt) / calculated wt   as bad_rate
from
  _tmp3
group by
  _rank;    
quit;  

proc report data = _last_ spacing = 1 split = "/" headline nowd;
  column("GOOD BAD SEPARATION REPORT FOR %upcase(%trim(&score)) IN DATA %upcase(%trim(&data))/
          MAXIMUM KS = %trim(&ks_stat) AT SCORE POINT %trim(&ks_score) and AUC STATISTICS = %trim(&auc)/ /"
          rank min_scr max_scr cnt wt gwt bwt bad_rate);
  define rank       / noprint order order = data;
  define min_scr    / "MIN/SCORE"             width = 10 format = 9.2        analysis min center;
  define max_scr    / "MAX/SCORE"             width = 10 format = 9.2        analysis max center;
  define cnt        / "RAW/COUNT"             width = 10 format = comma9.    analysis sum;
  define wt         / "WEIGHTED/SUM"          width = 15 format = comma14.2  analysis sum;
  define gwt        / "WEIGHTED/GOODS"        width = 15 format = comma14.2  analysis sum;
  define bwt        / "WEIGHTED/BADS"         width = 15 format = comma14.2  analysis sum;
  define bad_rate   / "BAD/RATE"              width = 10 format = percent9.2 order center;
  rbreak after / summarize dol skip;
run;

proc datasets library = work nolist;
  delete _tmp: / memtype = data;
run;
quit;

***********************************************************;
*                     END OF THE MACRO                    *;
***********************************************************;
%mend wt_auc;

In the first demo below, a weight variable with fractional values is tested.

*** TEST CASE OF FRACTIONAL WEIGHTS ***;
data one;
  set data.accepts;
  weight = ranuni(1);
run;

%wt_auc(data = one, score = bureau_score, y = bad, weight = weight);
/*
                   GOOD BAD SEPARATION REPORT FOR BUREAU_SCORE IN DATA ONE
       MAXIMUM KS = 34.89711721 AT SCORE POINT 678.00 and AUC STATISTICS = 0.73521009

    MIN        MAX            RAW        WEIGHTED        WEIGHTED        WEIGHTED    BAD
   SCORE      SCORE         COUNT             SUM           GOODS            BADS    RATE
 -------------------------------------------------------------------------------------------
    443.00     619.00         539          276.29          153.16          123.13   44.56%
    620.00     644.00         551          273.89          175.00           98.89   36.11%
    645.00     660.00         544          263.06          176.88           86.18   32.76%
    661.00     676.00         555          277.26          219.88           57.38   20.70%
    677.00     692.00         572          287.45          230.41           57.04   19.84%
    693.00     707.00         510          251.51          208.25           43.26   17.20%
    708.00     724.00         576          276.31          243.89           32.42   11.73%
    725.00     746.00         566          285.53          262.73           22.80    7.98%
    747.00     772.00         563          285.58          268.95           16.62    5.82%
    773.00     848.00         546          272.40          264.34            8.06    2.96%
 ========== ========== ========== =============== =============== ===============
    443.00     848.00       5,522        2,749.28        2,203.49          545.79
*/

In the second demo, a weight variable with positive integers is also tested.

*** TEST CASE OF INTEGER WEIGHTS ***;
data two;
  set data.accepts;
  weight = rand("poisson", 20);
run;

%wt_auc(data = two, score = bureau_score, y = bad, weight = weight);
/*
                   GOOD BAD SEPARATION REPORT FOR BUREAU_SCORE IN DATA TWO
       MAXIMUM KS = 35.58884479 AT SCORE POINT 679.00 and AUC STATISTICS = 0.73725030

    MIN        MAX            RAW        WEIGHTED        WEIGHTED        WEIGHTED    BAD
   SCORE      SCORE         COUNT             SUM           GOODS            BADS    RATE
 -------------------------------------------------------------------------------------------
    443.00     619.00         539       10,753.00        6,023.00        4,730.00   43.99%
    620.00     644.00         551       11,019.00        6,897.00        4,122.00   37.41%
    645.00     660.00         544       10,917.00        7,479.00        3,438.00   31.49%
    661.00     676.00         555       11,168.00        8,664.00        2,504.00   22.42%
    677.00     692.00         572       11,525.00        9,283.00        2,242.00   19.45%
    693.00     707.00         510       10,226.00        8,594.00        1,632.00   15.96%
    708.00     724.00         576       11,497.00       10,117.00        1,380.00   12.00%
    725.00     746.00         566       11,331.00       10,453.00          878.00    7.75%
    747.00     772.00         563       11,282.00       10,636.00          646.00    5.73%
    773.00     848.00         546       10,893.00       10,598.00          295.00    2.71%
 ========== ========== ========== =============== =============== ===============
    443.00     848.00       5,522      110,611.00       88,744.00       21,867.00
*/

As shown, the macro works well in both cases. Please feel free to let me know if it helps in your cases.

A SAS Macro Calculating PDO

In the development of credit scorecards, the model developer usually will scale the predicted probability of default / delinquent into a range of discrete score points for the purpose of operational convenience. While there are multiple ways to perform scaling, the most popular one in the credit risk arena is to scale the predicted probability logarithmically such that the odds, the ratio between goods and bads, will be doubled / halved after the increase / decrease of a certain number of score points, which is also called PDO (points to double the odds) in the industry.

In practice, PDO along the time carries strong implications in credit policies and strategies and is often used as an effectiveness measure for a scorecard. Chances are that a scorecard could still maintain a satisfactory predictiveness, e.g. K-S statistic or AUC, but might lose its effectiveness, e.g. increase in PDO. For instance, if a credit strategy was originally developed to take effect at credit score = 700 by design, it might require credit score = 720 later after the increase in PDO, which will significantly discount the effectiveness of such strategy. As a result, it is critically important to monitor PDO of a scorecard in the production environment.

Below is a sas macro to calculate PDO of a scorecard.

options nocenter nonumber nodate mprint mlogic symbolgen
        orientation = landscape ls = 125 formchar = "|----|+|---+=|-/\<>*";

%macro get_pdo(data = , score = , y = , wt = NONE, ref_score = , target_odds = , target_pdo = );
**********************************************************************;
* THIS MACRO IS TO CALCULATE OBSERVED ODDS AND PDO FOR ANY SCORECARD *;
* AND COMPUTE ALIGNMENT BETAS TO REACH TARGET ODDS AND PDO           *;
* ------------------------------------------------------------------ *;
* PARAMETERS:                                                        *;
*  DATA       : INPUT DATASET                                        *;
*  SCORE      : SCORECARD VARIABLE                                   *;
*  Y          : RESPONSE VARIABLE IN (0, 1)                          *;
*  WT         : WEIGHT VARIABLE IN POSITIVE INTEGER                  *;
*  REF_SCORE  : REFERENCE SCORE POINT FOR TARGET ODDS AND PDO        *;
*  TARGET_ODDS: TARGET ODDS AT REFERENCE SCORE OF SCORECARD          *;
*  TARGET_PDO : TARGET POINTS TO DOUBLE ODDS OF SCARECARD            *; 
* ------------------------------------------------------------------ *;
* OUTPUTS:                                                           *;
*  REPORT : PDO REPORT WITH THE CALIBRATION FORMULA IN HTML FORMAT   *;
* ------------------------------------------------------------------ *;
* AUTHOR: WENSUI.LIU@53.COM                                          *;
**********************************************************************;

options nonumber nodate orientation = landscape nocenter;  

*** CHECK IF THERE IS WEIGHT VARIABLE ***;
%if %upcase(&wt) = NONE %then %do;
  data _tmp1 (keep = &y &score _wt);
    set &data;
    where &y in (1, 0)  and
          &score ~= .;
    _wt = 1;
    &score = round(&score., 1);
  run;
%end;
%else %do;
  data _tmp1 (keep = &y &score _wt);
    set &data;
    where &y in (1, 0)        and
          &score ~= .         and
          round(&wt., 1) > 0;
    _wt = round(&wt., 1);
    &score = round(&score., 1);
  run;
%end;

proc logistic data = _tmp1 desc outest = _est1 noprint;
  model &y = &score;
  freq _wt;
run;

proc sql noprint;
  select round(min(&score), 0.01) into :min_score from _tmp1;

  select round(max(&score), 0.01) into :max_score from _tmp1;
quit;

data _est2;
  set _est1 (keep = intercept &score rename = (&score = slope));

  adjust_beta0 = &ref_score - (&target_pdo * log(&target_odds) / log(2)) - intercept * &target_pdo / log(2);
  adjust_beta1 = -1 * (&target_pdo * slope / log(2));

  do i = -5 to 5;
    old_pdo = round(-log(2) / slope, 0.01);
    old_ref = &ref_score + (i) * old_pdo;
    old_odd = exp(-(slope * old_ref + intercept)); 
    if old_ref >= &min_score and old_ref <= &max_score then output; 
  end;
run;

data _tmp2;
  set _tmp1;
  
  if _n_ = 1 then do;
    set _est2(obs = 1);
  end;

  adjusted = adjust_beta0 + adjust_beta1 * &score;
run;

proc logistic data = _tmp2 desc noprint outest = _est3;
  model &y = adjusted;
  freq _wt;
run;

data _est4;
  set _est3 (keep = intercept adjusted rename = (adjusted = slope));

  adjust_beta0 = &ref_score - (&target_pdo * log(&target_odds) / log(2)) - intercept * &target_pdo / log(2);
  adjust_beta1 = -1 * (&target_pdo * slope / log(2));

  do i = -5 to 5;
    new_pdo = round(-log(2) / slope, 0.01);
    new_ref = &ref_score + (i) * new_pdo;
    new_odd = exp(-(slope * new_ref + intercept)); 
    if new_ref >= &min_score and new_ref <= &max_score then output;
  end;
run;
 
proc sql noprint;
create table
  _final as
select
  &target_pdo            as target_pdo,
  &target_odds           as target_odds, 
  a.old_pdo              as pdo1,
  a.old_ref              as ref1,
  a.old_odd              as odd1,
  log(a.old_odd)         as ln_odd1,
  a.adjust_beta0         as adjust_beta0, 
  a.adjust_beta1         as adjust_beta1,
  b.new_pdo              as pdo2,
  b.new_ref              as ref2,
  b.new_odd              as odd2,
  log(b.new_odd)         as ln_odd2
from
  _est2 as a inner join _est4 as b
on
  a.i = b.i;

select round(pdo1, 1) into :pdo1 from _final;

select put(max(pdo1 / pdo2 - 1, 0), percent10.2) into :compare from _final;

select case when pdo1 > pdo2 then 1 else 0 end into :flag from _final;

select put(adjust_beta0, 12.8) into :beta0 from _final;

select put(adjust_beta1, 12.8) into :beta1 from _final;
quit;

%put &compare;
ods html file = "%upcase(%trim(&score))_PDO_SUMMARY.html" style = sasweb;
title;
proc report data  = _final box spacing = 1 split = "/" 
  style(header) = [font_face = "courier new"] style(column) = [font_face = "courier new"]
  style(lines) = [font_face = "courier new" font_size = 2] style(report) = [font_face = "courier new"];

  column("/SUMMARY OF POINTS TO DOUBLE ODDS FOR %upcase(&score) WEIGHTED BY %upcase(&wt) IN DATA %upcase(&data)
          /( TARGET PDO = &target_pdo, TARGET ODDS = &target_odds AT REFERENCE SCORE &ref_score ) / "
         pdo1 ref1 odd1 ln_odd1 pdo2 ref2 odd2 ln_odd2);

  define pdo1    / "OBSERVED/SCORE PDO"   width = 10 format = 4.   center;
  define ref1    / "OBSERVED/REF. SCORE"  width = 15 format = 5.   center order order = data;
  define odd1    / "OBSERVED/ODDS"        width = 15 format = 14.4 center;
  define ln_odd1 / "OBSERVED/LOG ODDS"    width = 15 format = 8.2  center;
  define pdo2    / "ADJUSTED/SCORE PDO"   width = 10 format = 4.   center;
  define ref2    / "ADJUSTED/REF. SCORE"  width = 15 format = 5.   center;
  define odd2    / "ADJUSTED/ODDS"        width = 15 format = 14.4 center;  
  define ln_odd2 / "ADJUSTED/LOG ODDS"    width = 15 format = 8.2  center;

  compute after;
  %if &flag = 1 %then %do;
    line @15 "THE SCORE ODDS IS DETERIORATED BY %trim(&compare).";
    line @15 "CALIBRATION FORMULA: ADJUSTED SCORE = %trim(&beta0) + %trim(&beta1) * %trim(%upcase(&score)).";
  %end;
  %else %do;
    line @25 "THERE IS NO DETERIORATION IN THE SCORE ODDS."; 
  %end;
  endcomp;
run;;
ods html close;

*************************************************;
*              END OF THE MACRO                 *;
*************************************************; 
%mend get_pdo;

In the following example, I called the macro to calculate PDO of bureau_score in a test data. Parameter inputs of “ref_score”, “target_odds”, and “target_pdo” could be acquired from the development or the benchmark sample. As shown in the output, the observed PDO is higher than the target PDO, implying 2.93% deterioration in the score effectiveness. In addition to PDO calculation, the sas macro also generated a formulation to calibrate the bureau_score back to the target PDO.

data tmp1;
  set data.accepts;
  where bureau_score ~= .;
run;

%get_pdo(data = tmp1, score = bureau_score, y = bad, wt = weight, ref_score = 680, target_odds = 20, target_pdo = 45);

/*
 -----------------------------------------------------------------------------------------------------------------------
 |                  SUMMARY OF POINTS TO DOUBLE ODDS FOR BUREAU_SCORE WEIGHTED BY WEIGHT IN DATA TMP1                  |
 |                            ( TARGET PDO = 45, TARGET ODDS = 20 AT REFERENCE SCORE 680 )                             |
 |                                                                                                                     |
 | OBSERVED     OBSERVED        OBSERVED        OBSERVED      ADJUSTED     ADJUSTED        ADJUSTED        ADJUSTED    |
 |SCORE PDO    REF. SCORE         ODDS          LOG ODDS     SCORE PDO    REF. SCORE         ODDS          LOG ODDS    |
 |---------------------------------------------------------------------------------------------------------------------| 
 |     46   |       448     |        0.6404 |      -0.45    |     45   |       455     |        0.6250 |      -0.47    | 
 |----------+---------------+---------------+---------------+----------+---------------+---------------+---------------| 
 |     46   |       495     |        1.2809 |       0.25    |     45   |       500     |        1.2500 |       0.22    | 
 |----------+---------------+---------------+---------------+----------+---------------+---------------+---------------| 
 |     46   |       541     |        2.5618 |       0.94    |     45   |       545     |        2.5000 |       0.92    | 
 |----------+---------------+---------------+---------------+----------+---------------+---------------+---------------| 
 |     46   |       587     |        5.1239 |       1.63    |     45   |       590     |        5.0000 |       1.61    | 
 |----------+---------------+---------------+---------------+----------+---------------+---------------+---------------| 
 |     46   |       634     |       10.2483 |       2.33    |     45   |       635     |       10.0000 |       2.30    | 
 |----------+---------------+---------------+---------------+----------+---------------+---------------+---------------| 
 |     46   |       680     |       20.4976 |       3.02    |     45   |       680     |       20.0000 |       3.00    | 
 |----------+---------------+---------------+---------------+----------+---------------+---------------+---------------| 
 |     46   |       726     |       40.9971 |       3.71    |     45   |       725     |       40.0000 |       3.69    | 
 |----------+---------------+---------------+---------------+----------+---------------+---------------+---------------| 
 |     46   |       773     |       81.9981 |       4.41    |     45   |       770     |       80.0000 |       4.38    | 
 |----------+---------------+---------------+---------------+----------+---------------+---------------+---------------| 
 |     46   |       819     |      164.0038 |       5.10    |     45   |       815     |      160.0000 |       5.08    | 
 |---------------------------------------------------------------------------------------------------------------------| 
 |              THE SCORE ODDS IS DETERIORATED BY 2.93%.                                                               |            
 |              CALIBRATION FORMULA: ADJUSTED SCORE = 20.92914838 + 0.97156812 * BUREAU_SCORE.                         |            
 ----------------------------------------------------------------------------------------------------------------------- 
*/

In the second example, I used the formulation from the above to generate a adjust_score from the bureau_score and then calculated PDO again. As shown in the output, after the calibration, the observed PDO of adjust_score now becomes equivalent to the target PDO of bureau_score. When the score increases every 45 points, e.g. from 680 to 725, the odds value would be doubled as anticipated, e.g. from 20 to 40.

data tmp2;
  set tmp1;
  adjust_score = 20.92914838 + 0.97156812 * bureau_score;
run;

%get_pdo(data = tmp2, score = adjust_score, y = bad, wt = weight, ref_score = 680, target_odds = 20, target_pdo = 45);

/*
 -----------------------------------------------------------------------------------------------------------------------
 |                  SUMMARY OF POINTS TO DOUBLE ODDS FOR ADJUST_SCORE WEIGHTED BY WEIGHT IN DATA TMP2                  |
 |                            ( TARGET PDO = 45, TARGET ODDS = 20 AT REFERENCE SCORE 680 )                             |
 |                                                                                                                     |
 | OBSERVED     OBSERVED        OBSERVED        OBSERVED      ADJUSTED     ADJUSTED        ADJUSTED        ADJUSTED    |
 |SCORE PDO    REF. SCORE         ODDS          LOG ODDS     SCORE PDO    REF. SCORE         ODDS          LOG ODDS    |
 |---------------------------------------------------------------------------------------------------------------------| 
 |     45   |       455     |        0.6249 |      -0.47    |     45   |       455     |        0.6250 |      -0.47    | 
 |----------+---------------+---------------+---------------+----------+---------------+---------------+---------------| 
 |     45   |       500     |        1.2498 |       0.22    |     45   |       500     |        1.2500 |       0.22    | 
 |----------+---------------+---------------+---------------+----------+---------------+---------------+---------------| 
 |     45   |       545     |        2.4996 |       0.92    |     45   |       545     |        2.5000 |       0.92    | 
 |----------+---------------+---------------+---------------+----------+---------------+---------------+---------------| 
 |     45   |       590     |        4.9991 |       1.61    |     45   |       590     |        5.0000 |       1.61    | 
 |----------+---------------+---------------+---------------+----------+---------------+---------------+---------------| 
 |     45   |       635     |        9.9980 |       2.30    |     45   |       635     |       10.0000 |       2.30    | 
 |----------+---------------+---------------+---------------+----------+---------------+---------------+---------------| 
 |     45   |       680     |       19.9956 |       3.00    |     45   |       680     |       20.0000 |       3.00    | 
 |----------+---------------+---------------+---------------+----------+---------------+---------------+---------------| 
 |     45   |       725     |       39.9905 |       3.69    |     45   |       725     |       40.0000 |       3.69    | 
 |----------+---------------+---------------+---------------+----------+---------------+---------------+---------------| 
 |     45   |       770     |       79.9796 |       4.38    |     45   |       770     |       80.0000 |       4.38    | 
 |----------+---------------+---------------+---------------+----------+---------------+---------------+---------------| 
 |     45   |       815     |      159.9565 |       5.07    |     45   |       815     |      160.0000 |       5.08    | 
 |---------------------------------------------------------------------------------------------------------------------| 
 |                        THERE IS NO DETERIORATION IN THE SCORE ODDS.                                                 |
 ----------------------------------------------------------------------------------------------------------------------- 
*/

Monotonic Binning with Python

Monotonic binning is a data preparation technique widely used in scorecard development and is usually implemented with SAS. Below is an attempt to do the monotonic binning with python.

Python Code:

# import packages
import pandas as pd
import numpy as np
import scipy.stats.stats as stats

# import data
data = pd.read_csv("/home/liuwensui/Documents/data/accepts.csv", sep = ",", header = 0)

# define a binning function
def mono_bin(Y, X, n = 20):
  # fill missings with median
  X2 = X.fillna(np.median(X))
  r = 0
  while np.abs(r) < 1:
    d1 = pd.DataFrame({"X": X2, "Y": Y, "Bucket": pd.qcut(X2, n)})
    d2 = d1.groupby('Bucket', as_index = True)
    r, p = stats.spearmanr(d2.mean().X, d2.mean().Y)
    n = n - 1
  d3 = pd.DataFrame(d2.min().X, columns = ['min_' + X.name])
  d3['max_' + X.name] = d2.max().X
  d3[Y.name] = d2.sum().Y
  d3['total'] = d2.count().Y
  d3[Y.name + '_rate'] = d2.mean().Y
  d4 = (d3.sort_index(by = 'min_' + X.name)).reset_index(drop = True)
  print "=" * 60
  print d4

mono_bin(data.bad, data.ltv)
mono_bin(data.bad, data.bureau_score)
mono_bin(data.bad, data.age_oldest_tr)
mono_bin(data.bad, data.tot_tr)
mono_bin(data.bad, data.tot_income)

Output:

============================================================
   min_ltv  max_ltv  bad  total  bad_rate
0        0       83   88    884  0.099548
1       84       92  137    905  0.151381
2       93       98  175    851  0.205640
3       99      102  173    814  0.212531
4      103      108  194    821  0.236297
5      109      116  194    769  0.252276
6      117      176  235    793  0.296343
============================================================
   min_bureau_score  max_bureau_score  bad  total  bad_rate
0               443               630  325    747  0.435074
1               631               655  242    721  0.335645
2               656               676  173    721  0.239945
3               677               698  245   1059  0.231350
4               699               709   64    427  0.149883
5               710               732   73    712  0.102528
6               733               763   53    731  0.072503
7               764               848   21    719  0.029207
============================================================
   min_age_oldest_tr  max_age_oldest_tr  bad  total  bad_rate
0                  1                 59  319    987  0.323202
1                 60                108  235    975  0.241026
2                109                142  282   1199  0.235196
3                143                171  142    730  0.194521
4                172                250  125    976  0.128074
5                251                588   93    970  0.095876
============================================================
   min_tot_tr  max_tot_tr  bad  total  bad_rate
0           0           8  378   1351  0.279793
1           9          13  247   1025  0.240976
2          14          18  240   1185  0.202532
3          19          25  165   1126  0.146536
4          26          77  166   1150  0.144348
============================================================
   min_tot_income  max_tot_income  bad  total  bad_rate
0            0.00         2000.00  323   1217  0.265407
1         2002.00         2916.67  259   1153  0.224631
2         2919.00         4000.00  226   1150  0.196522
3         4001.00         5833.33  231   1186  0.194772
4         5833.34      8147166.66  157   1131  0.138815

Calculating K-S Statistic with Python

K-S statistic is a measure to evaluate the predictiveness of a statistical model for binary outcomes and has been widely used in direct marketing and risk modeling.

Below is a demonstration on how to calculate K-S statistic with less than 20 lines of python codes. In this piece of code snippet, I am also trying to show how to do data munging effectively with pandas and numpy packages.

As Wes McKinney pointed out in his book “Python for Data Analysis”, it is not necessary to become a proficient Python software developer in order to be a proficient Python data analyst.

In [1]: # IMPORT PACKAGES

In [2]: import pandas as pd

In [3]: import numpy as np

In [4]: # LOAD DATA FROM CSV FILE

In [5]: data = pd.read_csv('c:\\projects\\data.csv')

In [6]: data.describe()
Out[6]:
               bad        score
count  5522.000000  5522.000000
mean      0.197573   693.466135
std       0.398205    57.829769
min       0.000000   443.000000
25%       0.000000   653.000000
50%       0.000000   692.500000
75%       0.000000   735.000000
max       1.000000   848.000000

In [7]: data['good'] = 1 - data.bad

In [8]: # DEFINE 10 BUCKETS WITH EQUAL SIZE

In [9]: data['bucket'] = pd.qcut(data.score, 10)

In [10]: # GROUP THE DATA FRAME BY BUCKETS

In [11]: grouped = data.groupby('bucket', as_index = False)

In [12]: # CREATE A SUMMARY DATA FRAME

In [13]: agg1 = grouped.min().score

In [14]: agg1 = pd.DataFrame(grouped.min().score, columns = ['min_scr'])

In [15]: agg1['max_scr'] = grouped.max().score

In [16]: agg1['bads'] = grouped.sum().bad

In [17]: agg1['goods'] = grouped.sum().good

In [18]: agg1['total'] = agg1.bads + agg1.goods

In [19]: agg1
Out[19]:
   min_scr  max_scr  bads  goods  total
0      621      645   201    365    566
1      646      661   173    359    532
2      662      677   125    441    566
3      678      692    99    436    535
4      693      708    89    469    558
5      709      725    66    492    558
6      726      747    42    520    562
7      748      772    30    507    537
8      773      848    14    532    546
9      443      620   252    310    562

In [20]: # SORT THE DATA FRAME BY SCORE

In [21]: agg2 = (agg1.sort_index(by = 'min_scr')).reset_index(drop = True)

In [22]: agg2['odds'] = (agg2.goods / agg2.bads).apply('{0:.2f}'.format)

In [23]: agg2['bad_rate'] = (agg2.bads / agg2.total).apply('{0:.2%}'.format)

In [24]: # CALCULATE KS STATISTIC

In [25]: agg2['ks'] = np.round(((agg2.bads / data.bad.sum()).cumsum() - (agg2.goods / data.good.sum()).cumsum()), 4) * 100

In [26]: # DEFINE A FUNCTION TO FLAG MAX KS

In [27]: flag = lambda x: '<----' if x == agg2.ks.max() else ''

In [28]: # FLAG OUT MAX KS

In [29]: agg2['max_ks'] = agg2.ks.apply(flag)

In [30]: agg2
Out[30]:
   min_scr  max_scr  bads  goods  total   odds bad_rate     ks max_ks
0      443      620   252    310    562   1.23   44.84%  16.10
1      621      645   201    365    566   1.82   35.51%  26.29
2      646      661   173    359    532   2.08   32.52%  34.04
3      662      677   125    441    566   3.53   22.08%  35.55  <----
4      678      692    99    436    535   4.40   18.50%  34.78
5      693      708    89    469    558   5.27   15.95%  32.36
6      709      725    66    492    558   7.45   11.83%  27.30
7      726      747    42    520    562  12.38    7.47%  19.42
8      748      772    30    507    537  16.90    5.59%  10.72
9      773      848    14    532    546  38.00    2.56%  -0.00

A SAS Macro for Scorecard Performance Evaluation

%macro separation(data = , score = , y = );
***********************************************************;
* THE MACRO IS TO EVALUATE THE SEPARATION POWER OF A      *;
* SCORECARD                                               *; 
* ------------------------------------------------------- *;
* PARAMETERS:                                             *;
*  DATA : INPUT DATASET                                   *;
*  SCORE: SCORECARD VARIABLE                              *;
*  Y    : RESPONSE VARIABLE IN (0, 1)                     *;
* ------------------------------------------------------- *;
* OUTPUTS:                                                *;
*  SEPARATION_REPORT.TXT                                  *;
*  A SEPARATION SUMMARY REPORT IN TXT FORMAT              *;
*  NAMED AS THE ABOVE WITH PREDICTIVE MEASURES INCLUDING  *;
*  KS, AUC, GINI, AND DIVERGENCE                          *;
* ------------------------------------------------------- *;
* CONTACT:                                                *;
*  WENSUI.LIU@53.COM                                      *;
***********************************************************;

options nonumber nodate orientation = landscape linesize = 160 nocenter 
        formchar = "|----|+|---+=|-/\<>*" formdlim=' ' ls = 150; 

*** DEFAULT GROUP NUMBER FOR REPORT ***;
%let grp = 10;

data _tmp1 (keep = &y &score);
  set &data;
  where &y in (1, 0)  and &score ~= .;
run;

filename lst_out temp;

proc printto new print = lst_out;
run;

*** CONDUCT NON-PARAMETRIC TESTS ***; 
ods output wilcoxonscores = _wx;
ods output kolsmir2stats = _ks;
proc npar1way wilcoxon edf data = _tmp1;
  class &y;
  var &score;
run;

proc printto;
run;

proc sort data = _wx;
  by class;
run;

*** CALCULATE ROC AND GINI ***;
data _null_;
  set _wx end = eof;
  by class;

  array a{2, 3} _temporary_;
  if _n_ = 1 then do;
    a[1, 1] = n;
    a[1, 2] = sumofscores;
    a[1, 3] = expectedsum;
  end;
  else do;
    a[2, 1] = n;
  end;
  if eof then do;
    auc  = (a[1, 2] - a[1, 3]) / (a[1, 1] * a[2, 1])  + 0.5;
    if auc <= 0.5 then auc = 1 - auc;
    gini = 2 * (auc - 0.5);  
    call execute('%let auc = '||put(auc, 10.4)||';');
    call execute('%let gini = '||put(gini, 10.4)||';');
  end;
run;

*** CALCULATE KS ***;
data _null_;
  set _ks;

  if _n_ = 1 then do;
    ks = nvalue2 * 100;
    call execute('%let ks = '||put(ks, 10.4)||';');
  end;
run;

*** CAPTURE SCORE POINT FOR MAX KS ***;
data _null_;
  infile lst_out;
  input @" at Maximum = " ks_score;
  output;
  call execute('%let ks_score = '||put(ks_score, 10.4)||';');
  stop;
run;

proc summary data = _tmp1 nway;
  class &y;
  output out = _data_ (drop = _type_ _freq_)
  mean(&score.) = mean var(&score.) = variance;
run;

*** CALCULATE DIVERGENCE ***;
data _null_;
  set _last_ end = eof;
  array a{2, 2} _temporary_;
  if _n_ = 1 then do;
    a[1, 1] = mean;
    a[1, 2] = variance;
  end;
  else do;
    a[2, 1] = mean;
    a[2, 2] = variance;
  end;
  if eof then do;
    divergence = (a[1, 1] - a[2, 1]) ** 2 / ((a[1, 2] + a[2, 2]) / 2);
    call execute('%let divergence = '||put(divergence, 10.4)||';'); 
  end; 
run;

*** CAPTURE THE DIRECTION OF SCORE ***;
ods listing close;
ods output spearmancorr = _cor;
proc corr data = _tmp1 spearman;
  var &y;
  with &score;
run;
ods listing;
 
data _null_;
  set _cor;
  if &y >= 0 then do;
    call symput('desc', 'descending');
  end;
  else do;
    call symput('desc', ' ');
  end;
run;
%put &desc;

proc rank data = _tmp1 out = _tmp2 groups = &grp ties = low;
  var &score;
  ranks rank;
run;

proc summary data = _last_ nway;
  class rank;
  output out = _data_ (drop = _type_ rename = (_freq_ = freq))
  min(&score) = min_score max(&score) = max_score
  sum(&y) = bads;
run;

proc sql noprint;
  select sum(bads) into :bcnt from _last_;
  select sum(freq) - sum(bads) into :gcnt from _last_;
quit;

proc sort data = _last_ (drop = rank);
  by &desc min_score;
run;

data _data_;
  set _last_;
  by &desc min_score;

  i + 1; 
  percent = i / 100; 
  good  = freq - bads;
  odds  = good / bads;

  hit_rate = bads / freq;
  retain cum_bads cum_freq;
  cum_bads + bads;
  cum_freq + freq;
  cum_hit_rate = cum_bads / cum_freq;

  cat_rate = bads / &bcnt;
  retain cum_cat_rate;
  cum_cat_rate + cat_rate; 

  format symbol $4.;
  if i = 1 then symbol = 'BAD';
  else if i = &grp - 1 then symbol = 'V';
  else if i = &grp then symbol = 'GOOD';
  else symbol = '|';
run;

proc printto print = "%upcase(%trim(&score))_SEPARATION_REPORT.TXT" new;
run;

proc report data = _last_ spacing = 1 split = "/" headline nowd;
  column("GOOD BAD SEPARATION REPORT FOR %upcase(%trim(&score)) IN DATA %upcase(%trim(&data))/
          MAXIMUM KS = %trim(&ks) AT SCORE POINT %trim(&ks_score)/   
          ( AUC STATISTICS = %trim(&auc), GINI COEFFICIENT = %trim(&gini), DIVERGENCE = %trim(&divergence) )/ /"
         percent symbol min_score max_score good bads freq odds hit_rate cum_hit_rate cat_rate cum_cat_rate);

  define percent      / noprint order order = data;
  define symbol       / "" center               width = 5 center;
  define min_score    / "MIN/SCORE"             width = 10 format = 9.4        analysis min center;
  define max_score    / "MAX/SCORE"             width = 10 format = 9.4        analysis max center;
  define good         / "GOOD/#"                width = 10 format = comma9.    analysis sum;
  define bads         / "BAD/#"                 width = 10 format = comma9.    analysis sum;
  define freq         / "TOTAL/#"               width = 10 format = comma9.    analysis sum;
  define odds         / "ODDS"                  width = 10 format = 8.2        order;
  define hit_rate     / "BAD/RATE"              width = 10 format = percent9.2 order center;
  define cum_hit_rate / "CUMULATIVE/BAD RATE"   width = 10 format = percent9.2 order;
  define cat_rate     / "BAD/PERCENT"           width = 10 format = percent9.2 order center;
  define cum_cat_rate / "CUMU. BAD/PERCENT"     width = 10 format = percent9.2 order; 

  rbreak after / summarize dol skip;
run; 

proc printto;
run;

***********************************************************;
*                     END OF THE MACRO                    *;
***********************************************************; 
%mend separation;

libname data 'C:\Documents and Settings\liuwensui\Desktop\fraction_models\test';

%separation(data = data.accepts, score = bureau_score, y = bad);

Sample Output:

                          GOOD BAD SEPARATION REPORT FOR BUREAU_SCORE IN DATA DATA.ACCEPTS                          
                                    MAXIMUM KS = 35.5477 AT SCORE POINT 677.0000                                    
                     ( AUC STATISTICS = 0.7389, GINI COEFFICIENT = 0.4778, DIVERGENCE = 0.8027 )                    
                                                                                                                    
          MIN        MAX           GOOD        BAD      TOTAL               BAD     CUMULATIVE    BAD      CUMU. BAD
         SCORE      SCORE             #          #          #       ODDS    RATE      BAD RATE  PERCENT      PERCENT
 -------------------------------------------------------------------------------------------------------------------
  BAD   443.0000   620.0000         310        252        562       1.23   44.84%      44.84%    23.10%      23.10% 
   |    621.0000   645.0000         365        201        566       1.82   35.51%      40.16%    18.42%      41.52% 
   |    646.0000   661.0000         359        173        532       2.08   32.52%      37.71%    15.86%      57.38% 
   |    662.0000   677.0000         441        125        566       3.53   22.08%      33.74%    11.46%      68.84% 
   |    678.0000   692.0000         436         99        535       4.40   18.50%      30.79%     9.07%      77.91% 
   |    693.0000   708.0000         469         89        558       5.27   15.95%      28.29%     8.16%      86.07% 
   |    709.0000   725.0000         492         66        558       7.45   11.83%      25.92%     6.05%      92.12% 
   |    726.0000   747.0000         520         42        562      12.38    7.47%      23.59%     3.85%      95.97% 
   V    748.0000   772.0000         507         30        537      16.90    5.59%      21.64%     2.75%      98.72% 
 GOOD   773.0000   848.0000         532         14        546      38.00    2.56%      19.76%     1.28%     100.00% 
       ========== ========== ========== ========== ==========                                                       
        443.0000   848.0000       4,431      1,091      5,522 

WoE Macro for LGD Model

WoE (Weight of Evidence) transformation has been widely used in scorecard or PD (Probability of Default) model development with a binary response variable in the retail banking. In SAS Global Forum 2012, Anthony and Naeem proposed an innovative method to employ WoE transformation in LGD (Loss Given Default) model development with the response in the range [0, 1]. Specifically, if an account with LGD = 0.25, then this account will be duplicated 100 times, of which 25 instances will be assigned Y = 1 and the rest 75 instances will be assigned Y = 0. As a result, 1 case with the unity-interval response variable has been converted to 100 cases with the binary response variable. After this conversion, WoE transformation algorithm designed for the binary response can be directly applied to the new dataset with the converted LGD.

Following the logic described by Anthony and Naeem, I did a test run with WoE SAS macro that I drafted by myself and got the following output. In the tested data set, there are totally 36,543 cases with the unity-interval response.

                         MONOTONIC WEIGHT OF EVIDENCE TRANSFORMATION FOR X

   BIN           LOWER           UPPER                        #BADS      BAD                 INFO.           
 LEVEL           LIMIT           LIMIT      #FREQ  PERCENT    (Y=1)     RATE        WOE      VALUE         KS
 ------------------------------------------------------------------------------------------------------------
   001          9.5300         84.7200     406000  11.11%    158036  38.93%     -0.1726     0.0033     1.8904
   002         84.7300         91.9100     406800  11.13%    160530  39.46%     -0.1501     0.0025     3.5410
   003         91.9132         97.3100     405300  11.09%    168589  41.60%     -0.0615     0.0004     4.2202
   004         97.3146        101.8100     406000  11.11%    170773  42.06%     -0.0424     0.0002     4.6894
   005        101.8162        105.2221     406100  11.11%    172397  42.45%     -0.0264     0.0001     4.9821
   006        105.2223        110.1642     406000  11.11%    174480  42.98%     -0.0050     0.0000     5.0376
   007        110.1700        115.7000     406600  11.13%    183253  45.07%      0.0800     0.0007     4.1431
   008        115.7029        123.0886     405500  11.10%    188026  46.37%      0.1324     0.0020     2.6630
   009        123.1100        150.0000     406000  11.11%    198842  48.98%      0.2369     0.0063     0.0000
--------------------------------------------------------------------------------------------------------------                                        
    # TOTAL = 3654300, # BADs(Y=1) = 1574926, OVERALL BAD RATE = 43.0979%, MAX. KS = 5.0376, INFO. VALUE = 0.0154.                                    
--------------------------------------------------------------------------------------------------------------                                        

As shown in the above output, monotonic WoE binning, information value, and KS statistic come out nicely with the only exception that the frequency has been increased by 100 times. From a practical view, the increase in the data size will inevitably lead o the increase in the computing cost, making this binary conversion potentially inapplicable to large data.

After a few tweaks, I drafted a modified version of WoE SAS macro suitable for LGD model with the unity-interval response, as given below.

%macro lgd_nwoe(data = , y = , x = );

***********************************************************;
* THE SAS MACRO IS TO PERFORM UNIVARIATE IMPORTANCE RANK  *;
* ORDER AND MONOTONIC WEIGHT OF EVIDENCE TRANSFORMATION   *;
* FOR NUMERIC ATTRIBUTES IN PRE-MODELING DATA PROCESSING  *;
* FOR LGD MODELS. OUTPUTS ARE SUPPOSED TO BE USED IN A    *;
* FRACTIONAL MODEL WITH LOGIT LINK FUNCTION               *;
* ======================================================= *;
* PAMAMETERS:                                             *;
*  DATA: INPUT SAS DATA TABLE                             *;
*  Y   : CONTINUOUS RESPONSE VARIABLE IN [0, 1] RANGE     *;
*  X   : A LIST OF NUMERIC ATTRIBUTES                     *;
* ======================================================= *;
* OUTPUTS:                                                *;
*  NUM_WOE.WOE: A FILE OF WOE TRANSFORMATION RECODING     *;
*  NUM_WOE.FMT: A FILE OF BINNING FORMAT                  *;
*  NUM_WOE.PUT: A FILE OF PUT STATEMENTS FOR *.FMT FILE   *;
*  NUM_WOE.SUM: A FILE WITH PREDICTABILITY SUMMARY        *;
*  NUM_WOE.OUT: A FILE WITH STATISTICAL DETAILS           *;
*  NUM_WOE.IMP: A FILE OF MISSING IMPUTATION RECODING     *;
* ======================================================= *;
* CONTACT:                                                *;
*  WENSUI.LIU@53.COM                                      *;
***********************************************************;

options nocenter nonumber nodate mprint mlogic symbolgen
        orientation = landscape ls = 125 formchar = "|----|+|---+=|-/\<>*";

*** DEFAULT PARAMETERS ***;

%local maxbin miniv bignum mincor;

%let maxbin = 20;

%let miniv  = 0;

%let mincor = 1;

%let bignum = 1e300;

***********************************************************;
***         DO NOT CHANGE CODES BELOW THIS LINE         ***;
***********************************************************;

*** DEFAULT OUTPUT FILES ***;

* WOE RECODING FILE                     *;
filename woefile "LGD_NWOE.WOE";

* FORMAT FOR BINNING                    *;
filename fmtfile "LGD_NWOE.FMT";

* PUT STATEMENT TO USE FORMAT           *;
filename binfile "LGD_NWOE.PUT";

* KS SUMMARY                            *;
filename sumfile "LGD_NWOE.SUM";
 
* STATISTICAL SUMMARY FOR EACH VARIABLE *;
filename outfile "LGD_NWOE.OUT";

* IMPUTE RECODING FILE                  *;
filename impfile "LGD_NWOE.IMP";

*** A MACRO TO DELETE FILE ***;
%macro dfile(file = );
  data _null_;
    rc = fdelete("&file");
    if rc = 0 then do;
      put @1 50 * "+";
      put "THE EXISTED OUTPUT FILE HAS BEEN DELETED.";
      put @1 50 * "+";
    end;
  run;
%mend dfile;

*** CLEAN UP FILES ***;
%dfile(file = woefile);

%dfile(file = fmtfile);

%dfile(file = binfile);

%dfile(file = sumfile);

%dfile(file = outfile);

%dfile(file = impfile);

*** PARSING THE STRING OF NUMERIC PREDICTORS ***;
ods listing close;
ods output position = _pos1;
proc contents data = &data varnum;
run;

proc sql noprint;
  select
    upcase(variable) into :x2 separated by ' '
  from
    _pos1
  where
    compress(upcase(type), ' ') = 'NUM' and
    index("%upcase(%sysfunc(compbl(&x)))", compress(upcase(variable), ' ')) > 0;


  select
    count(variable) into :xcnt
  from
    _pos1
  where
    compress(upcase(type), ' ') = 'NUM' and
    index("%upcase(%sysfunc(compbl(&x)))", compress(upcase(variable), ' ')) > 0;
quit;

data _tmp1;
  retain &x2 &y &y.2;
  set &data;
  where &Y >= 0 and &y <= 1;
  &y.2 = 1 - &y;
  keep &x2 &y &y.2;
run;

ods output position = _pos2;
proc contents data = _tmp1 varnum;
run;

*** LOOP THROUGH EACH PREDICTOR ***;
%do i = 1 %to &xcnt;
    
  proc sql noprint;
    select
      upcase(variable) into :var
    from
      _pos2
    where
      num= &i;

    select
      count(distinct &var) into :xflg
    from
      _tmp1
    where
      &var ~= .;
  quit;

  proc summary data = _tmp1 nway;
    output out  = _med(drop = _type_ _freq_)
    median(&var) = med nmiss(&var) = mis;
  run;
  
  proc sql;
    select
      med into :median
    from
      _med;

    select
      mis into :nmiss
    from
      _med;

    select 
      case when count(&y) = sum(&y) then 1 else 0 end into :mis_flg1
    from
      _tmp1
    where
      &var = .;

    select
      case when sum(&y) = 0 then 1 else 0 end into :mis_flg2
    from
      _tmp1
    where
      &var = .;
  quit;

  %let nbin = %sysfunc(min(&maxbin, &xflg));

  *** CHECK IF # OF DISTINCT VALUES > 1 ***;
  %if &xflg > 1 %then %do;

    *** IMPUTE MISS VALUE WHEN WOE CANNOT BE CALCULATED ***;
    %if &mis_flg1 = 1 | &mis_flg2 = 1 %then %do;
      data _null_;
        file impfile mod;
        put " ";
        put @3 "*** MEDIAN IMPUTATION OF %TRIM(%UPCASE(&VAR)) (NMISS = %trim(&nmiss)) ***;";
        put @3 "IF %TRIM(%UPCASE(&VAR)) = . THEN %TRIM(%UPCASE(&VAR)) = &MEDIAN;";
      run;

      data _tmp1;
        set _tmp1;
        if &var = . then &var = &median;
      run; 
    %end;      
      
    *** LOOP THROUGH ALL # OF BINS ***;
    %do j = &nbin %to 2 %by -1;
      proc rank data = _tmp1 groups = &j out = _tmp2(keep = &y &var rank &y.2);
        var &var;
        ranks rank;
      run;

      proc summary data = _tmp2 nway missing;
        class rank;
        output out = _tmp3(drop = _type_ rename = (_freq_ = freq))
        sum(&y)   = bad    mean(&y)  = bad_rate mean(&y.2) = good_rate
        min(&var) = minx   max(&var) = maxx;
      run;

      *** CREATE FLAGS FOR MULTIPLE CRITERION ***;
      proc sql noprint;
        select
          case when min(bad_rate) > 0 then 1 else 0 end into :minflg
        from
          _tmp3;

        select
          case when max(bad_rate) < 1 then 1 else 0 end into :maxflg
        from
          _tmp3;              
      quit;

      *** CHECK IF SPEARMAN CORRELATION = 1 ***;
      %if &minflg = 1 & &maxflg = 1 %then %do;
        ods output spearmancorr = _corr(rename = (minx = cor));
        proc corr data = _tmp3 spearman;
          var minx;
          with bad_rate;
        run;

        proc sql noprint;
          select
            case when abs(cor) >= &mincor then 1 else 0 end into :cor
          from
            _corr;
        quit;

        *** IF SPEARMAN CORR = 1 THEN BREAK THE LOOP ***;
        %if &cor = 1 %then %goto loopout;
      %end;
      %else %if &nbin = 2 %then %goto exit;
    %end;

    %loopout:

    *** CALCULATE STATISTICAL SUMMARY ***;
    
    proc sql noprint;
      select 
        sum(freq) into :freq
      from
        _tmp3;
      
      select
        sum(bad) into :bad
      from
        _tmp3;

      select
        sum(bad) / sum(freq) into :lgd
      from
        _tmp3;
    quit;

    proc sort data = _tmp3 sortsize = max;
      by rank;
    run;

    data _tmp4;
      retain bin minx maxx bad freq pct bad_rate good_rate;
      set _tmp3 end = eof;
      by rank;

      if rank = . then bin = 0;
      else do;
        retain b 0;
        bin + 1;
      end;
  
      pct  = freq / &freq;
      bpct = bad / &bad;
      gpct = (freq - bad) / (&freq - &bad);
      woe  = log(bpct / gpct);
      iv   = (bpct - gpct) * woe;
      
      retain cum_bpct cum_gpct;
      cum_bpct + bpct;
      cum_gpct + gpct;
      ks = abs(cum_gpct - cum_bpct) * 100;
      
      retain iv_sum ks_max;
      iv_sum + iv;
      ks_max = max(ks_max, ks);
      if eof then do;
        call symput("bin", put(bin, 4.));
        call symput("ks", put(ks_max, 10.4));
        call symput("iv", put(iv_sum, 10.4));
      end;

      keep bin minx maxx bad freq pct bad_rate
           gpct bpct woe iv cum_gpct cum_bpct ks;
    run;

    *** REPORT STATISTICAL SUMMARY ***;
    proc printto print = outfile;
    run;

    title;
    ods listing;
    proc report data = _tmp4 spacing = 1 split = "*" headline nowindows;
      column(" * MONOTONIC WEIGHT OF EVIDENCE TRANSFORMATION FOR %upcase(%trim(&var))"
             bin minx maxx freq pct bad_rate woe iv ks);

      define bin      /"BIN#"         width = 5  format = z3. order order = data;
      define minx     /"LOWER LIMIT"  width = 15 format = 14.4;
      define maxx     /"UPPER LIMIT"  width = 15 format = 14.4;
      define freq     /"#FREQ"        width = 10 format = 9.;
      define pct      /"DISTRIBUTION" width = 15 format = percent14.4;
      define bad_rate /"MEAN LGD"     width = 15 format = percent14.4;
      define woe      /"WOE"          width = 15 format = 14.8;
      define iv       /"INFO. VALUE"  width = 15 format = 14.8;
      define ks       /"KS"           width = 10 format = 9.4;
      compute after;
        line @1 125 * "-";
        line @10 "# TOTAL = %trim(&freq), AVERAGE LGD = %trim(&lgd), "
                 "MAX. KS = %trim(&ks), INFO. VALUE = %trim(&iv).";
        line @1 125 * "-";    
      endcomp;
    run;
    ods listing close;

    proc printto;
    run;

    proc sql noprint;
      select
        case when sum(iv) >= &miniv then 1 else 0 end into :ivflg
      from
        _tmp4;
    quit;

    *** OUTPUT RECODING FILES IF IV >= &miniv BY DEFAULT ***;
    %if &ivflg = 1 %then %do;
      data _tmp5;
        length upper $20 lower $20;
        lower = compress(put(maxx, 20.4), ' ');

        set _tmp4 end = eof;
        upper = compress(put(maxx, 20.4), ' ');
        if bin = 1 then lower = "-%trim(&bignum)";
        if eof then upper = "%trim(&bignum)";
        w%trim(&var) = compress(put(woe, 12.8), ' ');
      run;

      *** OUTPUT WOE RECODE FILE ***;
      data _null_;
        set _tmp5 end = eof;
        file woefile mod;

        if bin = 0 and _n_ = 1 then do;
          put " ";
          put @3 3 * "*"
                 " WOE RECODE OF %upcase(%trim(&var)) (KS = %trim(&ks), IV = %trim(&iv))"
                 + 1 3 * "*" ";";
          put @3  "if %trim(&var) = . then w%trim(&var) = " w%trim(&var) ";";
        end;
        if bin = 1 and _n_ = 1 then do;
          put " ";
          put @3 3 * "*"
                 " WOE RECODE OF %upcase(%trim(&var)) (KS = %trim(&ks), IV = %trim(&iv))"
                 + 1 3 * "*" ";";
          put @3 "if " lower "< %trim(&var) <= " upper
                 "then w%trim(&var) = " w%trim(&var) ";";
        end;
        if _n_ > 1 then do;
          put @5 "else if " lower "< %trim(&var) <= " upper
                 "then w%trim(&var) = " w%trim(&var) ";";
        end;
        if eof then do;
          put @5 "else w%trim(&var) = 0;";
        end;
      run;

      *** OUTPUT BINNING FORMAT FILE ***;
      data _null_;
        set _tmp5 end = eof;
        file fmtfile mod;

        if bin = 1 then lower = "LOW";
        if eof then upper = "HIGH";

        if bin = 0 and _n_ = 1 then do;
          put " ";
          put @3 3 * "*"
                 " BINNING FORMAT OF %trim(&var) (KS = %trim(&ks), IV = %trim(&IV))"
              + 1 3 * "*" ";";
          put @3 "value %trim(&var)_fmt";
          put @5 ". " @40 " = '" bin: z3.
              @47 ". MISSINGS'";
        end;

            
        if bin = 1 and _n_ = 1 then do;
          put " ";
          put @3 3 * "*"
              @5 "BINNING FORMAT OF %trim(&var) (KS = %trim(&ks), IV = %trim(&IV))"
              + 1 3 * "*" ";";
          put @3 "value %trim(&var)_fmt";
          put @5 lower @15 " - " upper  @40 " = '" bin: z3.
              @47 "." + 1 lower "- " upper "'";
        end;

        if _n_ > 1 then do;
          put @5 lower @15 "<- " upper @40 " = '" bin: z3.
              @47 "." + 1 lower "<- " upper "'";
        end;
        if eof then do;
          put @5 "OTHER" @40 " = '999. OTHERS';";
        end;
      run;

      *** OUTPUT BINNING RECODE FILE ***;
      data _null_;
        file binfile mod;
        put " ";
        put @3 "*** BINNING RECODE of %trim(&var) ***;";
        put @3 "c%trim(&var) = put(%trim(&var), %trim(&var)_fmt.);";
      run;

      *** SAVE SUMMARY OF EACH VARIABLE INTO A TABLE ***;
      %if %sysfunc(exist(work._result)) %then %do;
        data _result;
          format variable $32. bin 3. ks 10.4 iv 10.4;
          if _n_ = 1 then do;
            variable = "%trim(&var)";
            bin      = &bin;
            ks       = &ks;
            iv       = &iv;
            output;
          end;
          set _result;
          output;
        run;
      %end;
      %else %do;
        data _result;
          format variable $32. bin 3. ks 10.4 iv 10.4;
          variable = "%trim(&var)";
          bin      = &bin;
          ks       = &ks;
          iv       = &iv;
        run;        
      %end;
    %end;

    %exit:

    *** CLEAN UP TEMPORARY TABLES ***;
    proc datasets library = work nolist;
      delete _tmp2 - _tmp5 _corr / memtype = data;
    run;
    quit;
  %end;    
%end;

*** SORT VARIABLES BY KS AND OUTPUT RESULTS ***;
proc sort data = _result sortsize = max;
  by descending iv descending ks;
run;

data _null_;
  set _result end = eof;
  file sumfile;

  if _n_ = 1 then do;
    put @1 80 * "-";
    put @1  "| RANK" @10 "| VARIABLE RANKED BY IV" @45 "| # BINS"
        @55 "|  KS"  @66 "| INFO. VALUE" @80 "|";
    put @1 80 * "-";
  end;
  put @1  "| " @4  _n_ z3. @10 "| " @12 variable @45 "| " @50 bin
      @55 "| " @57 ks      @66 "| " @69 iv       @80 "|";
  if eof then do;
    put @1 80 * "-";
  end;
run;

proc datasets library = work nolist;
  delete _result (mt = data);
run;
quit;

*********************************************************;
*                   END OF MACRO                        *;
*********************************************************;

%mend lgd_nwoe;

With this macro, I tested on the same data again but without the binary conversion proposed Anthony and Naeem. The output is also pasted below for the comparison purpose.

                                                                                                                            
                                 MONOTONIC WEIGHT OF EVIDENCE TRANSFORMATION FOR X

  BIN#     LOWER LIMIT     UPPER LIMIT      #FREQ    DISTRIBUTION        MEAN LGD             WOE     INFO. VALUE         KS
 ---------------------------------------------------------------------------------------------------------------------------
   001          9.5300         84.7200       4060       11.1102%        38.9246%      -0.17266201      0.00326518     1.8911
   002         84.7300         91.9100       4068       11.1321%        39.4665%      -0.14992613      0.00247205     3.5399
   003         91.9132         97.3100       4053       11.0910%        41.5962%      -0.06155066      0.00041827     4.2195
   004         97.3146        101.8100       4060       11.1102%        42.0504%      -0.04288377      0.00020368     4.6945
   005        101.8162        105.2221       4061       11.1129%        42.4538%      -0.02634960      0.00007701     4.9867
   006        105.2223        110.1642       4060       11.1102%        42.9896%      -0.00445666      0.00000221     5.0362
   007        110.1700        115.7000       4066       11.1266%        45.0653%       0.07978905      0.00071190     4.1440
   008        115.7029        123.0886       4055       11.0965%        46.3687%       0.13231366      0.00195768     2.6644
   009        123.1100        150.0000       4060       11.1102%        48.9800%       0.23701619      0.00631510     0.0000
-----------------------------------------------------------------------------------------------------------------------------
         # TOTAL = 36543, AVERAGE LGD = 0.430988, MAX. KS = 5.0362, INFO. VALUE = 0.0154.                                    
-----------------------------------------------------------------------------------------------------------------------------

The comparison shows that most information from 2 implementations are almost identical. For instance, information value = 0.0154 in both cases. However, the frequency in the second output reflects the correct sample size, e.g 36,543. As a result, the computation is more efficient and the time consumed is much shorter. Considering the fact that usually hundreds of variables should be looped through this WoE transformation one by one, the approach without duplicating cases and converting to binary might make more practical sense.

A SAS Macro Implementing Monotonic WOE Transformation in Scorecard Development

This SAS macro was specifically designed for model developers to do uni-variate variable importance ranking and monotonic weight of evidence (WOE) transformation for potentially hundreds of predictors in the scorecard development. Please feel free to use or distribute it at your own risk. I will really appreciate it if you could share your successful story using this macro in your model development with me.

%macro num_woe(data = , y = , x = );
***********************************************************;
* THE SAS MACRO IS TO PERFORM UNIVARIATE IMPORTANCE RANK  *;
* ORDER AND MONOTONIC WEIGHT OF EVIDENCE TRANSFORMATION   *;
* FOR NUMERIC ATTRIBUTES IN PRE-MODELING DATA PROCESSING  *;
* (IT IS RECOMMENDED TO RUN THIS MACRO IN THE BATCH MODE) *;
* ======================================================= *;
* PAMAMETERS:                                             *;
*  DATA: INPUT SAS DATA TABLE                             *;
*  Y   : RESPONSE VARIABLE WITH 0/1 VALUE                 *;
*  X   : A LIST OF NUMERIC ATTRIBUTES                     *;
* ======================================================= *;
* OUTPUTS:                                                *;
*  MONO_WOE.WOE: A FILE OF WOE TRANSFORMATION RECODING    *;
*  MONO_WOE.FMT: A FILE OF BINNING FORMAT                 *;
*  MONO_WOE.PUT: A FILE OF PUT STATEMENTS FOR *.FMT FILE  *;
*  MONO_WOE.SUM: A FILE WITH PREDICTABILITY SUMMARY       *;
*  MONO_WOE.OUT: A FILE WITH STATISTICAL DETAILS          *;
*  MONO_WOE.IMP: A FILE OF MISSING IMPUTATION RECODING    *;
* ======================================================= *;
* CONTACT:                                                *;
*  WENSUI.LIU@53.COM                                      *;
***********************************************************;

options nocenter nonumber nodate mprint mlogic symbolgen
        orientation = landscape ls = 150;

*** DEFAULT PARAMETERS ***;

%local maxbin minbad miniv bignum;

%let maxbin = 100;

%let minbad = 50;

%let miniv  = 0.03;

%let bignum = 1e300;

***********************************************************;
***         DO NOT CHANGE CODES BELOW THIS LINE         ***;
***********************************************************;

*** DEFAULT OUTPUT FILES ***;

* WOE RECODING FILE                     *;
filename woefile "MONO_WOE.WOE";

* FORMAT FOR BINNING                    *;
filename fmtfile "MONO_WOE.FMT";

* PUT STATEMENT TO USE FORMAT           *;
filename binfile "MONO_WOE.PUT";

* KS SUMMARY                            *;
filename sumfile "MONO_WOE.SUM";
 
* STATISTICAL SUMMARY FOR EACH VARIABLE *;
filename outfile "MONO_WOE.OUT";

* IMPUTE RECODING FILE                  *;
filename impfile "MONO_WOE.IMP";

*** A MACRO TO DELETE FILE ***;
%macro dfile(file = );
  data _null_;
    rc = fdelete("&file");
    if rc = 0 then do;
      put @1 50 * "+";
      put "THE EXISTED OUTPUT FILE HAS BEEN DELETED.";
      put @1 50 * "+";
    end;
  run;
%mend dfile;

*** CLEAN UP FILES ***;
%dfile(file = woefile);

%dfile(file = fmtfile);

%dfile(file = binfile);

%dfile(file = sumfile);

%dfile(file = outfile);

%dfile(file = impfile);

*** PARSING THE STRING OF NUMERIC PREDICTORS ***;
ods listing close;
ods output position = _pos1;
proc contents data = &data varnum;
run;

proc sql noprint;
  select
    upcase(variable) into :x2 separated by ' '
  from
    _pos1
  where
    compress(upcase(type), ' ') = 'NUM' and
    index("%upcase(%sysfunc(compbl(&x)))", compress(upcase(variable), ' ')) > 0;


  select
    count(variable) into :xcnt
  from
    _pos1
  where
    compress(upcase(type), ' ') = 'NUM' and
    index("%upcase(%sysfunc(compbl(&x)))", compress(upcase(variable), ' ')) > 0;
quit;

data _tmp1;
  retain &x2 &y;
  set &data;
  where &Y in (1, 0);
  keep &x2 &y;
run;

ods output position = _pos2;
proc contents data = _tmp1 varnum;
run;

*** LOOP THROUGH EACH PREDICTOR ***;
%do i = 1 %to &xcnt;
    
  proc sql noprint;
    select
      upcase(variable) into :var
    from
      _pos2
    where
      num= &i;

    select
      count(distinct &var) into :xflg
    from
      _tmp1
    where
      &var ~= .;
  quit;

  proc summary data = _tmp1 nway;
    output out  = _med(drop = _type_ _freq_)
    median(&var) = med nmiss(&var) = mis;
  run;
  
  proc sql;
    select
      med into :median
    from
      _med;

    select
      mis into :nmiss
    from
      _med;

    select 
      case when count(&y) = sum(&y) then 1 else 0 end into :mis_flg1
    from
      _tmp1
    where
      &var = .;

    select
      case when sum(&y) = 0 then 1 else 0 end into :mis_flg2
    from
      _tmp1
    where
      &var = .;
  quit;

  %let nbin = %sysfunc(min(&maxbin, &xflg));

  *** CHECK IF THE NUMBER OF DISTINCT VALUES > 1 ***;
  %if &xflg > 1 %then %do;

    *** IMPUTE MISS VALUE WHEN WOE CANNOT BE CALCULATED ***;
    %if &mis_flg1 = 1 | &mis_flg2 = 1 %then %do;
      data _null_;
        file impfile mod;
        put " ";
        put @3 "*** MEDIAN IMPUTATION OF %TRIM(%UPCASE(&VAR)) (NMISS = %trim(&nmiss)) ***;";
        put @3 "IF %TRIM(%UPCASE(&VAR)) = . THEN %TRIM(%UPCASE(&VAR)) = &MEDIAN;";
      run;

      data _tmp1;
        set _tmp1;
        if &var = . then &var = &median;
      run; 
    %end;      
      
    *** LOOP THROUGH THE NUMBER OF BINS ***;
    %do j = &nbin %to 2 %by -1;
      proc rank data = _tmp1 groups = &j out = _tmp2(keep = &y &var rank);
        var &var;
        ranks rank;
      run;

      proc summary data = _tmp2 nway missing;
        class rank;
        output out = _tmp3(drop = _type_ rename = (_freq_ = freq))
        sum(&y)   = bad    mean(&y)  = bad_rate
        min(&var) = minx   max(&var) = maxx;
      run;

      *** CREATE FLAGS FOR MULTIPLE CRITERION ***;
      proc sql noprint;
        select
          case when min(bad) >= &minbad then 1 else 0 end into :badflg
        from
          _tmp3;

        select
          case when min(bad_rate) > 0 then 1 else 0 end into :minflg
        from
          _tmp3;

        select
          case when max(bad_rate) < 1 then 1 else 0 end into :maxflg
        from
          _tmp3;              
      quit;

      *** CHECK IF SPEARMAN CORRELATION = 1 ***;
      %if &badflg = 1 & &minflg = 1 & &maxflg = 1 %then %do;
        ods output spearmancorr = _corr(rename = (minx = cor));
        proc corr data = _tmp3 spearman;
          var minx;
          with bad_rate;
        run;

        proc sql noprint;
          select
            case when abs(cor) = 1 then 1 else 0 end into :cor
          from
            _corr;
        quit;

        *** IF SPEARMAN CORR = 1 THEN BREAK THE LOOP ***;
        %if &cor = 1 %then %goto loopout;
      %end;
      %else %if &nbin = 2 %then %goto exit;
    %end;

    %loopout:
    
    *** CALCULATE STATISTICAL SUMMARY ***;
    proc sql noprint;
      select 
        sum(freq) into :freq
      from
        _tmp3;

      select
        sum(bad) into :bad
      from
        _tmp3;
    quit;

    proc sort data = _tmp3 sortsize = max;
      by rank;
    run;

    data _tmp4;
      retain bin minx maxx bad freq pct bad_rate;
      set _tmp3 end = eof;
      by rank;

      if rank = . then bin = 0;
      else do;
        retain b 0;
        bin + 1;
      end;
  
      pct  = freq / &freq;
      bpct = bad / &bad;
      gpct = (freq - bad) / (&freq - &bad);
      woe  = log(bpct / gpct);
      iv   = (bpct - gpct) * woe;

      retain cum_bpct cum_gpct;
      cum_bpct + bpct;
      cum_gpct + gpct;
      ks = abs(cum_gpct - cum_bpct) * 100;

      retain iv_sum ks_max;
      iv_sum + iv;
      ks_max = max(ks_max, ks);
      if eof then do;
        call symput("bin", put(bin, 4.));
        call symput("ks", put(ks_max, 10.4));
        call symput("iv", put(iv_sum, 10.4));
      end;

      keep bin minx maxx bad freq pct bad_rate
           gpct bpct woe iv cum_gpct cum_bpct ks;
    run;

    *** REPORT STATISTICAL SUMMARY ***;
    proc printto print = outfile;
    run;

    title;
    ods listing;
    proc report data = _tmp4 spacing = 1 split = "*" headline nowindows;
      column(" * MONOTONIC WEIGHT OF EVIDENCE TRANSFORMATION FOR %upcase(%trim(&var))"
             bin minx maxx freq pct bad bad_rate woe iv ks);

      define bin      /"BIN*LEVEL"   width = 5  format = z3. order order = data;
      define minx     /"LOWER*LIMIT" width = 15 format = 14.4;
      define maxx     /"UPPER*LIMIT" width = 15 format = 14.4;
      define bad      /"#BADS*(Y=1)" width = 8  format = 7.;
      define freq     /"#FREQ"       width = 10 format = 9.;
      define pct      /"PERCENT"     width = 8  format = percent8.2;
      define bad_rate /"BAD*RATE"    width = 8  format = percent8.2;
      define woe      /"WOE"         width = 10 format = 9.4;
      define iv       /"INFO.*VALUE" width = 10 format = 9.4;
      define ks       /"KS"          width = 10 format = 9.4;
      compute after;
        line @1 110 * "-";
        line @5 "# TOTAL = %trim(&freq), # BADs(Y=1) = %trim(&bad), "
                "OVERALL BAD RATE = %trim(%sysfunc(round(&bad / &freq * 100, 0.0001)))%, "
                "MAX. KS = %trim(&ks), INFO. VALUE = %trim(&iv).";
        line @1 110 * "-";    
      endcomp;
    run;
    ods listing close;

    proc printto;
    run;

    proc sql noprint;
      select
        case when sum(iv) >= &miniv then 1 else 0 end into :ivflg
      from
        _tmp4;
    quit;

    *** OUTPUT RECODING FILES IF IV >= &miniv BY DEFAULT ***;
    %if &ivflg = 1 %then %do;
      data _tmp5;
        length upper $20 lower $20;
        lower = compress(put(maxx, 20.4), ' ');

        set _tmp4 end = eof;
        upper = compress(put(maxx, 20.4), ' ');
        if bin = 1 then lower = "-%trim(&bignum)";
        if eof then upper = "%trim(&bignum)";
        w%trim(&var) = compress(put(woe, 12.8), ' ');
      run;

      *** OUTPUT WOE RECODE FILE ***;
      data _null_;
        set _tmp5 end = eof;
        file woefile mod;

        if bin = 0 and _n_ = 1 then do;
          put " ";
          put @3 3 * "*"
                 " WOE RECODE OF %upcase(%trim(&var)) (KS = %trim(&ks), IV = %trim(&iv))"
                 + 1 3 * "*" ";";
          put @3  "if %trim(&var) = . then w%trim(&var) = " + 1 w%trim(&var) ";";
        end;
        if bin = 1 and _n_ = 1 then do;
          put " ";
          put @3 3 * "*"
                 " WOE RECODE OF %upcase(%trim(&var)) (KS = %trim(&ks), IV = %trim(&iv))"
                 + 1 3 * "*" ";";
          put @3 "if " + 1 lower " < %trim(&var) <= " upper
                 " then w%trim(&var) = " + 1 w%trim(&var) ";";
        end;
        if _n_ > 1 then do;
          put @5 "else if " + 1 lower " < %trim(&var) <= " upper
                 " then w%trim(&var) = " + 1 w%trim(&var) ";";
        end;
        if eof then do;
          put @5 "else w%trim(&var) = 0;";
        end;
      run;

      *** OUTPUT BINNING FORMAT FILE ***;
      data _null_;
        set _tmp5 end = eof;
        file fmtfile mod;

        if bin = 1 then lower = "LOW";
        if eof then upper = "HIGH";

        if bin = 0 and _n_ = 1 then do;
          put " ";
          put @3 3 * "*"
                 " BINNING FORMAT OF %trim(&var) (KS = %trim(&ks), IV = %trim(&IV))"
              + 1 3 * "*" ";";
          put @3 "value %trim(&var)_fmt";
          put @5 ". " @40 " = '" bin: z3.
                 ". MISSINGS'";
        end;

            
        if bin = 1 and _n_ = 1 then do;
          put " ";
          put @3 3 * "*"
              @5 "BINNING FORMAT OF %trim(&var) (KS = %trim(&ks), IV = %trim(&IV))"
              + 1 3 * "*" ";";
          put @3 "value %trim(&var)_fmt";
          put @5 lower @15 " - " + 1 upper  @40 " = '" bin: z3.
                 ". " + 1 lower " - " + 1 upper "'";
        end;

        if _n_ > 1 then do;
          put @5 lower @15 "<- " + 1 upper @40 " = '" bin: z3.
                 ". " + 1 lower "<- " + 1 upper "'";
        end;
        if eof then do;
          put @5 "OTHER" @40 " = '999 .  OTHERS';";
        end;
      run;

      *** OUTPUT BINNING RECODE FILE ***;
      data _null_;
        file binfile mod;
        put " ";
        put @3 "*** BINNING RECODE of %trim(&var) ***;";
        put @3 "c%trim(&var) = put(%trim(&var), %trim(&var)_fmt.);";
      run;

      *** SAVE SUMMARY OF EACH VARIABLE INTO A TABLE ***;
      %if %sysfunc(exist(work._result)) %then %do;
        data _result;
          format variable $32. bin 3. ks 10.4 iv 10.4;
          if _n_ = 1 then do;
            variable = "%trim(&var)";
            bin      = &bin;
            ks       = &ks;
            iv       = &iv;
            output;
          end;
          set _result;
          output;
        run;
      %end;
      %else %do;
        data _result;
          format variable $32. bin 3. ks 10.4 iv 10.4;
          variable = "%trim(&var)";
          bin      = &bin;
          ks       = &ks;
          iv       = &iv;
        run;        
      %end;
    %end;

    %exit:

    *** CLEAN UP TEMPORARY TABLES ***;
    proc datasets library = work nolist;
      delete _tmp2 - _tmp5 _corr / memtype = data;
    run;
    quit;
  %end;    
%end;

*** SORT VARIABLES BY KS AND OUTPUT RESULTS ***;
proc sort data = _result sortsize = max;
  by descending ks descending iv;
run;

data _null_;
  set _result end = eof;
  file sumfile;

  if _n_ = 1 then do;
    put @1 80 * "-";
    put @1  "| RANK" @10 "| VARIABLE RANKED BY KS" @45 "| # BINS"
        @55 "|  KS"  @66 "| INFO. VALUE" @80 "|";
    put @1 80 * "-";
  end;
  put @1  "| " @4  _n_ z3. @10 "| " @12 variable @45 "| " @50 bin
      @55 "| " @57 ks      @66 "| " @69 iv       @80 "|";
  if eof then do;
    put @1 80 * "-";
  end;
run;

proc datasets library = work nolist;
  delete _result (mt = data);
run;
quit;

*********************************************************;
*           END OF NUM_WOE MACRO                        *;
*********************************************************;
%mend num_woe;

libname data 'D:\SAS_CODE\woe';

%let x = 
tot_derog
tot_tr
age_oldest_tr
tot_open_tr
tot_rev_tr
tot_rev_debt
tot_rev_line
rev_util
bureau_score
ltv
tot_income
;

%num_woe(data = data.accepts, y = bad, x = &x);

The macro above will automatically generate 6 standard output files with different contents for various purposes through the whole process of scorecard development.

1) “MONO_WOE.WOE” is a file of WOE transformation recoding.

 
  *** WOE RECODE OF TOT_DEROG (KS = 20.0442, IV = 0.2480) ***;
  if TOT_DEROG = . then wTOT_DEROG =  0.64159782 ;
    else if  -1e300  < TOT_DEROG <= 0.0000  then wTOT_DEROG =  -0.55591373 ;
    else if  0.0000  < TOT_DEROG <= 2.0000  then wTOT_DEROG =  0.14404414 ;
    else if  2.0000  < TOT_DEROG <= 4.0000  then wTOT_DEROG =  0.50783799 ;
    else if  4.0000  < TOT_DEROG <= 1e300  then wTOT_DEROG =  0.64256014 ;
    else wTOT_DEROG = 0;
 
  *** WOE RECODE OF TOT_TR (KS = 16.8344, IV = 0.1307) ***;
  if TOT_TR = . then wTOT_TR =  0.64159782 ;
    else if  -1e300  < TOT_TR <= 7.0000  then wTOT_TR =  0.40925900 ;
    else if  7.0000  < TOT_TR <= 12.0000  then wTOT_TR =  0.26386662 ;
    else if  12.0000  < TOT_TR <= 18.0000  then wTOT_TR =  -0.13512611 ;
    else if  18.0000  < TOT_TR <= 25.0000  then wTOT_TR =  -0.40608173 ;
    else if  25.0000  < TOT_TR <= 1e300  then wTOT_TR =  -0.42369090 ;
    else wTOT_TR = 0;
 
  *** WOE RECODE OF AGE_OLDEST_TR (KS = 19.6163, IV = 0.2495) ***;
  if AGE_OLDEST_TR = . then wAGE_OLDEST_TR =  0.66280002 ;
    else if  -1e300  < AGE_OLDEST_TR <= 46.0000  then wAGE_OLDEST_TR =  0.66914925 ;
    else if  46.0000  < AGE_OLDEST_TR <= 77.0000  then wAGE_OLDEST_TR =  0.36328349 ;
    else if  77.0000  < AGE_OLDEST_TR <= 114.0000  then wAGE_OLDEST_TR =  0.15812827 ;
    else if  114.0000  < AGE_OLDEST_TR <= 137.0000  then wAGE_OLDEST_TR =  0.01844301 ;
    else if  137.0000  < AGE_OLDEST_TR <= 164.0000  then wAGE_OLDEST_TR =  -0.04100445 ;
    else if  164.0000  < AGE_OLDEST_TR <= 204.0000  then wAGE_OLDEST_TR =  -0.32667232 ;
    else if  204.0000  < AGE_OLDEST_TR <= 275.0000  then wAGE_OLDEST_TR =  -0.79931317 ;
    else if  275.0000  < AGE_OLDEST_TR <= 1e300  then wAGE_OLDEST_TR =  -0.89926463 ;
    else wAGE_OLDEST_TR = 0;
 
  *** WOE RECODE OF TOT_REV_TR (KS = 9.0779, IV = 0.0757) ***;
  if TOT_REV_TR = . then wTOT_REV_TR =  0.69097090 ;
    else if  -1e300  < TOT_REV_TR <= 1.0000  then wTOT_REV_TR =  0.00269270 ;
    else if  1.0000  < TOT_REV_TR <= 3.0000  then wTOT_REV_TR =  -0.14477602 ;
    else if  3.0000  < TOT_REV_TR <= 1e300  then wTOT_REV_TR =  -0.15200275 ;
    else wTOT_REV_TR = 0;
 
  *** WOE RECODE OF TOT_REV_DEBT (KS = 8.5317, IV = 0.0629) ***;
  if TOT_REV_DEBT = . then wTOT_REV_DEBT =  0.68160936 ;
    else if  -1e300  < TOT_REV_DEBT <= 3009.0000  then wTOT_REV_DEBT =  0.04044249 ;
    else if  3009.0000  < TOT_REV_DEBT <= 1e300  then wTOT_REV_DEBT =  -0.19723686 ;
    else wTOT_REV_DEBT = 0;
 
  *** WOE RECODE OF TOT_REV_LINE (KS = 25.5174, IV = 0.3970) ***;
  if TOT_REV_LINE = . then wTOT_REV_LINE =  0.68160936 ;
    else if  -1e300  < TOT_REV_LINE <= 1477.0000  then wTOT_REV_LINE =  0.73834416 ;
    else if  1477.0000  < TOT_REV_LINE <= 4042.0000  then wTOT_REV_LINE =  0.34923628 ;
    else if  4042.0000  < TOT_REV_LINE <= 8350.0000  then wTOT_REV_LINE =  0.11656236 ;
    else if  8350.0000  < TOT_REV_LINE <= 14095.0000  then wTOT_REV_LINE =  0.03996934 ;
    else if  14095.0000  < TOT_REV_LINE <= 23419.0000  then wTOT_REV_LINE =  -0.49492745 ;
    else if  23419.0000  < TOT_REV_LINE <= 38259.0000  then wTOT_REV_LINE =  -0.94090721 ;
    else if  38259.0000  < TOT_REV_LINE <= 1e300  then wTOT_REV_LINE =  -1.22174118 ;
    else wTOT_REV_LINE = 0;
 
  *** WOE RECODE OF REV_UTIL (KS = 14.3262, IV = 0.0834) ***;
  if  -1e300  < REV_UTIL <= 29.0000  then wREV_UTIL =  -0.31721190 ;
    else if  29.0000  < REV_UTIL <= 1e300  then wREV_UTIL =  0.26459777 ;
    else wREV_UTIL = 0;
 
  *** WOE RECODE OF BUREAU_SCORE (KS = 34.1481, IV = 0.7251) ***;
  if BUREAU_SCORE = . then wBUREAU_SCORE =  0.66280002 ;
    else if  -1e300  < BUREAU_SCORE <= 653.0000  then wBUREAU_SCORE =  0.93490359 ;
    else if  653.0000  < BUREAU_SCORE <= 692.0000  then wBUREAU_SCORE =  0.07762676 ;
    else if  692.0000  < BUREAU_SCORE <= 735.0000  then wBUREAU_SCORE =  -0.58254635 ;
    else if  735.0000  < BUREAU_SCORE <= 1e300  then wBUREAU_SCORE =  -1.61790566 ;
    else wBUREAU_SCORE = 0;
 
  *** WOE RECODE OF LTV (KS = 16.3484, IV = 0.1625) ***;
  if  -1e300  < LTV <= 82.0000  then wLTV =  -0.84674934 ;
    else if  82.0000  < LTV <= 91.0000  then wLTV =  -0.43163689 ;
    else if  91.0000  < LTV <= 97.0000  then wLTV =  -0.14361551 ;
    else if  97.0000  < LTV <= 101.0000  then wLTV =  0.08606320 ;
    else if  101.0000  < LTV <= 107.0000  then wLTV =  0.18554122 ;
    else if  107.0000  < LTV <= 115.0000  then wLTV =  0.22405397 ;
    else if  115.0000  < LTV <= 1e300  then wLTV =  0.51906325 ;
    else wLTV = 0;

2) “MONO_WOE.FMT” is a file for binning format.

 
  *** BINNING FORMAT OF TOT_DEROG (KS = 20.0442, IV = 0.2480) ***;
  value TOT_DEROG_fmt
    .                                   = '000 . MISSINGS'
    LOW       <-  0.0000                = '001 .  LOW <-  0.0000 '
    0.0000    <-  2.0000                = '002 .  0.0000 <-  2.0000 '
    2.0000    <-  4.0000                = '003 .  2.0000 <-  4.0000 '
    4.0000    <-  HIGH                  = '004 .  4.0000 <-  HIGH '
    OTHER                               = '999 .  OTHERS';
 
  *** BINNING FORMAT OF TOT_TR (KS = 16.8344, IV = 0.1307) ***;
  value TOT_TR_fmt
    .                                   = '000 . MISSINGS'
    LOW       <-  7.0000                = '001 .  LOW <-  7.0000 '
    7.0000    <-  12.0000               = '002 .  7.0000 <-  12.0000 '
    12.0000   <-  18.0000               = '003 .  12.0000 <-  18.0000 '
    18.0000   <-  25.0000               = '004 .  18.0000 <-  25.0000 '
    25.0000   <-  HIGH                  = '005 .  25.0000 <-  HIGH '
    OTHER                               = '999 .  OTHERS';
 
  *** BINNING FORMAT OF AGE_OLDEST_TR (KS = 19.6163, IV = 0.2495) ***;
  value AGE_OLDEST_TR_fmt
    .                                   = '000 . MISSINGS'
    LOW       <-  46.0000               = '001 .  LOW <-  46.0000 '
    46.0000   <-  77.0000               = '002 .  46.0000 <-  77.0000 '
    77.0000   <-  114.0000              = '003 .  77.0000 <-  114.0000 '
    114.0000  <-  137.0000              = '004 .  114.0000 <-  137.0000 '
    137.0000  <-  164.0000              = '005 .  137.0000 <-  164.0000 '
    164.0000  <-  204.0000              = '006 .  164.0000 <-  204.0000 '
    204.0000  <-  275.0000              = '007 .  204.0000 <-  275.0000 '
    275.0000  <-  HIGH                  = '008 .  275.0000 <-  HIGH '
    OTHER                               = '999 .  OTHERS';
 
  *** BINNING FORMAT OF TOT_REV_TR (KS = 9.0779, IV = 0.0757) ***;
  value TOT_REV_TR_fmt
    .                                   = '000 . MISSINGS'
    LOW       <-  1.0000                = '001 .  LOW <-  1.0000 '
    1.0000    <-  3.0000                = '002 .  1.0000 <-  3.0000 '
    3.0000    <-  HIGH                  = '003 .  3.0000 <-  HIGH '
    OTHER                               = '999 .  OTHERS';
 
  *** BINNING FORMAT OF TOT_REV_DEBT (KS = 8.5317, IV = 0.0629) ***;
  value TOT_REV_DEBT_fmt
    .                                   = '000 . MISSINGS'
    LOW       <-  3009.0000             = '001 .  LOW <-  3009.0000 '
    3009.0000 <-  HIGH                  = '002 .  3009.0000 <-  HIGH '
    OTHER                               = '999 .  OTHERS';
 
  *** BINNING FORMAT OF TOT_REV_LINE (KS = 25.5174, IV = 0.3970) ***;
  value TOT_REV_LINE_fmt
    .                                   = '000 . MISSINGS'
    LOW       <-  1477.0000             = '001 .  LOW <-  1477.0000 '
    1477.0000 <-  4042.0000             = '002 .  1477.0000 <-  4042.0000 '
    4042.0000 <-  8350.0000             = '003 .  4042.0000 <-  8350.0000 '
    8350.0000 <-  14095.0000            = '004 .  8350.0000 <-  14095.0000 '
    14095.0000<-  23419.0000            = '005 .  14095.0000 <-  23419.0000 '
    23419.0000<-  38259.0000            = '006 .  23419.0000 <-  38259.0000 '
    38259.0000<-  HIGH                  = '007 .  38259.0000 <-  HIGH '
    OTHER                               = '999 .  OTHERS';
 
  **BINNING FORMAT OF REV_UTIL (KS = 14.3262, IV = 0.0834) ***;
  value REV_UTIL_fmt
    LOW        -  29.0000               = '001 .  LOW  -  29.0000 '
    29.0000   <-  HIGH                  = '002 .  29.0000 <-  HIGH '
    OTHER                               = '999 .  OTHERS';
 
  *** BINNING FORMAT OF BUREAU_SCORE (KS = 34.1481, IV = 0.7251) ***;
  value BUREAU_SCORE_fmt
    .                                   = '000 . MISSINGS'
    LOW       <-  653.0000              = '001 .  LOW <-  653.0000 '
    653.0000  <-  692.0000              = '002 .  653.0000 <-  692.0000 '
    692.0000  <-  735.0000              = '003 .  692.0000 <-  735.0000 '
    735.0000  <-  HIGH                  = '004 .  735.0000 <-  HIGH '
    OTHER                               = '999 .  OTHERS';
 
  **BINNING FORMAT OF LTV (KS = 16.3484, IV = 0.1625) ***;
  value LTV_fmt
    LOW        -  82.0000               = '001 .  LOW  -  82.0000 '
    82.0000   <-  91.0000               = '002 .  82.0000 <-  91.0000 '
    91.0000   <-  97.0000               = '003 .  91.0000 <-  97.0000 '
    97.0000   <-  101.0000              = '004 .  97.0000 <-  101.0000 '
    101.0000  <-  107.0000              = '005 .  101.0000 <-  107.0000 '
    107.0000  <-  115.0000              = '006 .  107.0000 <-  115.0000 '
    115.0000  <-  HIGH                  = '007 .  115.0000 <-  HIGH '
    OTHER                               = '999 .  OTHERS';

3) “MONO_WOE.PUT” is a file for “put” statements with the above *.FMT file.

  *** BINNING RECODE of TOT_DEROG ***;
  cTOT_DEROG = put(TOT_DEROG, TOT_DEROG_fmt.);

  *** BINNING RECODE of TOT_TR ***;
  cTOT_TR = put(TOT_TR, TOT_TR_fmt.);

  *** BINNING RECODE of AGE_OLDEST_TR ***;
  cAGE_OLDEST_TR = put(AGE_OLDEST_TR, AGE_OLDEST_TR_fmt.);

  *** BINNING RECODE of TOT_REV_TR ***;
  cTOT_REV_TR = put(TOT_REV_TR, TOT_REV_TR_fmt.);

  *** BINNING RECODE of TOT_REV_DEBT ***;
  cTOT_REV_DEBT = put(TOT_REV_DEBT, TOT_REV_DEBT_fmt.);

  *** BINNING RECODE of TOT_REV_LINE ***;
  cTOT_REV_LINE = put(TOT_REV_LINE, TOT_REV_LINE_fmt.);

  *** BINNING RECODE of REV_UTIL ***;
  cREV_UTIL = put(REV_UTIL, REV_UTIL_fmt.);

  *** BINNING RECODE of BUREAU_SCORE ***;
  cBUREAU_SCORE = put(BUREAU_SCORE, BUREAU_SCORE_fmt.);

  *** BINNING RECODE of LTV ***;
  cLTV = put(LTV, LTV_fmt.);

4) “MONO_WOE.SUM” is a file summarizing the predictability of all numeric variables, e.g. KS statistics and Information Values.

--------------------------------------------------------------------------------
| RANK   | VARIABLE RANKED BY KS            | # BINS  |  KS      | INFO. VALUE |
--------------------------------------------------------------------------------
|  001   | BUREAU_SCORE                     |    4    | 34.1481  |  0.7251     |
|  002   | TOT_REV_LINE                     |    7    | 25.5174  |  0.3970     |
|  003   | TOT_DEROG                        |    4    | 20.0442  |  0.2480     |
|  004   | AGE_OLDEST_TR                    |    8    | 19.6163  |  0.2495     |
|  005   | TOT_TR                           |    5    | 16.8344  |  0.1307     |
|  006   | LTV                              |    7    | 16.3484  |  0.1625     |
|  007   | REV_UTIL                         |    2    | 14.3262  |  0.0834     |
|  008   | TOT_REV_TR                       |    3    | 9.0779   |  0.0757     |
|  009   | TOT_REV_DEBT                     |    2    | 8.5317   |  0.0629     |
--------------------------------------------------------------------------------

5) “MONO_WOE.OUT” is a file providing statistical summaries of all numeric variables.

                          MONOTONIC WEIGHT OF EVIDENCE TRANSFORMATION FOR TOT_DEROG                          
   BIN           LOWER           UPPER                        #BADS      BAD                 INFO.           
 LEVEL           LIMIT           LIMIT      #FREQ  PERCENT    (Y=1)     RATE        WOE      VALUE         KS
 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
   000           .               .            213   3.65%        70  32.86%      0.6416     0.0178     2.7716
   001          0.0000          0.0000       2850  48.83%       367  12.88%     -0.5559     0.1268    20.0442
   002          1.0000          2.0000       1369  23.45%       314  22.94%      0.1440     0.0051    16.5222
   003          3.0000          4.0000        587  10.06%       176  29.98%      0.5078     0.0298    10.6623
   004          5.0000         32.0000        818  14.01%       269  32.89%      0.6426     0.0685     0.0000
--------------------------------------------------------------------------------------------------------------                                        
    # TOTAL = 5837, # BADs(Y=1) = 1196, OVERALL BAD RATE = 20.49%, MAX. KS = 20.0442, INFO. VALUE = 0.2480.                                           
--------------------------------------------------------------------------------------------------------------                                        
                                                                                                             
                            MONOTONIC WEIGHT OF EVIDENCE TRANSFORMATION FOR TOT_TR                           
   BIN           LOWER           UPPER                        #BADS      BAD                 INFO.           
 LEVEL           LIMIT           LIMIT      #FREQ  PERCENT    (Y=1)     RATE        WOE      VALUE         KS
 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
   000           .               .            213   3.65%        70  32.86%      0.6416     0.0178     2.7716
   001          0.0000          7.0000       1159  19.86%       324  27.96%      0.4093     0.0372    11.8701
   002          8.0000         12.0000       1019  17.46%       256  25.12%      0.2639     0.0131    16.8344
   003         13.0000         18.0000       1170  20.04%       215  18.38%     -0.1351     0.0035    14.2335
   004         19.0000         25.0000       1126  19.29%       165  14.65%     -0.4061     0.0281     7.3227
   005         26.0000         77.0000       1150  19.70%       166  14.43%     -0.4237     0.0310     0.0000
--------------------------------------------------------------------------------------------------------------                                        
    # TOTAL = 5837, # BADs(Y=1) = 1196, OVERALL BAD RATE = 20.49%, MAX. KS = 16.8344, INFO. VALUE = 0.1307.                                           
--------------------------------------------------------------------------------------------------------------                                        
                                                                                                             
                        MONOTONIC WEIGHT OF EVIDENCE TRANSFORMATION FOR AGE_OLDEST_TR                        
   BIN           LOWER           UPPER                        #BADS      BAD                 INFO.           
 LEVEL           LIMIT           LIMIT      #FREQ  PERCENT    (Y=1)     RATE        WOE      VALUE         KS
 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
   000           .               .            216   3.70%        72  33.33%      0.6628     0.0193     2.9173
   001          1.0000         46.0000        708  12.13%       237  33.47%      0.6691     0.0647    12.5847
   002         47.0000         77.0000        699  11.98%       189  27.04%      0.3633     0.0175    17.3983
   003         78.0000        114.0000        703  12.04%       163  23.19%      0.1581     0.0032    19.3917
   004        115.0000        137.0000        707  12.11%       147  20.79%      0.0184     0.0000    19.6163
   005        138.0000        164.0000        706  12.10%       140  19.83%     -0.0410     0.0002    19.1263
   006        165.0000        204.0000        689  11.80%       108  15.67%     -0.3267     0.0114    15.6376
   007        205.0000        275.0000        703  12.04%        73  10.38%     -0.7993     0.0597     8.1666
   008        276.0000        588.0000        706  12.10%        67   9.49%     -0.8993     0.0734     0.0000
--------------------------------------------------------------------------------------------------------------                                        
    # TOTAL = 5837, # BADs(Y=1) = 1196, OVERALL BAD RATE = 20.49%, MAX. KS = 19.6163, INFO. VALUE = 0.2495.                                           
--------------------------------------------------------------------------------------------------------------                                        
                                                                                                             
                         MONOTONIC WEIGHT OF EVIDENCE TRANSFORMATION FOR TOT_OPEN_TR                         
   BIN           LOWER           UPPER                        #BADS      BAD                 INFO.           
 LEVEL           LIMIT           LIMIT      #FREQ  PERCENT    (Y=1)     RATE        WOE      VALUE         KS
 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
   000           .               .           1416  24.26%       354  25.00%      0.2573     0.0173     6.7157
   001          0.0000          4.0000       1815  31.09%       353  19.45%     -0.0651     0.0013     4.7289
   002          5.0000          6.0000       1179  20.20%       226  19.17%     -0.0831     0.0014     3.0908
   003          7.0000         26.0000       1427  24.45%       263  18.43%     -0.1315     0.0041     0.0000
--------------------------------------------------------------------------------------------------------------                                        
    # TOTAL = 5837, # BADs(Y=1) = 1196, OVERALL BAD RATE = 20.49%, MAX. KS = 6.7157, INFO. VALUE = 0.0240.                                            
--------------------------------------------------------------------------------------------------------------                                        
                                                                                                             
                          MONOTONIC WEIGHT OF EVIDENCE TRANSFORMATION FOR TOT_REV_TR                         
   BIN           LOWER           UPPER                        #BADS      BAD                 INFO.           
 LEVEL           LIMIT           LIMIT      #FREQ  PERCENT    (Y=1)     RATE        WOE      VALUE         KS
 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
   000           .               .            636  10.90%       216  33.96%      0.6910     0.0623     9.0104
   001          0.0000          1.0000       1461  25.03%       300  20.53%      0.0027     0.0000     9.0779
   002          2.0000          3.0000       2002  34.30%       365  18.23%     -0.1448     0.0069     4.3237
   003          4.0000         24.0000       1738  29.78%       315  18.12%     -0.1520     0.0066     0.0000
--------------------------------------------------------------------------------------------------------------                                        
    # TOTAL = 5837, # BADs(Y=1) = 1196, OVERALL BAD RATE = 20.49%, MAX. KS = 9.0779, INFO. VALUE = 0.0757.                                            
--------------------------------------------------------------------------------------------------------------                                        
                                                                                                             
                         MONOTONIC WEIGHT OF EVIDENCE TRANSFORMATION FOR TOT_REV_DEBT                        
   BIN           LOWER           UPPER                        #BADS      BAD                 INFO.           
 LEVEL           LIMIT           LIMIT      #FREQ  PERCENT    (Y=1)     RATE        WOE      VALUE         KS
 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
   000           .               .            477   8.17%       161  33.75%      0.6816     0.0453     6.6527
   001          0.0000       3009.0000       2680  45.91%       567  21.16%      0.0404     0.0008     8.5317
   002       3010.0000      96260.0000       2680  45.91%       468  17.46%     -0.1972     0.0168     0.0000
--------------------------------------------------------------------------------------------------------------                                        
    # TOTAL = 5837, # BADs(Y=1) = 1196, OVERALL BAD RATE = 20.49%, MAX. KS = 8.5317, INFO. VALUE = 0.0629.                                            
--------------------------------------------------------------------------------------------------------------                                        
                                                                                                             
                         MONOTONIC WEIGHT OF EVIDENCE TRANSFORMATION FOR TOT_REV_LINE                        
   BIN           LOWER           UPPER                        #BADS      BAD                 INFO.           
 LEVEL           LIMIT           LIMIT      #FREQ  PERCENT    (Y=1)     RATE        WOE      VALUE         KS
 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
   000           .               .            477   8.17%       161  33.75%      0.6816     0.0453     6.6527
   001          0.0000       1477.0000        765  13.11%       268  35.03%      0.7383     0.0864    18.3518
   002       1481.0000       4042.0000        766  13.12%       205  26.76%      0.3492     0.0176    23.4043
   003       4044.0000       8350.0000        766  13.12%       172  22.45%      0.1166     0.0018    24.9867
   004       8360.0000      14095.0000        766  13.12%       162  21.15%      0.0400     0.0002    25.5174
   005      14100.0000      23419.0000        766  13.12%       104  13.58%     -0.4949     0.0276    19.9488
   006      23427.0000      38259.0000        766  13.12%        70   9.14%     -0.9409     0.0860    10.8049
   007      38300.0000     205395.0000        765  13.11%        54   7.06%     -1.2217     0.1320     0.0000
--------------------------------------------------------------------------------------------------------------                                        
    # TOTAL = 5837, # BADs(Y=1) = 1196, OVERALL BAD RATE = 20.49%, MAX. KS = 25.5174, INFO. VALUE = 0.3970.                                           
--------------------------------------------------------------------------------------------------------------                                        
                                                                                                             
                           MONOTONIC WEIGHT OF EVIDENCE TRANSFORMATION FOR REV_UTIL                          
   BIN           LOWER           UPPER                        #BADS      BAD                 INFO.           
 LEVEL           LIMIT           LIMIT      #FREQ  PERCENT    (Y=1)     RATE        WOE      VALUE         KS
 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
   001          0.0000         29.0000       2905  49.77%       459  15.80%     -0.3172     0.0454    14.3262
   002         30.0000        100.0000       2932  50.23%       737  25.14%      0.2646     0.0379     0.0000
--------------------------------------------------------------------------------------------------------------                                        
    # TOTAL = 5837, # BADs(Y=1) = 1196, OVERALL BAD RATE = 20.49%, MAX. KS = 14.3262, INFO. VALUE = 0.0834.                                           
--------------------------------------------------------------------------------------------------------------                                        
                                                                                                             
                         MONOTONIC WEIGHT OF EVIDENCE TRANSFORMATION FOR BUREAU_SCORE                        
   BIN           LOWER           UPPER                        #BADS      BAD                 INFO.           
 LEVEL           LIMIT           LIMIT      #FREQ  PERCENT    (Y=1)     RATE        WOE      VALUE         KS
 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
   000           .               .            315   5.40%       105  33.33%      0.6628     0.0282     4.2544
   001        443.0000        653.0000       1393  23.86%       552  39.63%      0.9349     0.2621    32.2871
   002        654.0000        692.0000       1368  23.44%       298  21.78%      0.0776     0.0014    34.1481
   003        693.0000        735.0000       1383  23.69%       174  12.58%     -0.5825     0.0670    22.6462
   004        736.0000        848.0000       1378  23.61%        67   4.86%     -1.6179     0.3664     0.0000
--------------------------------------------------------------------------------------------------------------                                        
    # TOTAL = 5837, # BADs(Y=1) = 1196, OVERALL BAD RATE = 20.49%, MAX. KS = 34.1481, INFO. VALUE = 0.7251.                                           
--------------------------------------------------------------------------------------------------------------                                        
                                                                                                             
                             MONOTONIC WEIGHT OF EVIDENCE TRANSFORMATION FOR LTV                             
   BIN           LOWER           UPPER                        #BADS      BAD                 INFO.           
 LEVEL           LIMIT           LIMIT      #FREQ  PERCENT    (Y=1)     RATE        WOE      VALUE         KS
 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
   001          0.0000         82.0000        814  13.95%        81   9.95%     -0.8467     0.0764     9.0214
   002         83.0000         91.0000        837  14.34%       120  14.34%     -0.4316     0.0234    14.4372
   003         92.0000         97.0000        811  13.89%       148  18.25%     -0.1436     0.0027    16.3484
   004         98.0000        101.0000        830  14.22%       182  21.93%      0.0861     0.0011    15.0935
   005        102.0000        107.0000        870  14.90%       206  23.68%      0.1855     0.0054    12.1767
   006        108.0000        115.0000        808  13.84%       197  24.38%      0.2241     0.0074     8.8704
   007        116.0000        176.0000        867  14.85%       262  30.22%      0.5191     0.0460     0.0000
--------------------------------------------------------------------------------------------------------------                                        
    # TOTAL = 5837, # BADs(Y=1) = 1196, OVERALL BAD RATE = 20.49%, MAX. KS = 16.3484, INFO. VALUE = 0.1625.                                           
--------------------------------------------------------------------------------------------------------------                                        
                                                                                                             
                          MONOTONIC WEIGHT OF EVIDENCE TRANSFORMATION FOR TOT_INCOME                         
   BIN           LOWER           UPPER                        #BADS      BAD                 INFO.           
 LEVEL           LIMIT           LIMIT      #FREQ  PERCENT    (Y=1)     RATE        WOE      VALUE         KS
 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
   000           .               .              5   0.09%         1  20.00%     -0.0303     0.0000     0.0026
   001          0.0000       3397.0000       2913  49.91%       669  22.97%      0.1457     0.0111     7.5822
   002       3400.0000    8147166.6600       2919  50.01%       526  18.02%     -0.1591     0.0121     0.0000
--------------------------------------------------------------------------------------------------------------                                        
    # TOTAL = 5837, # BADs(Y=1) = 1196, OVERALL BAD RATE = 20.49%, MAX. KS = 7.5822, INFO. VALUE = 0.0231.                                            
--------------------------------------------------------------------------------------------------------------                                        

6) “MONO_WOE.IMP” is a file imputing missing values in the case when # of bads or goods for missings is not enough to calculate WOE.

  *** MEDIAN IMPUTATION OF LTV (NMISS = 1) ***;
  IF LTV = . THEN LTV =      100;