Batch Processing of Monotonic Binning

In my GitHub repository (https://github.com/statcompute/MonotonicBinning), multiple R functions have been developed to implement the monotonic binning by using either iterative discretization or isotonic regression. With these functions, we can run the monotonic binning for one independent variable at a time. However, in a real-world production environment, we often would want to apply the binning algorithm to hundreds or thousands of variables at once. In addition, we might be interested in comparing different binning outcomes.

The function batch_bin() is designed to apply a monotonic binning function to all numeric variables in a data frame with the last column as the dependent variable. Currently, four binning algorithms are supported, including qtl_bin() and bad_bin() by iterative discretizations, iso_bin() by isotonic regression, and gbm_bin() by generalized boosted model. Before using these four functions, we need to save related R files in the working folder, which would be sourced by the batch_bin() function. Scripts for R functions can be downloaded from https://github.com/statcompute/MonotonicBinning/tree/master/code.

Below is the demonstrating showing how to use the batch_bin() function, which only requires two input parameters, a data frame and an integer number indicating the binning method. With method = 1, the batch_bin() function implements the iterative discretization by quantiles. With method = 4, the batch_bin() function implements the generalized boosted modelling. As shown below, both KS and IV with method = 4 are higher than with method = 1 due to more granular bins. For instance, while the method = 1 only generates 2 bins, the method = 4 can generate 11 bins.


head(df, 2)
# tot_derog tot_tr age_oldest_tr tot_open_tr tot_rev_tr tot_rev_debt tot_rev_line rev_util bureau_score ltv tot_income bad
#1 6 7 46 NaN NaN NaN NaN 0 747 109 4800.00 0
#2 0 21 153 6 1 97 4637 2 744 97 5833.33 0
batch_bin(df, 1)
#|var | nbin| unique| miss| min| median| max| ks| iv|
#|:————–|—–:|——-:|—–:|—-:|——–:|——–:|——–:|——-:|
#|tot_derog | 5| 29| 213| 0| 0.0| 32| 18.9469| 0.2055|
#|tot_tr | 5| 67| 213| 0| 16.0| 77| 15.7052| 0.1302|
#|age_oldest_tr | 10| 460| 216| 1| 137.0| 588| 19.9821| 0.2539|
#|tot_open_tr | 3| 26| 1416| 0| 5.0| 26| 6.7157| 0.0240|
#|tot_rev_tr | 3| 21| 636| 0| 3.0| 24| 9.0104| 0.0717|
#|tot_rev_debt | 3| 3880| 477| 0| 3009.5| 96260| 8.5102| 0.0627|
#|tot_rev_line | 9| 3617| 477| 0| 10573.0| 205395| 26.4924| 0.4077|
#|rev_util | 2| 101| 0| 0| 30.0| 100| 15.1570| 0.0930|
#|bureau_score | 12| 315| 315| 443| 692.5| 848| 34.8028| 0.7785|
#|ltv | 7| 145| 1| 0| 100.0| 176| 15.6254| 0.1538|
#|tot_income | 4| 1639| 5| 0| 3400.0| 8147167| 9.1526| 0.0500|
batch_bin(df, 1)$BinLst[["rev_util"]]$df
# bin rule freq dist mv_cnt bad_freq bad_rate woe iv ks
# 01 $X <= 31 3007 0.5152 0 472 0.1570 -0.3250 0.0493 15.157
# 02 $X > 31 2830 0.4848 0 724 0.2558 0.2882 0.0437 0.000
batch_bin(df, 4)
#|var | nbin| unique| miss| min| median| max| ks| iv|
#|:————–|—–:|——-:|—–:|—-:|——–:|——–:|——–:|——-:|
#|tot_derog | 8| 29| 213| 0| 0.0| 32| 20.0442| 0.2556|
#|tot_tr | 13| 67| 213| 0| 16.0| 77| 17.3002| 0.1413|
#|age_oldest_tr | 22| 460| 216| 1| 137.0| 588| 20.3646| 0.2701|
#|tot_open_tr | 6| 26| 1416| 0| 5.0| 26| 6.8695| 0.0274|
#|tot_rev_tr | 4| 21| 636| 0| 3.0| 24| 9.0779| 0.0789|
#|tot_rev_debt | 9| 3880| 477| 0| 3009.5| 96260| 8.8722| 0.0848|
#|tot_rev_line | 21| 3617| 477| 0| 10573.0| 205395| 26.8943| 0.4445|
#|rev_util | 11| 101| 0| 0| 30.0| 100| 16.9615| 0.1635|
#|bureau_score | 30| 315| 315| 443| 692.5| 848| 35.2651| 0.8344|
#|ltv | 17| 145| 1| 0| 100.0| 176| 16.8807| 0.1911|
#|tot_income | 17| 1639| 5| 0| 3400.0| 8147167| 10.3386| 0.0775|
batch_bin(df, 4)$BinLst[["rev_util"]]$df
# bin rule freq dist mv_cnt bad_freq bad_rate woe iv ks
# 01 $X <= 24 2653 0.4545 0 414 0.1560 -0.3320 0.0452 13.6285
# 02 $X > 24 & $X <= 36 597 0.1023 0 96 0.1608 -0.2963 0.0082 16.3969
# 03 $X > 36 & $X <= 40 182 0.0312 0 32 0.1758 -0.1890 0.0011 16.9533
# 04 $X > 40 & $X <= 58 669 0.1146 0 137 0.2048 -0.0007 0.0000 16.9615
# 05 $X > 58 & $X <= 60 77 0.0132 0 16 0.2078 0.0177 0.0000 16.9381
# 06 $X > 60 & $X <= 73 442 0.0757 0 103 0.2330 0.1647 0.0022 15.6305
# 07 $X > 73 & $X <= 75 62 0.0106 0 16 0.2581 0.2999 0.0010 15.2839
# 08 $X > 75 & $X <= 83 246 0.0421 0 70 0.2846 0.4340 0.0089 13.2233
# 09 $X > 83 & $X <= 96 376 0.0644 0 116 0.3085 0.5489 0.0225 9.1266
# 10 $X > 96 & $X <= 98 50 0.0086 0 17 0.3400 0.6927 0.0049 8.4162
# 11 $X > 98 483 0.0827 0 179 0.3706 0.8263 0.0695 0.0000

view raw

use_BatchBin.R

hosted with ❤ by GitHub