Binning with Weights

After working on the MOB package, I received requests from multiple users if I can write a binning function that takes the weighting scheme into consideration. It is a legitimate request from the practical standpoint. For instance, in the development of fraud detection models, we often would sample down non-fraud cases given an extremely low frequency of fraud instances. After the sample down, a weight value > 1 should be assigned to all non-fraud cases to reflect the fraud rate in the pre-sample data.

While accommodating the request for weighting cases is trivial, I’d like to do a simple experitment showing what the impact might be with the consideration of weighting.

– First of all, let’s apply the monotonic binning to a variable named “tot_derog”. In this unweighted binning output, KS = 18.94, IV = 0.21, and WoE values range from -0.38 to 0.64.

– In the first trial, a weight value = 5 is assigned to cases with Y = 0 and a weight value = 1 assigned to cases with Y = 1. As expected, frequency, distribution, bad_frequency, and bad_rate changed. However, KS, IV, and WoE remain identical.

– In the second trial, a weight value = 1 is assigned to cases with Y = 0 and a weight value = 5 assigned to cases with Y = 1. Once again, KS, IV, and WoE are still the same as the unweighted output.

The conclusion from this demonstrate is very clear. In cases of two-value weights assigned to the binary Y, the variable importance reflected by IV / KS and WoE values should remain identical with or without weights. However, if you are concerned about the binning distribution and the bad rate in each bin, the function wts_bin() should do the correction and is available in the project repository (https://github.com/statcompute/MonotonicBinning).

derog_bin <- qtl_bin(df, bad, tot_derog)
derog_bin
#$df
# bin rule freq dist mv_cnt bad_freq bad_rate woe iv ks
# 00 is.na($X) 213 0.0365 213 70 0.3286 0.6416 0.0178 2.7716
# 01 $X <= 1 3741 0.6409 0 560 0.1497 -0.3811 0.0828 18.9469
# 02 $X > 1 & $X <= 2 478 0.0819 0 121 0.2531 0.2740 0.0066 16.5222
# 03 $X > 2 & $X <= 4 587 0.1006 0 176 0.2998 0.5078 0.0298 10.6623
# 04 $X > 4 818 0.1401 0 269 0.3289 0.6426 0.0685 0.0000
# $cuts
# [1] 1 2 4
wts_bin(derog_bin$df, c(1, 5))
# bin rule wt_freq wt_dist wt_bads wt_badrate wt_woe wt_iv wt_ks
# 00 is.na($X) 493 0.0464 350 0.7099 0.6416 0.0178 2.7716
# 01 $X <= 1 5981 0.5631 2800 0.4681 -0.3811 0.0828 18.9469
# 02 $X > 1 & $X <= 2 962 0.0906 605 0.6289 0.2740 0.0066 16.5222
# 03 $X > 2 & $X <= 4 1291 0.1216 880 0.6816 0.5078 0.0298 10.6623
# 04 $X > 4 1894 0.1783 1345 0.7101 0.6426 0.0685 0.0000
wts_bin(derog_bin$df, c(5, 1))
# bin rule wt_freq wt_dist wt_bads wt_badrate wt_woe wt_iv wt_ks
# 00 is.na($X) 785 0.0322 70 0.0892 0.6416 0.0178 2.7716
# 01 $X <= 1 16465 0.6748 560 0.0340 -0.3811 0.0828 18.9469
# 02 $X > 1 & $X <= 2 1906 0.0781 121 0.0635 0.2740 0.0066 16.5222
# 03 $X > 2 & $X <= 4 2231 0.0914 176 0.0789 0.5078 0.0298 10.6623
# 04 $X > 4 3014 0.1235 269 0.0893 0.6426 0.0685 0.0000

view raw
wts_bin.R
hosted with ❤ by GitHub