After working on the MOB package, I received requests from multiple users if I can write a binning function that takes the weighting scheme into consideration. It is a legitimate request from the practical standpoint. For instance, in the development of fraud detection models, we often would sample down non-fraud cases given an extremely low frequency of fraud instances. After the sample down, a weight value > 1 should be assigned to all non-fraud cases to reflect the fraud rate in the pre-sample data.
While accommodating the request for weighting cases is trivial, I’d like to do a simple experitment showing what the impact might be with the consideration of weighting.
– First of all, let’s apply the monotonic binning to a variable named “tot_derog”. In this unweighted binning output, KS = 18.94, IV = 0.21, and WoE values range from -0.38 to 0.64.
– In the first trial, a weight value = 5 is assigned to cases with Y = 0 and a weight value = 1 assigned to cases with Y = 1. As expected, frequency, distribution, bad_frequency, and bad_rate changed. However, KS, IV, and WoE remain identical.
– In the second trial, a weight value = 1 is assigned to cases with Y = 0 and a weight value = 5 assigned to cases with Y = 1. Once again, KS, IV, and WoE are still the same as the unweighted output.
The conclusion from this demonstrate is very clear. In cases of two-value weights assigned to the binary Y, the variable importance reflected by IV / KS and WoE values should remain identical with or without weights. However, if you are concerned about the binning distribution and the bad rate in each bin, the function wts_bin() should do the correction and is available in the project repository (https://github.com/statcompute/MonotonicBinning).
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
derog_bin <- qtl_bin(df, bad, tot_derog) | |
derog_bin | |
#$df | |
# bin rule freq dist mv_cnt bad_freq bad_rate woe iv ks | |
# 00 is.na($X) 213 0.0365 213 70 0.3286 0.6416 0.0178 2.7716 | |
# 01 $X <= 1 3741 0.6409 0 560 0.1497 -0.3811 0.0828 18.9469 | |
# 02 $X > 1 & $X <= 2 478 0.0819 0 121 0.2531 0.2740 0.0066 16.5222 | |
# 03 $X > 2 & $X <= 4 587 0.1006 0 176 0.2998 0.5078 0.0298 10.6623 | |
# 04 $X > 4 818 0.1401 0 269 0.3289 0.6426 0.0685 0.0000 | |
# $cuts | |
# [1] 1 2 4 | |
wts_bin(derog_bin$df, c(1, 5)) | |
# bin rule wt_freq wt_dist wt_bads wt_badrate wt_woe wt_iv wt_ks | |
# 00 is.na($X) 493 0.0464 350 0.7099 0.6416 0.0178 2.7716 | |
# 01 $X <= 1 5981 0.5631 2800 0.4681 -0.3811 0.0828 18.9469 | |
# 02 $X > 1 & $X <= 2 962 0.0906 605 0.6289 0.2740 0.0066 16.5222 | |
# 03 $X > 2 & $X <= 4 1291 0.1216 880 0.6816 0.5078 0.0298 10.6623 | |
# 04 $X > 4 1894 0.1783 1345 0.7101 0.6426 0.0685 0.0000 | |
wts_bin(derog_bin$df, c(5, 1)) | |
# bin rule wt_freq wt_dist wt_bads wt_badrate wt_woe wt_iv wt_ks | |
# 00 is.na($X) 785 0.0322 70 0.0892 0.6416 0.0178 2.7716 | |
# 01 $X <= 1 16465 0.6748 560 0.0340 -0.3811 0.0828 18.9469 | |
# 02 $X > 1 & $X <= 2 1906 0.0781 121 0.0635 0.2740 0.0066 16.5222 | |
# 03 $X > 2 & $X <= 4 2231 0.0914 176 0.0789 0.5078 0.0298 10.6623 | |
# 04 $X > 4 3014 0.1235 269 0.0893 0.6426 0.0685 0.0000 |