In the post https://statcompute.wordpress.com/2019/04/27/more-general-weighted-binning, I’ve shown how to do the weighted binning with the function wqtl_bin() by the iterative partitioning. However, the outcome from wqtl_bin() sometimes can be too coarse. The function wgbm_bin() (https://github.com/statcompute/MonotonicBinning/blob/master/code/wgbm_bin.R) leverages the idea of gbm() that implements the Generalized Boosted Model and generates more granular weighted binning outcomes.
Below is the demonstration showing the difference between wqtl_bin() and wgbm_bin() outcomes. Even with the same data, the wgbm_bin() function is able to generate a more granular binning result and 14% higher Information Value.
I had been asked why I spent so much effort on developing SAS macros and R functions to do monotonic binning for the WoE transformation, given the availability of other cutting-edge data mining algorithms that will automatically generate the prediction with whatever predictors fed in the model. Nonetheless, what really distinguishes a good modeler from the rest is how to handle challenging data issues before feeding data in the model, including missing values, outliers, linearity, and predictability, in a scalable way that can be rolled out to hundreds or even thousands of potential model drivers in the production environment.
The WoE transformation through monotonic binning provides a convenient way to address each of aforementioned concerns.
1. Because WoE is a piecewise transformation based on the data discretization, all missing values would fall into a standalone category either by itself or to be combined with the neighbor that shares a similar event probability. As a result, the special treatment for missing values is not necessary.
2. After the monotonic binning of each variable, since the WoE value for each bin is a projection from the predictor into the response that is defined by the log ratio between event and non-event distributions, any raw value of the predictor doesn’t matter anymore and therefore the issue related to outliers would disappear.
3. While many modelers would like to use log or power transformations to achieve a good linear relationship between the predictor and log odds of the response, which is heuristic at best with no guarantee for the good outcome, the WoE transformation is strictly linear with respect to log odds of the response with the unity correlation. It is also worth mentioning that a numeric variable and its strictly monotone functions should converge to the same monotonic WoE transformation.
4. At last, because the WoE is defined as the log ratio between event and non-event distributions, it is indicative of the separation between cases with Y = 0 and cases with Y = 1. As the weighted sum of WoE values with the weight being the difference in event and non-event distributions, the IV (Information Value) is an important statistic commonly used to measure the predictor importance.
Below is a simple example showing how to use WoE transformations in the estimation of a logistic regression.
You might be wondering what motivates me spending countless weekend hours on the MOB package. The answer is plain and simple. It is users that are driving the development work.
After I published the wts_bin() function last week showing the impact of two-value weights on the monotonic binning outcome (https://statcompute.wordpress.com/2019/04/21/binning-with-weights), a question was asked if I can write a more general weighted binning function with weights being any positive value. The function wqtl_bin() is my answer (https://github.com/statcompute/MonotonicBinning/blob/master/code/wqtl_bin.R).
Below is an example demonstrating how to use the wqtl_bin() function. First of all, let’s apply the function to the case with two-value weights that was illustrated last week. As expected, statistics from both approaches are identical. In the second use case, let’s assume that weights can be any value under the Uniform distribution between 0 and 10. With positive random weights, all statistics have changed.
It is worth mentioning that, while binning rules can be the same with or without weights in some cases, it is not necessarily true in all situations, depending on the distribution of weights across the data sample. As shown in binning outcomes for “ltv” below, there are 7 bins without weights but only 5 with weights.
After working on the MOB package, I received requests from multiple users if I can write a binning function that takes the weighting scheme into consideration. It is a legitimate request from the practical standpoint. For instance, in the development of fraud detection models, we often would sample down non-fraud cases given an extremely low frequency of fraud instances. After the sample down, a weight value > 1 should be assigned to all non-fraud cases to reflect the fraud rate in the pre-sample data.
While accommodating the request for weighting cases is trivial, I’d like to do a simple experitment showing what the impact might be with the consideration of weighting.
– First of all, let’s apply the monotonic binning to a variable named “tot_derog”. In this unweighted binning output, KS = 18.94, IV = 0.21, and WoE values range from -0.38 to 0.64.
– In the first trial, a weight value = 5 is assigned to cases with Y = 0 and a weight value = 1 assigned to cases with Y = 1. As expected, frequency, distribution, bad_frequency, and bad_rate changed. However, KS, IV, and WoE remain identical.
– In the second trial, a weight value = 1 is assigned to cases with Y = 0 and a weight value = 5 assigned to cases with Y = 1. Once again, KS, IV, and WoE are still the same as the unweighted output.
The conclusion from this demonstrate is very clear. In cases of two-value weights assigned to the binary Y, the variable importance reflected by IV / KS and WoE values should remain identical with or without weights. However, if you are concerned about the binning distribution and the bad rate in each bin, the function wts_bin() should do the correction and is available in the project repository (https://github.com/statcompute/MonotonicBinning).
After wrapping up the function batch_woe() today with the purpose to allow users to apply WoE transformations to many independent variables simultaneously, I have completed the development of major functions in the MOB package that can be usable for the model development in a production setting.
The function batch_woe() basically is the wrapper around cal_woe() and has two input parameters. The “data” parameter is the data frame that we would deploy binning outcomes and the “slst” parameter is the list of multiple binning specification tables that is either the direct output from the function batch_bin or created manually by combining outputs from multiple binning functions.
There are also two components in the output of batch_woe(), a list of PSI tables for transformed variables and a data frame with a row index and all transformed variables. The default printout is a PSI summation of all input variables to be transformed. As shown below, all PSI values are below 0.1 and therefore none is concerning.
binout <- batch_bin(df, 1)
woeout <- batch_woe(df[sample(seq(nrow(df)), 2000, replace = T), ], binout$BinLst)
# tot_derog tot_tr age_oldest_tr tot_open_tr tot_rev_tr tot_rev_debt ...
# psi 0.0027 0.0044 0.0144 0.0011 3e-04 0.0013 ...
str(woeout, max.level = 1)
# List of 2
# $ psi:List of 11
# $ df :'data.frame': 2000 obs. of 12 variables:
# - attr(*, "class")= chr "psiSummary"
# idx_ woe.tot_derog woe.tot_tr woe.age_oldest_tr woe.tot_open_tr woe.tot_rev_tr ...
# 1 -0.3811 -0.0215 -0.5356 -0.0722 -0.1012 ...
All source codes of the MOB package are available on https://github.com/statcompute/MonotonicBinning and free (as free beer) to download and distribute.
In my GitHub repository (https://github.com/statcompute/MonotonicBinning), multiple R functions have been developed to implement the monotonic binning by using either iterative discretization or isotonic regression. With these functions, we can run the monotonic binning for one independent variable at a time. However, in a real-world production environment, we often would want to apply the binning algorithm to hundreds or thousands of variables at once. In addition, we might be interested in comparing different binning outcomes.
The function batch_bin() is designed to apply a monotonic binning function to all numeric variables in a data frame with the last column as the dependent variable. Currently, four binning algorithms are supported, including qtl_bin() and bad_bin() by iterative discretizations, iso_bin() by isotonic regression, and gbm_bin() by generalized boosted model. Before using these four functions, we need to save related R files in the working folder, which would be sourced by the batch_bin() function. Scripts for R functions can be downloaded from https://github.com/statcompute/MonotonicBinning/tree/master/code.
Below is the demonstrating showing how to use the batch_bin() function, which only requires two input parameters, a data frame and an integer number indicating the binning method. With method = 1, the batch_bin() function implements the iterative discretization by quantiles. With method = 4, the batch_bin() function implements the generalized boosted modelling. As shown below, both KS and IV with method = 4 are higher than with method = 1 due to more granular bins. For instance, while the method = 1 only generates 2 bins, the method = 4 can generate 11 bins.
In addition to monotonic binning algorithms introduced in my previous post (https://statcompute.wordpress.com/2019/03/10/a-summary-of-my-home-brew-binning-algorithms-for-scorecard-development), two more functions based on Generalized Boosted Regression Models have been added to my GitHub repository, gbm_bin() and gbmcv_bin().
The function gbm_bin() estimates a GBM model without the cross validation and tends to generate a more granular binning outcome.
The function gbmcv_bin() estimates a GBM model with the cross validation (CV). Therefore, it would generate a more stable but coarse binning outcome. Nonetheless, the computation is more expensive due to CV, especially for large datasets.
Motivated by the idea of my friend Talbot (https://www.linkedin.com/in/talbot-katz-b76785), I also drafted a function pava_bin() based upon the Pool Adjacent Violators Algorithm (PAVA) and compared it with the iso_bin() function based on the isotonic regression. As shown in the comparison below, there is no difference in the binning outcome. However, the computing cost of pava_bin() function is higher given that PAVA is an iterative algorithm solving for the monotonicity.