## Archive for the ‘**S+/R**’ Category

## DART: Dropout Regularization in Boosting Ensembles

The dropout approach developed by Hinton has been widely employed in deep learnings to prevent the deep neural network from overfitting, as shown in https://statcompute.wordpress.com/2017/01/02/dropout-regularization-in-deep-neural-networks.

In the paper http://proceedings.mlr.press/v38/korlakaivinayak15.pdf, the dropout can also be used to address the overfitting in boosting tree ensembles, e.g. MART, caused by the so-called “over-specialization”. In particular, while first few trees added at the beginning of ensembles would dominate the model performance, the rest added later can only improve the prediction for a small subset, which increases the risk of overfitting. The idea of DART is to build an ensemble by randomly dropping boosting tree members. The percentage of dropouts can determine the degree of regularization for boosting tree ensembles.

Below is a demonstration showing the implementation of DART with the R xgboost package. First of all, after importing the data, we divided it into two pieces, one for training and the other for testing.

pkgs <- c('pROC', 'xgboost') lapply(pkgs, require, character.only = T) df1 <- read.csv("Downloads/credit_count.txt") df2 <- df1[df1$CARDHLDR == 1, ] set.seed(2017) n <- nrow(df2) sample <- sample(seq(n), size = n / 2, replace = FALSE) train <- df2[sample, -1] test <- df2[-sample, -1]

For the comparison purpose, we first developed a boosting tree ensemble without dropouts, as shown below. For the simplicity, all parameters were chosen heuristically. The max_depth is set to 3 due to the fact that the boosting tends to work well with so-called “weak” learners, e.g. simple trees. While ROC for the training set can be as high as 0.95, ROC for the testing set is only 0.60 in our case, implying the overfitting issue.

mart.parm <- list(booster = "gbtree", nthread = 4, eta = 0.1, max_depth = 3, subsample = 1, eval_metric = "auc") mart <- xgboost(data = as.matrix(train[, -1]), label = train[, 1], params = mart.parm, nrounds = 500, verbose = 0, seed = 2017) pred1 <- predict(mart, as.matrix(train[, -1])) pred2 <- predict(mart, as.matrix(test[, -1])) roc(as.factor(train$DEFAULT), pred1) # Area under the curve: 0.9459 roc(as.factor(test$DEFAULT), pred2) # Area under the curve: 0.6046

With the same set of parameters, we refitted the ensemble with dropouts, e.g. DART. As shown below, by dropping 10% tree members, ROC for the testing set can increase from 0.60 to 0.65. In addition, the performance disparity between training and testing sets with DART decreases significantly.

dart.parm <- list(booster = "dart", rate_drop = 0.1, nthread = 4, eta = 0.1, max_depth = 3, subsample = 1, eval_metric = "auc") dart <- xgboost(data = as.matrix(train[, -1]), label = train[, 1], params = dart.parm, nrounds = 500, verbose = 0, seed = 2017) pred1 <- predict(dart, as.matrix(train[, -1])) pred2 <- predict(dart, as.matrix(test[, -1])) roc(as.factor(train$DEFAULT), pred1) # Area under the curve: 0.7734 roc(as.factor(test$DEFAULT), pred2) # Area under the curve: 0.6517

Besides rate_drop = 0.1, a wide range of dropout rates have also been tested. In most cases, DART outperforms its counterpart without the dropout regularization.

## Model Operational Losses with Copula Regression

In the previous post (https://statcompute.wordpress.com/2017/06/29/model-operational-loss-directly-with-tweedie-glm), it has been explained why we should consider modeling operational losses for non-material UoMs directly with Tweedie models. However, for material UoMs with significant losses, it is still beneficial to model the frequency and the severity separately.

In the prevailing modeling practice for operational losses, it is often convenient to assume a functional independence between frequency and severity models, which might not be the case empirically. For instance, in the economic downturn, both the frequency and the severity of consumer frauds might tend to increase simultaneously. With the independence assumption, while we can argue that same variables could be included in both frequency and severity models and therefore induce a certain correlation, the frequency-severity dependence and the its contribution to the loss distribution might be overlooked.

In the context of Copula, the distribution of operational losses can be considered a joint distribution determined by both marginal distributions and a parameter measuring the dependence between marginals, of which marginal distributions can be Poisson for the frequency and Gamma for the severity. Depending on the dependence structure in the data, various copula functions might be considered. For instance, a product copula can be used to describe the independence. In the example shown below, a Gumbel copula is considered given that it is often used to describe the positive dependence on the right tail, e.g. high severity and high frequency. For details, the book “Copula Modeling” by Trivedi and Zimmer is a good reference to start with.

In the demonstration, we simulated both frequency and severity measures driven by the same set of co-variates. Both are positively correlated with the Kendall’s tau = 0.5 under the assumption of Gumbel copula.

library(CopulaRegression) # number of observations to simulate n <- 100 # seed value for the simulation set.seed(2017) # design matrices with a constant column X <- cbind(rep(1, n), runif(n), runif(n)) # define coefficients for both Poisson and Gamma regressions p_beta <- g_beta <- c(3, -2, 1) # define the Gamma dispersion delta <- 1 # define the Kendall's tau tau <- 0.5 # copula parameter based on tau theta <- 1 / (1 - tau) # define the Gumbel Copula family <- 4 # simulate outcomes out <- simulate_regression_data(n, g_beta, p_beta, X, X, delta, tau, family, zt = FALSE) G <- out[, 1] P <- out[, 2]

After the simulation, a Copula regression is estimated with Poisson and Gamma marginals for the frequency and the severity respectively. As shown in the model estimation, estimated parameters with related inferences are different between independent and dependent assumptions.

m <- copreg(G, P, X, family = 4, sd.error = TRUE, joint = TRUE, zt = FALSE) coef <- c("_CONST", "X1", "X2") cols <- c("ESTIMATE", "STD. ERR", "Z-VALUE") g_est <- cbind(m$alpha, m$sd.alpha, m$alpha / m$sd.alpha) p_est <- cbind(m$beta, m$sd.beta, m$beta / m$sd.beta) g_est0 <- cbind(m$alpha0, m$sd.alpha0, m$alpha0 / m$sd.alpha0) p_est0 <- cbind(m$beta0, m$sd.beta0, m$beta0 / m$sd.beta0) rownames(g_est) <- rownames(g_est0) <- rownames(p_est) <- rownames(p_est0) <- coef colnames(g_est) <- colnames(g_est0) <- colnames(p_est) <- colnames(p_est0) <- cols # estimated coefficients for the Gamma regression assumed dependence print(g_est) # ESTIMATE STD. ERR Z-VALUE # _CONST 2.9710512 0.2303651 12.897141 # X1 -1.8047627 0.2944627 -6.129003 # X2 0.9071093 0.2995218 3.028526 # estimated coefficients for the Gamma regression assumed dependence print(p_est) # ESTIMATE STD. ERR Z-VALUE # _CONST 2.954519 0.06023353 49.05107 # X1 -1.967023 0.09233056 -21.30414 # X2 1.025863 0.08254870 12.42736 # estimated coefficients for the Gamma regression assumed independence # should be identical to GLM() outcome print(g_est0) # ESTIMATE STD. ERR Z-VALUE # _CONST 3.020771 0.2499246 12.086727 # X1 -1.777570 0.3480328 -5.107478 # X2 0.905527 0.3619011 2.502140 # estimated coefficients for the Gamma regression assumed independence # should be identical to GLM() outcome print(p_est0) # ESTIMATE STD. ERR Z-VALUE # _CONST 2.939787 0.06507502 45.17536 # X1 -2.010535 0.10297887 -19.52376 # X2 1.088269 0.09334663 11.65837

If we compare conditional loss distributions under different dependence assumptions, it shows that the predicted loss with Copula regression tends to have a fatter right tail and therefore should be considered more conservative.

df <- data.frame(g = G, p = P, x1 = X[, 2], x2 = X[, 3]) glm_p <- glm(p ~ x1 + x2, data = df, family = poisson(log)) glm_g <- glm(g ~ x1 + x2, data = df, family = Gamma(log)) loss_dep <- predict(m, X, X, independence = FALSE)[3][[1]][[1]] loss_ind <- fitted(glm_p) * fitted(glm_g) den <- data.frame(loss = c(loss_dep, loss_ind), lines = rep(c("DEPENDENCE", "INDEPENDENCE"), each = n)) ggplot(den, aes(x = loss, fill = lines)) + geom_density(alpha = 0.5)

## Model Operational Loss Directly with Tweedie GLM

In the development of operational loss forecasting models, the Frequency-Severity modeling approach, which the frequency and the severity of a Unit of Measure (UoM) are modeled separately, has been widely employed in the banking industry. However, sometimes it also makes sense to model the operational loss directly, especially for UoMs with non-material losses. First of all, given the low loss amount, the effort of developing two models, e.g. frequency and severity, might not be justified. Secondly, for UoMs with low losses due to low frequencies, modeling the frequency and the severity separately might overlook the internal connection between the low frequency and the subsequent low loss amount. For instance, when the frequency N = 0, then the loss L = $0 inevitably.

The Tweedie distribution is defined as a Poisson sum of Gamma random variables. In particular, if the frequency of loss events N is assumed a Poisson distribution and the loss amount L_i of an event i, where i = 0, 1 … N, is assumed a Gamma distribution, then the total loss amount L = SUM[L_i] would have a Tweedie distribution. When there is no loss event, e.g. N = 0, then Prob(L = $0) = Prob(N = 0) = Exp(-Lambda). However, when N > 0, then L = L_1 + … + L_N > $0 is governed by a Gamma distribution, e.g. sum of I.I.D. Gamma also being Gamma.

For the Tweedie loss, E(L) = Mu and VAR(L) = Phi * (Mu ** P), where P is called the index parameter and Phi is the dispersion parameter. When P approaches 1 and therefore VAR(L) approaches Phi * E(L), the Tweedie would be similar to a Poisson-like distribution. When P approaches 2 and therefore VAR(L) approaches Phi * (E(L) ** 2), the Tweedie would be similar to a Gamma distribution. When P is between 1 and 2, then the Tweedie would be a compound mixture of Poisson and Gamma, where P and Phi can be estimated.

To estimate a regression with the Tweedie distributional assumption, there are two implementation approaches in R with cplm and statmod packages respectively. With the cplm package, the Tweedie regression can be estimated directly as long as P is in the range of (1, 2), as shown below. In the example, the estimated index parameter P is 1.42.

> library(cplm) > data(FineRoot) > m1 <- cpglm(RLD ~ Zone + Stock, data = FineRoot) > summary(m1) Deviance Residuals: Min 1Q Median 3Q Max -1.0611 -0.6475 -0.3928 0.1380 1.9627 Estimate Std. Error t value Pr(>|t|) (Intercept) -1.95141 0.14643 -13.327 < 2e-16 *** ZoneOuter -0.85693 0.13292 -6.447 2.66e-10 *** StockMM106 0.01177 0.17535 0.067 0.947 StockMark -0.83933 0.17476 -4.803 2.06e-06 *** --- Estimated dispersion parameter: 0.35092 Estimated index parameter: 1.4216 Residual deviance: 203.91 on 507 degrees of freedom AIC: -157.33

The statmod package provides a more general and flexible solution with the two-stage estimation, which will estimate the P parameter first and then estimate regression parameters. In the real-world practice, we could do a coarse search to narrow down a reasonable range of P and then do a fine search to identify the optimal P value. As shown below, all estimated parameters are fairly consistent with ones in the previous example.

> library(tweedie) > library(statmod) > prof <- tweedie.profile(RLD ~ Zone + Stock, data = FineRoot, p.vec = seq(1.1, 1.9, 0.01), method = "series") 1.1 1.11 1.12 1.13 1.14 1.15 1.16 1.17 1.18 1.19 1.2 1.21 1.22 1.23 1.24 1.25 1.26 1.27 1.28 1.29 1.3 1.31 1.32 1.33 1.34 1.35 1.36 1.37 1.38 1.39 1.4 1.41 1.42 1.43 1.44 1.45 1.46 1.47 1.48 1.49 1.5 1.51 1.52 1.53 1.54 1.55 1.56 1.57 1.58 1.59 1.6 1.61 1.62 1.63 1.64 1.65 1.66 1.67 1.68 1.69 1.7 1.71 1.72 1.73 1.74 1.75 1.76 1.77 1.78 1.79 1.8 1.81 1.82 1.83 1.84 1.85 1.86 1.87 1.88 1.89 1.9 .................................................................................Done. > prof$p.max [1] 1.426531 > m2 <- glm(RLD ~ Zone + Stock, data = FineRoot, family = tweedie(var.power = prof$p.max, link.power = 0)) > summary(m2) Deviance Residuals: Min 1Q Median 3Q Max -1.0712 -0.6559 -0.3954 0.1380 1.9728 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.95056 0.14667 -13.299 < 2e-16 *** ZoneOuter -0.85823 0.13297 -6.454 2.55e-10 *** StockMM106 0.01204 0.17561 0.069 0.945 StockMark -0.84044 0.17492 -4.805 2.04e-06 *** --- (Dispersion parameter for Tweedie family taken to be 0.4496605) Null deviance: 241.48 on 510 degrees of freedom Residual deviance: 207.68 on 507 degrees of freedom AIC: NA

## GLM with H2O in R

Below is an example showing how to fit a Generalized Linear Model with H2O in R. The output is much more comprehensive than the one generated by the generic R glm().

> library(h2o) > h2o.init(max_mem_size = "12g") > df1 <- h2o.uploadFile("Documents/credit_count.txt", header = TRUE, sep = ",", parse_type = "CSV") > df2 <- h2o.assign(df1[df1$CARDHLDR == 1, ], "glm_df") > h2o.colnames(df2) [1] "CARDHLDR" "DEFAULT" "AGE" "ACADMOS" "ADEPCNT" "MAJORDRG" [7] "MINORDRG" "OWNRENT" "INCOME" "SELFEMPL" "INCPER" "EXP_INC" [13] "SPENDING" "LOGSPEND" > Y <- "DEFAULT" > X <- c("MAJORDRG", "MINORDRG", "INCOME", "OWNRENT") > dist <- "binomial" > link <- "logit" > id <- "h2o_mdl01" > mdl <- h2o.glm(X, Y, training_frame = h2o.getFrame("glm_df"), model_id = id, family = dist, link = link, lambda = 0, compute_p_values = TRUE, standardize = FALSE) > show(h2o.getModel(id)@model$coefficients_table) Coefficients: glm coefficients names coefficients std_error z_value p_value 1 Intercept -1.204439 0.090811 -13.263121 0.000000 2 MAJORDRG 0.203135 0.069250 2.933370 0.003353 3 MINORDRG 0.202727 0.047971 4.226014 0.000024 4 OWNRENT -0.201223 0.071619 -2.809636 0.004960 5 INCOME -0.000442 0.000040 -10.942350 0.000000 > h2o.performance(h2o.getModel(id)) H2OBinomialMetrics: glm ** Reported on training data. ** MSE: 0.08414496 RMSE: 0.2900775 LogLoss: 0.3036585 Mean Per-Class Error: 0.410972 AUC: 0.6432189 Gini: 0.2864378 R^2: 0.02005004 Residual Deviance: 6376.221 AIC: 6386.221 Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold: 0 1 Error Rate 0 7703 1800 0.189414 =1800/9503 1 630 366 0.632530 =630/996 Totals 8333 2166 0.231451 =2430/10499 Maximum Metrics: Maximum metrics at their respective thresholds metric threshold value idx 1 max f1 0.126755 0.231499 142 2 max f2 0.075073 0.376556 272 3 max f0point5 0.138125 0.191828 115 4 max accuracy 0.368431 0.905039 0 5 max precision 0.314224 0.250000 3 6 max recall 0.006115 1.000000 399 7 max specificity 0.368431 0.999895 0 8 max absolute_mcc 0.126755 0.128940 142 9 max min_per_class_accuracy 0.106204 0.604546 196 10 max mean_per_class_accuracy 0.103730 0.605663 202

## H2O Benchmark for CSV Import

The importFile() function in H2O is extremely efficient due to the parallel reading. The benchmark comparison below shows that it is comparable to the read.df() in SparkR and significantly faster than the generic read.csv().

library(SparkR, lib.loc = paste(Sys.getenv("SPARK_HOME"), "/R/lib", sep = "")) sc <- sparkR.session(master = "local", sparkConfig = list(spark.driver.memory = "10g", spark.driver.cores = "4")) library(h2o) h2o.init(max_mem_size = "10g") library(rbenchmark) benchmark(replications = 5, order = "elapsed", relative = "elapsed", csv = { df <- read.csv("Documents/nycflights13.csv") print(nrow(df)) rm(df) }, spk = { df <- read.df("Documents/nycflights13.csv", source = "csv", header = "true", inferSchema = "true") print(nrow(df)) rm(df) }, h2o = { df <- h2o.importFile(path = "Documents/nycflights13.csv", header = TRUE, sep = ",") print(nrow(df)) rm(df) } ) # test replications elapsed relative user.self sys.self user.child sys.child # 3 h2o 5 8.221 1.000 0.508 0.032 0 0 # 2 spk 5 9.822 1.195 0.008 0.004 0 0 # 1 csv 5 16.595 2.019 16.420 0.176 0 0

## Using Tweedie Parameter to Identify Distributions

In the development of operational loss models, it is important to identify which distribution should be used to model operational risk measures, e.g. frequency and severity. For instance, why should we use the Gamma distribution instead of the Inverse Gaussian distribution to model the severity?

In my previous post https://statcompute.wordpress.com/2016/11/20/modified-park-test-in-sas, it is shown how to use the Modified Park test to identify the mean-variance relationship and then decide the corresponding distribution of operational risk measures. Following the similar logic, we can also leverage the flexibility of the Tweedie distribution to accomplish the same goal. Based upon the parameterization of a Tweedie distribution, the variance = Phi * (Mu ** P), where Mu is the mean and P is the power parameter. Depending on the specific value of P, the Tweedie distribution can accommodate several important distributions commonly used in the operational risk modeling, including Poisson, Gamma, Inverse Gaussian. For instance,

- With P = 0, the variance would be independent of the mean, indicating a Normal distribution.
- With P = 1, the variance would be in a linear form of the mean, indicating a Poisson-like distribution
- With P = 2, the variance would be in a quadratic form of the mean, indicating a Gamma distribution.
- With P = 3, the variance would be in a cubic form of the mean, indicating an Inverse Gaussian distribution.

In the example below, it is shown that the value of P is in the neighborhood of 1 for the frequency measure and is near 3 for the severity measure and that, given P closer to 3, the Inverse Gaussian regression would fit the severity better than the Gamma regression.

library(statmod) library(tweedie) profile1 <- tweedie.profile(Claim_Count ~ Age + Vehicle_Use, data = AutoCollision, p.vec = seq(1.1, 3.0, 0.1), fit.glm = TRUE) print(profile1$p.max) # [1] 1.216327 # The P parameter close to 1 indicates that the claim_count might follow a Poisson-like distribution profile2 <- tweedie.profile(Severity ~ Age + Vehicle_Use, data = AutoCollision, p.vec = seq(1.1, 3.0, 0.1), fit.glm = TRUE) print(profile2$p.max) # [1] 2.844898 # The P parameter close to 3 indicates that the severity might follow an Inverse Gaussian distribution BIC(glm(Severity ~ Age + Vehicle_Use, data = AutoCollision, family = Gamma(link = log))) # [1] 360.8064 BIC(glm(Severity ~ Age + Vehicle_Use, data = AutoCollision, family = inverse.gaussian(link = log))) # [1] 350.2504

Together with the Modified Park test, the estimation of P in a Tweedie distribution is able to help us identify the correct distribution employed in operational loss models in the context of GLM.

## Finer Monotonic Binning Based on Isotonic Regression

In my early post (https://statcompute.wordpress.com/2017/01/22/monotonic-binning-with-smbinning-package/), I wrote a monobin() function based on the smbinning package by Herman Jopia to improve the monotonic binning algorithm. The function works well and provides robust binning outcomes. However, there are a couple potential drawbacks due to the coarse binning. First of all, the derived Information Value for each binned variable might tend to be low. Secondly, the binned variable might not be granular enough to reflect the data nature.

In light of the aforementioned, I drafted an improved function isobin() based on the isotonic regression (https://en.wikipedia.org/wiki/Isotonic_regression), as shown below.

isobin <- function(data, y, x) { d1 <- data[c(y, x)] d2 <- d1[!is.na(d1[x]), ] c <- cor(d2[, 2], d2[, 1], method = "spearman", use = "complete.obs") reg <- isoreg(d2[, 2], c / abs(c) * d2[, 1]) k <- knots(as.stepfun(reg)) sm1 <-smbinning.custom(d1, y, x, k) c1 <- subset(sm1$ivtable, subset = CntGood * CntBad > 0, select = Cutpoint) c2 <- suppressWarnings(as.numeric(unlist(strsplit(c1$Cutpoint, " ")))) c3 <- c2[!is.na(c2)] return(smbinning.custom(d1, y, x, c3[-length(c3)])) }

Compared with the legacy monobin(), the isobin() function is able to significantly increase the binning granularity as well as moderately improve the Information Value.

**LTV Binning with isobin() Function**

Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate Odds LnOdds WoE IV 1 <= 46 81 78 3 81 78 3 0.0139 0.9630 0.0370 26.0000 3.2581 1.9021 0.0272 2 <= 71 312 284 28 393 362 31 0.0535 0.9103 0.0897 10.1429 2.3168 0.9608 0.0363 3 <= 72 22 20 2 415 382 33 0.0038 0.9091 0.0909 10.0000 2.3026 0.9466 0.0025 4 <= 73 27 24 3 442 406 36 0.0046 0.8889 0.1111 8.0000 2.0794 0.7235 0.0019 5 <= 81 303 268 35 745 674 71 0.0519 0.8845 0.1155 7.6571 2.0356 0.6797 0.0194 6 <= 83 139 122 17 884 796 88 0.0238 0.8777 0.1223 7.1765 1.9708 0.6149 0.0074 7 <= 90 631 546 85 1515 1342 173 0.1081 0.8653 0.1347 6.4235 1.8600 0.5040 0.0235 8 <= 94 529 440 89 2044 1782 262 0.0906 0.8318 0.1682 4.9438 1.5981 0.2422 0.0049 9 <= 95 145 119 26 2189 1901 288 0.0248 0.8207 0.1793 4.5769 1.5210 0.1651 0.0006 10 <= 100 907 709 198 3096 2610 486 0.1554 0.7817 0.2183 3.5808 1.2756 -0.0804 0.0010 11 <= 101 195 151 44 3291 2761 530 0.0334 0.7744 0.2256 3.4318 1.2331 -0.1229 0.0005 12 <= 110 1217 934 283 4508 3695 813 0.2085 0.7675 0.2325 3.3004 1.1940 -0.1619 0.0057 13 <= 112 208 158 50 4716 3853 863 0.0356 0.7596 0.2404 3.1600 1.1506 -0.2054 0.0016 14 <= 115 253 183 70 4969 4036 933 0.0433 0.7233 0.2767 2.6143 0.9610 -0.3950 0.0075 15 <= 136 774 548 226 5743 4584 1159 0.1326 0.7080 0.2920 2.4248 0.8857 -0.4702 0.0333 16 <= 138 27 18 9 5770 4602 1168 0.0046 0.6667 0.3333 2.0000 0.6931 -0.6628 0.0024 17 > 138 66 39 27 5836 4641 1195 0.0113 0.5909 0.4091 1.4444 0.3677 -0.9882 0.0140 18 Missing 1 0 1 5837 4641 1196 0.0002 0.0000 1.0000 0.0000 -Inf -Inf Inf 19 Total 5837 4641 1196 NA NA NA 1.0000 0.7951 0.2049 3.8804 1.3559 0.0000 0.1897

**LTV Binning with monobin() Function**

Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate Odds LnOdds WoE IV 1 <= 85 1025 916 109 1025 916 109 0.1756 0.8937 0.1063 8.4037 2.1287 0.7727 0.0821 2 <= 94 1019 866 153 2044 1782 262 0.1746 0.8499 0.1501 5.6601 1.7334 0.3775 0.0221 3 <= 100 1052 828 224 3096 2610 486 0.1802 0.7871 0.2129 3.6964 1.3074 -0.0486 0.0004 4 <= 105 808 618 190 3904 3228 676 0.1384 0.7649 0.2351 3.2526 1.1795 -0.1765 0.0045 5 <= 114 985 748 237 4889 3976 913 0.1688 0.7594 0.2406 3.1561 1.1493 -0.2066 0.0076 6 > 114 947 665 282 5836 4641 1195 0.1622 0.7022 0.2978 2.3582 0.8579 -0.4981 0.0461 7 Missing 1 0 1 5837 4641 1196 0.0002 0.0000 1.0000 0.0000 -Inf -Inf Inf 8 Total 5837 4641 1196 NA NA NA 1.0000 0.7951 0.2049 3.8804 1.3559 0.0000 0.1628

**Bureau_Score Binning with isobin() Function**

Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate Odds LnOdds WoE IV 1 <= 491 4 1 3 4 1 3 0.0007 0.2500 0.7500 0.3333 -1.0986 -2.4546 0.0056 2 <= 532 24 9 15 28 10 18 0.0041 0.3750 0.6250 0.6000 -0.5108 -1.8668 0.0198 3 <= 559 51 24 27 79 34 45 0.0087 0.4706 0.5294 0.8889 -0.1178 -1.4737 0.0256 4 <= 560 2 1 1 81 35 46 0.0003 0.5000 0.5000 1.0000 0.0000 -1.3559 0.0008 5 <= 572 34 17 17 115 52 63 0.0058 0.5000 0.5000 1.0000 0.0000 -1.3559 0.0143 6 <= 602 153 84 69 268 136 132 0.0262 0.5490 0.4510 1.2174 0.1967 -1.1592 0.0459 7 <= 605 56 31 25 324 167 157 0.0096 0.5536 0.4464 1.2400 0.2151 -1.1408 0.0162 8 <= 606 14 8 6 338 175 163 0.0024 0.5714 0.4286 1.3333 0.2877 -1.0683 0.0035 9 <= 607 17 10 7 355 185 170 0.0029 0.5882 0.4118 1.4286 0.3567 -0.9993 0.0037 10 <= 632 437 261 176 792 446 346 0.0749 0.5973 0.4027 1.4830 0.3940 -0.9619 0.0875 11 <= 639 150 95 55 942 541 401 0.0257 0.6333 0.3667 1.7273 0.5465 -0.8094 0.0207 12 <= 653 451 300 151 1393 841 552 0.0773 0.6652 0.3348 1.9868 0.6865 -0.6694 0.0412 13 <= 662 295 213 82 1688 1054 634 0.0505 0.7220 0.2780 2.5976 0.9546 -0.4014 0.0091 14 <= 665 100 77 23 1788 1131 657 0.0171 0.7700 0.2300 3.3478 1.2083 -0.1476 0.0004 15 <= 667 57 44 13 1845 1175 670 0.0098 0.7719 0.2281 3.3846 1.2192 -0.1367 0.0002 16 <= 677 381 300 81 2226 1475 751 0.0653 0.7874 0.2126 3.7037 1.3093 -0.0466 0.0001 17 <= 679 66 53 13 2292 1528 764 0.0113 0.8030 0.1970 4.0769 1.4053 0.0494 0.0000 18 <= 683 160 129 31 2452 1657 795 0.0274 0.8062 0.1938 4.1613 1.4258 0.0699 0.0001 19 <= 689 203 164 39 2655 1821 834 0.0348 0.8079 0.1921 4.2051 1.4363 0.0804 0.0002 20 <= 699 304 249 55 2959 2070 889 0.0521 0.8191 0.1809 4.5273 1.5101 0.1542 0.0012 21 <= 707 312 268 44 3271 2338 933 0.0535 0.8590 0.1410 6.0909 1.8068 0.4509 0.0094 22 <= 717 368 318 50 3639 2656 983 0.0630 0.8641 0.1359 6.3600 1.8500 0.4941 0.0132 23 <= 721 134 119 15 3773 2775 998 0.0230 0.8881 0.1119 7.9333 2.0711 0.7151 0.0094 24 <= 723 49 44 5 3822 2819 1003 0.0084 0.8980 0.1020 8.8000 2.1748 0.8188 0.0043 25 <= 739 425 394 31 4247 3213 1034 0.0728 0.9271 0.0729 12.7097 2.5424 1.1864 0.0700 26 <= 746 166 154 12 4413 3367 1046 0.0284 0.9277 0.0723 12.8333 2.5520 1.1961 0.0277 27 <= 756 234 218 16 4647 3585 1062 0.0401 0.9316 0.0684 13.6250 2.6119 1.2560 0.0422 28 <= 761 110 104 6 4757 3689 1068 0.0188 0.9455 0.0545 17.3333 2.8526 1.4967 0.0260 29 <= 763 46 44 2 4803 3733 1070 0.0079 0.9565 0.0435 22.0000 3.0910 1.7351 0.0135 30 <= 767 96 92 4 4899 3825 1074 0.0164 0.9583 0.0417 23.0000 3.1355 1.7795 0.0293 31 <= 772 77 74 3 4976 3899 1077 0.0132 0.9610 0.0390 24.6667 3.2055 1.8495 0.0249 32 <= 787 269 260 9 5245 4159 1086 0.0461 0.9665 0.0335 28.8889 3.3635 2.0075 0.0974 33 <= 794 95 93 2 5340 4252 1088 0.0163 0.9789 0.0211 46.5000 3.8395 2.4835 0.0456 34 > 794 182 179 3 5522 4431 1091 0.0312 0.9835 0.0165 59.6667 4.0888 2.7328 0.0985 35 Missing 315 210 105 5837 4641 1196 0.0540 0.6667 0.3333 2.0000 0.6931 -0.6628 0.0282 36 Total 5837 4641 1196 NA NA NA 1.0000 0.7951 0.2049 3.8804 1.3559 0.0000 0.8357

**Bureau_Score Binning with monobin() Function**

Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate Odds LnOdds WoE IV 1 <= 617 513 284 229 513 284 229 0.0879 0.5536 0.4464 1.2402 0.2153 -1.1407 0.1486 2 <= 642 515 317 198 1028 601 427 0.0882 0.6155 0.3845 1.6010 0.4706 -0.8853 0.0861 3 <= 657 512 349 163 1540 950 590 0.0877 0.6816 0.3184 2.1411 0.7613 -0.5946 0.0363 4 <= 672 487 371 116 2027 1321 706 0.0834 0.7618 0.2382 3.1983 1.1626 -0.1933 0.0033 5 <= 685 494 396 98 2521 1717 804 0.0846 0.8016 0.1984 4.0408 1.3964 0.0405 0.0001 6 <= 701 521 428 93 3042 2145 897 0.0893 0.8215 0.1785 4.6022 1.5265 0.1706 0.0025 7 <= 714 487 418 69 3529 2563 966 0.0834 0.8583 0.1417 6.0580 1.8014 0.4454 0.0144 8 <= 730 489 441 48 4018 3004 1014 0.0838 0.9018 0.0982 9.1875 2.2178 0.8619 0.0473 9 <= 751 513 476 37 4531 3480 1051 0.0879 0.9279 0.0721 12.8649 2.5545 1.1986 0.0859 10 <= 775 492 465 27 5023 3945 1078 0.0843 0.9451 0.0549 17.2222 2.8462 1.4903 0.1157 11 > 775 499 486 13 5522 4431 1091 0.0855 0.9739 0.0261 37.3846 3.6213 2.2653 0.2126 12 Missing 315 210 105 5837 4641 1196 0.0540 0.6667 0.3333 2.0000 0.6931 -0.6628 0.0282 13 Total 5837 4641 1196 NA NA NA 1.0000 0.7951 0.2049 3.8804 1.3559 0.0000 0.7810