## Archive for the ‘**S+/R**’ Category

## R Interfaces to Python Keras Package

Keras is a popular Python package to do the prototyping for deep neural networks with multiple backends, including TensorFlow, CNTK, and Theano. Currently, there are two R interfaces that allow us to use Keras from R through the reticulate package. While the keras R package is able to provide a flexible and feature-rich API, the kerasR R package is more convenient and computationally efficient. For instance, in the below example mimicking the Python code shown in https://statcompute.wordpress.com/2017/01/02/dropout-regularization-in-deep-neural-networks, the kerasR package is at least 10% faster than the keras package in terms of the computing time.

df <- read.csv("credit_count.txt") Y <- matrix(df[df$CARDHLDR == 1, ]$DEFAULT) X <- scale(df[df$CARDHLDR == 1, ][3:14]) set.seed(2018) rows <- sample(1:nrow(Y), nrow(Y) - 2000) Y1 <- Y[rows, ] Y2 <- Y[-rows, ] X1 <- X[rows, ] X2 <- X[-rows, ] ### USE KERAS PACKAGE (https://keras.rstudio.com) ### library(keras) dnn1 % ### DEFINE THE INPUT LAYER ### layer_dense(units = 50, activation = 'relu', input_shape = ncol(X), kernel_constraint = constraint_maxnorm(4)) %>% layer_dropout(rate = 0.2, seed = 1) %>% ### DEFINE THE 1ST HIDDEN LAYER ### layer_dense(units = 20, activation = 'relu', kernel_constraint = constraint_maxnorm(4)) %>% layer_dropout(rate = 0.2, seed = 1) %>% ### DEFINE THE 2ND HIDDEN LAYER ### layer_dense(units = 20, activation = 'relu', kernel_constraint = constraint_maxnorm(4)) %>% layer_dropout(rate = 0.2, seed = 1) %>% layer_dense(units = 1, activation = 'sigmoid') %>% compile(loss = 'binary_crossentropy', optimizer = 'sgd', metrics = c('accuracy')) dnn1 %>% fit(X1, Y1, batch_size = 50, epochs = 20, verbose = 0, validation_split = 0.3) pROC::roc(as.numeric(Y2), as.numeric(predict_proba(dnn1, X2))) ### USE KERAS PACKAGE (https://github.com/statsmaths/kerasR) ### library(kerasR) dnn2 <- Sequential() ### DEFINE THE INPUT LAYER ### dnn2$add(Dense(units = 50, input_shape = ncol(X), activation = 'relu', kernel_constraint = max_norm(4))) dnn2$add(Dropout(rate = 0.2, seed = 1)) ### DEFINE THE 1ST HIDDEN LAYER ### dnn2$add(Dense(units = 20, activation = 'relu', kernel_constraint = max_norm(4))) dnn2$add(Dropout(rate = 0.2, seed = 1)) ### DEFINE THE 2ND HIDDEN LAYER ### dnn2$add(Dense(units = 20, activation = 'relu', kernel_constraint = max_norm(4))) dnn2$add(Dropout(rate = 0.2, seed = 1)) dnn2$add(Dense(units = 1, activation = 'sigmoid')) keras_compile(dnn2, loss = 'binary_crossentropy', optimizer = 'sgd', metrics = 'accuracy') keras_fit(dnn2, X1, Y1, batch_size = 50, epochs = 20, verbose = 0, validation_split = 0.3) pROC::roc(as.numeric(Y2), as.numeric(keras_predict_proba(dnn2, X2)))

## Additional Thoughts on Estimating LGD with Proportional Odds Model

In my previous post (https://statcompute.wordpress.com/2018/01/28/modeling-lgd-with-proportional-odds-model), I’ve discussed how to use Proportional Odds Models in the LGD model development. In particular, I specifically mentioned that we would estimate a sub-model, which can be Gamma or Simplex regression, to project the conditional mean for LGD values in the (0, 1) range. However, it is worth pointing out that, if we would define a finer LGD segmentation, the necessity of this sub-model is completely optional. A standalone Proportional Odds Model without any sub-model is more than sufficient to serve the purpose of stress testing, e.g. CCAR.

In the example below, I will define 5 categories based upon LGD values in the [0, 1] range, estimate a Proportional Odds Model as usual, and then demonstrate how to apply the model outcome in the setting of stress testing with the stressed model input, e.g. LTV.

First of all, I defined 5 instead of 3 categories for LGD values, as shown below. Nonetheless, we could use a even finer category definition in practice to achieve a more accurate outcome.

df <- read.csv("lgd.csv") df$lgd <- round(1 - df$Recovery_rate, 4) l1 <- c(-Inf, 0, 0.0999, 0.4999, 0.9999, Inf) l2 <- c("A", "B", "C", "D", "E") df$lgd_cat <- cut(df$lgd, breaks = l1, labels = l2, ordered_result = T) summary(df$lgd_cat) m1 <- ordinal::clm(lgd_cat ~ LTV, data = df) #Coefficients: # Estimate Std. Error z value Pr(>|z|) #LTV 2.3841 0.1083 22.02 <2e-16 *** # #Threshold coefficients: # Estimate Std. Error z value #A|B 0.54082 0.07897 6.848 #B|C 2.12270 0.08894 23.866 #C|D 3.18098 0.10161 31.307 #D|E 4.80338 0.13174 36.460

After the model estimation, it is straightforward to calculate the probability of each LGD category. The only question remained is how to calculate the LGD projection for each individual account as well as for the whole portfolio. In order to calculate the LGD projection, we need two factors, namely the probability and the expected mean of each LGD category, such that

Estimated_LGD = SUM_i [Prob(category i) * LGD_Mean(category i)], where i = A, B, C, D, and E in this particular case.

The calculation is shown below with the estimated LGD = 0.23 that is consistent with the actual LGD = 0.23 for the whole portfolio.

prob_A <- exp(df$LTV * (-m1$beta) + m1$Theta[1]) / (1 + exp(df$LTV * (-m1$beta) + m1$Theta[1])) prob_B <- exp(df$LTV * (-m1$beta) + m1$Theta[2]) / (1 + exp(df$LTV * (-m1$beta) + m1$Theta[2])) - prob_A prob_C <- exp(df$LTV * (-m1$beta) + m1$Theta[3]) / (1 + exp(df$LTV * (-m1$beta) + m1$Theta[3])) - prob_A - prob_B prob_D <- exp(df$LTV * (-m1$beta) + m1$Theta[4]) / (1 + exp(df$LTV * (-m1$beta) + m1$Theta[4])) - prob_A - prob_B - prob_C prob_E <- 1 - exp(df$LTV * (-m1$beta) + m1$Theta[4]) / (1 + exp(df$LTV * (-m1$beta) + m1$Theta[4])) pred <- data.frame(prob_A, prob_B, prob_C, prob_D, prob_E) sum(apply(pred, 2, mean) * aggregate(df['lgd'], df['lgd_cat'], mean)[2]) #[1] 0.2262811

One might be wondering how to apply the model outcome with simple averages in stress testing that the model input is stressed, e.g. more severe, and might be also concerned about the lack of model sensitivity. In the demonstration below, let’s stress the model input LTV by 50% and then evaluate the stressed LGD.

df$LTV_ST <- df$LTV * 1.5 prob_A <- exp(df$LTV_ST * (-m1$beta) + m1$Theta[1]) / (1 + exp(df$LTV_ST * (-m1$beta) + m1$Theta[1])) prob_B <- exp(df$LTV_ST * (-m1$beta) + m1$Theta[2]) / (1 + exp(df$LTV_ST * (-m1$beta) + m1$Theta[2])) - prob_A prob_C <- exp(df$LTV_ST * (-m1$beta) + m1$Theta[3]) / (1 + exp(df$LTV_ST * (-m1$beta) + m1$Theta[3])) - prob_A - prob_B prob_D <- exp(df$LTV_ST * (-m1$beta) + m1$Theta[4]) / (1 + exp(df$LTV_ST * (-m1$beta) + m1$Theta[4])) - prob_A - prob_B - prob_C prob_E <- 1 - exp(df$LTV_ST * (-m1$beta) + m1$Theta[4]) / (1 + exp(df$LTV_ST * (-m1$beta) + m1$Theta[4])) pred_ST <- data.frame(prob_A, prob_B, prob_C, prob_D, prob_E) sum(apply(pred_ST, 2, mean) * aggregate(df['lgd'], df['lgd_cat'], mean)[2]) #[1] 0.3600153

As shown above, although we only use a simple averages as the expected mean for each LGD category, the overall LGD still increases by ~60%. The reason is that, with the more stressed model input, the Proportional Odds Model is able to push more accounts into categories with higher LGD. For instance, the output below shows that, if LTV is stressed by 50% overall, ~146% more accounts would roll into the most severe LGD category without any recovery.

apply(pred_ST, 2, mean) / apply(pred, 2, mean) # prob_A prob_B prob_C prob_D prob_E #0.6715374 0.7980619 1.0405573 1.4825803 2.4639293

## Modeling LGD with Proportional Odds Model

The LGD model is an important component in the expected loss calculation. In https://statcompute.wordpress.com/2015/11/01/quasi-binomial-model-in-sas, I discussed how to model LGD with the quasi-binomial regression that is simple and makes no distributional assumption.

In the real-world LGD data, we usually would observe 3 ordered categories of values, including 0, 1, and in-betweens. In cases with a nontrivial number of 0 and 1 values, the ordered logit model, which is also known as Proportional Odds model, can be applicable. In the demonstration below, I will show how we can potentially use the proportional odds model in the LGD model development.

First of all, we need to categorize all numeric LGD values into three ordinal categories. As shown below, there are more than 30% of 0 and 1 values.

df <- read.csv("lgd.csv") df$lgd <- round(1 - df$Recovery_rate, 4) df$lgd_cat <- cut(df$lgd, breaks = c(-Inf, 0, 0.9999, Inf), labels = c("L", "M", "H"), ordered_result = T) summary(df$lgd_cat) # L M H # 730 1672 143

The estimation of a proportional odds model is straightforward with clm() in the ordinal package or polr() in the MASS package. As demonstrated below, in addition to the coefficient for LTV, there are 2 intercepts to differentiate 3 categories.

m1 <- ordinal::clm(lgd_cat ~ LTV, data = df) summary(m1) #Coefficients: # Estimate Std. Error z value Pr(>|z|) #LTV 2.0777 0.1267 16.4 <2e-16 *** #--- #Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 # #Threshold coefficients: # Estimate Std. Error z value #L|M 0.38134 0.08676 4.396 #M|H 4.50145 0.14427 31.201

It is important to point out that, in a proportional odds model, it is the cumulative probability that is derived from the linear combination of model variables. For instance, the cumulative probability of LGD belonging to L or M is formulated as

Prob(LGD <= M) = Exp(4.50 – 2.08 * LTV) / (1 + Exp(4.50 – 2.08 * LTV))

Likewise, we would have

Prob(LGD <= L) = Exp(0.38 – 2.08 * LTV) / (1 + Exp(0.38 – 2.08 * LTV))

With above cumulative probabilities, then we can calculate the probability of each category as below.

Prob(LGD = L) = Prob(LGD <= L)

Prob(LGD = M) = Prob(LGD <= M) – Prob(LGD <= L)

Prob(LGD = H) = 1 – Prob(LGD <= M)

The R code is showing the detailed calculation how to convert cumulative probabilities to probabilities of interest.

cumprob_L <- exp(df$LTV * (-m1$beta) + m1$Theta[1]) / (1 + exp(df$LTV * (-m1$beta) + m1$Theta[1])) cumprob_M <- exp(df$LTV * (-m1$beta) + m1$Theta[2]) / (1 + exp(df$LTV * (-m1$beta) + m1$Theta[2])) prob_L <- cumprob_L prob_M <- cumprob_M - cumprob_L prob_H <- 1 - cumprob_M pred <- data.frame(prob_L, prob_M, prob_H) apply(pred, 2, mean) # prob_L prob_M prob_H #0.28751210 0.65679888 0.05568903

After predicting the probability of each category, we would need another sub-model to estimate the conditional LGD for lgd_cat = “M” with either Beta or Simplex regression. (See https://statcompute.wordpress.com/2014/10/27/flexible-beta-modeling and https://statcompute.wordpress.com/2014/02/02/simplex-model-in-r) The final LGD prediction can be formulated as

E(LGD|X)

= Prob(Y = 0|X) * E(Y|X, Y = 0) + Prob(Y = 1|X) * E(Y|X, Y = 1) + Prob(0 < Y < 1|X) * E(Y|X, 0 < Y < 1)

= Prob(Y = 1|X) + Prob(0 < Y < 1|X) * E(Y|X, 0 < Y < 1)

where E(Y|X, 0 < Y < 1) can be calculated from the sub-model.

## Model Non-Negative Numeric Outcomes with Zeros

As mentioned in the previous post (https://statcompute.wordpress.com/2017/06/29/model-operational-loss-directly-with-tweedie-glm/), we often need to model non-negative numeric outcomes with zeros in the operational loss model development. Tweedie GLM provides a convenient interface to model non-negative losses directly by assuming that aggregated losses are the Poisson sum of Gamma outcomes, which however might not be well supported empirically from the data generation standpoint.

In examples below, we demonstrated another flexible option, namely Zero-Adjusted (ZA) models, in both scenarios of modeling non-negative numeric outcomes, one with a small number of zeros and the other with a large number of zeros. The basic idea of ZA models is very intuitive and similar to the concept of Hurdle models for count outcomes. In a nutshell, non-negative numeric outcomes can be considered two data generation processes, one for point-mass at zeros and the other governed by a statistical distribution for positive outcomes. The latter could be either Gamma or Inverse Gaussian.

First of all, we sampled down an auto-claim data in a way that only 10 claims are zeros and the rest are all positive. While 10 is an arbitrary choice in the example, other small numbers should show similar results.

pkgs <- list("cplm", "gamlss", "MLmetrics") lapply(pkgs, require, character.only = T) data(AutoClaim, package = "cplm") df1 <- na.omit(AutoClaim) # SMALL NUMBER OF ZEROS set.seed(2017) smp <- sample(seq(nrow(df1[df1$CLM_AMT == 0, ])), size = 10, replace = FALSE) df2 <- rbind(df1[df1$CLM_AMT > 0, ], df1[df1$CLM_AMT == 0, ][smp, ])

Next, we applied both Tweedie and zero-adjusted Gamma (ZAGA) models to the data with only 10 zero outcomes. It is worth mentioning that ZAGA doesn’t have to be overly complex in this case. As shown below, while we estimated the Gamma Mu parameter with model attributes, the Nu parameter to separate zeros is just a constant with the intercept = -5.4. Both Tweedie and GAZA models gave very similar estimated parameters and predictive measures with MAPE = 0.61.

tw <- cpglm(CLM_AMT ~ BLUEBOOK + NPOLICY, data = df2) # Estimate Std. Error t value Pr(>|t|) # (Intercept) 8.194e+00 7.234e-02 113.277 < 2e-16 *** # BLUEBOOK 2.047e-05 3.068e-06 6.671 3.21e-11 *** # NPOLICY 7.274e-02 3.102e-02 2.345 0.0191 * MAPE(df2$CLM_AMT, fitted(tw)) # 0.6053669 zaga0 <- gamlss(CLM_AMT ~ BLUEBOOK + NPOLICY, data = df2, family = "ZAGA") # Mu Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 8.203e+00 4.671e-02 175.629 < 2e-16 *** # BLUEBOOK 2.053e-05 2.090e-06 9.821 < 2e-16 *** # NPOLICY 6.948e-02 2.057e-02 3.377 0.000746 *** # Nu Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) -5.3886 0.3169 -17 <2e-16 *** MAPE(df2$CLM_AMT, (1 - fitted(zaga0, what = "nu")) * fitted(zaga0, what = "mu")) # 0.6053314

In the next case, we used the full data with a large number of zeros in the response and then applied both Tweedie and ZAGA models again. However, in ZAGA model, we estimated two sub-models this time, one for the Nu parameter to separate zeros from non-zeros and the other for the Mu parameter to model non-zero outcomes. As shown below, ZAGA outperformed Tweedie in terms of MAPE due to the advantage that ZAGA is able to explain two data generation schemes separately with different model attributes, which is the capability beyond what Tweedie can provide.

# LARGE NUMBER OF ZEROS tw <- cpglm(CLM_AMT ~ BLUEBOOK + NPOLICY + CLM_FREQ5 + MVR_PTS + INCOME, data = df1) # Estimate Std. Error t value Pr(>|t|) # (Intercept) 6.854e+00 1.067e-01 64.241 < 2e-16 *** # BLUEBOOK 1.332e-05 4.495e-06 2.963 0.00305 ** # NPOLICY 4.380e-02 3.664e-02 1.195 0.23196 # CLM_FREQ5 2.064e-01 2.937e-02 7.026 2.29e-12 *** # MVR_PTS 1.066e-01 1.510e-02 7.063 1.76e-12 *** # INCOME -4.606e-06 8.612e-07 -5.348 9.12e-08 *** MAPE(df1$CLM_AMT, fitted(tw)) # 1.484484 zaga1 <- gamlss(CLM_AMT ~ BLUEBOOK + NPOLICY, nu.formula = ~(CLM_FREQ5 + MVR_PTS + INCOME), data = df1, family = "ZAGA") # Mu Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 8.203e+00 4.682e-02 175.218 < 2e-16 *** # BLUEBOOK 2.053e-05 2.091e-06 9.816 < 2e-16 *** # NPOLICY 6.948e-02 2.067e-02 3.362 0.000778 *** # Nu Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 1.153e+00 5.077e-02 22.72 <2e-16 *** # CLM_FREQ5 -3.028e-01 2.283e-02 -13.26 <2e-16 *** # MVR_PTS -1.509e-01 1.217e-02 -12.41 <2e-16 *** # INCOME 7.285e-06 6.269e-07 11.62 <2e-16 *** MAPE(df1$CLM_AMT, (1 - fitted(zaga1, what = "nu")) * fitted(zaga1, what = "mu")) # 1.470228

Given the great flexibility of ZA models, we also have the luxury to explore other candidates than ZAGA. For instance, if the positive part of non-negative outcomes demonstrates a high variance, we can also try a zero-inflated Inverse Gaussian (ZAIG) model, as shown below.

zaig1 <- gamlss(CLM_AMT ~ BLUEBOOK + NPOLICY, nu.formula = ~(CLM_FREQ5 + MVR_PTS + INCOME), data = df1, family = "ZAIG") # Mu Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 8.205e+00 5.836e-02 140.591 < 2e-16 *** # BLUEBOOK 2.163e-05 2.976e-06 7.268 3.97e-13 *** # NPOLICY 5.898e-02 2.681e-02 2.200 0.0278 * # Nu Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 1.153e+00 5.077e-02 22.72 <2e-16 *** # CLM_FREQ5 -3.028e-01 2.283e-02 -13.26 <2e-16 *** # MVR_PTS -1.509e-01 1.217e-02 -12.41 <2e-16 *** # INCOME 7.285e-06 6.269e-07 11.62 <2e-16 *** MAPE(df1$CLM_AMT, (1 - fitted(zaig1, what = "nu")) * fitted(zaig1, what = "mu")) # 1.469236

## Variable Selection with Elastic Net

LASSO has been a popular algorithm for the variable selection and extremely effective with high-dimension data. However, it often tends to “over-regularize” a model that might be overly compact and therefore under-predictive.

The Elastic Net addresses the aforementioned “over-regularization” by balancing between LASSO and ridge penalties. In particular, a hyper-parameter, namely Alpha, would be used to regularize the model such that the model would become a LASSO in case of Alpha = 1 and a ridge in case of Alpha = 0. In practice, Alpha can be tuned easily by the cross-validation. Below is a demonstration of Elastic Net with R glmnet package and its comparison with LASSO and ridge models.

pkgs <- list("glmnet", "doParallel", "foreach", "pROC") lapply(pkgs, require, character.only = T) registerDoParallel(cores = 4) df1 <- read.csv("Downloads/credit_count.txt") df2 <- df1[df1$CARDHLDR == 1, ] set.seed(2017) n <- nrow(df2) sample <- sample(seq(n), size = n * 0.5, replace = FALSE) train <- df2[sample, -1] test <- df2[-sample, -1] mdlY <- as.factor(as.matrix(train["DEFAULT"])) mdlX <- as.matrix(train[setdiff(colnames(df1), c("CARDHLDR", "DEFAULT"))]) newY <- as.factor(as.matrix(test["DEFAULT"])) newX <- as.matrix(test[setdiff(colnames(df1), c("CARDHLDR", "DEFAULT"))])

First of all, we estimates a LASSO model with Alpha = 1. The function cv.glmnet() is used to search for a regularization parameter, namely Lambda, that controls the penalty strength. As shown below, the model only identifies 2 attributes out of total 12.

# LASSO WITH ALPHA = 1 cv1 <- cv.glmnet(mdlX, mdlY, family = "binomial", nfold = 10, type.measure = "deviance", paralle = TRUE, alpha = 1) md1 <- glmnet(mdlX, mdlY, family = "binomial", lambda = cv1$lambda.1se, alpha = 1) coef(md1) #(Intercept) -1.963030e+00 #AGE . #ACADMOS . #ADEPCNT . #MAJORDRG . #MINORDRG . #OWNRENT . #INCOME -5.845981e-05 #SELFEMPL . #INCPER . #EXP_INC . #SPENDING . #LOGSPEND -4.015902e-02 roc(newY, as.numeric(predict(md1, newX, type = "response"))) #Area under the curve: 0.636

We next estimates a ridge model as below by setting Alpha = 0. Similarly, Lambda is searched by the cross-validation. Since the ridge penalty would only regularize the magnitude of each coefficient, we end up with a “full” model with all model attributes. The model performance is slightly better with 10 more variables, which is a debatable outcome.

# RIDGE WITH ALPHA = 0 cv2 <- cv.glmnet(mdlX, mdlY, family = "binomial", nfold = 10, type.measure = "deviance", paralle = TRUE, alpha = 0) md2 <- glmnet(mdlX, mdlY, family = "binomial", lambda = cv2$lambda.1se, alpha = 0) coef(md2) #(Intercept) -2.221016e+00 #AGE -4.184422e-04 #ACADMOS -3.085096e-05 #ADEPCNT 1.485114e-04 #MAJORDRG 6.684849e-03 #MINORDRG 1.006660e-03 #OWNRENT -9.082750e-03 #INCOME -6.960253e-06 #SELFEMPL 3.610381e-03 #INCPER -3.881890e-07 #EXP_INC -1.416971e-02 #SPENDING -1.638184e-05 #LOGSPEND -6.213884e-03 roc(newY, as.numeric(predict(md2, newX, type = "response"))) #Area under the curve: 0.6435

At last, we use the Elastic Net by tuning the value of Alpha through a line search with the parallelism. In this particular case, Alpha = 0.3 is chosen through the cross-validation. As shown below, 6 variables are used in the model that even performs better than the ridge model with all 12 attributes.

# ELASTIC NET WITH 0 < ALPHA < 1 a <- seq(0.1, 0.9, 0.05) search <- foreach(i = a, .combine = rbind) %dopar% { cv <- cv.glmnet(mdlX, mdlY, family = "binomial", nfold = 10, type.measure = "deviance", paralle = TRUE, alpha = i) data.frame(cvm = cv$cvm[cv$lambda == cv$lambda.1se], lambda.1se = cv$lambda.1se, alpha = i) } cv3 <- search[search$cvm == min(search$cvm), ] md3 <- glmnet(mdlX, mdlY, family = "binomial", lambda = cv3$lambda.1se, alpha = cv3$alpha) coef(md3) #(Intercept) -1.434700e+00 #AGE -8.426525e-04 #ACADMOS . #ADEPCNT . #MAJORDRG 6.276924e-02 #MINORDRG . #OWNRENT -2.780958e-02 #INCOME -1.305118e-04 #SELFEMPL . #INCPER -2.085349e-06 #EXP_INC . #SPENDING . #LOGSPEND -9.992808e-02 roc(newY, as.numeric(predict(md3, newX, type = "response"))) #Area under the curve: 0.6449

## DART: Dropout Regularization in Boosting Ensembles

The dropout approach developed by Hinton has been widely employed in deep learnings to prevent the deep neural network from overfitting, as shown in https://statcompute.wordpress.com/2017/01/02/dropout-regularization-in-deep-neural-networks.

In the paper http://proceedings.mlr.press/v38/korlakaivinayak15.pdf, the dropout can also be used to address the overfitting in boosting tree ensembles, e.g. MART, caused by the so-called “over-specialization”. In particular, while first few trees added at the beginning of ensembles would dominate the model performance, the rest added later can only improve the prediction for a small subset, which increases the risk of overfitting. The idea of DART is to build an ensemble by randomly dropping boosting tree members. The percentage of dropouts can determine the degree of regularization for boosting tree ensembles.

Below is a demonstration showing the implementation of DART with the R xgboost package. First of all, after importing the data, we divided it into two pieces, one for training and the other for testing.

pkgs <- c('pROC', 'xgboost') lapply(pkgs, require, character.only = T) df1 <- read.csv("Downloads/credit_count.txt") df2 <- df1[df1$CARDHLDR == 1, ] set.seed(2017) n <- nrow(df2) sample <- sample(seq(n), size = n / 2, replace = FALSE) train <- df2[sample, -1] test <- df2[-sample, -1]

For the comparison purpose, we first developed a boosting tree ensemble without dropouts, as shown below. For the simplicity, all parameters were chosen heuristically. The max_depth is set to 3 due to the fact that the boosting tends to work well with so-called “weak” learners, e.g. simple trees. While ROC for the training set can be as high as 0.95, ROC for the testing set is only 0.60 in our case, implying the overfitting issue.

mart.parm <- list(booster = "gbtree", nthread = 4, eta = 0.1, max_depth = 3, subsample = 1, eval_metric = "auc") mart <- xgboost(data = as.matrix(train[, -1]), label = train[, 1], params = mart.parm, nrounds = 500, verbose = 0, seed = 2017) pred1 <- predict(mart, as.matrix(train[, -1])) pred2 <- predict(mart, as.matrix(test[, -1])) roc(as.factor(train$DEFAULT), pred1) # Area under the curve: 0.9459 roc(as.factor(test$DEFAULT), pred2) # Area under the curve: 0.6046

With the same set of parameters, we refitted the ensemble with dropouts, e.g. DART. As shown below, by dropping 10% tree members, ROC for the testing set can increase from 0.60 to 0.65. In addition, the performance disparity between training and testing sets with DART decreases significantly.

dart.parm <- list(booster = "dart", rate_drop = 0.1, nthread = 4, eta = 0.1, max_depth = 3, subsample = 1, eval_metric = "auc") dart <- xgboost(data = as.matrix(train[, -1]), label = train[, 1], params = dart.parm, nrounds = 500, verbose = 0, seed = 2017) pred1 <- predict(dart, as.matrix(train[, -1])) pred2 <- predict(dart, as.matrix(test[, -1])) roc(as.factor(train$DEFAULT), pred1) # Area under the curve: 0.7734 roc(as.factor(test$DEFAULT), pred2) # Area under the curve: 0.6517

Besides rate_drop = 0.1, a wide range of dropout rates have also been tested. In most cases, DART outperforms its counterpart without the dropout regularization.

## Model Operational Losses with Copula Regression

In the previous post (https://statcompute.wordpress.com/2017/06/29/model-operational-loss-directly-with-tweedie-glm), it has been explained why we should consider modeling operational losses for non-material UoMs directly with Tweedie models. However, for material UoMs with significant losses, it is still beneficial to model the frequency and the severity separately.

In the prevailing modeling practice for operational losses, it is often convenient to assume a functional independence between frequency and severity models, which might not be the case empirically. For instance, in the economic downturn, both the frequency and the severity of consumer frauds might tend to increase simultaneously. With the independence assumption, while we can argue that same variables could be included in both frequency and severity models and therefore induce a certain correlation, the frequency-severity dependence and the its contribution to the loss distribution might be overlooked.

In the context of Copula, the distribution of operational losses can be considered a joint distribution determined by both marginal distributions and a parameter measuring the dependence between marginals, of which marginal distributions can be Poisson for the frequency and Gamma for the severity. Depending on the dependence structure in the data, various copula functions might be considered. For instance, a product copula can be used to describe the independence. In the example shown below, a Gumbel copula is considered given that it is often used to describe the positive dependence on the right tail, e.g. high severity and high frequency. For details, the book “Copula Modeling” by Trivedi and Zimmer is a good reference to start with.

In the demonstration, we simulated both frequency and severity measures driven by the same set of co-variates. Both are positively correlated with the Kendall’s tau = 0.5 under the assumption of Gumbel copula.

library(CopulaRegression) # number of observations to simulate n <- 100 # seed value for the simulation set.seed(2017) # design matrices with a constant column X <- cbind(rep(1, n), runif(n), runif(n)) # define coefficients for both Poisson and Gamma regressions p_beta <- g_beta <- c(3, -2, 1) # define the Gamma dispersion delta <- 1 # define the Kendall's tau tau <- 0.5 # copula parameter based on tau theta <- 1 / (1 - tau) # define the Gumbel Copula family <- 4 # simulate outcomes out <- simulate_regression_data(n, g_beta, p_beta, X, X, delta, tau, family, zt = FALSE) G <- out[, 1] P <- out[, 2]

After the simulation, a Copula regression is estimated with Poisson and Gamma marginals for the frequency and the severity respectively. As shown in the model estimation, estimated parameters with related inferences are different between independent and dependent assumptions.

m <- copreg(G, P, X, family = 4, sd.error = TRUE, joint = TRUE, zt = FALSE) coef <- c("_CONST", "X1", "X2") cols <- c("ESTIMATE", "STD. ERR", "Z-VALUE") g_est <- cbind(m$alpha, m$sd.alpha, m$alpha / m$sd.alpha) p_est <- cbind(m$beta, m$sd.beta, m$beta / m$sd.beta) g_est0 <- cbind(m$alpha0, m$sd.alpha0, m$alpha0 / m$sd.alpha0) p_est0 <- cbind(m$beta0, m$sd.beta0, m$beta0 / m$sd.beta0) rownames(g_est) <- rownames(g_est0) <- rownames(p_est) <- rownames(p_est0) <- coef colnames(g_est) <- colnames(g_est0) <- colnames(p_est) <- colnames(p_est0) <- cols # estimated coefficients for the Gamma regression assumed dependence print(g_est) # ESTIMATE STD. ERR Z-VALUE # _CONST 2.9710512 0.2303651 12.897141 # X1 -1.8047627 0.2944627 -6.129003 # X2 0.9071093 0.2995218 3.028526 # estimated coefficients for the Gamma regression assumed dependence print(p_est) # ESTIMATE STD. ERR Z-VALUE # _CONST 2.954519 0.06023353 49.05107 # X1 -1.967023 0.09233056 -21.30414 # X2 1.025863 0.08254870 12.42736 # estimated coefficients for the Gamma regression assumed independence # should be identical to GLM() outcome print(g_est0) # ESTIMATE STD. ERR Z-VALUE # _CONST 3.020771 0.2499246 12.086727 # X1 -1.777570 0.3480328 -5.107478 # X2 0.905527 0.3619011 2.502140 # estimated coefficients for the Gamma regression assumed independence # should be identical to GLM() outcome print(p_est0) # ESTIMATE STD. ERR Z-VALUE # _CONST 2.939787 0.06507502 45.17536 # X1 -2.010535 0.10297887 -19.52376 # X2 1.088269 0.09334663 11.65837

If we compare conditional loss distributions under different dependence assumptions, it shows that the predicted loss with Copula regression tends to have a fatter right tail and therefore should be considered more conservative.

df <- data.frame(g = G, p = P, x1 = X[, 2], x2 = X[, 3]) glm_p <- glm(p ~ x1 + x2, data = df, family = poisson(log)) glm_g <- glm(g ~ x1 + x2, data = df, family = Gamma(log)) loss_dep <- predict(m, X, X, independence = FALSE)[3][[1]][[1]] loss_ind <- fitted(glm_p) * fitted(glm_g) den <- data.frame(loss = c(loss_dep, loss_ind), lines = rep(c("DEPENDENCE", "INDEPENDENCE"), each = n)) ggplot(den, aes(x = loss, fill = lines)) + geom_density(alpha = 0.5)