## Archive for **August 2017**

## DART: Dropout Regularization in Boosting Ensembles

The dropout approach developed by Hinton has been widely employed in deep learnings to prevent the deep neural network from overfitting, as shown in https://statcompute.wordpress.com/2017/01/02/dropout-regularization-in-deep-neural-networks.

In the paper http://proceedings.mlr.press/v38/korlakaivinayak15.pdf, the dropout can also be used to address the overfitting in boosting tree ensembles, e.g. MART, caused by the so-called “over-specialization”. In particular, while first few trees added at the beginning of ensembles would dominate the model performance, the rest added later can only improve the prediction for a small subset, which increases the risk of overfitting. The idea of DART is to build an ensemble by randomly dropping boosting tree members. The percentage of dropouts can determine the degree of regularization for boosting tree ensembles.

Below is a demonstration showing the implementation of DART with the R xgboost package. First of all, after importing the data, we divided it into two pieces, one for training and the other for testing.

pkgs <- c('pROC', 'xgboost') lapply(pkgs, require, character.only = T) df1 <- read.csv("Downloads/credit_count.txt") df2 <- df1[df1$CARDHLDR == 1, ] set.seed(2017) n <- nrow(df2) sample <- sample(seq(n), size = n / 2, replace = FALSE) train <- df2[sample, -1] test <- df2[-sample, -1]

For the comparison purpose, we first developed a boosting tree ensemble without dropouts, as shown below. For the simplicity, all parameters were chosen heuristically. The max_depth is set to 3 due to the fact that the boosting tends to work well with so-called “weak” learners, e.g. simple trees. While ROC for the training set can be as high as 0.95, ROC for the testing set is only 0.60 in our case, implying the overfitting issue.

mart.parm <- list(booster = "gbtree", nthread = 4, eta = 0.1, max_depth = 3, subsample = 1, eval_metric = "auc") mart <- xgboost(data = as.matrix(train[, -1]), label = train[, 1], params = mart.parm, nrounds = 500, verbose = 0, seed = 2017) pred1 <- predict(mart, as.matrix(train[, -1])) pred2 <- predict(mart, as.matrix(test[, -1])) roc(as.factor(train$DEFAULT), pred1) # Area under the curve: 0.9459 roc(as.factor(test$DEFAULT), pred2) # Area under the curve: 0.6046

With the same set of parameters, we refitted the ensemble with dropouts, e.g. DART. As shown below, by dropping 10% tree members, ROC for the testing set can increase from 0.60 to 0.65. In addition, the performance disparity between training and testing sets with DART decreases significantly.

dart.parm <- list(booster = "dart", rate_drop = 0.1, nthread = 4, eta = 0.1, max_depth = 3, subsample = 1, eval_metric = "auc") dart <- xgboost(data = as.matrix(train[, -1]), label = train[, 1], params = dart.parm, nrounds = 500, verbose = 0, seed = 2017) pred1 <- predict(dart, as.matrix(train[, -1])) pred2 <- predict(dart, as.matrix(test[, -1])) roc(as.factor(train$DEFAULT), pred1) # Area under the curve: 0.7734 roc(as.factor(test$DEFAULT), pred2) # Area under the curve: 0.6517

Besides rate_drop = 0.1, a wide range of dropout rates have also been tested. In most cases, DART outperforms its counterpart without the dropout regularization.

## Model Operational Losses with Copula Regression

In the previous post (https://statcompute.wordpress.com/2017/06/29/model-operational-loss-directly-with-tweedie-glm), it has been explained why we should consider modeling operational losses for non-material UoMs directly with Tweedie models. However, for material UoMs with significant losses, it is still beneficial to model the frequency and the severity separately.

In the prevailing modeling practice for operational losses, it is often convenient to assume a functional independence between frequency and severity models, which might not be the case empirically. For instance, in the economic downturn, both the frequency and the severity of consumer frauds might tend to increase simultaneously. With the independence assumption, while we can argue that same variables could be included in both frequency and severity models and therefore induce a certain correlation, the frequency-severity dependence and the its contribution to the loss distribution might be overlooked.

In the context of Copula, the distribution of operational losses can be considered a joint distribution determined by both marginal distributions and a parameter measuring the dependence between marginals, of which marginal distributions can be Poisson for the frequency and Gamma for the severity. Depending on the dependence structure in the data, various copula functions might be considered. For instance, a product copula can be used to describe the independence. In the example shown below, a Gumbel copula is considered given that it is often used to describe the positive dependence on the right tail, e.g. high severity and high frequency. For details, the book “Copula Modeling” by Trivedi and Zimmer is a good reference to start with.

In the demonstration, we simulated both frequency and severity measures driven by the same set of co-variates. Both are positively correlated with the Kendall’s tau = 0.5 under the assumption of Gumbel copula.

library(CopulaRegression) # number of observations to simulate n <- 100 # seed value for the simulation set.seed(2017) # design matrices with a constant column X <- cbind(rep(1, n), runif(n), runif(n)) # define coefficients for both Poisson and Gamma regressions p_beta <- g_beta <- c(3, -2, 1) # define the Gamma dispersion delta <- 1 # define the Kendall's tau tau <- 0.5 # copula parameter based on tau theta <- 1 / (1 - tau) # define the Gumbel Copula family <- 4 # simulate outcomes out <- simulate_regression_data(n, g_beta, p_beta, X, X, delta, tau, family, zt = FALSE) G <- out[, 1] P <- out[, 2]

After the simulation, a Copula regression is estimated with Poisson and Gamma marginals for the frequency and the severity respectively. As shown in the model estimation, estimated parameters with related inferences are different between independent and dependent assumptions.

m <- copreg(G, P, X, family = 4, sd.error = TRUE, joint = TRUE, zt = FALSE) coef <- c("_CONST", "X1", "X2") cols <- c("ESTIMATE", "STD. ERR", "Z-VALUE") g_est <- cbind(m$alpha, m$sd.alpha, m$alpha / m$sd.alpha) p_est <- cbind(m$beta, m$sd.beta, m$beta / m$sd.beta) g_est0 <- cbind(m$alpha0, m$sd.alpha0, m$alpha0 / m$sd.alpha0) p_est0 <- cbind(m$beta0, m$sd.beta0, m$beta0 / m$sd.beta0) rownames(g_est) <- rownames(g_est0) <- rownames(p_est) <- rownames(p_est0) <- coef colnames(g_est) <- colnames(g_est0) <- colnames(p_est) <- colnames(p_est0) <- cols # estimated coefficients for the Gamma regression assumed dependence print(g_est) # ESTIMATE STD. ERR Z-VALUE # _CONST 2.9710512 0.2303651 12.897141 # X1 -1.8047627 0.2944627 -6.129003 # X2 0.9071093 0.2995218 3.028526 # estimated coefficients for the Gamma regression assumed dependence print(p_est) # ESTIMATE STD. ERR Z-VALUE # _CONST 2.954519 0.06023353 49.05107 # X1 -1.967023 0.09233056 -21.30414 # X2 1.025863 0.08254870 12.42736 # estimated coefficients for the Gamma regression assumed independence # should be identical to GLM() outcome print(g_est0) # ESTIMATE STD. ERR Z-VALUE # _CONST 3.020771 0.2499246 12.086727 # X1 -1.777570 0.3480328 -5.107478 # X2 0.905527 0.3619011 2.502140 # estimated coefficients for the Gamma regression assumed independence # should be identical to GLM() outcome print(p_est0) # ESTIMATE STD. ERR Z-VALUE # _CONST 2.939787 0.06507502 45.17536 # X1 -2.010535 0.10297887 -19.52376 # X2 1.088269 0.09334663 11.65837

If we compare conditional loss distributions under different dependence assumptions, it shows that the predicted loss with Copula regression tends to have a fatter right tail and therefore should be considered more conservative.

df <- data.frame(g = G, p = P, x1 = X[, 2], x2 = X[, 3]) glm_p <- glm(p ~ x1 + x2, data = df, family = poisson(log)) glm_g <- glm(g ~ x1 + x2, data = df, family = Gamma(log)) loss_dep <- predict(m, X, X, independence = FALSE)[3][[1]][[1]] loss_ind <- fitted(glm_p) * fitted(glm_g) den <- data.frame(loss = c(loss_dep, loss_ind), lines = rep(c("DEPENDENCE", "INDEPENDENCE"), each = n)) ggplot(den, aes(x = loss, fill = lines)) + geom_density(alpha = 0.5)