Yet Another Blog in Statistical Computing

I can calculate the motion of heavenly bodies but not the madness of people. -Isaac Newton

DART: Dropout Regularization in Boosting Ensembles

The dropout approach developed by Hinton has been widely employed in deep learnings to prevent the deep neural network from overfitting, as shown in https://statcompute.wordpress.com/2017/01/02/dropout-regularization-in-deep-neural-networks.

In the paper http://proceedings.mlr.press/v38/korlakaivinayak15.pdf, the dropout can also be used to address the overfitting in boosting tree ensembles, e.g. MART, caused by the so-called “over-specialization”. In particular, while first few trees added at the beginning of ensembles would dominate the model performance, the rest added later can only improve the prediction for a small subset, which increases the risk of overfitting. The idea of DART is to build an ensemble by randomly dropping boosting tree members. The percentage of dropouts can determine the degree of regularization for boosting tree ensembles.

Below is a demonstration showing the implementation of DART with the R xgboost package. First of all, after importing the data, we divided it into two pieces, one for training and the other for testing.

pkgs <- c('pROC', 'xgboost')
lapply(pkgs, require, character.only = T)
df1 <- read.csv("Downloads/credit_count.txt")
df2 <- df1[df1$CARDHLDR == 1, ]
set.seed(2017)
n <- nrow(df2)
sample <- sample(seq(n), size = n / 2, replace = FALSE)
train <- df2[sample, -1]
test <- df2[-sample, -1]

For the comparison purpose, we first developed a boosting tree ensemble without dropouts, as shown below. For the simplicity, all parameters were chosen heuristically. The max_depth is set to 3 due to the fact that the boosting tends to work well with so-called “weak” learners, e.g. simple trees. While ROC for the training set can be as high as 0.95, ROC for the testing set is only 0.60 in our case, implying the overfitting issue.

mart.parm <- list(booster = "gbtree", nthread = 4, eta = 0.1, max_depth = 3, subsample = 1, eval_metric = "auc")
mart <- xgboost(data = as.matrix(train[, -1]), label = train[, 1], params = mart.parm, nrounds = 500, verbose = 0, seed = 2017)
pred1 <- predict(mart, as.matrix(train[, -1]))
pred2 <- predict(mart, as.matrix(test[, -1]))
roc(as.factor(train$DEFAULT), pred1)
# Area under the curve: 0.9459
roc(as.factor(test$DEFAULT), pred2)
# Area under the curve: 0.6046

With the same set of parameters, we refitted the ensemble with dropouts, e.g. DART. As shown below, by dropping 10% tree members, ROC for the testing set can increase from 0.60 to 0.65. In addition, the performance disparity between training and testing sets with DART decreases significantly.

dart.parm <- list(booster = "dart", rate_drop = 0.1, nthread = 4, eta = 0.1, max_depth = 3, subsample = 1, eval_metric = "auc")
dart <- xgboost(data = as.matrix(train[, -1]), label = train[, 1], params = dart.parm, nrounds = 500, verbose = 0, seed = 2017)
pred1 <- predict(dart, as.matrix(train[, -1]))
pred2 <- predict(dart, as.matrix(test[, -1]))
roc(as.factor(train$DEFAULT), pred1)
# Area under the curve: 0.7734
roc(as.factor(test$DEFAULT), pred2)
# Area under the curve: 0.6517

Besides rate_drop = 0.1, a wide range of dropout rates have also been tested. In most cases, DART outperforms its counterpart without the dropout regularization.

Written by statcompute

August 20, 2017 at 5:50 pm

Model Operational Losses with Copula Regression

In the previous post (https://statcompute.wordpress.com/2017/06/29/model-operational-loss-directly-with-tweedie-glm), it has been explained why we should consider modeling operational losses for non-material UoMs directly with Tweedie models. However, for material UoMs with significant losses, it is still beneficial to model the frequency and the severity separately.

In the prevailing modeling practice for operational losses, it is often convenient to assume a functional independence between frequency and severity models, which might not be the case empirically. For instance, in the economic downturn, both the frequency and the severity of consumer frauds might tend to increase simultaneously. With the independence assumption, while we can argue that same variables could be included in both frequency and severity models and therefore induce a certain correlation, the frequency-severity dependence and the its contribution to the loss distribution might be overlooked.

In the context of Copula, the distribution of operational losses can be considered a joint distribution determined by both marginal distributions and a parameter measuring the dependence between marginals, of which marginal distributions can be Poisson for the frequency and Gamma for the severity. Depending on the dependence structure in the data, various copula functions might be considered. For instance, a product copula can be used to describe the independence. In the example shown below, a Gumbel copula is considered given that it is often used to describe the positive dependence on the right tail, e.g. high severity and high frequency. For details, the book “Copula Modeling” by Trivedi and Zimmer is a good reference to start with.

In the demonstration, we simulated both frequency and severity measures driven by the same set of co-variates. Both are positively correlated with the Kendall’s tau = 0.5 under the assumption of Gumbel copula.

library(CopulaRegression)
# number of observations to simulate
n <- 100
# seed value for the simulation
set.seed(2017)
# design matrices with a constant column
X <- cbind(rep(1, n), runif(n), runif(n))
# define coefficients for both Poisson and Gamma regressions
p_beta <- g_beta <- c(3, -2, 1)
# define the Gamma dispersion
delta <- 1
# define the Kendall's tau
tau <- 0.5
# copula parameter based on tau
theta <- 1 / (1 - tau)
# define the Gumbel Copula 
family <- 4
# simulate outcomes
out <- simulate_regression_data(n, g_beta, p_beta, X, X, delta, tau, family, zt = FALSE)
G <- out[, 1]
P <- out[, 2]

After the simulation, a Copula regression is estimated with Poisson and Gamma marginals for the frequency and the severity respectively. As shown in the model estimation, estimated parameters with related inferences are different between independent and dependent assumptions.

m <- copreg(G, P, X, family = 4, sd.error = TRUE, joint = TRUE, zt = FALSE)
coef <- c("_CONST", "X1", "X2")
cols <- c("ESTIMATE", "STD. ERR", "Z-VALUE")
g_est <- cbind(m$alpha, m$sd.alpha, m$alpha / m$sd.alpha)
p_est <- cbind(m$beta, m$sd.beta, m$beta / m$sd.beta)
g_est0 <- cbind(m$alpha0, m$sd.alpha0, m$alpha0 / m$sd.alpha0)
p_est0 <- cbind(m$beta0, m$sd.beta0, m$beta0 / m$sd.beta0)
rownames(g_est) <- rownames(g_est0) <- rownames(p_est) <- rownames(p_est0) <- coef
colnames(g_est) <- colnames(g_est0) <- colnames(p_est) <- colnames(p_est0) <- cols

# estimated coefficients for the Gamma regression assumed dependence 
print(g_est)
#          ESTIMATE  STD. ERR   Z-VALUE
# _CONST  2.9710512 0.2303651 12.897141
# X1     -1.8047627 0.2944627 -6.129003
# X2      0.9071093 0.2995218  3.028526

# estimated coefficients for the Gamma regression assumed dependence 
print(p_est)
#         ESTIMATE   STD. ERR   Z-VALUE
# _CONST  2.954519 0.06023353  49.05107
# X1     -1.967023 0.09233056 -21.30414
# X2      1.025863 0.08254870  12.42736

# estimated coefficients for the Gamma regression assumed independence 
# should be identical to GLM() outcome
print(g_est0)
#         ESTIMATE  STD. ERR   Z-VALUE
# _CONST  3.020771 0.2499246 12.086727
# X1     -1.777570 0.3480328 -5.107478
# X2      0.905527 0.3619011  2.502140

# estimated coefficients for the Gamma regression assumed independence 
# should be identical to GLM() outcome
print(p_est0)
#         ESTIMATE   STD. ERR   Z-VALUE
# _CONST  2.939787 0.06507502  45.17536
# X1     -2.010535 0.10297887 -19.52376
# X2      1.088269 0.09334663  11.65837

If we compare conditional loss distributions under different dependence assumptions, it shows that the predicted loss with Copula regression tends to have a fatter right tail and therefore should be considered more conservative.

df <- data.frame(g = G, p = P, x1 = X[, 2], x2 = X[, 3])
glm_p <- glm(p ~ x1 + x2, data = df, family = poisson(log))
glm_g <- glm(g ~ x1 + x2, data = df, family = Gamma(log))
loss_dep <- predict(m, X, X, independence = FALSE)[3][[1]][[1]]
loss_ind <- fitted(glm_p) * fitted(glm_g)
den <- data.frame(loss = c(loss_dep, loss_ind), lines = rep(c("DEPENDENCE", "INDEPENDENCE"), each = n))
ggplot(den, aes(x = loss, fill = lines)) + geom_density(alpha = 0.5)

loss2

Written by statcompute

August 20, 2017 at 5:22 pm

Sparkling Water and Moving Data Around

Sparkling Water is an application to integrate H2O with Spark. Below is an example showing how to move the data around among Pandas DataFrame, H2OFrame, and Spark Dataframe.

1. Define Context

In [1]: from pandas import read_csv, DataFrame

In [2]: from pyspark import sql

In [3]: from pysparkling import H2OContext

In [4]: from h2o import import_file, H2OFrame

In [5]: ss = sql.SparkSession.builder.getOrCreate()

In [6]: hc = H2OContext.getOrCreate(ss)

2. Convert Pandas Dataframe to H2OFrame and Spark DataFrame

In [7]: p_df = read_csv("Documents/credit_count.txt")

In [8]: type(p_df)
Out[8]: pandas.core.frame.DataFrame

In [9]: p2s_df = ss.createDataFrame(p_df)

In [10]: type(p2s_df)
Out[10]: pyspark.sql.dataframe.DataFrame

In [11]: p2h_df = H2OFrame(p_df)

In [12]: type(p2h_df)
Out[12]: h2o.frame.H2OFrame

3. Convert Spark Dataframe to H2OFrame and Pandas DataFrame

In [13]: s_df = ss.read.csv("Documents/credit_count.txt", header = True, inferSchema = True)

In [14]: type(s_df)
Out[14]: pyspark.sql.dataframe.DataFrame

In [15]: s2p_df = s_df.toPandas()

In [16]: type(s2p_df)
Out[16]: pandas.core.frame.DataFrame

In [17]: s2h_df = hc.as_h2o_frame(s_df)

In [18]: type(s2h_df)
Out[18]: h2o.frame.H2OFrame

4. Convert H2OFrame to Pandas Dataframe and Spark DataFrame

In [19]: h_df = import_file("Documents/credit_count.txt", header = 1, sep = ",")

In [20]: type(h_df)
Out[20]: h2o.frame.H2OFrame

In [21]: h2p_df = h_df.as_data_frame()

In [22]: type(h2p_df)
Out[22]: pandas.core.frame.DataFrame

In [23]: h2s_df = hc.as_spark_frame(h_df)

In [24]: type(h2s_df)
Out[24]: pyspark.sql.dataframe.DataFrame

Written by statcompute

July 3, 2017 at 12:36 am

Model Operational Loss Directly with Tweedie GLM

In the development of operational loss forecasting models, the Frequency-Severity modeling approach, which the frequency and the severity of a Unit of Measure (UoM) are modeled separately, has been widely employed in the banking industry. However, sometimes it also makes sense to model the operational loss directly, especially for UoMs with non-material losses. First of all, given the low loss amount, the effort of developing two models, e.g. frequency and severity, might not be justified. Secondly, for UoMs with low losses due to low frequencies, modeling the frequency and the severity separately might overlook the internal connection between the low frequency and the subsequent low loss amount. For instance, when the frequency N = 0, then the loss L = $0 inevitably.

The Tweedie distribution is defined as a Poisson sum of Gamma random variables. In particular, if the frequency of loss events N is assumed a Poisson distribution and the loss amount L_i of an event i, where i = 0, 1 … N, is assumed a Gamma distribution, then the total loss amount L = SUM[L_i] would have a Tweedie distribution. When there is no loss event, e.g. N = 0, then Prob(L = $0) = Prob(N = 0) = Exp(-Lambda). However, when N > 0, then L = L_1 + … + L_N > $0 is governed by a Gamma distribution, e.g. sum of I.I.D. Gamma also being Gamma.

For the Tweedie loss, E(L) = Mu and VAR(L) = Phi * (Mu ** P), where P is called the index parameter and Phi is the dispersion parameter. When P approaches 1 and therefore VAR(L) approaches Phi * E(L), the Tweedie would be similar to a Poisson-like distribution. When P approaches 2 and therefore VAR(L) approaches Phi * (E(L) ** 2), the Tweedie would be similar to a Gamma distribution. When P is between 1 and 2, then the Tweedie would be a compound mixture of Poisson and Gamma, where P and Phi can be estimated.

To estimate a regression with the Tweedie distributional assumption, there are two implementation approaches in R with cplm and statmod packages respectively. With the cplm package, the Tweedie regression can be estimated directly as long as P is in the range of (1, 2), as shown below. In the example, the estimated index parameter P is 1.42.

> library(cplm)
> data(FineRoot)
> m1 <- cpglm(RLD ~ Zone + Stock, data = FineRoot)
> summary(m1)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.0611  -0.6475  -0.3928   0.1380   1.9627  

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.95141    0.14643 -13.327  < 2e-16 ***
ZoneOuter   -0.85693    0.13292  -6.447 2.66e-10 ***
StockMM106   0.01177    0.17535   0.067    0.947    
StockMark   -0.83933    0.17476  -4.803 2.06e-06 ***
---
Estimated dispersion parameter: 0.35092
Estimated index parameter: 1.4216 

Residual deviance: 203.91  on 507  degrees of freedom
AIC:  -157.33 

The statmod package provides a more general and flexible solution with the two-stage estimation, which will estimate the P parameter first and then estimate regression parameters. In the real-world practice, we could do a coarse search to narrow down a reasonable range of P and then do a fine search to identify the optimal P value. As shown below, all estimated parameters are fairly consistent with ones in the previous example.

> library(tweedie)
> library(statmod)
> prof <- tweedie.profile(RLD ~ Zone + Stock, data = FineRoot, p.vec = seq(1.1, 1.9, 0.01), method = "series")
1.1 1.11 1.12 1.13 1.14 1.15 1.16 1.17 1.18 1.19 1.2 1.21 1.22 1.23 1.24 1.25 1.26 1.27 1.28 1.29 1.3 1.31 1.32 1.33 1.34 1.35 1.36 1.37 1.38 1.39 1.4 1.41 1.42 1.43 1.44 1.45 1.46 1.47 1.48 1.49 1.5 1.51 1.52 1.53 1.54 1.55 1.56 1.57 1.58 1.59 1.6 1.61 1.62 1.63 1.64 1.65 1.66 1.67 1.68 1.69 1.7 1.71 1.72 1.73 1.74 1.75 1.76 1.77 1.78 1.79 1.8 1.81 1.82 1.83 1.84 1.85 1.86 1.87 1.88 1.89 1.9 
.................................................................................Done.
> prof$p.max
[1] 1.426531
> m2 <- glm(RLD ~ Zone + Stock, data = FineRoot, family = tweedie(var.power = prof$p.max, link.power = 0))
> summary(m2)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.0712  -0.6559  -0.3954   0.1380   1.9728  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.95056    0.14667 -13.299  < 2e-16 ***
ZoneOuter   -0.85823    0.13297  -6.454 2.55e-10 ***
StockMM106   0.01204    0.17561   0.069    0.945    
StockMark   -0.84044    0.17492  -4.805 2.04e-06 ***
---
(Dispersion parameter for Tweedie family taken to be 0.4496605)

    Null deviance: 241.48  on 510  degrees of freedom
Residual deviance: 207.68  on 507  degrees of freedom
AIC: NA

Written by statcompute

June 29, 2017 at 10:46 pm

GLM with H2O in R

Below is an example showing how to fit a Generalized Linear Model with H2O in R. The output is much more comprehensive than the one generated by the generic R glm().

> library(h2o)

> h2o.init(max_mem_size = "12g")

> df1 <- h2o.uploadFile("Documents/credit_count.txt", header = TRUE, sep = ",", parse_type = "CSV")

> df2 <- h2o.assign(df1[df1$CARDHLDR == 1, ], "glm_df")

> h2o.colnames(df2)
 [1] "CARDHLDR" "DEFAULT"  "AGE"      "ACADMOS"  "ADEPCNT"  "MAJORDRG"
 [7] "MINORDRG" "OWNRENT"  "INCOME"   "SELFEMPL" "INCPER"   "EXP_INC"
[13] "SPENDING" "LOGSPEND"

> Y <- "DEFAULT"

> X <- c("MAJORDRG", "MINORDRG", "INCOME", "OWNRENT")

> dist <- "binomial"

> link <- "logit"

> id <- "h2o_mdl01"

> mdl <- h2o.glm(X, Y, training_frame = h2o.getFrame("glm_df"), model_id = id, family = dist, link = link, lambda = 0, compute_p_values = TRUE, standardize = FALSE)

> show(h2o.getModel(id)@model$coefficients_table)
Coefficients: glm coefficients
      names coefficients std_error    z_value  p_value
1 Intercept    -1.204439  0.090811 -13.263121 0.000000
2  MAJORDRG     0.203135  0.069250   2.933370 0.003353
3  MINORDRG     0.202727  0.047971   4.226014 0.000024
4   OWNRENT    -0.201223  0.071619  -2.809636 0.004960
5    INCOME    -0.000442  0.000040 -10.942350 0.000000

> h2o.performance(h2o.getModel(id))
H2OBinomialMetrics: glm
** Reported on training data. **

MSE:  0.08414496
RMSE:  0.2900775
LogLoss:  0.3036585
Mean Per-Class Error:  0.410972
AUC:  0.6432189
Gini:  0.2864378
R^2:  0.02005004
Residual Deviance:  6376.221
AIC:  6386.221

Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
          0    1    Error         Rate
0      7703 1800 0.189414   =1800/9503
1       630  366 0.632530     =630/996
Totals 8333 2166 0.231451  =2430/10499

Maximum Metrics: Maximum metrics at their respective thresholds
                        metric threshold    value idx
1                       max f1  0.126755 0.231499 142
2                       max f2  0.075073 0.376556 272
3                 max f0point5  0.138125 0.191828 115
4                 max accuracy  0.368431 0.905039   0
5                max precision  0.314224 0.250000   3
6                   max recall  0.006115 1.000000 399
7              max specificity  0.368431 0.999895   0
8             max absolute_mcc  0.126755 0.128940 142
9   max min_per_class_accuracy  0.106204 0.604546 196
10 max mean_per_class_accuracy  0.103730 0.605663 202

Written by statcompute

June 28, 2017 at 12:25 am

H2O Benchmark for CSV Import

The importFile() function in H2O is extremely efficient due to the parallel reading. The benchmark comparison below shows that it is comparable to the read.df() in SparkR and significantly faster than the generic read.csv().

library(SparkR, lib.loc = paste(Sys.getenv("SPARK_HOME"), "/R/lib", sep = ""))
sc <- sparkR.session(master = "local", sparkConfig = list(spark.driver.memory = "10g", spark.driver.cores = "4"))

library(h2o)
h2o.init(max_mem_size = "10g")

library(rbenchmark)

benchmark(replications = 5, order = "elapsed", relative = "elapsed",
   csv = {
          df <- read.csv("Documents/nycflights13.csv")
          print(nrow(df))
          rm(df)
         },
   spk = {
          df <- read.df("Documents/nycflights13.csv", source = "csv", header = "true", inferSchema = "true")
          print(nrow(df))
          rm(df)
         },
   h2o = {
          df <- h2o.importFile(path = "Documents/nycflights13.csv", header = TRUE, sep = ",")
          print(nrow(df))
          rm(df)
         }
 )

#   test replications elapsed relative user.self sys.self user.child sys.child
# 3  h2o            5   8.221    1.000     0.508    0.032          0         0
# 2  spk            5   9.822    1.195     0.008    0.004          0         0
# 1  csv            5  16.595    2.019    16.420    0.176          0         0

Written by statcompute

June 26, 2017 at 1:24 am

Posted in Big Data, H2O, S+/R, Spark

Tagged with , , ,

Using Tweedie Parameter to Identify Distributions

In the development of operational loss models, it is important to identify which distribution should be used to model operational risk measures, e.g. frequency and severity. For instance, why should we use the Gamma distribution instead of the Inverse Gaussian distribution to model the severity?

In my previous post https://statcompute.wordpress.com/2016/11/20/modified-park-test-in-sas, it is shown how to use the Modified Park test to identify the mean-variance relationship and then decide the corresponding distribution of operational risk measures. Following the similar logic, we can also leverage the flexibility of the Tweedie distribution to accomplish the same goal. Based upon the parameterization of a Tweedie distribution, the variance = Phi * (Mu ** P), where Mu is the mean and P is the power parameter. Depending on the specific value of P, the Tweedie distribution can accommodate several important distributions commonly used in the operational risk modeling, including Poisson, Gamma, Inverse Gaussian. For instance,

  • With P = 0, the variance would be independent of the mean, indicating a Normal distribution.
  • With P = 1, the variance would be in a linear form of the mean, indicating a Poisson-like distribution
  • With P = 2, the variance would be in a quadratic form of the mean, indicating a Gamma distribution.
  • With P = 3, the variance would be in a cubic form of the mean, indicating an Inverse Gaussian distribution.

In the example below, it is shown that the value of P is in the neighborhood of 1 for the frequency measure and is near 3 for the severity measure and that, given P closer to 3, the Inverse Gaussian regression would fit the severity better than the Gamma regression.

library(statmod)
library(tweedie)

profile1 <- tweedie.profile(Claim_Count ~ Age + Vehicle_Use, data = AutoCollision, p.vec = seq(1.1, 3.0, 0.1), fit.glm = TRUE)
print(profile1$p.max)
# [1] 1.216327
# The P parameter close to 1 indicates that the claim_count might follow a Poisson-like distribution

profile2 <- tweedie.profile(Severity ~ Age + Vehicle_Use, data = AutoCollision, p.vec = seq(1.1, 3.0, 0.1), fit.glm = TRUE)
print(profile2$p.max)
# [1] 2.844898
# The P parameter close to 3 indicates that the severity might follow an Inverse Gaussian distribution

BIC(glm(Severity ~ Age + Vehicle_Use, data = AutoCollision, family = Gamma(link = log)))
# [1] 360.8064

BIC(glm(Severity ~ Age + Vehicle_Use, data = AutoCollision, family = inverse.gaussian(link = log)))
# [1] 350.2504

Together with the Modified Park test, the estimation of P in a Tweedie distribution is able to help us identify the correct distribution employed in operational loss models in the context of GLM.

Written by statcompute

June 24, 2017 at 10:55 pm