## Archive for the ‘**S+/R**’ Category

## Model Non-Negative Numeric Outcomes with Zeros

As mentioned in the previous post (https://statcompute.wordpress.com/2017/06/29/model-operational-loss-directly-with-tweedie-glm/), we often need to model non-negative numeric outcomes with zeros in the operational loss model development. Tweedie GLM provides a convenient interface to model non-negative losses directly by assuming that aggregated losses are the Poisson sum of Gamma outcomes, which however might not be well supported empirically from the data generation standpoint.

In examples below, we demonstrated another flexible option, namely Zero-Adjusted (ZA) models, in both scenarios of modeling non-negative numeric outcomes, one with a small number of zeros and the other with a large number of zeros. The basic idea of ZA models is very intuitive and similar to the concept of Hurdle models for count outcomes. In a nutshell, non-negative numeric outcomes can be considered two data generation processes, one for point-mass at zeros and the other governed by a statistical distribution for positive outcomes. The latter could be either Gamma or Inverse Gaussian.

First of all, we sampled down an auto-claim data in a way that only 10 claims are zeros and the rest are all positive. While 10 is an arbitrary choice in the example, other small numbers should show similar results.

pkgs <- list("cplm", "gamlss", "MLmetrics") lapply(pkgs, require, character.only = T) data(AutoClaim, package = "cplm") df1 <- na.omit(AutoClaim) # SMALL NUMBER OF ZEROS set.seed(2017) smp <- sample(seq(nrow(df1[df1$CLM_AMT == 0, ])), size = 10, replace = FALSE) df2 <- rbind(df1[df1$CLM_AMT > 0, ], df1[df1$CLM_AMT == 0, ][smp, ])

Next, we applied both Tweedie and zero-adjusted Gamma (ZAGA) models to the data with only 10 zero outcomes. It is worth mentioning that ZAGA doesn’t have to be overly complex in this case. As shown below, while we estimated the Gamma Mu parameter with model attributes, the Nu parameter to separate zeros is just a constant with the intercept = -5.4. Both Tweedie and GAZA models gave very similar estimated parameters and predictive measures with MAPE = 0.61.

tw <- cpglm(CLM_AMT ~ BLUEBOOK + NPOLICY, data = df2) # Estimate Std. Error t value Pr(>|t|) # (Intercept) 8.194e+00 7.234e-02 113.277 < 2e-16 *** # BLUEBOOK 2.047e-05 3.068e-06 6.671 3.21e-11 *** # NPOLICY 7.274e-02 3.102e-02 2.345 0.0191 * MAPE(df2$CLM_AMT, fitted(tw)) # 0.6053669 zaga0 <- gamlss(CLM_AMT ~ BLUEBOOK + NPOLICY, data = df2, family = "ZAGA") # Mu Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 8.203e+00 4.671e-02 175.629 < 2e-16 *** # BLUEBOOK 2.053e-05 2.090e-06 9.821 < 2e-16 *** # NPOLICY 6.948e-02 2.057e-02 3.377 0.000746 *** # Nu Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) -5.3886 0.3169 -17 <2e-16 *** MAPE(df2$CLM_AMT, (1 - fitted(zaga0, what = "nu")) * fitted(zaga0, what = "mu")) # 0.6053314

In the next case, we used the full data with a large number of zeros in the response and then applied both Tweedie and ZAGA models again. However, in ZAGA model, we estimated two sub-models this time, one for the Nu parameter to separate zeros from non-zeros and the other for the Mu parameter to model non-zero outcomes. As shown below, ZAGA outperformed Tweedie in terms of MAPE due to the advantage that ZAGA is able to explain two data generation schemes separately with different model attributes, which is the capability beyond what Tweedie can provide.

# LARGE NUMBER OF ZEROS tw <- cpglm(CLM_AMT ~ BLUEBOOK + NPOLICY + CLM_FREQ5 + MVR_PTS + INCOME, data = df1) # Estimate Std. Error t value Pr(>|t|) # (Intercept) 6.854e+00 1.067e-01 64.241 < 2e-16 *** # BLUEBOOK 1.332e-05 4.495e-06 2.963 0.00305 ** # NPOLICY 4.380e-02 3.664e-02 1.195 0.23196 # CLM_FREQ5 2.064e-01 2.937e-02 7.026 2.29e-12 *** # MVR_PTS 1.066e-01 1.510e-02 7.063 1.76e-12 *** # INCOME -4.606e-06 8.612e-07 -5.348 9.12e-08 *** MAPE(df1$CLM_AMT, fitted(tw)) # 1.484484 zaga1 <- gamlss(CLM_AMT ~ BLUEBOOK + NPOLICY, nu.formula = ~(CLM_FREQ5 + MVR_PTS + INCOME), data = df1, family = "ZAGA") # Mu Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 8.203e+00 4.682e-02 175.218 < 2e-16 *** # BLUEBOOK 2.053e-05 2.091e-06 9.816 < 2e-16 *** # NPOLICY 6.948e-02 2.067e-02 3.362 0.000778 *** # Nu Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 1.153e+00 5.077e-02 22.72 <2e-16 *** # CLM_FREQ5 -3.028e-01 2.283e-02 -13.26 <2e-16 *** # MVR_PTS -1.509e-01 1.217e-02 -12.41 <2e-16 *** # INCOME 7.285e-06 6.269e-07 11.62 <2e-16 *** MAPE(df1$CLM_AMT, (1 - fitted(zaga1, what = "nu")) * fitted(zaga1, what = "mu")) # 1.470228

Given the great flexibility of ZA models, we also have the luxury to explore other candidates than ZAGA. For instance, if the positive part of non-negative outcomes demonstrates a high variance, we can also try a zero-inflated Inverse Gaussian (ZAIG) model, as shown below.

zaig1 <- gamlss(CLM_AMT ~ BLUEBOOK + NPOLICY, nu.formula = ~(CLM_FREQ5 + MVR_PTS + INCOME), data = df1, family = "ZAIG") # Mu Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 8.205e+00 5.836e-02 140.591 < 2e-16 *** # BLUEBOOK 2.163e-05 2.976e-06 7.268 3.97e-13 *** # NPOLICY 5.898e-02 2.681e-02 2.200 0.0278 * # Nu Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 1.153e+00 5.077e-02 22.72 <2e-16 *** # CLM_FREQ5 -3.028e-01 2.283e-02 -13.26 <2e-16 *** # MVR_PTS -1.509e-01 1.217e-02 -12.41 <2e-16 *** # INCOME 7.285e-06 6.269e-07 11.62 <2e-16 *** MAPE(df1$CLM_AMT, (1 - fitted(zaig1, what = "nu")) * fitted(zaig1, what = "mu")) # 1.469236

## Variable Selection with Elastic Net

LASSO has been a popular algorithm for the variable selection and extremely effective with high-dimension data. However, it often tends to “over-regularize” a model that might be overly compact and therefore under-predictive.

The Elastic Net addresses the aforementioned “over-regularization” by balancing between LASSO and ridge penalties. In particular, a hyper-parameter, namely Alpha, would be used to regularize the model such that the model would become a LASSO in case of Alpha = 1 and a ridge in case of Alpha = 0. In practice, Alpha can be tuned easily by the cross-validation. Below is a demonstration of Elastic Net with R glmnet package and its comparison with LASSO and ridge models.

pkgs <- list("glmnet", "doParallel", "foreach", "pROC") lapply(pkgs, require, character.only = T) registerDoParallel(cores = 4) df1 <- read.csv("Downloads/credit_count.txt") df2 <- df1[df1$CARDHLDR == 1, ] set.seed(2017) n <- nrow(df2) sample <- sample(seq(n), size = n * 0.5, replace = FALSE) train <- df2[sample, -1] test <- df2[-sample, -1] mdlY <- as.factor(as.matrix(train["DEFAULT"])) mdlX <- as.matrix(train[setdiff(colnames(df1), c("CARDHLDR", "DEFAULT"))]) newY <- as.factor(as.matrix(test["DEFAULT"])) newX <- as.matrix(test[setdiff(colnames(df1), c("CARDHLDR", "DEFAULT"))])

First of all, we estimates a LASSO model with Alpha = 1. The function cv.glmnet() is used to search for a regularization parameter, namely Lambda, that controls the penalty strength. As shown below, the model only identifies 2 attributes out of total 12.

# LASSO WITH ALPHA = 1 cv1 <- cv.glmnet(mdlX, mdlY, family = "binomial", nfold = 10, type.measure = "deviance", paralle = TRUE, alpha = 1) md1 <- glmnet(mdlX, mdlY, family = "binomial", lambda = cv1$lambda.1se, alpha = 1) coef(md1) #(Intercept) -1.963030e+00 #AGE . #ACADMOS . #ADEPCNT . #MAJORDRG . #MINORDRG . #OWNRENT . #INCOME -5.845981e-05 #SELFEMPL . #INCPER . #EXP_INC . #SPENDING . #LOGSPEND -4.015902e-02 roc(newY, as.numeric(predict(md1, newX, type = "response"))) #Area under the curve: 0.636

We next estimates a ridge model as below by setting Alpha = 0. Similarly, Lambda is searched by the cross-validation. Since the ridge penalty would only regularize the magnitude of each coefficient, we end up with a “full” model with all model attributes. The model performance is slightly better with 10 more variables, which is a debatable outcome.

# RIDGE WITH ALPHA = 0 cv2 <- cv.glmnet(mdlX, mdlY, family = "binomial", nfold = 10, type.measure = "deviance", paralle = TRUE, alpha = 0) md2 <- glmnet(mdlX, mdlY, family = "binomial", lambda = cv2$lambda.1se, alpha = 0) coef(md2) #(Intercept) -2.221016e+00 #AGE -4.184422e-04 #ACADMOS -3.085096e-05 #ADEPCNT 1.485114e-04 #MAJORDRG 6.684849e-03 #MINORDRG 1.006660e-03 #OWNRENT -9.082750e-03 #INCOME -6.960253e-06 #SELFEMPL 3.610381e-03 #INCPER -3.881890e-07 #EXP_INC -1.416971e-02 #SPENDING -1.638184e-05 #LOGSPEND -6.213884e-03 roc(newY, as.numeric(predict(md2, newX, type = "response"))) #Area under the curve: 0.6435

At last, we use the Elastic Net by tuning the value of Alpha through a line search with the parallelism. In this particular case, Alpha = 0.3 is chosen through the cross-validation. As shown below, 6 variables are used in the model that even performs better than the ridge model with all 12 attributes.

# ELASTIC NET WITH 0 < ALPHA < 1 a <- seq(0.1, 0.9, 0.05) search <- foreach(i = a, .combine = rbind) %dopar% { cv <- cv.glmnet(mdlX, mdlY, family = "binomial", nfold = 10, type.measure = "deviance", paralle = TRUE, alpha = i) data.frame(cvm = cv$cvm[cv$lambda == cv$lambda.1se], lambda.1se = cv$lambda.1se, alpha = i) } cv3 <- search[search$cvm == min(search$cvm), ] md3 <- glmnet(mdlX, mdlY, family = "binomial", lambda = cv3$lambda.1se, alpha = cv3$alpha) coef(md3) #(Intercept) -1.434700e+00 #AGE -8.426525e-04 #ACADMOS . #ADEPCNT . #MAJORDRG 6.276924e-02 #MINORDRG . #OWNRENT -2.780958e-02 #INCOME -1.305118e-04 #SELFEMPL . #INCPER -2.085349e-06 #EXP_INC . #SPENDING . #LOGSPEND -9.992808e-02 roc(newY, as.numeric(predict(md3, newX, type = "response"))) #Area under the curve: 0.6449

## DART: Dropout Regularization in Boosting Ensembles

The dropout approach developed by Hinton has been widely employed in deep learnings to prevent the deep neural network from overfitting, as shown in https://statcompute.wordpress.com/2017/01/02/dropout-regularization-in-deep-neural-networks.

In the paper http://proceedings.mlr.press/v38/korlakaivinayak15.pdf, the dropout can also be used to address the overfitting in boosting tree ensembles, e.g. MART, caused by the so-called “over-specialization”. In particular, while first few trees added at the beginning of ensembles would dominate the model performance, the rest added later can only improve the prediction for a small subset, which increases the risk of overfitting. The idea of DART is to build an ensemble by randomly dropping boosting tree members. The percentage of dropouts can determine the degree of regularization for boosting tree ensembles.

Below is a demonstration showing the implementation of DART with the R xgboost package. First of all, after importing the data, we divided it into two pieces, one for training and the other for testing.

pkgs <- c('pROC', 'xgboost') lapply(pkgs, require, character.only = T) df1 <- read.csv("Downloads/credit_count.txt") df2 <- df1[df1$CARDHLDR == 1, ] set.seed(2017) n <- nrow(df2) sample <- sample(seq(n), size = n / 2, replace = FALSE) train <- df2[sample, -1] test <- df2[-sample, -1]

For the comparison purpose, we first developed a boosting tree ensemble without dropouts, as shown below. For the simplicity, all parameters were chosen heuristically. The max_depth is set to 3 due to the fact that the boosting tends to work well with so-called “weak” learners, e.g. simple trees. While ROC for the training set can be as high as 0.95, ROC for the testing set is only 0.60 in our case, implying the overfitting issue.

mart.parm <- list(booster = "gbtree", nthread = 4, eta = 0.1, max_depth = 3, subsample = 1, eval_metric = "auc") mart <- xgboost(data = as.matrix(train[, -1]), label = train[, 1], params = mart.parm, nrounds = 500, verbose = 0, seed = 2017) pred1 <- predict(mart, as.matrix(train[, -1])) pred2 <- predict(mart, as.matrix(test[, -1])) roc(as.factor(train$DEFAULT), pred1) # Area under the curve: 0.9459 roc(as.factor(test$DEFAULT), pred2) # Area under the curve: 0.6046

With the same set of parameters, we refitted the ensemble with dropouts, e.g. DART. As shown below, by dropping 10% tree members, ROC for the testing set can increase from 0.60 to 0.65. In addition, the performance disparity between training and testing sets with DART decreases significantly.

dart.parm <- list(booster = "dart", rate_drop = 0.1, nthread = 4, eta = 0.1, max_depth = 3, subsample = 1, eval_metric = "auc") dart <- xgboost(data = as.matrix(train[, -1]), label = train[, 1], params = dart.parm, nrounds = 500, verbose = 0, seed = 2017) pred1 <- predict(dart, as.matrix(train[, -1])) pred2 <- predict(dart, as.matrix(test[, -1])) roc(as.factor(train$DEFAULT), pred1) # Area under the curve: 0.7734 roc(as.factor(test$DEFAULT), pred2) # Area under the curve: 0.6517

Besides rate_drop = 0.1, a wide range of dropout rates have also been tested. In most cases, DART outperforms its counterpart without the dropout regularization.

## Model Operational Losses with Copula Regression

In the previous post (https://statcompute.wordpress.com/2017/06/29/model-operational-loss-directly-with-tweedie-glm), it has been explained why we should consider modeling operational losses for non-material UoMs directly with Tweedie models. However, for material UoMs with significant losses, it is still beneficial to model the frequency and the severity separately.

In the prevailing modeling practice for operational losses, it is often convenient to assume a functional independence between frequency and severity models, which might not be the case empirically. For instance, in the economic downturn, both the frequency and the severity of consumer frauds might tend to increase simultaneously. With the independence assumption, while we can argue that same variables could be included in both frequency and severity models and therefore induce a certain correlation, the frequency-severity dependence and the its contribution to the loss distribution might be overlooked.

In the context of Copula, the distribution of operational losses can be considered a joint distribution determined by both marginal distributions and a parameter measuring the dependence between marginals, of which marginal distributions can be Poisson for the frequency and Gamma for the severity. Depending on the dependence structure in the data, various copula functions might be considered. For instance, a product copula can be used to describe the independence. In the example shown below, a Gumbel copula is considered given that it is often used to describe the positive dependence on the right tail, e.g. high severity and high frequency. For details, the book “Copula Modeling” by Trivedi and Zimmer is a good reference to start with.

In the demonstration, we simulated both frequency and severity measures driven by the same set of co-variates. Both are positively correlated with the Kendall’s tau = 0.5 under the assumption of Gumbel copula.

library(CopulaRegression) # number of observations to simulate n <- 100 # seed value for the simulation set.seed(2017) # design matrices with a constant column X <- cbind(rep(1, n), runif(n), runif(n)) # define coefficients for both Poisson and Gamma regressions p_beta <- g_beta <- c(3, -2, 1) # define the Gamma dispersion delta <- 1 # define the Kendall's tau tau <- 0.5 # copula parameter based on tau theta <- 1 / (1 - tau) # define the Gumbel Copula family <- 4 # simulate outcomes out <- simulate_regression_data(n, g_beta, p_beta, X, X, delta, tau, family, zt = FALSE) G <- out[, 1] P <- out[, 2]

After the simulation, a Copula regression is estimated with Poisson and Gamma marginals for the frequency and the severity respectively. As shown in the model estimation, estimated parameters with related inferences are different between independent and dependent assumptions.

m <- copreg(G, P, X, family = 4, sd.error = TRUE, joint = TRUE, zt = FALSE) coef <- c("_CONST", "X1", "X2") cols <- c("ESTIMATE", "STD. ERR", "Z-VALUE") g_est <- cbind(m$alpha, m$sd.alpha, m$alpha / m$sd.alpha) p_est <- cbind(m$beta, m$sd.beta, m$beta / m$sd.beta) g_est0 <- cbind(m$alpha0, m$sd.alpha0, m$alpha0 / m$sd.alpha0) p_est0 <- cbind(m$beta0, m$sd.beta0, m$beta0 / m$sd.beta0) rownames(g_est) <- rownames(g_est0) <- rownames(p_est) <- rownames(p_est0) <- coef colnames(g_est) <- colnames(g_est0) <- colnames(p_est) <- colnames(p_est0) <- cols # estimated coefficients for the Gamma regression assumed dependence print(g_est) # ESTIMATE STD. ERR Z-VALUE # _CONST 2.9710512 0.2303651 12.897141 # X1 -1.8047627 0.2944627 -6.129003 # X2 0.9071093 0.2995218 3.028526 # estimated coefficients for the Gamma regression assumed dependence print(p_est) # ESTIMATE STD. ERR Z-VALUE # _CONST 2.954519 0.06023353 49.05107 # X1 -1.967023 0.09233056 -21.30414 # X2 1.025863 0.08254870 12.42736 # estimated coefficients for the Gamma regression assumed independence # should be identical to GLM() outcome print(g_est0) # ESTIMATE STD. ERR Z-VALUE # _CONST 3.020771 0.2499246 12.086727 # X1 -1.777570 0.3480328 -5.107478 # X2 0.905527 0.3619011 2.502140 # estimated coefficients for the Gamma regression assumed independence # should be identical to GLM() outcome print(p_est0) # ESTIMATE STD. ERR Z-VALUE # _CONST 2.939787 0.06507502 45.17536 # X1 -2.010535 0.10297887 -19.52376 # X2 1.088269 0.09334663 11.65837

If we compare conditional loss distributions under different dependence assumptions, it shows that the predicted loss with Copula regression tends to have a fatter right tail and therefore should be considered more conservative.

df <- data.frame(g = G, p = P, x1 = X[, 2], x2 = X[, 3]) glm_p <- glm(p ~ x1 + x2, data = df, family = poisson(log)) glm_g <- glm(g ~ x1 + x2, data = df, family = Gamma(log)) loss_dep <- predict(m, X, X, independence = FALSE)[3][[1]][[1]] loss_ind <- fitted(glm_p) * fitted(glm_g) den <- data.frame(loss = c(loss_dep, loss_ind), lines = rep(c("DEPENDENCE", "INDEPENDENCE"), each = n)) ggplot(den, aes(x = loss, fill = lines)) + geom_density(alpha = 0.5)

## Model Operational Loss Directly with Tweedie GLM

In the development of operational loss forecasting models, the Frequency-Severity modeling approach, which the frequency and the severity of a Unit of Measure (UoM) are modeled separately, has been widely employed in the banking industry. However, sometimes it also makes sense to model the operational loss directly, especially for UoMs with non-material losses. First of all, given the low loss amount, the effort of developing two models, e.g. frequency and severity, might not be justified. Secondly, for UoMs with low losses due to low frequencies, modeling the frequency and the severity separately might overlook the internal connection between the low frequency and the subsequent low loss amount. For instance, when the frequency N = 0, then the loss L = $0 inevitably.

The Tweedie distribution is defined as a Poisson sum of Gamma random variables. In particular, if the frequency of loss events N is assumed a Poisson distribution and the loss amount L_i of an event i, where i = 0, 1 … N, is assumed a Gamma distribution, then the total loss amount L = SUM[L_i] would have a Tweedie distribution. When there is no loss event, e.g. N = 0, then Prob(L = $0) = Prob(N = 0) = Exp(-Lambda). However, when N > 0, then L = L_1 + … + L_N > $0 is governed by a Gamma distribution, e.g. sum of I.I.D. Gamma also being Gamma.

For the Tweedie loss, E(L) = Mu and VAR(L) = Phi * (Mu ** P), where P is called the index parameter and Phi is the dispersion parameter. When P approaches 1 and therefore VAR(L) approaches Phi * E(L), the Tweedie would be similar to a Poisson-like distribution. When P approaches 2 and therefore VAR(L) approaches Phi * (E(L) ** 2), the Tweedie would be similar to a Gamma distribution. When P is between 1 and 2, then the Tweedie would be a compound mixture of Poisson and Gamma, where P and Phi can be estimated.

To estimate a regression with the Tweedie distributional assumption, there are two implementation approaches in R with cplm and statmod packages respectively. With the cplm package, the Tweedie regression can be estimated directly as long as P is in the range of (1, 2), as shown below. In the example, the estimated index parameter P is 1.42.

> library(cplm) > data(FineRoot) > m1 <- cpglm(RLD ~ Zone + Stock, data = FineRoot) > summary(m1) Deviance Residuals: Min 1Q Median 3Q Max -1.0611 -0.6475 -0.3928 0.1380 1.9627 Estimate Std. Error t value Pr(>|t|) (Intercept) -1.95141 0.14643 -13.327 < 2e-16 *** ZoneOuter -0.85693 0.13292 -6.447 2.66e-10 *** StockMM106 0.01177 0.17535 0.067 0.947 StockMark -0.83933 0.17476 -4.803 2.06e-06 *** --- Estimated dispersion parameter: 0.35092 Estimated index parameter: 1.4216 Residual deviance: 203.91 on 507 degrees of freedom AIC: -157.33

The statmod package provides a more general and flexible solution with the two-stage estimation, which will estimate the P parameter first and then estimate regression parameters. In the real-world practice, we could do a coarse search to narrow down a reasonable range of P and then do a fine search to identify the optimal P value. As shown below, all estimated parameters are fairly consistent with ones in the previous example.

> library(tweedie) > library(statmod) > prof <- tweedie.profile(RLD ~ Zone + Stock, data = FineRoot, p.vec = seq(1.1, 1.9, 0.01), method = "series") 1.1 1.11 1.12 1.13 1.14 1.15 1.16 1.17 1.18 1.19 1.2 1.21 1.22 1.23 1.24 1.25 1.26 1.27 1.28 1.29 1.3 1.31 1.32 1.33 1.34 1.35 1.36 1.37 1.38 1.39 1.4 1.41 1.42 1.43 1.44 1.45 1.46 1.47 1.48 1.49 1.5 1.51 1.52 1.53 1.54 1.55 1.56 1.57 1.58 1.59 1.6 1.61 1.62 1.63 1.64 1.65 1.66 1.67 1.68 1.69 1.7 1.71 1.72 1.73 1.74 1.75 1.76 1.77 1.78 1.79 1.8 1.81 1.82 1.83 1.84 1.85 1.86 1.87 1.88 1.89 1.9 .................................................................................Done. > prof$p.max [1] 1.426531 > m2 <- glm(RLD ~ Zone + Stock, data = FineRoot, family = tweedie(var.power = prof$p.max, link.power = 0)) > summary(m2) Deviance Residuals: Min 1Q Median 3Q Max -1.0712 -0.6559 -0.3954 0.1380 1.9728 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.95056 0.14667 -13.299 < 2e-16 *** ZoneOuter -0.85823 0.13297 -6.454 2.55e-10 *** StockMM106 0.01204 0.17561 0.069 0.945 StockMark -0.84044 0.17492 -4.805 2.04e-06 *** --- (Dispersion parameter for Tweedie family taken to be 0.4496605) Null deviance: 241.48 on 510 degrees of freedom Residual deviance: 207.68 on 507 degrees of freedom AIC: NA

## GLM with H2O in R

Below is an example showing how to fit a Generalized Linear Model with H2O in R. The output is much more comprehensive than the one generated by the generic R glm().

> library(h2o) > h2o.init(max_mem_size = "12g") > df1 <- h2o.uploadFile("Documents/credit_count.txt", header = TRUE, sep = ",", parse_type = "CSV") > df2 <- h2o.assign(df1[df1$CARDHLDR == 1, ], "glm_df") > h2o.colnames(df2) [1] "CARDHLDR" "DEFAULT" "AGE" "ACADMOS" "ADEPCNT" "MAJORDRG" [7] "MINORDRG" "OWNRENT" "INCOME" "SELFEMPL" "INCPER" "EXP_INC" [13] "SPENDING" "LOGSPEND" > Y <- "DEFAULT" > X <- c("MAJORDRG", "MINORDRG", "INCOME", "OWNRENT") > dist <- "binomial" > link <- "logit" > id <- "h2o_mdl01" > mdl <- h2o.glm(X, Y, training_frame = h2o.getFrame("glm_df"), model_id = id, family = dist, link = link, lambda = 0, compute_p_values = TRUE, standardize = FALSE) > show(h2o.getModel(id)@model$coefficients_table) Coefficients: glm coefficients names coefficients std_error z_value p_value 1 Intercept -1.204439 0.090811 -13.263121 0.000000 2 MAJORDRG 0.203135 0.069250 2.933370 0.003353 3 MINORDRG 0.202727 0.047971 4.226014 0.000024 4 OWNRENT -0.201223 0.071619 -2.809636 0.004960 5 INCOME -0.000442 0.000040 -10.942350 0.000000 > h2o.performance(h2o.getModel(id)) H2OBinomialMetrics: glm ** Reported on training data. ** MSE: 0.08414496 RMSE: 0.2900775 LogLoss: 0.3036585 Mean Per-Class Error: 0.410972 AUC: 0.6432189 Gini: 0.2864378 R^2: 0.02005004 Residual Deviance: 6376.221 AIC: 6386.221 Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold: 0 1 Error Rate 0 7703 1800 0.189414 =1800/9503 1 630 366 0.632530 =630/996 Totals 8333 2166 0.231451 =2430/10499 Maximum Metrics: Maximum metrics at their respective thresholds metric threshold value idx 1 max f1 0.126755 0.231499 142 2 max f2 0.075073 0.376556 272 3 max f0point5 0.138125 0.191828 115 4 max accuracy 0.368431 0.905039 0 5 max precision 0.314224 0.250000 3 6 max recall 0.006115 1.000000 399 7 max specificity 0.368431 0.999895 0 8 max absolute_mcc 0.126755 0.128940 142 9 max min_per_class_accuracy 0.106204 0.604546 196 10 max mean_per_class_accuracy 0.103730 0.605663 202

## H2O Benchmark for CSV Import

The importFile() function in H2O is extremely efficient due to the parallel reading. The benchmark comparison below shows that it is comparable to the read.df() in SparkR and significantly faster than the generic read.csv().

library(SparkR, lib.loc = paste(Sys.getenv("SPARK_HOME"), "/R/lib", sep = "")) sc <- sparkR.session(master = "local", sparkConfig = list(spark.driver.memory = "10g", spark.driver.cores = "4")) library(h2o) h2o.init(max_mem_size = "10g") library(rbenchmark) benchmark(replications = 5, order = "elapsed", relative = "elapsed", csv = { df <- read.csv("Documents/nycflights13.csv") print(nrow(df)) rm(df) }, spk = { df <- read.df("Documents/nycflights13.csv", source = "csv", header = "true", inferSchema = "true") print(nrow(df)) rm(df) }, h2o = { df <- h2o.importFile(path = "Documents/nycflights13.csv", header = TRUE, sep = ",") print(nrow(df)) rm(df) } ) # test replications elapsed relative user.self sys.self user.child sys.child # 3 h2o 5 8.221 1.000 0.508 0.032 0 0 # 2 spk 5 9.822 1.195 0.008 0.004 0 0 # 1 csv 5 16.595 2.019 16.420 0.176 0 0