## Where Bagging Might Work Better Than Boosting

In the previous post (https://statcompute.wordpress.com/2016/01/01/the-power-of-decision-stumps), it was shown that the boosting algorithm performs extremely well even with a simple 1-level stump as the base learner and provides a better performance lift than the bagging algorithm does. However, this observation shouldn’t be generalized, which would be demonstrated in the following example.

First of all, we developed a rule-based PART model as below. Albeit pruned, this model will still tend to over-fit the data, as shown in the highlighted.

# R = TRUE AND N = 10 FOR 10-FOLD CV PRUNING # M = 5 SPECIFYING MINIMUM NUMBER OF CASES PER LEAF part_control <- Weka_control(R = TRUE, N = 10, M = 5, Q = 2016) part <- PART(fml, data = df, control = part_control) roc(as.factor(train$DEFAULT), predict(part, newdata = train, type = "probability")[, 2]) # Area under the curve: 0.6839 roc(as.factor(test$DEFAULT), predict(part, newdata = test, type = "probability")[, 2]) # Area under the curve: 0.6082

Next, we applied the boosting to the PART model. As shown in the highlighted result below, AUC of the boosting on the testing data is even lower than AUC of the base model.

wlist <- list(PART, R = TRUE, N = 10, M = 5, Q = 2016) # I = 100 SPECIFYING NUMBER OF ITERATIONS # Q = TRUE SPECIFYING RESAMPLING USED IN THE BOOSTING boost_control <- Weka_control(I = 100, S = 2016, Q = TRUE, P = 100, W = wlist) boosting <- AdaBoostM1(fml, data = train, control = boost_control) roc(as.factor(test$DEFAULT), predict(boosting, newdata = test, type = "probability")[, 2]) # Area under the curve: 0.592

However, if employing the bagging, we are able to achieve more than 11% performance lift in terms of AUC.

# NUM-SLOTS = 0 AND I = 100 FOR PARALLELISM # P = 50 SPECIFYING THE SIZE OF EACH BAG bag_control <- Weka_control("num-slots" = 0, I = 100, S = 2016, P = 50, W = wlist) bagging <- Bagging(fml, data = train, control = bag_control) roc(as.factor(test$DEFAULT), predict(bagging, newdata = test, type = "probability")[, 2]) # Area under the curve: 0.6778

From examples demonstrated today and yesterday, an important lesson to learn is that ensemble methods are powerful machine learning tools only when they are used appropriately. Empirically speaking, while the boosting works well to improve the performance of a under-fitted base model such as the decision stump, the bagging might be able to perform better in the case of an over-fitted base model with high variance and low bias.

## The Power of Decision Stumps

A decision stump is the weak classification model with the simple tree structure consisting of one split, which can also be considered a one-level decision tree. Due to its simplicity, the stump often demonstrates a low predictive performance. As shown in the example below, the AUC measure of a stump is even lower than the one of a single attribute in a separate testing dataset.

pkgs <- c('pROC', 'RWeka') lapply(pkgs, require, character.only = T) df1 <- read.csv("credit_count.txt") df2 <- df1[df1$CARDHLDR == 1, ] set.seed(2016) n <- nrow(df2) sample <- sample(seq(n), size = n / 2, replace = FALSE) train <- df2[sample, ] test <- df2[-sample, ] x <- paste("AGE + ACADMOS + ADEPCNT + MAJORDRG + MINORDRG + OWNRENT + INCOME + SELFEMPL + INCPER + EXP_INC") fml <- as.formula(paste("as.factor(DEFAULT) ~ ", x)) ### IDENTIFY THE MOST PREDICTIVE ATTRIBUTE ### imp <- InfoGainAttributeEval(fml, data = train) imp_x <- test[, names(imp[imp == max(imp)])] roc(as.factor(test$DEFAULT), imp_x) # Area under the curve: 0.6243 ### CONSTRUCT A WEAK CLASSIFIER OF DECISION STUMP ### stump <- DecisionStump(fml, data = train) print(stump) roc(as.factor(test$DEFAULT), predict(stump, newdata = test, type = "probability")[, 2]) # Area under the curve: 0.5953

Albeit weak by itself, the decision stump can be used as a base model in many machine learning ensemble methods, such as bagging and boosting. For instance, the bagging classifier with 1,000 stumps combined outperforms the single stump by ~7% in terms of AUC (0.6346 vs. 0.5953). Moreover, AdaBoost with stumps can further improve the predictive performance over the single stump by ~11% (0.6585 vs. 0.5953) and also over the logistic regression benchmark by ~2% (0.6585 vs. 0.6473).

### BUILD A BAGGING CLASSIFIER WITH 1,000 STUMPS IN PARALLEL ### bagging <- Bagging(fml, data = train, control = Weka_control("num-slots" = 0, I = 1000, W = "DecisionStump", S = 2016, P = 50)) roc(as.factor(test$DEFAULT), predict(bagging, newdata = test, type = "probability")[, 2]) # Area under the curve: 0.6346 ### BUILD A BOOSTING CLASSIFIER WITH STUMPS ### boosting <- AdaBoostM1(fml, data = train, control = Weka_control(I = 100, W = "DecisionStump", S = 2016)) roc(as.factor(test$DEFAULT), predict(boosting, newdata = test, type = "probability")[, 2]) # Area under the curve: 0.6585 ### DEVELOP A LOGIT MODEL FOR THE BENCHMARK ### logit <- Logistic(fml, data = train) roc(as.factor(test$DEFAULT), predict(logit, newdata = test, type = "probability")[, 2]) # Area under the curve: 0.6473

## Parallelize Map()

Map() is a convenient routine in Python to apply a function to all items from one or more lists, as shown below. This specific nature also makes map() a perfect candidate for the parallelism.

In [1]: a = (1, 2, 3) In [2]: b = (10, 20, 30) In [3]: def func(a, b): ...: print "a -->", a, "b -->", b ...: In [4]: ### SERIAL CALL ### In [5]: map(func, a, b) a --> 1 b --> 10 a --> 2 b --> 20 a --> 3 b --> 30

Pool.map() function in Multiprocessing Package is the parallel implementation of map(). However, a drawback is that Pool.map() doesn’t support more than one arguments in the function call. Therefore, in case of a functional call with multiple arguments, a wrapper function is necessary to make it working, which however should be defined before importing Multiprocessing package.

In [6]: ### PARALLEL CALL ### In [7]: ### SINCE POOL.MAP() DOESN'T TAKE MULTIPLE ARGUMENTS, A WRAPPER IS NEEDED In [8]: def f2(ab): ...: a, b = ab ...: return func(a, b) ...: In [9]: from multiprocessing import Pool, cpu_count In [10]: pool = Pool(processes = cpu_count()) In [11]: ### PARALLEL MAP() ON ALL CPUS In [12]: pool.map(f2, zip(a, b)) a --> 1 b --> 10 a --> 2 b --> 20 a --> 3 b --> 30

In addition, Pool.apply() function, with some tweaks, can also be employed to mimic the parallel version of map(). The advantage of this approach is that, different from Pool.map(), Pool.apply() is able to handle multiple arguments by using the list comprehension.

In [13]: ### POOL.APPLY() CAN ALSO MIMIC MAP() In [14]: [pool.apply(func, args = (i, j)) for i, j in zip(a, b)] a --> 1 b --> 10 a --> 2 b --> 20 a --> 3 b --> 30

Alternatively, starmap() function in the parmap package (https://github.com/zeehio/parmap), which is specifically designed to overcome limitations in Pool.map(), provides a more friendly and elegant interface to implement the parallelized map() with multiple arguments at the cost of a slight computing overhead.

In [15]: ### ALTERNATIVELY, PARMAP PACKAGE IS USED In [16]: from parmap import starmap In [17]: starmap(func, zip(a, b), pool = Pool(processes = cpu_count())) a --> 1 b --> 10 a --> 2 b --> 20 a --> 3 b --> 30

## Parallelism with Joblib Package in Python

In the previous post (https://statcompute.wordpress.com/2015/12/27/import-csv-by-chunk-simultaneously-with-ipython-parallel), we’ve shown how to implement the parallelism with IPython parallel package. However, in that specific case, we were not able to observe the efficiency gain of parallelism.

In the example below, we tested Joblib package to implement the parallelism with Multiprocessing package as the back-end and searched for the optimal free parameter in a GRNN that has been demonstrated in (https://statcompute.wordpress.com/2015/12/09/fitting-generalized-regression-neural-network-with-python). As shown in the comparison below, CPU time of the parallel implementation with Joblib package is significantly shorter than CPU time of the serial implementation with map() function.

In [1]: import pandas as pd In [2]: import numpy as np In [3]: from sklearn import preprocessing, cross_validation In [4]: from neupy.algorithms import GRNN as grnn In [5]: from neupy.functions import mse In [6]: from joblib import Parallel, delayed In [7]: df = pd.read_table("csdata.txt") In [8]: y = df.ix[:, 0] In [9]: x = df.ix[:, 1:df.shape[1]] In [10]: st_x = preprocessing.scale(x) In [11]: x_train, x_test, y_train, y_test = cross_validation.train_test_split(st_x, y, train_size = 0.6, random_state = 2016) In [12]: def try_std(x): ....: nn = grnn(std = x, verbose = False) ....: nn.train(x_train, y_train) ....: y_pred = nn.predict(x_test) ....: print x, "-->", "{:10.8f}".format(mse(y_pred, y_test)) ....: In [13]: ### SERIAL IMPLEMENTATION ### In [14]: %time map(try_std, np.linspace(0.5, 2.0, 16)) 0.5 --> 0.03598864 0.6 --> 0.03387313 0.7 --> 0.03260287 0.8 --> 0.03188978 0.9 --> 0.03151914 1.0 --> 0.03134342 1.1 --> 0.03128110 1.2 --> 0.03129023 1.3 --> 0.03134819 1.4 --> 0.03143958 1.5 --> 0.03155242 1.6 --> 0.03167701 1.7 --> 0.03180485 1.8 --> 0.03192895 1.9 --> 0.03204561 2.0 --> 0.03215511 CPU times: user 7.15 s, sys: 11.8 s, total: 18.9 s Wall time: 5.94 s In [15]: ### PARALLEL IMPLEMENTATION ### In [16]: %time Parallel(n_jobs = 8)(delayed(try_std)(i) for i in np.linspace(0.5, 2.0, 16)) 0.5 --> 0.03598864 0.9 --> 0.03151914 0.6 --> 0.03387313 0.7 --> 0.03260287 1.2 --> 0.03129023 1.1 --> 0.03128110 0.8 --> 0.03188978 1.0 --> 0.03134342 1.3 --> 0.03134819 1.6 --> 0.03167701 1.8 --> 0.03192895 1.5 --> 0.03155242 1.9 --> 0.03204561 1.4 --> 0.03143958 1.7 --> 0.03180485 2.0 --> 0.03215511 CPU times: user 60.9 ms, sys: 270 ms, total: 331 ms Wall time: 2.87 s

## Import CSV by Chunk Simultaneously with IPython Parallel

IPython Parallel is a friendly interface to implement the parallelism in Python. Below is an example showing how to import a csv file into the Pandas DataFrame by chunk simultaneously through the interface of IPython Parallel.

In [1]: ### NEED TO RUN "IPCLUSTER START -N 2 &" FIRST ### In [2]: import pandas as pd In [3]: import ipyparallel as ip In [4]: str = "AutoCollision.csv" In [5]: rc = ip.Client() In [6]: ### showing 2 engines working In [7]: rc.ids Out[7]: [0, 1] In [8]: def para_read(file): ...: dview = rc[:] ...: # PARTITION THE IMPORT BY SCATTER() # ...: dview.scatter("df", pd.read_csv(file)) ...: return pd.concat([i for i in dview["df"]]) ...: In [9]: ### PARALLEL IMPORT ### In [10]: df1 = para_read(str) In [11]: ### SERIAL IMPORT ### In [12]: df2 = pd.read_csv(str) In [13]: df1.equals(df2) Out[13]: True

## Prediction Intervals for Poisson Regression

Different from the confidence interval that is to address the uncertainty related to the conditional mean, the prediction interval is to accommodate the additional uncertainty associated with prediction errors. As a result, the prediction interval is always wider than the confidence interval in a regression model. In the context of risk modeling, the prediction interval is often used to address the potential model risk due to aforementioned uncertainties.

While calculating prediction interval of OLS regression based on the Gaussian distributional assumption is relatively straightforward with the off-shelf solution in R, it could be more complicated in a Generalized Linear Model, e.g. Poisson regression. In this post, I am going to show two empirical methods, one based on bootstrapping and the other based on simulation, calculating the prediction interval of a Poisson regression. Because of the high computing cost, the parallelism with foreach() function will be used to improve the efficiency.

First of all, let’s estimate a Poisson regression with glm() and generate a couple fake new data points to calculate model predictions. Since the toy data is very small with only 32 records with all categorical predictors, I doubled the sample size by rbind() to ensure the appropriate data coverage in the bootstrapping.

pkgs <- c('doParallel', 'foreach') lapply(pkgs, require, character.only = T) registerDoParallel(cores = 4) data(AutoCollision, package = "insuranceData") df <- rbind(AutoCollision, AutoCollision) mdl <- glm(Claim_Count ~ Age + Vehicle_Use, data = df, family = poisson(link = "log")) new_fake <- df[1:5, 1:2]

The first method shown below is based on the bootstrapping with following steps:

1. Bootstrapped the original model development sample by the random sample with replacements;

2. Repeated the above many times, e.g. 1000, to generate different bootstrapped samples;

3. Refitted models with bootstrapped samples;

4. Generated predictions with new data points, e.g. “new_fake”, but with refitted models;

5. Generated random numbers based on Poisson distribution with the mean, e.g. lambda, equal to the predicted values from refitted models

6. Collected all Poisson random numbers from the previous step and calculated the percentiles.

boot_pi <- function(model, pdata, n, p) { odata <- model$data lp <- (1 - p) / 2 up <- 1 - lp set.seed(2016) seeds <- round(runif(n, 1, 1000), 0) boot_y <- foreach(i = 1:n, .combine = rbind) %dopar% { set.seed(seeds[i]) bdata <- odata[sample(seq(nrow(odata)), size = nrow(odata), replace = TRUE), ] bpred <- predict(update(model, data = bdata), type = "response", newdata = pdata) rpois(length(bpred), lambda = bpred) } boot_ci <- t(apply(boot_y, 2, quantile, c(lp, up))) return(data.frame(pred = predict(model, newdata = pdata, type = "response"), lower = boot_ci[, 1], upper = boot_ci[, 2])) } boot_pi(mdl, new_fake, 1000, 0.95) # pred lower upper #1 12.63040 6 21 #2 38.69738 25 55 #3 26.97271 16 39 #4 10.69951 4 18 #5 52.50839 35 70

The second method is based on the simulation and outlined as below:

1. Re-produced the model response variable, e.g. Claim_Count, by simulating Poisson random numbers with lambda equal to predicted values from the original model;

2. Repeated the above simulations many times, e.g. 1000, to generate many response series;

3. Generated 1000 updated model samples by replacing the original response with the new response generated from simulations;

4. Refitted models with these updated samples

5. Generated predictions with new data points, e.g. “new_fake”, but with refitted models;

6. Generated Poisson random numbers with lambda equal to the predicted values from refitted models

7. Collected all Poisson random numbers from the previous step and calculated the percentiles.

sim_pi <- function(model, pdata, n, p) { odata <- model$data yhat <- predict(model, type = "response") lp <- (1 - p) / 2 up <- 1 - lp set.seed(2016) seeds <- round(runif(n, 1, 1000), 0) sim_y <- foreach(i = 1:n, .combine = rbind) %dopar% { set.seed(seeds[i]) sim_y <- rpois(length(yhat), lambda = yhat) sdata <- data.frame(y = sim_y, odata[names(model$x)]) refit <- glm(y ~ ., data = sdata, family = poisson) bpred <- predict(refit, type = "response", newdata = pdata) rpois(length(bpred),lambda = bpred) } sim_ci <- t(apply(sim_y, 2, quantile, c(lp, up))) return(data.frame(pred = predict(model, newdata = pdata, type = "response"), lower = sim_ci[, 1], upper = sim_ci[, 2])) } sim_pi(mdl, new_fake, 1000, 0.95) # pred lower upper #1 12.63040 6 21 #2 38.69738 26 52 #3 26.97271 17 39 #4 10.69951 4 18 #5 52.50839 38 68

As demonstrated above, after a large number of replications, outcomes from both methods are highly consistent.

## Calculate Leave-One-Out Prediction for GLM

In the model development, the “leave-one-out” prediction is a way of cross-validation, calculated as below:

1. First of all, after a model is developed, each observation used in the model development is removed in turn and then the model is refitted with the remaining observations

2. The out-of-sample prediction for the refitted model is calculated with the removed observation one by one to assemble the LOO, e.g. leave-one-out predicted values for the whole model development sample.

The loo_predict() function below is a general routine to calculate the LOO prediction for any GLM object, which can be further employed to investigate the model stability and predictability.

> pkgs <- c('doParallel', 'foreach') > lapply(pkgs, require, character.only = T) [[1]] [1] TRUE [[2]] [1] TRUE > registerDoParallel(cores = 8) > > data(AutoCollision, package = "insuranceData") > # A GAMMA GLM # > model1 <- glm(Severity ~ Age + Vehicle_Use, data = AutoCollision, family = Gamma(link = "log")) > # A POISSON GLM # > model2 <- glm(Claim_Count ~ Age + Vehicle_Use, data = AutoCollision, family = poisson(link = "log")) > > loo_predict <- function(obj) { + yhat <- foreach(i = 1:nrow(obj$data), .combine = rbind) %dopar% { + predict(update(obj, data = obj$data[-i, ]), obj$data[i,], type = "response") + } + return(data.frame(result = yhat[, 1], row.names = NULL)) + } > # TEST CASE 1 > test1 <- loo_predict(model1) > test1$result [1] 303.7393 328.7292 422.6610 375.5023 240.9785 227.6365 288.4404 446.5589 [9] 213.9368 244.7808 278.7786 443.2256 213.9262 243.2495 266.9166 409.2565 [17] 175.0334 172.0683 197.2911 326.5685 187.2529 215.9931 249.9765 349.3873 [25] 190.1174 218.6321 243.7073 359.9631 192.3655 215.5986 233.1570 348.2781 > # TEST CASE 2 > test2 <- loo_predict(model2) > test2$result [1] 11.15897 37.67273 28.76127 11.54825 50.26364 152.35489 122.23782 [8] 44.57048 129.58158 465.84173 260.48114 107.23832 167.40672 510.41127 [15] 316.50765 121.75804 172.56928 546.25390 341.03826 134.04303 359.30141 [22] 977.29107 641.69934 251.32547 248.79229 684.86851 574.13994 238.42209 [29] 148.77733 504.12221 422.75047 167.61203