## Archive for **December 2015**

## Parallelize Map()

Map() is a convenient routine in Python to apply a function to all items from one or more lists, as shown below. This specific nature also makes map() a perfect candidate for the parallelism.

In [1]: a = (1, 2, 3) In [2]: b = (10, 20, 30) In [3]: def func(a, b): ...: print "a -->", a, "b -->", b ...: In [4]: ### SERIAL CALL ### In [5]: map(func, a, b) a --> 1 b --> 10 a --> 2 b --> 20 a --> 3 b --> 30

Pool.map() function in Multiprocessing Package is the parallel implementation of map(). However, a drawback is that Pool.map() doesn’t support more than one arguments in the function call. Therefore, in case of a functional call with multiple arguments, a wrapper function is necessary to make it working, which however should be defined before importing Multiprocessing package.

In [6]: ### PARALLEL CALL ### In [7]: ### SINCE POOL.MAP() DOESN'T TAKE MULTIPLE ARGUMENTS, A WRAPPER IS NEEDED In [8]: def f2(ab): ...: a, b = ab ...: return func(a, b) ...: In [9]: from multiprocessing import Pool, cpu_count In [10]: pool = Pool(processes = cpu_count()) In [11]: ### PARALLEL MAP() ON ALL CPUS In [12]: pool.map(f2, zip(a, b)) a --> 1 b --> 10 a --> 2 b --> 20 a --> 3 b --> 30

In addition, Pool.apply() function, with some tweaks, can also be employed to mimic the parallel version of map(). The advantage of this approach is that, different from Pool.map(), Pool.apply() is able to handle multiple arguments by using the list comprehension.

In [13]: ### POOL.APPLY() CAN ALSO MIMIC MAP() In [14]: [pool.apply(func, args = (i, j)) for i, j in zip(a, b)] a --> 1 b --> 10 a --> 2 b --> 20 a --> 3 b --> 30

Alternatively, starmap() function in the parmap package (https://github.com/zeehio/parmap), which is specifically designed to overcome limitations in Pool.map(), provides a more friendly and elegant interface to implement the parallelized map() with multiple arguments at the cost of a slight computing overhead.

In [15]: ### ALTERNATIVELY, PARMAP PACKAGE IS USED In [16]: from parmap import starmap In [17]: starmap(func, zip(a, b), pool = Pool(processes = cpu_count())) a --> 1 b --> 10 a --> 2 b --> 20 a --> 3 b --> 30

## Parallelism with Joblib Package in Python

In the previous post (https://statcompute.wordpress.com/2015/12/27/import-csv-by-chunk-simultaneously-with-ipython-parallel), we’ve shown how to implement the parallelism with IPython parallel package. However, in that specific case, we were not able to observe the efficiency gain of parallelism.

In the example below, we tested Joblib package to implement the parallelism with Multiprocessing package as the back-end and searched for the optimal free parameter in a GRNN that has been demonstrated in (https://statcompute.wordpress.com/2015/12/09/fitting-generalized-regression-neural-network-with-python). As shown in the comparison below, CPU time of the parallel implementation with Joblib package is significantly shorter than CPU time of the serial implementation with map() function.

In [1]: import pandas as pd In [2]: import numpy as np In [3]: from sklearn import preprocessing, cross_validation In [4]: from neupy.algorithms import GRNN as grnn In [5]: from neupy.functions import mse In [6]: from joblib import Parallel, delayed In [7]: df = pd.read_table("csdata.txt") In [8]: y = df.ix[:, 0] In [9]: x = df.ix[:, 1:df.shape[1]] In [10]: st_x = preprocessing.scale(x) In [11]: x_train, x_test, y_train, y_test = cross_validation.train_test_split(st_x, y, train_size = 0.6, random_state = 2016) In [12]: def try_std(x): ....: nn = grnn(std = x, verbose = False) ....: nn.train(x_train, y_train) ....: y_pred = nn.predict(x_test) ....: print x, "-->", "{:10.8f}".format(mse(y_pred, y_test)) ....: In [13]: ### SERIAL IMPLEMENTATION ### In [14]: %time map(try_std, np.linspace(0.5, 2.0, 16)) 0.5 --> 0.03598864 0.6 --> 0.03387313 0.7 --> 0.03260287 0.8 --> 0.03188978 0.9 --> 0.03151914 1.0 --> 0.03134342 1.1 --> 0.03128110 1.2 --> 0.03129023 1.3 --> 0.03134819 1.4 --> 0.03143958 1.5 --> 0.03155242 1.6 --> 0.03167701 1.7 --> 0.03180485 1.8 --> 0.03192895 1.9 --> 0.03204561 2.0 --> 0.03215511 CPU times: user 7.15 s, sys: 11.8 s, total: 18.9 s Wall time: 5.94 s In [15]: ### PARALLEL IMPLEMENTATION ### In [16]: %time Parallel(n_jobs = 8)(delayed(try_std)(i) for i in np.linspace(0.5, 2.0, 16)) 0.5 --> 0.03598864 0.9 --> 0.03151914 0.6 --> 0.03387313 0.7 --> 0.03260287 1.2 --> 0.03129023 1.1 --> 0.03128110 0.8 --> 0.03188978 1.0 --> 0.03134342 1.3 --> 0.03134819 1.6 --> 0.03167701 1.8 --> 0.03192895 1.5 --> 0.03155242 1.9 --> 0.03204561 1.4 --> 0.03143958 1.7 --> 0.03180485 2.0 --> 0.03215511 CPU times: user 60.9 ms, sys: 270 ms, total: 331 ms Wall time: 2.87 s

## Import CSV by Chunk Simultaneously with IPython Parallel

IPython Parallel is a friendly interface to implement the parallelism in Python. Below is an example showing how to import a csv file into the Pandas DataFrame by chunk simultaneously through the interface of IPython Parallel.

In [1]: ### NEED TO RUN "IPCLUSTER START -N 2 &" FIRST ### In [2]: import pandas as pd In [3]: import ipyparallel as ip In [4]: str = "AutoCollision.csv" In [5]: rc = ip.Client() In [6]: ### showing 2 engines working In [7]: rc.ids Out[7]: [0, 1] In [8]: def para_read(file): ...: dview = rc[:] ...: # PARTITION THE IMPORT BY SCATTER() # ...: dview.scatter("df", pd.read_csv(file)) ...: return pd.concat([i for i in dview["df"]]) ...: In [9]: ### PARALLEL IMPORT ### In [10]: df1 = para_read(str) In [11]: ### SERIAL IMPORT ### In [12]: df2 = pd.read_csv(str) In [13]: df1.equals(df2) Out[13]: True

## Prediction Intervals for Poisson Regression

Different from the confidence interval that is to address the uncertainty related to the conditional mean, the prediction interval is to accommodate the additional uncertainty associated with prediction errors. As a result, the prediction interval is always wider than the confidence interval in a regression model. In the context of risk modeling, the prediction interval is often used to address the potential model risk due to aforementioned uncertainties.

While calculating prediction interval of OLS regression based on the Gaussian distributional assumption is relatively straightforward with the off-shelf solution in R, it could be more complicated in a Generalized Linear Model, e.g. Poisson regression. In this post, I am going to show two empirical methods, one based on bootstrapping and the other based on simulation, calculating the prediction interval of a Poisson regression. Because of the high computing cost, the parallelism with foreach() function will be used to improve the efficiency.

First of all, let’s estimate a Poisson regression with glm() and generate a couple fake new data points to calculate model predictions. Since the toy data is very small with only 32 records with all categorical predictors, I doubled the sample size by rbind() to ensure the appropriate data coverage in the bootstrapping.

pkgs <- c('doParallel', 'foreach') lapply(pkgs, require, character.only = T) registerDoParallel(cores = 4) data(AutoCollision, package = "insuranceData") df <- rbind(AutoCollision, AutoCollision) mdl <- glm(Claim_Count ~ Age + Vehicle_Use, data = df, family = poisson(link = "log")) new_fake <- df[1:5, 1:2]

The first method shown below is based on the bootstrapping with following steps:

1. Bootstrapped the original model development sample by the random sample with replacements;

2. Repeated the above many times, e.g. 1000, to generate different bootstrapped samples;

3. Refitted models with bootstrapped samples;

4. Generated predictions with new data points, e.g. “new_fake”, but with refitted models;

5. Generated random numbers based on Poisson distribution with the mean, e.g. lambda, equal to the predicted values from refitted models

6. Collected all Poisson random numbers from the previous step and calculated the percentiles.

boot_pi <- function(model, pdata, n, p) { odata <- model$data lp <- (1 - p) / 2 up <- 1 - lp set.seed(2016) seeds <- round(runif(n, 1, 1000), 0) boot_y <- foreach(i = 1:n, .combine = rbind) %dopar% { set.seed(seeds[i]) bdata <- odata[sample(seq(nrow(odata)), size = nrow(odata), replace = TRUE), ] bpred <- predict(update(model, data = bdata), type = "response", newdata = pdata) rpois(length(bpred), lambda = bpred) } boot_ci <- t(apply(boot_y, 2, quantile, c(lp, up))) return(data.frame(pred = predict(model, newdata = pdata, type = "response"), lower = boot_ci[, 1], upper = boot_ci[, 2])) } boot_pi(mdl, new_fake, 1000, 0.95) # pred lower upper #1 12.63040 6 21 #2 38.69738 25 55 #3 26.97271 16 39 #4 10.69951 4 18 #5 52.50839 35 70

The second method is based on the simulation and outlined as below:

1. Re-produced the model response variable, e.g. Claim_Count, by simulating Poisson random numbers with lambda equal to predicted values from the original model;

2. Repeated the above simulations many times, e.g. 1000, to generate many response series;

3. Generated 1000 updated model samples by replacing the original response with the new response generated from simulations;

4. Refitted models with these updated samples

5. Generated predictions with new data points, e.g. “new_fake”, but with refitted models;

6. Generated Poisson random numbers with lambda equal to the predicted values from refitted models

7. Collected all Poisson random numbers from the previous step and calculated the percentiles.

sim_pi <- function(model, pdata, n, p) { odata <- model$data yhat <- predict(model, type = "response") lp <- (1 - p) / 2 up <- 1 - lp set.seed(2016) seeds <- round(runif(n, 1, 1000), 0) sim_y <- foreach(i = 1:n, .combine = rbind) %dopar% { set.seed(seeds[i]) sim_y <- rpois(length(yhat), lambda = yhat) sdata <- data.frame(y = sim_y, odata[names(model$x)]) refit <- glm(y ~ ., data = sdata, family = poisson) bpred <- predict(refit, type = "response", newdata = pdata) rpois(length(bpred),lambda = bpred) } sim_ci <- t(apply(sim_y, 2, quantile, c(lp, up))) return(data.frame(pred = predict(model, newdata = pdata, type = "response"), lower = sim_ci[, 1], upper = sim_ci[, 2])) } sim_pi(mdl, new_fake, 1000, 0.95) # pred lower upper #1 12.63040 6 21 #2 38.69738 26 52 #3 26.97271 17 39 #4 10.69951 4 18 #5 52.50839 38 68

As demonstrated above, after a large number of replications, outcomes from both methods are highly consistent.

## Calculate Leave-One-Out Prediction for GLM

In the model development, the “leave-one-out” prediction is a way of cross-validation, calculated as below:

1. First of all, after a model is developed, each observation used in the model development is removed in turn and then the model is refitted with the remaining observations

2. The out-of-sample prediction for the refitted model is calculated with the removed observation one by one to assemble the LOO, e.g. leave-one-out predicted values for the whole model development sample.

The loo_predict() function below is a general routine to calculate the LOO prediction for any GLM object, which can be further employed to investigate the model stability and predictability.

> pkgs <- c('doParallel', 'foreach') > lapply(pkgs, require, character.only = T) [[1]] [1] TRUE [[2]] [1] TRUE > registerDoParallel(cores = 8) > > data(AutoCollision, package = "insuranceData") > # A GAMMA GLM # > model1 <- glm(Severity ~ Age + Vehicle_Use, data = AutoCollision, family = Gamma(link = "log")) > # A POISSON GLM # > model2 <- glm(Claim_Count ~ Age + Vehicle_Use, data = AutoCollision, family = poisson(link = "log")) > > loo_predict <- function(obj) { + yhat <- foreach(i = 1:nrow(obj$data), .combine = rbind) %dopar% { + predict(update(obj, data = obj$data[-i, ]), obj$data[i,], type = "response") + } + return(data.frame(result = yhat[, 1], row.names = NULL)) + } > # TEST CASE 1 > test1 <- loo_predict(model1) > test1$result [1] 303.7393 328.7292 422.6610 375.5023 240.9785 227.6365 288.4404 446.5589 [9] 213.9368 244.7808 278.7786 443.2256 213.9262 243.2495 266.9166 409.2565 [17] 175.0334 172.0683 197.2911 326.5685 187.2529 215.9931 249.9765 349.3873 [25] 190.1174 218.6321 243.7073 359.9631 192.3655 215.5986 233.1570 348.2781 > # TEST CASE 2 > test2 <- loo_predict(model2) > test2$result [1] 11.15897 37.67273 28.76127 11.54825 50.26364 152.35489 122.23782 [8] 44.57048 129.58158 465.84173 260.48114 107.23832 167.40672 510.41127 [15] 316.50765 121.75804 172.56928 546.25390 341.03826 134.04303 359.30141 [22] 977.29107 641.69934 251.32547 248.79229 684.86851 574.13994 238.42209 [29] 148.77733 504.12221 422.75047 167.61203

## Multivariate Adaptive Regression Splines with Python

In [1]: import statsmodels.datasets as datasets In [2]: import sklearn.metrics as metrics In [3]: from numpy import log In [4]: from pyearth import Earth as earth In [5]: boston = datasets.get_rdataset("Boston", "MASS").data In [6]: x = boston.ix[:, 0:boston.shape[1] - 1] In [7]: xlabel = list(x.columns) In [8]: y = boston.ix[:, boston.shape[1] - 1] In [9]: model = earth(enable_pruning = True, penalty = 3, minspan_alpha = 0.05, endspan_alpha = 0.05) In [10]: model.fit(x, log(y), xlabels = xlabel) Out[10]: Earth(allow_linear=None, check_every=None, enable_pruning=True, endspan=None, endspan_alpha=0.05, max_degree=None, max_terms=None, min_search_points=None, minspan=None, minspan_alpha=0.05, penalty=3, smooth=None, thresh=None) In [11]: print model.summary() Earth Model -------------------------------------- Basis Function Pruned Coefficient -------------------------------------- (Intercept) No 2.93569 h(lstat-5.9) No -0.0249319 h(5.9-lstat) No 0.067697 h(rm-6.402) No 0.230063 h(6.402-rm) Yes None crim Yes None h(dis-1.5004) No -0.0369272 h(1.5004-dis) No 1.70517 h(ptratio-18.6) No -0.0393493 h(18.6-ptratio) No 0.0278253 nox No -0.767146 h(indus-25.65) No -0.147471 h(25.65-indus) Yes None h(black-395.24) Yes None h(395.24-black) Yes None h(crim-24.8017) Yes None h(24.8017-crim) No 0.0292657 rad No 0.00807783 h(rm-7.853) Yes None h(7.853-rm) Yes None chas Yes None -------------------------------------- MSE: 0.0265, GCV: 0.0320, RSQ: 0.8409, GRSQ: 0.8090 In [12]: metrics.r2_score(log(y), model.predict(x)) Out[12]: 0.840861083407211

## Download Federal Reserve Economic Data (FRED) with Python

In the operational loss calculation, it is important to use CPI (Consumer Price Index) adjusting historical losses. Below is an example showing how to download CPI data online directly from Federal Reserve Bank of St. Louis and then to calculate monthly and quarterly CPI adjustment factors with Python.

In [1]: import pandas_datareader.data as web In [2]: import pandas as pd In [3]: import numpy as np In [4]: import datetime as dt In [5]: # SET START AND END DATES OF THE SERIES In [6]: sdt = dt.datetime(2000, 1, 1) In [7]: edt = dt.datetime(2015, 9, 1) In [8]: cpi = web.DataReader("CPIAUCNS", "fred", sdt, edt) In [9]: cpi.head() Out[9]: CPIAUCNS DATE 2000-01-01 168.8 2000-02-01 169.8 2000-03-01 171.2 2000-04-01 171.3 2000-05-01 171.5 In [10]: df1 = pd.DataFrame({'month': [dt.datetime.strftime(i, "%Y-%m") for i in cpi.index]}) In [11]: df1['qtr'] = [str(x.year) + "-Q" + str(x.quarter) for x in cpi.index] In [12]: df1['m_cpi'] = cpi.values In [13]: df1.index = cpi.index In [14]: grp = df1.groupby('qtr', as_index = False) In [15]: df2 = grp['m_cpi'].agg({'q_cpi': np.mean}) In [16]: df3 = pd.merge(df1, df2, how = 'inner', left_on = 'qtr', right_on = 'qtr') In [17]: maxm_cpi = np.array(df3.m_cpi)[-1] In [18]: maxq_cpi = np.array(df3.q_cpi)[-1] In [19]: df3['m_factor'] = maxm_cpi / df3.m_cpi In [20]: df3['q_factor'] = maxq_cpi / df3.q_cpi In [21]: df3.index = cpi.index In [22]: final = df3.sort_index(ascending = False) In [23]: final.head(12) Out[23]: month qtr m_cpi q_cpi m_factor q_factor DATE 2015-09-01 2015-09 2015-Q3 237.945 238.305000 1.000000 1.000000 2015-08-01 2015-08 2015-Q3 238.316 238.305000 0.998443 1.000000 2015-07-01 2015-07 2015-Q3 238.654 238.305000 0.997029 1.000000 2015-06-01 2015-06 2015-Q2 238.638 237.680667 0.997096 1.002627 2015-05-01 2015-05 2015-Q2 237.805 237.680667 1.000589 1.002627 2015-04-01 2015-04 2015-Q2 236.599 237.680667 1.005689 1.002627 2015-03-01 2015-03 2015-Q1 236.119 234.849333 1.007733 1.014714 2015-02-01 2015-02 2015-Q1 234.722 234.849333 1.013731 1.014714 2015-01-01 2015-01 2015-Q1 233.707 234.849333 1.018134 1.014714 2014-12-01 2014-12 2014-Q4 234.812 236.132000 1.013343 1.009202 2014-11-01 2014-11 2014-Q4 236.151 236.132000 1.007597 1.009202 2014-10-01 2014-10 2014-Q4 237.433 236.132000 1.002156 1.009202