Yet Another Blog in Statistical Computing

I can calculate the motion of heavenly bodies but not the madness of people. -Isaac Newton

Archive for December 2015

Parallelize Map()

Map() is a convenient routine in Python to apply a function to all items from one or more lists, as shown below. This specific nature also makes map() a perfect candidate for the parallelism.

In [1]: a = (1, 2, 3)

In [2]: b = (10, 20, 30)

In [3]: def func(a, b):
...:      print "a -->", a, "b -->", b
...:

In [4]: ### SERIAL CALL ###

In [5]: map(func, a, b)
a --> 1 b --> 10
a --> 2 b --> 20
a --> 3 b --> 30

Pool.map() function in Multiprocessing Package is the parallel implementation of map(). However, a drawback is that Pool.map() doesn’t support more than one arguments in the function call. Therefore, in case of a functional call with multiple arguments, a wrapper function is necessary to make it working, which however should be defined before importing Multiprocessing package.

In [6]: ### PARALLEL CALL ###

In [7]: ### SINCE POOL.MAP() DOESN'T TAKE MULTIPLE ARGUMENTS, A WRAPPER IS NEEDED

In [8]: def f2(ab):
...:      a, b = ab
...:      return func(a, b)
...:

In [9]: from multiprocessing import Pool, cpu_count

In [10]: pool = Pool(processes = cpu_count())

In [11]: ### PARALLEL MAP() ON ALL CPUS

In [12]: pool.map(f2, zip(a, b))
a --> 1 b --> 10
a --> 2 b --> 20
a --> 3 b --> 30

In addition, Pool.apply() function, with some tweaks, can also be employed to mimic the parallel version of map(). The advantage of this approach is that, different from Pool.map(), Pool.apply() is able to handle multiple arguments by using the list comprehension.

In [13]: ### POOL.APPLY() CAN ALSO MIMIC MAP()

In [14]: [pool.apply(func, args = (i, j)) for i, j in zip(a, b)]
a --> 1 b --> 10
a --> 2 b --> 20
a --> 3 b --> 30

Alternatively, starmap() function in the parmap package (https://github.com/zeehio/parmap), which is specifically designed to overcome limitations in Pool.map(), provides a more friendly and elegant interface to implement the parallelized map() with multiple arguments at the cost of a slight computing overhead.

In [15]: ### ALTERNATIVELY, PARMAP PACKAGE IS USED

In [16]: from parmap import starmap

In [17]: starmap(func, zip(a, b), pool = Pool(processes = cpu_count()))
a --> 1 b --> 10
a --> 2 b --> 20
a --> 3 b --> 30
Advertisements

Written by statcompute

December 31, 2015 at 4:00 pm

Posted in Big Data, Parallelism, PYTHON

Tagged with ,

Parallelism with Joblib Package in Python

In the previous post (https://statcompute.wordpress.com/2015/12/27/import-csv-by-chunk-simultaneously-with-ipython-parallel), we’ve shown how to implement the parallelism with IPython parallel package. However, in that specific case, we were not able to observe the efficiency gain of parallelism.

In the example below, we tested Joblib package to implement the parallelism with Multiprocessing package as the back-end and searched for the optimal free parameter in a GRNN that has been demonstrated in (https://statcompute.wordpress.com/2015/12/09/fitting-generalized-regression-neural-network-with-python). As shown in the comparison below, CPU time of the parallel implementation with Joblib package is significantly shorter than CPU time of the serial implementation with map() function.

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: from sklearn import preprocessing, cross_validation

In [4]: from neupy.algorithms import GRNN as grnn

In [5]: from neupy.functions import mse

In [6]: from joblib import Parallel, delayed

In [7]: df = pd.read_table("csdata.txt")

In [8]: y = df.ix[:, 0]

In [9]: x = df.ix[:, 1:df.shape[1]]

In [10]: st_x = preprocessing.scale(x)

In [11]: x_train, x_test, y_train, y_test = cross_validation.train_test_split(st_x, y, train_size = 0.6, random_state = 2016)

In [12]: def try_std(x):
   ....:       nn = grnn(std = x, verbose = False)
   ....:       nn.train(x_train, y_train)
   ....:       y_pred = nn.predict(x_test)
   ....:       print x, "-->", "{:10.8f}".format(mse(y_pred, y_test))
   ....:     

In [13]: ### SERIAL IMPLEMENTATION ###

In [14]: %time map(try_std, np.linspace(0.5, 2.0, 16))
0.5 --> 0.03598864
0.6 --> 0.03387313
0.7 --> 0.03260287
0.8 --> 0.03188978
0.9 --> 0.03151914
1.0 --> 0.03134342
1.1 --> 0.03128110
1.2 --> 0.03129023
1.3 --> 0.03134819
1.4 --> 0.03143958
1.5 --> 0.03155242
1.6 --> 0.03167701
1.7 --> 0.03180485
1.8 --> 0.03192895
1.9 --> 0.03204561
2.0 --> 0.03215511
CPU times: user 7.15 s, sys: 11.8 s, total: 18.9 s
Wall time: 5.94 s

In [15]: ### PARALLEL IMPLEMENTATION ###

In [16]: %time Parallel(n_jobs = 8)(delayed(try_std)(i) for i in np.linspace(0.5, 2.0, 16))
0.5 --> 0.03598864
0.9 --> 0.03151914
0.6 --> 0.03387313
0.7 --> 0.03260287
1.2 --> 0.03129023
1.1 --> 0.03128110
0.8 --> 0.03188978
1.0 --> 0.03134342
1.3 --> 0.03134819
1.6 --> 0.03167701
1.8 --> 0.03192895
1.5 --> 0.03155242
1.9 --> 0.03204561
1.4 --> 0.03143958
1.7 --> 0.03180485
2.0 --> 0.03215511
CPU times: user 60.9 ms, sys: 270 ms, total: 331 ms
Wall time: 2.87 s

Written by statcompute

December 30, 2015 at 1:45 am

Import CSV by Chunk Simultaneously with IPython Parallel

IPython Parallel is a friendly interface to implement the parallelism in Python. Below is an example showing how to import a csv file into the Pandas DataFrame by chunk simultaneously through the interface of IPython Parallel.

In [1]: ### NEED TO RUN "IPCLUSTER START -N 2 &" FIRST ###

In [2]: import pandas as pd

In [3]: import ipyparallel as ip

In [4]: str = "AutoCollision.csv"

In [5]: rc = ip.Client()

In [6]: ### showing 2 engines working

In [7]: rc.ids
Out[7]: [0, 1]

In [8]: def para_read(file):
   ...:       dview = rc[:]
   ...:       # PARTITION THE IMPORT BY SCATTER() #
   ...:       dview.scatter("df", pd.read_csv(file))
   ...:       return pd.concat([i for i in dview["df"]])
   ...:

In [9]: ### PARALLEL IMPORT ###

In [10]: df1 = para_read(str)

In [11]: ### SERIAL IMPORT ###

In [12]: df2 = pd.read_csv(str)

In [13]: df1.equals(df2)
Out[13]: True

Written by statcompute

December 27, 2015 at 4:03 pm

Posted in Big Data, pandas, PYTHON

Tagged with ,

Prediction Intervals for Poisson Regression

Different from the confidence interval that is to address the uncertainty related to the conditional mean, the prediction interval is to accommodate the additional uncertainty associated with prediction errors. As a result, the prediction interval is always wider than the confidence interval in a regression model. In the context of risk modeling, the prediction interval is often used to address the potential model risk due to aforementioned uncertainties.

While calculating prediction interval of OLS regression based on the Gaussian distributional assumption is relatively straightforward with the off-shelf solution in R, it could be more complicated in a Generalized Linear Model, e.g. Poisson regression. In this post, I am going to show two empirical methods, one based on bootstrapping and the other based on simulation, calculating the prediction interval of a Poisson regression. Because of the high computing cost, the parallelism with foreach() function will be used to improve the efficiency.

First of all, let’s estimate a Poisson regression with glm() and generate a couple fake new data points to calculate model predictions. Since the toy data is very small with only 32 records with all categorical predictors, I doubled the sample size by rbind() to ensure the appropriate data coverage in the bootstrapping.

pkgs <- c('doParallel', 'foreach')
lapply(pkgs, require, character.only = T)
registerDoParallel(cores = 4)

data(AutoCollision, package = "insuranceData")
df <- rbind(AutoCollision, AutoCollision)
mdl <- glm(Claim_Count ~ Age + Vehicle_Use, data = df, family = poisson(link = "log"))
new_fake <- df[1:5, 1:2]

The first method shown below is based on the bootstrapping with following steps:

1. Bootstrapped the original model development sample by the random sample with replacements;

2. Repeated the above many times, e.g. 1000, to generate different bootstrapped samples;

3. Refitted models with bootstrapped samples;

4. Generated predictions with new data points, e.g. “new_fake”, but with refitted models;

5. Generated random numbers based on Poisson distribution with the mean, e.g. lambda, equal to the predicted values from refitted models

6. Collected all Poisson random numbers from the previous step and calculated the percentiles.

boot_pi <- function(model, pdata, n, p) {
  odata <- model$data
  lp <- (1 - p) / 2
  up <- 1 - lp
  set.seed(2016)
  seeds <- round(runif(n, 1, 1000), 0)
  boot_y <- foreach(i = 1:n, .combine = rbind) %dopar% {
    set.seed(seeds[i])
    bdata <- odata[sample(seq(nrow(odata)), size = nrow(odata), replace = TRUE), ]
    bpred <- predict(update(model, data = bdata), type = "response", newdata = pdata)
    rpois(length(bpred), lambda = bpred)
  }
  boot_ci <- t(apply(boot_y, 2, quantile, c(lp, up)))
  return(data.frame(pred = predict(model, newdata = pdata, type = "response"), lower = boot_ci[, 1], upper = boot_ci[, 2]))
}

boot_pi(mdl, new_fake, 1000, 0.95)
#      pred lower upper
#1 12.63040     6    21
#2 38.69738    25    55
#3 26.97271    16    39
#4 10.69951     4    18
#5 52.50839    35    70

The second method is based on the simulation and outlined as below:

1. Re-produced the model response variable, e.g. Claim_Count, by simulating Poisson random numbers with lambda equal to predicted values from the original model;

2. Repeated the above simulations many times, e.g. 1000, to generate many response series;

3. Generated 1000 updated model samples by replacing the original response with the new response generated from simulations;

4. Refitted models with these updated samples

5. Generated predictions with new data points, e.g. “new_fake”, but with refitted models;

6. Generated Poisson random numbers with lambda equal to the predicted values from refitted models

7. Collected all Poisson random numbers from the previous step and calculated the percentiles.

sim_pi <- function(model, pdata, n, p) {
  odata <- model$data
  yhat <- predict(model, type = "response")
  lp <- (1 - p) / 2
  up <- 1 - lp
  set.seed(2016)
  seeds <- round(runif(n, 1, 1000), 0)
  sim_y <- foreach(i = 1:n, .combine = rbind) %dopar% {
    set.seed(seeds[i])
    sim_y <- rpois(length(yhat), lambda = yhat)
    sdata <- data.frame(y = sim_y, odata[names(model$x)])
    refit <- glm(y ~ ., data = sdata, family = poisson)
    bpred <- predict(refit, type = "response", newdata = pdata)
    rpois(length(bpred),lambda = bpred)
  }
  sim_ci <- t(apply(sim_y, 2, quantile, c(lp, up)))
  return(data.frame(pred = predict(model, newdata = pdata, type = "response"), lower = sim_ci[, 1], upper = sim_ci[, 2]))
}

sim_pi(mdl, new_fake, 1000, 0.95)
#      pred lower upper
#1 12.63040     6    21
#2 38.69738    26    52
#3 26.97271    17    39
#4 10.69951     4    18
#5 52.50839    38    68

As demonstrated above, after a large number of replications, outcomes from both methods are highly consistent.

Written by statcompute

December 20, 2015 at 2:54 pm

Calculate Leave-One-Out Prediction for GLM

In the model development, the “leave-one-out” prediction is a way of cross-validation, calculated as below:
1. First of all, after a model is developed, each observation used in the model development is removed in turn and then the model is refitted with the remaining observations
2. The out-of-sample prediction for the refitted model is calculated with the removed observation one by one to assemble the LOO, e.g. leave-one-out predicted values for the whole model development sample.
The loo_predict() function below is a general routine to calculate the LOO prediction for any GLM object, which can be further employed to investigate the model stability and predictability.

> pkgs <- c('doParallel', 'foreach')
> lapply(pkgs, require, character.only = T)
[[1]]
[1] TRUE

[[2]]
[1] TRUE

> registerDoParallel(cores = 8)
>
> data(AutoCollision, package = "insuranceData")
> # A GAMMA GLM #
> model1 <- glm(Severity ~ Age + Vehicle_Use, data = AutoCollision, family = Gamma(link = "log"))
> # A POISSON GLM #
> model2 <- glm(Claim_Count ~ Age + Vehicle_Use, data = AutoCollision, family = poisson(link = "log"))
>
> loo_predict <- function(obj) {
+   yhat <- foreach(i = 1:nrow(obj$data), .combine = rbind) %dopar% {
+     predict(update(obj, data = obj$data[-i, ]), obj$data[i,], type = "response")
+   }
+   return(data.frame(result = yhat[, 1], row.names = NULL))
+ }
> # TEST CASE 1
> test1 <- loo_predict(model1)
> test1$result
 [1] 303.7393 328.7292 422.6610 375.5023 240.9785 227.6365 288.4404 446.5589
 [9] 213.9368 244.7808 278.7786 443.2256 213.9262 243.2495 266.9166 409.2565
[17] 175.0334 172.0683 197.2911 326.5685 187.2529 215.9931 249.9765 349.3873
[25] 190.1174 218.6321 243.7073 359.9631 192.3655 215.5986 233.1570 348.2781
> # TEST CASE 2
> test2 <- loo_predict(model2)
> test2$result
 [1]  11.15897  37.67273  28.76127  11.54825  50.26364 152.35489 122.23782
 [8]  44.57048 129.58158 465.84173 260.48114 107.23832 167.40672 510.41127
[15] 316.50765 121.75804 172.56928 546.25390 341.03826 134.04303 359.30141
[22] 977.29107 641.69934 251.32547 248.79229 684.86851 574.13994 238.42209
[29] 148.77733 504.12221 422.75047 167.61203

Written by statcompute

December 13, 2015 at 1:36 am

Multivariate Adaptive Regression Splines with Python

In [1]: import statsmodels.datasets as datasets

In [2]: import sklearn.metrics as metrics

In [3]: from numpy import log

In [4]: from pyearth import Earth as earth

In [5]: boston = datasets.get_rdataset("Boston", "MASS").data

In [6]: x = boston.ix[:, 0:boston.shape[1] - 1]

In [7]: xlabel = list(x.columns)

In [8]: y = boston.ix[:, boston.shape[1] - 1]

In [9]: model = earth(enable_pruning = True, penalty = 3, minspan_alpha = 0.05, endspan_alpha = 0.05)

In [10]: model.fit(x, log(y), xlabels = xlabel)
Out[10]:
Earth(allow_linear=None, check_every=None, enable_pruning=True, endspan=None,
   endspan_alpha=0.05, max_degree=None, max_terms=None,
   min_search_points=None, minspan=None, minspan_alpha=0.05, penalty=3,
   smooth=None, thresh=None)

In [11]: print model.summary()
Earth Model
--------------------------------------
Basis Function   Pruned  Coefficient
--------------------------------------
(Intercept)      No      2.93569
h(lstat-5.9)     No      -0.0249319
h(5.9-lstat)     No      0.067697
h(rm-6.402)      No      0.230063
h(6.402-rm)      Yes     None
crim             Yes     None
h(dis-1.5004)    No      -0.0369272
h(1.5004-dis)    No      1.70517
h(ptratio-18.6)  No      -0.0393493
h(18.6-ptratio)  No      0.0278253
nox              No      -0.767146
h(indus-25.65)   No      -0.147471
h(25.65-indus)   Yes     None
h(black-395.24)  Yes     None
h(395.24-black)  Yes     None
h(crim-24.8017)  Yes     None
h(24.8017-crim)  No      0.0292657
rad              No      0.00807783
h(rm-7.853)      Yes     None
h(7.853-rm)      Yes     None
chas             Yes     None
--------------------------------------
MSE: 0.0265, GCV: 0.0320, RSQ: 0.8409, GRSQ: 0.8090

In [12]: metrics.r2_score(log(y), model.predict(x))
Out[12]: 0.840861083407211

Written by statcompute

December 11, 2015 at 9:45 pm

Download Federal Reserve Economic Data (FRED) with Python

In the operational loss calculation, it is important to use CPI (Consumer Price Index) adjusting historical losses. Below is an example showing how to download CPI data online directly from Federal Reserve Bank of St. Louis and then to calculate monthly and quarterly CPI adjustment factors with Python.

In [1]: import pandas_datareader.data as web

In [2]: import pandas as pd

In [3]: import numpy as np

In [4]: import datetime as dt

In [5]: # SET START AND END DATES OF THE SERIES

In [6]: sdt = dt.datetime(2000, 1, 1)

In [7]: edt = dt.datetime(2015, 9, 1)

In [8]: cpi = web.DataReader("CPIAUCNS", "fred", sdt, edt)

In [9]: cpi.head()
Out[9]:
            CPIAUCNS
DATE
2000-01-01     168.8
2000-02-01     169.8
2000-03-01     171.2
2000-04-01     171.3
2000-05-01     171.5

In [10]: df1 = pd.DataFrame({'month': [dt.datetime.strftime(i, "%Y-%m") for i in cpi.index]})

In [11]: df1['qtr'] = [str(x.year) + "-Q" + str(x.quarter) for x in cpi.index]

In [12]: df1['m_cpi'] = cpi.values

In [13]: df1.index = cpi.index

In [14]: grp = df1.groupby('qtr', as_index = False)

In [15]: df2 = grp['m_cpi'].agg({'q_cpi': np.mean})

In [16]: df3 = pd.merge(df1, df2, how = 'inner', left_on = 'qtr', right_on = 'qtr')

In [17]: maxm_cpi = np.array(df3.m_cpi)[-1]

In [18]: maxq_cpi = np.array(df3.q_cpi)[-1]

In [19]: df3['m_factor'] = maxm_cpi / df3.m_cpi

In [20]: df3['q_factor'] = maxq_cpi / df3.q_cpi

In [21]: df3.index = cpi.index

In [22]: final = df3.sort_index(ascending = False)

In [23]: final.head(12)
Out[23]:
              month      qtr    m_cpi       q_cpi  m_factor  q_factor
DATE
2015-09-01  2015-09  2015-Q3  237.945  238.305000  1.000000  1.000000
2015-08-01  2015-08  2015-Q3  238.316  238.305000  0.998443  1.000000
2015-07-01  2015-07  2015-Q3  238.654  238.305000  0.997029  1.000000
2015-06-01  2015-06  2015-Q2  238.638  237.680667  0.997096  1.002627
2015-05-01  2015-05  2015-Q2  237.805  237.680667  1.000589  1.002627
2015-04-01  2015-04  2015-Q2  236.599  237.680667  1.005689  1.002627
2015-03-01  2015-03  2015-Q1  236.119  234.849333  1.007733  1.014714
2015-02-01  2015-02  2015-Q1  234.722  234.849333  1.013731  1.014714
2015-01-01  2015-01  2015-Q1  233.707  234.849333  1.018134  1.014714
2014-12-01  2014-12  2014-Q4  234.812  236.132000  1.013343  1.009202
2014-11-01  2014-11  2014-Q4  236.151  236.132000  1.007597  1.009202
2014-10-01  2014-10  2014-Q4  237.433  236.132000  1.002156  1.009202

Written by statcompute

December 10, 2015 at 9:28 pm