Sparkling Water and Moving Data Around

Sparkling Water is an application to integrate H2O with Spark. Below is an example showing how to move the data around among Pandas DataFrame, H2OFrame, and Spark Dataframe.

1. Define Context

In [1]: from pandas import read_csv, DataFrame

In [2]: from pyspark import sql

In [3]: from pysparkling import H2OContext

In [4]: from h2o import import_file, H2OFrame

In [5]: ss = sql.SparkSession.builder.getOrCreate()

In [6]: hc = H2OContext.getOrCreate(ss)

2. Convert Pandas Dataframe to H2OFrame and Spark DataFrame

In [7]: p_df = read_csv("Documents/credit_count.txt")

In [8]: type(p_df)
Out[8]: pandas.core.frame.DataFrame

In [9]: p2s_df = ss.createDataFrame(p_df)

In [10]: type(p2s_df)
Out[10]: pyspark.sql.dataframe.DataFrame

In [11]: p2h_df = H2OFrame(p_df)

In [12]: type(p2h_df)
Out[12]: h2o.frame.H2OFrame

3. Convert Spark Dataframe to H2OFrame and Pandas DataFrame

In [13]: s_df = ss.read.csv("Documents/credit_count.txt", header = True, inferSchema = True)

In [14]: type(s_df)
Out[14]: pyspark.sql.dataframe.DataFrame

In [15]: s2p_df = s_df.toPandas()

In [16]: type(s2p_df)
Out[16]: pandas.core.frame.DataFrame

In [17]: s2h_df = hc.as_h2o_frame(s_df)

In [18]: type(s2h_df)
Out[18]: h2o.frame.H2OFrame

4. Convert H2OFrame to Pandas Dataframe and Spark DataFrame

In [19]: h_df = import_file("Documents/credit_count.txt", header = 1, sep = ",")

In [20]: type(h_df)
Out[20]: h2o.frame.H2OFrame

In [21]: h2p_df = h_df.as_data_frame()

In [22]: type(h2p_df)
Out[22]: pandas.core.frame.DataFrame

In [23]: h2s_df = hc.as_spark_frame(h_df)

In [24]: type(h2s_df)
Out[24]: pyspark.sql.dataframe.DataFrame

Flavors of SQL on Pandas DataFrame

In R, sqldf() provides a convenient interface of running SQL statement on data frames. Similarly, Python also offers multiple ways to interact between SQL and Pandas DataFrames by leveraging the lightweight SQLite engine. While pandasql (https://github.com/yhat/pandasql) works similarly to sqldf() in R, pysqldf (https://github.com/airtoxin/pysqldf) is even more powerful. In my experiments shown below, advantages of pysqldf over pandasql are two-fold. First of all, pysqldf is 2 – 3 times faster than pandasql. Secondly, pysqldf supports new function definitions, which is not available in pandasql. However, it is worth mentioning that the generic python interface to an in-memory SQLite database can be more efficient and flexible than both pysqldf and pandasql, as demonstrated below, as long as we are able to get the DataFrame into the SQLite and let it stay in-memory.

from sqlite3 import connect
from pandas import read_sql_query
import pandasql
import pysqldf
import numpy

# CREATE AN IN-MEMORY SQLITE DB
con = connect(":memory:")
cur = con.cursor()
cur.execute("attach 'my.db' as filedb")
cur.execute("create table df as select * from filedb.hflights")
cur.execute("detach filedb")

# IMPORT SQLITE TABLE INTO PANDAS DF
df = read_sql_query("select * from df", con)

# WRITE QUERIES
sql01 = "select * from df where DayofWeek = 1 and Dest = 'CVG';"
sql02 = "select DayofWeek, AVG(ArrTime) from df group by DayofWeek;"
sql03 = "select DayofWeek, median(ArrTime) from df group by DayofWeek;"

# SELECTION:
# 1. PANDASQL
%time t11 = pandasql.sqldf(sql01, globals())
# 2. PYSQLDF
%time t12 = pysqldf.SQLDF(globals()).execute(sql01)
# 3. GENERIC SQLITE CONNECTION
%time t13 = read_sql_query(sql01, con)

# AGGREGATION:
# 1. PANDASQL
%time t21 = pandasql.sqldf(sql02, globals())
# 2. PYSQLDF
%time t22 = pysqldf.SQLDF(globals()).execute(sql02)
# 3. GENERIC SQLITE CONNECTION
%time t23 = read_sql_query(sql02, con)

# DEFINING A NEW FUNCTION:
# DEFINE A FUNCTION NOT SUPPORTED IN SQLITE
class median(object):
  def __init__(self):
    self.a = []
  def step(self, x):
    self.a.append(x)
  def finalize(self):
    return numpy.median(self.a)

# 1. PYSQLDF
udafs = {"median": median}
%time t31 = pysqldf.SQLDF(globals(), udafs = udafs).execute(sql03)
# 2 GENERIC SQLITE CONNECTION
con.create_aggregate("median", 1, median)
%time t32 = read_sql_query(sql03, con)

Python Prototype of Grid Search for SVM Parameters

from itertools import product
from pandas import read_table, DataFrame
from sklearn.cross_validation import KFold as kfold
from sklearn.svm import SVC as svc
from sklearn.metrics import roc_auc_score as auc

df = read_table('credit_count.txt', sep = ',')
Y = df[df.CARDHLDR == 1].DEFAULT
X = df[df.CARDHLDR == 1][['AGE', 'ADEPCNT', 'MAJORDRG', 'MINORDRG', 'INCOME', 'OWNRENT', 'SELFEMPL']]

c = [1, 10]
g = [0.01, 0.001]
parms = [i for i in product(c, g)]
kf = [i for i in kfold(Y.count(), n_folds = 3, shuffle = True, random_state = 0)]
final = DataFrame()

for i in parms:
  result = DataFrame()	
  mdl = svc(C = i[0], gamma = i[1], probability = True, random_state = 0)
  for j in kf:
    X1 = X.iloc[j[0]]
    Y1 = Y.iloc[j[0]]
    X2 = X.iloc[j[1]]
    Y2 = Y.iloc[j[1]]
    mdl.fit(X1, Y1)
    pred = mdl.predict_proba(X2)[:, 1]
    out = DataFrame({'pred': pred, 'y': Y2})
    result = result.append(out)
  perf = DataFrame({'Cost': i[0], 'Gamma': i[1], 'AUC': [auc(result.y, result.pred)]}) 
  final = final.append(perf)

Import CSV by Chunk Simultaneously with IPython Parallel

IPython Parallel is a friendly interface to implement the parallelism in Python. Below is an example showing how to import a csv file into the Pandas DataFrame by chunk simultaneously through the interface of IPython Parallel.

In [1]: ### NEED TO RUN "IPCLUSTER START -N 2 &" FIRST ###

In [2]: import pandas as pd

In [3]: import ipyparallel as ip

In [4]: str = "AutoCollision.csv"

In [5]: rc = ip.Client()

In [6]: ### showing 2 engines working

In [7]: rc.ids
Out[7]: [0, 1]

In [8]: def para_read(file):
   ...:       dview = rc[:]
   ...:       # PARTITION THE IMPORT BY SCATTER() #
   ...:       dview.scatter("df", pd.read_csv(file))
   ...:       return pd.concat([i for i in dview["df"]])
   ...:

In [9]: ### PARALLEL IMPORT ###

In [10]: df1 = para_read(str)

In [11]: ### SERIAL IMPORT ###

In [12]: df2 = pd.read_csv(str)

In [13]: df1.equals(df2)
Out[13]: True

Download Federal Reserve Economic Data (FRED) with Python

In the operational loss calculation, it is important to use CPI (Consumer Price Index) adjusting historical losses. Below is an example showing how to download CPI data online directly from Federal Reserve Bank of St. Louis and then to calculate monthly and quarterly CPI adjustment factors with Python.

In [1]: import pandas_datareader.data as web

In [2]: import pandas as pd

In [3]: import numpy as np

In [4]: import datetime as dt

In [5]: # SET START AND END DATES OF THE SERIES

In [6]: sdt = dt.datetime(2000, 1, 1)

In [7]: edt = dt.datetime(2015, 9, 1)

In [8]: cpi = web.DataReader("CPIAUCNS", "fred", sdt, edt)

In [9]: cpi.head()
Out[9]:
            CPIAUCNS
DATE
2000-01-01     168.8
2000-02-01     169.8
2000-03-01     171.2
2000-04-01     171.3
2000-05-01     171.5

In [10]: df1 = pd.DataFrame({'month': [dt.datetime.strftime(i, "%Y-%m") for i in cpi.index]})

In [11]: df1['qtr'] = [str(x.year) + "-Q" + str(x.quarter) for x in cpi.index]

In [12]: df1['m_cpi'] = cpi.values

In [13]: df1.index = cpi.index

In [14]: grp = df1.groupby('qtr', as_index = False)

In [15]: df2 = grp['m_cpi'].agg({'q_cpi': np.mean})

In [16]: df3 = pd.merge(df1, df2, how = 'inner', left_on = 'qtr', right_on = 'qtr')

In [17]: maxm_cpi = np.array(df3.m_cpi)[-1]

In [18]: maxq_cpi = np.array(df3.q_cpi)[-1]

In [19]: df3['m_factor'] = maxm_cpi / df3.m_cpi

In [20]: df3['q_factor'] = maxq_cpi / df3.q_cpi

In [21]: df3.index = cpi.index

In [22]: final = df3.sort_index(ascending = False)

In [23]: final.head(12)
Out[23]:
              month      qtr    m_cpi       q_cpi  m_factor  q_factor
DATE
2015-09-01  2015-09  2015-Q3  237.945  238.305000  1.000000  1.000000
2015-08-01  2015-08  2015-Q3  238.316  238.305000  0.998443  1.000000
2015-07-01  2015-07  2015-Q3  238.654  238.305000  0.997029  1.000000
2015-06-01  2015-06  2015-Q2  238.638  237.680667  0.997096  1.002627
2015-05-01  2015-05  2015-Q2  237.805  237.680667  1.000589  1.002627
2015-04-01  2015-04  2015-Q2  236.599  237.680667  1.005689  1.002627
2015-03-01  2015-03  2015-Q1  236.119  234.849333  1.007733  1.014714
2015-02-01  2015-02  2015-Q1  234.722  234.849333  1.013731  1.014714
2015-01-01  2015-01  2015-Q1  233.707  234.849333  1.018134  1.014714
2014-12-01  2014-12  2014-Q4  234.812  236.132000  1.013343  1.009202
2014-11-01  2014-11  2014-Q4  236.151  236.132000  1.007597  1.009202
2014-10-01  2014-10  2014-Q4  237.433  236.132000  1.002156  1.009202

Fitting Generalized Regression Neural Network with Python

In [1]: # LOAD PACKAGES

In [2]: import pandas as pd

In [3]: import numpy as np

In [4]: from sklearn import preprocessing as pp

In [5]: from sklearn import cross_validation as cv

In [6]: from neupy.algorithms import GRNN as grnn

In [7]: from neupy.functions import mse

In [8]: # DATA PROCESSING

In [9]: df = pd.read_table("csdata.txt")

In [10]: y = df.ix[:, 0]

In [11]: y.describe()
Out[11]:
count    4421.000000
mean        0.090832
std         0.193872
min         0.000000
25%         0.000000
50%         0.000000
75%         0.011689
max         0.998372
Name: LEV_LT3, dtype: float64

In [12]: x = df.ix[:, 1:df.shape[1]]

In [13]: st_x = pp.scale(x)

In [14]: st_x.mean(axis = 0)
Out[14]:
array([  1.88343648e-17,   5.76080438e-17,  -1.76540780e-16,
        -7.71455583e-17,  -3.80705294e-17,   3.79409490e-15,
         4.99487355e-17,  -2.97100804e-15,   3.93261537e-15,
        -8.70310886e-16,  -1.30728071e-15])

In [15]: st_x.std(axis = 0)
Out[15]: array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.])

In [16]: x_train, x_test, y_train, y_test = cv.train_test_split(st_x, y, train_size = 0.7, random_state = 2015)

In [17]: # TRAIN THE NEURAL NETWORK

In [18]: def try_std(x):
   ....:       nn = grnn(std = x, verbose = False)
   ....:       nn.train(x_train, y_train)
   ....:       y_pred = nn.predict(x_test)
   ....:       print mse(y_pred, y_test)
   ....:

In [19]: # TEST A LIST OF VALUES FOR THE TUNING PARAMETER

In [20]: for x in np.linspace(0.5, 1.5, 11):
   ....:       print x
   ....:       try_std(x)
   ....:
0.5
0.034597892756
0.6
0.0331189699098
0.7
0.0323384657283
0.8
0.0319580849146
0.9
0.0318001764256
1.0
0.031751821704
1.1
0.031766356369
1.2
0.03183082142
1.3
0.0319348198865
1.4
0.0320623872248
1.5
0.03219800235

Modeling Frequency in Operational Losses with Python

Poisson and Negative Binomial regressions are two popular approaches to model frequency measures in the operational loss and can be implemented in Python with the statsmodels package as below:

In [1]: import pandas as pd

In [2]: import statsmodels.api as sm

In [3]: import statsmodels.formula.api as smf

In [4]: df = pd.read_csv(&quot;AutoCollision.csv&quot;)

In [5]: # FITTING A POISSON REGRESSION

In [6]: poisson = smf.glm(formula = &quot;Claim_Count ~ Age + Vehicle_Use&quot;, data = df, family = sm.families.Poisson(sm.families.links.log))

In [7]: poisson.fit().summary()
Out[7]:
&lt;class 'statsmodels.iolib.summary.Summary'&gt;
&quot;&quot;&quot;
                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:            Claim_Count   No. Observations:                   32
Model:                            GLM   Df Residuals:                       21
Model Family:                 Poisson   Df Model:                           10
Link Function:                    log   Scale:                             1.0
Method:                          IRLS   Log-Likelihood:                -204.40
Date:                Tue, 08 Dec 2015   Deviance:                       184.72
Time:                        20:31:27   Pearson chi2:                     184.
No. Iterations:                     9
=============================================================================================
                                coef    std err          z      P&gt;|z|      [95.0% Conf. Int.]
---------------------------------------------------------------------------------------------
Intercept                     2.3702      0.110     21.588      0.000         2.155     2.585
Age[T.21-24]                  1.4249      0.118     12.069      0.000         1.193     1.656
Age[T.25-29]                  2.3465      0.111     21.148      0.000         2.129     2.564
Age[T.30-34]                  2.5153      0.110     22.825      0.000         2.299     2.731
Age[T.35-39]                  2.5821      0.110     23.488      0.000         2.367     2.798
Age[T.40-49]                  3.2247      0.108     29.834      0.000         3.013     3.437
Age[T.50-59]                  3.0019      0.109     27.641      0.000         2.789     3.215
Age[T.60+]                    2.6391      0.110     24.053      0.000         2.424     2.854
Vehicle_Use[T.DriveLong]      0.9246      0.036     25.652      0.000         0.854     0.995
Vehicle_Use[T.DriveShort]     1.2856      0.034     37.307      0.000         1.218     1.353
Vehicle_Use[T.Pleasure]       0.1659      0.041      4.002      0.000         0.085     0.247
=============================================================================================
&quot;&quot;&quot;

In [8]: # FITTING A NEGATIVE BINOMIAL REGRESSION

In [9]: nbinom = smf.glm(formula = &quot;Claim_Count ~ Age + Vehicle_Use&quot;, data = df, family = sm.families.NegativeBinomial(sm.families.links.log))

In [10]: nbinom.fit().summary()
Out[10]:
&lt;class 'statsmodels.iolib.summary.Summary'&gt;
&quot;&quot;&quot;
                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:            Claim_Count   No. Observations:                   32
Model:                            GLM   Df Residuals:                       21
Model Family:        NegativeBinomial   Df Model:                           10
Link Function:                    log   Scale:                 0.0646089484752
Method:                          IRLS   Log-Likelihood:                -198.15
Date:                Tue, 08 Dec 2015   Deviance:                       1.4436
Time:                        20:31:27   Pearson chi2:                     1.36
No. Iterations:                    11
=============================================================================================
                                coef    std err          z      P&gt;|z|      [95.0% Conf. Int.]
---------------------------------------------------------------------------------------------
Intercept                     2.2939      0.153     14.988      0.000         1.994     2.594
Age[T.21-24]                  1.4546      0.183      7.950      0.000         1.096     1.813
Age[T.25-29]                  2.4133      0.183     13.216      0.000         2.055     2.771
Age[T.30-34]                  2.5636      0.183     14.042      0.000         2.206     2.921
Age[T.35-39]                  2.6259      0.183     14.384      0.000         2.268     2.984
Age[T.40-49]                  3.2408      0.182     17.760      0.000         2.883     3.598
Age[T.50-59]                  2.9717      0.183     16.283      0.000         2.614     3.329
Age[T.60+]                    2.6404      0.183     14.463      0.000         2.283     2.998
Vehicle_Use[T.DriveLong]      0.9480      0.128      7.408      0.000         0.697     1.199
Vehicle_Use[T.DriveShort]     1.3402      0.128     10.480      0.000         1.090     1.591
Vehicle_Use[T.Pleasure]       0.3265      0.128      2.548      0.011         0.075     0.578
=============================================================================================
&quot;&quot;&quot;

Although Quasi-Poisson regressions is not currently supported by the statsmodels package, we are still able to estimate the model with the rpy2 package by using R in the back-end. As shown in the output below, parameter estimates in Quasi-Poisson model are identical to the ones in standard Poisson model. In case that we want a flexible model approach for frequency measures in the operational loss forecast without pursuing more complex Negative Binomial model, Quasi-Poisson regression can be considered a serious contender.

In [11]: # FITTING A QUASI-POISSON REGRESSION

In [12]: import rpy2.robjects as ro

In [13]: from rpy2.robjects import pandas2ri

In [14]: pandas2ri.activate()

In [15]: rdf = pandas2ri.py2ri_pandasdataframe(df)

In [16]: qpoisson = ro.r.glm('Claim_Count ~ Age + Vehicle_Use', data = rdf, family = ro.r('quasipoisson(link = &quot;log&quot;)'))

In [17]: print ro.r.summary(qpoisson)

Coefficients:
                      Estimate Std. Error t value Pr(&gt;|t|)
(Intercept)             2.3702     0.3252   7.288 3.55e-07 ***
Age21-24                1.4249     0.3497   4.074 0.000544 ***
Age25-29                2.3465     0.3287   7.140 4.85e-07 ***
Age30-34                2.5153     0.3264   7.705 1.49e-07 ***
Age35-39                2.5821     0.3256   7.929 9.49e-08 ***
Age40-49                3.2247     0.3202  10.072 1.71e-09 ***
Age50-59                3.0019     0.3217   9.331 6.42e-09 ***
Age60+                  2.6391     0.3250   8.120 6.48e-08 ***
Vehicle_UseDriveLong    0.9246     0.1068   8.660 2.26e-08 ***
Vehicle_UseDriveShort   1.2856     0.1021  12.595 2.97e-11 ***
Vehicle_UsePleasure     0.1659     0.1228   1.351 0.191016
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for quasipoisson family taken to be 8.774501)

    Null deviance: 6064.97  on 31  degrees of freedom
Residual deviance:  184.72  on 21  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 4

Modeling Severity in Operational Losses with Python

When modeling severity measurements in the operational loss with Generalized Linear Models, we might have a couple choices based on different distributional assumptions, including Gamma, Inverse Gaussian, and Lognormal. However, based on my observations from the empirical work, the differences in parameter estimates among these three popular candidates are rather immaterial from the practical standpoint.

Below is a demonstration showing how to model the severity with the insurance data under aforementioned three distributions. As shown, albeit with inferential differences, three models show similar coefficients.

In [1]: # LOAD PACKAGES

In [2]: import pandas as pd

In [3]: import numpy as np

In [4]: import statsmodels.api as sm

In [5]: import statsmodels.formula.api as smf

In [6]: df = pd.read_csv(&quot;AutoCollision.csv&quot;)

In [7]: df.head()
Out[7]:
     Age Vehicle_Use  Severity  Claim_Count
0  17-20    Pleasure    250.48           21
1  17-20  DriveShort    274.78           40
2  17-20   DriveLong    244.52           23
3  17-20    Business    797.80            5
4  21-24    Pleasure    213.71           63

In [8]: # FIT A GAMMA REGRESSION

In [9]: gamma = smf.glm(formula = &quot;Severity ~ Age + Vehicle_Use&quot;, data = df, family = sm.families.Gamma(sm.families.links.log))

In [10]: gamma.fit().summary()
Out[10]:
&lt;class 'statsmodels.iolib.summary.Summary'&gt;
&quot;&quot;&quot;
                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:               Severity   No. Observations:                   32
Model:                            GLM   Df Residuals:                       21
Model Family:                   Gamma   Df Model:                           10
Link Function:                    log   Scale:                 0.0299607547345
Method:                          IRLS   Log-Likelihood:                -161.35
Date:                Sun, 06 Dec 2015   Deviance:                      0.58114
Time:                        12:59:17   Pearson chi2:                    0.629
No. Iterations:                     8
=============================================================================================
                                coef    std err          z      P&gt;|z|      [95.0% Conf. Int.]
---------------------------------------------------------------------------------------------
Intercept                     6.2413      0.101     61.500      0.000         6.042     6.440
Age[T.21-24]                 -0.2080      0.122     -1.699      0.089        -0.448     0.032
Age[T.25-29]                 -0.2303      0.122     -1.881      0.060        -0.470     0.010
Age[T.30-34]                 -0.2630      0.122     -2.149      0.032        -0.503    -0.023
Age[T.35-39]                 -0.5311      0.122     -4.339      0.000        -0.771    -0.291
Age[T.40-49]                 -0.3820      0.122     -3.121      0.002        -0.622    -0.142
Age[T.50-59]                 -0.3741      0.122     -3.057      0.002        -0.614    -0.134
Age[T.60+]                   -0.3939      0.122     -3.218      0.001        -0.634    -0.154
Vehicle_Use[T.DriveLong]     -0.3573      0.087     -4.128      0.000        -0.527    -0.188
Vehicle_Use[T.DriveShort]    -0.5053      0.087     -5.839      0.000        -0.675    -0.336
Vehicle_Use[T.Pleasure]      -0.5886      0.087     -6.801      0.000        -0.758    -0.419
=============================================================================================
&quot;&quot;&quot;

In [11]: # FIT A INVERSE GAUSSIAN REGRESSION

In [12]: igauss = smf.glm(formula = &quot;Severity ~ Age + Vehicle_Use&quot;, data = df, family = sm.families.InverseGaussian(sm.families.links.log))

In [13]: igauss.fit().summary()
Out[13]:
&lt;class 'statsmodels.iolib.summary.Summary'&gt;
&quot;&quot;&quot;
                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:               Severity   No. Observations:                   32
Model:                            GLM   Df Residuals:                       21
Model Family:         InverseGaussian   Df Model:                           10
Link Function:                    log   Scale:               8.73581523073e-05
Method:                          IRLS   Log-Likelihood:                -156.44
Date:                Sun, 06 Dec 2015   Deviance:                    0.0015945
Time:                        13:01:14   Pearson chi2:                  0.00183
No. Iterations:                     7
=============================================================================================
                                coef    std err          z      P&gt;|z|      [95.0% Conf. Int.]
---------------------------------------------------------------------------------------------
Intercept                     6.1776      0.103     59.957      0.000         5.976     6.379
Age[T.21-24]                 -0.1475      0.116     -1.269      0.204        -0.375     0.080
Age[T.25-29]                 -0.1632      0.116     -1.409      0.159        -0.390     0.064
Age[T.30-34]                 -0.2079      0.115     -1.814      0.070        -0.433     0.017
Age[T.35-39]                 -0.4732      0.108     -4.361      0.000        -0.686    -0.261
Age[T.40-49]                 -0.3299      0.112     -2.954      0.003        -0.549    -0.111
Age[T.50-59]                 -0.3206      0.112     -2.866      0.004        -0.540    -0.101
Age[T.60+]                   -0.3465      0.111     -3.115      0.002        -0.565    -0.128
Vehicle_Use[T.DriveLong]     -0.3334      0.084     -3.992      0.000        -0.497    -0.170
Vehicle_Use[T.DriveShort]    -0.4902      0.081     -6.055      0.000        -0.649    -0.332
Vehicle_Use[T.Pleasure]      -0.5743      0.080     -7.206      0.000        -0.731    -0.418
=============================================================================================
&quot;&quot;&quot;

In [14]: # FIT A LOGNORMAL REGRESSION

In [15]: df['Log_Severity'] = np.log(df.Severity)

In [16]: lognormal = smf.glm(formula = &quot;Log_Severity ~ Age + Vehicle_Use&quot;, data = df, family = sm.families.Gaussian())

In [17]: lognormal.fit().summary()
Out[17]:
&lt;class 'statsmodels.iolib.summary.Summary'&gt;
&quot;&quot;&quot;
                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:           Log_Severity   No. Observations:                   32
Model:                            GLM   Df Residuals:                       21
Model Family:                Gaussian   Df Model:                           10
Link Function:               identity   Scale:                 0.0265610360381
Method:                          IRLS   Log-Likelihood:                 19.386
Date:                Sun, 06 Dec 2015   Deviance:                      0.55778
Time:                        13:02:12   Pearson chi2:                    0.558
No. Iterations:                     4
=============================================================================================
                                coef    std err          z      P&gt;|z|      [95.0% Conf. Int.]
---------------------------------------------------------------------------------------------
Intercept                     6.1829      0.096     64.706      0.000         5.996     6.370
Age[T.21-24]                 -0.1667      0.115     -1.447      0.148        -0.393     0.059
Age[T.25-29]                 -0.1872      0.115     -1.624      0.104        -0.413     0.039
Age[T.30-34]                 -0.2163      0.115     -1.877      0.061        -0.442     0.010
Age[T.35-39]                 -0.4901      0.115     -4.252      0.000        -0.716    -0.264
Age[T.40-49]                 -0.3347      0.115     -2.904      0.004        -0.561    -0.109
Age[T.50-59]                 -0.3267      0.115     -2.835      0.005        -0.553    -0.101
Age[T.60+]                   -0.3467      0.115     -3.009      0.003        -0.573    -0.121
Vehicle_Use[T.DriveLong]     -0.3481      0.081     -4.272      0.000        -0.508    -0.188
Vehicle_Use[T.DriveShort]    -0.4903      0.081     -6.016      0.000        -0.650    -0.331
Vehicle_Use[T.Pleasure]      -0.5726      0.081     -7.027      0.000        -0.732    -0.413
=============================================================================================
&quot;&quot;&quot;

Ibis – A New Kid in Town

Developed by Wes McKinney, pandas is a very efficient and powerful data analysis tool in python language for data scientists. Same as R, pandas reads the data into memory. As a result, we might often face the problem of running out of memory while analyzing large-size data with pandas.

Similar to Blaze, ibis is a new data analysis framework in python built on top of other back-end data engines, such as sqlite and impala. Even better, ibis provides a higher compatibility to pandas and better performance than Blaze.

In a previous blog (https://statcompute.wordpress.com/2015/03/27/a-comparison-between-blaze-and-pandas), I’ve shown the efficiency of Blaze through a simple example. However, in the demonstration below, it is shown that, while applied to the same data with sqlite engine, ibis is 50% more efficient than Blaze in terms of the “real time”.

import ibis as ibis

tbl = ibis.sqlite.connect('//home/liuwensui/Documents/data/flights.db').table('tbl2008')
exp = tbl[tbl.DayOfWeek > 1].group_by("DayOfWeek").aggregate(avg_AirTime = tbl.AirTime.mean())
pd = exp.execute()
print(pd)

#i   DayOfWeek  avg_AirTime
#0          2   103.214930
#1          3   103.058508
#2          4   103.467138
#3          5   103.557539
#4          6   107.400631
#5          7   104.864885
#
#real   0m10.346s
#user   0m9.585s
#sys    0m1.181s

A Comparison between Blaze and Pandas

Blaze (https://github.com/ContinuumIO/blaze) is a lightweight interface on top of other data or computing engines such as numpy or sqlite. Blaze itself doesn’t do any computation but provides a Pandas-like syntax to interact with the back-end data.

Below is an example showing how Blaze leverages the computing power of SQLite (https://www.sqlite.org) and outperforms Pandas when performing some simple tasks on a SQLite table with ~7MM rows. Since Blaze doesn’t have to load the data table into the memory as Pandas does, the cpu time is significantly shorter.

Pandas

import pandas as pda
import pandas.io.sql as psql
import sqlite3 as sql
import numpy as npy

con = sql.connect('/home/liuwensui/Documents/data/flights.db')
ds1 = psql.read_sql('select * from tbl2008', con)
ds2 = ds1[ds1.DayOfWeek > 1]
ds3 = ds2.groupby('DayOfWeek', as_index = False)
ds4 = ds3['AirTime'].agg({'avg_AirTime' : npy.mean})
print(ds4)

#   DayOfWeek  avg_AirTime
#0          2   103.214930
#1          3   103.058508
#2          4   103.467138
#3          5   103.557539
#4          6   107.400631
#5          7   104.864885
#
#real 1m7.241s
#user 1m0.403s
#sys  0m5.278s

Blaze

import blaze as blz
import pandas as pda

ds1 = blz.Data('sqlite:////home/liuwensui/Documents/data/flights.db::tbl2008')
ds2 = ds1[ds1.DayOfWeek > 1]
ds3 = blz.by(ds2.DayOfWeek, avg_AirTime = ds2.AirTime.mean())
ds4 = blz.into(pda.DataFrame, ds3)
print(ds4)

#   DayOfWeek  avg_AirTime
#0          2   103.214930
#1          3   103.058508
#2          4   103.467138
#3          5   103.557539
#4          6   107.400631
#5          7   104.864885
#
#real 0m21.658s
#user 0m10.727s
#sys  0m1.167s