## Faster SQL on Pandas DataFrame with Sandals

Similar to Pandasql, Sandals (https://github.com/jbochi/sandals) provides a convenient interface to query on Pandas DataFrame. However, while Pandasql leverages the power of SQLite in the back-end, Sandals employs the SQL parser to manipulate the DataFrame directly and hence delivery a superior performance by avoiding the data exchange between python and SQLite.

In [1]: import sqlite3 as sqlite In [2]: import pandas as pd In [3]: import pandasql as psql In [4]: import sandals as ssql In [5]: con = sqlite.connect("/home/liuwensui/Documents/data/flights.db") In [6]: df = pd.io.sql.read_sql("select * from tbl2008 limit 100000", con) In [7]: ### SELECT ### In [8]: %time psql.sqldf("select * from df where DayOfWeek = 1 limit 1", locals()) CPU times: user 1.69 s, sys: 11 ms, total: 1.7 s Wall time: 1.7 s Out[8]: Year Month DayofMonth DayOfWeek DepTime CRSDepTime ArrTime \ 0 2008 1 7 1 1842 1845 2032 CRSArrTime UniqueCarrier FlightNum ... TaxiIn TaxiOut \ 0 2040 WN 746 ... 4 9 Cancelled CancellationCode Diverted CarrierDelay WeatherDelay NASDelay \ 0 0 NaN 0 NaN NaN NaN SecurityDelay LateAircraftDelay 0 NaN NaN [1 rows x 29 columns] In [9]: %time ssql.sql("select * from df where DayOfWeek = 1 limit 1", locals()) CPU times: user 16.3 ms, sys: 3.79 ms, total: 20.1 ms Wall time: 18 ms Out[9]: Year Month DayofMonth DayOfWeek DepTime CRSDepTime ArrTime \ 11843 2008 1 7 1 1842 1845 2032 CRSArrTime UniqueCarrier FlightNum ... TaxiIn TaxiOut \ 11843 2040 WN 746 ... 4 9 Cancelled CancellationCode Diverted CarrierDelay WeatherDelay \ 11843 0 NaN 0 NaN NaN NASDelay SecurityDelay LateAircraftDelay 11843 NaN NaN NaN [1 rows x 29 columns] In [10]: ### AGGREGATE ### In [11]: %time psql.sqldf("select DayOfWeek, AVG(ArrTime) from df group by DayOfWeek", locals()) CPU times: user 1.74 s, sys: 15.6 ms, total: 1.75 s Wall time: 1.75 s Out[11]: DayOfWeek AVG(ArrTime) 0 1 1484.413548 1 2 1489.169235 2 3 1490.011514 3 4 1478.614701 4 5 1486.728223 5 6 1485.118600 6 7 1540.071341 In [12]: %time ssql.sql("select DayOfWeek, AVG(ArrTime) from df group by DayOfWeek", locals()) CPU times: user 5.72 ms, sys: 0 ns, total: 5.72 ms Wall time: 4.92 ms Out[12]: ArrTime DayOfWeek 1 1484.413548 2 1489.169235 3 1490.011514 4 1478.614701 5 1486.728223 6 1485.118600 7 1540.071341

## Estimating Time Series Models for Count Outcomes with SAS

In SAS, there is no out-of-box procedure to estimate time series models for count outcomes, which is similar to the one shown here (https://statcompute.wordpress.com/2015/03/31/modeling-count-time-series-with-tscount-package). However, as long as we understand the likelihood function of Poisson distribution, it is straightforward to estimate a time series model with **PROC MODEL** in the ETS module.

Below is a demonstration of how to estimate a Poisson time series model with the identity link function. As shown, the parameter estimates with related inferences are extremely close to the ones estimated with tscount() in R.

data polio; idx + 1; input y @@; datalines; 0 1 0 0 1 3 9 2 3 5 3 5 2 2 0 1 0 1 3 3 2 1 1 5 0 3 1 0 1 4 0 0 1 6 14 1 1 0 0 1 1 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 2 0 1 0 1 0 0 1 2 0 0 1 2 0 3 1 1 0 2 0 4 0 2 1 1 1 1 0 1 1 0 2 1 3 1 2 4 0 0 0 1 0 1 0 2 2 4 2 3 3 0 0 2 7 8 2 4 1 1 2 4 0 1 1 1 3 0 0 0 0 1 0 1 1 0 0 0 0 0 1 2 0 2 0 0 0 1 0 1 0 1 0 2 0 0 1 2 0 1 0 0 0 1 2 1 0 1 3 6 ; run; proc model data = polio; parms b0 = 0.5 b1 = 0.1 b2 = 0.1; yhat = b0 + b1 * zlag1(y) + b2 * zlag1(yhat); y = yhat; lk = exp(-yhat) * (yhat ** y) / fact(y); ll = -log(lk); errormodel y ~ general(ll); fit y / fiml converge = 1e-8; run; /* OUTPUT: Nonlinear Liklhood Summary of Residual Errors DF DF Adj Equation Model Error SSE MSE R-Square R-Sq y 3 165 532.6 3.2277 0.0901 0.0791 Nonlinear Liklhood Parameter Estimates Approx Approx Parameter Estimate Std Err t Value Pr > |t| b0 0.606313 0.1680 3.61 0.0004 b1 0.349495 0.0690 5.06 <.0001 b2 0.206877 0.1397 1.48 0.1405 Number of Observations Statistics for System Used 168 Log Likelihood -278.6615 Missing 0 */

## Easy DB Interaction with db.py

The db.py package (https://github.com/yhat/db.py) provides a convenient interface to interact with databases, e.g. allowing users to explore the meta-data and execute queries. Below is an example showing how to create an index in a SQLite database to improve the query efficiency with the db.py package.

**Without Index**

In [1]: import db as db In [2]: db = db.DB(filename = "Documents/data/test.db";, dbtype = "sqlite") In [3]: %timeit db.query("select Month, count(*) as cnt from tbl2008 group by Month;") 1 loops, best of 3: 6.28 s per loop In [4]: %timeit db.query("select * from tbl2008 where Month = 1;") 1 loops, best of 3: 5.61 s per loop

**With Index**

In [1]: import db as db In [2]: db = db.DB(filename = "Documents/data/test.db", dbtype = "sqlite") In [3]: db.query("create INDEX idx on tbl2008(Month);") In [4]: db.query("select * from sqlite_master where type = 'index';") Out[4]: type name tbl_name rootpage sql 0 index idx tbl2008 544510 CREATE INDEX idx on tbl2008(Month) In [5]: %timeit db.query("select Month, count(*) as cnt from tbl2008 group by Month;") 1 loops, best of 3: 787 ms per loop In [6]: %timeit db.query("select * from tbl2008 where Month = 1;") 1 loops, best of 3: 5.16 s per loop

## Modeling Count Time Series with tscount Package

The example below shows how to estimate a simple univariate Poisson time series model with the tscount package. While the model estimation is straightforward and yeilds very similar parameter estimates to the ones generated with the acp package (https://statcompute.wordpress.com/2015/03/29/autoregressive-conditional-poisson-model-i), the prediction mechanism is a bit tricky.

1) For the in-sample and the 1-step-ahead predictions:

**yhat_[t] = beta0 + beta1 * y_[t – 1] + beta2 * yhat_[t – 1]**

2) For the out-of-sample predictions with the observed Y unavailable:

**yhat_[t] = beta0 + beta1 * yhat_[t – 1] + beta2 * yhat_[t – 1]**

library(tscount) mdl <- tsglm(cnt$y, model = list(past_obs = 1, past_mean = 1), distr = "poisson") summary(mdl) # tsglm(ts = cnt$y, model = list(past_obs = 1, past_mean = 1), # distr = "poisson") # # Coefficients: # Estimate Std. Error # (Intercept) 0.632 0.1774 # beta_1 0.350 0.0687 # alpha_1 0.184 0.1455 # Standard errors obtained by normal approximation. # # Link function: identity # Distribution family: poisson # Number of coefficients: 3 # Log-likelihood: -279.2738 # AIC: 564.5476 # BIC: 573.9195 ### in-sample prediction ### cnt$yhat <- mdl$fitted.values tail(cnt, 3) # y yhat # 166 1 0.8637023 # 167 3 1.1404714 # 168 6 1.8918651 ### manually check ### beta <- mdl$coefficients pv167 <- beta[1] + beta[2] * cnt$y[166] + beta[3] * cnt$yhat[166] # 1.140471 pv168 <- beta[1] + beta[2] * cnt$y[167] + beta[3] * cnt$yhat[167] # 1.891865 ### out-of-sample prediction ### oot <- predict(mdl, n.ahead = 3) # [1] 3.080667 2.276211 1.846767 ### manually check ### ov2 <- beta[1] + beta[2] * oot[[1]][1] + beta[3] * oot[[1]][1] # 2.276211 ov3 <- beta[1] + beta[2] * oot[[1]][2] + beta[3] * oot[[1]][2] # 1.846767

## rPithon vs. rPython

Similar to rPython, the rPithon package (http://rpithon.r-forge.r-project.org) allows users to execute Python code from R and exchange the data between Python and R. However, the underlying mechanisms between these two packages are fundamentally different. Wihle rPithon communicates with Python from R through pipes, rPython accomplishes the same task with json. A major advantage of rPithon over rPython is that multiple Python processes can be started within a R session. However, rPithon is not very robust while exchanging large data objects between R and Python.

**rPython Session **

library(sqldf) df_in <- sqldf('select Year, Month, DayofMonth from tbl2008 limit 5000', dbname = '/home/liuwensui/Documents/data/flights.db') library(rPython) ### R DATA.FRAME TO PYTHON DICTIONARY ### python.assign('py_dict', df_in) ### PASS PYTHON DICTIONARY BACK TO R LIST r_list <- python.get('py_dict') ### CONVERT R LIST TO DATA.FRAME df_out <- data.frame(r_list) dim(df_out) # [1] 5000 3 # # real 0m0.973s # user 0m0.797s # sys 0m0.186s

**rPithon Session **

library(sqldf) df_in <- sqldf('select Year, Month, DayofMonth from tbl2008 limit 5000', dbname = '/home/liuwensui/Documents/data/flights.db') library(rPithon) ### R DATA.FRAME TO PYTHON DICTIONARY ### pithon.assign('py_dict', df_in) ### PASS PYTHON DICTIONARY BACK TO R LIST r_list <- pithon.get('py_dict') ### CONVERT R LIST TO DATA.FRAME df_out <- data.frame(r_list) dim(df_out) # [1] 5000 3 # # real 0m0.984s # user 0m0.771s # sys 0m0.187s

## Autoregressive Conditional Poisson Model – I

Modeling the time series of count outcome is of interest in the operational risk while forecasting the frequency of losses. Below is an example showing how to estimate a simple ACP(1, 1) model, e.g. Autoregressive Conditional Poisson, without covariates with **ACP** package.

library(acp) ### acp(1, 1) without covariates ### mdl <- acp(y ~ -1, data = cnt) summary(mdl) # acp.formula(formula = y ~ -1, data = cnt) # # Estimate StdErr t.value p.value # a 0.632670 0.169027 3.7430 0.0002507 *** # b 0.349642 0.067414 5.1865 6.213e-07 *** # c 0.184509 0.134154 1.3753 0.1708881 ### generate predictions ### f <- predict(mdl) pred <- data.frame(yhat = f, cnt) tail(pred, 5) # yhat y # 164 1.5396921 1 # 165 1.2663993 0 # 166 0.8663321 1 # 167 1.1421586 3 # 168 1.8923355 6 ### calculate predictions manually ### pv167 <- mdl$coef[1] + mdl$coef[2] * pred$y[166] + mdl$coef[3] * pred$yhat[166] # [1] 1.142159 pv168 <- mdl$coef[1] + mdl$coef[2] * pred$y[167] + mdl$coef[3] * pred$yhat[167] # [1] 1.892336 plot.ts(pred, main = "Predictions")

## A Comparison between Blaze and Pandas

Blaze (https://github.com/ContinuumIO/blaze) is a lightweight interface on top of other data or computing engines such as numpy or sqlite. Blaze itself doesn’t do any computation but provides a Pandas-like syntax to interact with the back-end data.

Below is an example showing how Blaze leverages the computing power of SQLite (https://www.sqlite.org) and outperforms Pandas when performing some simple tasks on a SQLite table with ~7MM rows. Since Blaze doesn’t have to load the data table into the memory as Pandas does, the cpu time is significantly shorter.

**Pandas**

import pandas as pda import pandas.io.sql as psql import sqlite3 as sql import numpy as npy con = sql.connect('/home/liuwensui/Documents/data/flights.db') ds1 = psql.read_sql('select * from tbl2008', con) ds2 = ds1[ds1.DayOfWeek > 1] ds3 = ds2.groupby('DayOfWeek', as_index = False) ds4 = ds3['AirTime'].agg({'avg_AirTime' : npy.mean}) print(ds4) # DayOfWeek avg_AirTime #0 2 103.214930 #1 3 103.058508 #2 4 103.467138 #3 5 103.557539 #4 6 107.400631 #5 7 104.864885 # #real 1m7.241s #user 1m0.403s #sys 0m5.278s

**Blaze**

import blaze as blz import pandas as pda ds1 = blz.Data('sqlite:////home/liuwensui/Documents/data/flights.db::tbl2008') ds2 = ds1[ds1.DayOfWeek > 1] ds3 = blz.by(ds2.DayOfWeek, avg_AirTime = ds2.AirTime.mean()) ds4 = blz.into(pda.DataFrame, ds3) print(ds4) # DayOfWeek avg_AirTime #0 2 103.214930 #1 3 103.058508 #2 4 103.467138 #3 5 103.557539 #4 6 107.400631 #5 7 104.864885 # #real 0m21.658s #user 0m10.727s #sys 0m1.167s