Yet Another Blog in Statistical Computing

I can calculate the motion of heavenly bodies but not the madness of people. -Isaac Newton

Faster SQL on Pandas DataFrame with Sandals

logo

Similar to Pandasql, Sandals (https://github.com/jbochi/sandals) provides a convenient interface to query on Pandas DataFrame. However, while Pandasql leverages the power of SQLite in the back-end, Sandals employs the SQL parser to manipulate the DataFrame directly and hence delivery a superior performance by avoiding the data exchange between python and SQLite.

In [1]: import sqlite3  as sqlite

In [2]: import pandas   as pd

In [3]: import pandasql as psql

In [4]: import sandals  as ssql

In [5]: con = sqlite.connect("/home/liuwensui/Documents/data/flights.db")

In [6]: df = pd.io.sql.read_sql("select * from tbl2008 limit 100000", con) 

In [7]: ### SELECT ###

In [8]: %time psql.sqldf("select * from df where DayOfWeek = 1 limit 1", locals())
CPU times: user 1.69 s, sys: 11 ms, total: 1.7 s
Wall time: 1.7 s
Out[8]: 
   Year  Month  DayofMonth  DayOfWeek  DepTime  CRSDepTime  ArrTime  \
0  2008      1           7          1     1842        1845     2032   

   CRSArrTime UniqueCarrier  FlightNum        ...         TaxiIn  TaxiOut  \
0        2040            WN        746        ...              4        9   

   Cancelled  CancellationCode  Diverted  CarrierDelay WeatherDelay NASDelay  \
0          0               NaN         0           NaN          NaN      NaN   

   SecurityDelay  LateAircraftDelay  
0            NaN                NaN  

[1 rows x 29 columns]

In [9]: %time ssql.sql("select * from df where DayOfWeek = 1 limit 1", locals())
CPU times: user 16.3 ms, sys: 3.79 ms, total: 20.1 ms
Wall time: 18 ms
Out[9]: 
       Year  Month  DayofMonth  DayOfWeek  DepTime  CRSDepTime  ArrTime  \
11843  2008      1           7          1     1842        1845     2032   

       CRSArrTime UniqueCarrier  FlightNum        ...         TaxiIn  TaxiOut  \
11843        2040            WN        746        ...              4        9   

       Cancelled  CancellationCode  Diverted  CarrierDelay WeatherDelay  \
11843          0               NaN         0           NaN          NaN   

      NASDelay  SecurityDelay  LateAircraftDelay  
11843      NaN            NaN                NaN  

[1 rows x 29 columns]

In [10]: ### AGGREGATE ###

In [11]: %time psql.sqldf("select DayOfWeek, AVG(ArrTime) from df group by DayOfWeek", locals())
CPU times: user 1.74 s, sys: 15.6 ms, total: 1.75 s
Wall time: 1.75 s
Out[11]: 
   DayOfWeek  AVG(ArrTime)
0          1   1484.413548
1          2   1489.169235
2          3   1490.011514
3          4   1478.614701
4          5   1486.728223
5          6   1485.118600
6          7   1540.071341

In [12]: %time ssql.sql("select DayOfWeek, AVG(ArrTime) from df group by DayOfWeek", locals())
CPU times: user 5.72 ms, sys: 0 ns, total: 5.72 ms
Wall time: 4.92 ms
Out[12]: 
               ArrTime
DayOfWeek             
1          1484.413548
2          1489.169235
3          1490.011514
4          1478.614701
5          1486.728223
6          1485.118600
7          1540.071341

Written by statcompute

April 19, 2015 at 1:57 am

Posted in PYTHON, SQL, SQLite

Tagged with , ,

Estimating Time Series Models for Count Outcomes with SAS

In SAS, there is no out-of-box procedure to estimate time series models for count outcomes, which is similar to the one shown here (https://statcompute.wordpress.com/2015/03/31/modeling-count-time-series-with-tscount-package). However, as long as we understand the likelihood function of Poisson distribution, it is straightforward to estimate a time series model with PROC MODEL in the ETS module.

Below is a demonstration of how to estimate a Poisson time series model with the identity link function. As shown, the parameter estimates with related inferences are extremely close to the ones estimated with tscount() in R.

data polio;
  idx + 1;
  input y @@;
datalines;
0  1  0  0  1  3  9  2  3  5  3  5  2  2  0  1  0  1  3  3  2  1  1  5  0
3  1  0  1  4  0  0  1  6 14  1  1  0  0  1  1  1  1  0  1  0  1  0  1  0
1  0  1  0  1  0  1  0  0  2  0  1  0  1  0  0  1  2  0  0  1  2  0  3  1
1  0  2  0  4  0  2  1  1  1  1  0  1  1  0  2  1  3  1  2  4  0  0  0  1
0  1  0  2  2  4  2  3  3  0  0  2  7  8  2  4  1  1  2  4  0  1  1  1  3
0  0  0  0  1  0  1  1  0  0  0  0  0  1  2  0  2  0  0  0  1  0  1  0  1
0  2  0  0  1  2  0  1  0  0  0  1  2  1  0  1  3  6
;
run;

proc model data = polio;
  parms b0 = 0.5 b1 = 0.1 b2 = 0.1;
  yhat = b0 + b1 * zlag1(y) + b2 * zlag1(yhat);
  y = yhat;
  lk = exp(-yhat) * (yhat ** y) / fact(y);
  ll = -log(lk);
  errormodel y ~ general(ll);
  fit y / fiml converge = 1e-8;
run;

/* OUTPUT:
            Nonlinear Liklhood Summary of Residual Errors

                  DF      DF                                        Adj
Equation       Model   Error        SSE        MSE   R-Square      R-Sq
y                  3     165      532.6     3.2277     0.0901    0.0791

           Nonlinear Liklhood Parameter Estimates

                              Approx                  Approx
Parameter       Estimate     Std Err    t Value     Pr > |t|
b0              0.606313      0.1680       3.61       0.0004
b1              0.349495      0.0690       5.06       <.0001
b2              0.206877      0.1397       1.48       0.1405

Number of Observations       Statistics for System

Used               168    Log Likelihood    -278.6615
Missing              0
*/

Written by statcompute

April 17, 2015 at 9:43 pm

Easy DB Interaction with db.py

The db.py package (https://github.com/yhat/db.py) provides a convenient interface to interact with databases, e.g. allowing users to explore the meta-data and execute queries. Below is an example showing how to create an index in a SQLite database to improve the query efficiency with the db.py package.

Without Index

In [1]: import db as db

In [2]: db = db.DB(filename = "Documents/data/test.db";, dbtype = "sqlite")

In [3]: %timeit db.query("select Month, count(*) as cnt from tbl2008 group by Month;")
1 loops, best of 3: 6.28 s per loop

In [4]: %timeit db.query("select * from tbl2008 where Month = 1;")
1 loops, best of 3: 5.61 s per loop

With Index

In [1]: import db as db

In [2]: db = db.DB(filename = "Documents/data/test.db", dbtype = "sqlite")

In [3]: db.query("create INDEX idx on tbl2008(Month);")

In [4]: db.query("select * from sqlite_master where type = 'index';")
Out[4]: 
    type name tbl_name  rootpage                                 sql
0  index  idx  tbl2008    544510  CREATE INDEX idx on tbl2008(Month)

In [5]: %timeit db.query("select Month, count(*) as cnt from tbl2008 group by Month;")
1 loops, best of 3: 787 ms per loop

In [6]: %timeit db.query("select * from tbl2008 where Month = 1;")
1 loops, best of 3: 5.16 s per loop

Written by statcompute

April 5, 2015 at 1:21 am

Posted in PYTHON, SQLite

Tagged with , ,

Modeling Count Time Series with tscount Package

The example below shows how to estimate a simple univariate Poisson time series model with the tscount package. While the model estimation is straightforward and yeilds very similar parameter estimates to the ones generated with the acp package (https://statcompute.wordpress.com/2015/03/29/autoregressive-conditional-poisson-model-i), the prediction mechanism is a bit tricky.

1) For the in-sample and the 1-step-ahead predictions:

yhat_[t] = beta0 + beta1 * y_[t – 1] + beta2 * yhat_[t – 1]

2) For the out-of-sample predictions with the observed Y unavailable:

yhat_[t] = beta0 + beta1 * yhat_[t – 1] + beta2 * yhat_[t – 1]

library(tscount)

mdl <- tsglm(cnt$y, model = list(past_obs = 1, past_mean = 1), distr = "poisson")
summary(mdl)
# tsglm(ts = cnt$y, model = list(past_obs = 1, past_mean = 1), 
#     distr = "poisson")
#
# Coefficients:
#              Estimate  Std. Error
# (Intercept)     0.632      0.1774
# beta_1          0.350      0.0687
# alpha_1         0.184      0.1455
# Standard errors obtained by normal approximation.
#
# Link function: identity 
# Distribution family: poisson 
# Number of coefficients: 3 
# Log-likelihood: -279.2738 
# AIC: 564.5476 
# BIC: 573.9195 

### in-sample prediction ###
cnt$yhat <- mdl$fitted.values
tail(cnt, 3)
#     y      yhat
# 166 1 0.8637023
# 167 3 1.1404714
# 168 6 1.8918651

### manually check ###
beta <- mdl$coefficients
pv167 <- beta[1] + beta[2] * cnt$y[166] + beta[3] * cnt$yhat[166] 
#  1.140471
pv168 <- beta[1] + beta[2] * cnt$y[167] + beta[3] * cnt$yhat[167] 
#  1.891865 

### out-of-sample prediction ###
oot <- predict(mdl, n.ahead = 3)
# [1] 3.080667 2.276211 1.846767

### manually check ###
ov2 <- beta[1] + beta[2] * oot[[1]][1] + beta[3] * oot[[1]][1] 
#  2.276211
ov3 <- beta[1] + beta[2] * oot[[1]][2] + beta[3] * oot[[1]][2] 
#  1.846767

Written by statcompute

March 31, 2015 at 11:35 pm

rPithon vs. rPython

Similar to rPython, the rPithon package (http://rpithon.r-forge.r-project.org) allows users to execute Python code from R and exchange the data between Python and R. However, the underlying mechanisms between these two packages are fundamentally different. Wihle rPithon communicates with Python from R through pipes, rPython accomplishes the same task with json. A major advantage of rPithon over rPython is that multiple Python processes can be started within a R session. However, rPithon is not very robust while exchanging large data objects between R and Python.

rPython Session

library(sqldf)
df_in <- sqldf('select Year, Month, DayofMonth from tbl2008 limit 5000', dbname = '/home/liuwensui/Documents/data/flights.db')
library(rPython)
### R DATA.FRAME TO PYTHON DICTIONARY ###
python.assign('py_dict', df_in)
### PASS PYTHON DICTIONARY BACK TO R LIST
r_list <- python.get('py_dict')
### CONVERT R LIST TO DATA.FRAME
df_out <- data.frame(r_list)
dim(df_out)
# [1] 5000    3
#
# real	0m0.973s
# user	0m0.797s
# sys	0m0.186s

rPithon Session

library(sqldf)
df_in <- sqldf('select Year, Month, DayofMonth from tbl2008 limit 5000', dbname = '/home/liuwensui/Documents/data/flights.db')
library(rPithon)
### R DATA.FRAME TO PYTHON DICTIONARY ###
pithon.assign('py_dict', df_in)
### PASS PYTHON DICTIONARY BACK TO R LIST
r_list <- pithon.get('py_dict')
### CONVERT R LIST TO DATA.FRAME
df_out <- data.frame(r_list)
dim(df_out)
# [1] 5000    3
#
# real	0m0.984s
# user	0m0.771s
# sys	0m0.187s

Written by statcompute

March 30, 2015 at 11:24 pm

Posted in PYTHON, S+/R

Tagged with ,

Autoregressive Conditional Poisson Model – I

Modeling the time series of count outcome is of interest in the operational risk while forecasting the frequency of losses. Below is an example showing how to estimate a simple ACP(1, 1) model, e.g. Autoregressive Conditional Poisson, without covariates with ACP package.

library(acp)

### acp(1, 1) without covariates ###
mdl <- acp(y ~ -1, data = cnt)
summary(mdl)
# acp.formula(formula = y ~ -1, data = cnt)
#
#   Estimate   StdErr t.value   p.value    
# a 0.632670 0.169027  3.7430 0.0002507 ***
# b 0.349642 0.067414  5.1865 6.213e-07 ***
# c 0.184509 0.134154  1.3753 0.1708881    

### generate predictions ###
f <- predict(mdl)
pred <- data.frame(yhat = f, cnt)
tail(pred, 5)
#          yhat y
# 164 1.5396921 1
# 165 1.2663993 0
# 166 0.8663321 1
# 167 1.1421586 3
# 168 1.8923355 6

### calculate predictions manually ###
pv167 <- mdl$coef[1] + mdl$coef[2] * pred$y[166] + mdl$coef[3] * pred$yhat[166] 
# [1] 1.142159

pv168 <- mdl$coef[1] + mdl$coef[2] * pred$y[167] + mdl$coef[3] * pred$yhat[167] 
# [1] 1.892336

plot.ts(pred, main = "Predictions")

rplot

Written by statcompute

March 29, 2015 at 11:06 pm

A Comparison between Blaze and Pandas

Blaze (https://github.com/ContinuumIO/blaze) is a lightweight interface on top of other data or computing engines such as numpy or sqlite. Blaze itself doesn’t do any computation but provides a Pandas-like syntax to interact with the back-end data.

Below is an example showing how Blaze leverages the computing power of SQLite (https://www.sqlite.org) and outperforms Pandas when performing some simple tasks on a SQLite table with ~7MM rows. Since Blaze doesn’t have to load the data table into the memory as Pandas does, the cpu time is significantly shorter.

Pandas

import pandas as pda
import pandas.io.sql as psql
import sqlite3 as sql
import numpy as npy

con = sql.connect('/home/liuwensui/Documents/data/flights.db')
ds1 = psql.read_sql('select * from tbl2008', con)
ds2 = ds1[ds1.DayOfWeek > 1]
ds3 = ds2.groupby('DayOfWeek', as_index = False)
ds4 = ds3['AirTime'].agg({'avg_AirTime' : npy.mean})
print(ds4)

#   DayOfWeek  avg_AirTime
#0          2   103.214930
#1          3   103.058508
#2          4   103.467138
#3          5   103.557539
#4          6   107.400631
#5          7   104.864885
#
#real 1m7.241s
#user 1m0.403s
#sys  0m5.278s

Blaze

import blaze as blz
import pandas as pda

ds1 = blz.Data('sqlite:////home/liuwensui/Documents/data/flights.db::tbl2008')
ds2 = ds1[ds1.DayOfWeek > 1]
ds3 = blz.by(ds2.DayOfWeek, avg_AirTime = ds2.AirTime.mean())
ds4 = blz.into(pda.DataFrame, ds3)
print(ds4)

#   DayOfWeek  avg_AirTime
#0          2   103.214930
#1          3   103.058508
#2          4   103.467138
#3          5   103.557539
#4          6   107.400631
#5          7   104.864885
#
#real 0m21.658s
#user 0m10.727s
#sys  0m1.167s

Written by statcompute

March 27, 2015 at 1:13 am

Posted in blaze, pandas, PYTHON

Tagged with , ,

Follow

Get every new post delivered to your Inbox.

Join 130 other followers