Yet Another Blog in Statistical Computing

I can calculate the motion of heavenly bodies but not the madness of people. -Isaac Newton

Archive for April 2015

Faster SQL on Pandas DataFrame with Sandals

logo

Similar to Pandasql, Sandals (https://github.com/jbochi/sandals) provides a convenient interface to query on Pandas DataFrame. However, while Pandasql leverages the power of SQLite in the back-end, Sandals employs the SQL parser to manipulate the DataFrame directly and hence delivery a superior performance by avoiding the data exchange between python and SQLite.

In [1]: import sqlite3  as sqlite

In [2]: import pandas   as pd

In [3]: import pandasql as psql

In [4]: import sandals  as ssql

In [5]: con = sqlite.connect("/home/liuwensui/Documents/data/flights.db")

In [6]: df = pd.io.sql.read_sql("select * from tbl2008 limit 100000", con) 

In [7]: ### SELECT ###

In [8]: %time psql.sqldf("select * from df where DayOfWeek = 1 limit 1", locals())
CPU times: user 1.69 s, sys: 11 ms, total: 1.7 s
Wall time: 1.7 s
Out[8]: 
   Year  Month  DayofMonth  DayOfWeek  DepTime  CRSDepTime  ArrTime  \
0  2008      1           7          1     1842        1845     2032   

   CRSArrTime UniqueCarrier  FlightNum        ...         TaxiIn  TaxiOut  \
0        2040            WN        746        ...              4        9   

   Cancelled  CancellationCode  Diverted  CarrierDelay WeatherDelay NASDelay  \
0          0               NaN         0           NaN          NaN      NaN   

   SecurityDelay  LateAircraftDelay  
0            NaN                NaN  

[1 rows x 29 columns]

In [9]: %time ssql.sql("select * from df where DayOfWeek = 1 limit 1", locals())
CPU times: user 16.3 ms, sys: 3.79 ms, total: 20.1 ms
Wall time: 18 ms
Out[9]: 
       Year  Month  DayofMonth  DayOfWeek  DepTime  CRSDepTime  ArrTime  \
11843  2008      1           7          1     1842        1845     2032   

       CRSArrTime UniqueCarrier  FlightNum        ...         TaxiIn  TaxiOut  \
11843        2040            WN        746        ...              4        9   

       Cancelled  CancellationCode  Diverted  CarrierDelay WeatherDelay  \
11843          0               NaN         0           NaN          NaN   

      NASDelay  SecurityDelay  LateAircraftDelay  
11843      NaN            NaN                NaN  

[1 rows x 29 columns]

In [10]: ### AGGREGATE ###

In [11]: %time psql.sqldf("select DayOfWeek, AVG(ArrTime) from df group by DayOfWeek", locals())
CPU times: user 1.74 s, sys: 15.6 ms, total: 1.75 s
Wall time: 1.75 s
Out[11]: 
   DayOfWeek  AVG(ArrTime)
0          1   1484.413548
1          2   1489.169235
2          3   1490.011514
3          4   1478.614701
4          5   1486.728223
5          6   1485.118600
6          7   1540.071341

In [12]: %time ssql.sql("select DayOfWeek, AVG(ArrTime) from df group by DayOfWeek", locals())
CPU times: user 5.72 ms, sys: 0 ns, total: 5.72 ms
Wall time: 4.92 ms
Out[12]: 
               ArrTime
DayOfWeek             
1          1484.413548
2          1489.169235
3          1490.011514
4          1478.614701
5          1486.728223
6          1485.118600
7          1540.071341
Advertisements

Written by statcompute

April 19, 2015 at 1:57 am

Posted in PYTHON, SQL, SQLite

Tagged with , ,

Estimating Time Series Models for Count Outcomes with SAS

In SAS, there is no out-of-box procedure to estimate time series models for count outcomes, which is similar to the one shown here (https://statcompute.wordpress.com/2015/03/31/modeling-count-time-series-with-tscount-package). However, as long as we understand the likelihood function of Poisson distribution, it is straightforward to estimate a time series model with PROC MODEL in the ETS module.

Below is a demonstration of how to estimate a Poisson time series model with the identity link function. As shown, the parameter estimates with related inferences are extremely close to the ones estimated with tscount() in R.

data polio;
  idx + 1;
  input y @@;
datalines;
0  1  0  0  1  3  9  2  3  5  3  5  2  2  0  1  0  1  3  3  2  1  1  5  0
3  1  0  1  4  0  0  1  6 14  1  1  0  0  1  1  1  1  0  1  0  1  0  1  0
1  0  1  0  1  0  1  0  0  2  0  1  0  1  0  0  1  2  0  0  1  2  0  3  1
1  0  2  0  4  0  2  1  1  1  1  0  1  1  0  2  1  3  1  2  4  0  0  0  1
0  1  0  2  2  4  2  3  3  0  0  2  7  8  2  4  1  1  2  4  0  1  1  1  3
0  0  0  0  1  0  1  1  0  0  0  0  0  1  2  0  2  0  0  0  1  0  1  0  1
0  2  0  0  1  2  0  1  0  0  0  1  2  1  0  1  3  6
;
run;

proc model data = polio;
  parms b0 = 0.5 b1 = 0.1 b2 = 0.1;
  yhat = b0 + b1 * zlag1(y) + b2 * zlag1(yhat);
  y = yhat;
  lk = exp(-yhat) * (yhat ** y) / fact(y);
  ll = -log(lk);
  errormodel y ~ general(ll);
  fit y / fiml converge = 1e-8;
run;

/* OUTPUT:
            Nonlinear Liklhood Summary of Residual Errors

                  DF      DF                                        Adj
Equation       Model   Error        SSE        MSE   R-Square      R-Sq
y                  3     165      532.6     3.2277     0.0901    0.0791

           Nonlinear Liklhood Parameter Estimates

                              Approx                  Approx
Parameter       Estimate     Std Err    t Value     Pr > |t|
b0              0.606313      0.1680       3.61       0.0004
b1              0.349495      0.0690       5.06       <.0001
b2              0.206877      0.1397       1.48       0.1405

Number of Observations       Statistics for System

Used               168    Log Likelihood    -278.6615
Missing              0
*/

Written by statcompute

April 17, 2015 at 9:43 pm

Easy DB Interaction with db.py

The db.py package (https://github.com/yhat/db.py) provides a convenient interface to interact with databases, e.g. allowing users to explore the meta-data and execute queries. Below is an example showing how to create an index in a SQLite database to improve the query efficiency with the db.py package.

Without Index

In [1]: import db as db

In [2]: db = db.DB(filename = "Documents/data/test.db";, dbtype = "sqlite")

In [3]: %timeit db.query("select Month, count(*) as cnt from tbl2008 group by Month;")
1 loops, best of 3: 6.28 s per loop

In [4]: %timeit db.query("select * from tbl2008 where Month = 1;")
1 loops, best of 3: 5.61 s per loop

With Index

In [1]: import db as db

In [2]: db = db.DB(filename = "Documents/data/test.db", dbtype = "sqlite")

In [3]: db.query("create INDEX idx on tbl2008(Month);")

In [4]: db.query("select * from sqlite_master where type = 'index';")
Out[4]: 
    type name tbl_name  rootpage                                 sql
0  index  idx  tbl2008    544510  CREATE INDEX idx on tbl2008(Month)

In [5]: %timeit db.query("select Month, count(*) as cnt from tbl2008 group by Month;")
1 loops, best of 3: 787 ms per loop

In [6]: %timeit db.query("select * from tbl2008 where Month = 1;")
1 loops, best of 3: 5.16 s per loop

Written by statcompute

April 5, 2015 at 1:21 am

Posted in PYTHON, SQLite

Tagged with , ,