Yet Another Blog in Statistical Computing

I can calculate the motion of heavenly bodies but not the madness of people. -Isaac Newton

Archive for the ‘SQLite’ Category

Flavors of SQL on Pandas DataFrame

In R, sqldf() provides a convenient interface of running SQL statement on data frames. Similarly, Python also offers multiple ways to interact between SQL and Pandas DataFrames by leveraging the lightweight SQLite engine. While pandasql (https://github.com/yhat/pandasql) works similarly to sqldf() in R, pysqldf (https://github.com/airtoxin/pysqldf) is even more powerful. In my experiments shown below, advantages of pysqldf over pandasql are two-fold. First of all, pysqldf is 2 – 3 times faster than pandasql. Secondly, pysqldf supports new function definitions, which is not available in pandasql. However, it is worth mentioning that the generic python interface to an in-memory SQLite database can be more efficient and flexible than both pysqldf and pandasql, as demonstrated below, as long as we are able to get the DataFrame into the SQLite and let it stay in-memory.

from sqlite3 import connect
from pandas import read_sql_query
import pandasql
import pysqldf
import numpy

# CREATE AN IN-MEMORY SQLITE DB
con = connect(":memory:")
cur = con.cursor()
cur.execute("attach 'my.db' as filedb")
cur.execute("create table df as select * from filedb.hflights")
cur.execute("detach filedb")

# IMPORT SQLITE TABLE INTO PANDAS DF
df = read_sql_query("select * from df", con)

# WRITE QUERIES
sql01 = "select * from df where DayofWeek = 1 and Dest = 'CVG';"
sql02 = "select DayofWeek, AVG(ArrTime) from df group by DayofWeek;"
sql03 = "select DayofWeek, median(ArrTime) from df group by DayofWeek;"

# SELECTION:
# 1. PANDASQL
%time t11 = pandasql.sqldf(sql01, globals())
# 2. PYSQLDF
%time t12 = pysqldf.SQLDF(globals()).execute(sql01)
# 3. GENERIC SQLITE CONNECTION
%time t13 = read_sql_query(sql01, con)

# AGGREGATION:
# 1. PANDASQL
%time t21 = pandasql.sqldf(sql02, globals())
# 2. PYSQLDF
%time t22 = pysqldf.SQLDF(globals()).execute(sql02)
# 3. GENERIC SQLITE CONNECTION
%time t23 = read_sql_query(sql02, con)

# DEFINING A NEW FUNCTION:
# DEFINE A FUNCTION NOT SUPPORTED IN SQLITE
class median(object):
  def __init__(self):
    self.a = []
  def step(self, x):
    self.a.append(x)
  def finalize(self):
    return numpy.median(self.a)

# 1. PYSQLDF
udafs = {"median": median}
%time t31 = pysqldf.SQLDF(globals(), udafs = udafs).execute(sql03)
# 2 GENERIC SQLITE CONNECTION
con.create_aggregate("median", 1, median)
%time t32 = read_sql_query(sql03, con)
Advertisements

Written by statcompute

October 17, 2016 at 10:34 pm

Posted in Big Data, pandas, PYTHON, SQL, SQLite

Tagged with , , , ,

Ibis – A New Kid in Town

Developed by Wes McKinney, pandas is a very efficient and powerful data analysis tool in python language for data scientists. Same as R, pandas reads the data into memory. As a result, we might often face the problem of running out of memory while analyzing large-size data with pandas.

Similar to Blaze, ibis is a new data analysis framework in python built on top of other back-end data engines, such as sqlite and impala. Even better, ibis provides a higher compatibility to pandas and better performance than Blaze.

In a previous blog (https://statcompute.wordpress.com/2015/03/27/a-comparison-between-blaze-and-pandas), I’ve shown the efficiency of Blaze through a simple example. However, in the demonstration below, it is shown that, while applied to the same data with sqlite engine, ibis is 50% more efficient than Blaze in terms of the “real time”.

import ibis as ibis

tbl = ibis.sqlite.connect('//home/liuwensui/Documents/data/flights.db').table('tbl2008')
exp = tbl[tbl.DayOfWeek > 1].group_by("DayOfWeek").aggregate(avg_AirTime = tbl.AirTime.mean())
pd = exp.execute()
print(pd)

#i   DayOfWeek  avg_AirTime
#0          2   103.214930
#1          3   103.058508
#2          4   103.467138
#3          5   103.557539
#4          6   107.400631
#5          7   104.864885
#
#real   0m10.346s
#user   0m9.585s
#sys    0m1.181s

Written by statcompute

November 15, 2015 at 12:56 am

Faster SQL on Pandas DataFrame with Sandals

logo

Similar to Pandasql, Sandals (https://github.com/jbochi/sandals) provides a convenient interface to query on Pandas DataFrame. However, while Pandasql leverages the power of SQLite in the back-end, Sandals employs the SQL parser to manipulate the DataFrame directly and hence delivery a superior performance by avoiding the data exchange between python and SQLite.

In [1]: import sqlite3  as sqlite

In [2]: import pandas   as pd

In [3]: import pandasql as psql

In [4]: import sandals  as ssql

In [5]: con = sqlite.connect("/home/liuwensui/Documents/data/flights.db")

In [6]: df = pd.io.sql.read_sql("select * from tbl2008 limit 100000", con) 

In [7]: ### SELECT ###

In [8]: %time psql.sqldf("select * from df where DayOfWeek = 1 limit 1", locals())
CPU times: user 1.69 s, sys: 11 ms, total: 1.7 s
Wall time: 1.7 s
Out[8]: 
   Year  Month  DayofMonth  DayOfWeek  DepTime  CRSDepTime  ArrTime  \
0  2008      1           7          1     1842        1845     2032   

   CRSArrTime UniqueCarrier  FlightNum        ...         TaxiIn  TaxiOut  \
0        2040            WN        746        ...              4        9   

   Cancelled  CancellationCode  Diverted  CarrierDelay WeatherDelay NASDelay  \
0          0               NaN         0           NaN          NaN      NaN   

   SecurityDelay  LateAircraftDelay  
0            NaN                NaN  

[1 rows x 29 columns]

In [9]: %time ssql.sql("select * from df where DayOfWeek = 1 limit 1", locals())
CPU times: user 16.3 ms, sys: 3.79 ms, total: 20.1 ms
Wall time: 18 ms
Out[9]: 
       Year  Month  DayofMonth  DayOfWeek  DepTime  CRSDepTime  ArrTime  \
11843  2008      1           7          1     1842        1845     2032   

       CRSArrTime UniqueCarrier  FlightNum        ...         TaxiIn  TaxiOut  \
11843        2040            WN        746        ...              4        9   

       Cancelled  CancellationCode  Diverted  CarrierDelay WeatherDelay  \
11843          0               NaN         0           NaN          NaN   

      NASDelay  SecurityDelay  LateAircraftDelay  
11843      NaN            NaN                NaN  

[1 rows x 29 columns]

In [10]: ### AGGREGATE ###

In [11]: %time psql.sqldf("select DayOfWeek, AVG(ArrTime) from df group by DayOfWeek", locals())
CPU times: user 1.74 s, sys: 15.6 ms, total: 1.75 s
Wall time: 1.75 s
Out[11]: 
   DayOfWeek  AVG(ArrTime)
0          1   1484.413548
1          2   1489.169235
2          3   1490.011514
3          4   1478.614701
4          5   1486.728223
5          6   1485.118600
6          7   1540.071341

In [12]: %time ssql.sql("select DayOfWeek, AVG(ArrTime) from df group by DayOfWeek", locals())
CPU times: user 5.72 ms, sys: 0 ns, total: 5.72 ms
Wall time: 4.92 ms
Out[12]: 
               ArrTime
DayOfWeek             
1          1484.413548
2          1489.169235
3          1490.011514
4          1478.614701
5          1486.728223
6          1485.118600
7          1540.071341

Written by statcompute

April 19, 2015 at 1:57 am

Posted in PYTHON, SQL, SQLite

Tagged with , ,

Easy DB Interaction with db.py

The db.py package (https://github.com/yhat/db.py) provides a convenient interface to interact with databases, e.g. allowing users to explore the meta-data and execute queries. Below is an example showing how to create an index in a SQLite database to improve the query efficiency with the db.py package.

Without Index

In [1]: import db as db

In [2]: db = db.DB(filename = "Documents/data/test.db";, dbtype = "sqlite")

In [3]: %timeit db.query("select Month, count(*) as cnt from tbl2008 group by Month;")
1 loops, best of 3: 6.28 s per loop

In [4]: %timeit db.query("select * from tbl2008 where Month = 1;")
1 loops, best of 3: 5.61 s per loop

With Index

In [1]: import db as db

In [2]: db = db.DB(filename = "Documents/data/test.db", dbtype = "sqlite")

In [3]: db.query("create INDEX idx on tbl2008(Month);")

In [4]: db.query("select * from sqlite_master where type = 'index';")
Out[4]: 
    type name tbl_name  rootpage                                 sql
0  index  idx  tbl2008    544510  CREATE INDEX idx on tbl2008(Month)

In [5]: %timeit db.query("select Month, count(*) as cnt from tbl2008 group by Month;")
1 loops, best of 3: 787 ms per loop

In [6]: %timeit db.query("select * from tbl2008 where Month = 1;")
1 loops, best of 3: 5.16 s per loop

Written by statcompute

April 5, 2015 at 1:21 am

Posted in PYTHON, SQLite

Tagged with , ,