pandasql

In R, sqldf() provides a convenient interface of running SQL statement on data frames. Similarly, Python also offers multiple ways to interact between SQL and Pandas DataFrames by leveraging the lightweight SQLite engine. While pandasql (https://github.com/yhat/pandasql) works similarly to sqldf() in R, pysqldf (https://github.com/airtoxin/pysqldf) is even more powerful. In my experiments shown below, advantages of pysqldf over pandasql are two-fold. First of all, pysqldf is 2 – 3 times faster than pandasql. Secondly, pysqldf supports new function definitions, which is not available in pandasql. However, it is worth mentioning that the generic python interface to an in-memory SQLite database can be more efficient and flexible than both pysqldf and pandasql, as demonstrated below, as long as we are able to get the DataFrame into the SQLite and let it stay in-memory.

from sqlite3 import connect
from pandas import read_sql_query
import pandasql
import pysqldf
import numpy

# CREATE AN IN-MEMORY SQLITE DB
con = connect(":memory:")
cur = con.cursor()
cur.execute("attach 'my.db' as filedb")
cur.execute("create table df as select * from filedb.hflights")
cur.execute("detach filedb")

# IMPORT SQLITE TABLE INTO PANDAS DF
df = read_sql_query("select * from df", con)

# WRITE QUERIES
sql01 = "select * from df where DayofWeek = 1 and Dest = 'CVG';"
sql02 = "select DayofWeek, AVG(ArrTime) from df group by DayofWeek;"
sql03 = "select DayofWeek, median(ArrTime) from df group by DayofWeek;"

# SELECTION:
# 1. PANDASQL
%time t11 = pandasql.sqldf(sql01, globals())
# 2. PYSQLDF
%time t12 = pysqldf.SQLDF(globals()).execute(sql01)
# 3. GENERIC SQLITE CONNECTION
%time t13 = read_sql_query(sql01, con)

# AGGREGATION:
# 1. PANDASQL
%time t21 = pandasql.sqldf(sql02, globals())
# 2. PYSQLDF
%time t22 = pysqldf.SQLDF(globals()).execute(sql02)
# 3. GENERIC SQLITE CONNECTION
%time t23 = read_sql_query(sql02, con)

# DEFINING A NEW FUNCTION:
# DEFINE A FUNCTION NOT SUPPORTED IN SQLITE
class median(object):
  def __init__(self):
    self.a = []
  def step(self, x):
    self.a.append(x)
  def finalize(self):
    return numpy.median(self.a)

# 1. PYSQLDF
udafs = {"median": median}
%time t31 = pysqldf.SQLDF(globals(), udafs = udafs).execute(sql03)
# 2 GENERIC SQLITE CONNECTION
con.create_aggregate("median", 1, median)
%time t32 = read_sql_query(sql03, con)

In [1]: import sas7bdat as sas In [2]: import pandas as pd In [3]: import pandasql as pdsql In [4]: data = sas.SAS7BDAT("accepts.sas7bdat") In [5]: df = data.toDataFrame() In [6]: pysql = lambda q: pdsql.sqldf(q, globals()) In [7]: ### SELECT ### In [8]: str1 = "select bureau_score, ltv from df where bureau_score < 600 and ltv > 100 limit 3;" In [9]: df1 = pysql(str1) In [10]: df1 Out[10]: bureau_score ltv 0 590 103 1 575 120 2 538 113 In [11]: ### AGGREGATE ### In [12]: str2 = "select ltv, min(bureau_score) as min_score, max(bureau_score) as max_score from df group by ltv order by ltv DESC;" In [13]: df2 = pysql(str2); In [14]: df2.head(3) Out[14]: ltv min_score max_score 0 176 709 709 1 168 723 723 2 167 688 688 In [15]: ### JOIN ### In [16]: str3 = "select b.*, a.bureau_score from df a inner join df2 b on a.ltv = b.ltv order by ltv DESC;" In [17]: df3 = pysql(str3) In [18]: df3.head(3) Out[18]: ltv min_score max_score bureau_score 0 176 709 709 709 1 168 723 723 723 2 167 688 688 688

Yet Another Blog in Statistical Computing

Flavors of SQL on Pandas DataFrame

Query Pandas DataFrame with SQL