Yet Another Blog in Statistical Computing

I can calculate the motion of heavenly bodies but not the madness of people. -Isaac Newton

Posts Tagged ‘pandas

Sparkling Water and Moving Data Around

Sparkling Water is an application to integrate H2O with Spark. Below is an example showing how to move the data around among Pandas DataFrame, H2OFrame, and Spark Dataframe.

1. Define Context

In [1]: from pandas import read_csv, DataFrame

In [2]: from pyspark import sql

In [3]: from pysparkling import H2OContext

In [4]: from h2o import import_file, H2OFrame

In [5]: ss = sql.SparkSession.builder.getOrCreate()

In [6]: hc = H2OContext.getOrCreate(ss)

2. Convert Pandas Dataframe to H2OFrame and Spark DataFrame

In [7]: p_df = read_csv("Documents/credit_count.txt")

In [8]: type(p_df)
Out[8]: pandas.core.frame.DataFrame

In [9]: p2s_df = ss.createDataFrame(p_df)

In [10]: type(p2s_df)
Out[10]: pyspark.sql.dataframe.DataFrame

In [11]: p2h_df = H2OFrame(p_df)

In [12]: type(p2h_df)
Out[12]: h2o.frame.H2OFrame

3. Convert Spark Dataframe to H2OFrame and Pandas DataFrame

In [13]: s_df = ss.read.csv("Documents/credit_count.txt", header = True, inferSchema = True)

In [14]: type(s_df)
Out[14]: pyspark.sql.dataframe.DataFrame

In [15]: s2p_df = s_df.toPandas()

In [16]: type(s2p_df)
Out[16]: pandas.core.frame.DataFrame

In [17]: s2h_df = hc.as_h2o_frame(s_df)

In [18]: type(s2h_df)
Out[18]: h2o.frame.H2OFrame

4. Convert H2OFrame to Pandas Dataframe and Spark DataFrame

In [19]: h_df = import_file("Documents/credit_count.txt", header = 1, sep = ",")

In [20]: type(h_df)
Out[20]: h2o.frame.H2OFrame

In [21]: h2p_df = h_df.as_data_frame()

In [22]: type(h2p_df)
Out[22]: pandas.core.frame.DataFrame

In [23]: h2s_df = hc.as_spark_frame(h_df)

In [24]: type(h2s_df)
Out[24]: pyspark.sql.dataframe.DataFrame

Written by statcompute

July 3, 2017 at 12:36 am

Flavors of SQL on Pandas DataFrame

In R, sqldf() provides a convenient interface of running SQL statement on data frames. Similarly, Python also offers multiple ways to interact between SQL and Pandas DataFrames by leveraging the lightweight SQLite engine. While pandasql (https://github.com/yhat/pandasql) works similarly to sqldf() in R, pysqldf (https://github.com/airtoxin/pysqldf) is even more powerful. In my experiments shown below, advantages of pysqldf over pandasql are two-fold. First of all, pysqldf is 2 – 3 times faster than pandasql. Secondly, pysqldf supports new function definitions, which is not available in pandasql. However, it is worth mentioning that the generic python interface to an in-memory SQLite database can be more efficient and flexible than both pysqldf and pandasql, as demonstrated below, as long as we are able to get the DataFrame into the SQLite and let it stay in-memory.

from sqlite3 import connect
from pandas import read_sql_query
import pandasql
import pysqldf
import numpy

# CREATE AN IN-MEMORY SQLITE DB
con = connect(":memory:")
cur = con.cursor()
cur.execute("attach 'my.db' as filedb")
cur.execute("create table df as select * from filedb.hflights")
cur.execute("detach filedb")

# IMPORT SQLITE TABLE INTO PANDAS DF
df = read_sql_query("select * from df", con)

# WRITE QUERIES
sql01 = "select * from df where DayofWeek = 1 and Dest = 'CVG';"
sql02 = "select DayofWeek, AVG(ArrTime) from df group by DayofWeek;"
sql03 = "select DayofWeek, median(ArrTime) from df group by DayofWeek;"

# SELECTION:
# 1. PANDASQL
%time t11 = pandasql.sqldf(sql01, globals())
# 2. PYSQLDF
%time t12 = pysqldf.SQLDF(globals()).execute(sql01)
# 3. GENERIC SQLITE CONNECTION
%time t13 = read_sql_query(sql01, con)

# AGGREGATION:
# 1. PANDASQL
%time t21 = pandasql.sqldf(sql02, globals())
# 2. PYSQLDF
%time t22 = pysqldf.SQLDF(globals()).execute(sql02)
# 3. GENERIC SQLITE CONNECTION
%time t23 = read_sql_query(sql02, con)

# DEFINING A NEW FUNCTION:
# DEFINE A FUNCTION NOT SUPPORTED IN SQLITE
class median(object):
  def __init__(self):
    self.a = []
  def step(self, x):
    self.a.append(x)
  def finalize(self):
    return numpy.median(self.a)

# 1. PYSQLDF
udafs = {"median": median}
%time t31 = pysqldf.SQLDF(globals(), udafs = udafs).execute(sql03)
# 2 GENERIC SQLITE CONNECTION
con.create_aggregate("median", 1, median)
%time t32 = read_sql_query(sql03, con)

Written by statcompute

October 17, 2016 at 10:34 pm

Posted in Big Data, pandas, PYTHON, SQL, SQLite

Tagged with , , , ,

Faster SQL on Pandas DataFrame with Sandals

logo

Similar to Pandasql, Sandals (https://github.com/jbochi/sandals) provides a convenient interface to query on Pandas DataFrame. However, while Pandasql leverages the power of SQLite in the back-end, Sandals employs the SQL parser to manipulate the DataFrame directly and hence delivery a superior performance by avoiding the data exchange between python and SQLite.

In [1]: import sqlite3  as sqlite

In [2]: import pandas   as pd

In [3]: import pandasql as psql

In [4]: import sandals  as ssql

In [5]: con = sqlite.connect("/home/liuwensui/Documents/data/flights.db")

In [6]: df = pd.io.sql.read_sql("select * from tbl2008 limit 100000", con) 

In [7]: ### SELECT ###

In [8]: %time psql.sqldf("select * from df where DayOfWeek = 1 limit 1", locals())
CPU times: user 1.69 s, sys: 11 ms, total: 1.7 s
Wall time: 1.7 s
Out[8]: 
   Year  Month  DayofMonth  DayOfWeek  DepTime  CRSDepTime  ArrTime  \
0  2008      1           7          1     1842        1845     2032   

   CRSArrTime UniqueCarrier  FlightNum        ...         TaxiIn  TaxiOut  \
0        2040            WN        746        ...              4        9   

   Cancelled  CancellationCode  Diverted  CarrierDelay WeatherDelay NASDelay  \
0          0               NaN         0           NaN          NaN      NaN   

   SecurityDelay  LateAircraftDelay  
0            NaN                NaN  

[1 rows x 29 columns]

In [9]: %time ssql.sql("select * from df where DayOfWeek = 1 limit 1", locals())
CPU times: user 16.3 ms, sys: 3.79 ms, total: 20.1 ms
Wall time: 18 ms
Out[9]: 
       Year  Month  DayofMonth  DayOfWeek  DepTime  CRSDepTime  ArrTime  \
11843  2008      1           7          1     1842        1845     2032   

       CRSArrTime UniqueCarrier  FlightNum        ...         TaxiIn  TaxiOut  \
11843        2040            WN        746        ...              4        9   

       Cancelled  CancellationCode  Diverted  CarrierDelay WeatherDelay  \
11843          0               NaN         0           NaN          NaN   

      NASDelay  SecurityDelay  LateAircraftDelay  
11843      NaN            NaN                NaN  

[1 rows x 29 columns]

In [10]: ### AGGREGATE ###

In [11]: %time psql.sqldf("select DayOfWeek, AVG(ArrTime) from df group by DayOfWeek", locals())
CPU times: user 1.74 s, sys: 15.6 ms, total: 1.75 s
Wall time: 1.75 s
Out[11]: 
   DayOfWeek  AVG(ArrTime)
0          1   1484.413548
1          2   1489.169235
2          3   1490.011514
3          4   1478.614701
4          5   1486.728223
5          6   1485.118600
6          7   1540.071341

In [12]: %time ssql.sql("select DayOfWeek, AVG(ArrTime) from df group by DayOfWeek", locals())
CPU times: user 5.72 ms, sys: 0 ns, total: 5.72 ms
Wall time: 4.92 ms
Out[12]: 
               ArrTime
DayOfWeek             
1          1484.413548
2          1489.169235
3          1490.011514
4          1478.614701
5          1486.728223
6          1485.118600
7          1540.071341

Written by statcompute

April 19, 2015 at 1:57 am

Posted in PYTHON, SQL, SQLite

Tagged with , ,

A Comparison between Blaze and Pandas

Blaze (https://github.com/ContinuumIO/blaze) is a lightweight interface on top of other data or computing engines such as numpy or sqlite. Blaze itself doesn’t do any computation but provides a Pandas-like syntax to interact with the back-end data.

Below is an example showing how Blaze leverages the computing power of SQLite (https://www.sqlite.org) and outperforms Pandas when performing some simple tasks on a SQLite table with ~7MM rows. Since Blaze doesn’t have to load the data table into the memory as Pandas does, the cpu time is significantly shorter.

Pandas

import pandas as pda
import pandas.io.sql as psql
import sqlite3 as sql
import numpy as npy

con = sql.connect('/home/liuwensui/Documents/data/flights.db')
ds1 = psql.read_sql('select * from tbl2008', con)
ds2 = ds1[ds1.DayOfWeek > 1]
ds3 = ds2.groupby('DayOfWeek', as_index = False)
ds4 = ds3['AirTime'].agg({'avg_AirTime' : npy.mean})
print(ds4)

#   DayOfWeek  avg_AirTime
#0          2   103.214930
#1          3   103.058508
#2          4   103.467138
#3          5   103.557539
#4          6   107.400631
#5          7   104.864885
#
#real 1m7.241s
#user 1m0.403s
#sys  0m5.278s

Blaze

import blaze as blz
import pandas as pda

ds1 = blz.Data('sqlite:////home/liuwensui/Documents/data/flights.db::tbl2008')
ds2 = ds1[ds1.DayOfWeek > 1]
ds3 = blz.by(ds2.DayOfWeek, avg_AirTime = ds2.AirTime.mean())
ds4 = blz.into(pda.DataFrame, ds3)
print(ds4)

#   DayOfWeek  avg_AirTime
#0          2   103.214930
#1          3   103.058508
#2          4   103.467138
#3          5   103.557539
#4          6   107.400631
#5          7   104.864885
#
#real 0m21.658s
#user 0m10.727s
#sys  0m1.167s

Written by statcompute

March 27, 2015 at 1:13 am

Posted in blaze, pandas, PYTHON

Tagged with , ,

Chain Operations of Pandas DataFrame

Chain operations is an innovative feature provided in dlpy package of R language (https://statcompute.wordpress.com/2014/07/28/chain-operations-an-interesting-feature-in-dplyr-package). This new functionality makes the data manipulation more compact and expressive.

In python, pandas-ply package also brings this capability to pandas DataFrame objects, as shown below.

In [1]: import pandas as pd

In [2]: df = pd.read_csv('/home/liuwensui/Downloads/2008.csv')

In [3]: ### STANDARD PANDAS OPERATIONS ###

In [4]: grp = df[df.Month < 6].groupby(["Month", "DayOfWeek"])

In [5]: df1 = pd.DataFrame()

In [6]: df1["ArrMedian"] = grp.ArrTime.median()

In [7]: df1["DepMean"] = grp.DepTime.mean()

In [8]: df2 = df1[(df1.ArrMedian > 1515) & (df1.DepMean > 1350)]

In [9]: print(df2)
                 ArrMedian      DepMean
Month DayOfWeek                        
1     7               1545  1375.156487
2     5               1522  1352.657670
      7               1538  1375.605698
3     7               1542  1377.128506
4     7               1536  1366.719829
5     7               1535  1361.323864

In [10]: ### PLY CHAIN OPERATIONS ###

In [11]: from ply import install_ply, X

In [12]: install_ply(pd)

In [13]: df1 = (df
   ....:        .ply_where(X.Month < 6)
   ....:        .groupby(['Month', 'DayOfWeek'])
   ....:        .ply_select(DepMean = X.DepTime.mean(), ArrMedian = X.ArrTime.median())
   ....:        .ply_where(X.ArrMedian > 1515, X.DepMean > 1350))

In [14]: print(df1)
                  ArrMedian      DepMean
Month DayOfWeek                         
1     7                1545  1375.156487
2     5                1522  1352.657670
      7                1538  1375.605698
3     7                1542  1377.128506
4     7                1536  1366.719829
5     7                1535  1361.323864

Written by statcompute

January 31, 2015 at 4:26 pm

Posted in PYTHON

Tagged with ,

Performance Improvement by Expression Evaluation in Pandas

In [1]: import pandas as pd

In [2]: df1 = pd.read_csv('/home/liuwensui/Downloads/2008.csv')

In [3]: df1.shape
Out[3]: (7009728, 29)

In [4]: # STANDARD PANDAS SYNTAX

In [5]: %timeit -n 10 df2 = df1[(df1['Month'] == 1) & (df1['DayofMonth'] == 1)]
10 loops, best of 3: 85.1 ms per loop

In [6]: # EXPRESSION EVALUATION

In [7]: %timeit -n 10 df3 = df1.query('Month == 1 and DayofMonth == 1')
10 loops, best of 3: 44 ms per loop

Written by statcompute

September 14, 2014 at 11:04 pm

Posted in PYTHON

Tagged with ,

Joining DataFrame with pandas

Driven by my own curiosity, I also did a test on joining two DataFrames with python pandas package and tried to compare it with R in terms of efficiency.

In pandas package, there are 2 ways to join two DataFrames, pandas.merge() function and pandas.DataFrame.join() method. Based upon the result shown below, both methods seem very comparable in terms of speed, which doubles the time with R data.table package but is a lot faster then the other three methods, e.g. merge(), sqldf(), and ff package.

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: import random as rd

In [4]: rd.seed(2013)

In [5]: n = 1000000

In [6]: ldf = pd.DataFrame({"id1": rd.sample(range(n), n), 
   ...:                     "id2": np.random.randint(0, n / 100, n),
   ...:                     "x1" : np.random.normal(size = n),
   ...:                     "x2" : np.random.uniform(size = n)})

In [7]: rdf = pd.DataFrame({"id1": rd.sample(range(n), n), 
   ...:                     "id2": np.random.randint(0, n / 100, n),
   ...:                     "y1" : np.random.normal(size = n),
   ...:                     "y2" : np.random.uniform(size = n)})

In [8]: ldf.head(3)
Out[8]: 
      id1   id2        x1        x2
0  425177  5902  0.445512  0.393837
1   51906  6066  0.336391  0.911894
2  401789  6609  0.347967  0.719724

In [9]: rdf.head(3)
Out[9]: 
      id1   id2        y1        y2
0  724470  4559 -0.674539  0.767925
1  196104  8637  0.001924  0.246559
2  988955  2348 -0.254853  0.258329

In [10]: # METHOD 1: PANDAS.MERGE() FUNCTION

In [11]: %timeit -n10 -r3 pd.merge(ldf, rdf, how = 'inner', left_on = ['id1', 'id2'], right_on = ['id1', 'id2'])
10 loops, best of 3: 3.44 s per loop

In [12]: # METHOD 2: PANDAS.DATAFRAME.JOIN() METHOD

In [13]: %timeit -n10 -r3 ldf.join(rdf.set_index(['id1', 'id2']), on = ['id1', 'id2'], how = 'inner')
10 loops, best of 3: 3.71 s per loop

Written by statcompute

January 31, 2013 at 12:16 am

Posted in PYTHON

Tagged with ,