Yet Another Blog in Statistical Computing

I can calculate the motion of heavenly bodies but not the madness of people. -Isaac Newton

A Comparison between Blaze and Pandas

Blaze (https://github.com/ContinuumIO/blaze) is a lightweight interface on top of other data or computing engines such as numpy or sqlite. Blaze itself doesn’t do any computation but provides a Pandas-like syntax to interact with the back-end data.

Below is an example showing how Blaze leverages the computing power of SQLite (https://www.sqlite.org) and outperforms Pandas when performing some simple tasks on a SQLite table with ~7MM rows. Since Blaze doesn’t have to load the data table into the memory as Pandas does, the cpu time is significantly shorter.

Pandas

import pandas as pda
import pandas.io.sql as psql
import sqlite3 as sql
import numpy as npy

con = sql.connect('/home/liuwensui/Documents/data/flights.db')
ds1 = psql.read_sql('select * from tbl2008', con)
ds2 = ds1[ds1.DayOfWeek > 1]
ds3 = ds2.groupby('DayOfWeek', as_index = False)
ds4 = ds3['AirTime'].agg({'avg_AirTime' : npy.mean})
print(ds4)

#   DayOfWeek  avg_AirTime
#0          2   103.214930
#1          3   103.058508
#2          4   103.467138
#3          5   103.557539
#4          6   107.400631
#5          7   104.864885
#
#real 1m7.241s
#user 1m0.403s
#sys  0m5.278s

Blaze

import blaze as blz
import pandas as pda

ds1 = blz.Data('sqlite:////home/liuwensui/Documents/data/flights.db::tbl2008')
ds2 = ds1[ds1.DayOfWeek > 1]
ds3 = blz.by(ds2.DayOfWeek, avg_AirTime = ds2.AirTime.mean())
ds4 = blz.into(pda.DataFrame, ds3)
print(ds4)

#   DayOfWeek  avg_AirTime
#0          2   103.214930
#1          3   103.058508
#2          4   103.467138
#3          5   103.557539
#4          6   107.400631
#5          7   104.864885
#
#real 0m21.658s
#user 0m10.727s
#sys  0m1.167s
Advertisements

Written by statcompute

March 27, 2015 at 1:13 am

Posted in blaze, pandas, PYTHON

Tagged with , ,

%d bloggers like this: