Yet Another Blog in Statistical Computing

I can calculate the motion of heavenly bodies but not the madness of people. -Isaac Newton

Data Import Efficiency

HDF5 stands for hierarchical data format and is a popular data model designed to store and organize large amounts of data with interfaces in multiple computing languages such as R, PYTHON, and MATLAB.

Below is a demonstration showing the efficiency of data import from HDF5 and its comparison with CSV and SQLITE. As shown in the result, the time of data import from HDF5 is the shortest, only ~50% of import time from CSV and ~25% of import time from SQLITE.

In [1]: import pandas as pd

In [2]: import pandas.io.sql as pd_sql

In [3]: import sqlite3 as sql

In [4]: DF = pd.read_csv('credit_count.csv')

In [5]: DF.head(3)
Out[5]: 
   CARDHLDR  BAD        AGE  ACADMOS  ADEPCNT  MAJORDRG  MINORDRG  OWNRENT       INCOME  \
0         0    0  27.250000        4        0         0         0        0  1200.000000   
1         0    0  40.833332      111        3         0         0        1  4000.000000   
2         1    0  37.666668       54        3         0         0        1  3666.666667   

   SELFEMPL  INCPER   EXP_INC     SPENDING   LOGSPEND   
0         0   18000  0.000667                           
1         0   13500  0.000222                           
2         0   11300  0.033270  121.9896773  4.8039364   

In [6]: con = sql.connect('data.db') #WRITE DF INTO SQLITE DB

In [7]: pd_sql.write_frame(DF, "tbl", con)

In [8]: con.commit() 

In [9]: h5 = pd.HDFStore('data.h5', 'w') #WRITE DF INTO HDF5

In [10]: h5['tbl'] = DF

In [11]: %timeit -n100 -r10 pd.read_csv('credit_count.csv')
100 loops, best of 10: 64.3 ms per loop

In [12]: %timeit -n100 -r10 pd_sql.read_frame("select * from tbl", con)
100 loops, best of 10: 114 ms per loop

In [13]: %timeit -n100 -r10 output = h5['tbl']
100 loops, best of 10: 26.3 ms per loop
Advertisements

Written by statcompute

December 22, 2012 at 11:21 pm

Posted in PYTHON

Tagged with ,

%d bloggers like this: