Yet Another Blog in Statistical Computing

I can calculate the motion of heavenly bodies but not the madness of people. -Isaac Newton

Data Aggregation with Python

In the code below, I tried to do the data aggregation by groups with three different methods with python and compared their efficiency. It is also my first time to use ipython, which is an awesome testing environment.

It turns out that pandas package is the fastest and sqlite3 through in-memory database is the slowest.

In [1]: from random import *

In [2]: from pandas import *

In [3]: from sqlite3 import *

In [4]: from datetime import datetime, timedelta

In [5]: # SIMULATE A LIST WITH 1M RECORDS

In [6]: list = [[randint(1, 10), x] for x in xrange(1, 1000001)]

In [7]: # CALCULATE THE SUM BY GENERIC CODING

In [8]: start1 = datetime.now()

In [9]: summ1 = []

In [10]: for i in set(x[0] for x in list):
   ....:   total = sum([x[1] for x in list if x[0] == i])
   ....:   summ1.append([i, total])
   ....:

In [11]: end1 = datetime.now()

In [12]: print str(end1 - start1)
0:00:04.703000

In [13]: summ1
Out[13]:
[[1, 49899161104L],
 [2, 50089982753L],
 [3, 50289189719L],
 [4, 50152366638L],
 [5, 49876070739L],
 [6, 50024578837L],
 [7, 50024658776L],
 [8, 50082058403L],
 [9, 49665851019L],
 [10, 49896582012L]]

In [14]: # CALCULATE THE SUM BY DATAFRAME

In [15]: start2 = datetime.now()

In [16]: df = DataFrame(list, columns = ["x", "y"])

In [17]: summ2 = df.groupby('x', as_index = False).sum()

In [18]: end2 = datetime.now()

In [19]: print str(end2 - start2)
0:00:01.500000

In [20]: summ2
Out[20]:
    x            y
0   1  49899161104
1   2  50089982753
2   3  50289189719
3   4  50152366638
4   5  49876070739
5   6  50024578837
6   7  50024658776
7   8  50082058403
8   9  49665851019
9  10  49896582012

In [21]: # CALCULATE THE SUM BY SQLITE

In [22]: start3 = datetime.now()

In [23]: con = connect(":memory:")

In [24]: with con:
   ....:   cur = con.cursor()
   ....:   cur.execute("create table tbl (x real, y real)")
   ....:   cur.executemany("insert into tbl values(?, ?)", list)
   ....:   cur.execute("select x, sum(y) from tbl group by x")
   ....:   summ3 = cur.fetchall()
   ....:   cur.close()
   ....:

In [25]: con.close
Out[25]: <function close>

In [26]: end3 = datetime.now()

In [27]: print str(end3 - start3)
0:00:22.016000

In [28]: summ3
Out[28]:
[(1.0, 49899161104.0),
 (2.0, 50089982753.0),
 (3.0, 50289189719.0),
 (4.0, 50152366638.0),
 (5.0, 49876070739.0),
 (6.0, 50024578837.0),
 (7.0, 50024658776.0),
 (8.0, 50082058403.0),
 (9.0, 49665851019.0),
 (10.0, 49896582012.0)]
Advertisements

Written by statcompute

October 24, 2012 at 12:14 am

%d bloggers like this: