I can calculate the motion of heavenly bodies but not the madness of people. -Isaac Newton

## Joining DataFrame with pandas

Driven by my own curiosity, I also did a test on joining two DataFrames with python pandas package and tried to compare it with R in terms of efficiency.

In pandas package, there are 2 ways to join two DataFrames, pandas.merge() function and pandas.DataFrame.join() method. Based upon the result shown below, both methods seem very comparable in terms of speed, which doubles the time with R data.table package but is a lot faster then the other three methods, e.g. merge(), sqldf(), and ff package.

```In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: import random as rd

In [4]: rd.seed(2013)

In [5]: n = 1000000

In [6]: ldf = pd.DataFrame({"id1": rd.sample(range(n), n),
...:                     "id2": np.random.randint(0, n / 100, n),
...:                     "x1" : np.random.normal(size = n),
...:                     "x2" : np.random.uniform(size = n)})

In [7]: rdf = pd.DataFrame({"id1": rd.sample(range(n), n),
...:                     "id2": np.random.randint(0, n / 100, n),
...:                     "y1" : np.random.normal(size = n),
...:                     "y2" : np.random.uniform(size = n)})

Out[8]:
id1   id2        x1        x2
0  425177  5902  0.445512  0.393837
1   51906  6066  0.336391  0.911894
2  401789  6609  0.347967  0.719724

Out[9]:
id1   id2        y1        y2
0  724470  4559 -0.674539  0.767925
1  196104  8637  0.001924  0.246559
2  988955  2348 -0.254853  0.258329

In [10]: # METHOD 1: PANDAS.MERGE() FUNCTION

In [11]: %timeit -n10 -r3 pd.merge(ldf, rdf, how = 'inner', left_on = ['id1', 'id2'], right_on = ['id1', 'id2'])
10 loops, best of 3: 3.44 s per loop

In [12]: # METHOD 2: PANDAS.DATAFRAME.JOIN() METHOD

In [13]: %timeit -n10 -r3 ldf.join(rdf.set_index(['id1', 'id2']), on = ['id1', 'id2'], how = 'inner')
10 loops, best of 3: 3.71 s per loop
```

Written by statcompute

January 31, 2013 at 12:16 am

Posted in PYTHON

Tagged with ,