## Joining DataFrame with pandas

Driven by my own curiosity, I also did a test on joining two DataFrames with python pandas package and tried to compare it with R in terms of efficiency.

In pandas package, there are 2 ways to join two DataFrames, pandas.merge() function and pandas.DataFrame.join() method. Based upon the result shown below, both methods seem very comparable in terms of speed, which doubles the time with R data.table package but is a lot faster then the other three methods, e.g. merge(), sqldf(), and ff package.

In [1]: import pandas as pd In [2]: import numpy as np In [3]: import random as rd In [4]: rd.seed(2013) In [5]: n = 1000000 In [6]: ldf = pd.DataFrame({"id1": rd.sample(range(n), n), ...: "id2": np.random.randint(0, n / 100, n), ...: "x1" : np.random.normal(size = n), ...: "x2" : np.random.uniform(size = n)}) In [7]: rdf = pd.DataFrame({"id1": rd.sample(range(n), n), ...: "id2": np.random.randint(0, n / 100, n), ...: "y1" : np.random.normal(size = n), ...: "y2" : np.random.uniform(size = n)}) In [8]: ldf.head(3) Out[8]: id1 id2 x1 x2 0 425177 5902 0.445512 0.393837 1 51906 6066 0.336391 0.911894 2 401789 6609 0.347967 0.719724 In [9]: rdf.head(3) Out[9]: id1 id2 y1 y2 0 724470 4559 -0.674539 0.767925 1 196104 8637 0.001924 0.246559 2 988955 2348 -0.254853 0.258329 In [10]: # METHOD 1: PANDAS.MERGE() FUNCTION In [11]: %timeit -n10 -r3 pd.merge(ldf, rdf, how = 'inner', left_on = ['id1', 'id2'], right_on = ['id1', 'id2']) 10 loops, best of 3: 3.44 s per loop In [12]: # METHOD 2: PANDAS.DATAFRAME.JOIN() METHOD In [13]: %timeit -n10 -r3 ldf.join(rdf.set_index(['id1', 'id2']), on = ['id1', 'id2'], how = 'inner') 10 loops, best of 3: 3.71 s per loop

Advertisements