Data Munging with Pandas

Being a long-time user of SAS and R, I feel very comfortable with rectangle-shaped data tables and database-style operations in data munging. However, when I started diving into python and tried to pick it up as a data analysis tool, it took me a while to get used to the “python style”.

For instance, there are two data tables, one account table with 1 record per id and one transaction history table with multiple records per id.

acct = [['id1', '123-456-7890'], ['id2', '234-567-8901'],
        ['id3', '999-999-9999']]

hist = [['id1', 'deposited', 100], ['id1', 'withdraw', 40],
        ['id2', 'withdraw', 15], ['id1', 'withdraw', 25],
        ['id2',  'deposited', 30], ['id4',  'deposited', 30]]

If I’d like to do an inner join between these two tables, the generic coding logic might look like below, which is not be very intuitive (to me at least).

join = []
for h in hist:
  for a in acct:
    if h[0] == a[0]:
      list = [x for x in h]

for line in join:
  print line

# output:
#['id1', 'deposited', 100, '123-456-7890']
#['id1', 'withdraw', 40, '123-456-7890']
#['id2', 'withdraw', 15, '234-567-8901']
#['id1', 'withdraw', 25, '123-456-7890']
#['id2', 'deposited', 30, '234-567-8901']

Pandas package provides a very flexible column-oriented data structure DataFrame, which makes data munging operations extremely easy.

from pandas import *

df_acct = DataFrame(acct, columns = ['id', 'phone'])
df_hist = DataFrame(hist, columns = ['id', 'trans', 'amount'])
join2 = merge(df_acct, df_hist, on = 'id', how = 'inner')
print join2

# output
#    id         phone      trans  amount
#0  id1  123-456-7890  deposited     100
#1  id1  123-456-7890   withdraw      40
#2  id1  123-456-7890   withdraw      25
#3  id2  234-567-8901   withdraw      15
#4  id2  234-567-8901  deposited      30