Data Munging with Pandas
Being a long-time user of SAS and R, I feel very comfortable with rectangle-shaped data tables and database-style operations in data munging. However, when I started diving into python and tried to pick it up as a data analysis tool, it took me a while to get used to the “python style”.
For instance, there are two data tables, one account table with 1 record per id and one transaction history table with multiple records per id.
acct = [['id1', '123-456-7890'], ['id2', '234-567-8901'], ['id3', '999-999-9999']] hist = [['id1', 'deposited', 100], ['id1', 'withdraw', 40], ['id2', 'withdraw', 15], ['id1', 'withdraw', 25], ['id2', 'deposited', 30], ['id4', 'deposited', 30]]
If I’d like to do an inner join between these two tables, the generic coding logic might look like below, which is not be very intuitive (to me at least).
join =  for h in hist: for a in acct: if h == a: list = [x for x in h] list.append(a) join.append(list) for line in join: print line # output: #['id1', 'deposited', 100, '123-456-7890'] #['id1', 'withdraw', 40, '123-456-7890'] #['id2', 'withdraw', 15, '234-567-8901'] #['id1', 'withdraw', 25, '123-456-7890'] #['id2', 'deposited', 30, '234-567-8901']
Pandas package provides a very flexible column-oriented data structure DataFrame, which makes data munging operations extremely easy.
from pandas import * df_acct = DataFrame(acct, columns = ['id', 'phone']) df_hist = DataFrame(hist, columns = ['id', 'trans', 'amount']) join2 = merge(df_acct, df_hist, on = 'id', how = 'inner') print join2 # output # id phone trans amount #0 id1 123-456-7890 deposited 100 #1 id1 123-456-7890 withdraw 40 #2 id1 123-456-7890 withdraw 25 #3 id2 234-567-8901 withdraw 15 #4 id2 234-567-8901 deposited 30