Yet Another Blog in Statistical Computing

I can calculate the motion of heavenly bodies but not the madness of people. -Isaac Newton

Import CSV by Chunk Simultaneously with IPython Parallel

IPython Parallel is a friendly interface to implement the parallelism in Python. Below is an example showing how to import a csv file into the Pandas DataFrame by chunk simultaneously through the interface of IPython Parallel.

In [1]: ### NEED TO RUN "IPCLUSTER START -N 2 &" FIRST ###

In [2]: import pandas as pd

In [3]: import ipyparallel as ip

In [4]: str = "AutoCollision.csv"

In [5]: rc = ip.Client()

In [6]: ### showing 2 engines working

In [7]: rc.ids
Out[7]: [0, 1]

In [8]: def para_read(file):
   ...:       dview = rc[:]
   ...:       # PARTITION THE IMPORT BY SCATTER() #
   ...:       dview.scatter("df", pd.read_csv(file))
   ...:       return pd.concat([i for i in dview["df"]])
   ...:

In [9]: ### PARALLEL IMPORT ###

In [10]: df1 = para_read(str)

In [11]: ### SERIAL IMPORT ###

In [12]: df2 = pd.read_csv(str)

In [13]: df1.equals(df2)
Out[13]: True
Advertisements

Written by statcompute

December 27, 2015 at 4:03 pm

Posted in Big Data, pandas, PYTHON

Tagged with ,

%d bloggers like this: