RPY2 – Yet Another Blog in Statistical Computing

Within Python, there are two packages allowing pythonians to call R functions from a Python program. Below is a comparison between RPY2 and PYPER. The design of this study is very simple:
1) read a 100 by 100,000 data matrix into a pandas DataFrame
2) convert the pandas DataFrame to a R data frame
3) calculate column means of the R data frame by calling R colMeans() function
4) convert the vector of column means back to a pandas DataFrame

Based on the output, my impression is that
A. PYPER is extremely easy to use and seems more efficient in terms of “user time”.
B. RPY2 is not intuitive to use and shows a longer “user time”. However, the turn-around of the whole process, e.g. “real time”, is a lot shorter.

This result seems to make sense. As pointed out by Dr Xia (the developer of PYPER), PYPER is more efficient in the memory usage by avoiding frequent data exchanges between Python and R. However, since pipes are not fast at passing data between Python and R, the speed is sacrificed for large datasets.

Which package should you use? It all depends on your preference and your hardware. If the ram in your machine is not a concern, then RPY2 seems preferred. However, if the memory efficiency / code expressiveness / process tranparency is in your mind, then PYPER should be considered.

RPY2 Code and Output

import pandas as pd
import rpy2.robjects as ro
import pandas.rpy.common as py2r
py_data = pd.read_csv('/home/liuwensui//Documents/data/test.csv', sep = ',', nrows = 100000)
r_data = py2r.convert_to_r_dataframe(py_data)
r_mean = ro.packages.importr("base").colMeans(r_data)
py_mean = pd.DataFrame(py2r.convert_robj(r_mean), index = py_data.columns, columns = ['mean'])
print py_mean.head(n = 10)

#         mean
#x1   0.003442
#x2  -0.001742
#x3  -0.003448
#x4   0.001206
#x5   0.001721
#x6  -0.002381
#x7   0.002598
#x8  -0.005351
#x9   0.007363
#x10 -0.006564

#real	0m57.718s
#user	0m43.431s
#sys	0m6.024s

PYPER Code and Output

import pandas as pd
import pyper as pr
py_data = pd.read_csv('/home/liuwensui//Documents/data/test.csv', sep = ',', nrows = 100000)
r = pr.R(use_pandas = True)
r.r_data = py_data
r('r_mean <- colMeans(r_data)')
py_mean = pd.DataFrame(r.r_mean, index = py_data.columns, columns = ['mean'])
print py_mean.head(n = 10)

#         mean
#x1   0.003442
#x2  -0.001742
#x3  -0.003448
#x4   0.001206
#x5   0.001721
#x6  -0.002381
#x7   0.002598
#x8  -0.005351
#x9   0.007363
#x10 -0.006564

#real	5m31.532s
#user	0m21.289s
#sys	0m5.496s