Within Python, there are two packages allowing pythonians to call R functions from a Python program. Below is a comparison between RPY2 and PYPER. The design of this study is very simple:
1) read a 100 by 100,000 data matrix into a pandas DataFrame
2) convert the pandas DataFrame to a R data frame
3) calculate column means of the R data frame by calling R colMeans() function
4) convert the vector of column means back to a pandas DataFrame
Based on the output, my impression is that
A. PYPER is extremely easy to use and seems more efficient in terms of “user time”.
B. RPY2 is not intuitive to use and shows a longer “user time”. However, the turn-around of the whole process, e.g. “real time”, is a lot shorter.
This result seems to make sense. As pointed out by Dr Xia (the developer of PYPER), PYPER is more efficient in the memory usage by avoiding frequent data exchanges between Python and R. However, since pipes are not fast at passing data between Python and R, the speed is sacrificed for large datasets.
Which package should you use? It all depends on your preference and your hardware. If the ram in your machine is not a concern, then RPY2 seems preferred. However, if the memory efficiency / code expressiveness / process tranparency is in your mind, then PYPER should be considered.
RPY2 Code and Output
import pandas as pd import rpy2.robjects as ro import pandas.rpy.common as py2r py_data = pd.read_csv('/home/liuwensui//Documents/data/test.csv', sep = ',', nrows = 100000) r_data = py2r.convert_to_r_dataframe(py_data) r_mean = ro.packages.importr("base").colMeans(r_data) py_mean = pd.DataFrame(py2r.convert_robj(r_mean), index = py_data.columns, columns = ['mean']) print py_mean.head(n = 10) # mean #x1 0.003442 #x2 -0.001742 #x3 -0.003448 #x4 0.001206 #x5 0.001721 #x6 -0.002381 #x7 0.002598 #x8 -0.005351 #x9 0.007363 #x10 -0.006564 #real 0m57.718s #user 0m43.431s #sys 0m6.024s
PYPER Code and Output
import pandas as pd import pyper as pr py_data = pd.read_csv('/home/liuwensui//Documents/data/test.csv', sep = ',', nrows = 100000) r = pr.R(use_pandas = True) r.r_data = py_data r('r_mean <- colMeans(r_data)') py_mean = pd.DataFrame(r.r_mean, index = py_data.columns, columns = ['mean']) print py_mean.head(n = 10) # mean #x1 0.003442 #x2 -0.001742 #x3 -0.003448 #x4 0.001206 #x5 0.001721 #x6 -0.002381 #x7 0.002598 #x8 -0.005351 #x9 0.007363 #x10 -0.006564 #real 5m31.532s #user 0m21.289s #sys 0m5.496s