Yet Another Blog in Statistical Computing

I can calculate the motion of heavenly bodies but not the madness of people. -Isaac Newton

Exchange Data between Python and R with SQLite

SQLite is a light-weight database with zero-configuration. Being fast, reliable, and simple, SQLite is a good choice to store / query large data, e.g. terabytes, and is well supported by both Python and R.

In [1]: # LOAD PYTHON PACKAGES

In [2]: import pandas as pd

In [3]: import pandas.io.sql as pd_sql

In [4]: import sqlite3 as sql

In [5]: import pyper as pr

In [6]: # READ DATA

In [7]: py_data = pd.read_table("/home/liuwensui/Documents/data/csdata.txt")

In [8]: print py_data
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4421 entries, 0 to 4420
Data columns:
LEV_LT3     4421  non-null values
TAX_NDEB    4421  non-null values
COLLAT1     4421  non-null values
SIZE1       4421  non-null values
PROF2       4421  non-null values
GROWTH2     4421  non-null values
AGE         4421  non-null values
LIQ         4421  non-null values
IND2A       4421  non-null values
IND3A       4421  non-null values
IND4A       4421  non-null values
IND5A       4421  non-null values
dtypes: float64(7), int64(5)

In [9]: # CREATE A CONNECTION TO SQLITE DB

In [10]: con = sql.connect("/home/liuwensui/Documents/data/tmp.db")

In [11]: # WRITE THE DATAFRAME INTO SQLITE DB

In [12]: con.execute("drop table if exists tbldata")
Out[12]: <sqlite3.Cursor at 0xa00d820>

In [13]: pd_sql.write_frame(py_data, "tbldata", con)

In [14]: con.commit()

In [15]: # TEST THE DATA WRITTEN INTO SQLITE DB

In [16]: test_data = pd_sql.read_frame("select * from tbldata limit 5", con)

In [17]: print test_data
   LEV_LT3  TAX_NDEB   COLLAT1      SIZE1     PROF2    GROWTH2  AGE       LIQ  IND2A  IND3A  IND4A  IND5A
0        0  0.530298  0.079172  13.131993  0.082016   1.166493   53  0.385779      0      0      1      0
1        0  0.370025  0.040745  12.132626  0.082615  11.092048   54  0.224123      1      0      0      0
2        0  0.636884  0.307242  13.322921  0.245129  -6.316099   43  0.055441      1      0      0      0
3        0  0.815549  0.295864  16.274536  0.164052   1.394809   24  0.016731      1      0      0      0
4        0  0.097690  0.033567  13.491299  0.160505  10.204010   49  0.387136      1      0      0      0

In [18]: # CREATE A R INSTANCE

In [19]: r = pr.R()

In [20]: # LOAD R LIBRARY

In [21]: print r("library(sqldf)")
try({library(sqldf)})
Loading required package: DBI
Loading required package: gsubfn
Loading required package: proto
Loading required namespace: tcltk
Loading Tcl/Tk interface ... done
Loading required package: chron
Loading required package: RSQLite
Loading required package: RSQLite.extfuns


In [22]: # READ DATA FROM SQLITE DB

In [23]: print r("r_data <- sqldf('select * from tbldata', dbname = '/home/liuwensui/Documents/data/tmp.db')")
try({r_data <- sqldf('select * from tbldata', dbname = '/home/liuwensui/Documents/data/tmp.db')})
Loading required package: tcltk


In [24]: print r("str(r_data)")
try({str(r_data)})
'data.frame':	4421 obs. of  12 variables:
 $ LEV_LT3 : num  0 0 0 0 0 0 0 0 0 0 ...
 $ TAX_NDEB: num  0.5303 0.37 0.6369 0.8155 0.0977 ...
 $ COLLAT1 : num  0.0792 0.0407 0.3072 0.2959 0.0336 ...
 $ SIZE1   : num  13.1 12.1 13.3 16.3 13.5 ...
 $ PROF2   : num  0.082 0.0826 0.2451 0.1641 0.1605 ...
 $ GROWTH2 : num  1.17 11.09 -6.32 1.39 10.2 ...
 $ AGE     : int  53 54 43 24 49 24 35 77 33 81 ...
 $ LIQ     : num  0.3858 0.2241 0.0554 0.0167 0.3871 ...
 $ IND2A   : int  0 1 1 1 1 1 1 1 1 0 ...
 $ IND3A   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ IND4A   : int  1 0 0 0 0 0 0 0 0 0 ...
 $ IND5A   : int  0 0 0 0 0 0 0 0 0 1 ...
Advertisements

Written by statcompute

December 3, 2012 at 12:02 am

Posted in Machine Learning, PYTHON, S+/R

Tagged with , ,

%d bloggers like this: