Yet Another Blog in Statistical Computing

I can calculate the motion of heavenly bodies but not the madness of people. -Isaac Newton

R and MongoDB

MongoDB is a document-based noSQL database. Different from the relational database storing data in tables with rigid schemas, MongoDB stores data in documents with dynamic schemas. In the demonstration below, I am going to show how to extract data from a MongoDB with R.

Before starting the R session, we need to install the MongoDB in the local machine and then load the data into the database with the Python code below.

import pandas as pandas
import pymongo as pymongo

df = pandas.read_table('../data/csdata.txt')
lst = [dict([(colname, row[i]) for i, colname in enumerate(df.columns)]) for row in df.values]
for i in range(3):
  print lst[i]

con = pymongo.Connection('localhost', port = 27017)
test = con.db.test
test.drop()
for i in lst:
  test.save(i)

To the best of my knowledge, there are two R packages providing the interface with MongoDB, namely RMongo and rmongodb. While RMongo package is very straight-forward and user-friendly, it did take me a while to figure out how to specify a query with rmongodb package.

RMongo Example

library(RMongo)
mg1 <- mongoDbConnect('db')
print(dbShowCollections(mg1))
query <- dbGetQuery(mg1, 'test', "{'AGE': {'$lt': 10}, 'LIQ': {'$gte': 0.1}, 'IND5A': {'$ne': 1}}")
data1 <- query[c('AGE', 'LIQ', 'IND5A')]
summary(data1)

RMongo Output

Loading required package: rJava
Loading required package: methods
Loading required package: RUnit
[1] "system.indexes" "test"          
      AGE             LIQ             IND5A  
 Min.   :6.000   Min.   :0.1000   Min.   :0  
 1st Qu.:7.000   1st Qu.:0.1831   1st Qu.:0  
 Median :8.000   Median :0.2970   Median :0  
 Mean   :7.963   Mean   :0.3745   Mean   :0  
 3rd Qu.:9.000   3rd Qu.:0.4900   3rd Qu.:0  
 Max.   :9.000   Max.   :1.0000   Max.   :0  

rmongodb Example

library(rmongodb)
mg2 <- mongo.create()
print(mongo.get.databases(mg2))
print(mongo.get.database.collections(mg2, 'db'))
buf <- mongo.bson.buffer.create()
mongo.bson.buffer.start.object(buf, 'AGE')
mongo.bson.buffer.append(buf, '$lt', 10)
mongo.bson.buffer.finish.object(buf)
mongo.bson.buffer.start.object(buf, 'LIQ')
mongo.bson.buffer.append(buf, '$gte', 0.1)
mongo.bson.buffer.finish.object(buf)
mongo.bson.buffer.start.object(buf, 'IND5A')
mongo.bson.buffer.append(buf, '$ne', 1)
mongo.bson.buffer.finish.object(buf)
query <- mongo.bson.from.buffer(buf)
cur <- mongo.find(mg2, 'db.test', query = query)
age <- liq <- ind5a <- NULL
while (mongo.cursor.next(cur)) {
  value <- mongo.cursor.value(cur)
  age   <- rbind(age, mongo.bson.value(value, 'AGE'))
  liq   <- rbind(liq, mongo.bson.value(value, 'LIQ'))
  ind5a <- rbind(ind5a, mongo.bson.value(value, 'IND5A'))
  }
mongo.destroy(mg2)
data2 <- data.frame(AGE = age, LIQ = liq, IND5A = ind5a)
summary(data2)

rmongo Output

rmongodb package (mongo-r-driver) loaded
Use 'help("mongo")' to get started.

[1] "db"
[1] "db.test"
[1] TRUE
[1] TRUE
[1] TRUE
[1] TRUE
[1] TRUE
[1] TRUE
[1] TRUE
[1] TRUE
[1] TRUE
NULL
      AGE             LIQ             IND5A  
 Min.   :6.000   Min.   :0.1000   Min.   :0  
 1st Qu.:7.000   1st Qu.:0.1831   1st Qu.:0  
 Median :8.000   Median :0.2970   Median :0  
 Mean   :7.963   Mean   :0.3745   Mean   :0  
 3rd Qu.:9.000   3rd Qu.:0.4900   3rd Qu.:0  
 Max.   :9.000   Max.   :1.0000   Max.   :0  
About these ads

Written by statcompute

June 8, 2013 at 12:20 am

Follow

Get every new post delivered to your Inbox.

Join 76 other followers

%d bloggers like this: