I can calculate the motion of heavenly bodies but not the madness of people. -Isaac Newton

## Fastest Way to Add New Variables to A Large Data.Frame

```pkgs <- list("hflights", "doParallel", "foreach", "dplyr", "rbenchmark", "data.table")
lapply(pkgs, require, character.only = T)

data(hflights)

benchmark(replications = 10, order = "user.self", relative = "user.self",
transform = {
### THE GENERIC FUNCTION MODIFYING THE DATA.FRAME, SIMILAR TO DATA.FRAME() ###
transform(hflights, wday = ifelse(DayOfWeek %in% c(6, 7), 'weekend', 'weekday'), delay = ArrDelay + DepDelay)
},
within    = {
### EVALUATE THE EXPRESSION WITHIN THE LOCAL ENVIRONMENT ###
within(hflights, {wday = ifelse(DayOfWeek %in% c(6, 7), 'weekend', 'weekday'); delay = ArrDelay + DepDelay})
},
mutate   = {
### THE SPECIFIC FUNCTION IN DPLYR PACKAGE TO ADD VARIABLES ###
mutate(hflights, wday = ifelse(DayOfWeek %in% c(6, 7), 'weekend', 'weekday'), delay = ArrDelay + DepDelay)
},
foreach = {
### SPLIT AND THEN COMBINE IN PARALLEL ###
registerDoParallel(cores = 2)
v <- c(names(hflights), 'wday', 'delay')
f <- expression(ifelse(hflights\$DayOfWeek %in% c(6, 7), 'weekend', 'weekday'),
hflights\$ArrDelay + hflights\$DepDelay)
df <- foreach(fn = iter(f), .combine = mutate, .init = hflights) %dopar% {
eval(fn)
}
names(df) <- v
},
data.table = {
### DATA.TABLE ###
data.table(hflights)[, c("wday", "delay") := list(ifelse(hflights\$DayOfWeek %in% c(6, 7), 'weekend', 'weekday'), hflights\$ArrDelay + hflights\$DepDelay)]
}
)

#         test replications elapsed relative user.self sys.self user.child
# 4    foreach           10   1.442    1.000     0.240    0.144      0.848
# 2     within           10   0.667    2.783     0.668    0.000      0.000
# 3     mutate           10   0.679    2.833     0.680    0.000      0.000
# 5 data.table           10   0.955    3.983     0.956    0.000      0.000
# 1  transform           10   1.732    7.200     1.728    0.000      0.000
```

Written by statcompute

October 31, 2016 at 12:14 am

Posted in Big Data, S+/R

Tagged with ,

## Flavors of SQL on Pandas DataFrame

In R, sqldf() provides a convenient interface of running SQL statement on data frames. Similarly, Python also offers multiple ways to interact between SQL and Pandas DataFrames by leveraging the lightweight SQLite engine. While pandasql (https://github.com/yhat/pandasql) works similarly to sqldf() in R, pysqldf (https://github.com/airtoxin/pysqldf) is even more powerful. In my experiments shown below, advantages of pysqldf over pandasql are two-fold. First of all, pysqldf is 2 – 3 times faster than pandasql. Secondly, pysqldf supports new function definitions, which is not available in pandasql. However, it is worth mentioning that the generic python interface to an in-memory SQLite database can be more efficient and flexible than both pysqldf and pandasql, as demonstrated below, as long as we are able to get the DataFrame into the SQLite and let it stay in-memory.

```from sqlite3 import connect
import pandasql
import pysqldf
import numpy

# CREATE AN IN-MEMORY SQLITE DB
con = connect(":memory:")
cur = con.cursor()
cur.execute("attach 'my.db' as filedb")
cur.execute("create table df as select * from filedb.hflights")
cur.execute("detach filedb")

# IMPORT SQLITE TABLE INTO PANDAS DF
df = read_sql_query("select * from df", con)

# WRITE QUERIES
sql01 = "select * from df where DayofWeek = 1 and Dest = 'CVG';"
sql02 = "select DayofWeek, AVG(ArrTime) from df group by DayofWeek;"
sql03 = "select DayofWeek, median(ArrTime) from df group by DayofWeek;"

# SELECTION:
# 1. PANDASQL
%time t11 = pandasql.sqldf(sql01, globals())
# 2. PYSQLDF
%time t12 = pysqldf.SQLDF(globals()).execute(sql01)
# 3. GENERIC SQLITE CONNECTION

# AGGREGATION:
# 1. PANDASQL
%time t21 = pandasql.sqldf(sql02, globals())
# 2. PYSQLDF
%time t22 = pysqldf.SQLDF(globals()).execute(sql02)
# 3. GENERIC SQLITE CONNECTION

# DEFINING A NEW FUNCTION:
# DEFINE A FUNCTION NOT SUPPORTED IN SQLITE
class median(object):
def __init__(self):
self.a = []
def step(self, x):
self.a.append(x)
def finalize(self):
return numpy.median(self.a)

# 1. PYSQLDF
udafs = {"median": median}
%time t31 = pysqldf.SQLDF(globals(), udafs = udafs).execute(sql03)
# 2 GENERIC SQLITE CONNECTION
con.create_aggregate("median", 1, median)