I can calculate the motion of heavenly bodies but not the madness of people. -Isaac Newton

## Efficiency in Joining Two Data Frames

In R, there are multiple ways to merge 2 data frames. However, there could be a huge disparity in terms of efficiency. Therefore, it is worthwhile to test the performance among different methods and choose the correct approach in the real-world work.

For smaller data frames with 1,000 rows, all six methods shown below seem to work pretty well except that the approach with sql.select() is significantly slower than the rest. The generic merge() function in the base package is a very natural choice without much overhead of loading additional libraries and converting data frame. sqldf() is also attractive in that it might be the most user-friendly function with a very intuitive syntax.

```> n <- 1000
> set.seed(2013)
> ldf <- data.frame(id1 = sample(n, n), id2 = sample(n / 100, n, replace = TRUE), x1 = rnorm(n), x2 = runif(n))
> rdf <- data.frame(id1 = sample(n, n), id2 = sample(n / 100, n, replace = TRUE), y1 = rnorm(n), y2 = runif(n))
>
> # METHOD 1: MERGE
> system.time(join1 <- merge(ldf, rdf, by = c("id1", "id2")))
user  system elapsed
0.032   0.012   0.064
>
> # METHOD 2: DATA.TABLE
> ldt <- data.table::data.table(ldf, key = c("id1", "id2"))
> rdt <- data.table::data.table(rdf, key = c("id1", "id2"))
> system.time(join2 <- merge(ldt, rdt, by = c("id1", "id2")))
user  system elapsed
0.028   0.000   0.044
>
> # METHOD 3: FF
> lff <- ff::as.ffdf(ldf)
> rff <- ff::as.ffdf(rdf)
> system.time(join3 <- merge(lff, rff, by = c("id1", "id2")))
user  system elapsed
0.044   0.004   0.096
>
> # METHOD 4: SQLDF
> system.time(join4 <- sqldf::sqldf(c("create index ldx on ldf(id1, id2)",
+                                     "select * from main.ldf inner join rdf on ldf.id1 = rdf.id1 and ldf.id2 = rdf.id2")))
user  system elapsed
0.168   0.008   0.332
>
> # METHOD 5: PLYR
> system.time(join5 <- plyr::join(ldf, rdf, by = c("id1", "id2"), type = "inner"))
user  system elapsed
0.088   0.020   0.152
>
> # METHOD 6: SQL.SELECT
> system.time(join6 <- sql.select("select * from ldf inner join rdf on (`ldf\$id1 == rdf\$id1 & ldf\$id2 == rdf\$id2`)"))
user  system elapsed
53.775  19.725  73.813
```

However, when it comes to mid-size data frames with 1,000,000 rows, the story has changed. First of all, out of six methods shown above, the last two fails directly due to the insufficient memory size in my 32-bit ubuntu virtual machine. In this case, data.table package shows a significant advantage after converting 2 data.frames to data.tables. In additional, it is interesting that although ff and sqldf packages are slower than merge() function for the small-size data with 1,000 rows, both of them seem slightly faster for the data with 1,000,000 rows.

```> n <- 1000 ^ 2
> set.seed(2013)
> ldf <- data.frame(id1 = sample(n, n), id2 = sample(n / 100, n, replace = TRUE), x1 = rnorm(n), x2 = runif(n))
> rdf <- data.frame(id1 = sample(n, n), id2 = sample(n / 100, n, replace = TRUE), y1 = rnorm(n), y2 = runif(n))
>
> # METHOD 1: MERGE
> system.time(join1 <- merge(ldf, rdf, by = c("id1", "id2")))
user  system elapsed
55.223  12.437  68.054
>
> # METHOD 2: DATA.TABLE
> ldt <- data.table::data.table(ldf, key = c("id1", "id2"))
> rdt <- data.table::data.table(rdf, key = c("id1", "id2"))
> system.time(join2 <- merge(ldt, rdt, by = c("id1", "id2")))
user  system elapsed
0.484   0.008   0.492
>
> # METHOD 3: FF
> lff <- ff::as.ffdf(ldf)
> rff <- ff::as.ffdf(rdf)
> system.time(join3 <- merge(lff, rff, by = c("id1", "id2")))
user  system elapsed
49.811  13.821  64.004
>
> # METHOD 4: SQLDF
> system.time(join4 <- sqldf::sqldf(c("create index ldx on ldf(id1, id2)",
+                                     "select * from main.ldf inner join rdf on ldf.id1 = rdf.id1 and ldf.id2 = rdf.id2")))
user  system elapsed
40.418   1.268  42.076
```