Yet Another Blog in Statistical Computing

"Did you always know?" "No, I didn't. But I believed."

Import All Text Files in A Folder with Parallel Execution

Sometimes, we might need to import all files, e.g. *.txt, with the same data layout in a folder without knowing each file name and then combine all pieces together. With the old method, we can use lapply() and do.call() functions to accomplish the task. However, when there are a large number of such files and each file size is large, it could be computational expensive.

With the foreach package, we are able to split the data import task into pieces and then distribute them to multiple CPUs with the parallel execution, as shown in the code below.

setwd('../data/csdata')
files <- list.files(pattern = "[.]txt$")

library(rbenchmark)
benchmark(replications = 10, order = "user.self",
  LAPPLY = {
    read.all <- lapply(files, read.table, header = TRUE)
    data1 <- do.call(rbind, read.all)
  },
  FOREACH = {
    library(doParallel)
    registerDoParallel(cores = 4)
    library(foreach)
    data2 <- foreach(i = files, .combine = rbind) %dopar% read.table(i, header = TRUE)
  }
)

library(compare)
all.equal(data1, data2)

From the output below, it is shown that both methods generated identical data.frames. However, the method with foreach() is much more efficient than the method with lapply() due to the parallelism.

     test replications elapsed relative user.self sys.self user.child sys.child
2 FOREACH           10   0.689    1.000     0.156    0.076      1.088     0.308
1  LAPPLY           10   1.078    1.565     1.076    0.004      0.000     0.000

Attaching package: ‘compare’

The following object is masked from ‘package:base’:

    isTRUE

[1] TRUE
About these ads

Written by statcompute

May 26, 2013 at 11:55 pm

Posted in S+/R

Follow

Get every new post delivered to your Inbox.

Join 62 other followers

%d bloggers like this: