In the previous post (https://statcompute.wordpress.com/2018/09/03/playing-map-and-reduce-in-r-by-group-calculation), I’ve shown how to employ the MapReduce when calculating by-group statistics. Actually, the same Divide-n-Conquer strategy can be applicable to other use cases, one of which is the subsetting operation.

In the example below, let’s still use the same iris data for the demonstration purpose. In R, the most convenient way to perform the subsetting might be the subset() function, which would search for rows meeting the condition described in the “expr” expression below throughout the entire data.frame.

data(iris) expr = expression(Sepal.Length > 7 & Sepal.Width > 3) subset(iris, eval(expr))

With the whole data.frame partitioned into multiple pieces, the row searching operation can perfectly fit into the MapReduce paradigm, as described in the logic flow below.

1. First of all, the iris data is divided into chunks with equal number of rows, e.g. two chunks in the example.

2. Next, a Map() function is used to perform the row searching operation within each chunk.

3. Upon the return of rows meeting the criteria from each chunk, a Reduce() function is employed to combine all outcomes together.

n <- 2 lst <- split(iris, sort((1:nrow(iris)) %% n)) Reduce(rbind, Map(function(x) x[with(x, which(eval(expr))), ], lst))

It is noted that the above map operation is still performed sequentially without leveraging the computing power of multiple CPUs. In the CPU usage, we can see that only one CPU is used and the rest are idle.

Similar to the by-group summary, the by-chunk operation of row searching doesn’t have to be in the sequential order and can be distributed simultaneously across multiple CPUs by using the mcMap() function, as outlined below.

1. Again, it starts with the data partition. However, there are two caveats in the example. Firstly, the data is split based upon the number of CPUs captured by the detectCores() function. Secondly, the partitioned data is NOT stored physically in the memory but reflected logically by a list of future abstractions, e.g. “flst” in the code snippet.

2. In the second step, the mcMap() function is used to evaluate each future abstraction, return the partitioned data, and then perform the row searching within each chunk.

3. At last, the Reduce() function collects and combines all outcomes.

pkgs <- c("parallel", "future") mapply(function(x) require(x, character.only = T), pkgs) n <- detectCores() flst <- Map(function(x) future({x}), split(iris, sort((1:nrow(iris)) %% n))) Reduce(rbind, mcMap(function(x) value(x)[with(value(x), which(eval(expr))), ], flst, mc.cores = n))

If we take a look at the CPU usage again, it is now shown that all CPUs are utilized.