Yet Another Blog in Statistical Computing

I can calculate the motion of heavenly bodies but not the madness of people. -Isaac Newton

H2O Benchmark for CSV Import

The importFile() function in H2O is extremely efficient due to the parallel reading. The benchmark comparison below shows that it is comparable to the read.df() in SparkR and significantly faster than the generic read.csv().

library(SparkR, lib.loc = paste(Sys.getenv("SPARK_HOME"), "/R/lib", sep = ""))
sc <- sparkR.session(master = "local", sparkConfig = list(spark.driver.memory = "10g", spark.driver.cores = "4"))

library(h2o)
h2o.init(max_mem_size = "10g")

library(rbenchmark)

benchmark(replications = 5, order = "elapsed", relative = "elapsed",
   csv = {
          df <- read.csv("Documents/nycflights13.csv")
          print(nrow(df))
          rm(df)
         },
   spk = {
          df <- read.df("Documents/nycflights13.csv", source = "csv", header = "true", inferSchema = "true")
          print(nrow(df))
          rm(df)
         },
   h2o = {
          df <- h2o.importFile(path = "Documents/nycflights13.csv", header = TRUE, sep = ",")
          print(nrow(df))
          rm(df)
         }
 )

#   test replications elapsed relative user.self sys.self user.child sys.child
# 3  h2o            5   8.221    1.000     0.508    0.032          0         0
# 2  spk            5   9.822    1.195     0.008    0.004          0         0
# 1  csv            5  16.595    2.019    16.420    0.176          0         0
Advertisements

Written by statcompute

June 26, 2017 at 1:24 am

Posted in Big Data, H2O, S+/R, Spark

Tagged with , , ,

%d bloggers like this: