Yet Another Blog in Statistical Computing

I can calculate the motion of heavenly bodies but not the madness of people. -Isaac Newton

Data Aggregation with PySpark

Import CSV File into Spark Dataframe

import pyspark as spark

sc = spark.SQLContext(spark.SparkContext())

sdf1 = sc.read.csv("Documents/nycflights13.csv", header = True, inferSchema = True)

Data Aggregation with Spark Dataframe

import pyspark.sql.functions as fn

sdf1.cache() \
    .filter("month in (1, 3, 5)") \
    .groupby("month") \
    .agg(fn.mean("dep_time").alias("avg_dep"), fn.mean("arr_time").alias("avg_arr")) \
    .show()

+-----+------------------+------------------+
|month|           avg_dep|           avg_arr|
+-----+------------------+------------------+
|    1| 1347.209530642299|1523.1545262203415|
|    3|1359.4997676330747|1509.7429767741473|
|    5|1351.1682074168525|1502.6846604007803|
+-----+------------------+------------------+

Data Aggregation with Spark SQL

sc.registerDataFrameAsTable(sdf1, "tbl1")

sc.sql("select month, avg(dep_time) as avg_dep, avg(arr_time) as avg_arr from tbl1 where month in (1, 3, 5) group by month").show()

sc.dropTempTable(sc.tableNames()[0])

+-----+------------------+------------------+
|month|           avg_dep|           avg_arr|
+-----+------------------+------------------+
|    1| 1347.209530642299|1523.1545262203415|
|    3|1359.4997676330747|1509.7429767741473|
|    5|1351.1682074168525|1502.6846604007803|
+-----+------------------+------------------+
Advertisements

Written by statcompute

June 1, 2017 at 1:02 am

Posted in Pyspark, PYTHON, Spark

Tagged with ,

%d bloggers like this: