Yet Another Blog in Statistical Computing

I can calculate the motion of heavenly bodies but not the madness of people. -Isaac Newton

Posts Tagged ‘Pyspark

Data Aggregation with PySpark

Import CSV File into Spark Dataframe

import pyspark as spark

sc = spark.SQLContext(spark.SparkContext())

sdf1 = sc.read.csv("Documents/nycflights13.csv", header = True, inferSchema = True)

Data Aggregation with Spark Dataframe

import pyspark.sql.functions as fn

sdf1.cache() \
    .filter("month in (1, 3, 5)") \
    .groupby("month") \
    .agg(fn.mean("dep_time").alias("avg_dep"), fn.mean("arr_time").alias("avg_arr")) \
    .show()

+-----+------------------+------------------+
|month|           avg_dep|           avg_arr|
+-----+------------------+------------------+
|    1| 1347.209530642299|1523.1545262203415|
|    3|1359.4997676330747|1509.7429767741473|
|    5|1351.1682074168525|1502.6846604007803|
+-----+------------------+------------------+

Data Aggregation with Spark SQL

sc.registerDataFrameAsTable(sdf1, "tbl1")

sc.sql("select month, avg(dep_time) as avg_dep, avg(arr_time) as avg_arr from tbl1 where month in (1, 3, 5) group by month").show()

sc.dropTempTable(sc.tableNames()[0])

+-----+------------------+------------------+
|month|           avg_dep|           avg_arr|
+-----+------------------+------------------+
|    1| 1347.209530642299|1523.1545262203415|
|    3|1359.4997676330747|1509.7429767741473|
|    5|1351.1682074168525|1502.6846604007803|
+-----+------------------+------------------+
Advertisements

Written by statcompute

June 1, 2017 at 1:02 am

Posted in Pyspark, PYTHON, Spark

Tagged with ,

Kick Off Spark

My first Spark section:


scala> import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.SQLContext

scala> val sdf = spark.read.option("header", true).csv("Documents/spark/credit_count.txt")
sdf: org.apache.spark.sql.DataFrame = [CARDHLDR: string, DEFAULT: string ... 12 more fields]

scala> sdf.printSchema()
root
 |-- CARDHLDR: string (nullable = true)
 |-- DEFAULT: string (nullable = true)
 |-- AGE: string (nullable = true)
 |-- ACADMOS: string (nullable = true)
 |-- ADEPCNT: string (nullable = true)
 |-- MAJORDRG: string (nullable = true)
 |-- MINORDRG: string (nullable = true)
 |-- OWNRENT: string (nullable = true)
 |-- INCOME: string (nullable = true)
 |-- SELFEMPL: string (nullable = true)
 |-- INCPER: string (nullable = true)
 |-- EXP_INC: string (nullable = true)
 |-- SPENDING: string (nullable = true)
 |-- LOGSPEND : string (nullable = true)

scala> sdf.createOrReplaceTempView("tmp1")

scala> spark.sql("select count(*) as obs from tmp1").show()
+-----+
|  obs|
+-----+
|13444|
+-----+

Pyspark section doing the same thing:


In [1]: import pyspark as spark

In [2]: sc = spark.SQLContext(spark.SparkContext())

In [3]: sdf = sc.read.csv("Documents/spark/credit_count.txt", header = True)

In [4]: sdf.printSchema()
root
 |-- CARDHLDR: string (nullable = true)
 |-- DEFAULT: string (nullable = true)
 |-- AGE: string (nullable = true)
 |-- ACADMOS: string (nullable = true)
 |-- ADEPCNT: string (nullable = true)
 |-- MAJORDRG: string (nullable = true)
 |-- MINORDRG: string (nullable = true)
 |-- OWNRENT: string (nullable = true)
 |-- INCOME: string (nullable = true)
 |-- SELFEMPL: string (nullable = true)
 |-- INCPER: string (nullable = true)
 |-- EXP_INC: string (nullable = true)
 |-- SPENDING: string (nullable = true)
 |-- LOGSPEND : string (nullable = true)

In [5]: sdf.createOrReplaceTempView("tmp1")

In [6]: sc.sql("select count(*) as obs from tmp1").show()
+-----+
|  obs|
+-----+
|13444|
+-----+

Written by statcompute

May 29, 2017 at 1:34 am

Posted in Pyspark, PYTHON, Spark

Tagged with ,