Yet Another Blog in Statistical Computing

I can calculate the motion of heavenly bodies but not the madness of people. -Isaac Newton

Archive for May 2017

Kick Off Spark

My first Spark section:


scala> import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.SQLContext

scala> val sdf = spark.read.option("header", true).csv("Documents/spark/credit_count.txt")
sdf: org.apache.spark.sql.DataFrame = [CARDHLDR: string, DEFAULT: string ... 12 more fields]

scala> sdf.printSchema()
root
 |-- CARDHLDR: string (nullable = true)
 |-- DEFAULT: string (nullable = true)
 |-- AGE: string (nullable = true)
 |-- ACADMOS: string (nullable = true)
 |-- ADEPCNT: string (nullable = true)
 |-- MAJORDRG: string (nullable = true)
 |-- MINORDRG: string (nullable = true)
 |-- OWNRENT: string (nullable = true)
 |-- INCOME: string (nullable = true)
 |-- SELFEMPL: string (nullable = true)
 |-- INCPER: string (nullable = true)
 |-- EXP_INC: string (nullable = true)
 |-- SPENDING: string (nullable = true)
 |-- LOGSPEND : string (nullable = true)

scala> sdf.createOrReplaceTempView("tmp1")

scala> spark.sql("select count(*) as obs from tmp1").show()
+-----+
|  obs|
+-----+
|13444|
+-----+

Pyspark section doing the same thing:


In [1]: import pyspark as spark

In [2]: sc = spark.SQLContext(spark.SparkContext())

In [3]: sdf = sc.read.csv("Documents/spark/credit_count.txt", header = True)

In [4]: sdf.printSchema()
root
 |-- CARDHLDR: string (nullable = true)
 |-- DEFAULT: string (nullable = true)
 |-- AGE: string (nullable = true)
 |-- ACADMOS: string (nullable = true)
 |-- ADEPCNT: string (nullable = true)
 |-- MAJORDRG: string (nullable = true)
 |-- MINORDRG: string (nullable = true)
 |-- OWNRENT: string (nullable = true)
 |-- INCOME: string (nullable = true)
 |-- SELFEMPL: string (nullable = true)
 |-- INCPER: string (nullable = true)
 |-- EXP_INC: string (nullable = true)
 |-- SPENDING: string (nullable = true)
 |-- LOGSPEND : string (nullable = true)

In [5]: sdf.createOrReplaceTempView("tmp1")

In [6]: sc.sql("select count(*) as obs from tmp1").show()
+-----+
|  obs|
+-----+
|13444|
+-----+

Advertisements

Written by statcompute

May 29, 2017 at 1:34 am

Posted in Pyspark, PYTHON, Spark

Tagged with ,