Do We Really Need Dataframe in Clojure?

During my learning of Clojure, I was deeply impressed by the richness of data structures and the built-in support for concurrency, which makes it a perfect programming language for the data science. If you simply compare between R base and Clojure core functions, Clojure wins hands down!

At the beginning, a question that I kept asking myself was why there isn’t something in Clojure similar to R data frame or Python DataFrame. I had experimented Incanter API that is currently dying and was disappointed by its performance. However, after diving deeper, I started wondering whether we really need any additional data structure, such as something similar to the dataframe, just for the purpose of data munging.

Below is an example showing how to aggregate and query data with generic Clojure data structures, e.g. lazyseq and map, and core functions. For the time being, I am somewhat convinced that the life is still good, even without dataframe.


(require '[clojure.pprint :as p]
         '[clojure.java.jdbc :as j])

(def db
  {:classname   "org.sqlite.JDBC"
   :subprotocol "sqlite"
   :subname     "/home/liuwensui/Downloads/chinook.db"})

(pprint (j/query db "select * from invoices limit 1;"))

; ({:invoiceid 1,
;  :customerid 2,
;  :invoicedate "2009-01-01 00:00:00",
;  :billingaddress "Theodor-Heuss-Straße 34",
;  :billingcity "Stuttgart",
;  :billingstate nil,
;  :billingcountry "Germany",
;  :billingpostalcode "70174",
;  :total 1.98})

(def inv (j/query db "select * from invoices;"))

;; AGGREGATE INVOICE TOTAL BY COUNTRIES
(def country_sum 
  (map (fn [[billingcountry total]]
    {:billiingcountry billingcountry :total (reduce + (map :total total))})
    (group-by :billingcountry inv)))

;; TOP 5 COUNTRIES BY INVOICE AMOUNTS 
(p/print-table (take 5 (reverse (sort-by :total country_sum))))

; | :billiingcountry |             :total |
; |------------------+--------------------|
; |              USA |  523.0600000000003 |
; |           Canada |  303.9599999999999 |
; |           France | 195.09999999999994 |
; |           Brazil | 190.09999999999997 |
; |          Germany |             156.48 |

;; SELECT ROWS BY CRITERIA, E.G. US ORDERS BETWEEN $10 AND $12
(def us_inv (filter #(and (= (:billingcountry %) "USA") (< 10 (:total %) 12)) inv))

;; LIST ORDERS MEETING CRITERIA
(pprint us_inv)

;({:invoiceid 298,
;  :customerid 17,
;  :invoicedate "2012-07-31 00:00:00",
;  :billingaddress "1 Microsoft Way",
;  :billingcity "Redmond",
;  :billingstate "WA",
;  :billingcountry "USA",
;  :billingpostalcode "98052-8300",
;  :total 10.91}
; {:invoiceid 311,
;  :customerid 28,
;  :invoicedate "2012-09-28 00:00:00",
;  :billingaddress "302 S 700 E",
;  :billingcity "Salt Lake City",
;  :billingstate "UT",
;  :billingcountry "USA",
;  :billingpostalcode "84102",
;  :total 11.94})

;; SELECT COLUMNS, E.G. STATES AND CITIES
(p/print-table (map #(select-keys % [:invoiceid :billingcountry :billingstate :billingcity]) us_inv))

; | :invoiceid | :billingcountry | :billingstate |   :billingcity |
; |------------+-----------------+---------------+----------------|
; |        298 |             USA |            WA |        Redmond |
; |        311 |             USA |            UT | Salt Lake City |