Yet Another Blog in Statistical Computing

I can calculate the motion of heavenly bodies but not the madness of people. -Isaac Newton

Updating Column Values in Clojure Map


(require '[huri.core :as h]
         '[clojure.core.matrix.dataset :as d]
         '[incanter.core :as i])

(def ds [{:id 1.0 :name "name1"}
         {:id 2.0 :name "name2"}
         {:id 3.0 :name "name3"}])

;; UPDATE THE :NAME COLUMN IN THE DATASET
;; - IF THE VALUE IS NOT "NAME2", THEN CHANGE TO "NOT 2"
;;
;; EXPECTED OUTPUT:
;; | :id | :name |
;; |-----+-------|
;; | 1.0 | not 2 |
;; | 2.0 | name2 |
;; | 3.0 | not 2 |

;; WITH CLOJURE.CORE/UPDATE
(def d1 (map (fn [x] (update x :name #(if (= "name2" %) % "not 2"))) ds))

;; WITH CLOJURE.CORE/UPDATE-IN
(def d2 (map (fn [x] (update-in x [:name] #(if (= "name2" %) % "not 2"))) ds))

;; WITH HURI/UPDATE-COLS
(def d3 (h/update-cols {:name #(if (= "name2" %) % "not 2")} ds))

;; WITH MATRIX.DATASET/EMAP-COLUMN
(def d4 (-> ds
            (d/dataset)
            (d/emap-column :name #(if (= "name2" %) % "not 2"))
            ((comp #(map into %) d/row-maps))))
   
;; WITH INCANTER/TRANSFORM-COL
(def d5 (-> ds
            (i/to-dataset)
            (i/transform-col :name #(if (= "name2" %) % "not 2"))
            ((comp #(map into %) second vals))))

Advertisements

Written by statcompute

April 22, 2018 at 5:23 pm

Adding New Columns to Clojure Map


(require '[huri.core :as h]
         '[clojure.core.matrix.dataset :as d]
         '[incanter.core :as i])

(def ds [{:id 1.0 :name "name1"}
         {:id 2.0 :name "name2"}
         {:id 3.0 :name "name3"}])

;; ADD 2 COLUMNS TO THE DATASET
;; - ADD 2 TO ID AND NAME ADD2
;; - CHECK NAME = "name2" AND NAME NAME2
;;
;; EXPECTED OUTPUT:
;;| :id | :name | :add2 | :name2 |
;;|-----+-------+-------+--------|
;;| 1.0 | name1 |   3.0 |      N |
;;| 2.0 | name2 |   4.0 |      Y |
;;| 3.0 | name3 |   5.0 |      N |

;; WITH PLAIN CLOJURE
;; #1 - MERGE
(def d1 (map #(merge % {:add2 (+ (:id %) 2) 
                        :name2 (if (= "name2" (:name %)) "Y" "N")}) ds))

;; #2 - MERGE-WITH
(def d2 (map #(merge-with into % {:add2 (+ (:id %) 2)
                                  :name2 (if (= "name2" (:name %)) "Y" "N")}) ds))

;; #3 - ASSOC
(def d3 (map #(assoc % :add2 (+ (:id %) 2) 
                       :name2 (if (= "name2" (:name %)) "Y" "N")) ds))

;; #4 - CONJ
(def d4 (map #(conj % {:add2 (+ (:id %) 2)
                       :name2 (if (= "name2" (:name %)) "Y" "N")}) ds))

;; #5 - CONCAT 
(def d5 (map #(into {} (concat % {:add2 (+ (:id %) 2)
                                  :name2 (if (= "name2" (:name %)) "Y" "N")})) ds))

;; WITH HURI 
(def d6 (h/derive-cols {:name2 [#(if (= "name2" %) "Y" "N") :name] 
                        :add2 [#(+ 2  %) :id]} ds))

;; WITH CORE.MATRIX API
(def d7 (-> ds
            (d/dataset)
            (d/add-column :add2 (map #(+ 2 %) (map :id ds)))
            (d/add-column :name2 (map #(if (= "name2" %) "Y" "N") (map :name ds)))
            (d/row-maps)))

;; WITH INCANTER API
(def d8 (->> ds
             (i/to-dataset)
             (i/add-derived-column :add2 [:id] #(+ 2 %))
             (i/add-derived-column :name2 [:name] #(if (= "name2" %) "Y" "N"))
             ((comp second vals))))

;; CHECK THE DATA EQUALITY
(= d1 d2 d3 d4 d5 d6 d7 d8)
;; true

Written by statcompute

April 21, 2018 at 4:02 pm

LogRatio Regression – A Simple Way to Model Compositional Data

The compositional data are proportionals of mutually exclusive groups that would be summed up to the unity. Statistical models for compositional data have been applicable in a number of areas, e.g. the product or channel mix in the marketing research and asset allocations of a investment portfolio.

In the example below, I will show how to model compositional outcomes with a simple LogRatio regression. The underlying idea is very simple. With the D-dimension outcome [p_1, p_2…p_D], we can derive a [D-1]-dimension outcome [log(p_2 / p_1)…log(p_D / p_1)] and then estimate a multivariate regression based on the new outcome.

df = get("ArcticLake", envir = asNamespace('DirichletReg'))

#   sand  silt  clay depth
#1 0.775 0.195 0.030  10.4
#2 0.719 0.249 0.032  11.7
#3 0.507 0.361 0.132  12.8

lm(cbind(log(silt / sand), log(clay / sand)) ~ depth, data = df)

#Response log(silt/sand):
#Coefficients:
#             Estimate Std. Error t value Pr(>|t|)
#(Intercept) -0.649656   0.236733  -2.744   0.0093 **
#depth        0.037522   0.004269   8.790 1.36e-10 ***
#
#Response log(clay/sand) :
#Coefficients:
#             Estimate Std. Error t value Pr(>|t|)
#(Intercept) -2.614897   0.421383  -6.206 3.31e-07 ***
#depth        0.062181   0.007598   8.184 8.00e-10 ***

Since log(x / y) = log(x) – log(y), we can also estimate the model with log(sand) as an offset term.


lm(cbind(log(silt), log(clay)) ~ depth + offset(log(sand)), data = df)

#Response log(silt) :
#Coefficients:
#             Estimate Std. Error t value Pr(>|t|)
#(Intercept) -0.649656   0.236733  -2.744   0.0093 **
#depth        0.037522   0.004269   8.790 1.36e-10 ***
#
#Response log(clay) :
#Coefficients:
#             Estimate Std. Error t value Pr(>|t|)
#(Intercept) -2.614897   0.421383  -6.206 3.31e-07 ***
#depth        0.062181   0.007598   8.184 8.00e-10 ***

Alternatively, we can also use the comp.reg function in the Compositional package.


Compositional::comp.reg(as.matrix(df[, 1:3]), df[, 4])

#$be
#                   [,1]        [,2]
#(Intercept) -0.64965598 -2.61489731
#x            0.03752186  0.06218069
#
#$seb
#                   [,1]        [,2]
#(Intercept) 0.236733203 0.421382652
#x           0.004268588 0.007598043

Written by statcompute

April 15, 2018 at 9:04 pm

Transpose in Clojure


(require '[huri.core :as h]
         '[clojure.core.matrix.dataset :as d]
         '[incanter.core :as i])

;; FROM MAP OF ROWS TO MAP OF COLUMNS

(def byRow [{:x 1 :y "a"}
            {:x 2 :y "b"}
            {:x 3 :y "c"}])

;; APPROACH #1 - PLAIN CLOJURE
(zipmap (keys (first byRow)) (apply map list (map vals byRow)))

; {:x (1 2 3), :y ("a" "b" "c")}

;; APPROACH #2 - HURI LIBRARY
(h/col-oriented byRow)

; {:x (1 2 3), :y ("a" "b" "c")}

;; APPROACH #3 - CORE.MATRIX LIBRARY
(d/to-map (d/dataset (keys (first byRow)) byRow))

; {:x [1 2 3], :y ["a" "b" "c"]}

;; APPROACH #4 - INCANTER LIBRARY
(i/to-map (i/to-dataset byRow))

; {:x (1 2 3), :y ("a" "b" "c")}

;; FROM MAP OF COLUMNS TO MAP OF ROWS

(def byCol {:x '(1 2 3)
            :y '("a" "b" "c")})

;; APPROACH #1 - PLAIN CLOJURE
(map #(zipmap (keys byCol) %) (apply map list (vals byCol)))

; ({:x 1, :y "a"} {:x 2, :y "b"} {:x 3, :y "c"})

;; APPROACH #2 - HURI LIBRARY
(h/row-oriented byCol)

; ({:x 1, :y "a"} {:x 2, :y "b"} {:x 3, :y "c"})

;; APPROACH #3 - CORE.MATRIX LIBRARY
(d/row-maps (d/dataset (keys byCol) byCol))

; [{:x 1, :y "a"} {:x 2, :y "b"} {:x 3, :y "c"}]

;; APPROACH #4 - INCANTER LIBRARY
(second (vals (i/dataset (keys byCol) (apply map list (vals byCol)))))

; ({:x 1, :y "a"} {:x 2, :y "b"} {:x 3, :y "c"})

Written by statcompute

April 13, 2018 at 10:28 pm

Clojure Integration with R


(require '[tnoda.rashinban :as rr]
         '[tnoda.rashinban.core :as rc]
         '[clojure.core.matrix.dataset :as dt]
         '[clojure.core.matrix.impl.dataset :as id])

;; CREATE A TOY DATA
(def ds [{:id 1.0 :name "name1"}
         {:id 2.0 :name "name2"}
         {:id 3.0 :name "name3"}])

;; RUN THE FOLLOWING R CODE IN ADVANCE TO START THE RSERVE SERVER:
;;   R -e 'library(Rserve)' -e 'Rserve(args = "--vanilla")'
;; IF YOU HAVE LITTLER INSTALLED, BELOW ALSO WORKS:
;;   r -e 'library(Rserve); Rserve(args = "--vanilla")'  
(rr/init)

;; PASS THE DATA FROM CLOJURE INTO R
(map (fn [x] (rr/<- (name (key x)) (val x))) 
  (let [ks ((comp keys first) ds)] (zipmap ks (map #(map % ds) ks))))

(rr/<- 'header (map name ((comp keys first) ds)))
         
;; CREATE THE R DATA.FRAME         
(rc/eval "df = data.frame(lapply(header, as.name))")

;; TEST THE R DATA.FRAME
(rc/eval "df$id")
; [1.0 2.0 3.0]

(rc/eval "df$name")
; ["name1" "name2" "name3"]

;; CONVERT THE R DATA.FRAME BACK TO THE CLOJURE MAP
(def mp (into [] (map #(zipmap (map keyword (rr/colnames 'df)) %) 
                   (partition (count (rr/colnames 'df)) (apply interleave (rr/matrix 'df))))))

; [{:id 1.0, :name "name1"} {:id 2.0, :name "name2"} {:id 3.0, :name "name3"}]

;; TEST THE EQUALITY BETWEEN INPUT AND OUTPUT DATA
(= mp ds)
; true

;; ALTERNATIVELY, WE CAN ALSO CONVERT THE R DATA.FRAME TO A CLOJURE DATASET
(def dt (id/dataset-from-columns (map keyword (rr/colnames 'df)) (rr/matrix 'df)))

; #dataset/dataset {:column-names [:id :name], :columns [[1.0 2.0 3.0] ["name1" "name2" "name3"]], :shape [3 2]}

;; NEXT, CONVERT THE DATASET TO THE MAP
(def mp2 (dt/row-maps dt))

; [{:id 1.0, :name "name1"} {:id 2.0, :name "name2"} {:id 3.0, :name "name3"}]

(= ds mp2)
; true

Written by statcompute

April 11, 2018 at 7:47 pm

Posted in Big Data, clojure, S+/R, Statistics

Tagged with ,

Aggregation by Multiple Keys in Clojure


(require '[ultra-csv.core :refer [read-csv]]
         '[criterium.core :refer [quick-bench]]
         '[clojure.set :refer [index]])

(def ds (read-csv "/home/liuwensui/Downloads/nycflights.csv"))

;; FASTEST
(quick-bench
  (map
    (fn [x] {:year (first (key x))
             :month (last (key x))
             :flights (count (val x))})
      (group-by (juxt :year :month) ds)))      

;Evaluation count : 6 in 6 samples of 1 calls.
;             Execution time mean : 712.329182 ms
;    Execution time std-deviation : 3.832950 ms
;   Execution time lower quantile : 709.135737 ms ( 2.5%)
;   Execution time upper quantile : 718.651856 ms (97.5%)
;                   Overhead used : 11.694357 ns

;; WORKS FINE
(quick-bench
  (map
    (fn [x] {:year (:year (key x))
             :month (:month (key x))
             :flights (count (val x))})
      (group-by #(select-keys % [:year :month]) ds)))
      
;Evaluation count : 6 in 6 samples of 1 calls.
;             Execution time mean : 1.485215 sec
;    Execution time std-deviation : 9.832209 ms
;   Execution time lower quantile : 1.476116 sec ( 2.5%)
;   Execution time upper quantile : 1.500560 sec (97.5%)
;                   Overhead used : 11.694357 ns

;; SLOWEST
(quick-bench
  (map
    (fn [x] {:year (:year (key x))
             :month (:month (key x))
             :flights (count (val x))})
      (index ds [:year :month])))
      
;Evaluation count : 6 in 6 samples of 1 calls.
;             Execution time mean : 2.158245 sec
;    Execution time std-deviation : 11.208489 ms
;   Execution time lower quantile : 2.149538 sec ( 2.5%)
;   Execution time upper quantile : 2.175743 sec (97.5%)
;                   Overhead used : 11.694357 ns

Written by statcompute

April 8, 2018 at 4:21 pm

Posted in Big Data, clojure, Statistics

Tagged with

Inner and Outer Joins in Clojure


(require '[clojure.pprint :refer [print-table] :rename {print-table p}]
         '[clojure.set :as s]
         '[clojure.core.reducers :as r])

;; CREATE TOY DATASETS                 
(def ds1 [{:id 1 :name "name1"}
          {:id 2 :name "name2"}
          {:id 3 :name "name3"}])
          
(def ds2 [{:id 2 :address "addr2"}
          {:id 3 :address "addr3"}
          {:id 4 :address "addr4"}])

;; GET THE HEADER
(def ks ((comp distinct flatten) (map #((comp keys first) %) [ds1 ds2])))

;; INNER JOIN WITH SET/JOIN
(p ks
  (s/join ds1 ds2))

;| :id | :name | :address |
;|-----+-------+----------|
;|   3 | name3 |    addr3 |
;|   2 | name2 |    addr2 |

;; OUTER JOIN #1
(p ks (map #(apply merge %) (vals (group-by :id (concat ds1 ds2)))))

;| :id | :name | :address |
;|-----+-------+----------|
;|   1 | name1 |          |
;|   2 | name2 |    addr2 |
;|   3 | name3 |    addr3 |
;|   4 |       |    addr4 |

;; OUTER JOIN #2 -- AN EXAMPLE OF USING REDUCERS
(p ks (into () (r/map #(r/reduce merge %) (vals (s/index (s/union ds2 ds1) [:id])))))

;| :id | :name | :address |
;|-----+-------+----------|
;|   1 | name1 |          |
;|   4 |       |    addr4 |
;|   3 | name3 |    addr3 |
;|   2 | name2 |    addr2 |
 
 ;; OUTER JOIN #3 -- USE LET CREATING LOCAL VARIABLES
(p ks (let [z1 (zipmap (map :id ds1) ds1) 
            z2 (zipmap (map :id ds2) ds2)]
        (vals (merge-with merge z1 z2))))

;| :id | :name | :address |
;|-----+-------+----------|
;|   1 | name1 |          |
;|   2 | name2 |    addr2 |
;|   3 | name3 |    addr3 |
;|   4 |       |    addr4 |

Written by statcompute

April 8, 2018 at 2:22 pm

Posted in Big Data, clojure, Statistics

Tagged with