## Archive for **March 2015**

## Modeling Count Time Series with tscount Package

The example below shows how to estimate a simple univariate Poisson time series model with the tscount package. While the model estimation is straightforward and yeilds very similar parameter estimates to the ones generated with the acp package (https://statcompute.wordpress.com/2015/03/29/autoregressive-conditional-poisson-model-i), the prediction mechanism is a bit tricky.

1) For the in-sample and the 1-step-ahead predictions:

**yhat_[t] = beta0 + beta1 * y_[t – 1] + beta2 * yhat_[t – 1]**

2) For the out-of-sample predictions with the observed Y unavailable:

**yhat_[t] = beta0 + beta1 * yhat_[t – 1] + beta2 * yhat_[t – 1]**

library(tscount) mdl <- tsglm(cnt$y, model = list(past_obs = 1, past_mean = 1), distr = "poisson") summary(mdl) # tsglm(ts = cnt$y, model = list(past_obs = 1, past_mean = 1), # distr = "poisson") # # Coefficients: # Estimate Std. Error # (Intercept) 0.632 0.1774 # beta_1 0.350 0.0687 # alpha_1 0.184 0.1455 # Standard errors obtained by normal approximation. # # Link function: identity # Distribution family: poisson # Number of coefficients: 3 # Log-likelihood: -279.2738 # AIC: 564.5476 # BIC: 573.9195 ### in-sample prediction ### cnt$yhat <- mdl$fitted.values tail(cnt, 3) # y yhat # 166 1 0.8637023 # 167 3 1.1404714 # 168 6 1.8918651 ### manually check ### beta <- mdl$coefficients pv167 <- beta[1] + beta[2] * cnt$y[166] + beta[3] * cnt$yhat[166] # 1.140471 pv168 <- beta[1] + beta[2] * cnt$y[167] + beta[3] * cnt$yhat[167] # 1.891865 ### out-of-sample prediction ### oot <- predict(mdl, n.ahead = 3) # [1] 3.080667 2.276211 1.846767 ### manually check ### ov2 <- beta[1] + beta[2] * oot[[1]][1] + beta[3] * oot[[1]][1] # 2.276211 ov3 <- beta[1] + beta[2] * oot[[1]][2] + beta[3] * oot[[1]][2] # 1.846767

## rPithon vs. rPython

Similar to rPython, the rPithon package (http://rpithon.r-forge.r-project.org) allows users to execute Python code from R and exchange the data between Python and R. However, the underlying mechanisms between these two packages are fundamentally different. Wihle rPithon communicates with Python from R through pipes, rPython accomplishes the same task with json. A major advantage of rPithon over rPython is that multiple Python processes can be started within a R session. However, rPithon is not very robust while exchanging large data objects between R and Python.

**rPython Session **

library(sqldf) df_in <- sqldf('select Year, Month, DayofMonth from tbl2008 limit 5000', dbname = '/home/liuwensui/Documents/data/flights.db') library(rPython) ### R DATA.FRAME TO PYTHON DICTIONARY ### python.assign('py_dict', df_in) ### PASS PYTHON DICTIONARY BACK TO R LIST r_list <- python.get('py_dict') ### CONVERT R LIST TO DATA.FRAME df_out <- data.frame(r_list) dim(df_out) # [1] 5000 3 # # real 0m0.973s # user 0m0.797s # sys 0m0.186s

**rPithon Session **

library(sqldf) df_in <- sqldf('select Year, Month, DayofMonth from tbl2008 limit 5000', dbname = '/home/liuwensui/Documents/data/flights.db') library(rPithon) ### R DATA.FRAME TO PYTHON DICTIONARY ### pithon.assign('py_dict', df_in) ### PASS PYTHON DICTIONARY BACK TO R LIST r_list <- pithon.get('py_dict') ### CONVERT R LIST TO DATA.FRAME df_out <- data.frame(r_list) dim(df_out) # [1] 5000 3 # # real 0m0.984s # user 0m0.771s # sys 0m0.187s

## Autoregressive Conditional Poisson Model – I

Modeling the time series of count outcome is of interest in the operational risk while forecasting the frequency of losses. Below is an example showing how to estimate a simple ACP(1, 1) model, e.g. Autoregressive Conditional Poisson, without covariates with **ACP** package.

library(acp) ### acp(1, 1) without covariates ### mdl <- acp(y ~ -1, data = cnt) summary(mdl) # acp.formula(formula = y ~ -1, data = cnt) # # Estimate StdErr t.value p.value # a 0.632670 0.169027 3.7430 0.0002507 *** # b 0.349642 0.067414 5.1865 6.213e-07 *** # c 0.184509 0.134154 1.3753 0.1708881 ### generate predictions ### f <- predict(mdl) pred <- data.frame(yhat = f, cnt) tail(pred, 5) # yhat y # 164 1.5396921 1 # 165 1.2663993 0 # 166 0.8663321 1 # 167 1.1421586 3 # 168 1.8923355 6 ### calculate predictions manually ### pv167 <- mdl$coef[1] + mdl$coef[2] * pred$y[166] + mdl$coef[3] * pred$yhat[166] # [1] 1.142159 pv168 <- mdl$coef[1] + mdl$coef[2] * pred$y[167] + mdl$coef[3] * pred$yhat[167] # [1] 1.892336 plot.ts(pred, main = "Predictions")

## A Comparison between Blaze and Pandas

Blaze (https://github.com/ContinuumIO/blaze) is a lightweight interface on top of other data or computing engines such as numpy or sqlite. Blaze itself doesn’t do any computation but provides a Pandas-like syntax to interact with the back-end data.

Below is an example showing how Blaze leverages the computing power of SQLite (https://www.sqlite.org) and outperforms Pandas when performing some simple tasks on a SQLite table with ~7MM rows. Since Blaze doesn’t have to load the data table into the memory as Pandas does, the cpu time is significantly shorter.

**Pandas**

import pandas as pda import pandas.io.sql as psql import sqlite3 as sql import numpy as npy con = sql.connect('/home/liuwensui/Documents/data/flights.db') ds1 = psql.read_sql('select * from tbl2008', con) ds2 = ds1[ds1.DayOfWeek > 1] ds3 = ds2.groupby('DayOfWeek', as_index = False) ds4 = ds3['AirTime'].agg({'avg_AirTime' : npy.mean}) print(ds4) # DayOfWeek avg_AirTime #0 2 103.214930 #1 3 103.058508 #2 4 103.467138 #3 5 103.557539 #4 6 107.400631 #5 7 104.864885 # #real 1m7.241s #user 1m0.403s #sys 0m5.278s

**Blaze**

import blaze as blz import pandas as pda ds1 = blz.Data('sqlite:////home/liuwensui/Documents/data/flights.db::tbl2008') ds2 = ds1[ds1.DayOfWeek > 1] ds3 = blz.by(ds2.DayOfWeek, avg_AirTime = ds2.AirTime.mean()) ds4 = blz.into(pda.DataFrame, ds3) print(ds4) # DayOfWeek avg_AirTime #0 2 103.214930 #1 3 103.058508 #2 4 103.467138 #3 5 103.557539 #4 6 107.400631 #5 7 104.864885 # #real 0m21.658s #user 0m10.727s #sys 0m1.167s

## Ensemble Learning with Cubist Model

The tree-based Cubist model can be easily used to develop an ensemble classifier with a scheme called “committees”. The concept of “committees” is similar to the one of “boosting” by developing a series of trees sequentially with adjusted weights. However, the final prediction is the simple average of predictions from all “committee” members, an idea more close to “bagging”.

Below is a demonstration showing how to use the train() function in the caret package to select the optimal number of “committees” in the ensemble model with cubist, e.g. 100 in the example. As shown, the ensemble model is able to outperform the standalone model by ~4% in a separate testing dataset.

data(Boston, package = "MASS") X <- Boston[, 1:13] Y <- log(Boston[, 14]) # SAMPLE THE DATA set.seed(2015) rows <- sample(1:nrow(Boston), nrow(Boston) - 100) X1 <- X[rows, ] X2 <- X[-rows, ] Y1 <- Y[rows] Y2 <- Y[-rows] pkgs <- c('doMC', 'Cubist', 'caret') lapply(pkgs, require, character.only = T) registerDoMC(core = 7) # TRAIN A STANDALONE MODEL FOR COMPARISON mdl1 <- cubist(x = X1, y = Y1, control = cubistControl(unbiased = TRUE, label = "log_medv", seed = 2015)) print(cor(Y2, predict(mdl1, newdata = X2) ^ 2)) # [1] 0.923393 # SEARCH FOR THE OPTIMIAL NUMBER OF COMMITEES test <- train(x = X1, y = Y1, "cubist", tuneGrid = expand.grid(.committees = seq(10, 100, 10), .neighbors = 0), trControl = trainControl(method = 'cv')) print(test) # OUTPUT SHOWING A HIGHEST R^2 WHEN # OF COMMITEES = 100 # committees RMSE Rsquared RMSE SD Rsquared SD # 10 0.1607422 0.8548458 0.04166821 0.07783100 # 20 0.1564213 0.8617020 0.04223616 0.07858360 # 30 0.1560715 0.8619450 0.04015586 0.07534421 # 40 0.1562329 0.8621699 0.03904749 0.07301656 # 50 0.1563900 0.8612108 0.03904703 0.07342892 # 60 0.1558986 0.8620672 0.03819357 0.07138955 # 70 0.1553652 0.8631393 0.03849417 0.07173025 # 80 0.1552432 0.8629853 0.03887986 0.07254633 # 90 0.1548292 0.8637903 0.03880407 0.07182265 # 100 0.1547612 0.8638320 0.03953242 0.07354575 mdl2 <- cubist(x = X1, y = Y1, committees = 100, control = cubistControl(unbiased = TRUE, label = "log_medv", seed = 2015)) print(cor(Y2, predict(mdl2, newdata = X2) ^ 2)) # [1] 0.9589031

## Model Segmentation with Cubist

Cubist is a tree-based model with a OLS regression attached to each terminal node and is somewhat similar to mob() function in the Party package (https://statcompute.wordpress.com/2014/10/26/model-segmentation-with-recursive-partitioning). Below is a demonstrate of cubist() model with the classic Boston housing data.

pkgs <- c('MASS', 'Cubist', 'caret') lapply(pkgs, require, character.only = T) data(Boston) X <- Boston[, 1:13] Y <- log(Boston[, 14]) ### TRAIN THE MODEL ### mdl <- cubist(x = X, y = Y, control = cubistControl(unbiased = TRUE, label = "log_medv", seed = 2015, rules = 5)) summary(mdl) # Rule 1: [94 cases, mean 2.568824, range 1.609438 to 3.314186, est err 0.180985] # # if # nox > 0.671 # then # log_medv = 1.107315 + 0.588 dis + 2.92 nox - 0.0287 lstat - 0.2 rm # - 0.0065 crim # # Rule 2: [39 cases, mean 2.701933, range 1.94591 to 3.314186, est err 0.202473] # # if # nox <= 0.671 # lstat > 19.01 # then # log_medv = 3.935974 - 1.68 nox - 0.0076 lstat + 0.0035 rad - 0.00017 tax # - 0.013 dis - 0.0029 crim + 0.034 rm - 0.011 ptratio # + 0.00015 black + 0.0003 zn # # Rule 3: [200 cases, mean 2.951007, range 2.116256 to 3.589059, est err 0.100825] # # if # rm <= 6.232 # dis > 1.8773 # then # log_medv = 2.791381 + 0.152 rm - 0.0147 lstat + 0.00085 black # - 0.031 dis - 0.027 ptratio - 0.0017 age + 0.0031 rad # - 0.00013 tax - 0.0025 crim - 0.12 nox + 0.0002 zn # # Rule 4: [37 cases, mean 3.038195, range 2.341806 to 3.912023, est err 0.184200] # # if # dis <= 1.8773 # lstat <= 19.01 # then # log_medv = 5.668421 - 1.187 dis - 0.0469 lstat - 0.0122 crim # # Rule 5: [220 cases, mean 3.292121, range 2.261763 to 3.912023, est err 0.093716] # # if # rm > 6.232 # lstat <= 19.01 # then # log_medv = 2.419507 - 0.033 lstat + 0.238 rm - 0.0089 crim + 0.0082 rad # - 0.029 dis - 0.00035 tax + 0.0006 black - 0.024 ptratio # - 0.0006 age - 0.12 nox + 0.0002 zn # # Evaluation on training data (506 cases): # # Average |error| 0.100444 # Relative |error| 0.33 # Correlation coefficient 0.94 # # Attribute usage: # Conds Model # # 71% 94% rm # 50% 100% lstat # 40% 100% dis # 23% 94% nox # 100% crim # 78% zn # 78% rad # 78% tax # 78% ptratio # 78% black # 71% age ### VARIABLE IMPORTANCE ### varImp(mdl) # Overall # rm 82.5 # lstat 75.0 # dis 70.0 # nox 58.5 # crim 50.0 # zn 39.0 # rad 39.0 # tax 39.0 # ptratio 39.0 # black 39.0 # age 35.5 # indus 0.0 # chas 0.0

## Two Ways to Select Rows in Incanter

user=> (use '(incanter core io)) nil user=> (def iris (read-dataset "../data/iris.dat" :header true :delim \space)) #'user/iris user=> (sel iris :rows (range 3)) [:Sepal.Length :Sepal.Width :Petal.Length :Petal.Width :Species] [5.1 3.5 1.4 0.2 "setosa"] [4.9 3.0 1.4 0.2 "setosa"] [4.7 3.2 1.3 0.2 "setosa"] ;; METHOD 1 - USING $WHERE user=> ($where {:Species {:in #{"virginica" "setosa"}} :Sepal.Length {:gt 5.5, :lt 6.0}} iris) [:Sepal.Length :Sepal.Width :Petal.Length :Petal.Width :Species] [5.8 4.0 1.2 0.2 "setosa"] [5.7 4.4 1.5 0.4 "setosa"] [5.7 3.8 1.7 0.3 "setosa"] [5.8 2.7 5.1 1.9 "virginica"] [5.7 2.5 5.0 2.0 "virginica"] [5.8 2.8 5.1 2.4 "virginica"] [5.6 2.8 4.9 2.0 "virginica"] [5.8 2.7 5.1 1.9 "virginica"] [5.9 3.0 5.1 1.8 "virginica"] ;; METHOD 2 - USING QUERY-DATASET user=> (query-dataset iris {:Species {:in #{"virginica" "setosa"}} :Sepal.Length {:gt 5.5, :lt 6.0}}) [:Sepal.Length :Sepal.Width :Petal.Length :Petal.Width :Species] [5.8 4.0 1.2 0.2 "setosa"] [5.7 4.4 1.5 0.4 "setosa"] [5.7 3.8 1.7 0.3 "setosa"] [5.8 2.7 5.1 1.9 "virginica"] [5.7 2.5 5.0 2.0 "virginica"] [5.8 2.8 5.1 2.4 "virginica"] [5.6 2.8 4.9 2.0 "virginica"] [5.8 2.7 5.1 1.9 "virginica"] [5.9 3.0 5.1 1.8 "virginica"]