Yet Another Blog in Statistical Computing

I can calculate the motion of heavenly bodies but not the madness of people. -Isaac Newton

Archive for June 2013

GRNN and PNN

From the technical prospective, people usually would choose GRNN (general regression neural network) to do the function approximation for the continuous response variable and use PNN (probabilistic neural network) for pattern recognition / classification problems with categorical outcomes. However, from the practical standpoint, it is often not necessary to draw a fine line between GRNN and PNN given the fact that most classification problems in the real world are binary. After reading paper in PNN (http://courses.cs.tamu.edu/rgutier/cpsc636_s10/specht1990pnn.pdf) and in GRNN (http://research.vuse.vanderbilt.edu/vuwal/Paul/Paper/References/grnn.pdf) both by Specht, one shouldn’t be difficult to find out the similarity between two. In particular, for a 2-class classification problem, GRNN should be able to serve the same purpose after converting the 2-class categorical outcome to the numeric response with 0-1 values. In the demonstration below, I am going to show that GRNN and PNN should generate identical predictions with the same smooth parameter.

First of all, let’s train a PNN for a 2-class outcome with cran-pnn package.

pkgs <- c('doParallel', 'foreach', 'pnn')
lapply(pkgs, require, character.only = T)
registerDoParallel(cores = 8)

data(norms)
nn1 <- smooth(learn(norms), sigma = 0.5)

pred_pnn <- function(x, nn){
  xlst <- split(x, 1:nrow(x))
  pred <- foreach(i = xlst, .combine = rbind) %dopar% {
    data.frame(prob = guess(nn, as.matrix(i))$probabilities[1], row.names = NULL)
  }
}

print(pred_pnn(norms[1:10, -1], nn1))
#         prob
# 1  0.6794262
# 2  0.5336774
# 3  0.7632387
# 4  0.8103197
# 5  0.6496806
# 6  0.7752137
# 7  0.4138325
# 8  0.7320472
# 9  0.6599813
# 10 0.8015706

Secondly, I also trained a GRNN after converting the categorical outcome above to a dummy response with 0-1 values.

pkgs <- c('pnn', 'doParallel', 'foreach', 'grnn')
lapply(pkgs, require, character.only = T)
registerDoParallel(cores = 8)

data(norms)
norm2 <- data.frame(n = ifelse(norms$c == 'A', 1, 0), x = norms$x, y = norms$y)
detach('package:pnn')

nn2 <- smooth(learn(norm2), sigma = 0.5)

pred_grnn <- function(x, nn){
  xlst <- split(x, 1:nrow(x))
  pred <- foreach(i = xlst, .combine = rbind) %dopar% {
    data.frame(pred = guess(nn, as.matrix(i)), row.names = NULL)
  }
}

print(pred_grnn(norm2[1:10, -1], nn2))
#         pred
# 1  0.6794262
# 2  0.5336774
# 3  0.7632387
# 4  0.8103197
# 5  0.6496806
# 6  0.7752137
# 7  0.4138325
# 8  0.7320472
# 9  0.6599813
# 10 0.8015706

As clearly shown in outputs, for the 2-level classification problem, both PNN and GRNN generated identical predicted values.

Advertisements

Written by statcompute

June 23, 2013 at 12:21 pm

Prototyping A General Regression Neural Network with SAS

Last time when I read the paper “A General Regression Neural Network” by Donald Specht, it was exactly 10 years ago when I was in the graduate school. After reading again this week, I decided to code it out with SAS macros and make this excellent idea available for the SAS community.

The prototype of GRNN consists of 2 SAS macros, %grnn_learn() for the training of a GRNN and %grnn_pred() for the prediction with a GRNN. The famous Boston Housing dataset is used to test these two macros with the result compared with the outcome from the R implementation below. In this exercise, it is assumed that the smoothing parameter SIGMA is known and equal to 0.55 in order to simplify the case.

pkgs <- c('MASS', 'doParallel', 'foreach', 'grnn')
lapply(pkgs, require, character.only = T)
registerDoParallel(cores = 8)

data(Boston)
X <- Boston[-14]
st.X <- scale(X)
Y <- Boston[14]
boston <- data.frame(st.X, Y)

pred_grnn <- function(x, nn){
  xlst <- split(x, 1:nrow(x))
  pred <- foreach(i = xlst, .combine = rbind) %dopar% {
    data.frame(pred = guess(nn, as.matrix(i)), i, row.names = NULL)
  }
}

grnn <- smooth(learn(boston, variable.column = ncol(boston)), sigma = 0.55)
pred_grnn <- pred_grnn(boston[, -ncol(boston)], grnn)
head(pred_grnn$pred, n = 10)
# [1] 24.61559 23.22232 32.29610 32.57700 33.29552 26.73482 21.46017 20.96827
# [9] 16.55537 20.25247

The first SAS macro to train a GRNN is %grnn_learn() shown below. The purpose of this macro is store the whole specification of a GRNN in a SAS dataset after the simple 1-pass training with the development data. Please note that motivated by the idea of MongoDB, I use the key-value paired scheme to store the information of a GRNN.

libname data '';

data data.boston;
  infile 'housing.data';
  input x1 - x13 y;
run;

%macro grnn_learn(data = , x = , y = , sigma = , nn_out = );
options mprint mlogic nocenter;
********************************************************;
* THIS MACRO IS TO TRAIN A GENERAL REGRESSION NEURAL   *;
* NETWORK (SPECHT, 1991) AND STORE THE SPECIFICATION   *;
*------------------------------------------------------*;
* INPUT PARAMETERS:                                    *;
*  DATA  : INPUT SAS DATASET                           *;
*  X     : A LIST OF PREDICTORS IN THE NUMERIC FORMAT  *;
*  Y     : A RESPONSE VARIABLE IN THE NUMERIC FORMAT   *;
*  SIGMA : THE SMOOTH PARAMETER FOR GRNN               *;
*  NN_OUT: OUTPUT SAS DATASET CONTAINING THE GRNN      *;
*          SPECIFICATION                               *;
*------------------------------------------------------*;
* AUTHOR:                                              *;
*  WENSUI.LIU@53.COM                                   *;
********************************************************;

data _tmp1;
  set &data (keep = &x &y);
  where &y ~= .;
  array _x_ &x;
  _miss_ = 0;
  do _i_ = 1 to dim(_x_);
    if _x_[_i_] = . then _miss_ = 1; 
  end;
  if _miss_ = 0 then output;
run;

proc summary data = _tmp1;
  output out = _avg_ (drop = _type_ _freq_)
  mean(&x) = ;
run;

proc summary data = _tmp1;
  output out = _std_ (drop = _type_ _freq_)
  std(&x) = ;
run;

proc standard data = _tmp1 mean = 0 std = 1 out = _data_;
  var &x;
run;

data &nn_out (keep = _neuron_ _key_ _value_);
  set _last_ end = eof;
  _neuron_ + 1;
  length _key_ $32;
  array _a_ &y &x;
  do _i_ = 1 to dim(_a_);
    if _i_ = 1 then _key_ = '_Y_';
    else _key_ = upcase(vname(_a_[_i_]));
    _value_ = _a_[_i_];
    output;
  end; 
  if eof then do;
    _neuron_ = 0;
    _key_  = "_SIGMA_";
    _value_  = &sigma;
    output;
    set _avg_;
    array _b_ &x;
    do _i_ = 1 to dim(_b_);
      _neuron_ = -1;
      _key_ = upcase(vname(_b_[_i_]));
      _value_ = _b_[_i_];
      output;
    end;
    set _std_;
    array _c_ &x;
    do _i_ = 1 to dim(_c_);
      _neuron_ = -2;
      _key_ = upcase(vname(_c_[_i_]));
      _value_ = _c_[_i_];
      output;
    end;
  end;
run;

proc datasets library = work;
  delete _: / memtype = data;
run;
quit;

********************************************************;
*              END OF THE MACRO                        *;
********************************************************;
%mend grnn_learn;

%grnn_learn(data = data.boston, x = x1 - x13, y = y, sigma = 0.55, nn_out = data.grnn);

proc print data = data.grnn (obs = 10) noobs;
run;
/* SAS PRINTOUT OF GRNN DATA:
_neuron_    _key_     _value_
    1        _Y_      24.0000
    1        X1       -0.4194
    1        X2        0.2845
    1        X3       -1.2866
    1        X4       -0.2723
    1        X5       -0.1441
    1        X6        0.4133
    1        X7       -0.1199
    1        X8        0.1401
    1        X9       -0.9819
*/

After the training of a GRNN, the macro %grnn_pred() would be used to generate predicted values from a test dataset with all predictors. As shown in the print-out, first 10 predicted values are identical to those generated with R.

libname data '';

%macro grnn_pred(data = , x = , id = NA, nn_in = , out = grnn_pred);
options mprint mlogic nocenter;
********************************************************;
* THIS MACRO IS TO GENERATE PREDICTED VALUES BASED ON  *;
* THE SPECIFICATION OF GRNN CREATED BY THE %GRNN_LEARN *;
* MACRO                                                *;
*------------------------------------------------------*;
* INPUT PARAMETERS:                                    *;
*  DATA : INPUT SAS DATASET                            *;
*  X    : A LIST OF PREDICTORS IN THE NUMERIC FORMAT   *;
*  ID   : AN ID VARIABLE (OPTIONAL)                    *;
*  NN_IN: INPUT SAS DATASET CONTAINING THE GRNN        *;
*         SPECIFICATION GENERATED FROM %GRNN_LEARN     *;
*  OUT  : OUTPUT SAS DATASET WITH GRNN PREDICTIONS     *;
*------------------------------------------------------*;
* AUTHOR:                                              *;
*  WENSUI.LIU@53.COM                                   *;
********************************************************;

data _data_;
  set &data;
  array _x_ &x;
  _miss_ = 0;
  do _i_ = 1 to dim(_x_);
    if _x_[_i_] = . then _miss_ = 1;
  end;
  if _miss_ = 0 then output;
run;

data _data_;
  set _last_ (drop = _miss_);
  %if &id = NA %then %do;
  _id_ + 1;
  %end;
  %else %do;
  _id_ = &id;
  %end;
run;

proc sort data = _last_ sortsize = max nodupkey;
  by _id_;
run;

data _data_ (keep = _id_ _key_ _value_);
  set _last_;
  array _x_ &x;
  length _key_ $32;
  do _i_ = 1 to dim(_x_);
    _key_ = upcase(vname(_x_[_i_]));
    _value_ = _x_[_i_];
    output;
  end;
run;

proc sql noprint;
select _value_ ** 2 into :s2 from &nn_in where _neuron_ = 0;

create table
  _last_ as 
select
  a._id_,
  a._key_,
  (a._value_ - b._value_) / c._value_ as _value_
from
  _last_ as a,
  &nn_in as b,
  &nn_in as c
where
  compress(a._key_, ' ') = compress(b._key_, ' ') and
  compress(a._key_, ' ') = compress(c._key_, ' ') and
  b._neuron_ = -1                                 and
  c._neuron_ = -2;

create table
  _last_ as
select
  a._id_,
  b._neuron_,
  sum((a._value_ - b._value_) ** 2) as d2,
  mean(c._value_)                   as y,
  exp(-(calculated d2) / (2 * &s2)) as exp
from
  _last_  as a,
  &nn_in as b,
  &nn_in as c
where
  compress(a._key_, ' ') = compress(b._key_, ' ') and
  b._neuron_ = c._neuron_                         and
  b._neuron_ > 0                                  and
  c._key_ = '_Y_'
group by
  a._id_, b._neuron_;

create table
  _last_ as
select
  a._id_,
  sum(a.y * a.exp / b.sum_exp) as _pred_
from
  _last_ as a inner join (select _id_, sum(exp) as sum_exp from _last_ group by _id_) as b
on
  a._id_ = b._id_
group by
  a._id_;
quit;

proc sort data = _last_ out = &out sortsize = max;
  by _id_;
run;

********************************************************;
*              END OF THE MACRO                        *;
********************************************************;
%mend grnn_pred;

%grnn_pred(data = data.boston, x = x1 - x13, nn_in = data.grnn);

proc print data = grnn_pred (obs = 10) noobs;
run;
/* SAS PRINTOUT:
_id_     _pred_
  1     24.6156
  2     23.2223
  3     32.2961
  4     32.5770
  5     33.2955
  6     26.7348
  7     21.4602
  8     20.9683
  9     16.5554
 10     20.2525
*/

After the development of these two macros, I also compare predictive performances between GRNN and OLS regression. It turns out that GRNN consistently outperforms OLS regression even with a wide range of SIGMA values. With a reasonable choice of SIGMA value, even a GRNN developed with 10% of the whole Boston Housing dataset is able to generalize well and yield a R^2 > 0.8 based upon the rest 90% data.

Written by statcompute

June 23, 2013 at 12:02 am

General Regression Neural Network with R

Similar to the back propagation neural network, the general regression neural network (GRNN) is also a good tool for the function approximation in the modeling toolbox. Proposed by Specht in 1991, GRNN has advantages of instant training and easy tuning. A GRNN would be formed instantly with just a 1-pass training with the development data. In the network development phase, the only hurdle is to tune the hyper-parameter, which is known as sigma, governing the smoothness of a GRNN.

The grnn package (http://flow.chasset.net/r-grnn/) is the implementation of GRNN in R and was just published on CRAN last month. Although the grnn package is still in the early phase, e.g. version 0.1, it is very easy to use and has a great potential for future improvements. For instance, the guess() function to predict new cases is only able to take 1 record at a time. Therefore, the user needs to write his / her own function to generate predicted values from a data frame. In addition, there is no automatic scheme to find the optimal value of the smooth parameter sigma. The user has to come up with his / her own solution.

Below is my test drive of grnn package over the weekend. By leveraging the power of foreach package, I wrote a simple function to let the guess() function able to score a whole matrix instead of a single row. Additionally, I used a hold-out sample to search for the optimal value of sigma, which turns out to work out pretty well and identifies the lowest SSE for the hold-out sample with sigma = 0.55.

pkgs <- c('MASS', 'doParallel', 'foreach', 'grnn')
lapply(pkgs, require, character.only = T)
registerDoParallel(cores = 8)

data(Boston)
# PRE-PROCESSING DATA 
X <- Boston[-14]
st.X <- scale(X)
Y <- Boston[14]
boston <- data.frame(st.X, Y)

# SPLIT DATA SAMPLES
set.seed(2013)
rows <- sample(1:nrow(boston), nrow(boston) - 200)
set1 <- boston[rows, ]
set2 <- boston[-rows, ]

# DEFINE A FUNCTION TO SCORE GRNN
pred_grnn <- function(x, nn){
  xlst <- split(x, 1:nrow(x))
  pred <- foreach(i = xlst, .combine = rbind) %dopar% {
    data.frame(pred = guess(nn, as.matrix(i)), i, row.names = NULL)
  }
}

# SEARCH FOR THE OPTIMAL VALUE OF SIGMA BY THE VALIDATION SAMPLE
cv <- foreach(s = seq(0.2, 1, 0.05), .combine = rbind) %dopar% {
  grnn <- smooth(learn(set1, variable.column = ncol(set1)), sigma = s)
  pred <- pred_grnn(set2[, -ncol(set2)], grnn)
  test.sse <- sum((set2[, ncol(set2)] - pred$pred)^2)
  data.frame(s, sse = test.sse)
}

cat("\n### SSE FROM VALIDATIONS ###\n")
print(cv)
jpeg('grnn_cv.jpeg', width = 800, height = 400, quality = 100)
with(cv, plot(s, sse, type = 'b'))

cat("\n### BEST SIGMA WITH THE LOWEST SSE ###\n")
print(best.s <- cv[cv$sse == min(cv$sse), 1])

# SCORE THE WHOLE DATASET WITH GRNN
final_grnn <- smooth(learn(set1, variable.column = ncol(set1)), sigma = best.s)
pred_all <- pred_grnn(boston[, -ncol(set2)], final_grnn)
jpeg('grnn_fit.jpeg', width = 800, height = 400, quality = 100)
plot(pred_all$pred, boston$medv) 
dev.off()

grnn_cv
grnn_fit

Written by statcompute

June 16, 2013 at 5:57 pm

Posted in Machine Learning, S+/R

Improve The Efficiency in Joining Data with Index

When managing big data with R, many people like to use sqldf() package due to its friendly interface or choose data.table() package for its lightening speed. However, very few would pay special attentions to small details that might significantly boost the efficiency of these packages by adding index to the data.frame or data.table.

In my post on 01/29/2013 (https://statcompute.wordpress.com/2013/01/29/another-benchmark-for-joining-two-data-frames), I’ve shown how to effectively join two data.frames / data.tables. However, the example is not intuitive for people to fully understand the benefit of adding index. In the demonstration below, I will compare 2 scenarios, one with the index and the other without, to show the extra efficiency gained by a simple index.

It is also important to note that creating the index in “ldf” would have the effect of adding the data.frame “ldf” from the R workspace to SQLite database. Therefore, in the 2nd “select…” statement, we need to add “main.” in front of “ldf” in order to use the indexed table “ldf” in SQLite instead of the unindexed table “ldf” in the R environment.

As shown in the benchmark table, simply adding an index can significantly reduce the user time with sqldf package and improves somewhat with data.table package.

libs <- c('sqldf', 'data.table', 'rbenchmark')
lapply(libs, require, character.only = T)

n <- 1000000
set.seed(1)
ldf <- data.frame(id1 = sample(n, n), id2 = sample(n / 1000, n, replace = TRUE), x1 = rnorm(n), x2 = runif(n))
rdf <- data.frame(id1 = sample(n, n), id2 = sample(n / 1000, n, replace = TRUE), y1 = rnorm(n), y2 = runif(n))

benchmark(replications = 5, order = "user.self", 
  noindex.sqldf = (sqldf('select * from ldf as l inner join rdf as r on l.id1 = r.id1 and l.id2 = r.id2')),
  indexed.sqldf = (sqldf(c('create index ldx on ldf(id1, id2)', 
                           'select * from main.ldf as l inner join rdf as r on l.id1 = r.id1 and l.id2 = r.id2')))
)

benchmark(replications = 5, order = "user.self", 
  noindex.table = {
    ldt <- data.table(ldf)
    rdt <- data.table(rdf)
    merge(ldt, rdt, by = c('id1', 'id2'))
  },
  indexed.table = {
    ldt <- data.table(ldf, key = 'id1,id2')
    rdt <- data.table(rdf, key = 'id1,id2')
    merge(ldt, rdt, by = c('id1', 'id2'))
  }
)

SQLDF OUTCOMES

           test replications elapsed relative user.self sys.self user.child
2 indexed.sqldf            5  34.774    1.000    34.511    0.244          0
1 noindex.sqldf            5  61.873    1.779    44.918   16.941          0

DATA.TABLE OUTCOMES

           test replications elapsed relative user.self sys.self user.child
2 indexed.table            5   6.719    1.000     6.609    0.104          0
1 noindex.table            5   6.777    1.009     6.696    0.076          0

Written by statcompute

June 9, 2013 at 4:58 pm

Posted in Big Data, S+/R

Estimating Finite Mixture Models with Flexmix Package

In my post on 06/05/2013 (https://statcompute.wordpress.com/2013/06/05/estimating-composite-models-for-count-outcomes-with-fmm-procedure), I’ve shown how to estimate finite mixture models, e.g. zero-inflated Poisson and 2-class finite mixture Poisson models, with FMM and NLMIXED procedure in SAS. Today, I am going to demonstrate how to achieve the same results with flexmix package in R.

R Code

library(flexmix)

data <- read.table('../data/credit_count.txt', header = TRUE, sep = ',')

set.seed(2013)
class <- FLXPmultinom(~ AGE + ACADMOS + MINORDRG + LOGSPEND)
formula <- "MAJORDRG ~ AGE + ACADMOS + MINORDRG + LOGSPEND"
control <- list(verbose = 10, iter.max = 500, minprior = 0.1, tol = 0.01)

cat("\n### TWO-CLASS FINITE MIXTURE POSSION MODEL ###\n")
mdl1 <- FLXMRglm(family = "poisson")
fit1 <- flexmix(as.formula(formula), data = data[data$CARDHLDR == 1, ], k = 2, model = mdl1, concomitant = class, control = control)
refit1 <- refit(fit1, method = 'optim')
cat("\n=== MODEL THE RESPONSE ===\n")
summary(refit1, which = 'model')
cat("\n=== MODEL THE MIXTURE DISTRIBUTION ===\n")
summary(refit1, which = 'concomitant') 

cat("\n### ZERO-INFLATED POSSION MODEL ###\n")
mdl2 <- FLXMRziglm(family = "poisson")
fit <- flexmix(as.formula(formula), data = data[data$CARDHLDR == 1, ], k = 2 , model = mdl2, concomitant = class, control = control)
refit2 <- refit(fit, method = 'optim')
cat("\n=== MODEL THE RESPONSE ===\n")
summary(refit2, which = 'model')
cat("\n=== MODEL THE MIXTURE DISTRIBUTION ===\n")
summary(refit2, which = 'concomitant')

R Output for 2-Class Finite Mixture Poisson

### TWO-CLASS FINITE MIXTURE POSSION MODEL ###
Classification: weighted 
   2 Log-likelihood :   -4303.5967 
converged

=== MODEL THE RESPONSE ===
$Comp.1
              Estimate Std. Error z value  Pr(>|z|)    
(Intercept) -8.0940843  1.6102947 -5.0265 4.996e-07 ***
AGE          0.0114988  0.0129009  0.8913  0.372759    
ACADMOS      0.0045677  0.0020299  2.2502  0.024438 *  
MINORDRG     0.2641256  0.6769000  0.3902  0.696390    
LOGSPEND     0.6826690  0.2224763  3.0685  0.002151 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

$Comp.2
               Estimate  Std. Error z value  Pr(>|z|)    
(Intercept) -2.44490331  0.34951344 -6.9952 2.650e-12 ***
AGE          0.02214164  0.00662479  3.3422 0.0008311 ***
ACADMOS      0.00052922  0.00077757  0.6806 0.4961209    
MINORDRG     0.05054178  0.04048630  1.2484 0.2118965    
LOGSPEND     0.21398000  0.04127917  5.1837 2.175e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


=== MODEL THE MIXTURE DISTRIBUTION ===
$Comp.2
              Estimate Std. Error z value  Pr(>|z|)    
(Intercept) -1.4274523  0.5275366 -2.7059  0.006812 ** 
AGE         -0.0027648  0.0100981 -0.2738  0.784246    
ACADMOS      0.0016143  0.0014455  1.1168  0.264096    
MINORDRG     1.5865202  0.1791074  8.8579 < 2.2e-16 ***
LOGSPEND    -0.0695020  0.0745171 -0.9327  0.350976    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R Output for Zero-Inflated Poisson

### ZERO-INFLATED POSSION MODEL ###
Classification: weighted 
   3 Log-likelihood :   -4175.7431 
converged

=== MODEL THE RESPONSE ===
$Comp.2
               Estimate  Std. Error z value  Pr(>|z|)    
(Intercept) -2.27782892  0.30030188 -7.5851 3.322e-14 ***
AGE          0.01955236  0.00602142  3.2471  0.001166 ** 
ACADMOS      0.00024907  0.00067394  0.3696  0.711698    
MINORDRG     0.11764062  0.02711666  4.3383 1.436e-05 ***
LOGSPEND     0.16441222  0.03531969  4.6550 3.240e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


=== MODEL THE MIXTURE DISTRIBUTION ===
$Comp.2
               Estimate  Std. Error z value  Pr(>|z|)    
(Intercept) -1.91132158  0.41706481 -4.5828 4.588e-06 ***
AGE         -0.00081716  0.00841240 -0.0971  0.922617    
ACADMOS      0.00293407  0.00109997  2.6674  0.007644 ** 
MINORDRG     1.44243620  0.13613625 10.5955 < 2.2e-16 ***
LOGSPEND     0.09561048  0.05080464  1.8819  0.059846 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Written by statcompute

June 9, 2013 at 2:24 am

R and MongoDB

MongoDB is a document-based noSQL database. Different from the relational database storing data in tables with rigid schemas, MongoDB stores data in documents with dynamic schemas. In the demonstration below, I am going to show how to extract data from a MongoDB with R.

Before starting the R session, we need to install the MongoDB in the local machine and then load the data into the database with the Python code below.

import pandas as pandas
import pymongo as pymongo

df = pandas.read_table('../data/csdata.txt')
lst = [dict([(colname, row[i]) for i, colname in enumerate(df.columns)]) for row in df.values]
for i in range(3):
  print lst[i]

con = pymongo.Connection('localhost', port = 27017)
test = con.db.test
test.drop()
for i in lst:
  test.save(i)

To the best of my knowledge, there are two R packages providing the interface with MongoDB, namely RMongo and rmongodb. While RMongo package is very straight-forward and user-friendly, it did take me a while to figure out how to specify a query with rmongodb package.

RMongo Example

library(RMongo)
mg1 <- mongoDbConnect('db')
print(dbShowCollections(mg1))
query <- dbGetQuery(mg1, 'test', "{'AGE': {'$lt': 10}, 'LIQ': {'$gte': 0.1}, 'IND5A': {'$ne': 1}}")
data1 <- query[c('AGE', 'LIQ', 'IND5A')]
summary(data1)

RMongo Output

Loading required package: rJava
Loading required package: methods
Loading required package: RUnit
[1] "system.indexes" "test"          
      AGE             LIQ             IND5A  
 Min.   :6.000   Min.   :0.1000   Min.   :0  
 1st Qu.:7.000   1st Qu.:0.1831   1st Qu.:0  
 Median :8.000   Median :0.2970   Median :0  
 Mean   :7.963   Mean   :0.3745   Mean   :0  
 3rd Qu.:9.000   3rd Qu.:0.4900   3rd Qu.:0  
 Max.   :9.000   Max.   :1.0000   Max.   :0  

rmongodb Example

library(rmongodb)
mg2 <- mongo.create()
print(mongo.get.databases(mg2))
print(mongo.get.database.collections(mg2, 'db'))
buf <- mongo.bson.buffer.create()
mongo.bson.buffer.start.object(buf, 'AGE')
mongo.bson.buffer.append(buf, '$lt', 10)
mongo.bson.buffer.finish.object(buf)
mongo.bson.buffer.start.object(buf, 'LIQ')
mongo.bson.buffer.append(buf, '$gte', 0.1)
mongo.bson.buffer.finish.object(buf)
mongo.bson.buffer.start.object(buf, 'IND5A')
mongo.bson.buffer.append(buf, '$ne', 1)
mongo.bson.buffer.finish.object(buf)
query <- mongo.bson.from.buffer(buf)
cur <- mongo.find(mg2, 'db.test', query = query)
age <- liq <- ind5a <- NULL
while (mongo.cursor.next(cur)) {
  value <- mongo.cursor.value(cur)
  age   <- rbind(age, mongo.bson.value(value, 'AGE'))
  liq   <- rbind(liq, mongo.bson.value(value, 'LIQ'))
  ind5a <- rbind(ind5a, mongo.bson.value(value, 'IND5A'))
  }
mongo.destroy(mg2)
data2 <- data.frame(AGE = age, LIQ = liq, IND5A = ind5a)
summary(data2)

rmongo Output

rmongodb package (mongo-r-driver) loaded
Use 'help("mongo")' to get started.

[1] "db"
[1] "db.test"
[1] TRUE
[1] TRUE
[1] TRUE
[1] TRUE
[1] TRUE
[1] TRUE
[1] TRUE
[1] TRUE
[1] TRUE
NULL
      AGE             LIQ             IND5A  
 Min.   :6.000   Min.   :0.1000   Min.   :0  
 1st Qu.:7.000   1st Qu.:0.1831   1st Qu.:0  
 Median :8.000   Median :0.2970   Median :0  
 Mean   :7.963   Mean   :0.3745   Mean   :0  
 3rd Qu.:9.000   3rd Qu.:0.4900   3rd Qu.:0  
 Max.   :9.000   Max.   :1.0000   Max.   :0  

Written by statcompute

June 8, 2013 at 12:20 am

Estimating Composite Models for Count Outcomes with FMM Procedure

Once upon a time when I learned SAS, it was still version 6.X. As a old dog with 10+ years of experience in SAS, I’ve been trying my best to keep up with new tricks in each release of SAS. In SAS 9.3, my favorite novel feature in SAS/STAT is FMM procedure to estimate finite mixture models.

In 2008 when I drafted “Count Data Models in SAS®” (www2.sas.com/proceedings/forum2008/371-2008.pdf‎), it was pretty cumbersome to specify the log likelihood function of a composite model for count outcomes, e.g. hurdle Poisson or zero-inflated Poisson model. However, with the availability of FMM procedure, estimating composite models has never been easier.

In the demonstration below, I am going to show a side-by-side comparison how to estimate three types of composite models for count outcomes, including hurdle Poisson, zero-inflated Poisson, and finite mixture Poisson models, with FMM and NLMIXED procedure respectively. As shown, both procedures provided identical model estimations.

Hurdle Poisson Model

*** HURDLE POISSON MODEL WITH FMM PROCEDURE ***;
proc fmm data = tmp1 tech = trureg;
  model majordrg = age acadmos minordrg logspend / dist = truncpoisson;
  model majordrg = / dist = constant;
  probmodel age acadmos minordrg logspend;
run;
/*
Fit Statistics

-2 Log Likelihood             8201.0
AIC  (smaller is better)      8221.0
AICC (smaller is better)      8221.0
BIC  (smaller is better)      8293.5

Parameter Estimates for 'Truncated Poisson' Model
 
                                Standard
Component  Effect     Estimate     Error  z Value  Pr > |z|

        1  Intercept   -2.0706    0.3081    -6.72    <.0001
        1  AGE         0.01796  0.005482     3.28    0.0011
        1  ACADMOS    0.000852  0.000700     1.22    0.2240
        1  MINORDRG     0.1739   0.03441     5.05    <.0001
        1  LOGSPEND     0.1229   0.04219     2.91    0.0036

Parameter Estimates for Mixing Probabilities
 
                         Standard
Effect       Estimate       Error    z Value    Pr > |z|

Intercept     -4.2309      0.1808     -23.40      <.0001
AGE           0.01694    0.003323       5.10      <.0001
ACADMOS      0.002240    0.000492       4.55      <.0001
MINORDRG       0.7653     0.03842      19.92      <.0001
LOGSPEND       0.2301     0.02683       8.58      <.0001
*/

*** HURDLE POISSON MODEL WITH NLMIXED PROCEDURE ***;
proc nlmixed data = tmp1 tech = trureg maxit = 500;
  parms B1_intercept = -4 B1_age = 0 B1_acadmos = 0 B1_minordrg = 0 B1_logspend = 0
        B2_intercept = -2 B2_age = 0 B2_acadmos = 0 B2_minordrg = 0	B2_logspend = 0;

  eta1 = B1_intercept + B1_age * age + B1_acadmos * acadmos + B1_minordrg * minordrg + B1_logspend * logspend;
  exp_eta1 = exp(eta1);
  p0 = 1 / (1 + exp_eta1);
  eta2 = B2_intercept + B2_age * age + B2_acadmos * acadmos + B2_minordrg * minordrg + B2_logspend * logspend;
  exp_eta2 = exp(eta2);
  if majordrg = 0 then _prob_ = p0;
  else _prob_ = (1 - p0) * exp(-exp_eta2) * (exp_eta2 ** majordrg) / ((1 - exp(-exp_eta2)) * fact(majordrg));
  ll = log(_prob_);
  model majordrg ~ general(ll);
run;
/*
Fit Statistics

-2 Log Likelihood                 8201.0
AIC (smaller is better)           8221.0
AICC (smaller is better)          8221.0
BIC (smaller is better)           8293.5

Parameter Estimates
 
                          Standard
Parameter      Estimate      Error     DF   t Value   Pr > |t|

B1_intercept    -4.2309     0.1808    1E4    -23.40     <.0001
B1_age          0.01694   0.003323    1E4      5.10     <.0001
B1_acadmos     0.002240   0.000492    1E4      4.55     <.0001
B1_minordrg      0.7653    0.03842    1E4     19.92     <.0001
B1_logspend      0.2301    0.02683    1E4      8.58     <.0001
============
B2_intercept    -2.0706     0.3081    1E4     -6.72     <.0001
B2_age          0.01796   0.005482    1E4      3.28     0.0011
B2_acadmos     0.000852   0.000700    1E4      1.22     0.2240
B2_minordrg      0.1739    0.03441    1E4      5.05     <.0001
B2_logspend      0.1229    0.04219    1E4      2.91     0.0036
*/

Zero-Inflated Poisson Model

*** ZERO-INFLATED POISSON MODEL WITH FMM PROCEDURE ***;
proc fmm data = tmp1 tech = trureg;
  model majordrg = age acadmos minordrg logspend / dist = poisson;
  model majordrg = / dist = constant;
  probmodel age acadmos minordrg logspend;
run;
/*
Fit Statistics

-2 Log Likelihood             8147.9
AIC  (smaller is better)      8167.9
AICC (smaller is better)      8167.9
BIC  (smaller is better)      8240.5

Parameter Estimates for 'Poisson' Model
 
                                Standard
Component  Effect     Estimate     Error  z Value  Pr > |z|

        1  Intercept   -2.2780    0.3002    -7.59    <.0001
        1  AGE         0.01956  0.006019     3.25    0.0012
        1  ACADMOS    0.000249  0.000668     0.37    0.7093
        1  MINORDRG     0.1176   0.02711     4.34    <.0001
        1  LOGSPEND     0.1644   0.03531     4.66    <.0001

Parameter Estimates for Mixing Probabilities
 
                         Standard
Effect       Estimate       Error    z Value    Pr > |z|

Intercept     -1.9111      0.4170      -4.58      <.0001
AGE          -0.00082    0.008406      -0.10      0.9218
ACADMOS      0.002934    0.001085       2.70      0.0068
MINORDRG       1.4424      0.1361      10.59      <.0001
LOGSPEND      0.09562     0.05080       1.88      0.0598
*/

*** ZERO-INFLATED POISSON MODEL WITH NLMIXED PROCEDURE ***;
proc nlmixed data = tmp1 tech = trureg maxit = 500;
  parms B1_intercept = -2 B1_age = 0 B1_acadmos = 0 B1_minordrg = 0 B1_logspend = 0
        B2_intercept = -2 B2_age = 0 B2_acadmos = 0 B2_minordrg = 0	B2_logspend = 0;

  eta1 = B1_intercept + B1_age * age + B1_acadmos * acadmos + B1_minordrg * minordrg + B1_logspend * logspend;
  exp_eta1 = exp(eta1);
  p0 = 1 / (1 + exp_eta1);
  eta2 = B2_intercept + B2_age * age + B2_acadmos * acadmos + B2_minordrg * minordrg + B2_logspend * logspend;
  exp_eta2 = exp(eta2);
  if majordrg = 0 then _prob_ = p0 + (1 - p0) * exp(-exp_eta2);
  else _prob_ = (1 - p0) * exp(-exp_eta2) * (exp_eta2 ** majordrg) / fact(majordrg);
  ll = log(_prob_);
  model majordrg ~ general(ll);
run;
/*
Fit Statistics

-2 Log Likelihood                 8147.9
AIC (smaller is better)           8167.9
AICC (smaller is better)          8167.9
BIC (smaller is better)           8240.5

Parameter Estimates
 
                          Standard
Parameter      Estimate      Error     DF   t Value   Pr > |t|

B1_intercept    -1.9111     0.4170    1E4     -4.58     <.0001
B1_age         -0.00082   0.008406    1E4     -0.10     0.9219
B1_acadmos     0.002934   0.001085    1E4      2.70     0.0068
B1_minordrg      1.4424     0.1361    1E4     10.59     <.0001
B1_logspend     0.09562    0.05080    1E4      1.88     0.0598
============
B2_intercept    -2.2780     0.3002    1E4     -7.59     <.0001
B2_age          0.01956   0.006019    1E4      3.25     0.0012
B2_acadmos     0.000249   0.000668    1E4      0.37     0.7093
B2_minordrg      0.1176    0.02711    1E4      4.34     <.0001
B2_logspend      0.1644    0.03531    1E4      4.66     <.0001
*/

Two-Class Finite Mixture Poisson Model

*** TWO-CLASS FINITE MIXTURE POISSON MODEL WITH FMM PROCEDURE ***;
proc fmm data = tmp1 tech = trureg;
  model majordrg = age acadmos minordrg logspend / dist = poisson k = 2;
  probmodel age acadmos minordrg logspend;
run;
/*
Fit Statistics

-2 Log Likelihood             8136.8
AIC  (smaller is better)      8166.8
AICC (smaller is better)      8166.9
BIC  (smaller is better)      8275.7

Parameter Estimates for 'Poisson' Model
 
                                Standard
Component  Effect     Estimate     Error  z Value  Pr > |z|

        1  Intercept   -2.4449    0.3497    -6.99    <.0001
        1  AGE         0.02214  0.006628     3.34    0.0008
        1  ACADMOS    0.000529  0.000770     0.69    0.4920
        1  MINORDRG    0.05054   0.04015     1.26    0.2081
        1  LOGSPEND     0.2140   0.04127     5.18    <.0001
        2  Intercept   -8.0935    1.5915    -5.09    <.0001
        2  AGE         0.01150   0.01294     0.89    0.3742
        2  ACADMOS    0.004567  0.002055     2.22    0.0263
        2  MINORDRG     0.2638    0.6770     0.39    0.6968
        2  LOGSPEND     0.6826    0.2203     3.10    0.0019

Parameter Estimates for Mixing Probabilities
 
                         Standard
Effect       Estimate       Error    z Value    Pr > |z|

Intercept     -1.4275      0.5278      -2.70      0.0068
AGE          -0.00277     0.01011      -0.27      0.7844
ACADMOS      0.001614    0.001440       1.12      0.2623
MINORDRG       1.5865      0.1791       8.86      <.0001
LOGSPEND     -0.06949     0.07436      -0.93      0.3501
*/

*** TWO-CLASS FINITE MIXTURE POISSON MODEL WITH NLMIXED PROCEDURE ***;
proc nlmixed data = tmp1 tech = trureg maxit = 500;
  parms B1_intercept = -2 B1_age = 0 B1_acadmos = 0 B1_minordrg = 0 B1_logspend = 0
        B2_intercept = -8 B2_age = 0 B2_acadmos = 0 B2_minordrg = 0 B2_logspend = 0
		B3_intercept = -1 B3_age = 0 B3_acadmos = 0 B3_minordrg = 0 B3_logspend = 0;

  eta1 = B1_intercept + B1_age * age + B1_acadmos * acadmos + B1_minordrg * minordrg + B1_logspend * logspend;
  exp_eta1 = exp(eta1);
  prob1 = exp(-exp_eta1) * exp_eta1 ** majordrg / fact(majordrg);
  eta2 = B2_intercept + B2_age * age + B2_acadmos * acadmos + B2_minordrg * minordrg + B2_logspend * logspend;
  exp_eta2 = exp(eta2);
  prob2 = exp(-exp_eta2) * exp_eta2 ** majordrg / fact(majordrg);
  eta3 = B3_intercept + B3_age * age + B3_acadmos * acadmos + B3_minordrg * minordrg + B3_logspend * logspend;
  exp_eta3 = exp(eta3);
  p = exp_eta3 / (1 + exp_eta3);
  _prob_ = p * prob1 + (1 - p) * prob2;
  ll = log(_prob_);
  model majordrg ~ general(ll);
run;
/*
Fit Statistics

-2 Log Likelihood                 8136.8
AIC (smaller is better)           8166.8
AICC (smaller is better)          8166.9
BIC (smaller is better)           8275.7

Parameter Estimates
 
                          Standard
Parameter      Estimate      Error     DF   t Value   Pr > |t|

B1_intercept    -2.4449     0.3497    1E4     -6.99     <.0001
B1_age          0.02214   0.006628    1E4      3.34     0.0008
B1_acadmos     0.000529   0.000770    1E4      0.69     0.4920
B1_minordrg     0.05054    0.04015    1E4      1.26     0.2081
B1_logspend      0.2140    0.04127    1E4      5.18     <.0001
============
B2_intercept    -8.0935     1.5916    1E4     -5.09     <.0001
B2_age          0.01150    0.01294    1E4      0.89     0.3742
B2_acadmos     0.004567   0.002055    1E4      2.22     0.0263
B2_minordrg      0.2638     0.6770    1E4      0.39     0.6968
B2_logspend      0.6826     0.2203    1E4      3.10     0.0020
============
B3_intercept    -1.4275     0.5278    1E4     -2.70     0.0068
B3_age         -0.00277    0.01011    1E4     -0.27     0.7844
B3_acadmos     0.001614   0.001440    1E4      1.12     0.2623
B3_minordrg      1.5865     0.1791    1E4      8.86     <.0001
B3_logspend    -0.06949    0.07436    1E4     -0.93     0.3501
*/

Written by statcompute

June 5, 2013 at 10:44 pm