Yet Another Blog in Statistical Computing

I can calculate the motion of heavenly bodies but not the madness of people. -Isaac Newton

Archive for September 2013

Test Drive of PyPy

PyPy (www.pypy.org) is a python interpreter alternative to CPython. Today, I did a test drive with PyPy. The python code below is to generate 1,000,000 rows of data and then calculate the aggregated summation of each category.

import random  as random
import pprint  as pprint

random.seed(2013)
list = [[random.randint(1, 2), random.randint(1, 3), random.uniform(-1, 1)] for i in xrange(1, 1000001)]

summ = []
 
for i in set(x[0] for x in list):
  for j in set(x[1] for x in list):
    total = sum([x[2] for x in list if x[0] == i and x[1] == j]) 
    summ.append([i, j, round(total, 2)])

pprint.pprint(summ)

First of all, I run it with CPython and get the follow outcome.

liuwensui@ubuntu64:~/Documents/code$ time python test_pypy.py 
[[1, 1, 258.53],
 [1, 2, -209.49],
 [1, 3, -225.78],
 [2, 1, 202.2],
 [2, 2, 58.63],
 [2, 3, 188.99]]

real    0m4.491s
user    0m4.448s
sys     0m0.040s

However, if I run it with PyPy, the run time is only about one third of the one with CPython interpreter.

liuwensui@ubuntu64:~/Documents/code$ time pypy test_pypy.py 
[[1, 1, 258.53],
 [1, 2, -209.49],
 [1, 3, -225.78],
 [2, 1, 202.2],
 [2, 2, 58.63],
 [2, 3, 188.99]]

real    0m1.351s
user    0m1.228s
sys     0m0.120s
Advertisements

Written by statcompute

September 15, 2013 at 4:57 pm

Posted in Big Data, PYTHON

Generate and Retrieve Many Objects with Sequential Names

While coding ensemble methods in data mining with R, e.g. bagging, we often need to generate many data and models objects with sequential names. Below is a quick example how to use assign() function to generate many prediction objects on the fly and then retrieve these predictions with mget() to do the model averaging.

data(Boston, package = "MASS")

for (i in 1:10) {
  set.seed(i)
  smp <- Boston[sample(1:nrow(Boston), nrow(Boston), replace = TRUE), ]
  glm <- glm(medv ~ ., data = smp)
  prd <- predict(glm, Boston)
  ### ASSIGN A LIST OF SEQUENTIAL NAMES TO PREDICTIONS ###
  assign(paste("p", i, sep = ""), prd)
}

### RETURN NAMED OBJECTS TO A LIST ###
plist <- mget(paste('p', 1:i, sep = ''))
### AGGREGATE ALL PREDICTIONS ###
pcols <- do.call('cbind', plist)
pred_medv <- rowSums(pcols) / i

### A SIMPLE FUNCTION CALCULATION R-SQUARE ###
r2 <- function(y, yhat) {
  ybar <- mean(y)
  r2 <- sum((yhat - ybar) ^ 2) / sum((y - ybar) ^ 2)
  return(r2)
}
print(r2(Boston$medv, pred_medv))
# OUTPUT:
# [1] 0.7454225

Written by statcompute

September 8, 2013 at 2:09 pm

Calculate Predicted Values for Composite Models with NLMIXED

After estimating a statistical model, we often need to use the estimated model specification to calculate predicted values from a separate hold-out dataset to evaluate the model performance. In R or StatsModels of Python, it is trivial to calculate the predicted values by passing the data through the model object. However, in SAS, the similar task might become a bit complicated especially when we want to calculate predicted values of a composite model, e.g. Zero-Inflated Beta Model, estimated through NLMIXED procedure. Most of the time, we might need to parse the model specification into an open sas code or into a sas dataset that can be used by SCORE procedure, which is not easy in either case.

Today, I’d like to introduce a small trick that can allow us to calculate predicted values with NLMIXED procedure on the fly and that can be extremely handy in case of our interests in the prediction only. Below is a high-level procedure.

1) first of all, we combine, e.g. union, both the development and the testing datasets into 1 table and use a variable to flag out all observations from the develoopment dataset, e.g. deve_flg = 1.
2) secondly, we feed the whole dataset into NLMIXED procedure. However, we should only specify the likelihood function for all observations from the development dataset. For observations from the testing dataset, we force the likelihood function equal to 0 (zero).
3) at last, we might use the PREDICT statement in NLMIXED procedure to calculate predicted values of interest.

Please note that since we artificially increase the sample size of the model estimation sample, some statistical measures, e.g. BIC, might not be accurate. However, since the log likelihood function is still correct, it is trivial to calculate BIC manually.

Below is an example how to calculate predicted values of a Zero-Inflated Beta Model. In this composite model, there are 3 sets of parameters, 2 for mean and 1 for variance. Therefore, it could be extremely cumbersome if there is no scheme to calculate predicted values automatically.

libname data 'c:\projects\data';

*** COMBINE DEVELOPMENT AND TEST DATASETS ***;
data full;
  set data.deve (in = a) data.test (in = b);
  *** FLAG OUT THE DEVELOPMENT SAMPLE ***;
  if a then deve_flg = 1;
  if b then deve_flg = 0;
run;

proc nlmixed data = full tech = trureg;
  parms a0 = 0  a1 = 0  a2 = 0  a3 = 0  a4 = 0  a5 = 0  a6 = 0  a7 = 0
        b0 = 0  b1 = 0  b2 = 0  b3 = 0  b4 = 0  b5 = 0  b6 = 0  b7 = 0
        c0 = 1  c1 = 0  c2 = 0  c3 = 0  c4 = 0  c5 = 0  c6 = 0  c7 = 0;
  xa = a0 + a1 * x1 + a2 * x2 + a3 * x3 + a4 * x4 + a5 * x5 + a6 * x6 + a7 * x7;
  xb = b0 + b1 * x1 + b2 * x2 + b3 * x3 + b4 * x4 + b5 * x5 + b6 * x6 + b7 * x7;
  xc = c0 + c1 * x1 + c2 * x2 + c3 * x3 + c4 * x4 + c5 * x5 + c6 * x6 + c7 * x7;
  mu_xa = 1 / (1 + exp(-xa));
  mu_xb = 1 / (1 + exp(-xb));
  phi = exp(xc);
  w = mu_xb * phi;
  t = (1 - mu_xb) * phi;
  *** SPECIFY LIKELIHOOD FUNCTION FOR ALL CASES WITH DEVE. FLAG ***;
  if deve_flg = 1 then do;
    if y = 0 then lh = 1 - mu_xa;
    else lh = mu_xa * (gamma(w + t) / (gamma(w) * gamma(t)) * (y ** (w - 1)) * ((1 - y) ** (t - 1)));
    ll = log(lh);
  end;
  *** FORCE LIKELIHOOD FUNCTION = 0 FOR ALL CASES WITHOUT DEVE. FLAG ***;
  else ll = 0;
  model y ~ general(ll);
  mu = mu_xa * mu_xb;
  predict mu out = pred (rename = (pred = mu));
run;

proc means data = pred mean n;
  class deve_flg;
  var y mu;
run;
/* output:
                   N
    deve_flg     Obs    Variable    Label                      Mean       N
---------------------------------------------------------------------------
           0    1780    y                                 0.0892984    1780
                        mu          Predicted Value       0.0932779    1780

           1    2641    y                                 0.0918661    2641
                        mu          Predicted Value       0.0919383    2641
---------------------------------------------------------------------------
*/

Written by statcompute

September 2, 2013 at 11:27 am