## Archive for **September 2013**

## Test Drive of PyPy

PyPy (www.pypy.org) is a python interpreter alternative to CPython. Today, I did a test drive with PyPy. The python code below is to generate 1,000,000 rows of data and then calculate the aggregated summation of each category.

import random as random import pprint as pprint random.seed(2013) list = [[random.randint(1, 2), random.randint(1, 3), random.uniform(-1, 1)] for i in xrange(1, 1000001)] summ = [] for i in set(x[0] for x in list): for j in set(x[1] for x in list): total = sum([x[2] for x in list if x[0] == i and x[1] == j]) summ.append([i, j, round(total, 2)]) pprint.pprint(summ)

First of all, I run it with CPython and get the follow outcome.

liuwensui@ubuntu64:~/Documents/code$ time python test_pypy.py [[1, 1, 258.53], [1, 2, -209.49], [1, 3, -225.78], [2, 1, 202.2], [2, 2, 58.63], [2, 3, 188.99]] real 0m4.491s user 0m4.448s sys 0m0.040s

However, if I run it with PyPy, the run time is only about one third of the one with CPython interpreter.

liuwensui@ubuntu64:~/Documents/code$ time pypy test_pypy.py [[1, 1, 258.53], [1, 2, -209.49], [1, 3, -225.78], [2, 1, 202.2], [2, 2, 58.63], [2, 3, 188.99]] real 0m1.351s user 0m1.228s sys 0m0.120s

## Generate and Retrieve Many Objects with Sequential Names

While coding ensemble methods in data mining with R, e.g. bagging, we often need to generate many data and models objects with sequential names. Below is a quick example how to use **assign()** function to generate many prediction objects on the fly and then retrieve these predictions with **mget()** to do the model averaging.

data(Boston, package = "MASS") for (i in 1:10) { set.seed(i) smp <- Boston[sample(1:nrow(Boston), nrow(Boston), replace = TRUE), ] glm <- glm(medv ~ ., data = smp) prd <- predict(glm, Boston) ### ASSIGN A LIST OF SEQUENTIAL NAMES TO PREDICTIONS ### assign(paste("p", i, sep = ""), prd) } ### RETURN NAMED OBJECTS TO A LIST ### plist <- mget(paste('p', 1:i, sep = '')) ### AGGREGATE ALL PREDICTIONS ### pcols <- do.call('cbind', plist) pred_medv <- rowSums(pcols) / i ### A SIMPLE FUNCTION CALCULATION R-SQUARE ### r2 <- function(y, yhat) { ybar <- mean(y) r2 <- sum((yhat - ybar) ^ 2) / sum((y - ybar) ^ 2) return(r2) } print(r2(Boston$medv, pred_medv)) # OUTPUT: # [1] 0.7454225

## Calculate Predicted Values for Composite Models with NLMIXED

After estimating a statistical model, we often need to use the estimated model specification to calculate predicted values from a separate hold-out dataset to evaluate the model performance. In R or StatsModels of Python, it is trivial to calculate the predicted values by passing the data through the model object. However, in SAS, the similar task might become a bit complicated especially when we want to calculate predicted values of a composite model, e.g. Zero-Inflated Beta Model, estimated through NLMIXED procedure. Most of the time, we might need to parse the model specification into an open sas code or into a sas dataset that can be used by SCORE procedure, which is not easy in either case.

Today, I’d like to introduce a small trick that can allow us to calculate predicted values with NLMIXED procedure on the fly and that can be extremely handy in case of our interests in the prediction only. Below is a high-level procedure.

1) first of all, we combine, e.g. union, both the development and the testing datasets into 1 table and use a variable to flag out all observations from the develoopment dataset, e.g. deve_flg = 1.

2) secondly, we feed the whole dataset into NLMIXED procedure. However, we should only specify the likelihood function for all observations from the development dataset. For observations from the testing dataset, we force the likelihood function equal to 0 (zero).

3) at last, we might use the PREDICT statement in NLMIXED procedure to calculate predicted values of interest.

Please note that since we artificially increase the sample size of the model estimation sample, some statistical measures, e.g. BIC, might not be accurate. However, since the log likelihood function is still correct, it is trivial to calculate BIC manually.

Below is an example how to calculate predicted values of a Zero-Inflated Beta Model. In this composite model, there are 3 sets of parameters, 2 for mean and 1 for variance. Therefore, it could be extremely cumbersome if there is no scheme to calculate predicted values automatically.

libname data 'c:\projects\data'; *** COMBINE DEVELOPMENT AND TEST DATASETS ***; data full; set data.deve (in = a) data.test (in = b); *** FLAG OUT THE DEVELOPMENT SAMPLE ***; if a then deve_flg = 1; if b then deve_flg = 0; run; proc nlmixed data = full tech = trureg; parms a0 = 0 a1 = 0 a2 = 0 a3 = 0 a4 = 0 a5 = 0 a6 = 0 a7 = 0 b0 = 0 b1 = 0 b2 = 0 b3 = 0 b4 = 0 b5 = 0 b6 = 0 b7 = 0 c0 = 1 c1 = 0 c2 = 0 c3 = 0 c4 = 0 c5 = 0 c6 = 0 c7 = 0; xa = a0 + a1 * x1 + a2 * x2 + a3 * x3 + a4 * x4 + a5 * x5 + a6 * x6 + a7 * x7; xb = b0 + b1 * x1 + b2 * x2 + b3 * x3 + b4 * x4 + b5 * x5 + b6 * x6 + b7 * x7; xc = c0 + c1 * x1 + c2 * x2 + c3 * x3 + c4 * x4 + c5 * x5 + c6 * x6 + c7 * x7; mu_xa = 1 / (1 + exp(-xa)); mu_xb = 1 / (1 + exp(-xb)); phi = exp(xc); w = mu_xb * phi; t = (1 - mu_xb) * phi; *** SPECIFY LIKELIHOOD FUNCTION FOR ALL CASES WITH DEVE. FLAG ***; if deve_flg = 1 then do; if y = 0 then lh = 1 - mu_xa; else lh = mu_xa * (gamma(w + t) / (gamma(w) * gamma(t)) * (y ** (w - 1)) * ((1 - y) ** (t - 1))); ll = log(lh); end; *** FORCE LIKELIHOOD FUNCTION = 0 FOR ALL CASES WITHOUT DEVE. FLAG ***; else ll = 0; model y ~ general(ll); mu = mu_xa * mu_xb; predict mu out = pred (rename = (pred = mu)); run; proc means data = pred mean n; class deve_flg; var y mu; run; /* output: N deve_flg Obs Variable Label Mean N --------------------------------------------------------------------------- 0 1780 y 0.0892984 1780 mu Predicted Value 0.0932779 1780 1 2641 y 0.0918661 2641 mu Predicted Value 0.0919383 2641 --------------------------------------------------------------------------- */