Yet Another Blog in Statistical Computing

I can calculate the motion of heavenly bodies but not the madness of people. -Isaac Newton

Calculate Predicted Values for Composite Models with NLMIXED

After estimating a statistical model, we often need to use the estimated model specification to calculate predicted values from a separate hold-out dataset to evaluate the model performance. In R or StatsModels of Python, it is trivial to calculate the predicted values by passing the data through the model object. However, in SAS, the similar task might become a bit complicated especially when we want to calculate predicted values of a composite model, e.g. Zero-Inflated Beta Model, estimated through NLMIXED procedure. Most of the time, we might need to parse the model specification into an open sas code or into a sas dataset that can be used by SCORE procedure, which is not easy in either case.

Today, I’d like to introduce a small trick that can allow us to calculate predicted values with NLMIXED procedure on the fly and that can be extremely handy in case of our interests in the prediction only. Below is a high-level procedure.

1) first of all, we combine, e.g. union, both the development and the testing datasets into 1 table and use a variable to flag out all observations from the develoopment dataset, e.g. deve_flg = 1.
2) secondly, we feed the whole dataset into NLMIXED procedure. However, we should only specify the likelihood function for all observations from the development dataset. For observations from the testing dataset, we force the likelihood function equal to 0 (zero).
3) at last, we might use the PREDICT statement in NLMIXED procedure to calculate predicted values of interest.

Please note that since we artificially increase the sample size of the model estimation sample, some statistical measures, e.g. BIC, might not be accurate. However, since the log likelihood function is still correct, it is trivial to calculate BIC manually.

Below is an example how to calculate predicted values of a Zero-Inflated Beta Model. In this composite model, there are 3 sets of parameters, 2 for mean and 1 for variance. Therefore, it could be extremely cumbersome if there is no scheme to calculate predicted values automatically.

libname data 'c:\projects\data';

*** COMBINE DEVELOPMENT AND TEST DATASETS ***;
data full;
  set data.deve (in = a) data.test (in = b);
  *** FLAG OUT THE DEVELOPMENT SAMPLE ***;
  if a then deve_flg = 1;
  if b then deve_flg = 0;
run;

proc nlmixed data = full tech = trureg;
  parms a0 = 0  a1 = 0  a2 = 0  a3 = 0  a4 = 0  a5 = 0  a6 = 0  a7 = 0
        b0 = 0  b1 = 0  b2 = 0  b3 = 0  b4 = 0  b5 = 0  b6 = 0  b7 = 0
        c0 = 1  c1 = 0  c2 = 0  c3 = 0  c4 = 0  c5 = 0  c6 = 0  c7 = 0;
  xa = a0 + a1 * x1 + a2 * x2 + a3 * x3 + a4 * x4 + a5 * x5 + a6 * x6 + a7 * x7;
  xb = b0 + b1 * x1 + b2 * x2 + b3 * x3 + b4 * x4 + b5 * x5 + b6 * x6 + b7 * x7;
  xc = c0 + c1 * x1 + c2 * x2 + c3 * x3 + c4 * x4 + c5 * x5 + c6 * x6 + c7 * x7;
  mu_xa = 1 / (1 + exp(-xa));
  mu_xb = 1 / (1 + exp(-xb));
  phi = exp(xc);
  w = mu_xb * phi;
  t = (1 - mu_xb) * phi;
  *** SPECIFY LIKELIHOOD FUNCTION FOR ALL CASES WITH DEVE. FLAG ***;
  if deve_flg = 1 then do;
    if y = 0 then lh = 1 - mu_xa;
    else lh = mu_xa * (gamma(w + t) / (gamma(w) * gamma(t)) * (y ** (w - 1)) * ((1 - y) ** (t - 1)));
    ll = log(lh);
  end;
  *** FORCE LIKELIHOOD FUNCTION = 0 FOR ALL CASES WITHOUT DEVE. FLAG ***;
  else ll = 0;
  model y ~ general(ll);
  mu = mu_xa * mu_xb;
  predict mu out = pred (rename = (pred = mu));
run;

proc means data = pred mean n;
  class deve_flg;
  var y mu;
run;
/* output:
                   N
    deve_flg     Obs    Variable    Label                      Mean       N
---------------------------------------------------------------------------
           0    1780    y                                 0.0892984    1780
                        mu          Predicted Value       0.0932779    1780

           1    2641    y                                 0.0918661    2641
                        mu          Predicted Value       0.0919383    2641
---------------------------------------------------------------------------
*/
Advertisements

Written by statcompute

September 2, 2013 at 11:27 am

%d bloggers like this: