Yet Another Blog in Statistical Computing

I can calculate the motion of heavenly bodies but not the madness of people. -Isaac Newton

Archive for July 2013

Dispersion Models

In the last week, I’ve read an interesting article “Dispersion Models in Regression Analysis” by Peter Song (http://www.pakjs.com/journals/25%284%29/25%284%299.pdf), which describes a new class of models more general than classic generalized linear models based on the error distribution.

A dispersion model can be defined by two parameters, a location parameter mu and a dispersion parameter sigma ^ 2, and has a very general form of probability function formulated as:
p(y, mu, sigma ^ 2) = {2 * pi * sigma ^ 2 * V(.)} ^ -0.5 * exp{-1 / (2 * sigma ^ 2) * D(.)}
where the variance function V(.) and the deviance function D(.) varies by distributions. For instance, in a poisson model,
D(.) = 2 * (y * log(y / mu) – y + mu)
V(.) = mu

Below is a piece of SAS code estimating a Poisson with both the error distribution assumption and the dispersion assumption.

data one;
  do i = 1 to 1000;
    x = ranuni(i);
    y = ranpoi(i, exp(2 + x * 2 + rannor(1) * 0.1));
    output;
  end;
run;

*** fit a poisson model with classic GLM ***;
proc nlmixed data = one tech = trureg;
  parms b0 = 0 b1 = 0;
  mu = exp(b0 + b1 * x);
  ll = -mu + y * log(mu) - log(fact(y));
  model y ~ general(ll);
run;
/*
             Fit Statistics
-2 Log Likelihood                 6118.0
AIC (smaller is better)           6122.0
AICC (smaller is better)          6122.0
BIC (smaller is better)           6131.8

                                             Parameter Estimates
                         Standard
Parameter    Estimate       Error      DF    t Value    Pr > |t|     Alpha       Lower       Upper    Gradient
b0             2.0024     0.01757    1000     113.95      <.0001      0.05      1.9679      2.0369    5.746E-9
b1             1.9883     0.02518    1000      78.96      <.0001      0.05      1.9388      2.0377    1.773E-9
*/

*** fit a poisson model with dispersion probability ***;
*** proposed by Jorgensen in 1987                   ***;
proc nlmixed data = one tech = trureg;
  parms b0 = 0 b1 = 0 s2 = 1;
  mu = exp(b0 + b1 * x);
  d  = 2 * (y * log(y / mu) - y + mu);
  v  = y;
  lh = (2 * constant('pi') * s2 * v) **  (-0.5) * exp(-(2 * s2) ** (-1) * d);
  ll = log(lh);
  model y ~ general(ll);
run;
/*
             Fit Statistics
-2 Log Likelihood                 6066.2
AIC (smaller is better)           6072.2
AICC (smaller is better)          6072.2
BIC (smaller is better)           6086.9

                                             Parameter Estimates
                         Standard
Parameter    Estimate       Error      DF    t Value    Pr > |t|     Alpha       Lower       Upper    Gradient
b0             2.0024     0.02015    1000      99.37      <.0001      0.05      1.9629      2.0420    2.675E-6
b1             1.9883     0.02888    1000      68.86      <.0001      0.05      1.9316      2.0449    1.903E-6
s2             1.3150     0.05881    1000      22.36      <.0001      0.05      1.1996      1.4304    -0.00002
*/

Please note that although both methods yield the same parameter estimates, there are slight differences in standard errors and therefore t-values. In addition, despite one more parameter estimated in the model, AIC / BIC are even lower in the dispersion model.

Advertisements

Written by statcompute

July 14, 2013 at 11:42 am

A Schematic Diagram of Statistical Models for Fractional Outcomes

Below is a schematic diagram of statistical models for fractional outcomes based on my studies done in early 2012. For details, please refer to my blog series in “Modeling Rates and Proportions in SAS”.

diagram

Written by statcompute

July 6, 2013 at 10:23 am

V-Fold Cross Validation to Pick GRNN Smoothing Parameter

On 06/23, I posted two SAS macros implementation GRNN (https://statcompute.wordpress.com/2013/06/23/prototyping-a-general-regression-neural-network-with-sas). However, in order to use these macros in the production environment, we still need a scheme to automatically choose the optimal value of smoothing parameter. In practice, v-fold or holdout cross validation has been often used to accomplish the task. Below is a SAS macro implementing the v-fold cross validation to automatically select an optional value of the smoothing parameter based on the highest K-S statistics in a binary classification case.

%macro grnn_cv(data = , y = , x = , v = , sigmas = );
********************************************************;
* THIS MACRO IS TO DO THE V-FOLD CROSS VALIDATION TO   *;
* PICK THE OPTIMAL VALUE OF SMOOTHING PARAMETER IN A   *;
* BINARY CLASSIFICATION PROBLEM                        *;
*------------------------------------------------------*;
* INPUT PARAMETERS:                                    *;
*  DATA  : INPUT SAS DATASET                           *;
*  X     : A LIST OF PREDICTORS IN THE NUMERIC FORMAT  *;
*  Y     : A RESPONSE VARIABLE IN THE NUMERIC FORMAT   *;
*  V     : NUMBER OF FOLDS FOR CROSS-VALIDATION        *; 
*  SIGMAS: A LIST OF SIGMA VALUES TO TEST              *;
*------------------------------------------------------*;
* OUTPUT:                                              *;
*  SAS PRINT-OUT OF CROSS VALIDATION RESULT IN KS      *;
*  STATISTICS                                          *;
*------------------------------------------------------*;
* AUTHOR:                                              *;
*  WENSUI.LIU@53.COM                                   *;
********************************************************;

options nocenter nonumber nodate mprint mlogic symbolgen         
        orientation = landscape ls = 125 formchar = "|----|+|---+=|-/\<>*"; 

data _data_;
  set &data (keep = &x &y);
  where &y ~= .;
  array _x_ &x;
  _miss_ = 0;
  do _i_ = 1 to dim(_x_);
    if _x_[_i_] = . then _miss_ = 1; 
  end;
  _rand_ = ranuni(1);
  if _miss_ = 0 then output;
run;

proc rank data = _last_ out = _cv1 groups = &v;
  var _rand_;
  ranks _rank_;
run;

%let i = 1;
%local i;

%inc "grnn_learn.sas";
%inc "grnn_pred.sas";

%do %while (%scan(&sigmas, &i, " ") ne %str());
%let sigma = %scan(&sigmas, &i, " ");

  %do j = 0 %to %eval(&v - 1);
  %put &sigma | &i | &j;
   
  data _cv2 _cv3;
    set _cv1;
    if _rank_ ~= &j then output _cv2;
    else output _cv3;
  run;
  
  %grnn_learn(data = _cv2, x = &x, y = &y, sigma = &sigma, nn_out = _grnn);
 
  %grnn_pred(data = _cv3, x = &x, nn_in = _grnn, id = _rand_, out = _pred);

  proc sql;
  create table
    _cv4 as
  select
    a.&y as _y_,
    b._pred_  
  from
    _cv3 as a inner join _pred as b on a._rand_ = b._id_;
  quit;

  %if &j = 0 %then %do;
  data _cv5;
    set _cv4;
  run;
  %end;
  %else %do;
  data _cv5;
    set _cv5 _cv4;
  run;
  %end;

  %end;

ods listing close;
ods output kolsmir2stats = _ks1;
proc npar1way wilcoxon edf data = _cv5;
  class _y_;
  var _pred_;
run;
ods listing;

data _ks2 (keep = sigma ks);
  set _ks1;
  if _n_ = 1 then do;
    sigma = &sigma;
    ks = nvalue2 * 100;
    output;
  end;
run;

%if &i = 1 %then %do;
data _ks3;
  set _ks2;
run;
%end;
%else %do;
data _ks3;
  set _ks3 _ks2;
run;
%end;

%let i = %eval(&i + 1); 
%end;

title "&v._fold cross validation outcomes";
proc print data = _ks3 noobs;
run;

********************************************************;
*              END OF THE MACRO                        *;
********************************************************;
%mend grnn_cv;

Written by statcompute

July 4, 2013 at 11:53 pm