Yet Another Blog in Statistical Computing

I can calculate the motion of heavenly bodies but not the madness of people. -Isaac Newton

Archive for the ‘Econometrics’ Category

Estimating Conway-Maxwell-Poisson Regression in SAS

Conway-Maxwell-Poisson (CMP) regression is a flexible way to model frequency outcomes with both under-dispersion and over-dispersion. In SAS, CMP regression can be estimated with COUNTREG procedure directly or with NLMIXED procedure by specifying the likelihood function. However, the use of NLMIXED procedure is extremely cumbersome in that we need to estimate a standard Poisson regression and then use estimated parameters as initial values parameter estimates for the CMP regression.

In the example below, we will show how to employ GLIMMIX procedure to estimate a CMP regression by providing both the log likelihood function and the variance function in terms of the expected mean.

proc glimmix data = mylib.credit_count;
  model majordrg = age acadmos minordrg ownrent / link = log solution;
  _nu =  1 / exp(_phi_);
  _variance_ = (1 / _nu) / ((_mu_) ** (1 / _nu));
  _z = 0;
  do i = 0 to 100;
    _z = _z + (_mu_ ** i) / fact(i) ** _nu;
  end;
  _prob = (_mu_ ** majordrg) / (fact(majordrg) ** _nu) * (_z ** (-1));
  _logl_ = log(_prob);
run;

Since the scale parameter _phi_ is strictly above 0, the function 1 / exp(_phi_) in the line #3 is to ensure the Nu parameter bounded between 0 and 1.

In addition, the DO loop is to calculate the normalization constant Z such that the PMF would sum up to 1. As there is no closed form for the calculation of Z, we need to calculate it numerically at the cost of a longer computing time.

Other implicit advantages of GLIMMIX procedure over NLMIXED procedure include the unnecessity to provide initiate values of parameter estimates and a shorter computing time.

Written by statcompute

March 26, 2017 at 5:09 pm

Modeling Generalized Poisson Regression in SAS

The Generalized Poisson (GP) regression is a very flexible statistical model for count outcomes in that it can accommodate both over-dispersion and under-dispersion, which makes it a very practical modeling approach in real-world problems and is considered a serious contender for the Quasi-Poisson regression.

Prob(Y) = Alpha / Y! * (Alpha + Xi * Y) ^ (Y – 1) * EXP(-Alpha – Xi * Y)
E(Y) = Mu = Alpha / (1 – Xi)
Var(Y) = Mu / (1 – Xi) ^ 2

While the GP regression can be conveniently estimated with HMM procedure in SAS, I’d always like to dive a little deeper into its model specification and likelihood function to have a better understanding. For instance, there is a slight difference in GP model outcomes between HMM procedure in SAS and VGAM package in R. After looking into the detail, I then realized that the difference is solely due to the different parameterization.

Basically, there are three steps for estimating a GP regression with NLMIXED procedure. Due to the complexity of GP likelihood function and its convergence process, it is always a good practice to estimate a baseline Standard Poisson regression as a starting point and then to output its parameter estimates into a table, e.g. _EST as shown below.

ods output ParameterEstimates = _est;
proc genmod data = mylib.credit_count;
  model majordrg = age acadmos minordrg ownrent / dist = poisson link = log;
run;

After acquiring parameter estimates from a Standard Poisson regression, we can use them to construct initiate values of parameter estimates for the Generalized Poisson regression. In the code snippet below, we used SQL procedure to create 2 macro variables that we are going to use in the final model estimation of GP regression.

proc sql noprint;
select
  "_"||compress(upcase(parameter), ' ')||" = "||compress(put(estimate, 10.2), ' ')
into
  :_parm separated by ' '
from  
  _est;
  
select
  case 
    when upcase(parameter) = 'INTERCEPT' then "_"||compress(upcase(parameter), ' ')
    else "_"||compress(upcase(parameter), ' ')||" * "||compress(upcase(parameter), ' ')
  end
into
  :_xb separated by ' + '    
from  
  _est
where
  upcase(parameter) ~= 'SCALE';  
quit;

/*
%put &_parm;
_INTERCEPT = -1.38 _AGE = 0.01 _ACADMOS = 0.00 _MINORDRG = 0.46 _OWNRENT = -0.20 _SCALE = 1.00

%put &_xb;
 _INTERCEPT + _AGE * AGE + _ACADMOS * ACADMOS + _MINORDRG * MINORDRG + _OWNRENT * OWNRENT
*/

In the last step, we used the NLMIXED procedure to estimate the GP regression by specifying its log likelihood function that would generate identical model results as with HMM procedure. It is worth mentioning that the expected mean _mu = exp(x * beta) in SAS and the term exp(x * beta) refers to the _alpha parameter in R. Therefore, the intercept would be different between SAS and R, primarily due to different ways of parameterization with the identical statistical logic.

proc nlmixed data = mylib.credit_count;
  parms &_parm.;
  _xb = &_xb.;
  _xi = 1 - exp(-_scale);
  _mu = exp(_xb);  
  _alpha = _mu * (1 - _xi);
  _prob = _alpha / fact(majordrg) * (_alpha + _xi * majordrg) ** (majordrg - 1) * exp(- _alpha - _xi * majordrg);
  ll = log(_prob);
  model majordrg ~ general(ll);
run;

In addition to HMM and NLMIXED procedures, GLIMMIX can also be employed to estimate the GP regression, as shown below. In this case, we need to specify both the log likelihood function and the variance function in terms of the expected mean.

proc glimmix data = mylib.credit_count;
  model majordrg = age acadmos minordrg ownrent / link = log solution;
  _xi = 1 - 1 / exp(_phi_);
  _variance_ = _mu_ / (1 - _xi) ** 2;
  _alpha = _mu_ * (1 - _xi);
  _prob = _alpha / fact(majordrg) * (_alpha + _xi * majordrg) ** (majordrg - 1) * exp(- _alpha - _xi * majordrg);  
  _logl_ = log(_prob);
run;

Written by statcompute

March 11, 2017 at 3:01 pm

Estimate Regression with (Type-I) Pareto Response

The Type-I Pareto distribution has a probability function shown as below

f(y; a, k) = k * (a ^ k) / (y ^ (k + 1))

In the formulation, the scale parameter 0 < a < y and the shape parameter k > 1 .

The positive lower bound of Type-I Pareto distribution is particularly appealing in modeling the severity measure in that there is usually a reporting threshold for operational loss events. For instance, the reporting threshold of ABA operational risk consortium data is $10,000 and any loss event below the threshold value would be not reported, which might add the complexity in the severity model estimation.

In practice, instead of modeling the severity measure directly, we might model the shifted response y` = severity – threshold to accommodate the threshold value such that the supporting domain of y` could start from 0 and that the Gamma, Inverse Gaussian, or Lognormal regression can still be applicable. However, under the distributional assumption of Type-I Pareto with a known lower end, we do not need to shift the severity measure anymore but model it directly based on the probability function.

Below is the R code snippet showing how to estimate a regression model for the Pareto response with the lower bound a = 2 by using the VGAM package.

library(VGAM)
set.seed(2017)
n <- 200
a <- 2
x <- runif(n)
k <- exp(1 + 5 * x)
pdata <- data.frame(y = rpareto(n = n, scale = a, shape = k), x = x)
fit <- vglm(y ~ x, paretoff(scale = a), data = pdata, trace = TRUE)
summary(fit)
# Coefficients:
#             Estimate Std. Error z value Pr(>|z|)
# (Intercept)   1.0322     0.1363   7.574 3.61e-14 ***
# x             4.9815     0.2463  20.229  < 2e-16 ***
AIC(fit)
#  -644.458
BIC(fit)
#  -637.8614

The SAS code below estimating the Type-I Pareto regression provides almost identical model estimation.

proc nlmixed data = pdata;
  parms b0 = 0.1 b1 = 0.1;
  k = exp(b0 + b1 * x);
  a = 2;
  lh = k * (a ** k) / (y ** (k + 1));
  ll = log(lh);
  model y ~ general(ll);
run;
/*
Fit Statistics
-2 Log Likelihood               -648.5
AIC (smaller is better)         -644.5
AICC (smaller is better)        -644.4
BIC (smaller is better)         -637.9

Parameter Estimate   Standard   DF    t Value   Pr > |t|
                     Error 
b0        1.0322     0.1385     200    7.45     <.0001 	
b1        4.9815     0.2518     200   19.78     <.0001 	
*/

At last, it is worth pointing out that the conditional mean of Type-I Pareto response is not equal to exp(x * beta) but a * k / (k – 1) with k = exp(x * beta) . Therefore, the conditional mean only exists when k > 1 , which might cause numerical issues in the model estimation.

Written by statcompute

December 11, 2016 at 5:12 pm

Pregibon Test for Goodness of Link in SAS

When estimating generalized linear models for binary outcomes, we often choose the logit link function by default and seldom consider other alternatives such as probit or cloglog. The Pregibon test (Pregibon, 1980) provides a mean to check the goodness of link with a simple logic outlined below.

1. First of all, we can estimate the regression model with the hypothesized link function, e.g. logit;
2. After the model estimation, we calculate yhat and yhat ^ 2 and then estimate a secondary regression with the identical response variable Y and link function but with yhat and yhat ^ 2 as model predictors (with the intercept).
3. If the link function is correctly specified, then the t-value of yaht ^2 should be insignificant.

The SAS macro shown below is the implementation of Pregibon test in the context of logistic regressions. However, the same idea can be generalized to any GLM.

%macro pregibon(data = , y = , x = );
***********************************************************;
* SAS MACRO PERFORMING PREGIBON TEST FOR GOODNESS OF LINK *;
* ======================================================= *;
* INPUT PAREMETERS:                                       *;
*  DATA : INPUT SAS DATA TABLE                            *;
*  Y    : THE DEPENDENT VARIABLE WITH 0 / 1 VALUES        *;
*  X    : MODEL PREDICTORS                                *;
* ======================================================= *;
* AUTHOR: WENSUI.LIU@53.COM                               *;
***********************************************************;
options mprint mlogic nocenter;

%let links = logit probit cloglog;
%let loop = 1;

proc sql noprint;
  select n(&data) - 3 into :df from &data;
quit; 

%do %while (%scan(&links, &loop) ne %str());

  %let link = %scan(&links, &loop);
  
  proc logistic data = &data noprint desc;
    model &y = &x / link = &link;
    score data = &data out = _out1;
  run;
  
  data _out2;
    set _out1(rename = (p_1 = p1));
    p2 = p1 * p1;
  run;
  
  ods listing close;
  ods output ParameterEstimates = _parm;  
  proc logistic data = _out2 desc;
    model &y = p1 p2 /  link = &link ;
  run;
  ods listing;
    
  %if &loop = 1 %then %do;
    data _parm1;
      format link $10.;
      set _parm(where = (variable = "p2"));
      link = upcase("&link");
    run;
  %end;
  %else %do;
    data _parm1;
      set _parm1 _parm(where = (variable = "p2") in = new);
      if new then link = upcase("&link");
    run;
  %end;
  
  data _parm2(drop = variable);
    set _parm1;
    _t = estimate / stderr;
    _df = &df;
    _p = (1 - probt(abs(_t), _df)) * 2;
  run;
  
  %let loop = %eval(&loop + 1);

%end;

title;
proc report data = _last_ spacing = 1 headline nowindows split = "*";
  column(" * PREGIBON TEST FOR GOODNESS OF LINK
           * H0: THE LINK FUNCTION IS SPECIFIED CORRECTLY * "
         link _t _df _p);
  define link / "LINK FUNCTION" width = 15 order order = data;          
  define _t   / "T-VALUE"       width = 15 format = 12.4;
  define _df  / "DF"            width = 10;
  define _p   / "P-VALUE"       width = 15 format = 12.4;
run;

%mend;

After applying the macro to the kyphosis data (https://stat.ethz.ch/R-manual/R-devel/library/rpart/html/kyphosis.html), we can see that both logit and probit can be considered appropriate link functions in this specific case and cloglog might not be a good choice.

             PREGIBON TEST FOR GOODNESS OF LINK
        H0: THE LINK FUNCTION IS SPECIFIED CORRECTLY

 LINK FUNCTION           T-VALUE         DF         P-VALUE
-----------------------------------------------------------
 LOGIT                   -1.6825         78          0.0965
 PROBIT                  -1.7940         78          0.0767
 CLOGLOG                 -2.3632         78          0.0206

Written by statcompute

December 4, 2016 at 6:44 pm

Scorecard Development with Data from Multiple Sources

This week, one of my friends asked me a very interesting and practical question in the scorecard development. The model development data were collected from multiple independent sources with various data sizes, heterogeneous risk profiles and different bad rates. While the performance statistics seem satisfactory on the model training dataset, the model doesn’t generalize well with new accounts that might come from a unknown source. The situation is very common in a consulting company where a risk or marketing model is sometimes developed with the data from multiple organizations.

To better understand the issue, I simulated a dataset consisting of two groups. In the dataset, while x0 and x1 govern the group segmentation, x2 and x3 define the bad definition. It is important to point out that the group information “grp” is only known in the model development sample but is unknown in the production population. Therefore, the variable “grp”, albeit predictive, can not be explicitly used in the model estimation.

data one;
  do i = 1 to 100000;
    x0 = ranuni(0);
    x1 = ranuni(1);
    x2 = ranuni(2);
    x3 = ranuni(3);
    if 1 + x0 * 2 + x1 * 4 + rannor(1) > 5 then do;
      grp = 1;
      if x2 * 2 + x3 * 4 + rannor(2) > 5 then bad = 1;
    	else bad = 0;
    end;
    else do;
      grp = 0;
      if x2 * 4 + x3 * 2 + rannor(3) > 4 then bad = 1;
    	else bad = 0;
    end;
    output;
  end;
run;

Our first approach is to use all variables x0 – x3 to build a logistic regression and then evaluate the model altogether and by groups.

proc logistic data = one desc noprint;
  model bad = x0 x1 x2 x3;
  score data = one out = mdl1 (rename = (p_1 = score1));
run;

                            GOOD BAD SEPARATION REPORT FOR SCORE1 IN DATA MDL1
                                MAXIMUM KS = 59.5763 AT SCORE POINT 0.2281
               ( AUC STATISTICS = 0.8800, GINI COEFFICIENT = 0.7599, DIVERGENCE = 2.6802 )

          MIN        MAX           GOOD        BAD      TOTAL    BAD     CUMULATIVE    BAD      CUMU. BAD
         SCORE      SCORE             #          #          #    RATE      BAD RATE  PERCENT      PERCENT
 --------------------------------------------------------------------------------------------------------
  BAD     0.6800     0.9699       2,057      7,943     10,000   79.43%      79.43%    33.81%      33.81%
   |      0.4679     0.6799       4,444      5,556     10,000   55.56%      67.50%    23.65%      57.46%
   |      0.3094     0.4679       6,133      3,867     10,000   38.67%      57.89%    16.46%      73.92%
   |      0.1947     0.3094       7,319      2,681     10,000   26.81%      50.12%    11.41%      85.33%
   |      0.1181     0.1946       8,364      1,636     10,000   16.36%      43.37%     6.96%      92.29%
   |      0.0690     0.1181       9,044        956     10,000    9.56%      37.73%     4.07%      96.36%
   |      0.0389     0.0690       9,477        523     10,000    5.23%      33.09%     2.23%      98.59%
   |      0.0201     0.0389       9,752        248     10,000    2.48%      29.26%     1.06%      99.64%
   V      0.0085     0.0201       9,925         75     10,000    0.75%      26.09%     0.32%      99.96%
 GOOD     0.0005     0.0085       9,991          9     10,000    0.09%      23.49%     0.04%     100.00%
       ========== ========== ========== ========== ==========
          0.0005     0.9699      76,506     23,494    100,000

                  GOOD BAD SEPARATION REPORT FOR SCORE1 IN DATA MDL1(WHERE = (GRP = 0))
                                MAXIMUM KS = 61.0327 AT SCORE POINT 0.2457
               ( AUC STATISTICS = 0.8872, GINI COEFFICIENT = 0.7744, DIVERGENCE = 2.8605 )

          MIN        MAX           GOOD        BAD      TOTAL    BAD     CUMULATIVE    BAD      CUMU. BAD
         SCORE      SCORE             #          #          #    RATE      BAD RATE  PERCENT      PERCENT
 --------------------------------------------------------------------------------------------------------
  BAD     0.7086     0.9699       1,051      6,162      7,213   85.43%      85.43%    30.51%      30.51%
   |      0.5019     0.7086       2,452      4,762      7,214   66.01%      75.72%    23.58%      54.10%
   |      0.3407     0.5019       3,710      3,504      7,214   48.57%      66.67%    17.35%      71.45%
   |      0.2195     0.3406       4,696      2,517      7,213   34.90%      58.73%    12.46%      83.91%
   |      0.1347     0.2195       5,650      1,564      7,214   21.68%      51.32%     7.74%      91.66%
   |      0.0792     0.1347       6,295        919      7,214   12.74%      44.89%     4.55%      96.21%
   |      0.0452     0.0792       6,737        476      7,213    6.60%      39.42%     2.36%      98.56%
   |      0.0234     0.0452       7,000        214      7,214    2.97%      34.86%     1.06%      99.62%
   V      0.0099     0.0234       7,150         64      7,214    0.89%      31.09%     0.32%      99.94%
 GOOD     0.0007     0.0099       7,201         12      7,213    0.17%      27.99%     0.06%     100.00%
       ========== ========== ========== ========== ==========
          0.0007     0.9699      51,942     20,194     72,136

                  GOOD BAD SEPARATION REPORT FOR SCORE1 IN DATA MDL1(WHERE = (GRP = 1))
                                MAXIMUM KS = 53.0942 AT SCORE POINT 0.2290
               ( AUC STATISTICS = 0.8486, GINI COEFFICIENT = 0.6973, DIVERGENCE = 2.0251 )

          MIN        MAX           GOOD        BAD      TOTAL    BAD     CUMULATIVE    BAD      CUMU. BAD
         SCORE      SCORE             #          #          #    RATE      BAD RATE  PERCENT      PERCENT
 --------------------------------------------------------------------------------------------------------
  BAD     0.5863     0.9413       1,351      1,435      2,786   51.51%      51.51%    43.48%      43.48%
   |      0.3713     0.5862       2,136        651      2,787   23.36%      37.43%    19.73%      63.21%
   |      0.2299     0.3712       2,340        446      2,786   16.01%      30.29%    13.52%      76.73%
   |      0.1419     0.2298       2,525        262      2,787    9.40%      25.07%     7.94%      84.67%
   |      0.0832     0.1419       2,584        202      2,786    7.25%      21.50%     6.12%      90.79%
   |      0.0480     0.0832       2,643        144      2,787    5.17%      18.78%     4.36%      95.15%
   |      0.0270     0.0480       2,682        104      2,786    3.73%      16.63%     3.15%      98.30%
   |      0.0140     0.0270       2,741         46      2,787    1.65%      14.76%     1.39%      99.70%
   V      0.0058     0.0140       2,776         10      2,786    0.36%      13.16%     0.30%     100.00%
 GOOD     0.0005     0.0058       2,786          0      2,786    0.00%      11.84%     0.00%     100.00%
       ========== ========== ========== ========== ==========
          0.0005     0.9413      24,564      3,300     27,864

As shown in the above output, while the overall model performance looks ok, it doesn’t generalize well in the dataset from the 2nd group with a smaller size. While the overall KS could be as high as 60, the KS for the 2nd group is merely 53. The reason is that the overall model performance is heavily influenced by the dataset from the 1st group with the larger size. Therefore, the estimated model is biased toward the risk profile reflected in the 1st group.

To alleviate the bias in the first model, we could first introduce a look-alike model driven by x0 – x1 with the purpose to profile the group and then build two separate risk models with x2 – x3 only for 1st and 2nd groups respectively. As a result, the final predicted probability should be the composite of all three sub-models, as shown below. The model evaluation is also provided to compared with the first model.

proc logistic data = one desc noprint;
  where grp = 0;
  model bad = x2 x3;
  score data = one out = mdl20(rename = (p_1 = p_10));
run;

proc logistic data = one desc noprint;
  where grp = 1;
  model bad = x2 x3;
  score data = one out = mdl21(rename = (p_1 = p_11));
run;

proc logistic data = one desc noprint;
  model grp = x0 x1;
  score data = one out = seg;
run;

data mdl2;
  merge seg mdl20 mdl21;
  by i;
  score2 = p_10 * (1 - p_1) + p_11 * p_1;
run;

                            GOOD BAD SEPARATION REPORT FOR SCORE2 IN DATA MDL2
                                MAXIMUM KS = 60.6234 AT SCORE POINT 0.2469
               ( AUC STATISTICS = 0.8858, GINI COEFFICIENT = 0.7715, DIVERGENCE = 2.8434 )

          MIN        MAX           GOOD        BAD      TOTAL    BAD     CUMULATIVE    BAD      CUMU. BAD
         SCORE      SCORE             #          #          #    RATE      BAD RATE  PERCENT      PERCENT
 --------------------------------------------------------------------------------------------------------
  BAD     0.6877     0.9677       2,011      7,989     10,000   79.89%      79.89%    34.00%      34.00%
   |      0.4749     0.6876       4,300      5,700     10,000   57.00%      68.45%    24.26%      58.27%
   |      0.3125     0.4748       6,036      3,964     10,000   39.64%      58.84%    16.87%      75.14%
   |      0.1932     0.3124       7,451      2,549     10,000   25.49%      50.51%    10.85%      85.99%
   |      0.1142     0.1932       8,379      1,621     10,000   16.21%      43.65%     6.90%      92.89%
   |      0.0646     0.1142       9,055        945     10,000    9.45%      37.95%     4.02%      96.91%
   |      0.0345     0.0646       9,533        467     10,000    4.67%      33.19%     1.99%      98.90%
   |      0.0166     0.0345       9,800        200     10,000    2.00%      29.29%     0.85%      99.75%
   V      0.0062     0.0166       9,946         54     10,000    0.54%      26.10%     0.23%      99.98%
 GOOD     0.0001     0.0062       9,995          5     10,000    0.05%      23.49%     0.02%     100.00%
       ========== ========== ========== ========== ==========
          0.0001     0.9677      76,506     23,494    100,000

                  GOOD BAD SEPARATION REPORT FOR SCORE2 IN DATA MDL2(WHERE = (GRP = 0))
                                MAXIMUM KS = 61.1591 AT SCORE POINT 0.2458
               ( AUC STATISTICS = 0.8880, GINI COEFFICIENT = 0.7759, DIVERGENCE = 2.9130 )

          MIN        MAX           GOOD        BAD      TOTAL    BAD     CUMULATIVE    BAD      CUMU. BAD
         SCORE      SCORE             #          #          #    RATE      BAD RATE  PERCENT      PERCENT
 --------------------------------------------------------------------------------------------------------
  BAD     0.7221     0.9677       1,075      6,138      7,213   85.10%      85.10%    30.40%      30.40%
   |      0.5208     0.7221       2,436      4,778      7,214   66.23%      75.66%    23.66%      54.06%
   |      0.3533     0.5208       3,670      3,544      7,214   49.13%      66.82%    17.55%      71.61%
   |      0.2219     0.3532       4,726      2,487      7,213   34.48%      58.73%    12.32%      83.92%
   |      0.1309     0.2219       5,617      1,597      7,214   22.14%      51.41%     7.91%      91.83%
   |      0.0731     0.1309       6,294        920      7,214   12.75%      44.97%     4.56%      96.39%
   |      0.0387     0.0731       6,762        451      7,213    6.25%      39.44%     2.23%      98.62%
   |      0.0189     0.0387       7,009        205      7,214    2.84%      34.86%     1.02%      99.63%
   V      0.0074     0.0189       7,152         62      7,214    0.86%      31.09%     0.31%      99.94%
 GOOD     0.0002     0.0073       7,201         12      7,213    0.17%      27.99%     0.06%     100.00%
       ========== ========== ========== ========== ==========
          0.0002     0.9677      51,942     20,194     72,136

                  GOOD BAD SEPARATION REPORT FOR SCORE2 IN DATA MDL2(WHERE = (GRP = 1))
                                MAXIMUM KS = 57.6788 AT SCORE POINT 0.1979
               ( AUC STATISTICS = 0.8717, GINI COEFFICIENT = 0.7434, DIVERGENCE = 2.4317 )

          MIN        MAX           GOOD        BAD      TOTAL    BAD     CUMULATIVE    BAD      CUMU. BAD
         SCORE      SCORE             #          #          #    RATE      BAD RATE  PERCENT      PERCENT
 --------------------------------------------------------------------------------------------------------
  BAD     0.5559     0.9553       1,343      1,443      2,786   51.79%      51.79%    43.73%      43.73%
   |      0.3528     0.5559       2,001        786      2,787   28.20%      40.00%    23.82%      67.55%
   |      0.2213     0.3528       2,364        422      2,786   15.15%      31.71%    12.79%      80.33%
   |      0.1372     0.2213       2,513        274      2,787    9.83%      26.24%     8.30%      88.64%
   |      0.0840     0.1372       2,588        198      2,786    7.11%      22.42%     6.00%      94.64%
   |      0.0484     0.0840       2,683        104      2,787    3.73%      19.30%     3.15%      97.79%
   |      0.0256     0.0483       2,729         57      2,786    2.05%      16.84%     1.73%      99.52%
   |      0.0118     0.0256       2,776         11      2,787    0.39%      14.78%     0.33%      99.85%
   V      0.0040     0.0118       2,781          5      2,786    0.18%      13.16%     0.15%     100.00%
 GOOD     0.0001     0.0040       2,786          0      2,786    0.00%      11.84%     0.00%     100.00%
       ========== ========== ========== ========== ==========
          0.0001     0.9553      24,564      3,300     27,864

After comparing KS statistics from two modeling approaches, we can see that, while the performance from the 2nd approach on the overall sample is only slightly better than the one from the 1st approach, the KS on the 2nd group with a smaller size, e.g. grp = 1, increases from 53 upto 58 by 8.6%. While the example is just for two groups, it is trivial to generalize in cases with more than two groups.

Written by statcompute

June 18, 2016 at 2:43 pm

Duplicate Breusch-Godfrey Test Logic in SAS Autoreg Procedure

Since it appears that SAS and R might give slightly different B-G test results, I spent a couple hours on duplicating the logic of B-G test implemented in SAS Autoreg Procedure. The written SAS macro should give my team more flexibility to perform B-G test in CCAR 2017 model developments in cases that models will not be estimated with Autoreg Procedure.

B-G Test with Proc Autoreg

data one;
  do i = 1 to 100;
    x1 = uniform(1);
    x2 = uniform(2);
    r  = normal(3) * 0.5;
    y = 10 + 8 * x1 + 6 * x2 + r;
    output;
  end;
run;

proc autoreg data = one;
  model y = x1 x2 / godfrey = 4;
run;
quit;

/*
Godfrey's Serial Correlation Test

Alternative            LM    Pr > LM
AR(1)              0.2976     0.5854
AR(2)              1.5919     0.4512
AR(3)              1.7168     0.6332
AR(4)              1.7839     0.7754
*/

Home-brew SAS Macro

%macro bgtest(data = , r = , x = , order = 4);
options nocenter nonumber nodate mprint mlogic symbolgen
        formchar = "|----|+|---+=|-/\<>*";

proc sql noprint;
select
  mean(&r) format = 12.8 into :m
from
  &data;
quit;

data _1 (drop = _i);
  set &data (keep = &r &x);
  %do i = 1 %to &order;
    _lag&i._&r = lag&i.(&r);
  %end;
  _i + 1;
  _index = _i - &order;
  array _l _lag:;
  do over _l;
    if _l = . then _l = &m;
  end;
run;
 
proc reg data = _last_ noprint;
  model &r =  &x _lag:;
  output out = _2 p = rhat;
run;
 
proc sql noprint;  
create table
  _result as
select
  sum((rhat - &m) ** 2) / sum((&r - &m) ** 2)  as _r2,
  (select count(*) from _2) * calculated _r2   as _chisq,
  1 - probchi(calculated _chisq, &order.)      as _p_chisq,
  &order                                       as _df
from
  _2;
quit;

title;
proc report data = _last_ spacing = 1 headline nowindows split = "*";
  column(" * BREUSCH-GODFREY TEST FOR SERIAL CORRELATION
           * H0: THERE IS NO SERIAL CORRELATION OF ANY ORDER UP TO &order * "
          _chisq _df _p_chisq);
  define _chisq   / "CHI-SQUARE" width = 20 format = 15.10;
  define _df      / "DF"         width = 10;
  define _p_chisq / "P-VALUE"    width = 20 format = 15.10;
run;
%mend bgtest;

proc reg data = one noprint;
  model y = x1 x2;
  output out = two r = r2;
run;
quit;

data _null_;
  do i = 1 to 4;
    call execute('%bgtest(data = two, r = r2, x = x1 x2, order = '||put(i, 2.)||');');
  end;
run;

/*
       BREUSCH-GODFREY TEST FOR SERIAL CORRELATION
 H0: THERE IS NO SERIAL CORRELATION OF ANY ORDER UP TO 1
           CHI-SQUARE         DF              P-VALUE
 -------------------------------------------------------
         0.2976458421          1         0.5853620441

       BREUSCH-GODFREY TEST FOR SERIAL CORRELATION
 H0: THERE IS NO SERIAL CORRELATION OF ANY ORDER UP TO 2
           CHI-SQUARE         DF              P-VALUE
 -------------------------------------------------------
         1.5918785412          2         0.4511572771

       BREUSCH-GODFREY TEST FOR SERIAL CORRELATION
 H0: THERE IS NO SERIAL CORRELATION OF ANY ORDER UP TO 3
           CHI-SQUARE         DF              P-VALUE
 -------------------------------------------------------
         1.7167785901          3         0.6332099963

       BREUSCH-GODFREY TEST FOR SERIAL CORRELATION
 H0: THERE IS NO SERIAL CORRELATION OF ANY ORDER UP TO 4
           CHI-SQUARE         DF              P-VALUE
 -------------------------------------------------------
         1.7839349922          4         0.7754201982
*/

Written by statcompute

May 21, 2016 at 4:16 pm

More Flexible Approaches to Model Frequency

(The post below is motivated by my friend Matt Flynn https://www.linkedin.com/in/matthew-flynn-1b443b11)

In the context of operational loss forecast models, the standard Poisson regression is the most popular way to model frequency measures. Conceptually speaking, there is a restrictive assumption for the standard Poisson regression, namely Equi-Dispersion, which requires the equality between the conditional mean and the variance such that E(Y) = var(Y). However, in real-world frequency outcomes, the assumption of Equi-Dispersion is always problematic. On the contrary, the empirical data often presents either an excessive variance, namely Over-Dispersion, or an insufficient variance, namely Under-Dispersion. The application of a standard Poisson regression to the over-dispersed data will lead to deflated standard errors of parameter estimates and therefore inflated t-statistics.

In cases of Over-Dispersion, the Negative Binomial (NB) regression has been the most common alternative to the standard Poisson regression by including a dispersion parameter to accommodate the excessive variance in the data. In the formulation of NB regression, the variance is expressed as a quadratic function of the conditional mean such that the variance is guaranteed to be higher than the conditional mean. However, it is not flexible enough to allow for both Over-Dispersion and Under-Dispersion. Therefore, more generalizable approaches are called for.

Two additional frequency modeling methods, including Quasi-Poisson (QP) regression and Conway-Maxwell Poisson (CMP) regression, are discussed. In the case of Quasi-Poisson, E(Y) = λ and var(Y) = θ • λ. While θ > 1 addresses Over-Dispersion, θ < 1 governs Under-Dispersion. Since QP regression is estimated with QMLE, likelihood-based statistics, such as AIC and BIC, won’t be available. Instead, quasi-AIC and quasi-BIC are provided. In the case of Conway-Maxwell Poisson, E(Y) = λ ** (1 / v) – (v – 1) / (2 • v) and var(Y) = (1 / v) • λ ** (1 / v), where λ doesn’t represent the conditional mean anymore but a location parameter. While v < 1 enables us to model the long-tailed distribution reflected as Over-Dispersion, v > 1 takes care of the short-tailed distribution reflected as Under-Dispersion. Since CMP regression is estimated with MLE, likelihood-based statistics, such as AIC and BIC, are available at a high computing cost.

Below demonstrates how to estimate QP and CMP regressions with R and a comparison of their computing times. If the modeling purpose is mainly for the prediction without focusing on the statistical reference, QP regression would be an excellent choice for most practitioners. Otherwise, CMP regression is an elegant model to address various levels of dispersion parsimoniously.

# data source: www.jstatsoft.org/article/view/v027i08
load("../Downloads/DebTrivedi.rda")

library(rbenchmark)
library(CompGLM)

benchmark(replications = 3, order = "user.self",
  quasi.poisson = {
    m1 <- glm(ofp ~ health + hosp + numchron + privins + school + gender + medicaid, data = DebTrivedi, family = "quasipoisson")
  },
  conway.maxwell = {
    m2 <- glm.comp(ofp ~ health + hosp + numchron + privins + school + gender + medicaid, data = DebTrivedi, lamStart = m1$coefficient
s)
  }
)
#             test replications elapsed relative user.self sys.self user.child
# 1  quasi.poisson            3   0.084    1.000     0.084    0.000          0
# 2 conway.maxwell            3  42.466  505.548    42.316    0.048          0

summary(m1)
summary(m2) 

Quasi-Poisson Regression

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)
(Intercept)      0.886462   0.069644  12.729  < 2e-16 ***
healthpoor       0.235673   0.046284   5.092 3.69e-07 ***
healthexcellent -0.360188   0.078441  -4.592 4.52e-06 ***
hosp             0.163246   0.015594  10.468  < 2e-16 ***
numchron         0.144652   0.011894  12.162  < 2e-16 ***
privinsyes       0.304691   0.049879   6.109 1.09e-09 ***
school           0.028953   0.004812   6.016 1.93e-09 ***
gendermale      -0.092460   0.033830  -2.733   0.0063 **
medicaidyes      0.297689   0.063787   4.667 3.15e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for quasipoisson family taken to be 6.697556)

    Null deviance: 26943  on 4405  degrees of freedom
Residual deviance: 23027  on 4397  degrees of freedom
AIC: NA

Conway-Maxwell Poisson Regression

Beta:
                   Estimate   Std.Error  t.value p.value
(Intercept)     -0.23385559  0.16398319  -1.4261 0.15391
healthpoor       0.03226830  0.01325437   2.4345 0.01495 *
healthexcellent -0.08361733  0.00687228 -12.1673 < 2e-16 ***
hosp             0.01743416  0.01500555   1.1618 0.24536
numchron         0.02186788  0.00209274  10.4494 < 2e-16 ***
privinsyes       0.05193645  0.00184446  28.1581 < 2e-16 ***
school           0.00490214  0.00805940   0.6083 0.54305
gendermale      -0.01485663  0.00076861 -19.3292 < 2e-16 ***
medicaidyes      0.04861617  0.00535814   9.0733 < 2e-16 ***

Zeta:
              Estimate  Std.Error t.value   p.value
(Intercept) -3.4642316  0.0093853 -369.11 < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

AIC: 24467.13
Log-Likelihood: -12223.56

Written by statcompute

May 13, 2016 at 12:27 am