## Duplicate Breusch-Godfrey Test Logic in SAS Autoreg Procedure

Since it appears that SAS and R might give slightly different B-G test results, I spent a couple hours on duplicating the logic of B-G test implemented in SAS Autoreg Procedure. The written SAS macro should give my team more flexibility to perform B-G test in CCAR 2017 model developments in cases that models will not be estimated with Autoreg Procedure.

**B-G Test with Proc Autoreg**

data one; do i = 1 to 100; x1 = uniform(1); x2 = uniform(2); r = normal(3) * 0.5; y = 10 + 8 * x1 + 6 * x2 + r; output; end; run; proc autoreg data = one; model y = x1 x2 / godfrey = 4; run; quit; /* Godfrey's Serial Correlation Test Alternative LM Pr > LM AR(1) 0.2976 0.5854 AR(2) 1.5919 0.4512 AR(3) 1.7168 0.6332 AR(4) 1.7839 0.7754 */

**Home-brew SAS Macro**

%macro bgtest(data = , r = , x = , order = 4); options nocenter nonumber nodate mprint mlogic symbolgen formchar = "|----|+|---+=|-/\<>*"; proc sql noprint; select mean(&r) format = 12.8 into :m from &data; quit; data _1 (drop = _i); set &data (keep = &r &x); %do i = 1 %to ℴ _lag&i._&r = lag&i.(&r); %end; _i + 1; _index = _i - ℴ array _l _lag:; do over _l; if _l = . then _l = &m; end; run; proc reg data = _last_ noprint; model &r = &x _lag:; output out = _2 p = rhat; run; proc sql noprint; create table _result as select sum((rhat - &m) ** 2) / sum((&r - &m) ** 2) as _r2, (select count(*) from _2) * calculated _r2 as _chisq, 1 - probchi(calculated _chisq, &order.) as _p_chisq, &order as _df from _2; quit; title; proc report data = _last_ spacing = 1 headline nowindows split = "*"; column(" * BREUSCH-GODFREY TEST FOR SERIAL CORRELATION * H0: THERE IS NO SERIAL CORRELATION OF ANY ORDER UP TO &order * " _chisq _df _p_chisq); define _chisq / "CHI-SQUARE" width = 20 format = 15.10; define _df / "DF" width = 10; define _p_chisq / "P-VALUE" width = 20 format = 15.10; run; %mend bgtest; proc reg data = one noprint; model y = x1 x2; output out = two r = r2; run; quit; data _null_; do i = 1 to 4; call execute('%bgtest(data = two, r = r2, x = x1 x2, order = '||put(i, 2.)||');'); end; run; /* BREUSCH-GODFREY TEST FOR SERIAL CORRELATION H0: THERE IS NO SERIAL CORRELATION OF ANY ORDER UP TO 1 CHI-SQUARE DF P-VALUE ------------------------------------------------------- 0.2976458421 1 0.5853620441 BREUSCH-GODFREY TEST FOR SERIAL CORRELATION H0: THERE IS NO SERIAL CORRELATION OF ANY ORDER UP TO 2 CHI-SQUARE DF P-VALUE ------------------------------------------------------- 1.5918785412 2 0.4511572771 BREUSCH-GODFREY TEST FOR SERIAL CORRELATION H0: THERE IS NO SERIAL CORRELATION OF ANY ORDER UP TO 3 CHI-SQUARE DF P-VALUE ------------------------------------------------------- 1.7167785901 3 0.6332099963 BREUSCH-GODFREY TEST FOR SERIAL CORRELATION H0: THERE IS NO SERIAL CORRELATION OF ANY ORDER UP TO 4 CHI-SQUARE DF P-VALUE ------------------------------------------------------- 1.7839349922 4 0.7754201982 */

## More Flexible Approaches to Model Frequency

(The post below is motivated by my friend Matt Flynn https://www.linkedin.com/in/matthew-flynn-1b443b11)

In the context of operational loss forecast models, the standard Poisson regression is the most popular way to model frequency measures. Conceptually speaking, there is a restrictive assumption for the standard Poisson regression, namely Equi-Dispersion, which requires the equality between the conditional mean and the variance such that E(Y) = var(Y). However, in real-world frequency outcomes, the assumption of Equi-Dispersion is always problematic. On the contrary, the empirical data often presents either an excessive variance, namely Over-Dispersion, or an insufficient variance, namely Under-Dispersion. The application of a standard Poisson regression to the over-dispersed data will lead to deflated standard errors of parameter estimates and therefore inflated t-statistics.

In cases of Over-Dispersion, the Negative Binomial (NB) regression has been the most common alternative to the standard Poisson regression by including a dispersion parameter to accommodate the excessive variance in the data. In the formulation of NB regression, the variance is expressed as a quadratic function of the conditional mean such that the variance is guaranteed to be higher than the conditional mean. However, it is not flexible enough to allow for both Over-Dispersion and Under-Dispersion. Therefore, more generalizable approaches are called for.

Two additional frequency modeling methods, including Quasi-Poisson (QP) regression and Conway-Maxwell Poisson (CMP) regression, are discussed. In the case of Quasi-Poisson, E(Y) = λ and var(Y) = θ • λ. While θ > 1 addresses Over-Dispersion, θ < 1 governs Under-Dispersion. Since QP regression is estimated with QMLE, likelihood-based statistics, such as AIC and BIC, won’t be available. Instead, quasi-AIC and quasi-BIC are provided. In the case of Conway-Maxwell Poisson, E(Y) = λ ** (1 / v) – (v – 1) / (2 • v) and var(Y) = (1 / v) • λ ** (1 / v), where λ doesn’t represent the conditional mean anymore but a location parameter. While v < 1 enables us to model the long-tailed distribution reflected as Over-Dispersion, v > 1 takes care of the short-tailed distribution reflected as Under-Dispersion. Since CMP regression is estimated with MLE, likelihood-based statistics, such as AIC and BIC, are available at a high computing cost.

Below demonstrates how to estimate QP and CMP regressions with R and a comparison of their computing times. If the modeling purpose is mainly for the prediction without focusing on the statistical reference, QP regression would be an excellent choice for most practitioners. Otherwise, CMP regression is an elegant model to address various levels of dispersion parsimoniously.

# data source: www.jstatsoft.org/article/view/v027i08 load("../Downloads/DebTrivedi.rda") library(rbenchmark) library(CompGLM) benchmark(replications = 3, order = "user.self", quasi.poisson = { m1 <- glm(ofp ~ health + hosp + numchron + privins + school + gender + medicaid, data = DebTrivedi, family = "quasipoisson") }, conway.maxwell = { m2 <- glm.comp(ofp ~ health + hosp + numchron + privins + school + gender + medicaid, data = DebTrivedi, lamStart = m1$coefficient s) } ) # test replications elapsed relative user.self sys.self user.child # 1 quasi.poisson 3 0.084 1.000 0.084 0.000 0 # 2 conway.maxwell 3 42.466 505.548 42.316 0.048 0 summary(m1) summary(m2)

**Quasi-Poisson Regression**

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.886462 0.069644 12.729 < 2e-16 *** healthpoor 0.235673 0.046284 5.092 3.69e-07 *** healthexcellent -0.360188 0.078441 -4.592 4.52e-06 *** hosp 0.163246 0.015594 10.468 < 2e-16 *** numchron 0.144652 0.011894 12.162 < 2e-16 *** privinsyes 0.304691 0.049879 6.109 1.09e-09 *** school 0.028953 0.004812 6.016 1.93e-09 *** gendermale -0.092460 0.033830 -2.733 0.0063 ** medicaidyes 0.297689 0.063787 4.667 3.15e-06 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for quasipoisson family taken to be 6.697556) Null deviance: 26943 on 4405 degrees of freedom Residual deviance: 23027 on 4397 degrees of freedom AIC: NA

**Conway-Maxwell Poisson Regression**

Beta: Estimate Std.Error t.value p.value (Intercept) -0.23385559 0.16398319 -1.4261 0.15391 healthpoor 0.03226830 0.01325437 2.4345 0.01495 * healthexcellent -0.08361733 0.00687228 -12.1673 < 2e-16 *** hosp 0.01743416 0.01500555 1.1618 0.24536 numchron 0.02186788 0.00209274 10.4494 < 2e-16 *** privinsyes 0.05193645 0.00184446 28.1581 < 2e-16 *** school 0.00490214 0.00805940 0.6083 0.54305 gendermale -0.01485663 0.00076861 -19.3292 < 2e-16 *** medicaidyes 0.04861617 0.00535814 9.0733 < 2e-16 *** Zeta: Estimate Std.Error t.value p.value (Intercept) -3.4642316 0.0093853 -369.11 < 2.2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 AIC: 24467.13 Log-Likelihood: -12223.56

## Calculating ACF with Data Step Only

In SAS/ETS, it is trivial to calculate ACF of a time series with ARIMA procedure. However, the downside is that, in addition to ACF, you will get more outputs than necessary without knowing the underlying mechanism. The SAS macro below is a clean routine written with simple data steps showing each step how to calculate ACF and generating nothing but a table with ACF and the related lag without using SAS/ETS module at all. It is easy to write a wrapper around this macro for any further analysis.

%macro acf(data = , var = , out = acf); ***********************************************************; * SAS MACRO CALCULATING AUTOCORRELATION FUNCTION WITH *; * DATA STEP ONLY *; * ======================================================= *; * INPUT PAREMETERS: *; * DATA : INPUT SAS DATA TABLE *; * VAR : THE TIME SERIES TO TEST FOR INDEPENDENCE *; * ======================================================= *; * OUTPUT: *; * OUT : A OUTPUT SAS DATA TABLE WITH ACF AND LAG *; * ======================================================= *; * AUTHOR: WENSUI.LIU@53.COM *; ***********************************************************; %local nobs; data _1 (keep = &var); set &data end = eof; if eof then do; call execute('%let nobs = '||put(_n_, 8.)||';'); end; run; proc sql noprint; select mean(&var) into :mean_x from _last_; quit; %do i = 1 %to %eval(&nobs - 1); data _2(keep = _:); set _1; _x = &var; _lag = lag&i.(_x); run; proc sql ; create table _3 as select (_x - &mean_x) ** 2 as _den, (_x - &mean_x) * (_lag - &mean_x) as _num from _last_; create table _4 as select &i as lag, sum(_num) / sum(_den) as acf from _last_; %if &i = 1 %then %do; create table &out as select * from _4; %end; %else %do; insert into &out select * from _4; %end; drop table _2, _3, _4; quit; %end; %mend acf;

## Estimate Quasi-Binomial Model with GENMOD Procedure in SAS

In my previous post (https://statcompute.wordpress.com/2015/11/01/quasi-binomial-model-in-sas/), I’ve shown why there is an interest in estimating Quasi-Binomial models for financial practitioners and how to estimate with GLIMMIX procedure in SAS.

In the demonstration below, I will show an example how to estimate a Quasi-Binomial model with GENMOD procedure by specifying **VARIANCE** and **DEVIANCE**. While the CPU time for model estimation is a lot faster with GENMOD than with GLIMMIX, additional steps are necessary to ensure the correct statistical inference.

ods listing close; ods output modelfit = fit; ods output parameterestimates = parm1; proc genmod data = kyphosis; model y = age number start / link = logit noscale; variance v = _mean_ * (1 - _mean_); deviance d = (-2) * log((_mean_ ** _resp_) * ((1 - _mean_) ** (1 - _resp_))); run; ods listing; proc sql noprint; select distinct valuedf format = 12.8, df format = 8.0 into :disp, :df from fit where index(criterion, "Pearson") > 0; select value format = 12.4 into :ll from fit where criterion = "Log Likelihood"; select sum(df) into :k from parm1; quit; %let aic = %sysevalf((-&ll + &k) * 2); %let bic = %sysevalf(-&ll * 2 + &k * %sysfunc(log(&df + &k))); data parm2 (keep = parameter estimate stderr2 df t_value p_value); set parm1; where df > 0; stderr2 = stderr * (&scale ** 0.5); df = &df; t_value = estimate / stderr2; p_value = (1 - probt(abs(t_value), &df)) * 2; run; title; proc report data = parm2 spacing = 1 headline nowindows split = "*"; column(" Parameter Estimate of Quasi-Binomial Model " parameter estimate stderr2 t_value df p_value); compute before _page_; line @5 "Fit Statistics"; line " "; line @3 "Quasi Log Likelihood: %sysfunc(round(&ll, 0.01))"; line @3 "Quasi-AIC (smaller is better): %sysfunc(round(&aic, 0.01))"; line @3 "Quasi-BIC (smaller is better): %sysfunc(round(&bic, 0.01))"; line @3 "(Dispersion Parameter for Quasi-Binomial is %sysfunc(round(&disp, 0.0001)))"; line " "; endcomp; define parameter / "Parmeter" width = 10 order order = data; define estimate / "Estimate" width = 10 format = 10.4; define stderr2 / "Std Err" width = 10 format = 10.4; define t_value / "T Value" width = 10 format = 10.2; define df / "DF" width = 5 format = 4.0; define p_value / "Pr > |t|" width = 10 format = 10.4; run; quit; /* Fit Statistics Quasi Log Likelihood: -30.69 Quasi-AIC (smaller is better): 69.38 Quasi-BIC (smaller is better): 78.96 (Dispersion Parameter for Quasi-Binomial is 0.9132) Parameter Estimate of Quasi-Binomial Model Parmeter Estimate Std Err T Value DF Pr > |t| ------------------------------------------------------------ Intercept -2.0369 1.3853 -1.47 77 0.1455 Age 0.0109 0.0062 1.77 77 0.0800 Number 0.4106 0.2149 1.91 77 0.0598 Start -0.2065 0.0647 -3.19 77 0.0020 */

## A More Flexible Ljung-Box Test in SAS

Ljung-Box test is an important diagnostic to check if residuals from the time series model are independently distributed. In SAS / ETS module, it is easy to perform Ljung-Box with ARIMA procedure. However, test outputs are only provided for Lag 6, 12, 18, and so on, which cannot be changed by any option.

data one; do i = 1 to 100; x = uniform(1); output; end; run; proc arima data = one; identify var = x whitenoise = ignoremiss; run; quit; /* Autocorrelation Check for White Noise To Chi- Pr > Lag Square DF ChiSq --------------------Autocorrelations-------------------- 6 5.49 6 0.4832 0.051 -0.132 0.076 -0.024 -0.146 0.064 12 6.78 12 0.8719 0.050 0.076 -0.046 -0.025 -0.016 -0.018 18 10.43 18 0.9169 0.104 -0.053 0.063 0.038 -0.085 -0.065 24 21.51 24 0.6083 0.007 0.178 0.113 -0.046 0.180 0.079 */

The SAS macro below is a more flexible way to perform Ljung-Box test for any number of lags. As shown in the output, test results for Lag 6 and 12 are identical to the one directly from ARIMA procedure.

%macro LBtest(data = , var = , lags = 4); ***********************************************************; * SAS MACRO PERFORMING LJUNG-BOX TEST FOR INDEPENDENCE *; * ======================================================= *; * INPUT PAREMETERS: *; * DATA : INPUT SAS DATA TABLE *; * VAR : THE TIME SERIES TO TEST FOR INDEPENDENCE *; * LAGS : THE NUMBER OF LAGS BEING TESTED *; * ======================================================= *; * AUTHOR: WENSUI.LIU@53.COM *; ***********************************************************; %local nlag; data _1 (keep = &var); set &data end = eof; if eof then do; call execute('%let nlag = '||put(_n_ - 1, 8.)||';'); end; run; proc arima data = _last_; identify var = &var nlag = &nlag outcov = _2 noprint; run; quit; %do i = 1 %to &lags; data _3; set _2; where lag > 0 and lag <= &i; run; proc sql noprint; create table _4 as select sum(corr * corr / n) * (&nlag + 1) * (&nlag + 3) as _chisq, 1 - probchi(calculated _chisq, &i.) as _p_chisq, &i as _df from _last_; quit; %if &i = 1 %then %do; data _5; set _4; run; %end; %else %do; data _5; set _5 _4; run; %end; %end; title; proc report data = _5 spacing = 1 headline nowindows split = "*"; column(" * LJUNG-BOX TEST FOR WHITE NOISE * * H0: RESIDUALS ARE INDEPENDENTLY DISTRIBUTED UPTO LAG &lags * " _chisq _df _p_chisq); define _chisq / "CHI-SQUARE" width = 20 format = 15.10; define _df / "DF" width = 10 order; define _p_chisq / "P-VALUE" width = 20 format = 15.10; run; %mend LBtest; %LBtest(data = one, var = x, lags = 12); /* LJUNG-BOX TEST FOR WHITE NOISE H0: RESIDUALS ARE INDEPENDENTLY DISTRIBUTED UPTO LAG 12 CHI-SQUARE DF P-VALUE ------------------------------------------------------ 0.2644425904 1 0.6070843322 2.0812769288 2 0.3532290858 2.6839655476 3 0.4429590625 2.7428168168 4 0.6017432831 5.0425834917 5 0.4107053939 5.4851972398 6 0.4832476224 5.7586229652 7 0.5681994829 6.4067856029 8 0.6017645131 6.6410385135 9 0.6744356312 6.7142471241 10 0.7521182318 6.7427585395 11 0.8195164211 6.7783018413 12 0.8719097622 */

## SAS Macro Performing Breusch–Godfrey Test for Serial Correlation

%macro bgtest(data = , r = , x = , order = 1); ********************************************************************; * SAS MACRO PERFORMING BREUSCH-GODFREY TEST FOR SERIAL CORRELATION *; * BY FOLLOWING THE LOGIC OF BGTEST() IN R LMTEST PACKAGE *; * ================================================================ *; * INPUT PAREMETERS: *; * DATA : INPUT SAS DATA TABLE *; * R : RESIDUALS TO TEST SERIAL CORRELATION *; * X : INDEPENDENT VARIABLES IN THE ORIGINAL REGRESSION MODEL *; * ORDER : THE ORDER OF SERIAL CORRELATION *; * ================================================================ *; * AUTHOR: WENSUI.LIU@53.COM *; ********************************************************************; data _1 (drop = _i); set &data (keep = &r &x); %do i = 1 %to ℴ _lag&i._&r = lag&i.(&r); %end; _i + 1; _index = _i - ℴ if _index > 0 then output; run; ods listing close; proc reg data = _last_; model &r = &x _lag:; output out = _2 p = yhat; run; ods listing; proc sql noprint; create table _result as select (select count(*) from _2) * sum(yhat ** 2) / sum(&r ** 2) as _chisq, 1 - probchi(calculated _chisq, &order.) as _p_chisq, &order as _df from _2; quit; title; proc report data = _last_ spacing = 1 headline nowindows split = "*"; column(" * BREUSCH-GODFREY TEST FOR SERIAL CORRELATION * H0: THERE IS NO SERIAL CORRELATION OF ANY ORDER UP TO &order * " _chisq _df _p_chisq); define _chisq / "CHI-SQUARE" width = 20 format = 15.10; define _df / "DF" width = 10; define _p_chisq / "P-VALUE" width = 20 format = 15.10; run; %mend bgtest;

## Python Prototype of Grid Search for SVM Parameters

from itertools import product from pandas import read_table, DataFrame from sklearn.cross_validation import KFold as kfold from sklearn.svm import SVC as svc from sklearn.metrics import roc_auc_score as auc df = read_table('credit_count.txt', sep = ',') Y = df[df.CARDHLDR == 1].DEFAULT X = df[df.CARDHLDR == 1][['AGE', 'ADEPCNT', 'MAJORDRG', 'MINORDRG', 'INCOME', 'OWNRENT', 'SELFEMPL']] c = [1, 10] g = [0.01, 0.001] parms = [i for i in product(c, g)] kf = [i for i in kfold(Y.count(), n_folds = 3, shuffle = True, random_state = 0)] final = DataFrame() for i in parms: result = DataFrame() mdl = svc(C = i[0], gamma = i[1], probability = True, random_state = 0) for j in kf: X1 = X.iloc[j[0]] Y1 = Y.iloc[j[0]] X2 = X.iloc[j[1]] Y2 = Y.iloc[j[1]] mdl.fit(X1, Y1) pred = mdl.predict_proba(X2)[:, 1] out = DataFrame({'pred': pred, 'y': Y2}) result = result.append(out) perf = DataFrame({'Cost': i[0], 'Gamma': i[1], 'AUC': [auc(result.y, result.pred)]}) final = final.append(perf)