## SAS Macro for Engle-Granger Co-integration Test

In the coursework of time series analysis, we’ve been taught that a time series regression of Y on X could be valid only when both X and Y are stationary due to the so-call “spurious regression problem”. However, one exception holds that if X and Y, albeit non-stationary, share a common trend such that their trends can be cancelled each other out, then X and Y are co-integrated and the regression of Y on X is valid. As a result, it is important to test the co-integration between X and Y.

Following the definition of co-integration, it is straightforward to formulate a procedure of the co-integration test. First of all, construct a linear combination between Y and X such that e = Y – (a + b * X). Secondly, test if e is stationary with ADF test. If e is stationary, then X and Y are co-integrated. This two-stage procedure is also called Engle-Granger co-integration test.

Below is a SAS macro implementing Engle-Granger co-integration test to show the long-term relationship between GDP and other macro-economic variables, e.g. Personal Consumption and Personal Disposable Income.

**SAS Macro**

%macro eg_coint(data = , y = , xs = ); *********************************************************************; * THIS SAS MACRO IMPLEMENTATION ENGLE-GRANGER COINTEGRATION TEST IN *; * A BATCH MODE TO PROCESS MANY TIME SERIES *; *********************************************************************; * INPUT PARAMETERS: *; * DATA: A INPUT SAS DATASET *; * Y : A DEPENDENT VARIABLE IN THE COINTEGRATION REGRESSION *; * X : A LIST OF INDEPENDENT VARIABLE IN THE COINTEGRATION *; * REGRESSION *; *********************************************************************; * AUTHOR: WENSUI.LIU@53.COM *; *********************************************************************; options nocenter nonumber nodate mprint mlogic symbolgen orientation = landscape ls = 150 formchar = "|----|+|---+=|-/\<>*"; %local sig loop; %let sig = 0.1; %let loop = 1; %do %while (%scan(&xs, &loop) ne %str()); %let x = %scan(&xs, &loop); ods listing close; ods output FitStatistics = _fit; proc reg data = &data; model &y = &x; output out = _1 residual = r; run; quit; proc sql noprint; select cvalue2 into :r2 from _fit where upcase(label2) = "R-SQUARE"; quit; proc arima data = _1; ods output stationaritytests = _adf1 (where = (upcase(type) = "ZERO MEAN" and lags = 1) drop = rho probrho fvalue probf); identify var = r stationarity = (adf = 1); run; quit; ods listing; %if &loop = 1 %then %do; data _adf; format vars $32. lterm_r2 best12. flg_coint $3.; set _adf1 (drop = type lags); vars = upcase("&x"); lterm_r2 = &r2; if probtau < &sig then flg_coint = "YES"; else flg_coint = "NO"; run; %end; %else %do; data _adf; set _adf _adf1 (in = new drop = type lags); if new then do; vars = upcase("&x"); lterm_r2 = &r2; if probtau < &sig then flg_coint = "YES"; else flg_coint = "NO"; end; run; %end; %let loop = %eval(&loop + 1); %end; proc sort data = _last_; by descending flg_coint probtau; run; proc report data = _last_ box spacing = 1 split = "/" nowd; COLUMN("ENGLE-GRANGER COINTEGRATION TEST BETWEEN %UPCASE(&Y) AND EACH VARIABLE BELOW/ " vars lterm_r2 flg_coint tau probtau); define vars / "VARIABLES" width = 35; define lterm_r2 / "LONG-RUN/R-SQUARED" width = 15 format = 9.4 center; define flg_coint / "COINTEGRATION/FLAG" width = 15 center; define tau / "TAU STATISTIC/FOR ADF TEST" width = 20 format = 15.4; define probtau / "P-VALUE FOR/ADF TEST" width = 15 format = 9.4 center; run; %mend eg_coint; %eg_coint(data = sashelp.citiqtr, y = gdp, xs = gyd gc);

**SAS Output**

---------------------------------------------------------------------------------------------------------- | ENGLE-GRANGER COINTEGRATION TEST BETWEEN GDP AND EACH VARIABLE BELOW | | | | LONG-RUN COINTEGRATION TAU STATISTIC P-VALUE FOR | |VARIABLES R-SQUARED FLAG FOR ADF TEST ADF TEST | |--------------------------------------------------------------------------------------------------------| |GC | 0.9985 | YES | -2.8651| 0.0051 | |-----------------------------------+---------------+---------------+--------------------+---------------| |GYD | 0.9976 | YES | -1.7793| 0.0715 | ----------------------------------------------------------------------------------------------------------

From the output, it is interesting to see that GDP in U.S. is driven more by Personal Consumption than by Personal Disposable Income.

## SAS Macro to Test Stationarity in Batch

To determine if a time series is stationary or has the unit root, three methods can be used:

A. The most intuitive way, which is also sufficient in most cases, is to eyeball the ACF (Autocorrelation Function) plot of the time series. The ACF pattern with a fast decay might imply a stationary series.

B. Statistical tests for Unit Roots, e.g. ADF (Augmented Dickey–Fuller) or PP (Phillips–Perron) test, could be employed as well. With the Null Hypothesis of Unit Root, a statistically significant outcome might suggest a stationary series.

C. In addition to the aforementioned tests for Unit Roots, statistical tests for stationarity, e.g. KPSS (Kwiatkowski–Phillips–Schmidt–Shin) test, might be an useful complement as well. With the Null Hypothesis of stationarity, a statistically insignificant outcome might suggest a stationary series.

By testing both the unit root and stationarity, the analyst should be able to have a better understanding about the data nature of a specific time series.

The SAS macro below is a convenient wrapper of stationarity tests for many time series in the production environment. **(Please note that this macro only works for SAS 9.2 or above.)**

%macro stationary(data = , vars =); ***********************************************************; * THIS SAS MACRO IS TO DO STATIONARITY TESTS FOR MANY *; * TIME SERIES *; * ------------------------------------------------------- *; * INPUT PARAMETERS: *; * DATA: A INPUT SAS DATASET *; * VARS: A LIST OF TIME SERIES *; * ------------------------------------------------------- *; * AUTHOR: WENSUI.LIU@53.COM *; ***********************************************************; options nocenter nonumber nodate mprint mlogic symbolgen orientation = landscape ls = 150 formchar = "|----|+|---+=|-/\<>*"; %local sig loop; %let sig = 0.1; %let loop = 1; %do %while (%scan(&vars, &loop) ne %str()); %let x = %scan(&vars, &loop); proc sql noprint; select int(12 * ((count(&x) / 100) ** 0.25)) into :nlag1 from &data; select int(max(1, (count(&x) ** 0.5) / 5)) into :nlag2 from &data; quit; ods listing close; ods output kpss = _kpss (drop = model lags rename = (prob = probeta)) adf = _adf (drop = model lags rho probrho fstat probf rename = (tau = adf_tau probtau = adf_probtau)) philperron = _pp (drop = model lags rho probrho rename = (tau = pp_tau probtau = pp_probtau)); proc autoreg data = &data; model &x = / noint stationarity = (adf = &nlag1, phillips = &nlag2, kpss = (kernel = nw lag = &nlag1)); run; quit; ods listing; proc sql noprint; create table _1 as select upcase("&x") as vars length = 32, upcase(_adf.type) as type, _adf.adf_tau, _adf.adf_probtau, _pp.pp_tau, _pp.pp_probtau, _kpss.eta, _kpss.probeta, case when _adf.adf_probtau < &sig or _pp.pp_probtau < &sig or _kpss.probeta > &sig then "*" else " " end as _flg, &loop as _i, monotonic() as _j from _adf inner join _pp on _adf.type = _pp.type inner join _kpss on _adf.type = _kpss.type; quit; %if &loop = 1 %then %do; data _result; set _1; run; %end; %else %do; proc append base = _result data = _1; run; %end; proc datasets library = work nolist; delete _1 _adf _pp _kpss / memtype = data; quit; %let loop = %eval(&loop + 1); %end; proc sort data = _result; by _i _j; run; proc report data = _result box spacing = 1 split = "/" nowd; column("STATISTICAL TESTS FOR STATIONARITY/ " vars type adf_tau adf_probtau pp_tau pp_probtau eta probeta _flg); define vars / "VARIABLES/ " width = 20 group order order = data; define type / "TYPE/ " width = 15 order order = data; define adf_tau / "ADF TEST/FOR/UNIT ROOT" width = 10 format = 8.2; define adf_probtau / "P-VALUE/FOR/ADF TEST" width = 10 format = 8.4 center; define pp_tau / "PP TEST/FOR/UNIT ROOT" width = 10 format = 8.2; define pp_probtau / "P-VALUE/FOR/PP TEST" width = 10 format = 8.4 center; define eta / "KPSS TEST/FOR/STATIONARY" width = 10 format = 8.2; define probeta / "P-VALUE/FOR/KPSS TEST" width = 10 format = 8.4 center; define _flg / "STATIONARY/FLAG" width = 10 center; run; %mend stationary;

## Are These Losses from The Same Distribution?

In Advanced Measurement Approaches (AMA) for Operational Risk models, the bank needs to segment operational losses into homogeneous segments known as “Unit of Measures (UoM)”, which are often defined by the combination of lines of business (LOB) and Basel II event types. However, how do we support whether the losses in one UoM are statistically different from the ones in another UoM? The answer is to test if the losses from various UoMs are distributionally different or equivalent.

Empirically, Kolmogorov-Smirnov (K-S) test is often used to test if two samples are from the same distribution. In the example below, although x and y share the same mean, K-S test still shows a significant result due to different variances.

n <- 300 set.seed(2015) x <- rnorm(n, 0, 1) y <- rnorm(n, 0, 1.5) ### 2-SAMPLE DISTRIBUTIONAL COMPARISON ### ks.test(x, y, alternative = "two.sided") # Two-sample Kolmogorov-Smirnov test # # data: x and y # D = 0.1567, p-value = 0.001268 # alternative hypothesis: two-sided

However, K-S test cannot be generalized to K-sample cases, where K > 2. In such scenario, the univariate coverage test or the more general multivariate MRPP test might be more appropriate. The Blossom package developed by Talbert and Cade (https://www.fort.usgs.gov/products/23735) provides convenient functions implementing both tests, as shown below.

z <- rnorm(n, 0, 2) df <- data.frame(x = c(x, y, z), g = unlist(lapply(c(1, 2, 3), rep, n))) ### K-SAMPLE DISTRIBUTIONAL COMPARISON ### # COVERAGE TEST FOR THE UNIVARIATE RESPONSES library(Blossom) ctest <- coverage(df$x, df$g) summary(ctest) # Results: # Observed coverage statistic : 1.870273 # Mean of coverage statistic : 1.774817 # Estimated variance of coverage statistic: 0.002275862 # Standard deviation of the variance # of the coverage statistic : 5.108031e-05 # # Observed standardized coverage statistic: 2.00093 # Skewness of observed coverage statistic : 0.08127759 # Probability (Pearson Type III) # of a larger or equal coverage statistic : 0.02484709 # Probability (Resampled) # of a largeror equal coverage statistic : 0.02475* # MULTIRESPONSE PERMUTATION PROCEDURE FOR MULTIVARIATE RESPONSES mtest <- mrpp(x, g, df) summary(mtest) # Results: # Delta Observed : 1.676303 # Delta Expected : 1.708194 # Delta Variance : 4.262303e-06 # Delta Skewness : -1.671773 # # Standardized test statistic : -15.44685 # Probability (Pearson Type III) # of a smaller or equal delta : 9.433116e-09***

## Efficiency of Concatenating Two Large Tables in SAS

In SAS, there are 4 approaches to concatenate two datasets together:

1. SET statement,

2. APPEND procedure (or APPEND statement in DATASETS procedure)

3. UNION ALL in SQL procedure,

4. INSERT INTO in SQL procedure.

Below is an efficiency comparison among these 4 approaches, showing that APPEND procedure, followed by SET statement, is the most efficient in terms of CPU time.

data one two; do i = 1 to 2000000; x = put(i, best12.); if i le 1000 then output one; else output two; end; run; data test1; set one two; run; data test2; set one; run; proc append base = test2 data = two; run; proc sql; create table test3 as select * from one union all select * from two; quit; data test4; set one; run; proc sql; insert into test4 select * from two; quit;

## Granger Causality Test

# READ QUARTERLY DATA FROM CSV library(zoo) ts1 <- read.zoo('Documents/data/macros.csv', header = T, sep = ",", FUN = as.yearqtr) # CONVERT THE DATA TO STATIONARY TIME SERIES ts1$hpi_rate <- log(ts1$hpi / lag(ts1$hpi)) ts1$unemp_rate <- log(ts1$unemp / lag(ts1$unemp)) ts2 <- ts1[1:nrow(ts1) - 1, c(3, 4)] # METHOD 1: LMTEST PACKAGE library(lmtest) grangertest(unemp_rate ~ hpi_rate, order = 1, data = ts2) # Granger causality test # # Model 1: unemp_rate ~ Lags(unemp_rate, 1:1) + Lags(hpi_rate, 1:1) # Model 2: unemp_rate ~ Lags(unemp_rate, 1:1) # Res.Df Df F Pr(>F) # 1 55 # 2 56 -1 4.5419 0.03756 * # --- # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # METHOD 2: VARS PACKAGE library(vars) var <- VAR(ts2, p = 1, type = "const") causality(var, cause = "hpi_rate")$Granger # Granger causality H0: hpi_rate do not Granger-cause unemp_rate # # data: VAR object var # F-Test = 4.5419, df1 = 1, df2 = 110, p-value = 0.0353 # AUTOMATICALLY SEARCH FOR THE MOST SIGNIFICANT RESULT for (i in 1:4) { cat("LAG =", i) print(causality(VAR(ts2, p = i, type = "const"), cause = "hpi_rate")$Granger) }

## SAS Macro for Jarque-Bera Normality Test

%macro jarque_bera(data = , var = ); **********************************************; * SAS macro to do Jarque-Bera Normality Test *; * ------------------------------------------ *; * wensui.liu@53.com *; **********************************************; options mprint mlogic symbolgen nodate; ods listing close; ods output moments = m1; proc univariate data = &data normal; var &var.; run; proc sql noprint; select nvalue1 into :n from m1 where upcase(compress(label1, ' ')) = 'N'; select put(nvalue1, best32.) into :s from m1 where upcase(compress(label1, ' ')) = 'SKEWNESS'; select put(nvalue2, best32.) into :k from m1 where upcase(compress(label2, ' ')) = 'KURTOSIS'; quit; data _temp_; jb = ((&s) ** 2 + (&k) ** 2 / 4) / 6 * &n; pvalue = 1 - probchi(jb, 2); put jb pvalue; run; ods listing; proc print data = _last_ noobs; run; %mend jarque_bera;

## Read A Block of Spreadsheet with R

In R, there are two ways to read a block of the spreadsheet, e.g. xlsx file, as the one shown below.

The xlsx package provides the most intuitive interface with readColumns() function by explicitly defining the starting and the ending columns and rows.

library(xlsx) file <- loadWorkbook("C:\\Documents and Settings\\Administrator\\Desktop\\test.xlsx") df1 <- readColumns(getSheets(file)[[1]], startColumn = 3, endColumn = 5, startRow = 5, endRow = 8, header = T) df1 # X Y Z # 1 1 A 2015-01-01 # 2 2 B 2015-02-01 # 3 3 C 2015-03-01

However, if we can define a named range for the block in the excel, the XLConnect package might be more convenient. In the example below, we first defined a range named as “block” within the spreadsheet and then called this named range with readNamedRegionFromFile() function without the necessity of specifying rows and columns.

library(XLConnect) df2 <- readNamedRegionFromFile("C:\\Documents and Settings\\Administrator\\Desktop\\test.xlsx", "block") df2 # X Y Z # 1 1 A 2015-01-01 # 2 2 B 2015-02-01 # 3 3 C 2015-03-01