Yet Another Blog in Statistical Computing

I can calculate the motion of heavenly bodies but not the madness of people. -Isaac Newton

Prediction Intervals for Poisson Regression

Different from the confidence interval that is to address the uncertainty related to the conditional mean, the prediction interval is to accommodate the additional uncertainty associated with prediction errors. As a result, the prediction interval is always wider than the confidence interval in a regression model. In the context of risk modeling, the prediction interval is often used to address the potential model risk due to aforementioned uncertainties.

While calculating prediction interval of OLS regression based on the Gaussian distributional assumption is relatively straightforward with the off-shelf solution in R, it could be more complicated in a Generalized Linear Model, e.g. Poisson regression. In this post, I am going to show two empirical methods, one based on bootstrapping and the other based on simulation, calculating the prediction interval of a Poisson regression. Because of the high computing cost, the parallelism with foreach() function will be used to improve the efficiency.

First of all, let’s estimate a Poisson regression with glm() and generate a couple fake new data points to calculate model predictions. Since the toy data is very small with only 32 records with all categorical predictors, I doubled the sample size by rbind() to ensure the appropriate data coverage in the bootstrapping.

pkgs <- c('doParallel', 'foreach')
lapply(pkgs, require, character.only = T)
registerDoParallel(cores = 4)

data(AutoCollision, package = "insuranceData")
df <- rbind(AutoCollision, AutoCollision)
mdl <- glm(Claim_Count ~ Age + Vehicle_Use, data = df, family = poisson(link = "log"))
new_fake <- df[1:5, 1:2]

The first method shown below is based on the bootstrapping with following steps:

1. Bootstrapped the original model development sample by the random sample with replacements;

2. Repeated the above many times, e.g. 1000, to generate different bootstrapped samples;

3. Refitted models with bootstrapped samples;

4. Generated predictions with new data points, e.g. “new_fake”, but with refitted models;

5. Generated random numbers based on Poisson distribution with the mean, e.g. lambda, equal to the predicted values from refitted models

6. Collected all Poisson random numbers from the previous step and calculated the percentiles.

boot_pi <- function(model, pdata, n, p) {
  odata <- model$data
  lp <- (1 - p) / 2
  up <- 1 - lp
  set.seed(2016)
  seeds <- round(runif(n, 1, 1000), 0)
  boot_y <- foreach(i = 1:n, .combine = rbind) %dopar% {
    set.seed(seeds[i])
    bdata <- odata[sample(seq(nrow(odata)), size = nrow(odata), replace = TRUE), ]
    bpred <- predict(update(model, data = bdata), type = "response", newdata = pdata)
    rpois(length(bpred), lambda = bpred)
  }
  boot_ci <- t(apply(boot_y, 2, quantile, c(lp, up)))
  return(data.frame(pred = predict(model, newdata = pdata, type = "response"), lower = boot_ci[, 1], upper = boot_ci[, 2]))
}

boot_pi(mdl, new_fake, 1000, 0.95)
#      pred lower upper
#1 12.63040     6    21
#2 38.69738    25    55
#3 26.97271    16    39
#4 10.69951     4    18
#5 52.50839    35    70

The second method is based on the simulation and outlined as below:

1. Re-produced the model response variable, e.g. Claim_Count, by simulating Poisson random numbers with lambda equal to predicted values from the original model;

2. Repeated the above simulations many times, e.g. 1000, to generate many response series;

3. Generated 1000 updated model samples by replacing the original response with the new response generated from simulations;

4. Refitted models with these updated samples

5. Generated predictions with new data points, e.g. “new_fake”, but with refitted models;

6. Generated Poisson random numbers with lambda equal to the predicted values from refitted models

7. Collected all Poisson random numbers from the previous step and calculated the percentiles.

sim_pi <- function(model, pdata, n, p) {
  odata <- model$data
  yhat <- predict(model, type = "response")
  lp <- (1 - p) / 2
  up <- 1 - lp
  set.seed(2016)
  seeds <- round(runif(n, 1, 1000), 0)
  sim_y <- foreach(i = 1:n, .combine = rbind) %dopar% {
    set.seed(seeds[i])
    sim_y <- rpois(length(yhat), lambda = yhat)
    sdata <- data.frame(y = sim_y, odata[names(model$x)])
    refit <- glm(y ~ ., data = sdata, family = poisson)
    bpred <- predict(refit, type = "response", newdata = pdata)
    rpois(length(bpred),lambda = bpred)
  }
  sim_ci <- t(apply(sim_y, 2, quantile, c(lp, up)))
  return(data.frame(pred = predict(model, newdata = pdata, type = "response"), lower = sim_ci[, 1], upper = sim_ci[, 2]))
}

sim_pi(mdl, new_fake, 1000, 0.95)
#      pred lower upper
#1 12.63040     6    21
#2 38.69738    26    52
#3 26.97271    17    39
#4 10.69951     4    18
#5 52.50839    38    68

As demonstrated above, after a large number of replications, outcomes from both methods are highly consistent.

Advertisements

Written by statcompute

December 20, 2015 at 2:54 pm

%d bloggers like this: