In retail banking, it is a key interest to predict the probability of accounts’ adverse behaviors, such as delinquencies or defaults. A widely accepted practice in the industry is to classify accounts into two groups, the good and the bad, based upon the presence of certain adverse behaviors and then to model this binary outcome with discriminant models, e.g. logistic regression. However, an obvious limitation of discriminant models based upon the binary outcome is that the two-state classification over-simplifies adverse behaviors of accounts. What financially impacts a financial institute are not only the presence of a certain adverse behavior but also the frequency of such behavior.

In the definition of binary outcome, it is important to notice that delinquencies can also be measured directly as the frequency of over-due payments. Therefore, instead of modeling the binary outcome, a more sensible alternative might be to model the frequency of delinquencies within a given valuation horizon. In the statistical content, the genuine model for count outcome, e.g. frequency, is Poisson regression model with probability function

f(Y_i | X_i) = exp(-λ_i) * (λ_i ^ Y_i) / Y_i!, where λ_i = exp(X_i`B)

It is assumed that each observed outcome Y_i is drawn from a Poisson distribution with the conditional mean λ_i on a given covariate vector X_i for case i. In Poisson model, a strong assumption is that the mean is equal to the variance such that E(Y_i | X_i) = Var(Y_i | X_i) = λ_i, which is also known as Equi-Disperson. However, in practice, this Equi-Dispersion assumption is too restrictive for many empirical applications. In real-world count outcomes, the variance often exceeds the mean, namely Over-Dispersion, due to various reasons, such as excess zeroes or long right tail. For instance, in a credit card portfolio, majority of cardholders should have zero delinquency at any point in time, while a few might have more than three. With the similar consequence of heteroskedasticity in a linear regression, Over-Dispersion in a Poisson model will lead to deflated standard errors of parameter estimates and therefore inflated t-statistics. Hence, Poisson model is often inadequate and practically unusable.

Considered a generalization of basic Poisson model, Negative Binomial (NB) model accommodates Over-Dispersion in data by including a dispersion parameter. In a NB model, it is assumed that the conditional mean λ_i for case i is determined not only by the observed heterogeneity explained by the covariate vector X_i but also by the unobserved heterogeneity denoted as ε_i that is independent of X_i such that

λ_i = exp(X_i`B + ε_i) = exp(X_i`B) * exp(ε_i), where exp(ε_i) ~ Gamma(1/α, 1/α)

While there are many variants of NB model, the most common one is NB2 model proposed by Cameron and Trivedi (1966) with probability function

f(Y_i | X_i) = Gamma(Y_i + 1/α) / [Gamma(Y_i + 1) * Gamma(1/α)] * [(1/α) / (1/α + λ_i)] ^ (1/α) * [λ_i / (1/α + λ_i)], where α is the dispersion parameter

For NB2 model, its conditional mean E(Y_i | X_i) is still λ_i, while its variance Var(Y_i | X_i) becomes λ_i + α * λ_i ^ 2. Since both λ_i > 0 and α > 0, the variance must exceed the mean and therefore the issue of Over-Dispersion has been addressed.

A major limitation of standard count data models, such as Poisson and Negative Binomial, is that the data is assumed to be generated by a single process. However, in many cases, it might be more appropriate to assume that the data is governed by two or more processes. For instance, it is believed that risk drivers of the first-time delinquent account might be very different from the ones of an account who had been delinquent for multiple times. From the business standpoint, the assumption of multiple processes is particularly attractive in that it provides the potential to segment the portfolio into two or more sub-groups based upon their delinquent pattern and loan characteristics.

Known as the two-part model, Hurdle Poisson model assumes that count outcomes come from two systematically different processes, a Binomial distribution determining the probability of zero counts and a Truncated-at-Zero Poisson governing positive outcomes. The probability function can be expressed as

for Y_i = 0, f(Y_i | X_i) = θ_i, where θ_i = Prob(Y_i = 0)

for Y_i > 0, f(Y_i | X_i) = (1 – θ_i) * exp(-λ_i) * λ_i ^ Y_i / {[1 – exp(-λ_i)] * Y_i!}, where λ_i = exp(X_i`B)

In the modeling framework, the first process can be analyzed by a logistic regression and the second can be reflected by a Truncated-at-Zero Poisson model. An advantage of Hurdle Model is that it is so flexible as to effectively model both Over-Dispersed data with too many zeroes and Under-Dispersed data with too few zeroes.

Alike to Hurdle model, Zero-Inflated Poisson (ZIP) model is another way to model count outcomes with excess zeroes under the assumption of two components. However, it is slightly different from Hurdle model in the sense that zero outcomes are assumed to come from two different sources, one generating only zero outcomes and the other generating both zero and nonzero outcomes. Specifically, a Binomial distribution decides if an individual is from the Always-Zero or the Not-Always-Zero group and then a standard Poisson distribution describes counts in the Not-always-zero group. The probability function of ZIP model is given as

for Y_i = 0, f(Y_i | X_i) = ω_i + (1 + ω_i) * exp(-λ_i), where ω_i = Prob[Y_i ~ Poisson(λ_i)]

for Y_i > 0, f(Y_i | X_i) = (1 – ω_i) * exp(-λ_i) * λ_i ^ Y_i / Y_i!

With the similar idea to Hurdle model, ZIP model can be represented jointly by two different sub-models as well. A logistic regression is used to separate the Always-Zero group from the Not-Always-Zero group and a basic Poisson model is applied to individuals in the Not-Always-Zero group. From a business prospective, ZIP Model describes an important fact that some not-at-risk accounts are well established such that they will never have financial problems, while the other at-risk ones might have chances to get into troubles during the tough time. Therefore, risk exposures and underlying matrices for accounts with same outcomes at zero count might still be differentiable.

In practice, a sharp dichotomization between at-risk group and not-at-risk group might not be realistic. Even a customer with the good financial condition might be exposed to risks in a certain situation. Therefore, it might make sense to split the whole portfolio into a couple segments with different levels of risk-exposure. A Latent Class Poisson model provides such mechanism by assuming that the population of interest is actually a mixture of S > 1 latent (unobservable) components and each individual is considered a draw from one of these latent groups. The probability function of a Latent Class Poisson model with S = 2 classes can be obtained as

F(Y_i | X_i) = P1_i * exp(-λ1_i) * λ1_i ^ Y_i / Y_i! + P2_i * exp(-λ2_i) * λ2_i ^ Y_i / Y_i!, where P1_i + P2_i = 1

Each latent component in the mixture is assumed to have a different parameter λ_i, which will account for the unobserved heterogeneity in the population. For instance, in the case of S = 2, a portfolio is assumed a mixture between a high risk group and a low risk one. Impacts of predictors are allowed to differ across different latent groups, providing a possibility of more informative and flexible interpretations.

Besides models discussed above, it is also worth to point out that the discrete choice model, such as Logit or Probit, has also been widely used to model count outcomes as well. However, such discrete choice model needs to be based upon sequential or ordered instead of multinomial response, namely ordered Logit.