In the insurance industry, Poisson regression (PR) has been widely used to model insurance claims. In the real-world auto insurance data, more than 90% of the insured reported zero claim, which might bring difficulties in two aspects if PR is employed. First of all, the observed number of zero count outcome is more than the predicted one by PR. Secondly, the variance might exceed the mean in the observed outcome, a violation of equi-dispersion assumption in PR. In order to improve the model performance, alternatives to PR should be considered.
As the most common alternative to PR, Negative Binomial (NB) regression addresses the issue of over-dispersion by including a random component with Gamma distribution. In NB, the variance can be expressed as the mean plus a non-negative term such that
V(Y|X) = mu + mu^2 / theta >= mu = E(Y|X)
Therefore, NB has a more flexible variance structure than PR does. While NB is able to capture the heterogeneity in the count outcome, it is under the criticism of failing to provide a satisfactory explanation on excess zeros.
Known as the two-part model, Hurdle regression (Mullahy 1986) provides a more appealing interpretation for the count outcome with excess zeros. While the first part separates the potential insurance claimants (Y ≥ 1) from the non-claimants (Y = 0) with a Binomial model, the second part models the number of claims made by those potential claimants (Y ≥ 1) with a truncated-at-zero Poisson model. Thus, the density of Hurdle model can be expressed as
F(Y|X) = θ for Y = 0
(1 – θ) * G(Y) / (1 – G(Y)) for Y > 0
θ is the probability of non-claimants and G(.) is a count density, e.g. Poisson or Negative Binomial. The major motivation of Hurdle regression in the auto insurance content is that the insured individual tends to behave differently once he/she makes the first claim, which is in line with our observation on most human behaviors.
Fallen into the class of finite mixture model, Zero-Inflated Poisson (ZIP) regression (Lambert 1992) tries to accommodate excess zeros by assuming zero count outcome coming from 2 different sources, one always with zero outcome and the other generating both zero and nonzero outcome. For instance, in the insured population, some never make insurance claim (Y = 0) despite the occurrence of accidents, while the other make claims (Y ≥ 0) whenever accidents happen. The density of ZIP model can be formulated as
F(Y|X) = ω + (1 – ω) * H(0) for Y = 0
(1 – ω) * H(Y) for Y > 0
ω is the probability of an individual belonging to the always-zero group and H(.) is a count density, e.g. Poisson or Negative Binomial. It is noted that, unlike θ in Hurdle model, ω in ZIP model is not directly observable but determined by the claim pattern of the insured. From the marketing perspective, ZIP model is attractive in that the insurance company is able to differentiate the insured and therefore to charge different premiums based upon their claim behaviors.
Considered a generalization of ZIP model, Latent Class Poisson (LCP) regression (Wedel 1993) assumes that all individuals instead of just the ones with zero counts are drawn from a finite mixture of Poisson distributions. Given the fact that even a careful driver might also have chance to file the claim, it is more desirable to avoid the somehow unrealistic dichotomization solely based upon the claim pattern but to distinguish the insured by latent risk factors such as lifestyle, attitude to risk, financial ability, and so on. For a population consisting of K mixing components with the proportion π, the density of LCP is given as
F(Y|X) = SUM πi * Fi(Y|X)
πi is the probability of an individual coming from component i with sum(πi = 1 to K) = 1 and Fi(.) is the count density for component i. In the mixture distribution, impacts of each explanatory variable are allowed to differ across different components to account for the heterogeneity in the population. Therefore, LCP provides an ability to differentiate the insured in a more flexible manner and a better framework for the market segmentation.