## Archive for **November 2012**

## Another Way to Access R from Python – PypeR

Different from RPy2, PypeR provides another simple way to access R from Python through pipes (http://www.jstatsoft.org/v35/c02/paper). This handy feature enables data analysts to do the data munging with python and the statistical analysis with R by passing objects interactively between two computing systems.

Below is a simple demonstration on how to call R within Python through RypeR, estimate a Beta regression, and then return the model prediction from R back to Python.

In [1]: # LOAD PYTHON PACKAGES In [2]: import pandas as pd In [3]: import pyper as pr In [4]: # READ DATA In [5]: data = pd.read_table("/home/liuwensui/Documents/data/csdata.txt", header = 0) In [6]: # CREATE A R INSTANCE WITH PYPER In [7]: r = pr.R(use_pandas = True) In [8]: # PASS DATA FROM PYTHON TO R In [9]: r.assign("rdata", data) In [10]: # SHOW DATA SUMMARY In [11]: print r("summary(rdata)") try({summary(rdata)}) LEV_LT3 TAX_NDEB COLLAT1 SIZE1 Min. :0.00000 Min. : 0.0000 Min. :0.0000 Min. : 7.738 1st Qu.:0.00000 1st Qu.: 0.3494 1st Qu.:0.1241 1st Qu.:12.317 Median :0.00000 Median : 0.5666 Median :0.2876 Median :13.540 Mean :0.09083 Mean : 0.8245 Mean :0.3174 Mean :13.511 3rd Qu.:0.01169 3rd Qu.: 0.7891 3rd Qu.:0.4724 3rd Qu.:14.751 Max. :0.99837 Max. :102.1495 Max. :0.9953 Max. :18.587 PROF2 GROWTH2 AGE LIQ Min. :0.0000158 Min. :-81.248 Min. : 6.00 Min. :0.00000 1st Qu.:0.0721233 1st Qu.: -3.563 1st Qu.: 11.00 1st Qu.:0.03483 Median :0.1203435 Median : 6.164 Median : 17.00 Median :0.10854 Mean :0.1445929 Mean : 13.620 Mean : 20.37 Mean :0.20281 3rd Qu.:0.1875148 3rd Qu.: 21.952 3rd Qu.: 25.00 3rd Qu.:0.29137 Max. :1.5902009 Max. :681.354 Max. :210.00 Max. :1.00018 IND2A IND3A IND4A IND5A Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.00000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000 Median :1.0000 Median :0.0000 Median :0.00000 Median :0.00000 Mean :0.6116 Mean :0.1902 Mean :0.02692 Mean :0.09907 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.00000 Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.00000 In [12]: # LOAD R PACKAGE In [13]: r("library(betareg)") Out[13]: 'try({library(betareg)})\nLoading required package: Formula\n' In [14]: # ESTIMATE A BETA REGRESSION In [15]: r("m <- betareg(LEV_LT3 ~ SIZE1 + PROF2 + GROWTH2 + AGE + IND3A, data = rdata, subset = LEV_LT3 > 0)") Out[15]: 'try({m <- betareg(LEV_LT3 ~ SIZE1 + PROF2 + GROWTH2 + AGE + IND3A, data = rdata, subset = LEV_LT3 > 0)})\n' In [16]: # OUTPUT MODEL SUMMARY In [17]: print r("summary(m)") try({summary(m)}) Call: betareg(formula = LEV_LT3 ~ SIZE1 + PROF2 + GROWTH2 + AGE + IND3A, data = rdata, subset = LEV_LT3 > 0) Standardized weighted residuals 2: Min 1Q Median 3Q Max -7.2802 -0.5194 0.0777 0.6037 5.8777 Coefficients (mean model with logit link): Estimate Std. Error z value Pr(>|z|) (Intercept) 1.229773 0.312990 3.929 8.53e-05 *** SIZE1 -0.105009 0.021211 -4.951 7.39e-07 *** PROF2 -2.414794 0.377271 -6.401 1.55e-10 *** GROWTH2 0.003306 0.001043 3.169 0.00153 ** AGE -0.004999 0.001795 -2.786 0.00534 ** IND3A 0.688314 0.074069 9.293 < 2e-16 *** Phi coefficients (precision model with identity link): Estimate Std. Error z value Pr(>|z|) (phi) 3.9362 0.1528 25.77 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Type of estimator: ML (maximum likelihood) Log-likelihood: 266.7 on 7 Df Pseudo R-squared: 0.1468 Number of iterations: 25 (BFGS) + 2 (Fisher scoring) In [18]: # CALCULATE MODEL PREDICTION In [19]: r("beta_fit <- predict(m, link = 'response')") Out[19]: "try({beta_fit <- predict(m, link = 'response')})\n" In [20]: # SHOW PREDICTION SUMMARY IN R In [21]: print r("summary(beta_fit)") try({summary(beta_fit)}) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.1634 0.3069 0.3465 0.3657 0.4007 0.6695 In [22]: # PASS DATA FROM R TO PYTHON In [23]: pydata = pd.DataFrame(r.get("beta_fit"), columns = ["y_hat"]) In [24]: # SHOW PREDICTION SUMMARY IN PYTHON In [25]: pydata.y_hat.describe() Out[25]: count 1116.000000 mean 0.365675 std 0.089804 min 0.163388 25% 0.306897 50% 0.346483 75% 0.400656 max 0.669489

## Run R Code Within Python On The Fly

Below is an example showing how to run R code within python, which is an extremely attractive feature for hardcore R programmers.

In [1]: import rpy2.robjects as ro In [2]: _null_ = ro.r('data <- read.table("/home/liuwensui/data/credit_count.txt", header = TRUE, sep = ",")') In [3]: print ro.r('str(data)') 'data.frame': 13444 obs. of 14 variables: $ CARDHLDR: int 0 0 1 1 1 1 1 1 1 1 ... $ DEFAULT : int 0 0 0 0 0 0 0 0 0 0 ... $ AGE : num 27.2 40.8 37.7 42.5 21.3 ... $ ACADMOS : int 4 111 54 60 8 78 25 6 20 162 ... $ ADEPCNT : int 0 3 3 3 0 1 1 0 3 7 ... $ MAJORDRG: int 0 0 0 0 0 0 0 0 0 0 ... $ MINORDRG: int 0 0 0 0 0 0 0 0 0 0 ... $ OWNRENT : int 0 1 1 1 0 0 1 0 0 1 ... $ INCOME : num 1200 4000 3667 2000 2917 ... $ SELFEMPL: int 0 0 0 0 0 0 0 0 0 0 ... $ INCPER : num 18000 13500 11300 17250 35000 ... $ EXP_INC : num 0.000667 0.000222 0.03327 0.048427 0.016523 ... $ SPENDING: num NA NA 122 96.9 48.2 ... $ LOGSPEND: num NA NA 4.8 4.57 3.88 ... NULL In [4]: _null_ = ro.r('sample <- data[data$CARDHLDR == 1,]') In [5]: print ro.r('summary(sample)') CARDHLDR DEFAULT AGE ACADMOS ADEPCNT Min. :1 Min. :0.00000 Min. : 0.00 Min. : 0.0 Min. :0.0000 1st Qu.:1 1st Qu.:0.00000 1st Qu.:25.75 1st Qu.: 12.0 1st Qu.:0.0000 Median :1 Median :0.00000 Median :31.67 Median : 30.0 Median :0.0000 Mean :1 Mean :0.09487 Mean :33.67 Mean : 55.9 Mean :0.9904 3rd Qu.:1 3rd Qu.:0.00000 3rd Qu.:39.75 3rd Qu.: 72.0 3rd Qu.:2.0000 Max. :1 Max. :1.00000 Max. :88.67 Max. :564.0 Max. :9.0000 MAJORDRG MINORDRG OWNRENT INCOME Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. : 50 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:1750 Median :0.0000 Median :0.0000 Median :0.0000 Median :2292 Mean :0.1433 Mean :0.2207 Mean :0.4791 Mean :2606 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:3042 Max. :6.0000 Max. :7.0000 Max. :1.0000 Max. :8333 SELFEMPL INCPER EXP_INC SPENDING Min. :0.00000 Min. : 700 Min. :0.000096 Min. : 0.111 1st Qu.:0.00000 1st Qu.: 12900 1st Qu.:0.025998 1st Qu.: 58.753 Median :0.00000 Median : 20000 Median :0.058957 Median : 139.992 Mean :0.05362 Mean : 22581 Mean :0.090744 Mean : 226.983 3rd Qu.:0.00000 3rd Qu.: 28337 3rd Qu.:0.116123 3rd Qu.: 284.440 Max. :1.00000 Max. :150000 Max. :2.037728 Max. :4810.309 LOGSPEND Min. :-2.197 1st Qu.: 4.073 Median : 4.942 Mean : 4.729 3rd Qu.: 5.651 Max. : 8.479 In [6]: print ro.r('summary(glm(DEFAULT ~ MAJORDRG + MINORDRG + OWNRENT + INCOME, data = sample, family = binomial))') Call: glm(formula = DEFAULT ~ MAJORDRG + MINORDRG + OWNRENT + INCOME, family = binomial, data = sample) Deviance Residuals: Min 1Q Median 3Q Max -0.9587 -0.5003 -0.4351 -0.3305 3.1928 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.204e+00 9.084e-02 -13.259 < 2e-16 *** MAJORDRG 2.031e-01 6.926e-02 2.933 0.00336 ** MINORDRG 2.027e-01 4.798e-02 4.225 2.38e-05 *** OWNRENT -2.012e-01 7.163e-02 -2.809 0.00496 ** INCOME -4.422e-04 4.044e-05 -10.937 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 6586.1 on 10498 degrees of freedom Residual deviance: 6376.2 on 10494 degrees of freedom AIC: 6386.2 Number of Fisher Scoring iterations: 6

## A Light Touch on RPy2

For a statistical analyst, the first step to start a data analysis project is to import the data into the program and then to screen the descriptive statistics of the data. In python, we can easily do so with pandas package.

In [1]: import pandas as pd In [2]: data = pd.read_table("/home/liuwensui/Documents/data/csdata.txt", header = 0) In [3]: pd.set_printoptions(precision = 5) In [4]: print data.describe().to_string() LEV_LT3 TAX_NDEB COLLAT1 SIZE1 PROF2 GROWTH2 AGE LIQ IND2A IND3A IND4A IND5A count 4421.0000 4421.0000 4421.0000 4421.0000 4421.0000 4421.0000 4421.0000 4421.0000 4421.0000 4421.0000 4421.0000 4421.0000 mean 0.0908 0.8245 0.3174 13.5109 0.1446 13.6196 20.3664 0.2028 0.6116 0.1902 0.0269 0.0991 std 0.1939 2.8841 0.2272 1.6925 0.1109 36.5177 14.5390 0.2333 0.4874 0.3925 0.1619 0.2988 min 0.0000 0.0000 0.0000 7.7381 0.0000 -81.2476 6.0000 0.0000 0.0000 0.0000 0.0000 0.0000 25% 0.0000 0.3494 0.1241 12.3170 0.0721 -3.5632 11.0000 0.0348 0.0000 0.0000 0.0000 0.0000 50% 0.0000 0.5666 0.2876 13.5396 0.1203 6.1643 17.0000 0.1085 1.0000 0.0000 0.0000 0.0000 75% 0.0117 0.7891 0.4724 14.7511 0.1875 21.9516 25.0000 0.2914 1.0000 0.0000 0.0000 0.0000 max 0.9984 102.1495 0.9953 18.5866 1.5902 681.3542 210.0000 1.0002 1.0000 1.0000 1.0000 1.0000

Tonight, I’d like to add some spice to my python learning experience and do the work in a different flavor with rpy2 package, which allows me to call R functions from python.

In [5]: import rpy2.robjects as ro In [6]: rdata = ro.packages.importr('utils').read_table("/home/liuwensui/Documents/data/csdata.txt", header = True) In [7]: print ro.r.summary(rdata) LEV_LT3 TAX_NDEB COLLAT1 SIZE1 Min. :0.00000 Min. : 0.0000 Min. :0.0000 Min. : 7.738 1st Qu.:0.00000 1st Qu.: 0.3494 1st Qu.:0.1241 1st Qu.:12.317 Median :0.00000 Median : 0.5666 Median :0.2876 Median :13.540 Mean :0.09083 Mean : 0.8245 Mean :0.3174 Mean :13.511 3rd Qu.:0.01169 3rd Qu.: 0.7891 3rd Qu.:0.4724 3rd Qu.:14.751 Max. :0.99837 Max. :102.1495 Max. :0.9953 Max. :18.587 PROF2 GROWTH2 AGE LIQ Min. :0.0000158 Min. :-81.248 Min. : 6.00 Min. :0.00000 1st Qu.:0.0721233 1st Qu.: -3.563 1st Qu.: 11.00 1st Qu.:0.03483 Median :0.1203435 Median : 6.164 Median : 17.00 Median :0.10854 Mean :0.1445929 Mean : 13.620 Mean : 20.37 Mean :0.20281 3rd Qu.:0.1875148 3rd Qu.: 21.952 3rd Qu.: 25.00 3rd Qu.:0.29137 Max. :1.5902009 Max. :681.354 Max. :210.00 Max. :1.00018 IND2A IND3A IND4A IND5A Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.00000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000 Median :1.0000 Median :0.0000 Median :0.00000 Median :0.00000 Mean :0.6116 Mean :0.1902 Mean :0.02692 Mean :0.09907 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.00000 Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.00000

As shown above, the similar analysis can be conducted by calling R functions with python. This feature enables us to extract and process the data effectively with python without losing the graphical and statistical functionality of R.

## Calculating K-S Statistic with Python

K-S statistic is a measure to evaluate the predictiveness of a statistical model for binary outcomes and has been widely used in direct marketing and risk modeling.

Below is a demonstration on how to calculate K-S statistic with less than 20 lines of python codes. In this piece of code snippet, I am also trying to show how to do data munging effectively with pandas and numpy packages.

As Wes McKinney pointed out in his book “Python for Data Analysis”, it is not necessary to become a proficient Python software developer in order to be a proficient Python data analyst.

In [1]: # IMPORT PACKAGES In [2]: import pandas as pd In [3]: import numpy as np In [4]: # LOAD DATA FROM CSV FILE In [5]: data = pd.read_csv('c:\\projects\\data.csv') In [6]: data.describe() Out[6]: bad score count 5522.000000 5522.000000 mean 0.197573 693.466135 std 0.398205 57.829769 min 0.000000 443.000000 25% 0.000000 653.000000 50% 0.000000 692.500000 75% 0.000000 735.000000 max 1.000000 848.000000 In [7]: data['good'] = 1 - data.bad In [8]: # DEFINE 10 BUCKETS WITH EQUAL SIZE In [9]: data['bucket'] = pd.qcut(data.score, 10) In [10]: # GROUP THE DATA FRAME BY BUCKETS In [11]: grouped = data.groupby('bucket', as_index = False) In [12]: # CREATE A SUMMARY DATA FRAME In [13]: agg1 = grouped.min().score In [14]: agg1 = pd.DataFrame(grouped.min().score, columns = ['min_scr']) In [15]: agg1['max_scr'] = grouped.max().score In [16]: agg1['bads'] = grouped.sum().bad In [17]: agg1['goods'] = grouped.sum().good In [18]: agg1['total'] = agg1.bads + agg1.goods In [19]: agg1 Out[19]: min_scr max_scr bads goods total 0 621 645 201 365 566 1 646 661 173 359 532 2 662 677 125 441 566 3 678 692 99 436 535 4 693 708 89 469 558 5 709 725 66 492 558 6 726 747 42 520 562 7 748 772 30 507 537 8 773 848 14 532 546 9 443 620 252 310 562 In [20]: # SORT THE DATA FRAME BY SCORE In [21]: agg2 = (agg1.sort_index(by = 'min_scr')).reset_index(drop = True) In [22]: agg2['odds'] = (agg2.goods / agg2.bads).apply('{0:.2f}'.format) In [23]: agg2['bad_rate'] = (agg2.bads / agg2.total).apply('{0:.2%}'.format) In [24]: # CALCULATE KS STATISTIC In [25]: agg2['ks'] = np.round(((agg2.bads / data.bad.sum()).cumsum() - (agg2.goods / data.good.sum()).cumsum()), 4) * 100 In [26]: # DEFINE A FUNCTION TO FLAG MAX KS In [27]: flag = lambda x: '<----' if x == agg2.ks.max() else '' In [28]: # FLAG OUT MAX KS In [29]: agg2['max_ks'] = agg2.ks.apply(flag) In [30]: agg2 Out[30]: min_scr max_scr bads goods total odds bad_rate ks max_ks 0 443 620 252 310 562 1.23 44.84% 16.10 1 621 645 201 365 566 1.82 35.51% 26.29 2 646 661 173 359 532 2.08 32.52% 34.04 3 662 677 125 441 566 3.53 22.08% 35.55 <---- 4 678 692 99 436 535 4.40 18.50% 34.78 5 693 708 89 469 558 5.27 15.95% 32.36 6 709 725 66 492 558 7.45 11.83% 27.30 7 726 747 42 520 562 12.38 7.47% 19.42 8 748 772 30 507 537 16.90 5.59% 10.72 9 773 848 14 532 546 38.00 2.56% -0.00

## Fitting A Logistic Regression with Python

In [1]: from pandas import * In [2]: import statsmodels.api as sm In [3]: # LOAD EXTERNAL DATA In [4]: data = read_table('C:\\data\\credit_count.txt', sep = ',') In [5]: data Out[5]: <class 'pandas.core.frame.DataFrame'> Int64Index: 13444 entries, 0 to 13443 Data columns: CARDHLDR 13444 non-null values DEFAULT 13444 non-null values AGE 13444 non-null values ACADMOS 13444 non-null values ADEPCNT 13444 non-null values MAJORDRG 13444 non-null values MINORDRG 13444 non-null values OWNRENT 13444 non-null values INCOME 13444 non-null values SELFEMPL 13444 non-null values INCPER 13444 non-null values EXP_INC 13444 non-null values SPENDING 13444 non-null values LOGSPEND 13444 non-null values dtypes: float64(4), int64(8), object(2) In [6]: # DEFINE RESPONSE In [7]: Y = data[data.CARDHLDR == 1].DEFAULT In [8]: # SUMMARIZE RESPONSE VARIABLE In [9]: Y.describe() Out[9]: count 10499.000000 mean 0.094866 std 0.293044 min 0.000000 25% 0.000000 50% 0.000000 75% 0.000000 max 1.000000 In [10]: # DEFINE PREDICTORS In [11]: X = sm.add_constant(data[data.CARDHLDR == 1][['AGE', 'ADEPCNT', 'MAJORDRG', 'MINORDRG', 'INCOME', 'OWNRENT']] In [12]: # SUMMARIZE PREDICTORS In [13]: X.describe() Out[13]: AGE ADEPCNT MAJORDRG MINORDRG INCOME OWNRENT const count 10499.000000 10499.000000 10499.000000 10499.000000 10499.000000 10499.000000 10499 mean 33.674945 0.990380 0.143252 0.220688 2606.125933 0.479093 1 std 10.290998 1.273887 0.461568 0.637142 1287.983386 0.499587 0 min 0.000000 0.000000 0.000000 0.000000 50.000000 0.000000 1 25% 25.750000 0.000000 0.000000 0.000000 1750.000000 0.000000 1 50% 31.666666 0.000000 0.000000 0.000000 2291.666667 0.000000 1 75% 39.750000 2.000000 0.000000 0.000000 3041.666667 1.000000 1 max 88.666664 9.000000 6.000000 7.000000 8333.250000 1.000000 1 In [14]: # DEFINE A MODEL In [15]: model = sm.GLM(Y, X, family = sm.families.Binomial()) In [16]: # FIT A MODEL In [17]: result = model.fit() In [18]: # PRINT RESULTS In [19]: print result.summary() Generalized Linear Model Regression Results ============================================================================== Dep. Variable: DEFAULT No. Observations: 10499 Model: GLM Df Residuals: 10492 Model Family: Binomial Df Model: 6 Link Function: logit Scale: 1.0 Method: IRLS Log-Likelihood: -3175.8 Date: Thu, 08 Nov 2012 Deviance: 6351.7 Time: 23:24:02 Pearson chi2: 1.11e+04 No. Iterations: 7 ============================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------ AGE -0.0095 0.004 -2.450 0.014 -0.017 -0.002 ADEPCNT 0.1338 0.029 4.655 0.000 0.077 0.190 MAJORDRG 0.2103 0.070 3.016 0.003 0.074 0.347 MINORDRG 0.2007 0.048 4.178 0.000 0.107 0.295 INCOME -0.0005 4.19e-05 -11.057 0.000 -0.001 -0.000 OWNRENT -0.2263 0.077 -2.924 0.003 -0.378 -0.075 const -0.9648 0.133 -7.245 0.000 -1.226 -0.704 ==============================================================================

## Another Class of Risk Models

In retail banking, it is a key interest to predict the probability of accounts’ adverse behaviors, such as delinquencies or defaults. A widely accepted practice in the industry is to classify accounts into two groups, the good and the bad, based upon the presence of certain adverse behaviors and then to model this binary outcome with discriminant models, e.g. logistic regression. However, an obvious limitation of discriminant models based upon the binary outcome is that the two-state classification over-simplifies adverse behaviors of accounts. What financially impacts a financial institute are not only the presence of a certain adverse behavior but also the frequency of such behavior.

In the definition of binary outcome, it is important to notice that delinquencies can also be measured directly as the frequency of over-due payments. Therefore, instead of modeling the binary outcome, a more sensible alternative might be to model the frequency of delinquencies within a given valuation horizon. In the statistical content, the genuine model for count outcome, e.g. frequency, is Poisson regression model with probability function

f(Y_i | X_i) = exp(-λ_i) * (λ_i ^ Y_i) / Y_i!, where λ_i = exp(X_i`B)

It is assumed that each observed outcome Y_i is drawn from a Poisson distribution with the conditional mean λ_i on a given covariate vector X_i for case i. In Poisson model, a strong assumption is that the mean is equal to the variance such that E(Y_i | X_i) = Var(Y_i | X_i) = λ_i, which is also known as Equi-Disperson. However, in practice, this Equi-Dispersion assumption is too restrictive for many empirical applications. In real-world count outcomes, the variance often exceeds the mean, namely Over-Dispersion, due to various reasons, such as excess zeroes or long right tail. For instance, in a credit card portfolio, majority of cardholders should have zero delinquency at any point in time, while a few might have more than three. With the similar consequence of heteroskedasticity in a linear regression, Over-Dispersion in a Poisson model will lead to deflated standard errors of parameter estimates and therefore inflated t-statistics. Hence, Poisson model is often inadequate and practically unusable.

Considered a generalization of basic Poisson model, Negative Binomial (NB) model accommodates Over-Dispersion in data by including a dispersion parameter. In a NB model, it is assumed that the conditional mean λ_i for case i is determined not only by the observed heterogeneity explained by the covariate vector X_i but also by the unobserved heterogeneity denoted as ε_i that is independent of X_i such that

λ_i = exp(X_i`B + ε_i) = exp(X_i`B) * exp(ε_i), where exp(ε_i) ~ Gamma(1/α, 1/α)

While there are many variants of NB model, the most common one is NB2 model proposed by Cameron and Trivedi (1966) with probability function

f(Y_i | X_i) = Gamma(Y_i + 1/α) / [Gamma(Y_i + 1) * Gamma(1/α)] * [(1/α) / (1/α + λ_i)] ^ (1/α) * [λ_i / (1/α + λ_i)], where α is the dispersion parameter

For NB2 model, its conditional mean E(Y_i | X_i) is still λ_i, while its variance Var(Y_i | X_i) becomes λ_i + α * λ_i ^ 2. Since both λ_i > 0 and α > 0, the variance must exceed the mean and therefore the issue of Over-Dispersion has been addressed.

A major limitation of standard count data models, such as Poisson and Negative Binomial, is that the data is assumed to be generated by a single process. However, in many cases, it might be more appropriate to assume that the data is governed by two or more processes. For instance, it is believed that risk drivers of the first-time delinquent account might be very different from the ones of an account who had been delinquent for multiple times. From the business standpoint, the assumption of multiple processes is particularly attractive in that it provides the potential to segment the portfolio into two or more sub-groups based upon their delinquent pattern and loan characteristics.

Known as the two-part model, Hurdle Poisson model assumes that count outcomes come from two systematically different processes, a Binomial distribution determining the probability of zero counts and a Truncated-at-Zero Poisson governing positive outcomes. The probability function can be expressed as

for Y_i = 0, f(Y_i | X_i) = θ_i, where θ_i = Prob(Y_i = 0)

for Y_i > 0, f(Y_i | X_i) = (1 – θ_i) * exp(-λ_i) * λ_i ^ Y_i / {[1 – exp(-λ_i)] * Y_i!}, where λ_i = exp(X_i`B)

In the modeling framework, the first process can be analyzed by a logistic regression and the second can be reflected by a Truncated-at-Zero Poisson model. An advantage of Hurdle Model is that it is so flexible as to effectively model both Over-Dispersed data with too many zeroes and Under-Dispersed data with too few zeroes.

Alike to Hurdle model, Zero-Inflated Poisson (ZIP) model is another way to model count outcomes with excess zeroes under the assumption of two components. However, it is slightly different from Hurdle model in the sense that zero outcomes are assumed to come from two different sources, one generating only zero outcomes and the other generating both zero and nonzero outcomes. Specifically, a Binomial distribution decides if an individual is from the Always-Zero or the Not-Always-Zero group and then a standard Poisson distribution describes counts in the Not-always-zero group. The probability function of ZIP model is given as

for Y_i = 0, f(Y_i | X_i) = ω_i + (1 + ω_i) * exp(-λ_i), where ω_i = Prob[Y_i ~ Poisson(λ_i)]

for Y_i > 0, f(Y_i | X_i) = (1 – ω_i) * exp(-λ_i) * λ_i ^ Y_i / Y_i!

With the similar idea to Hurdle model, ZIP model can be represented jointly by two different sub-models as well. A logistic regression is used to separate the Always-Zero group from the Not-Always-Zero group and a basic Poisson model is applied to individuals in the Not-Always-Zero group. From a business prospective, ZIP Model describes an important fact that some not-at-risk accounts are well established such that they will never have financial problems, while the other at-risk ones might have chances to get into troubles during the tough time. Therefore, risk exposures and underlying matrices for accounts with same outcomes at zero count might still be differentiable.

In practice, a sharp dichotomization between at-risk group and not-at-risk group might not be realistic. Even a customer with the good financial condition might be exposed to risks in a certain situation. Therefore, it might make sense to split the whole portfolio into a couple segments with different levels of risk-exposure. A Latent Class Poisson model provides such mechanism by assuming that the population of interest is actually a mixture of S > 1 latent (unobservable) components and each individual is considered a draw from one of these latent groups. The probability function of a Latent Class Poisson model with S = 2 classes can be obtained as

F(Y_i | X_i) = P1_i * exp(-λ1_i) * λ1_i ^ Y_i / Y_i! + P2_i * exp(-λ2_i) * λ2_i ^ Y_i / Y_i!, where P1_i + P2_i = 1

Each latent component in the mixture is assumed to have a different parameter λ_i, which will account for the unobserved heterogeneity in the population. For instance, in the case of S = 2, a portfolio is assumed a mixture between a high risk group and a low risk one. Impacts of predictors are allowed to differ across different latent groups, providing a possibility of more informative and flexible interpretations.

Besides models discussed above, it is also worth to point out that the discrete choice model, such as Logit or Probit, has also been widely used to model count outcomes as well. However, such discrete choice model needs to be based upon sequential or ordered instead of multinomial response, namely ordered Logit.