## Monotonic Binning with Smbinning Package

The R package smbinning (http://www.scoringmodeling.com/rpackage/smbinning) provides a very user-friendly interface for the WoE (Weight of Evidence) binning algorithm employed in the scorecard development. However, there are several improvement opportunities in my view:

1. First of all, the underlying algorithm in the smbinning() function utilizes the recursive partitioning, which does not necessarily guarantee the monotonicity.

2. Secondly, the density in each generated bin is not even. The frequency in some bins could be much higher than the one in others.

3. At last, the function might not provide the binning outcome for some variables due to the lack of statistical significance.

In light of the above, I wrote an enhanced version by utilizing the smbinning.custom() function, shown as below. The idea is very simple. Within the repeat loop, we would bin the variable iteratively until a certain criterion is met and then feed the list of cut points into the smbinning.custom() function. As a result, we are able to achieve a set of monotonic bins with similar frequencies regardless of the so-called “statistical significance”, which is a premature step for the variable transformation in my mind.

monobin <- function(data, y, x) { d1 <- data[c(y, x)] n <- min(20, nrow(unique(d1[x]))) repeat { d1$bin <- Hmisc::cut2(d1[, x], g = n) d2 <- aggregate(d1[-3], d1[3], mean) c <- cor(d2[-1], method = "spearman") if(abs(c[1, 2]) == 1 | n == 2) break n <- n - 1 } d3 <- aggregate(d1[-3], d1[3], max) cuts <- d3[-length(d3[, 3]), 3] return(smbinning::smbinning.custom(d1, y, x, cuts)) }

Below are a couple comparisons between the generic smbinning() and the home-brew monobin() functions with the use of a toy data.

In the first example, we applied the smbinning() function to a variable named “rev_util”. As shown in the highlighted rows in the column “BadRate”, the binning outcome is not monotonic.

Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate Odds LnOdds WoE IV 1 <= 0 965 716 249 965 716 249 0.1653 0.7420 0.2580 2.8755 1.0562 -0.2997 0.0162 2 <= 5 522 496 26 1487 1212 275 0.0894 0.9502 0.0498 19.0769 2.9485 1.5925 0.1356 3 <= 24 1166 1027 139 2653 2239 414 0.1998 0.8808 0.1192 7.3885 1.9999 0.6440 0.0677 4 <= 40 779 651 128 3432 2890 542 0.1335 0.8357 0.1643 5.0859 1.6265 0.2705 0.0090 5 <= 73 1188 932 256 4620 3822 798 0.2035 0.7845 0.2155 3.6406 1.2922 -0.0638 0.0008 6 <= 96 684 482 202 5304 4304 1000 0.1172 0.7047 0.2953 2.3861 0.8697 -0.4863 0.0316 7 > 96 533 337 196 5837 4641 1196 0.0913 0.6323 0.3677 1.7194 0.5420 -0.8140 0.0743 8 Missing 0 0 0 5837 4641 1196 0.0000 NaN NaN NaN NaN NaN NaN 9 Total 5837 4641 1196 NA NA NA 1.0000 0.7951 0.2049 3.8804 1.3559 0.0000 0.3352

Next, we did the same with the monobin() function. As shown below, the algorithm provided a monotonic binning at the cost of granularity. Albeit coarse, the result is directionally correct with no inversion.

Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate Odds LnOdds WoE IV 1 <= 30 2962 2495 467 2962 2495 467 0.5075 0.8423 0.1577 5.3426 1.6757 0.3198 0.0471 2 > 30 2875 2146 729 5837 4641 1196 0.4925 0.7464 0.2536 2.9438 1.0797 -0.2763 0.0407 3 Missing 0 0 0 5837 4641 1196 0.0000 NaN NaN NaN NaN NaN NaN 4 Total 5837 4641 1196 NA NA NA 1.0000 0.7951 0.2049 3.8804 1.3559 0.0000 0.0878

In the second example, we applied the smbinning() function to a variable named “bureau_score”. As shown in the highlighted rows, the frequencies in these two bins are much higher than the rest.

Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate Odds LnOdds WoE IV 1 <= 605 324 167 157 324 167 157 0.0555 0.5154 0.4846 1.0637 0.0617 -1.2942 0.1233 2 <= 632 468 279 189 792 446 346 0.0802 0.5962 0.4038 1.4762 0.3895 -0.9665 0.0946 3 <= 662 896 608 288 1688 1054 634 0.1535 0.6786 0.3214 2.1111 0.7472 -0.6087 0.0668 4 <= 699 1271 1016 255 2959 2070 889 0.2177 0.7994 0.2006 3.9843 1.3824 0.0264 0.0002 5 <= 717 680 586 94 3639 2656 983 0.1165 0.8618 0.1382 6.2340 1.8300 0.4741 0.0226 6 <= 761 1118 1033 85 4757 3689 1068 0.1915 0.9240 0.0760 12.1529 2.4976 1.1416 0.1730 7 > 761 765 742 23 5522 4431 1091 0.1311 0.9699 0.0301 32.2609 3.4739 2.1179 0.2979 8 Missing 315 210 105 5837 4641 1196 0.0540 0.6667 0.3333 2.0000 0.6931 -0.6628 0.0282 9 Total 5837 4641 1196 NA NA NA 1.0000 0.7951 0.2049 3.8804 1.3559 0.0000 0.8066

With the monobin() function applied to the same variable, we were able to get a set of more granular bins with similar frequencies.

Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate Odds LnOdds WoE IV 1 <= 617 513 284 229 513 284 229 0.0879 0.5536 0.4464 1.2402 0.2153 -1.1407 0.1486 2 <= 642 515 317 198 1028 601 427 0.0882 0.6155 0.3845 1.6010 0.4706 -0.8853 0.0861 3 <= 657 512 349 163 1540 950 590 0.0877 0.6816 0.3184 2.1411 0.7613 -0.5946 0.0363 4 <= 672 487 371 116 2027 1321 706 0.0834 0.7618 0.2382 3.1983 1.1626 -0.1933 0.0033 5 <= 685 494 396 98 2521 1717 804 0.0846 0.8016 0.1984 4.0408 1.3964 0.0405 0.0001 6 <= 701 521 428 93 3042 2145 897 0.0893 0.8215 0.1785 4.6022 1.5265 0.1706 0.0025 7 <= 714 487 418 69 3529 2563 966 0.0834 0.8583 0.1417 6.0580 1.8014 0.4454 0.0144 8 <= 730 489 441 48 4018 3004 1014 0.0838 0.9018 0.0982 9.1875 2.2178 0.8619 0.0473 9 <= 751 513 476 37 4531 3480 1051 0.0879 0.9279 0.0721 12.8649 2.5545 1.1986 0.0859 10 <= 775 492 465 27 5023 3945 1078 0.0843 0.9451 0.0549 17.2222 2.8462 1.4903 0.1157 11 > 775 499 486 13 5522 4431 1091 0.0855 0.9739 0.0261 37.3846 3.6213 2.2653 0.2126 12 Missing 315 210 105 5837 4641 1196 0.0540 0.6667 0.3333 2.0000 0.6931 -0.6628 0.0282 13 Total 5837 4641 1196 NA NA NA 1.0000 0.7951 0.2049 3.8804 1.3559 0.0000 0.7810

## Autoencoder for Dimensionality Reduction

We often use ICA or PCA to extract features from the high-dimensional data. The autoencoder is another interesting algorithm to achieve the same purpose in the context of Deep Learning.

with the purpose of learning a function to approximate the input data itself such that F(X) = X, an autoencoder consists of two parts, namely encoder and decoder. While the encoder aims to compress the original input data into a low-dimensional representation, the decoder tries to reconstruct the original input data based on the low-dimension representation generated by the encoder. As a result, the autoencoder has been widely used to remove the data noise as well to reduce the data dimension.

First of all, we will show the basic structure of an autoencoder with 1-layer encoder and 1-layer decoder, as below. In the example, we will compress the input data with 10 columns into a compressed on with 3 columns.

from pandas import read_csv, DataFrame from numpy.random import seed from sklearn.preprocessing import minmax_scale from sklearn.model_selection import train_test_split from keras.layers import Input, Dense from keras.models import Model df = read_csv("credit_count.txt") Y = df[df.CARDHLDR == 1].DEFAULTS X = df[df.CARDHLDR == 1].ix[:, 2:12] # SCALE EACH FEATURE INTO [0, 1] RANGE sX = minmax_scale(X, axis = 0) ncol = sX.shape[1] X_train, X_test, Y_train, Y_test = train_test_split(sX, Y, train_size = 0.5, random_state = seed(2017)) ### AN EXAMPLE OF SIMPLE AUTOENCODER ### # InputLayer (None, 10) # Dense (None, 5) # Dense (None, 10) input_dim = Input(shape = (ncol, )) # DEFINE THE DIMENSION OF ENCODER ASSUMED 3 encoding_dim = 3 # DEFINE THE ENCODER LAYER encoded = Dense(encoding_dim, activation = 'relu')(input_dim) # DEFINE THE DECODER LAYER decoded = Dense(ncol, activation = 'sigmoid')(encoded) # COMBINE ENCODER AND DECODER INTO AN AUTOENCODER MODEL autoencoder = Model(input = input_dim, output = decoded) # CONFIGURE AND TRAIN THE AUTOENCODER autoencoder.compile(optimizer = 'adadelta', loss = 'binary_crossentropy') autoencoder.fit(X_train, X_train, nb_epoch = 50, batch_size = 100, shuffle = True, validation_data = (X_test, X_test)) # THE ENCODER TO EXTRACT THE REDUCED DIMENSION FROM THE ABOVE AUTOENCODER encoder = Model(input = input_dim, output = encoded) encoded_input = Input(shape = (encoding_dim, )) encoded_out = encoder.predict(X_test) encoded_out[0:2] #array([[ 0. , 1.26510417, 1.62803197], # [ 2.32508397, 0.99735016, 2.06461048]], dtype=float32)

In the next example, we will relax the constraint of layers and employ a stack of layers to achievement the same purpose as above.

### AN EXAMPLE OF DEEP AUTOENCODER WITH MULTIPLE LAYERS # InputLayer (None, 10) # Dense (None, 20) # Dense (None, 10) # Dense (None, 5) # Dense (None, 3) # Dense (None, 5) # Dense (None, 10) # Dense (None, 20) # Dense (None, 10) input_dim = Input(shape = (ncol, )) # DEFINE THE DIMENSION OF ENCODER ASSUMED 3 encoding_dim = 3 # DEFINE THE ENCODER LAYERS encoded1 = Dense(20, activation = 'relu')(input_dim) encoded2 = Dense(10, activation = 'relu')(encoded1) encoded3 = Dense(5, activation = 'relu')(encoded2) encoded4 = Dense(encoding_dim, activation = 'relu')(encoded3) # DEFINE THE DECODER LAYERS decoded1 = Dense(5, activation = 'relu')(encoded4) decoded2 = Dense(10, activation = 'relu')(decoded1) decoded3 = Dense(20, activation = 'relu')(decoded2) decoded4 = Dense(ncol, activation = 'sigmoid')(decoded3) # COMBINE ENCODER AND DECODER INTO AN AUTOENCODER MODEL autoencoder = Model(input = input_dim, output = decoded4) # CONFIGURE AND TRAIN THE AUTOENCODER autoencoder.compile(optimizer = 'adadelta', loss = 'binary_crossentropy') autoencoder.fit(X_train, X_train, nb_epoch = 100, batch_size = 100, shuffle = True, validation_data = (X_test, X_test)) # THE ENCODER TO EXTRACT THE REDUCED DIMENSION FROM THE ABOVE AUTOENCODER encoder = Model(input = input_dim, output = encoded4) encoded_input = Input(shape = (encoding_dim, )) encoded_out = encoder.predict(X_test) encoded_out[0:2] #array([[ 3.74947715, 0. , 3.22947764], # [ 3.93903661, 0.17448257, 1.86618853]], dtype=float32)

## An Example of Merge Layer in Keras

The power of a DNN does not only come from its depth but also come from its flexibility of accommodating complex network structures. For instance, the DNN shown below consists of two branches, the left with 4 inputs and the right with 6 inputs. In addition, the right branch shows a more complicated structure than the left.

InputLayer (None, 6) Dense (None, 6) BatchNormalization (None, 6) Dense (None, 6) InputLayer (None, 4) BatchNormalization (None, 6) Dense (None, 4) Dense (None, 6) BatchNormalization (None, 4) BatchNormalization (None, 6) \____________________________________/ | Merge (None, 10) Dense (None, 1)

To create a DNN as the above, both left and right branches are defined separately with corresponding inputs and layers. In the line 29, both branches would be combined with a MERGE layer. There are multiple benefits of such merged DNNs. For instance, the DNN has the flexibility to handle various inputs differently. In addition, new features can be added conveniently without messing around with the existing network structure.

from pandas import read_csv, DataFrame from numpy.random import seed from sklearn.preprocessing import scale from keras.models import Sequential from keras.constraints import maxnorm from keras.optimizers import SGD from keras.layers import Dense, Merge from keras.layers.normalization import BatchNormalization from keras_diagram import ascii df = read_csv("credit_count.txt") Y = df[df.CARDHLDR == 1].DEFAULTS X1 = scale(df[df.CARDHLDR == 1][["MAJORDRG", "MINORDRG", "OWNRENT", "SELFEMPL"]]) X2 = scale(df[df.CARDHLDR == 1][["AGE", "ACADMOS", "ADEPCNT", "INCPER", "EXP_INC", "INCOME"]]) branch1 = Sequential() branch1.add(Dense(X1.shape[1], input_shape = (X1.shape[1],), init = 'normal', activation = 'relu')) branch1.add(BatchNormalization()) branch2 = Sequential() branch2.add(Dense(X2.shape[1], input_shape = (X2.shape[1],), init = 'normal', activation = 'relu')) branch2.add(BatchNormalization()) branch2.add(Dense(X2.shape[1], init = 'normal', activation = 'relu', W_constraint = maxnorm(5))) branch2.add(BatchNormalization()) branch2.add(Dense(X2.shape[1], init = 'normal', activation = 'relu', W_constraint = maxnorm(5))) branch2.add(BatchNormalization()) model = Sequential() model.add(Merge([branch1, branch2], mode = 'concat')) model.add(Dense(1, init = 'normal', activation = 'sigmoid')) sgd = SGD(lr = 0.1, momentum = 0.9, decay = 0, nesterov = False) model.compile(loss = 'binary_crossentropy', optimizer = sgd, metrics = ['accuracy']) seed(2017) model.fit([X1, X2], Y.values, batch_size = 2000, nb_epoch = 100, verbose = 1)

## Dropout Regularization in Deep Neural Networks

The deep neural network (DNN) is a very powerful neural work with multiple hidden layers and is able to capture the highly complex relationship between the response and predictors. However, it is prone to the over-fitting due to a large number of parameters that makes the regularization crucial for DNNs. In the paper (https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf), an interesting regularization approach, e.g. dropout, was proposed with a simple and elegant idea. Basically, it suppresses the complexity of DNNs by randomly dropping units in both input and hidden layers.

Below is an example showing how to tune the hyper-parameter of dropout rates with Keras library in Python. Because of the long computing time required by the dropout, the parallelism is used to speed up the process.

from pandas import read_csv, DataFrame from numpy.random import seed from sklearn.preprocessing import scale from sklearn.model_selection import train_test_split from sklearn.metrics import roc_auc_score from keras.models import Sequential from keras.constraints import maxnorm from keras.optimizers import SGD from keras.layers import Dense, Dropout from multiprocessing import Pool, cpu_count from itertools import product from parmap import starmap df = read_csv("credit_count.txt") Y = df[df.CARDHLDR == 1].DEFAULT X = df[df.CARDHLDR == 1][['AGE', 'ADEPCNT', 'MAJORDRG', 'MINORDRG', 'INCOME', 'OWNRENT', 'SELFEMPL']] sX = scale(X) ncol = sX.shape[1] x_train, x_test, y_train, y_test = train_test_split(sX, Y, train_size = 0.5, random_state = seed(2017)) def tune_dropout(rate1, rate2): net = Sequential() ## DROPOUT AT THE INPUT LAYER net.add(Dropout(rate1, input_shape = (ncol,))) ## DROPOUT AT THE 1ST HIDDEN LAYER net.add(Dense(ncol, init = 'normal', activation = 'relu', W_constraint = maxnorm(4))) net.add(Dropout(rate2)) ## DROPOUT AT THE 2ND HIDDER LAYER net.add(Dense(ncol, init = 'normal', activation = 'relu', W_constraint = maxnorm(4))) net.add(Dropout(rate2)) net.add(Dense(1, init = 'normal', activation = 'sigmoid')) sgd = SGD(lr = 0.1, momentum = 0.9, decay = 0, nesterov = False) net.compile(loss='binary_crossentropy', optimizer = sgd, metrics = ['accuracy']) net.fit(x_train, y_train, batch_size = 200, nb_epoch = 50, verbose = 0) print rate1, rate2, "{:6.4f}".format(roc_auc_score(y_test, net.predict(x_test))) input_dp = [0.1, 0.2, 0.3] hidden_dp = [0.2, 0.3, 0.4, 0.5] parms = [i for i in product(input_dp, hidden_dp)] seed(2017) starmap(tune_dropout, parms, pool = Pool(processes = cpu_count()))

As shown in the output below, the optimal dropout rate appears to be 0.2 incidentally for both input and hidden layers.

0.1 0.2 0.6354 0.1 0.4 0.6336 0.1 0.3 0.6389 0.1 0.5 0.6378 0.2 0.2 0.6419 0.2 0.4 0.6385 0.2 0.3 0.6366 0.2 0.5 0.6359 0.3 0.4 0.6313 0.3 0.2 0.6350 0.3 0.3 0.6346 0.3 0.5 0.6343

## Estimate Regression with (Type-I) Pareto Response

The Type-I Pareto distribution has a probability function shown as below

f(y; a, k) = k * (a ^ k) / (y ^ (k + 1))

In the formulation, the scale parameter **0 < a < y** and the shape parameter **k > 1 **.

The positive lower bound of Type-I Pareto distribution is particularly appealing in modeling the severity measure in that there is usually a reporting threshold for operational loss events. For instance, the reporting threshold of ABA operational risk consortium data is $10,000 and any loss event below the threshold value would be not reported, which might add the complexity in the severity model estimation.

In practice, instead of modeling the severity measure directly, we might model the shifted response ** y` = severity – threshold ** to accommodate the threshold value such that the supporting domain of y` could start from 0 and that the Gamma, Inverse Gaussian, or Lognormal regression can still be applicable. However, under the distributional assumption of Type-I Pareto with a known lower end, we do not need to shift the severity measure anymore but model it directly based on the probability function.

Below is the R code snippet showing how to estimate a regression model for the Pareto response with the lower bound ** a = 2 ** by using the **VGAM** package.

library(VGAM) set.seed(2017) n <- 200 a <- 2 x <- runif(n) k <- exp(1 + 5 * x) pdata <- data.frame(y = rpareto(n = n, scale = a, shape = k), x = x) fit <- vglm(y ~ x, paretoff(scale = a), data = pdata, trace = TRUE) summary(fit) # Coefficients: # Estimate Std. Error z value Pr(>|z|) # (Intercept) 1.0322 0.1363 7.574 3.61e-14 *** # x 4.9815 0.2463 20.229 < 2e-16 *** AIC(fit) # -644.458 BIC(fit) # -637.8614

The SAS code below estimating the Type-I Pareto regression provides almost identical model estimation.

proc nlmixed data = pdata; parms b0 = 0.1 b1 = 0.1; k = exp(b0 + b1 * x); a = 2; lh = k * (a ** k) / (y ** (k + 1)); ll = log(lh); model y ~ general(ll); run; /* Fit Statistics -2 Log Likelihood -648.5 AIC (smaller is better) -644.5 AICC (smaller is better) -644.4 BIC (smaller is better) -637.9 Parameter Estimate Standard DF t Value Pr > |t| Error b0 1.0322 0.1385 200 7.45 <.0001 b1 4.9815 0.2518 200 19.78 <.0001 */

At last, it is worth pointing out that the conditional mean of Type-I Pareto response is not equal to ** exp(x * beta) ** but ** a * k / (k – 1) ** with ** k = exp(x * beta) **. Therefore, the conditional mean only exists when ** k > 1 **, which might cause numerical issues in the model estimation.

## Pregibon Test for Goodness of Link in SAS

When estimating generalized linear models for binary outcomes, we often choose the logit link function by default and seldom consider other alternatives such as probit or cloglog. The Pregibon test (Pregibon, 1980) provides a mean to check the goodness of link with a simple logic outlined below.

1. First of all, we can estimate the regression model with the hypothesized link function, e.g. logit;

2. After the model estimation, we calculate yhat and yhat ^ 2 and then estimate a secondary regression with the identical response variable Y and link function but with yhat and yhat ^ 2 as model predictors (with the intercept).

3. If the link function is correctly specified, then the t-value of yaht ^2 should be insignificant.

The SAS macro shown below is the implementation of Pregibon test in the context of logistic regressions. However, the same idea can be generalized to any GLM.

%macro pregibon(data = , y = , x = ); ***********************************************************; * SAS MACRO PERFORMING PREGIBON TEST FOR GOODNESS OF LINK *; * ======================================================= *; * INPUT PAREMETERS: *; * DATA : INPUT SAS DATA TABLE *; * Y : THE DEPENDENT VARIABLE WITH 0 / 1 VALUES *; * X : MODEL PREDICTORS *; * ======================================================= *; * AUTHOR: WENSUI.LIU@53.COM *; ***********************************************************; options mprint mlogic nocenter; %let links = logit probit cloglog; %let loop = 1; proc sql noprint; select n(&data) - 3 into :df from &data; quit; %do %while (%scan(&links, &loop) ne %str()); %let link = %scan(&links, &loop); proc logistic data = &data noprint desc; model &y = &x / link = &link; score data = &data out = _out1; run; data _out2; set _out1(rename = (p_1 = p1)); p2 = p1 * p1; run; ods listing close; ods output ParameterEstimates = _parm; proc logistic data = _out2 desc; model &y = p1 p2 / link = &link ; run; ods listing; %if &loop = 1 %then %do; data _parm1; format link $10.; set _parm(where = (variable = "p2")); link = upcase("&link"); run; %end; %else %do; data _parm1; set _parm1 _parm(where = (variable = "p2") in = new); if new then link = upcase("&link"); run; %end; data _parm2(drop = variable); set _parm1; _t = estimate / stderr; _df = &df; _p = (1 - probt(abs(_t), _df)) * 2; run; %let loop = %eval(&loop + 1); %end; title; proc report data = _last_ spacing = 1 headline nowindows split = "*"; column(" * PREGIBON TEST FOR GOODNESS OF LINK * H0: THE LINK FUNCTION IS SPECIFIED CORRECTLY * " link _t _df _p); define link / "LINK FUNCTION" width = 15 order order = data; define _t / "T-VALUE" width = 15 format = 12.4; define _df / "DF" width = 10; define _p / "P-VALUE" width = 15 format = 12.4; run; %mend;

After applying the macro to the kyphosis data (https://stat.ethz.ch/R-manual/R-devel/library/rpart/html/kyphosis.html), we can see that both logit and probit can be considered appropriate link functions in this specific case and cloglog might not be a good choice.

PREGIBON TEST FOR GOODNESS OF LINK H0: THE LINK FUNCTION IS SPECIFIED CORRECTLY LINK FUNCTION T-VALUE DF P-VALUE ----------------------------------------------------------- LOGIT -1.6825 78 0.0965 PROBIT -1.7940 78 0.0767 CLOGLOG -2.3632 78 0.0206

## More about Flexible Frequency Models

Modeling the frequency is one of the most important aspects in operational risk models. In the previous post (https://statcompute.wordpress.com/2016/05/13/more-flexible-approaches-to-model-frequency), the importance of flexible modeling approaches for both under-dispersion and over-dispersion has been discussed.

In addition to the quasi-poisson regression, three flexible frequency modeling techniques, including generalized poisson, double poisson, and Conway-Maxwell poisson, with their implementations in R should also be demonstrated below. While the example is specifically related to the over-dispersed data simulated with the negative binomial distributional assumption, these approaches can be generalized to the under-dispersed data as well given their flexibility. However, as demonstrated below, the calculation of parameters for these modeling approaches is not straight-forward.

**Over-Dispersed Data Simulation**

> set.seed(1) > ### SIMULATE NEG. BINOMIAL WITH MEAN(X) = MU AND VAR(X) = MU + MU ^ 2 / THETA > df <- data.frame(y = MASS::rnegbin(1000, mu = 10, theta = 5)) > ### DATA MEAN > mean(df$y) [1] 9.77 > ### DATA VARIANCE > var(df$y) [1] 30.93003003

**Generalized Poisson Regression**

> library(VGAM) > gpois <- vglm(y ~ 1, data = df, family = genpoisson) > gpois.theta <- exp(coef(gpois)[2]) > gpois.lambda <- (exp(coef(gpois)[1]) - 1) / (exp(coef(gpois)[1]) + 1) > ### ESTIMATE MEAN = THETA / (1 - LAMBDA) > gpois.theta / (1 - gpois.lambda) (Intercept):2 9.77 > ### ESTIMATE VARIANCE = THETA / ((1 - LAMBDA) ^ 3) > gpois.theta / ((1 - gpois.lambda) ^ 3) (Intercept):2 31.45359991

**Double Poisson Regression**

> ### DOUBLE POISSON > library(gamlss) > dpois <- gamlss(y ~ 1, data = df, family = DPO, control = gamlss.control(n.cyc = 100)) > ### ESTIMATE MEAN > dpois.mu <- exp(dpois$mu.coefficients) > dpois.mu (Intercept) 9.848457877 > ### ESTIMATE VARIANCE = MU * SIGMA > dpois.sigma <- exp(dpois$sigma.coefficients) > dpois.mu * dpois.sigma (Intercept) 28.29229702

**Conway-Maxwell Poisson Regression**

> ### CONWAY-MAXWELL POISSON > library(CompGLM) > cpois <- glm.comp(y ~ 1, data = df) > cpois.lambda <- exp(cpois$beta) > cpois.nu <- exp(cpois$zeta) > ### ESTIMATE MEAN = LAMBDA ^ (1 / NU) - (NU - 1) / (2 * NU) > cpois.lambda ^ (1 / cpois.nu) - (cpois.nu - 1) / (2 * cpois.nu) (Intercept) 9.66575376 > ### ESTIMATE VARIANCE = LAMBDA ** (1 / NU) / NU > cpois.lambda ^ (1 / cpois.nu) / cpois.nu (Intercept) 29.69861239