# Yet Another Blog in Statistical Computing

I can calculate the motion of heavenly bodies but not the madness of people. -Isaac Newton

## Monotonic Binning with Smbinning Package

The R package smbinning (http://www.scoringmodeling.com/rpackage/smbinning) provides a very user-friendly interface for the WoE (Weight of Evidence) binning algorithm employed in the scorecard development. However, there are several improvement opportunities in my view:

1. First of all, the underlying algorithm in the smbinning() function utilizes the recursive partitioning, which does not necessarily guarantee the monotonicity.
2. Secondly, the density in each generated bin is not even. The frequency in some bins could be much higher than the one in others.
3. At last, the function might not provide the binning outcome for some variables due to the lack of statistical significance.

In light of the above, I wrote an enhanced version by utilizing the smbinning.custom() function, shown as below. The idea is very simple. Within the repeat loop, we would bin the variable iteratively until a certain criterion is met and then feed the list of cut points into the smbinning.custom() function. As a result, we are able to achieve a set of monotonic bins with similar frequencies regardless of the so-called “statistical significance”, which is a premature step for the variable transformation in my mind.

```monobin <- function(data, y, x) {
d1 <- data[c(y, x)]
n <- min(20, nrow(unique(d1[x])))
repeat {
d1\$bin <- Hmisc::cut2(d1[, x], g = n)
d2 <- aggregate(d1[-3], d1[3], mean)
c <- cor(d2[-1], method = "spearman")
if(abs(c[1, 2]) == 1 | n == 2) break
n <- n - 1
}
d3 <- aggregate(d1[-3], d1[3], max)
cuts <- d3[-length(d3[, 3]), 3]
return(smbinning::smbinning.custom(d1, y, x, cuts))
}
```

Below are a couple comparisons between the generic smbinning() and the home-brew monobin() functions with the use of a toy data.

In the first example, we applied the smbinning() function to a variable named “rev_util”. As shown in the highlighted rows in the column “BadRate”, the binning outcome is not monotonic.

```  Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate    Odds LnOdds     WoE     IV
1     <= 0    965     716    249       965        716       249 0.1653   0.7420  0.2580  2.8755 1.0562 -0.2997 0.0162
2     <= 5    522     496     26      1487       1212       275 0.0894   0.9502  0.0498 19.0769 2.9485  1.5925 0.1356
3    <= 24   1166    1027    139      2653       2239       414 0.1998   0.8808  0.1192  7.3885 1.9999  0.6440 0.0677
4    <= 40    779     651    128      3432       2890       542 0.1335   0.8357  0.1643  5.0859 1.6265  0.2705 0.0090
5    <= 73   1188     932    256      4620       3822       798 0.2035   0.7845  0.2155  3.6406 1.2922 -0.0638 0.0008
6    <= 96    684     482    202      5304       4304      1000 0.1172   0.7047  0.2953  2.3861 0.8697 -0.4863 0.0316
7     > 96    533     337    196      5837       4641      1196 0.0913   0.6323  0.3677  1.7194 0.5420 -0.8140 0.0743
8  Missing      0       0      0      5837       4641      1196 0.0000      NaN     NaN     NaN    NaN     NaN    NaN
9    Total   5837    4641   1196        NA         NA        NA 1.0000   0.7951  0.2049  3.8804 1.3559  0.0000 0.3352
```

Next, we did the same with the monobin() function. As shown below, the algorithm provided a monotonic binning at the cost of granularity. Albeit coarse, the result is directionally correct with no inversion.

```  Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate   Odds LnOdds     WoE     IV
1    <= 30   2962    2495    467      2962       2495       467 0.5075   0.8423  0.1577 5.3426 1.6757  0.3198 0.0471
2     > 30   2875    2146    729      5837       4641      1196 0.4925   0.7464  0.2536 2.9438 1.0797 -0.2763 0.0407
3  Missing      0       0      0      5837       4641      1196 0.0000      NaN     NaN    NaN    NaN     NaN    NaN
4    Total   5837    4641   1196        NA         NA        NA 1.0000   0.7951  0.2049 3.8804 1.3559  0.0000 0.0878
```

In the second example, we applied the smbinning() function to a variable named “bureau_score”. As shown in the highlighted rows, the frequencies in these two bins are much higher than the rest.

```  Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate    Odds LnOdds     WoE     IV
1   <= 605    324     167    157       324        167       157 0.0555   0.5154  0.4846  1.0637 0.0617 -1.2942 0.1233
2   <= 632    468     279    189       792        446       346 0.0802   0.5962  0.4038  1.4762 0.3895 -0.9665 0.0946
3   <= 662    896     608    288      1688       1054       634 0.1535   0.6786  0.3214  2.1111 0.7472 -0.6087 0.0668
4   <= 699   1271    1016    255      2959       2070       889 0.2177   0.7994  0.2006  3.9843 1.3824  0.0264 0.0002
5   <= 717    680     586     94      3639       2656       983 0.1165   0.8618  0.1382  6.2340 1.8300  0.4741 0.0226
6   <= 761   1118    1033     85      4757       3689      1068 0.1915   0.9240  0.0760 12.1529 2.4976  1.1416 0.1730
7    > 761    765     742     23      5522       4431      1091 0.1311   0.9699  0.0301 32.2609 3.4739  2.1179 0.2979
8  Missing    315     210    105      5837       4641      1196 0.0540   0.6667  0.3333  2.0000 0.6931 -0.6628 0.0282
9    Total   5837    4641   1196        NA         NA        NA 1.0000   0.7951  0.2049  3.8804 1.3559  0.0000 0.8066
```

With the monobin() function applied to the same variable, we were able to get a set of more granular bins with similar frequencies.

```   Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate    Odds LnOdds     WoE     IV
1    <= 617    513     284    229       513        284       229 0.0879   0.5536  0.4464  1.2402 0.2153 -1.1407 0.1486
2    <= 642    515     317    198      1028        601       427 0.0882   0.6155  0.3845  1.6010 0.4706 -0.8853 0.0861
3    <= 657    512     349    163      1540        950       590 0.0877   0.6816  0.3184  2.1411 0.7613 -0.5946 0.0363
4    <= 672    487     371    116      2027       1321       706 0.0834   0.7618  0.2382  3.1983 1.1626 -0.1933 0.0033
5    <= 685    494     396     98      2521       1717       804 0.0846   0.8016  0.1984  4.0408 1.3964  0.0405 0.0001
6    <= 701    521     428     93      3042       2145       897 0.0893   0.8215  0.1785  4.6022 1.5265  0.1706 0.0025
7    <= 714    487     418     69      3529       2563       966 0.0834   0.8583  0.1417  6.0580 1.8014  0.4454 0.0144
8    <= 730    489     441     48      4018       3004      1014 0.0838   0.9018  0.0982  9.1875 2.2178  0.8619 0.0473
9    <= 751    513     476     37      4531       3480      1051 0.0879   0.9279  0.0721 12.8649 2.5545  1.1986 0.0859
10   <= 775    492     465     27      5023       3945      1078 0.0843   0.9451  0.0549 17.2222 2.8462  1.4903 0.1157
11    > 775    499     486     13      5522       4431      1091 0.0855   0.9739  0.0261 37.3846 3.6213  2.2653 0.2126
12  Missing    315     210    105      5837       4641      1196 0.0540   0.6667  0.3333  2.0000 0.6931 -0.6628 0.0282
13    Total   5837    4641   1196        NA         NA        NA 1.0000   0.7951  0.2049  3.8804 1.3559  0.0000 0.7810
```

Written by statcompute

January 22, 2017 at 11:05 pm

## Autoencoder for Dimensionality Reduction

We often use ICA or PCA to extract features from the high-dimensional data. The autoencoder is another interesting algorithm to achieve the same purpose in the context of Deep Learning.

with the purpose of learning a function to approximate the input data itself such that F(X) = X, an autoencoder consists of two parts, namely encoder and decoder. While the encoder aims to compress the original input data into a low-dimensional representation, the decoder tries to reconstruct the original input data based on the low-dimension representation generated by the encoder. As a result, the autoencoder has been widely used to remove the data noise as well to reduce the data dimension.

First of all, we will show the basic structure of an autoencoder with 1-layer encoder and 1-layer decoder, as below. In the example, we will compress the input data with 10 columns into a compressed on with 3 columns.

```from pandas import read_csv, DataFrame
from numpy.random import seed
from sklearn.preprocessing import minmax_scale
from sklearn.model_selection import train_test_split
from keras.layers import Input, Dense
from keras.models import Model

Y = df[df.CARDHLDR == 1].DEFAULTS
X = df[df.CARDHLDR == 1].ix[:, 2:12]
# SCALE EACH FEATURE INTO [0, 1] RANGE
sX = minmax_scale(X, axis = 0)
ncol = sX.shape[1]
X_train, X_test, Y_train, Y_test = train_test_split(sX, Y, train_size = 0.5, random_state = seed(2017))

### AN EXAMPLE OF SIMPLE AUTOENCODER ###
# InputLayer (None, 10)
#      Dense (None, 5)
#      Dense (None, 10)

input_dim = Input(shape = (ncol, ))
# DEFINE THE DIMENSION OF ENCODER ASSUMED 3
encoding_dim = 3
# DEFINE THE ENCODER LAYER
encoded = Dense(encoding_dim, activation = 'relu')(input_dim)
# DEFINE THE DECODER LAYER
decoded = Dense(ncol, activation = 'sigmoid')(encoded)
# COMBINE ENCODER AND DECODER INTO AN AUTOENCODER MODEL
autoencoder = Model(input = input_dim, output = decoded)
# CONFIGURE AND TRAIN THE AUTOENCODER
autoencoder.fit(X_train, X_train, nb_epoch = 50, batch_size = 100, shuffle = True, validation_data = (X_test, X_test))
# THE ENCODER TO EXTRACT THE REDUCED DIMENSION FROM THE ABOVE AUTOENCODER
encoder = Model(input = input_dim, output = encoded)
encoded_input = Input(shape = (encoding_dim, ))
encoded_out = encoder.predict(X_test)
encoded_out[0:2]
#array([[ 0.        ,  1.26510417,  1.62803197],
#       [ 2.32508397,  0.99735016,  2.06461048]], dtype=float32)
```

In the next example, we will relax the constraint of layers and employ a stack of layers to achievement the same purpose as above.

```### AN EXAMPLE OF DEEP AUTOENCODER WITH MULTIPLE LAYERS
# InputLayer (None, 10)
#      Dense (None, 20)
#      Dense (None, 10)
#      Dense (None, 5)
#      Dense (None, 3)
#      Dense (None, 5)
#      Dense (None, 10)
#      Dense (None, 20)
#      Dense (None, 10)

input_dim = Input(shape = (ncol, ))
# DEFINE THE DIMENSION OF ENCODER ASSUMED 3
encoding_dim = 3
# DEFINE THE ENCODER LAYERS
encoded1 = Dense(20, activation = 'relu')(input_dim)
encoded2 = Dense(10, activation = 'relu')(encoded1)
encoded3 = Dense(5, activation = 'relu')(encoded2)
encoded4 = Dense(encoding_dim, activation = 'relu')(encoded3)
# DEFINE THE DECODER LAYERS
decoded1 = Dense(5, activation = 'relu')(encoded4)
decoded2 = Dense(10, activation = 'relu')(decoded1)
decoded3 = Dense(20, activation = 'relu')(decoded2)
decoded4 = Dense(ncol, activation = 'sigmoid')(decoded3)
# COMBINE ENCODER AND DECODER INTO AN AUTOENCODER MODEL
autoencoder = Model(input = input_dim, output = decoded4)
# CONFIGURE AND TRAIN THE AUTOENCODER
autoencoder.fit(X_train, X_train, nb_epoch = 100, batch_size = 100, shuffle = True, validation_data = (X_test, X_test))
# THE ENCODER TO EXTRACT THE REDUCED DIMENSION FROM THE ABOVE AUTOENCODER
encoder = Model(input = input_dim, output = encoded4)
encoded_input = Input(shape = (encoding_dim, ))
encoded_out = encoder.predict(X_test)
encoded_out[0:2]
#array([[ 3.74947715,  0.        ,  3.22947764],
#       [ 3.93903661,  0.17448257,  1.86618853]], dtype=float32)
```

Written by statcompute

January 15, 2017 at 6:19 pm

## An Example of Merge Layer in Keras

The power of a DNN does not only come from its depth but also come from its flexibility of accommodating complex network structures. For instance, the DNN shown below consists of two branches, the left with 4 inputs and the right with 6 inputs. In addition, the right branch shows a more complicated structure than the left.

```                                                InputLayer (None, 6)
Dense (None, 6)
BatchNormalization (None, 6)
Dense (None, 6)
InputLayer (None, 4)           BatchNormalization (None, 6)
Dense (None, 4)                        Dense (None, 6)
BatchNormalization (None, 4)           BatchNormalization (None, 6)
\____________________________________/
|
Merge (None, 10)
Dense (None, 1)
```

To create a DNN as the above, both left and right branches are defined separately with corresponding inputs and layers. In the line 29, both branches would be combined with a MERGE layer. There are multiple benefits of such merged DNNs. For instance, the DNN has the flexibility to handle various inputs differently. In addition, new features can be added conveniently without messing around with the existing network structure.

```from pandas import read_csv, DataFrame
from numpy.random import seed
from sklearn.preprocessing import scale
from keras.models import Sequential
from keras.constraints import maxnorm
from keras.optimizers import SGD
from keras.layers import Dense, Merge
from keras.layers.normalization import BatchNormalization
from keras_diagram import ascii

Y = df[df.CARDHLDR == 1].DEFAULTS
X1 = scale(df[df.CARDHLDR == 1][["MAJORDRG", "MINORDRG", "OWNRENT", "SELFEMPL"]])
X2 = scale(df[df.CARDHLDR == 1][["AGE", "ACADMOS", "ADEPCNT", "INCPER", "EXP_INC", "INCOME"]])

branch1 = Sequential()
branch1.add(Dense(X1.shape[1], input_shape = (X1.shape[1],), init = 'normal', activation = 'relu'))

branch2 = Sequential()
branch2.add(Dense(X2.shape[1], input_shape =  (X2.shape[1],), init = 'normal', activation = 'relu'))
branch2.add(Dense(X2.shape[1], init = 'normal', activation = 'relu', W_constraint = maxnorm(5)))
branch2.add(Dense(X2.shape[1], init = 'normal', activation = 'relu', W_constraint = maxnorm(5)))

model = Sequential()
model.add(Merge([branch1, branch2], mode = 'concat'))
model.add(Dense(1, init = 'normal', activation = 'sigmoid'))
sgd = SGD(lr = 0.1, momentum = 0.9, decay = 0, nesterov = False)
model.compile(loss = 'binary_crossentropy', optimizer = sgd, metrics = ['accuracy'])
seed(2017)
model.fit([X1, X2], Y.values, batch_size = 2000, nb_epoch = 100, verbose = 1)
```

Written by statcompute

January 8, 2017 at 4:42 pm

Tagged with , ,

## Dropout Regularization in Deep Neural Networks

The deep neural network (DNN) is a very powerful neural work with multiple hidden layers and is able to capture the highly complex relationship between the response and predictors. However, it is prone to the over-fitting due to a large number of parameters that makes the regularization crucial for DNNs. In the paper (https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf), an interesting regularization approach, e.g. dropout, was proposed with a simple and elegant idea. Basically, it suppresses the complexity of DNNs by randomly dropping units in both input and hidden layers.

Below is an example showing how to tune the hyper-parameter of dropout rates with Keras library in Python. Because of the long computing time required by the dropout, the parallelism is used to speed up the process.

```from pandas import read_csv, DataFrame
from numpy.random import seed
from sklearn.preprocessing import scale
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from keras.models import Sequential
from keras.constraints import maxnorm
from keras.optimizers import SGD
from keras.layers import Dense, Dropout
from multiprocessing import Pool, cpu_count
from itertools import product
from parmap import starmap

Y = df[df.CARDHLDR == 1].DEFAULT
X = df[df.CARDHLDR == 1][['AGE', 'ADEPCNT', 'MAJORDRG', 'MINORDRG', 'INCOME', 'OWNRENT', 'SELFEMPL']]
sX = scale(X)
ncol = sX.shape[1]
x_train, x_test, y_train, y_test = train_test_split(sX, Y, train_size = 0.5, random_state = seed(2017))

def tune_dropout(rate1, rate2):
net = Sequential()
## DROPOUT AT THE INPUT LAYER
net.add(Dropout(rate1, input_shape = (ncol,)))
## DROPOUT AT THE 1ST HIDDEN LAYER
net.add(Dense(ncol, init = 'normal', activation = 'relu', W_constraint = maxnorm(4)))
## DROPOUT AT THE 2ND HIDDER LAYER
net.add(Dense(ncol, init = 'normal', activation = 'relu', W_constraint = maxnorm(4)))
net.add(Dense(1, init = 'normal', activation = 'sigmoid'))
sgd = SGD(lr = 0.1, momentum = 0.9, decay = 0, nesterov = False)
net.compile(loss='binary_crossentropy', optimizer = sgd, metrics = ['accuracy'])
net.fit(x_train, y_train, batch_size = 200, nb_epoch = 50, verbose = 0)
print rate1, rate2, "{:6.4f}".format(roc_auc_score(y_test, net.predict(x_test)))

input_dp = [0.1, 0.2, 0.3]
hidden_dp = [0.2, 0.3, 0.4, 0.5]
parms = [i for i in product(input_dp, hidden_dp)]

seed(2017)
starmap(tune_dropout, parms, pool = Pool(processes = cpu_count()))

```

As shown in the output below, the optimal dropout rate appears to be 0.2 incidentally for both input and hidden layers.

```0.1 0.2 0.6354
0.1 0.4 0.6336
0.1 0.3 0.6389
0.1 0.5 0.6378
0.2 0.2 0.6419
0.2 0.4 0.6385
0.2 0.3 0.6366
0.2 0.5 0.6359
0.3 0.4 0.6313
0.3 0.2 0.6350
0.3 0.3 0.6346
0.3 0.5 0.6343
```

Written by statcompute

January 2, 2017 at 1:09 am