Improve GRNN Efficiency by Weighting

In the post (https://statcompute.wordpress.com/2019/07/14/yet-another-r-package-for-general-regression-neural-network), several advantages of General Regression Neural Network (GRNN) have been discussed. However, as pointed out by Specht, a major weakness of GRNN is the high computational cost required for a GRNN to generate predicted values based on a new input matrix due to its unique network structure, e.g. the number of neurons equal to the number of training samples.

For practical purposes, there is however no need to assign a neuron to each training sample, given the data duplication in real-world model development samples. Instead, a weighting scheme can be employed to reflect the frequency count of each unique training sample. A major benefit of the weight assignment is the ability to improve the efficiency of calculating predicted values, which depends on the extent of data duplicates. More attractively, the weighting application can bring up the possibility of using clustering or binning techniques to preprocess the training data so as to overcome the aforementioned weakness to a large degree.

Below is a demonstration showing the efficiency gain by using the weighting scheme in GRNN.

  1. First of all, I constructed a sample data with duplicates to double the size of the original Boston dataset. Based on the constructed data, a GRNN named “N1” was trained.
  2. Secondly, I generated another sample data by aggregating the above constructed data based on unique samples and calculating the weight of each unique data point based on its frequency. Based on the aggregated data, another GRNN named “N2” was also trained.

As shown in the output, predicted vectors from both “N1” and “N2” are identical. However, the computing time can be reduced to half by applying the weighting. All R functions used in the example can be found in https://github.com/statcompute/GRnnet/blob/master/code/grnnet.R.

For people interested in the SAS implementation of GRNN, two SAS macros are also available in https://github.com/statcompute/GRnnet/blob/master/code/grnn_learn.SAS and https://github.com/statcompute/GRnnet/blob/master/code/grnn_pred.SAS.

Advertisements

Bayesian Optimization for Hyper-Parameter

In past several weeks, I spent a tremendous amount of time on reading literature about automatic parameter tuning in the context of Machine Learning (ML), most of which can be classified into two major categories, e.g. search and optimization. Searching mechanisms, such as grid search, random search, and Sobol sequence, can be somewhat computationally expensive. However, they are extremely easy to implement and parallelize on a multi-core PC, as shown in https://statcompute.wordpress.com/2019/02/03/sobol-sequence-vs-uniform-random-in-hyper-parameter-optimization. On the other hand, optimization algorithms, especially gradient-free optimizers such as Nelder–Mead simplex and particle swarm, are often able to quickly locate close-to-optimal solutions in cases that the global optimal is neither feasible nor necessary, as shown in https://statcompute.wordpress.com/2019/02/10/direct-optimization-of-hyper-parameter and https://statcompute.wordpress.com/2019/02/23/gradient-free-optimization-for-glmnet-parameters.

In the example below, another interesting approach, namely Bayesian optimization (https://arxiv.org/abs/1206.2944), is demonstrated and compared with CMA-ES (https://www.researchgate.net/publication/227050324_The_CMA_Evolution_Strategy_A_Comparing_Review), which is also a popular gradient-free optimizer based on the evolution strategy. As shown in the result, the output from Bayesian optimization is closer to the one from Nelder–Mead simplex and particle swarm. What’s more, Bayesian optimization is more consistent than CMA-ES among multiple trials in the experiment.

Direct Optimization of Hyper-Parameter

In the previous post (https://statcompute.wordpress.com/2019/02/03/sobol-sequence-vs-uniform-random-in-hyper-parameter-optimization), it is shown how to identify the optimal hyper-parameter in a General Regression Neural Network by using the Sobol sequence and the uniform random generator respectively through the N-fold cross validation. While the Sobol sequence yields a slightly better performance, outcomes from both approaches are very similar, as shown below based upon five trials with 20 samples in each. Both approaches can be generalized from one-dimensional to multi-dimensional domains, e.g. boosting or deep learning.

net <- grnn.fit(scale(Boston[, -14]), Boston[, 14], sigma = 1)
                        
sb_out <- Reduce(rbind, Map(function(x) grnn.cv(net, gen_sobol(0.1, 1.0, 20, x), 4, 2019), seq(1, 5)))

uf_out <- Reduce(rbind, Map(function(x) grnn.cv(net, gen_unifm(0.1, 1.0, 20, x), 4, 2019), seq(1, 5)))

Map(function(x) x[x$R2 == max(x$R2), ], list(sobol = sb_out, uniform = uf_out))
# $sobol
#  sigma        R2
# 0.5568 0.8019342
# $uniform
#  sigma        R2
# 0.5608 0.8019327

Other than the random search, another way to locate the optimal hyper-parameter is applying general optimization routines, As shown in the demonstration below, we first need to define an objective function, e.g. grnn.optim(), to maximize the Cross-Validation R^2. In addition, depending on the optimization algorithm, upper and lower bounds of the parameter to be optimized should also be provided. Three optimization algorithms are employed in the example, including unconstrained non-linear optimization, particle swarm optimization, and Nelder–Mead simplex optimization, with all showing comparable outcomes to ones achieved by the random search.

R Interfaces to Python Keras Package

Keras is a popular Python package to do the prototyping for deep neural networks with multiple backends, including TensorFlow, CNTK, and Theano. Currently, there are two R interfaces that allow us to use Keras from R through the reticulate package. While the keras R package is able to provide a flexible and feature-rich API, the kerasR R package is more convenient and computationally efficient. For instance, in the below example mimicking the Python code shown in https://statcompute.wordpress.com/2017/01/02/dropout-regularization-in-deep-neural-networks, the kerasR package is at least 10% faster than the keras package in terms of the computing time.


df <- read.csv("credit_count.txt")
Y <- matrix(df[df$CARDHLDR == 1, ]$DEFAULT)
X <- scale(df[df$CARDHLDR == 1, ][3:14])
set.seed(2018)
rows <- sample(1:nrow(Y), nrow(Y) - 2000)
Y1 <- Y[rows, ]
Y2 <- Y[-rows, ]
X1 <- X[rows, ]
X2 <- X[-rows, ]

### USE KERAS PACKAGE (https://keras.rstudio.com) ###

library(keras)
dnn1 % 
  ### DEFINE THE INPUT LAYER ###
  layer_dense(units = 50, activation = 'relu', input_shape = ncol(X), kernel_constraint = constraint_maxnorm(4)) %>% 
  layer_dropout(rate = 0.2, seed = 1) %>% 
  ### DEFINE THE 1ST HIDDEN LAYER ###
  layer_dense(units = 20, activation = 'relu', kernel_constraint = constraint_maxnorm(4)) %>% 
  layer_dropout(rate = 0.2, seed = 1) %>% 
  ### DEFINE THE 2ND HIDDEN LAYER ###
  layer_dense(units = 20, activation = 'relu', kernel_constraint = constraint_maxnorm(4)) %>% 
  layer_dropout(rate = 0.2, seed = 1) %>% 
  layer_dense(units = 1, activation = 'sigmoid') %>% 
  compile(loss = 'binary_crossentropy', optimizer = 'sgd', metrics = c('accuracy'))

dnn1 %>% fit(X1, Y1, batch_size = 50, epochs = 20, verbose = 0, validation_split = 0.3)
pROC::roc(as.numeric(Y2), as.numeric(predict_proba(dnn1, X2)))

### USE KERAS PACKAGE (https://github.com/statsmaths/kerasR) ###

library(kerasR)
dnn2 <- Sequential()
### DEFINE THE INPUT LAYER ###
dnn2$add(Dense(units = 50, input_shape = ncol(X), activation = 'relu', kernel_constraint = max_norm(4)))
dnn2$add(Dropout(rate = 0.2, seed = 1))
### DEFINE THE 1ST HIDDEN LAYER ###
dnn2$add(Dense(units = 20, activation = 'relu', kernel_constraint = max_norm(4)))
dnn2$add(Dropout(rate = 0.2, seed = 1))
### DEFINE THE 2ND HIDDEN LAYER ###
dnn2$add(Dense(units = 20, activation = 'relu', kernel_constraint = max_norm(4)))
dnn2$add(Dropout(rate = 0.2, seed = 1))
dnn2$add(Dense(units = 1, activation = 'sigmoid'))
keras_compile(dnn2,  loss = 'binary_crossentropy', optimizer = 'sgd', metrics = 'accuracy')

keras_fit(dnn2, X1, Y1, batch_size = 50, epochs = 20, verbose = 0, validation_split = 0.3)
pROC::roc(as.numeric(Y2), as.numeric(keras_predict_proba(dnn2, X2)))

DART: Dropout Regularization in Boosting Ensembles

The dropout approach developed by Hinton has been widely employed in deep learnings to prevent the deep neural network from overfitting, as shown in https://statcompute.wordpress.com/2017/01/02/dropout-regularization-in-deep-neural-networks.

In the paper http://proceedings.mlr.press/v38/korlakaivinayak15.pdf, the dropout can also be used to address the overfitting in boosting tree ensembles, e.g. MART, caused by the so-called “over-specialization”. In particular, while first few trees added at the beginning of ensembles would dominate the model performance, the rest added later can only improve the prediction for a small subset, which increases the risk of overfitting. The idea of DART is to build an ensemble by randomly dropping boosting tree members. The percentage of dropouts can determine the degree of regularization for boosting tree ensembles.

Below is a demonstration showing the implementation of DART with the R xgboost package. First of all, after importing the data, we divided it into two pieces, one for training and the other for testing.

pkgs <- c('pROC', 'xgboost')
lapply(pkgs, require, character.only = T)
df1 <- read.csv("Downloads/credit_count.txt")
df2 <- df1[df1$CARDHLDR == 1, ]
set.seed(2017)
n <- nrow(df2)
sample <- sample(seq(n), size = n / 2, replace = FALSE)
train <- df2[sample, -1]
test <- df2[-sample, -1]

For the comparison purpose, we first developed a boosting tree ensemble without dropouts, as shown below. For the simplicity, all parameters were chosen heuristically. The max_depth is set to 3 due to the fact that the boosting tends to work well with so-called “weak” learners, e.g. simple trees. While ROC for the training set can be as high as 0.95, ROC for the testing set is only 0.60 in our case, implying the overfitting issue.

mart.parm <- list(booster = "gbtree", nthread = 4, eta = 0.1, max_depth = 3, subsample = 1, eval_metric = "auc")
mart <- xgboost(data = as.matrix(train[, -1]), label = train[, 1], params = mart.parm, nrounds = 500, verbose = 0, seed = 2017)
pred1 <- predict(mart, as.matrix(train[, -1]))
pred2 <- predict(mart, as.matrix(test[, -1]))
roc(as.factor(train$DEFAULT), pred1)
# Area under the curve: 0.9459
roc(as.factor(test$DEFAULT), pred2)
# Area under the curve: 0.6046

With the same set of parameters, we refitted the ensemble with dropouts, e.g. DART. As shown below, by dropping 10% tree members, ROC for the testing set can increase from 0.60 to 0.65. In addition, the performance disparity between training and testing sets with DART decreases significantly.

dart.parm <- list(booster = "dart", rate_drop = 0.1, nthread = 4, eta = 0.1, max_depth = 3, subsample = 1, eval_metric = "auc")
dart <- xgboost(data = as.matrix(train[, -1]), label = train[, 1], params = dart.parm, nrounds = 500, verbose = 0, seed = 2017)
pred1 <- predict(dart, as.matrix(train[, -1]))
pred2 <- predict(dart, as.matrix(test[, -1]))
roc(as.factor(train$DEFAULT), pred1)
# Area under the curve: 0.7734
roc(as.factor(test$DEFAULT), pred2)
# Area under the curve: 0.6517

Besides rate_drop = 0.1, a wide range of dropout rates have also been tested. In most cases, DART outperforms its counterpart without the dropout regularization.

Random Search for Optimal Parameters

Practices of manual search, grid search, or the combination of both have been successfully employed in the machine learning to optimize hyper-parameters. However, in the arena of deep learning, both approaches might become impractical. For instance, the computing cost of grid search for hyper-parameters in a multi-layer deep neural network (DNN) could be prohibitively high.

In light of aforementioned hurdles, Bergstra and Bengio proposed a novel idea of random search in the paper http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf. In their study, it was found that random search is more efficient than grid search for the hyper-parameter optimization in terms of computing costs.

In the example below, it is shown that both grid search and random search have reached similar results in the SVM parameter optimization based on cross-validations.

import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV 
from sklearn.svm import SVC as svc 
from sklearn.metrics import make_scorer, roc_auc_score
from scipy import stats

# DATA PREPARATION
df = pd.read_csv("credit_count.txt")
y = df[df.CARDHLDR == 1].DEFAULT.values 
x = preprocessing.scale(df[df.CARDHLDR == 1].ix[:, 2:12], axis = 0) 

# DEFINE MODEL AND PERFORMANCE MEASURE
mdl = svc(probability = True, random_state = 1)
auc = make_scorer(roc_auc_score)

# GRID SEARCH FOR 20 COMBINATIONS OF PARAMETERS
grid_list = {"C": np.arange(2, 10, 2),
             "gamma": np.arange(0.1, 1, 0.2)}

grid_search = GridSearchCV(mdl, param_grid = grid_list, n_jobs = 4, cv = 3, scoring = auc) 
grid_search.fit(x, y) 
grid_search.cv_results_

# RANDOM SEARCH FOR 20 COMBINATIONS OF PARAMETERS
rand_list = {"C": stats.uniform(2, 10),
             "gamma": stats.uniform(0.1, 1)}
             
rand_search = RandomizedSearchCV(mdl, param_distributions = rand_list, n_iter = 20, n_jobs = 4, cv = 3, random_state = 2017, scoring = auc) 
rand_search.fit(x, y) 
rand_search.cv_results_

A Simple Convolutional Neural Network for The Binary Outcome

Since CNN(Convolutional Neural Networks) have achieved a tremendous success in various challenging applications, e.g. image or digit recognitions, one might wonder how to employ CNNs in classification problems with binary outcomes.

Below is an example showing how to use a simple 1D convolutional neural network to predict credit card defaults.

### LOAD PACKAGES 
from numpy.random import seed
from pandas import read_csv, DataFrame
from sklearn.preprocessing import minmax_scale
from keras.layers.convolutional import Conv1D, MaxPooling1D
from keras.optimizers import SGD
from keras.models import Sequential
from keras.layers import Dense, Flatten

### PREPARE THE DATA 
df = read_csv("credit_count.txt")
Y = df[df.CARDHLDR == 1].DEFAULT
X = minmax_scale(df[df.CARDHLDR == 1].ix[:, 2:12], axis = 0)
y_train = Y.values
x_train = X.reshape(X.shape[0], X.shape[1], 1)

### FIT A 1D CONVOLUTIONAL NEURAL NETWORK
seed(2017)
conv = Sequential()
conv.add(Conv1D(20, 4, input_shape = x_train.shape[1:3], activation = 'relu'))
conv.add(MaxPooling1D(2))
conv.add(Flatten())
conv.add(Dense(1, activation = 'sigmoid'))
sgd = SGD(lr = 0.1, momentum = 0.9, decay = 0, nesterov = False)
conv.compile(loss = 'binary_crossentropy', optimizer = sgd, metrics = ['accuracy'])
conv.fit(x_train, y_train, batch_size = 500, epochs = 100, verbose = 0)

Considering that 1D is the special case of 2D, we can also solve the same problem with a 2D convolutional neural network by changing the input shape, as shown below.

from numpy.random import seed
from pandas import read_csv, DataFrame
from sklearn.preprocessing import minmax_scale
from keras_diagram import ascii
from keras.layers.convolutional import Conv2D, MaxPooling2D
from keras.optimizers import SGD
from keras.models import Sequential
from keras.layers import Dense, Flatten

df = read_csv("credit_count.txt")
Y = df[df.CARDHLDR == 1].DEFAULT
X = minmax_scale(df[df.CARDHLDR == 1].ix[:, 2:12], axis = 0)
y_train = Y.values
x_train = X.reshape(X.shape[0], 1, X.shape[1], 1)

seed(2017)
conv = Sequential()
conv.add(Conv2D(20, (1, 4), input_shape = x_train.shape[1:4], activation = 'relu'))
conv.add(MaxPooling2D((1, 2)))
conv.add(Flatten())
conv.add(Dense(1, activation = 'sigmoid'))
sgd = SGD(lr = 0.1, momentum = 0.9, decay = 0, nesterov = False)
conv.compile(loss = 'binary_crossentropy', optimizer = sgd, metrics = ['accuracy'])
conv.fit(x_train, y_train, batch_size = 500, epochs = 100, verbose = 0)