Improve GRNN Efficiency by Weighting

In the post (, several advantages of General Regression Neural Network (GRNN) have been discussed. However, as pointed out by Specht, a major weakness of GRNN is the high computational cost required for a GRNN to generate predicted values based on a new input matrix due to its unique network structure, e.g. the number of neurons equal to the number of training samples.

For practical purposes, there is however no need to assign a neuron to each training sample, given the data duplication in real-world model development samples. Instead, a weighting scheme can be employed to reflect the frequency count of each unique training sample. A major benefit of the weight assignment is the ability to improve the efficiency of calculating predicted values, which depends on the extent of data duplicates. More attractively, the weighting application can bring up the possibility of using clustering or binning techniques to preprocess the training data so as to overcome the aforementioned weakness to a large degree.

Below is a demonstration showing the efficiency gain by using the weighting scheme in GRNN.

  1. First of all, I constructed a sample data with duplicates to double the size of the original Boston dataset. Based on the constructed data, a GRNN named “N1” was trained.
  2. Secondly, I generated another sample data by aggregating the above constructed data based on unique samples and calculating the weight of each unique data point based on its frequency. Based on the aggregated data, another GRNN named “N2” was also trained.

As shown in the output, predicted vectors from both “N1” and “N2” are identical. However, the computing time can be reduced to half by applying the weighting. All R functions used in the example can be found in

For people interested in the SAS implementation of GRNN, two SAS macros are also available in and


Yet Another R Package for General Regression Neural Network

Compared with other types of neural networks, General Regression Neural Network (Specht, 1991) is advantageous in several aspects.

  1. Being an universal approximation function, GRNN has only one tuning parameter to control the overall generalization
  2. The network structure of GRNN is surprisingly simple, with only one hidden layer and the number of neurons equal to the number of training samples.
  3. GRNN is always able to converge globally and won’t be trapped by local solutions.
  4. The training of GRNN is a simple 1-pass, regardless of the sample size, and doesn’t require time-consuming iterations.
  5. Since any projected value of GRNN is the weighted average of training samples, predictions are bounded by the observed range.

The grnn package (, which has not been updated since 2013, is the only implementation of GRNN on CRAN and was designed elegantly with a parsimonious set of functions and lots of opportunities for potential improvements.

The GRnnet project ( is my attempt to provide a R implementation of GRNN, with several enhancements.

  1. While the training function is very similar to learn() and smooth() in the grnn package. three functions were designed to provide GRNN projections. The grnn.predone() function generates one projected value based on an input vector. Both grnn.predict() and grnn.parpred() functions generate a vector of projected values based on an input matrix. The only difference is that grnn.parpred() runs in parallel and therefore can be 3 times faster than grnn.predict() on my 4-core workstation.
  2. While tuning the only hyper-parameter is the key in GRNN training, there are two functions in the GRnnet project to search for the optimal parameter through the n-fold cross validation, including grnn.cv_r2() for numeric outcomes and grnn.cv_auc() for binary outcomes.
  3. In grnn.predone() function, while the default projection is based on the Euclidean distance, there is an option to calculate the GRNN projection based on the Manhattan distance as well for the sake of computational simplicity (Specht, 1991).

In the banking industry, GRNN can be useful in several areas. First of all, it can be employed as the replacement of splines to approximate the term structure of interest rates. Secondly, like other neural networks, it can be used in Fraud Detection and Anti-Money Laundering given its flexibility. At last, in the credit risk modeling, it can also be used to develop performance benchmarks and rapid prototypes for scorecards or Expected Loss models due to the simplicity.

Monotonic Binning Driven by Decision Tree

After the development of MOB package (, I was asked by a couple users about the possibility of using the decision tree to drive the monotonic binning. Although I am not aware of any R package implementing the decision tree with the monotonic constraint, I did manage to find a solution based upon the decision tree.

The Rborist package is an implementation of the Random Forest that would enforce the monotonicity at the local level within each tree but not at the global level for the whole forest. However, with a few tweaks on the Rborist syntax, it is not difficult to convert the forest with many trees into the forest with a single tree. After all necessary adjustments, I finally ended up with a decision tree that can be used to drive the monotonic binning algorithm, as shown in the arb_bin() function below, and will consider adding it into the MOB package later.

Chunk Averaging of GLM

Chunk Average (CA) is an interesting concept proposed by Matloff in the chapter 13 of his book “Parallel Computing for Data Science”. The basic idea is to partition the entire model estimation sample into chunks and then to estimate a glm for each chunk. Under the i.i.d assumption, the CA estimator with the chunked data is asymptotically equivalent to the estimator with the full data. The possibility of converting the full model estimation with an excessively large dataset to the chunked estimation with small pieces is particularly attractive in real-world model developments where the model convergence could be challenging given the data size.

The ca_glm() function below is my attempt to implement the Chunk Averaging of GLM. As shown, CA estimations by various chunks are consistent with the estimation with the full data.

Latin Hypercube Sampling in Hyper-Parameter Optimization

In my previous post, I’ve shown the difference between the uniform pseudo random and the quasi random number generators in the hyper-parameter optimization of machine learning.

Latin Hypercube Sampling (LHS) is another interesting way to generate near-random sequences with a very simple idea. Let’s assume that we’d like to perform LHS for 10 data points in the 1-dimension data space. We first partition the whole data space into 10 equal intervals and then randomly select a data point from each interval. For the N-dimension LHS with N > 1, we just need to independently repeat the 1-dimension LHS for N times and then randomly combine these sequences into a list of N-tuples.

LHS is similar to the Uniform Random in the sense that the Uniform Random number is drawn within each equal-space interval. On the other hand, LHS covers the data space more evenly in a way similar to the Quasi Random, such as Sobol Sequence. A comparison below shows how each of three looks like in the 2-dimension data space.

unifm_2d <- function(n, seed) {
  return(replicate(2, runif(n)))

sobol_2d <- function(n, seed) {
  return(randtoolbox::sobol(n, dim = 2, scrambling = 3, seed = seed))

latin_2d <- function(n, seed) {
  return(lhs::randomLHS(n, k = 2))

par(mfrow = c(1, 3))
plot(latin_2d(100, 2019), main = "LATIN HYPERCUBE", xlab = '', ylab = '', cex = 2, col = "blue")
plot(sobol_2d(100, 2019), main = " SOBOL SEQUENCE", xlab = '', ylab = '', cex = 2, col = "red")
plot(unifm_2d(100, 2019), main = " UNIFORM RANDOM", xlab = '', ylab = '', cex = 2, col = "black")


In the example below, three types of random numbers are applied to the hyper-parameter optimization of General Regression Neural Network (GRNN) in the 1-dimension case. While both Latin Hypercube and Sobol Sequence generate similar averages of CV R-squares, the variance of CV R-squares for Latin Hypercube is much lower. With no surprise, the performance of simple Uniform Random remains the lowest, e.g. lower mean and higher variance.

data(Boston, package = "MASS")
df <- data.frame(y = Boston[, 14], scale(Boston[, -14]))
gn <- grnn::smooth(grnn::learn(df), sigma = 1)

grnn.predict <- function(nn, dt) {
  Reduce(c, lapply(seq(nrow(dt)), function(i) grnn::guess(nn, as.matrix(dt[i, ]))))

r2 <- function(act, pre) {
  return(1 - sum((pre - act) ^ 2) / sum((act - mean(act)) ^ 2))
} <- function(nn, sigmas, nfolds, seed = 2019) {
  dt <- nn$set
  fd <- caret::createFolds(seq(nrow(dt)), k = nfolds)
  cv <- function(s) {
    rs <- Reduce(rbind,
                        function(f) data.frame(Ya = nn$Ya[unlist(f)],
                                               Yp = grnn.predict(grnn::smooth(grnn::learn(nn$set[unlist(-f), ]), s),
                                                                 nn$set[unlist(f), -1]))))
    return(data.frame(sigma = s, R2 = r2(rs$Ya, rs$Yp)))
  cl <- parallel::makeCluster(min(nfolds, parallel::detectCores() - 1), type = "PSOCK")
  parallel::clusterExport(cl, c("fd", "nn", "grnn.predict", "r2"),  envir = environment())
  rq <- Reduce(rbind, parallel::parLapply(cl, sigmas, cv))
  return(rq[rq$R2 == max(rq$R2), ])

gen_unifm <- function(min = 0, max = 1, n, seed) {
  return(round(min + (max - min) * runif(n), 8))

gen_sobol <- function(min = 0, max = 1, n, seed) {
  return(round(min + (max - min) * randtoolbox::sobol(n, dim = 1, scrambling = 3, seed = seed), 8))

gen_latin <- function(min = 0, max = 1, n, seed) {
  return(round(min + (max - min) * c(lhs::randomLHS(n, k = 1)), 8))

nfold <- 10
nseed <- 10

sobol_out <- Reduce(rbind, lapply(seq(nseed), function(x), gen_sobol(0.1, 1, 10, x), nfold)))
latin_out <- Reduce(rbind, lapply(seq(nseed), function(x), gen_latin(0.1, 1, 10, x), nfold)))
unifm_out <- Reduce(rbind, lapply(seq(nseed), function(x), gen_unifm(0.1, 1, 10, x), nfold)))

out <- rbind(cbind(type = rep("LH", nseed), latin_out),
             cbind(type = rep("SS", nseed), sobol_out),
             cbind(type = rep("UR", nseed), unifm_out))

title <- "Latin Hypercube vs. Sobol Sequence vs. Uniform Random"
boxplot(R2 ~ type, data = out, main = title, ylab = "CV RSquare", xlab = "Sequence Type")

aggregate(R2 ~ type, data = out, function(x) round(c(avg = mean(x), var = var(x)), 8))
#type     R2.avg     R2.var
#  LH 0.82645414 0.00000033
#  SS 0.82632171 0.00000075
#  UR 0.82536693 0.00000432


Parallel R: Socket or Fork

In the R parallel package, there are two implementations of parallelism, e.g. fork and socket, with pros and cons.

For the fork, each parallel thread is a complete duplication of the master process with the shared environment, including objects or variables defined prior to the kickoff of parallel threads. Therefore, it runs fast. However, the major limitation is that the fork doesn’t work on the Windows system.

On the other hand, the socket works on all operating systems. Each thread runs separately without sharing objects or variables, which can only be passed from the master process explicitly. As a result, it runs slower due to the communication overhead.

Below is an example showing the performance difference between the fork and the socket. A self-defined filter function runs in parallel and exacts three rows out of 336,776 that are meeting criteria. As shown, the fork runs 40% faster than the socket.

df <- read.csv("data/nycflights")

ex <- expression(carrier == "UA" & origin == "EWR" & day == 1 &
# SELECT 3 ROWS OUT OF 336,776
#        year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest air_time ...
# 56866  2013    11   1       NA        NA       NA        NA      UA            252    EWR  IAH       NA ...
# 84148  2013    12   1       NA        NA       NA        NA      UA            643    EWR  ORD       NA ...
# 251405 2013     7   1       NA        NA       NA        NA      UA            394    EWR  ORD       NA ...

parFilter <- function(df, ex, type) {
  cn <- parallel::detectCores() - 1
  cl <- parallel::makeCluster(cn, type = type)
  sp <- parallel::parLapply(cl, parallel::clusterSplit(cl, seq(nrow(df))),
                            function(c_) df[c_,])
  parallel::clusterExport(cl, "ex")
  id <- Reduce(c, parallel::parLapply(cl, sp,
                                      function(s_) with(s_, eval(ex))))

rbenchmark::benchmark(replications = 10, order = "elapsed", relative = "elapsed",
                        columns = c("test", "replications", "elapsed", "relative"),
  "  FORK" = parFilter(df, ex, "FORK"),
  "SOCKET" = parFilter(df, ex, "PSOCK")
#     test replications elapsed relative
# 1   FORK           10  59.396    1.000
# 2 SOCKET           10  83.856    1.412

WoE Transformation for Loss Given Default Models

In the intro section of my MOB package (, reasons and benefits of using WoE transformations in the context of logistic regressions with binary outcomes had been discussed. What’s more, the similar idea can be easily generalized to other statistical models in the credit risk area, such as LGD (Loss Given Default) models with fractional outcomes.

Measuring the ratio between net and gross charge-offs, LGD can take any value within the unity interval of [0, 1] with no unanimous consensus on the distributional assumption either academically or empirically. In the banking industry, a popular approach to model LGD is the use of Quasi-Binomial models that makes no assumption of any statistical distribution but merely specifies the conditional mean by a Logit link function. With the specification of Logit link, the idea of WoE transformations can be ported directly from logistic regressions to Quasi-Binomial models.

The example below shows how to perform WoE transformations through monotonic binning based upon the fractional outcome, e.g. LGD, by using the function qtl_lgd() (

As demonstrated in the outcome table, the average LGD increases along with the LTV (Loan-to-Value) and the WoE transformation of LTV is strictly linear with respect to the Logit of average LGD.