Oct2Py and GNU Octave

GNU Octave (www.gnu.org/software/octave) is a matrix-based computing language most compatible with Matlab and is gaining popularity in Machine Learning after Dr Ng’s open course (www.coursera.org/course/ml). It is not a general-purpose programming language and is primarily designed for numerical computation. As a result, it is usually not convenient to do dirty works such as data munging within octave. In this case, python becomes a good compliment.

Oct2Py is a python interface to octave, which allows users to run octave commands and m-files dynamically within python and to exchange data between two computing systems seamlessly.

In the example below, I will show how to do simple data wrangling with oct2py package in python, convert python numpy array to octave binary *.mat file, load the data in *.mat file into octave, and then estimate a logistic regression through IRLS (iteratively reweighted least squares) method. As shown in octave session below, the vectorized implementation makes octave very efficient in prototyping machine learning algorithm, i.e. logistic regression in this case.

PYTHON SESSION

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: import oct2py as op

In [4]: # READ DATA FROM CSV

In [5]: df = pd.read_csv('credit_count.csv')

In [6]: # SCRUB RAW DATA

In [7]: df2 = df.ix[df.CARDHLDR == 1, ['AGE', 'ADEPCNT', 'MAJORDRG', 'MINORDRG', 'INCOME', 'OWNRENT', 'BAD']]

In [8]: # CONVERT DATAFRAME TO NUMPY ARRAY

In [9]: arr = np.array(df2, dtype = np.double)

In [10]: # INVOKE AN OCTAVE SESSION

In [11]: octave = op.Oct2Py()

In [12]: # PUT NUMPY ARRAY INTO OCTAVE WORKSPACE

In [13]: octave.put('m', arr)

In [14]: print octave.run('whos')
Variables in the current scope:

   Attr Name        Size                     Bytes  Class
   ==== ====        ====                     =====  =====
        ans         1x70                       763  cell
        m       10499x7                     587944  double

Total is 73563 elements using 588707 bytes

In [15]: # SAVE MAT DATA TO BE USED IN OCTAVE

In [16]: octave.run('save data.mat m')
Out[16]: ''

OCTAVE SESSION

octave:1> % LOAD DATA
octave:1> load data.mat
octave:2> whos
Variables in the current scope:

   Attr Name        Size                     Bytes  Class
   ==== ====        ====                     =====  ===== 
        ans         1x70                       763  cell
        m       10499x7                     587944  double

Total is 73563 elements using 588707 bytes

octave:3> x = [ones(size(m, 1), 1) m(:, 1:6)];
octave:4> y = m(:, 7);
octave:5> % SET INITIAL VALUES OF PARAMETERS
octave:5> b = zeros(size(x, 2), 1);
octave:6> % MAX ITERATIONS
octave:6> maxiter = 100;
octave:7> % SOLVE FOR BETA THROUGH IRLS
octave:7> for i = 1:maxiter
>   xb = x * b;
>   p = 1 ./ (1 + exp(-xb));
>   w = diag(p .* (1 - p));
>   z = x * b + w \ (y - p);
>   b = (x' * w * x) \ (x' * w) * z;
>   new_ll = y' * log(p) + (1 - y)' * log(1 - p);
>   disp(sprintf('log likelihood for iteration %0.0f: %0.8f', i, new_ll));
>   if i == 1
>     old_ll = new_ll;
>   elseif new_ll > old_ll 
>     old_ll = new_ll;
>   else
>     break
>   endif
> end
log likelihood for iteration 1: -7277.35224870
log likelihood for iteration 2: -3460.32494800
log likelihood for iteration 3: -3202.91405878
log likelihood for iteration 4: -3176.95671906
log likelihood for iteration 5: -3175.85136112
log likelihood for iteration 6: -3175.84836320
log likelihood for iteration 7: -3175.84836317
log likelihood for iteration 8: -3175.84836317
log likelihood for iteration 9: -3175.84836317
octave:8> % CALCULATE STANDARD ERRORS AND Z-VALUES
octave:8> stder = sqrt(diag(inv(x' * w * x)));
octave:9> z = b ./ stder;
octave:10> result = [b, stder, z];
octave:11> format short g;
octave:12> disp(result);
    -0.96483     0.13317     -7.2453
  -0.0094622   0.0038629     -2.4495
     0.13384     0.02875      4.6554
     0.21028    0.069724      3.0159
     0.20073    0.048048      4.1776
  -0.0004632  4.1893e-05     -11.057
    -0.22627    0.077392     -2.9237

Generalized Regression Neural Networks and the Implementation with Matlab

Generalized Regression Neural Networks (GRNN) is a special case of Radial Basis Networks (RBN). Compared with its competitor, e.g. standard feedforward neural network, GRNN has several advantages. First of all, the structure of a GRNN is relatively simple and static with 2 layers, namely pattern and summation layers. Once the input goes through each unit in the pattern layer, the relationship between the input and the response would be “memorized” and stored in the unit. As a result, # of units in the pattern layer is equal to # of observations in the training sample. In each pattern unit, a Gaussian PDF would be applied to the network input such that

Theta = EXP[-0.5 * (X – u) `(X – u) / ( Sigma ^2)]

where Theta is the output from pattern units, X is the input, u is training vector stored in the unit, and Sigma is a positive constant known as “spread” or “smooth parameter”. Once Theta is computed, it is passed to the summation layer to calculate Y|X = SUM(Y * Theta) / SUM(Theta), where Y|X is the prediction conditional on X and Y is the response in the training sample. In addition to the above, other benefits of GRNN claimed by Specht (1991) include:

1) The network is able to learning from the training data by “1-pass” training in a fraction of the time it takes to train standard feedforward networks.

2) The spread, Sigma, is the only free parameter in the network, which often can be identified by the V-fold or Split-Sample cross validation.

3) Unlike standard feedforward networks, GRNN estimation is always able to converge to a global solution and won’t be trapped by a local minimum.

With respect to the implementation of GRNN, Matlab might be considered the best computing engine from my limited experience in terms of ease to use and fast speed. A demo is given below on how to use matlab to develop a GRNN and to identify an optimal value of Sigma using split-sample cross validation.

load credit

Y = transpose(data(:, 2));
[n, m] = size(Y);
train_index = 2:2:m;

% SPLIT THE RESPONSE VECTOR INTO TRAINING AND TESTING
train_Y = Y(train_index);
test_Y = Y;
test_Y(train_index) = [];

% SPLIT X MATRIX INTO TRAINING AND TESTING
X = transpose(data(:, 3:10));
train_X = X(:, train_index);
test_X = X;
test_X(:, train_index) = [];

% STANDARDIZE X MATRIX IN TRAINING SET
[train_X2, map] = mapstd(train_X);

% STANDARDIZE X MATRIX IN TESTING SET
test_X2 = mapstd('apply', test_X, map);

% CHECK IF VARIANCE == 1
var(transpose(train_X2))
var(transpose(test_X2))

% TESTING DIFFERENT SPREAD OF RADIAL BASIS FUNCTION
j = 0;
for i = 1:0.02:2
  % TRAIN A GRNN    
  grnn = newgrnn(train_X2, train_Y, i);
  
  % CALCULATE THE PREDICTION FOR TESTING SET
  test_P = sim(grnn, test_X2);
  
  % COLLECT THE PERFORMANCE
  if j == 0
    spread = i;
    perf = sse(test_Y - test_P);    
  else
    spread = [spread, i];
    perf = [perf, sse(test_Y - test_P)];
  end;
  j = j + 1;
end;    

plot(spread, perf, '-ro');

The plot below is generated by the matlab program. As shown, SSE reaches the minimal when Sigma is between 1.3 and 1.4, indicating a reasonable range of the optimal Spread value.