Matlab/Octave

GNU Octave (www.gnu.org/software/octave) is a matrix-based computing language most compatible with Matlab and is gaining popularity in Machine Learning after Dr Ng’s open course (www.coursera.org/course/ml). It is not a general-purpose programming language and is primarily designed for numerical computation. As a result, it is usually not convenient to do dirty works such as data munging within octave. In this case, python becomes a good compliment.

Oct2Py is a python interface to octave, which allows users to run octave commands and m-files dynamically within python and to exchange data between two computing systems seamlessly.

In the example below, I will show how to do simple data wrangling with oct2py package in python, convert python numpy array to octave binary *.mat file, load the data in *.mat file into octave, and then estimate a logistic regression through IRLS (iteratively reweighted least squares) method. As shown in octave session below, the vectorized implementation makes octave very efficient in prototyping machine learning algorithm, i.e. logistic regression in this case.

PYTHON SESSION

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: import oct2py as op

In [4]: # READ DATA FROM CSV

In [5]: df = pd.read_csv('credit_count.csv')

In [6]: # SCRUB RAW DATA

In [7]: df2 = df.ix[df.CARDHLDR == 1, ['AGE', 'ADEPCNT', 'MAJORDRG', 'MINORDRG', 'INCOME', 'OWNRENT', 'BAD']]

In [8]: # CONVERT DATAFRAME TO NUMPY ARRAY

In [9]: arr = np.array(df2, dtype = np.double)

In [10]: # INVOKE AN OCTAVE SESSION

In [11]: octave = op.Oct2Py()

In [12]: # PUT NUMPY ARRAY INTO OCTAVE WORKSPACE

In [13]: octave.put('m', arr)

In [14]: print octave.run('whos')
Variables in the current scope:

   Attr Name        Size                     Bytes  Class
   ==== ====        ====                     =====  =====
        ans         1x70                       763  cell
        m       10499x7                     587944  double

Total is 73563 elements using 588707 bytes

In [15]: # SAVE MAT DATA TO BE USED IN OCTAVE

In [16]: octave.run('save data.mat m')
Out[16]: ''

OCTAVE SESSION

octave:1> % LOAD DATA
octave:1> load data.mat
octave:2> whos
Variables in the current scope:

   Attr Name        Size                     Bytes  Class
   ==== ====        ====                     =====  ===== 
        ans         1x70                       763  cell
        m       10499x7                     587944  double

Total is 73563 elements using 588707 bytes

octave:3> x = [ones(size(m, 1), 1) m(:, 1:6)];
octave:4> y = m(:, 7);
octave:5> % SET INITIAL VALUES OF PARAMETERS
octave:5> b = zeros(size(x, 2), 1);
octave:6> % MAX ITERATIONS
octave:6> maxiter = 100;
octave:7> % SOLVE FOR BETA THROUGH IRLS
octave:7> for i = 1:maxiter
>   xb = x * b;
>   p = 1 ./ (1 + exp(-xb));
>   w = diag(p .* (1 - p));
>   z = x * b + w \ (y - p);
>   b = (x' * w * x) \ (x' * w) * z;
>   new_ll = y' * log(p) + (1 - y)' * log(1 - p);
>   disp(sprintf('log likelihood for iteration %0.0f: %0.8f', i, new_ll));
>   if i == 1
>     old_ll = new_ll;
>   elseif new_ll > old_ll 
>     old_ll = new_ll;
>   else
>     break
>   endif
> end
log likelihood for iteration 1: -7277.35224870
log likelihood for iteration 2: -3460.32494800
log likelihood for iteration 3: -3202.91405878
log likelihood for iteration 4: -3176.95671906
log likelihood for iteration 5: -3175.85136112
log likelihood for iteration 6: -3175.84836320
log likelihood for iteration 7: -3175.84836317
log likelihood for iteration 8: -3175.84836317
log likelihood for iteration 9: -3175.84836317
octave:8> % CALCULATE STANDARD ERRORS AND Z-VALUES
octave:8> stder = sqrt(diag(inv(x' * w * x)));
octave:9> z = b ./ stder;
octave:10> result = [b, stder, z];
octave:11> format short g;
octave:12> disp(result);
    -0.96483     0.13317     -7.2453
  -0.0094622   0.0038629     -2.4495
     0.13384     0.02875      4.6554
     0.21028    0.069724      3.0159
     0.20073    0.048048      4.1776
  -0.0004632  4.1893e-05     -11.057
    -0.22627    0.077392     -2.9237

load credit Y = transpose(data(:, 2)); [n, m] = size(Y); train_index = 2:2:m; % SPLIT THE RESPONSE VECTOR INTO TRAINING AND TESTING train_Y = Y(train_index); test_Y = Y; test_Y(train_index) = []; % SPLIT X MATRIX INTO TRAINING AND TESTING X = transpose(data(:, 3:10)); train_X = X(:, train_index); test_X = X; test_X(:, train_index) = []; % STANDARDIZE X MATRIX IN TRAINING SET [train_X2, map] = mapstd(train_X); % STANDARDIZE X MATRIX IN TESTING SET test_X2 = mapstd('apply', test_X, map); % CHECK IF VARIANCE == 1 var(transpose(train_X2)) var(transpose(test_X2)) % TESTING DIFFERENT SPREAD OF RADIAL BASIS FUNCTION j = 0; for i = 1:0.02:2 % TRAIN A GRNN grnn = newgrnn(train_X2, train_Y, i); % CALCULATE THE PREDICTION FOR TESTING SET test_P = sim(grnn, test_X2); % COLLECT THE PERFORMANCE if j == 0 spread = i; perf = sse(test_Y - test_P); else spread = [spread, i]; perf = [perf, sse(test_Y - test_P)]; end; j = j + 1; end; plot(spread, perf, '-ro');

Yet Another Blog in Statistical Computing

Oct2Py and GNU Octave

Generalized Regression Neural Networks and the Implementation with Matlab