Octave Error: out of memory or dimension too large for Octave's index type

512 Views Asked by At

I am trying to run the following code in Octave. The variable "data" consists of 864 rows and 25333 columns.

clc; clear all; close all;

pkg load statistics

GEO = load("GSE59739.mat");
GEOT = tabulate(GEO.class)
data = GEO.data;
clear GEO

idx = kmeans(data,3,'Distance','cosine');
test1 = silhouette(data, idx, 'cosine');
xlabel('Silhouette Value')
ylabel('Cluster')

This is the error I get when trying to run the silhouette function: "error: out of memory or dimension too large for Octave's index type". Any idea on how I can fix it?

1

There are 1 best solutions below

1
On BEST ANSWER

It appears the problem is not necessarily with your data but with the way Octave's statistics package has implemented pdist. It uses an expansion that results in an array with dimensions that do exceed the system limits, just as the error message says.

Running through your example with some dummy data of the same size, on Octave 6.4.0 and statistics 1.4.3, I get:

pkg load statistics
data = rand(864,25333);
idx = kmeans(data,3,'Distance','cosine');
test1 = silhouette(data, idx, 'cosine');

error: out of memory or dimension too large for Octave's index type
error: called from
    pdist at line 164 column 14
    silhouette at line 125 column 16

pdist is a function to calculate the "distance" between any two rows in matrix, using one of several methods. silhouette is called using the cosine metric, and the error occurs in that calculation section:

pdist, lines 163-166 cosine block:

case "cosine"
        prod = X(:,Xi) .* X(:,Yi);
        weights = sumsq (X(:,Xi), 1) .* sumsq (X(:,Yi), 1);
        y = 1 - sum (prod, 1) ./ sqrt (weights);

The first line calculating prod causes the error, as X = data' is 25333x864, and Xi and Yi are each 372816x1, and were formed by running nchoosek(1:rows(data),2) (producing 372816 sets of all 2 element combinations of 1:864).

X(:,Xi) and X(:,Yi) each request creation of a rows(X) x rows(Xi) array, or 25333x372816, or 9,444,547,728 elements, which for double precision data requires 75,556,381,824 Bytes or 75.6GB. Odds are your machine can't handle this.

Just checking with Matlab 2022a, it is able to run those lines without any out of memory errors in a few seconds and the test1 output is only 864x1. So it appears this excessive memory overhead is an issue specific to Octave's implementation and not inherent to the the technique.

I've filed a bug report regarding this behavior at https://savannah.gnu.org/bugs/index.php?62495, but for now the answer appears to be that the 'cosine' metric, and perhaps others as well, simply cannot be used with input data of this size.

Update: as of 19 JUN 2022, a fix for this pdist memory problem has been pushed to the statistics package repository, and will be included in the next major package release. In the meantime the updated function can be found at https://github.com/gnu-octave/statistics/blob/main/inst/pdist.m