Count the number of unique values for each column of a submatrix in a fast manner

Question

Count the number of unique values for each column of a submatrix in a fast manner

284 Views Asked by Elkan At 01 March 2017 at 10:16

I have a matrix X with tens of rows and thousands of columns, all elements are categorical and re-organized to an index matrix. For example, ith column X(:,i) = [-1,-1,0,2,1,2]' is converted to X2(:,i) = ic of [x,ia,ic] = unique(X(:,i)), for convenient use of function accumarray. I randomly selected a submatrix from the matrix and counted the number of unique values of each column of the submatrix. I performed this procedure 10,000 times. I know several methods for counting number of unique values in a column, the fasted way I found so far is shown below:

mx = max(X);
for iter = 1:numperm
    for j = 1:ny
        ky = yrand(:,iter)==uy(j);
        % select submatrix from X where all rows correspond to rows in y that y equals to uy(j)
        Xk = X(ky,:);
        % specify the sites where to put the number of each unique value
        mxj = mx*(j-1);
        mxi = mxj+1;
        mxk = max(Xk)+mxj;
        % iteration to count number of unique values in each column of the submatrix
        for i = 1:c
            pxs(mxi(i):mxk(i),i) = accumarray(Xk(:,i),1);
        end
    end
end

This is a way to perform random permutation test to calculate information gain between a data matrix X of size n by c and categorical variable y, under which y is randomly permutated. In above codes, all randomly permutated y are stored in matrix yrand, and the number of permutations is numperm. The unique values of y are stored in uy and the unique number is ny. In each iteration of 1:numperm, submatrix Xk is selected according to the unique element of y and number of unique elements in each column of this submatrix is counted and stored in matrix pxs.

The most time costly section in the above code is the iterations of i = 1:c for large c.

Is it possible to perform the function accumarray in a matrix manner to avoid for loop? How else can I improve the above code?

-------

As requested, a simplified test function including above codes is provided as

%% test
function test(x,y)

[r,c] = size(x);
x2 = x;
numperm = 1000;

% convert the original matrix to index matrix for suitable and fast use of accumarray function
for i = 1:c
    [~,~,ic] = unique(x(:,i));
    x2(:,i) = ic;
end

% get 'numperm' rand permutations of y
yrand(r, numperm) = 0;
for i = 1:numperm
    yrand(:,i) = y(randperm(r));
end

% get statistic of y
uy = unique(y);
nuy = numel(uy);

% main iterations
mx = max(x2);
pxs(max(mx),c) = 0;
for iter = 1:numperm
    for j = 1:nuy
        ky = yrand(:,iter)==uy(j);
        xk = x2(ky,:);
        mxj = mx*(j-1);
        mxk = max(xk)+mxj;
        mxi = mxj+1;
        for i = 1:c
            pxs(mxi(i):mxk(i),i) = accumarray(xk(:,i),1);
        end
    end
end

And a test data

x = round(randn(60,3000));
y = [ones(30,1);ones(30,1)*-1];

Test the function

tic; test(x,y); toc

return Elapsed time is 15.391628 seconds. in my computer. In the test function, 1000 permutations is set. So if I perform 10,000 permutation and do some additional computations (are negligible comparing to the above code), time more than 150 s is expected. I think whether the code can be improved. Intuitively, perform accumarray in a matrix manner can save lots of time. Can I?

Original Q&A

There are 1 best solutions below

**Elkan** · Answer 1 · 2017-03-05T09:21:56.063000

The way suggested by @rahnema1 has significantly improved the calculations, so I posted my answer here, as also requested by @Dev-iL.

%% test
function test(x,y)

[r,c] = size(x);
x2 = x;
numperm = 1000;

% convert the original matrix to index matrix for suitable and fast use of accumarray function
for i = 1:c
    [~,~,ic] = unique(x(:,i));
    x2(:,i) = ic;
end

% get 'numperm' rand permutations of y
yrand(r, numperm) = 0;
for i = 1:numperm
    yrand(:,i) = y(randperm(r));
end

% get statistic of y
uy = unique(y);
nuy = numel(uy);

% main iterations
mx = max(max(x2));
% preallocation
pxs(mx*nuy,c) = 0;
% set the edges of the bin for function histc
binrg = (1:mx)';
% preallocation of the range of matrix into which the results will be stored
mxr = mx*(0:nuy);
for iter = 1:numperm
    yt = yrand(:,iter);
    for j = 1:nuy
        pxs(mxr(j)+1:mxr(j),:) = histc(x2(yt==uy(j)),binrg);
    end
end

Test results:

>> x = round(randn(60,3000));
>> y = [ones(30,1);ones(30,1)*-1];
>> tic; test(x,y); toc
Elapsed time is 15.632962 seconds.
>> tic; test(x,y); toc % using the way suggested by rahnema1, i.e., revised function posted above
Elapsed time is 2.900463 seconds.

Count the number of unique values for each column of a submatrix in a fast manner

There are 1 best solutions below

Related Questions in MATLAB

Related Questions in PERFORMANCE

Related Questions in MATRIX

Related Questions in PERMUTATION

Related Questions in ACCUMARRAY

Trending Questions

Popular # Hahtags

Popular Questions