Efficient construction of cosine similarity matrix from corpus in Chapel

305 Views Asked by At

I have a corpus V of TF/IDF vectors, so they are pretty sparse.
It's an array about 2,500 by 150,000.
I want to calculate the cosine similarity between each document in the corpus.

This is almost the most naive way I can think of to do it. I know of three or four optimizations already, but I don't want to assume the answer. I'd like know the most computationally efficient way to use Chapel in this calculation. The goal is to get X as a symmetric matrix with diag(X) = 0

use Norm,
    LinearAlgebra;

var ndocs = 2500,
    nftrs = 150000,
    docs = 1..ndocs,
    ftrs = 1..nftrs,
    V: [docs, ftrs] real,
    X: [docs, docs] real;

for i in docs {
  var n1 = norm(V[i,..]);
  for j in (i+1)..ndocs {
    var n2 = norm(V[j,..]);
    var c = dot(V[i,..], V[j,..]) / (n1*n2);
    X[i,j] = c;
    X[j,i] = c;
  }
}

Compiled using

chpl -I/usr/local/Cellar/openblas/0.2.20/include -L/usr/local/Cellar/openblas/0.2.20/lib -lblas cosim.chpl

== UPDATED ==

This could should actually compile and run. Original code had errors as suggested by @bradcray below

2

There are 2 best solutions below

0
On BEST ANSWER

Here are some improvements that can be made to the original implementation:

  • Pre-compute and cache dot(V[i, ..], V[i, ..]) for all i into an array to reduce repeated computations.
  • Use 1..V.size or V.domain instead of 1..V.shape[1]
    • V.shape is computed from the domain sizes, rather than stored as a field.
  • Exploit the embarrassingly parallel nature of this program, by computing X in parallel

For more details see the GitHub issue that explores these changes and their impact on the timings.

1
On

[Meta: This question has been weighing on me due to its being unanswered for so long. I've personally shied away from it due to its use of the phrase "most computationally efficient." In practice, its hard to guarantee that any solution meets that description given variations that may occur from one target machine or dataset to the next. But since nobody else has stepped up, I'll make some comments in hopes that they might be of use.]

A few things that stand out for me in your code:

1) Unless I'm missing something, you are computing norm(V[r, ..]) redundantly many, many times during the computation. Asymptotically speaking, this suggests that you're doing quadratic work where only linear work is required. I'd suggest computing the norm for each row once and storing it in an array to avoid this redundant computation:

var normVrow: [docs] real = [r in docs] norm(V[r,..]);

Then, within the inner loop, you can just refer to normVrow[i] or normVrow[j].

2) Since this is Chapel and your loop seems to have no cross-loop dependencies, rather than using serial for loops, you should probably use a parallel forall loop for this computation. There's a question about whether to:

(a) change your outer loop to a forall (which would result in a load imbalance since the overall iteration space is triangular),

(b) change both loops to forall loops (which would help the load imbalance problem by over-decomposing, but likely also increase overhead), or

(c) make the outer loop into a dynamically scheduled loop to address the load imbalance.

My instinct would be to do option c using Chapel's dynamic iterator:

use DynamicIters;

forall i in dynamic(ndocs) {
  ...
}

3) A final thing to consider would be to avoid the triangular iteration space and just redundantly compute X[i,j] and X[j,i] even though they'll have the same value. This may not make sense on a shared-memory run, but if you were computing on a distributed array X, you'd likely reduce communication since those matrix values will be stored by different processors. In this approach, you could iterate with a single forall loop over X.domain, and the result would be well-load-balanced by default without the need for the dynamic iterator.