Optimized way to find pairwise cosine distance matrix using pairwise_distances_chunked

1.1k Views Asked by At

I have a numpy array with 42000 (rows) * 110000 (dimensions) ,I am trying to create a pairwise distance matrix(42000*42000) with 32GB ram and 8 cores.

I tried pairwise_distances_chunked but it is only giving 3120*42000 distance matrix .Also used pairwise_distances but it is giving out of memory error.

Any suggestions what can be done?

1

There are 1 best solutions below

2
On BEST ANSWER

Reading the documentation for pairwise_distances_chunked, it yields a chunk at a time. Based on the way you phrased your question, it seems like you did this:

D_chunk = next(pairwise_distances_chunked(X))

That code (which is the first example from the documentation) only gives you the first chunk.

What you want to do is this:

for chunk in pairwise_distances_chunked(X):
    do_something(chunk)