How to speed up time when calculate cosine similarity using nested loops in python

811 Views Asked by At

I'm trying to calculate the cosine similarity between all the values.

The time for 1000*20000 calculations cost me more than 10 mins.

Code:

from gensim import matutils
# array_A contains 1,000 TF-IDF values
# array_B contains 20,000 TF-IDF values 
for x in array_A:
   for y in array_B:
      matutils.cossim(x,y)

It's necessary to using gensim package to get the tf-idf value and similarity calculation.

Can someone please give me some advice and guidance to speed up time?

3

There are 3 best solutions below

2
On

use memoize and also maybe use tuples (it may be faster) for the arrays:

def memoize(f):
    memo = {}

    def helper(a, b):
        if (b, a) in memo: return memo[b, a]
        elif (a, b) in memo: return memo[a, b]
        else:
            memo[(a, b)] = f(a, b)
            return memo[a, b]

    return helper


@memoize
def myfunc(a, b):
    matutils.cossim(x,y)

EDIT also after using the code above maybe add this just in case you are doing something else with the data

cossim_responses = [myfunc(a, b) for a in array_A for b in array_B]
# you could also do (myfunc(a, b) for a in array_A for b in array_B)
0
On

You can look at the source for gensim's matutils.cossim():

https://github.com/RaRe-Technologies/gensim/blob/2e58a1c899af05ee6a39a1dd1c49dd6641501a9c/gensim/matutils.py#L436

You'll see it's doing a bit of work on its two (sparse-array) arguments to move their non-zero dimensions into temporary dicts, then calculating their lengths – which is repeated every time the same vector is supplied in your loops.

You might get a reasonable speedup by doing those steps on each vector only once, and remembering those dicts & lengths for re-use on each final pairwise calculation. (That is, memoizing the interim values, rather than just the final values.)

0
On

You can use Nmslib or Faiss for vector search operations