Most efficient way of computing pairwise cosine similarity for large DataFrame

834 Views Asked by Johnny At 06 January 2023 at 00:42

I have a 300.000 row pd.DataFrame comprised of multiple columns, out of which, one is a 50-dimension numpy array of shape (1,50) like so:

ID                     Array1    
1         [2.4252 ... 5.6363] 
2         [3.1242 ... 9.0091] 
3         [6.6775 ... 12.958]  
...
300000    [0.1260 ... 5.3323]

I then generate a new numpy array (let's call it array2) with the same shape and calculate the cosine similarity between each row of the dataframe and the generated array. For this, I am currently using sklearn.metrics.pairwise.cosine_similarity and save the results in a new column:

from sklearn.metrics.pairwise import cosine_similarity
df['Cosine'] = cosine_similarity(df['Array1].tolist(), array2)

Which works as intended and takes, on average, 2.5 seconds to execute. I am currently trying to lower this time to under 1 second simply for the sake of having less waiting time in the system I am building.

I am beginning to learn about Vaex and Dask as alternatives to pandas but am failing to convert the code I provided to a working equivalent that is also faster.

Preferably with one of the technologies I mentioned, how can I go about making pairwise cosine calculations even faster for large datasets?

Original Q&A

There are 1 best solutions below

greenstreets2 On 06 January 2023 at 01:01

You could use Faiss here and apply a knn operation. To do this, you would put dataframe into a Faiss index and then search it using the array with k=3000000 (or whatever the total number of rows of your dataframe).

import faiss

dimension = 100

array1 = np.random.random((n, dimension)).astype('float32')


index = faiss.IndexFlatIP(d)
#add the rows of the dataframe into Faiss
for index, row in df.iterrows():  
    index.add(row)

k= len(df)
D, I = index.search(array1, k)

Note that you'll need to normalise the vectors to make this work (as the above solution is based on inner product).

Most efficient way of computing pairwise cosine similarity for large DataFrame

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in DASK

Related Questions in COSINE-SIMILARITY

Related Questions in VAEX

Trending Questions

Popular # Hahtags

Popular Questions