I am working on a problem that needs similarity metrics to extract a subset of data from a larger set for further analysis.

The way I am extracting the subset is by using cosine similarity above certain threshold. The toy set below describes the problem:

import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

df = pd.DataFrame(np.random.normal(0, 1, (10, 10)))

the similarity of the 10 "samples" by 10 "features" is given by this matrix:

similarity_df = pd.DataFrame(cosine_similarity(df))

(as an example, the comparison heatmap looks like this): enter image description here Given a new vector:

new_vector = np.array([1.0, 0.49, -0.05, 0.36, -1.0, -1.8, 0.21, 0.31, 0, -1])

I find similarity using this function:

def find_similarity(df, search):
    rankings = []
    for i in range(len(df)):
        sim = int(100*cosine_similarity(search.reshape(1,-1), df.iloc[i].values.reshape(1,-1))[0][0])
        rankings.append((i, sim))
    df= pd.DataFrame(rankings)
    df.columns = ['item', 'similarity']
    df.sort_values(by='similarity', inplace=True, ascending=False)
    df.reset_index(drop=True, inplace=True)
    return df

Then return the items most similar, above a threshold:

r = find_similarity(df, new_vector)
r[r['similarity']>50]

This works fine, the problem I have is that my data set has over 1 million rows by 200 columns, and the similarity_df is 1 million by 1 million. Needless to say, the sequential comparison in my for loop is slow.

for i in range(len(df)):
    sim = int(100*cosine_similarity(search.reshape(1,-1), df.iloc[i].values.reshape(1,-1))[0][0])

This is compounded by the fact that I need to find similarity for about 100 vectors at a time.

Is there a faster way to do this search/ranking? Maybe a hashing function that maximizes similarity (as opposed to minimizing collusion), such as LSH? Although I am using cosine similarity, maybe another similarity metric does a better job, but for now the problem is a faster way to do the one to many comparison.

0

There are 0 best solutions below