I am working on a problem that needs similarity metrics to extract a subset of data from a larger set for further analysis.
The way I am extracting the subset is by using cosine similarity above certain threshold. The toy set below describes the problem:
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
df = pd.DataFrame(np.random.normal(0, 1, (10, 10)))
the similarity of the 10 "samples" by 10 "features" is given by this matrix:
similarity_df = pd.DataFrame(cosine_similarity(df))
(as an example, the comparison heatmap looks like this):
Given a new vector:
new_vector = np.array([1.0, 0.49, -0.05, 0.36, -1.0, -1.8, 0.21, 0.31, 0, -1])
I find similarity using this function:
def find_similarity(df, search):
rankings = []
for i in range(len(df)):
sim = int(100*cosine_similarity(search.reshape(1,-1), df.iloc[i].values.reshape(1,-1))[0][0])
rankings.append((i, sim))
df= pd.DataFrame(rankings)
df.columns = ['item', 'similarity']
df.sort_values(by='similarity', inplace=True, ascending=False)
df.reset_index(drop=True, inplace=True)
return df
Then return the items most similar, above a threshold:
r = find_similarity(df, new_vector)
r[r['similarity']>50]
This works fine, the problem I have is that my data set has over 1 million rows by 200 columns, and the similarity_df is 1 million by 1 million. Needless to say, the sequential comparison in my for loop is slow.
for i in range(len(df)):
sim = int(100*cosine_similarity(search.reshape(1,-1), df.iloc[i].values.reshape(1,-1))[0][0])
This is compounded by the fact that I need to find similarity for about 100 vectors at a time.
Is there a faster way to do this search/ranking? Maybe a hashing function that maximizes similarity (as opposed to minimizing collusion), such as LSH? Although I am using cosine similarity, maybe another similarity metric does a better job, but for now the problem is a faster way to do the one to many comparison.