Filtering cosine similarity scores into a pandas dataframe

1.8k Views Asked by At

I'm trying to calculate cosine similarity scores between all possible combinations of text documents from a corpus. I'm using scikit-learn's cosine_similarity function to do this. Since my corpus is huge (30 million documents), the number of possible combinations between the documents in the corpus is just too many to store as a dataframe. So, I'd like to filter the similarity scores using a threshold, as they're being created, before storing them in a dataframe for future use. While I do that, I also want to assign the corresponding IDs of each of these documents to the index and column names of the dataframe. So, for a data value in the dataframe, each value should have index(row) and column names which are the document IDs for which the value is a cosine similarity score.

similarity_values = pd.DataFrame(cosine_similarity(tfidf_matrix), index = IDs, columns= IDs)

This piece of code works well without the filtering part. IDs is a list variable that has all document IDs sorted corresponding to the tfidf matrix.

similarity_values = pd.DataFrame(cosine_similarity(tfidf_matrix)>0.65, index = IDs, columns= IDs)

This modification helps with the filtering but the similarity scores are turned into boolean (True/False) values. How can I keep the actual cosine similarity scores here instead of the boolean True/False values.

1

There are 1 best solutions below

0
On

The only thing that comes to mind is breaking down the cosine similarity to batches. For example, you're using cosine_similarity(tfidf_matrix) to generate an NxN matrix, but you can also use cosine_similarity(tfidf_matrix[:5], tfidf_matrix) to generate a 5xN matrix. In this case we can do the following based on your followup clarification:

# source: followup clarification #2 
def question_followup_transformer(df):
  return df.stack().reset_index().rename(columns={'level_0':'ID1','level_1':'ID2', 0:'Score'})
# Copyright 2024 Google LLC.
# SPDX-License-Identifier: Apache-2.0
from sklearn.feature_extraction.text import TfidfVectorizer

# corpus is not provided in the example. 
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)

matrix_length = tfidf_matrix.shape[0]

BATCH_SIZE = 10
FILTER_THRESHOLD = 0.6

df = []
# we iterate (matrix_length//batch_size) times.
for i in range(0, matrix_length - BATCH_SIZE, BATCH_SIZE):
  # compute cosine similarity for a subMatrix.
  subMatrix = cosine_similarity(tfidf_matrix[i:i+BATCH_SIZE], tfidf_matrix)

  # set the proper index of the submatrix in a dataframe.
  similarity_values = pd.DataFrame(
      subMatrix, 
      index = range(i, i+BATCH_SIZE), 
      columns= range(0, matrix_length))
  
  # apply the stack transformation from the followup clarification.
  stacked_df = question_followup_transformer(similarity_values)

  # filter all scores below the the filter threshold.
  filtered_df = stacked_df.query("Score > {}".format(FILTER_THRESHOLD))

  # append dataframe to a list.
  df.append(filtered_df)

# concat all dataframes to a single one. 
df = pd.concat(df, ignore_index=True)