I'm trying to calculate cosine similarity scores between all possible combinations of text documents from a corpus. I'm using scikit-learn's cosine_similarity
function to do this. Since my corpus is huge (30 million documents), the number of possible combinations between the documents in the corpus is just too many to store as a dataframe. So, I'd like to filter the similarity scores using a threshold, as they're being created, before storing them in a dataframe for future use. While I do that, I also want to assign the corresponding IDs of each of these documents to the index and column names of the dataframe. So, for a data value in the dataframe, each value should have index(row) and column names which are the document IDs for which the value is a cosine similarity score.
similarity_values = pd.DataFrame(cosine_similarity(tfidf_matrix), index = IDs, columns= IDs)
This piece of code works well without the filtering part. IDs
is a list variable that has all document IDs sorted corresponding to the tfidf matrix.
similarity_values = pd.DataFrame(cosine_similarity(tfidf_matrix)>0.65, index = IDs, columns= IDs)
This modification helps with the filtering but the similarity scores are turned into boolean (True/False) values. How can I keep the actual cosine similarity scores here instead of the boolean True/False values.
The only thing that comes to mind is breaking down the cosine similarity to batches. For example, you're using
cosine_similarity(tfidf_matrix)
to generate an NxN matrix, but you can also usecosine_similarity(tfidf_matrix[:5], tfidf_matrix)
to generate a 5xN matrix. In this case we can do the following based on your followup clarification: