I have a dataframe that looks like the following, but with many rows:
import pandas as pd
data = {'intent': ['order_food', 'order_food','order_taxi','order_call','order_call','order_taxi'],
'Sent': ['i need hamburger','she wants sushi','i need a cab','call me at 6','she called me','i would like a new taxi' ],
'key_words': [['need','hamburger'], ['want','sushi'],['need','cab'],['call','6'],['call'],['new','taxi']]}
df = pd.DataFrame (data, columns = ['intent','Sent','key_words'])
I have calculated the jaccard similarity using the code below (not my solution):
def lexical_overlap(doc1, doc2):
words_doc1 = set(doc1)
words_doc2 = set(doc2)
intersection = words_doc1.intersection(words_doc2)
return intersection
and modified the code given by @Amit Amola to compare overlapping words between every possible two rows and created a dataframe out of it:
overlapping_word_list=[]
for val in list(combinations(range(len(data_new)), 2)):
overlapping_word_list.append(f"the shared keywords between {data_new.iloc[val[0],0]} and {data_new.iloc[val[1],0]} sentences are: {lexical_overlap(data_new.iloc[val[0],1],data_new.iloc[val[1],1])}")
#creating an overlap dataframe
banking_overlapping_words_per_sent = DataFrame(overlapping_word_list,columns=['overlapping_list'])
since my dataset is huge, when i run this code to compare all rows, it takes forever. so i would like to instead only compare the sentences which have the same intents and do not compare sentences that have different intents. I am not sure on how to proceed to do only that
IIUC you just need to iterate over the unique values in the
intent
column and then useloc
to grab just the rows that correspond to that. If you have more than two rows you will still need to usecombinations
to get the uniquecombinations
between similar intents.