Python Text similarity and matching - increase weighting when terms are together

933 Views Asked by At

I have two columns in pandas which contain a sequence of terms, and my objective is to find the entry from column B which is the closest match to for the entries in column A. I have used the TF-IDF to find the similarity between the two columns, but the problem with this is that it looks for the occurrence of individual words and does not give any priority to words grouped together.

How do I give more weight to words which occur together?

e.g. "The cat sat on the mat" should match more with entries that have the phrase "sat on the mat" than with entries that have "cat horse sat dog on elephant the pig mat"

2

There are 2 best solutions below

0
On

What you want is document similarity. I've done a lot of research into this and from my experience Word Mover's Distance is currently the best performing algorithm.

The easiest way to do it:

  1. Download the official Google News embeddings.
  2. Load them into Gensim's Word2Vec model using the load_word2vec_format method.
  3. Use the wmdistance method to compute document similarity.
2
On

You could for instance iterate in window sizes over your columns.
If you want matches by groups that would indicate you need to pay attention to word order in your sentences.
As an example, take the sentences 'the cat sat on the mat' and 'sat on the mat'.
Build a window size of the shorter sentence 'sat on the mat', iterate over both columns and decrease window size by 1 once you finished the iteration.
You the get matches for every window size and can factor them in in the way you like.

E: if you want to rank the longer matches higher you would need to lookup the sentence which has the most matches.

E2: I'm not sure why this is getting downvoted/ You need to build tuples or windows over your sentences, there is no other way to match when word oder matters. Unfortunately I do not have enough reputation to put this in the comment section.

E3:

def find_ngrams(input_list, n):
    return zip(*[input_list[i:] for i in range(n)])

sent_a = 'the cat sat on the mat'.split()
sent_b = 'sat on the mat'.split()
nga = find_ngrams(sent_a, len(sent_b))
ngb = find_ngrams(sent_b, len(sent_b))
ct = 0
for ngramone in nga:
    for ngramtwo in ngb:
        if ngramone == ngramtwo:
            ct += 1

In [30]: ct
Out[30]: 1

If you wish to find all matches the parameter 'n' for find_ngrams has to be decreased by one each iteration, until it reaches a value of two, you already matched the single words by TF-IDF.
As for how you factor them in, there is too little data provided. My best guess would be to do a lookup if you wish to rank them higher than the TF-DF matches.

I'm not sure if this is in any way included in the pandas library but the matching itself is quite simple and can be done in a few lines.