I have two columns in pandas which contain a sequence of terms, and my objective is to find the entry from column B which is the closest match to for the entries in column A. I have used the TF-IDF to find the similarity between the two columns, but the problem with this is that it looks for the occurrence of individual words and does not give any priority to words grouped together.
How do I give more weight to words which occur together?
e.g. "The cat sat on the mat" should match more with entries that have the phrase "sat on the mat" than with entries that have "cat horse sat dog on elephant the pig mat"
What you want is document similarity. I've done a lot of research into this and from my experience Word Mover's Distance is currently the best performing algorithm.
The easiest way to do it:
load_word2vec_format
method.wmdistance
method to compute document similarity.