I have the following practical scenario. Imagine you have a column of strings lets call them "description". And you have another column of strings (usually shorter) lets call them "name". The task is to find which "name" is contained in the every row of the "description". Issue is that "description" can have billion of records and "names" around million so naive comparison is not possible. I was hoping that maybe some hashing trick exists similar to SimHash or MinHash that speeds up the comparison.
Typical scenario: you have movie titles and some vocabulary and you want to check if title uses strings from vocabulary. The titles can contain spelling errors or something else so on-to-on mapping is not possible - only approximate.
Here is my naive approach and quantity I want to calculate (used for picking up the best candidate). For every "description" row I create set of character 2-grams (just example) and the same for "name" column. After this procedure I calculate set intersection divided by the length of "name" set i.e.
S = len(description_set.intersection(name_set)) / len(name_set)
similarity score is used for picking up the best candidate (above some threshold otherwise none of them fits). This would be ok for small scale data. Some better strategy is needed for large scale data.