I have 12 Million company names in my db. I want to match them with a list offline. I want to know the best algorithm to do so. I have done that through Levenstiens distance but it is not giving the expected results. Could you please suggest some algorithms for the same.Problem is matching the companies like
G corp. ----this need to be mapped to G corporation
water Inc -----Water Incorporated
You should probably start by expanding the known suffixes in both lists (the database and the list). This will take some manual work to figure out the correct mapping, e.g. with regexps:
\s+inc\.?$->Incorporated\s+corp\.?$->CorporationYou may want to do other normalization as well, such as lower-casing everything, removing punctuation, etc.
You can then use Levenshtein distance or another fuzzy matching algorithm.