I am trying to work out which entries in my data store are near-duplicates using approximate string matching.
Is there any implementation of the following approach in python, or do i need to try and roll my own?
Thanks :)
...
A brute-force approach would be to compute the edit distance to P for all substrings of T, and then choose the substring with the minimum distance. However, this algorithm would have the running time O(n3 m)
A better solution[3][4], utilizing dynamic programming, uses an alternative formulation of the problem: for each position j in the text T and each position i in the pattern P, compute the minimum edit distance between the i first characters of the pattern, Pi, and any substring Tj',j of T that ends at position j.
What is the most efficient way to apply this to many strings?
Levenshtein distance performs very similarly to the fuzzywuzzy standard ratio() function. fuzzywuzzy uses difflib http://seatgeek.com/blog/dev/fuzzywuzzy-fuzzy-string-matching-in-python
example from the fuzzywuzzy documentation: https://github.com/seatgeek/fuzzywuzzy