I have a large pool of short strings and a custom distance function on them (let's say Damerau–Levenshtein distance).
Q: What is the state-of-the-art solution for getting top N strings from the pool according to the custom distance?
I am looking for both a theoretical approach to this problem as well as coded implementation (Java, Python, etc).
The straight forward approach is to iterate over all strings, calculate the distance for each and keep only the best N while you iterate.
If you need to do this task a lot, you should think if you can come up with a upper-bound / lower bound estimation for the costs that can be calculated much faster than your real cost function. E.g. pre-calculate all n-grams (e.g. 3-grams) for your strings. or maybe comparing the length difference can already give a lower bound for the distance. than you can skip the calculation of the distance for all strings which have a lower bound distance higher than your current distance of the n-th best match.