Suggestion for limiting fuzzy search suggestion results

372 Views Asked by At

I've implemented a fuzzy search algorithm based on a N closest neighbors query for given search terms. Each query returns a pre-set number of raw results, in my case a max. of 200 hits / query, sorted descending by score, highest score first.

The raw search already produces good results, but in some rather rare cases not good enough so I've added another post-processing layer or better said another metric to the raw search results based on Levenshtein-Damerau algorithm that measures the word / phrase distance between query term(s) and raw results. The lower the resulting score the better, 0.0 would be an exact match.

Using the Levenshtein-Damerau post-processing algorithm I sort the results ascending, from the lowest to the highest.

The quality of matches is amazingly good and all relevant hits are ranked to the top. Still I have the bulk of 200 hits from the core search and I am looking for a smart way to limit the final result set down to a maximum of 10-20 hits. I could just add a static limit as it is basically done. But I wonder if there is a better way to do this based on the individual metrics I get with each search result set.

I have the following result metrics:

  • The result score of the fuzzy core search search, a value of type float/double. The higher the better
  • The Levenshtein-Damerau post processing weight, another value of type float/double. The lower the better
  • And finally each result set knows its minimum and maximum score limits. Using the Levenshtein-Damerau post processing algorithm on the raw results I take the min/max values from there.

The only ideas I have is to take a sub-range out of the result set, something like the top 20% results which is simple to achieve. More interesting would be to analyse the top result scores/metrics and find some indication where it gets too fuzzy. I could use the metrics I gather inside my Levenshtein-Damerau algorithm layer, respectively the word- and phrase-distance parameters - these values along with 2 other parameters make up the final distance score. For example if the word- and/or phrase distance exceed a certain threshold, then skip the result. This way is a bit more complicated but possible.

Well, I wonder if there are more opportunities I could use and just not obviously see. Once again, I would like to omit a static limit and make it more flexible on each individual result set.

Any hints or further ideas are greatly appreciated.

0

There are 0 best solutions below