Finding the subset of a dictionary that has the minimum edit distance to a given string

50 Views Asked by At

I'm looking for the most efficient way of solving an Levenshtein edit distance problem.

We are given as input:

  1. A set of strings S of size n <= 8, with average length m <= 50
  2. A target string t of length l <= 50

Our task is to 'align' t with S i.e:

Find the subset s* of S, where concat(s*) has the minimum Levenshtein edit distance (among all the subsets) with t

Some thoughts:

  1. For 2 strings of length l1 and l2, we could use the standard dynamic programming algorithm which has O(l1*l2) time complexity
  2. Brute forcing this would require us to compute editDistance(t, concat(s')) for each subset s'. This would be approximately O(l.m.n!)
  3. This could be optimized a bit by memoizing the results i.e if we are computing editDistance(t, S[1, 2, 3, 4]) we could re-use the computation from S[1, 2, 3])
  4. The other option I could think of is to construct a Trie or DAWG (Directed Acyclic Word Graph)

But, I'm not an expert on this, so I might be missing a better solution. How would we do this efficiently?

Thanks in advance!

1

There are 1 best solutions below

0
btilly On

Outline of a memoize solution. (I just didn't do levenshtein for you.)

def best_answer(S, t, p0=0, p1=None, cache=None):
    # Defaults
    if cache is None:
        cache = {}
    if p1 is None:
        p1 = len(t)
    if ('s', p0, p1) not in cache:
        best_strings = []
        best_error = p1 - p0
        for s in S:
            leven = levenshtein(s, t[p0:p1])
            if leven < best_error:
                best_strings = [s]
                best_error = leven
        for p in range(p0+1, p1):
            (best_strings0, best_error0) = best_answer(S, t, p0, p, cache)
            (best_strings1, best_error1) = best_answer(S, t, p, p1, cache)
            if best_error0 + best_error1 < best_error:
                best_strings = best_strings0 + best_strings1
                best_error = best_error0 + best_error1
        cache[(p0, p1)] = (best_strings, best_error)
    return cache[('s', p0, p1)]