Fast approximate string difference for large strings

355 Views Asked by At

I'm trying to quantify the difference between two strings as part of a change-monitor system.

The issue I'm having is that the strings are large - I can often be dealing with strings with 100K+ characters.

I'm currently using Levenshtein distance, but computing the levenshtein distance for large strings is very inefficient. Even the best implementations only manage O(min(mn)).

Since both strings are of approximately the same length, the distance calculation process can take many seconds.

I do not need high precision. A change resolution of 1 in 1000 (e.g. 0.1%) would be plenty for my application.

What options are there for more efficient string distance computation?

1

There are 1 best solutions below

0
On

If you can tolerate some error, you can try splitting the strings into smaller chunks, and calculate their pairwise L-distances.

The method would obviously yield accurate result for replacements, inserts and deletes would incur an accuracy penalty depending on the number of chunks (worst case scenario would give you a distance of 2 * <number of insert/deletes> * <number of chunks> instead of <number of insert/deletes>)

The next step could be to make the process adaptive, I see two ways of doing it, depending on the expected nature of changes:

  1. Try a small chunk size first then move on to larger and larger chunks and observe the drop between each iteration. That should help you estimate how much of your measured distance is error (though I haven't worked out exactly how).
  2. Once you find a difference between two chunks, try to identify what the difference is (exactly how many characters were added/deleted overall), and shift your next chunk to the left or to the right accordingly.