I'm trying to quantify the difference between two strings as part of a change-monitor system.
The issue I'm having is that the strings are large - I can often be dealing with strings with 100K+ characters.
I'm currently using Levenshtein distance, but computing the levenshtein distance for large strings is very inefficient. Even the best implementations only manage O(min(mn))
.
Since both strings are of approximately the same length, the distance calculation process can take many seconds.
I do not need high precision. A change resolution of 1 in 1000 (e.g. 0.1%) would be plenty for my application.
What options are there for more efficient string distance computation?
If you can tolerate some error, you can try splitting the strings into smaller chunks, and calculate their pairwise L-distances.
The method would obviously yield accurate result for replacements, inserts and deletes would incur an accuracy penalty depending on the number of chunks (worst case scenario would give you a distance of
2 * <number of insert/deletes> * <number of chunks>
instead of<number of insert/deletes>
)The next step could be to make the process adaptive, I see two ways of doing it, depending on the expected nature of changes: