I've recently been working with Oracle DB, and when evaluating their matching functions (in this case, EDIT_DISTANCE and EDIT_DISTANCE_SIMILARITY, which - I think - implement unnormalized and normalized Levenshtein distance) against standard implementations, I stumbled over some curious behaviour.
Normally, I would expect a normalization equal to:
1 - edit_distance / len(longer_string)
However, this doesn't perfectly match Oracle's results, nor do the unnormalized edit distance values themselves always match expections.
More specifically, my testing suggests that both functions seems to handle special characters strangely. I've tested German and French special characters, and both seem to be handled different from ASCII characters.
For example, here are results given by Oracle DB (I'm scaling Oracle normalization from [0, 100] down to [0, 1]), and what I would expect based on the standard Levenshtein algorithm and the formula above:
| String A | String B | Expected edit distance | Oracle edit distance | Expected normalization | Oracle normalization |
|---|---|---|---|---|---|
| stack | stock | 1 | 1 | 1 - 1/5 = 0.8 |
0.8 |
| müll | mall | 1 | 2 | 1 - 3/4 = 0.75 |
0.6 |
| déja-vu | deja-vu | 1 | 2 | 1 - 1/7 = 0.857 |
0.750 |
| château | château! | 1 | 1 | 1 - 1/8 = 0.875 |
0.89 |
The SQL I'm using to arrive at these results is as follows (replace 'a' and 'b', obviously):
SELECT UTL_MATCH.EDIT_DISTANCE('a', 'b') FROM dual;
SELECT UTL_MATCH.EDIT_DISTANCE_SIMILARITY('a', 'b') FROM dual;
All of the Oracle results are plausible if you count special characters as two characters. This is, at first, somewhat sensible for German umlauts, for which "ae", "oe", and "ue" are common substitutions, but I'm not aware of similarly common substitutions for e.g. French (then again, my French is quite rusty).
Even more strangely, "Müller" compared with "Mller", "Muller" and "Mueller" returns an edit distance of 2 for all three cases, so "ü" isn't simply converted to "ue". This, therefore, definitely does not seem correct.
It seems strange to me that Oracle is treating non-ASCII characters like this. Is there any precedent or "scientific basis" for this behaviour? For me, it seems pretty random, and in fact more like a bug.
Disclaimer: the entire behaviour could also be due to the version of DBVisualizer (the SQL client) or Oracle DB I'm working with. I unfortunately don't have an easy way to eliminate this variable due to installation restrictions in the environment I'm working in.
Look at the byte values of the strings:
Then:
Outputs:
The
müllandmüllvalues look identical but the first is created with a single characterLATIN SMALL LETTER U WITH DIAERESISand the second is created with two charactersLATIN SMALL LETTER Uand then the diacriticCOMBINING DIAERESIS.The edit distance is calculating the edit distance between the bytes (or ASCII values) rather than the edit distance between multi-byte characters (or, even more complicated, the edit distance between multi-byte characters modified by combining diacritics).
fiddle