How does Oracle DB compute edit distance and similarity with non-ASCII characters?

76 Views Asked by At

I've recently been working with Oracle DB, and when evaluating their matching functions (in this case, EDIT_DISTANCE and EDIT_DISTANCE_SIMILARITY, which - I think - implement unnormalized and normalized Levenshtein distance) against standard implementations, I stumbled over some curious behaviour.

Normally, I would expect a normalization equal to:

1 - edit_distance / len(longer_string)

However, this doesn't perfectly match Oracle's results, nor do the unnormalized edit distance values themselves always match expections.

More specifically, my testing suggests that both functions seems to handle special characters strangely. I've tested German and French special characters, and both seem to be handled different from ASCII characters.

For example, here are results given by Oracle DB (I'm scaling Oracle normalization from [0, 100] down to [0, 1]), and what I would expect based on the standard Levenshtein algorithm and the formula above:

String A String B Expected edit distance Oracle edit distance Expected normalization Oracle normalization
stack stock 1 1 1 - 1/5 = 0.8 0.8
müll mall 1 2 1 - 3/4 = 0.75 0.6
déja-vu deja-vu 1 2 1 - 1/7 = 0.857 0.750
château château! 1 1 1 - 1/8 = 0.875 0.89

The SQL I'm using to arrive at these results is as follows (replace 'a' and 'b', obviously):

SELECT UTL_MATCH.EDIT_DISTANCE('a', 'b') FROM dual;
SELECT UTL_MATCH.EDIT_DISTANCE_SIMILARITY('a', 'b') FROM dual;

All of the Oracle results are plausible if you count special characters as two characters. This is, at first, somewhat sensible for German umlauts, for which "ae", "oe", and "ue" are common substitutions, but I'm not aware of similarly common substitutions for e.g. French (then again, my French is quite rusty).

Even more strangely, "Müller" compared with "Mller", "Muller" and "Mueller" returns an edit distance of 2 for all three cases, so "ü" isn't simply converted to "ue". This, therefore, definitely does not seem correct.

It seems strange to me that Oracle is treating non-ASCII characters like this. Is there any precedent or "scientific basis" for this behaviour? For me, it seems pretty random, and in fact more like a bug.

Disclaimer: the entire behaviour could also be due to the version of DBVisualizer (the SQL client) or Oracle DB I'm working with. I unfortunately don't have an easy way to eliminate this variable due to installation restrictions in the environment I'm working in.

1

There are 1 best solutions below

6
MT0 On

Look at the byte values of the strings:

CREATE TABLE table_name (A, B) AS
SELECT 'stack',   'stock'    FROM DUAL UNION ALL
SELECT 'müll',    'mall'     FROM DUAL UNION ALL
SELECT 'müll',    'mall'     FROM DUAL UNION ALL
SELECT 'déja-vu', 'deja-vu'  FROM DUAL UNION ALL
SELECT 'château', 'château!' FROM DUAL;

Then:

SELECT a,
       b,
       DUMP(a),
       DUMP(b),
       UTL_MATCH.EDIT_DISTANCE(a, b) AS edist
FROM   table_name

Outputs:

A B DUMP(A) DUMP(B) EDIST
stack stock Typ=1 Len=5: 115,116,97,99,107 Typ=1 Len=5: 115,116,111,99,107 1
müll mall Typ=1 Len=5: 109,195,188,108,108 Typ=1 Len=4: 109,97,108,108 2
müll mall Typ=1 Len=6: 109,117,204,136,108,108 Typ=1 Len=4: 109,97,108,108 3
déja-vu deja-vu Typ=1 Len=8: 100,195,169,106,97,45,118,117 Typ=1 Len=7: 100,101,106,97,45,118,117 2
château château! Typ=1 Len=8: 99,104,195,162,116,101,97,117 Typ=1 Len=9: 99,104,195,162,116,101,97,117,33 1

The müll and müll values look identical but the first is created with a single character LATIN SMALL LETTER U WITH DIAERESIS and the second is created with two characters LATIN SMALL LETTER U and then the diacritic COMBINING DIAERESIS.

The edit distance is calculating the edit distance between the bytes (or ASCII values) rather than the edit distance between multi-byte characters (or, even more complicated, the edit distance between multi-byte characters modified by combining diacritics).

fiddle