How does Oracle DB compute edit distance and similarity with non-ASCII characters?

Question

How does Oracle DB compute edit distance and similarity with non-ASCII characters?

76 Views Asked by PixelMaster At 12 February 2024 at 14:50

I've recently been working with Oracle DB, and when evaluating their matching functions (in this case, EDIT_DISTANCE and EDIT_DISTANCE_SIMILARITY, which - I think - implement unnormalized and normalized Levenshtein distance) against standard implementations, I stumbled over some curious behaviour.

Normally, I would expect a normalization equal to:

1 - edit_distance / len(longer_string)

However, this doesn't perfectly match Oracle's results, nor do the unnormalized edit distance values themselves always match expections.

More specifically, my testing suggests that both functions seems to handle special characters strangely. I've tested German and French special characters, and both seem to be handled different from ASCII characters.

For example, here are results given by Oracle DB (I'm scaling Oracle normalization from [0, 100] down to [0, 1]), and what I would expect based on the standard Levenshtein algorithm and the formula above:

String A	String B	Expected edit distance	Oracle edit distance	Expected normalization	Oracle normalization
stack	stock	1	1	`1 - 1/5 = 0.8`	0.8
müll	mall	1	2	`1 - 3/4 = 0.75`	0.6
déja-vu	deja-vu	1	2	`1 - 1/7 = 0.857`	0.750
château	château!	1	1	`1 - 1/8 = 0.875`	0.89

The SQL I'm using to arrive at these results is as follows (replace 'a' and 'b', obviously):

SELECT UTL_MATCH.EDIT_DISTANCE('a', 'b') FROM dual;
SELECT UTL_MATCH.EDIT_DISTANCE_SIMILARITY('a', 'b') FROM dual;

All of the Oracle results are plausible if you count special characters as two characters. This is, at first, somewhat sensible for German umlauts, for which "ae", "oe", and "ue" are common substitutions, but I'm not aware of similarly common substitutions for e.g. French (then again, my French is quite rusty).

Even more strangely, "Müller" compared with "Mller", "Muller" and "Mueller" returns an edit distance of 2 for all three cases, so "ü" isn't simply converted to "ue". This, therefore, definitely does not seem correct.

It seems strange to me that Oracle is treating non-ASCII characters like this. Is there any precedent or "scientific basis" for this behaviour? For me, it seems pretty random, and in fact more like a bug.

Disclaimer: the entire behaviour could also be due to the version of DBVisualizer (the SQL client) or Oracle DB I'm working with. I unfortunately don't have an easy way to eliminate this variable due to installation restrictions in the environment I'm working in.

Original Q&A

There are 1 best solutions below

**MT0** · Answer 1 · 2024-02-12T15:25:18.573000

Look at the byte values of the strings:

CREATE TABLE table_name (A, B) AS
SELECT 'stack',   'stock'    FROM DUAL UNION ALL
SELECT 'müll',    'mall'     FROM DUAL UNION ALL
SELECT 'müll',    'mall'     FROM DUAL UNION ALL
SELECT 'déja-vu', 'deja-vu'  FROM DUAL UNION ALL
SELECT 'château', 'château!' FROM DUAL;

Then:

SELECT a,
       b,
       DUMP(a),
       DUMP(b),
       UTL_MATCH.EDIT_DISTANCE(a, b) AS edist
FROM   table_name

Outputs:

A	B	DUMP(A)	DUMP(B)	EDIST
stack	stock	Typ=1 Len=5: 115,116,97,99,107	Typ=1 Len=5: 115,116,111,99,107	1
müll	mall	Typ=1 Len=5: 109,195,188,108,108	Typ=1 Len=4: 109,97,108,108	2
müll	mall	Typ=1 Len=6: 109,117,204,136,108,108	Typ=1 Len=4: 109,97,108,108	3
déja-vu	deja-vu	Typ=1 Len=8: 100,195,169,106,97,45,118,117	Typ=1 Len=7: 100,101,106,97,45,118,117	2
château	château!	Typ=1 Len=8: 99,104,195,162,116,101,97,117	Typ=1 Len=9: 99,104,195,162,116,101,97,117,33	1

The müll and müll values look identical but the first is created with a single character LATIN SMALL LETTER U WITH DIAERESIS and the second is created with two characters LATIN SMALL LETTER U and then the diacritic COMBINING DIAERESIS.

The edit distance is calculating the edit distance between the bytes (or ASCII values) rather than the edit distance between multi-byte characters (or, even more complicated, the edit distance between multi-byte characters modified by combining diacritics).

fiddle

How does Oracle DB compute edit distance and similarity with non-ASCII characters?

There are 1 best solutions below

Related Questions in SQL

Related Questions in ORACLE

Related Questions in LEVENSHTEIN-DISTANCE

Related Questions in EDIT-DISTANCE

Trending Questions

Popular # Hahtags

Popular Questions