Homoglyphs REGEX detection and sql collation

130 Views Asked by At

I have a table containing some regexes. By default the table was created using utf8mb4_general_ci collation. Everything is fine until i try to add a regex containing homoglyphes like this one. The regex // once stored in my database will simply never match.

What is causing this behavior ? I suspect the mysql table row collation to be in fault. If so, what collation should be used to fix this ?

Does changing the collation will affect my other regex already stored in that table (more than 3k, but plain text, no homoglyphes) ?

1

There are 1 best solutions below

0
On

It is possible to create a COLLATION that treats homoglyphs as equal, but I don't believe such exists in MySQL.

I tried these with the default collation:

42           66=x0042  [B]    L  LATIN CAPITAL LETTER B
CE92        914=x0392  [Β]    L  GREEK CAPITAL LETTER BETA
D092       1042=x0412  [В]    L  CYRILLIC CAPITAL LETTER VE

and got

SELECT 'Β' = 'B' AS 'beta:b', 'B' = 'В' AS 'b:Ve';
+--------+------+
| beta:b | b:Ve |
+--------+------+
|      0 |    0 |
+--------+------+

MySQL 8.0 updated its regexp processor to handle utf8mb4, so now

mysql> SELECT 'Β' RLIKE '[ΒBВ]';
+----------------------+
| 'Β' RLIKE '[ΒBВ]'    |
+----------------------+
|                    1 |
+----------------------+

So, with a bunch of 'character classes', you can build a regexp to check for any set of B-loolalikes, hence CRYPTO lookalikes.