Allow only letters and digits in strings but without confusables

1.3k Views Asked by At

Say I want usernames to only consist of letters and digits regardless of language.

I think I might accomplish this with the following regex parts

(?>\p{L}[\p{Mn}\p{Mc}]*) //match any letter, including those consisting of two code points

\p{Nd} //match any digit

Now I have the problem that users may pretend to be other users by using a username that has the same look like the one from another user (homograph attack). admin vs admin would be an example.

I guess it's not possible to easily exclude characters that are both letters and confusables using a regex but how about outside the context of the regexes. Do the unicode ids of confusables lie in certain ranges that we could filter or something like that?

2

There are 2 best solutions below

6
On

Confusables... Then it comes to mind that you are talking about Cyrillic characters. If that's right, you can easily exclude them from your RegEx. Consider following ranges:

Cyrillic: U+0400–U+04FF, 256 characters

Cyrillic Supplement: U+0500–U+052F, 48 characters

Cyrillic Extended-A: U+2DE0–U+2DFF, 32 characters

Cyrillic Extended-B: U+A640–U+A69F, 96 characters

Phonetic Extensions: U+1D2B, U+1D78, 2 Cyrillic characters

Then:

/[^\x{0400}-\x{04FF}\x{0500}-\x{052F}\x{2DE0}-\x{2DFF}\x{A640}-\x{A69F}\x{1D2B}\x{1D78}]/u

Or simply by using [^\p{Cyrillic}]

0
On

The Unicode standard includes a list of confusable characters at http://www.unicode.org/Public/security/revision-02/confusables.txt

This list is incomplete according to some, and too aggressive according to others, but take a look at it in order to understand how difficult the problem is to solve generically.