Say I want usernames to only consist of letters and digits regardless of language.
I think I might accomplish this with the following regex parts
(?>\p{L}[\p{Mn}\p{Mc}]*) //match any letter, including those consisting of two code points
\p{Nd} //match any digit
Now I have the problem that users may pretend to be other users by using a username that has the same look like the one from another user (homograph attack). admin vs admin would be an example.
I guess it's not possible to easily exclude characters that are both letters and confusables using a regex but how about outside the context of the regexes. Do the unicode ids of confusables lie in certain ranges that we could filter or something like that?
Confusables... Then it comes to mind that you are talking about Cyrillic characters. If that's right, you can easily exclude them from your RegEx. Consider following ranges:
Cyrillic: U+0400–U+04FF, 256 characters
Cyrillic Supplement: U+0500–U+052F, 48 characters
Cyrillic Extended-A: U+2DE0–U+2DFF, 32 characters
Cyrillic Extended-B: U+A640–U+A69F, 96 characters
Phonetic Extensions: U+1D2B, U+1D78, 2 Cyrillic characters
Then:
Or simply by using
[^\p{Cyrillic}]