If I accept full Unicode for passwords, how should I normalize the string before passing it to the hash function?
Goals
Without normalization, if someone sets their password to "mañana" (ma\u00F1ana
) on one computer and tries to log in with "mañana" (ma\u006E\u0303ana
) on another computer, the hashes will be different and the login will fail. This is under the control of the user-agent or its operating system.
- I'd like to ensure that those hash to the same thing.
- I am not concerned about homoglyphs such as Α, А, and A (Greek, Cyrillic, Latin).
Reference
Unicode normalization forms: http://unicode.org/reports/tr15/#Norm_Forms
Considerations
- Any normalization procedure may cause collisions, e.g.
"office" == "office"
. - Normalization can change the number of bytes in the string.
Further questions
- What happens if the server receives a byte sequence that is not valid UTF-8 (or other format)? Reject, since it can't be normalized?
- What happens if the server receives characters that are unassigned in its version of Unicode?
As of November 2022, the currently relevant authority from IETF is RFC 8265, “Preparation, Enforcement, and Comparison of Internationalized Strings Representing Usernames and Passwords,” October 2017. This document about usernames and passwords is a special case of the more-general PRECIS specification in the still-authoritative RFC 8264, “PRECIS Framework: Preparation, Enforcement, and Comparison of Internationalized Strings in Application Protocols,” October 2017.
RFC 8265, § 4.1:
RFC 8265, § 4.2 defines the
OpaqueString
profile, the enforcement of which requires that the following rules be applied in the following order:FreeformClass
string class defined in RFC 8264, § 4.3. Certain characters are specified as:I can’t speak for any other programming language, but the Python package precis-i18n implements the PRECIS framework described in RFCs 8264, 8265, 8266.
Here’s an example of how simple it is to enforce the
OpaqueString
profile on a password string:I found Paweł Krawczyk’s “PRECIS, the next step in Unicode validation” a very helpful introduction and source of Python examples.