From unicodedata doc:
unicodedata.digit(chr[, default]) Returns the digit value assigned to the character chr as integer. If no such value is defined, default is returned, or, if not given, ValueError is raised.
unicodedata.numeric(chr[, default]) Returns the numeric value assigned to the character chr as float. If no such value is defined, default is returned, or, if not given, ValueError is raised.
Can anybody explain me the difference between those two functions?
Here ones can read the implementation of both functions but is not evident for me what is the difference from a quick look because I'm not familiar with CPython implementation.
EDIT 1:
Would be nice an example that shows the difference.
EDIT 2:
Examples useful to complement the comments and the spectacular answer from @user2357112:
print(unicodedata.digit('1')) # Decimal digit one.
print(unicodedata.digit('١')) # ARABIC-INDIC digit one
print(unicodedata.digit('¼')) # Not a digit, so "ValueError: not a digit" will be generated.
print(unicodedata.numeric('Ⅱ')) # Roman number two.
print(unicodedata.numeric('¼')) # Fraction to represent one quarter.
Short answer:
If a character represents a decimal digit, so things like
1
,¹
(SUPERSCRIPT ONE),①
(CIRCLED DIGIT ONE),١
(ARABIC-INDIC DIGIT ONE),unicodedata.digit
will return the digit that character represents as an int (so 1 for all of these examples).If the character represents any numeric value, so things like
⅐
(VULGAR FRACTION ONE SEVENTH) and all the decimal digit examples,unicodedata.numeric
will give that character's numeric value as a float.For technical reasons, more recent digit characters like
(DINGBAT NEGATIVE CIRCLED SANS-SERIF DIGIT ZERO) may raise a ValueError from
unicodedata.digit
.Long answer:
Unicode characters all have a
Numeric_Type
property. This property can have 4 possible values: Numeric_Type=Decimal, Numeric_Type=Digit, Numeric_Type=Numeric, or Numeric_Type=None.Quoting the Unicode standard, version 10.0.0, section 4.6,
Numeric_Type=Decimal characters are thus decimal digits fitting a few other specific technical requirements.
So Numeric_Type=Digit was historically used for other digits not fitting the technical requirements of Numeric_Type=Decimal, but they decided that wasn't useful, and digit characters not meeting the Numeric_Type=Decimal requirements have just been assigned Numeric_Type=Numeric since Unicode 6.3.0. For example,
(DINGBAT NEGATIVE CIRCLED SANS-SERIF DIGIT ZERO) introduced in Unicode 7.0 has Numeric_Type=Numeric.
Numeric_Type=Numeric is for all characters that represent numbers and don't fit in the other categories, and Numeric_Type=None is for characters that don't represent numbers (or at least, don't under normal usage).
All characters with a non-None Numeric_Type property have a Numeric_Value property representing their numeric value.
unicodedata.digit
will return that value as an int for characters with Numeric_Type=Decimal or Numeric_Type=Digit, andunicodedata.numeric
will return that value as a float for characters with any non-None Numeric_Type.