In Python 2.7 at least, unicodedata.name()
doesn't recognise certain characters.
>>> from unicodedata import name
>>> name(u'\n')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: no such name
>>> name(u'a')
'LATIN SMALL LETTER A'
Certainly Unicode contains the character \n
, and it has a name, specifically "LINE FEED".
NB. unicodedata.lookup('LINE FEED')
and unicodedata.lookup(u'LINE FEED')
both give a KeyError: undefined character name
.
The
unicodedata.name()
lookup relies on column 2 of the UnicodeData.txt database in the standard (Python 2.7 uses Unicode 5.2.0).If that name starts with
<
it is ignored. All control codes, including newlines, are in that category; the first column has no name other than<control>
:Column 10 is the old, Unicode 1.0 name, and should not be used, according to the standard. In other words,
\n
has no name, other than the generic<control>
, which the Python database ignores (as it is not unique).Python 3.3 added support for NameAliases.txt, which lets you look up names by alias; so
lookup('LINE FEED')
,lookup('new line')
orlookup('eol')
, etc, all reference\n
. However, theunicodedata.name()
method does not support aliases, nor could it (which would it pick?):TL;DR:
LINE FEED
is not the official name for\n
, it is but an alias for it. Python 3.3 and up let you look up characters by alias.