How does Python xml parser detect encoding (utf-8 vs utf-16)?

749 Views Asked by At

The Python XML Parser can parse byte strings of various encodings (Even if there is no encoding specified in the XML header):

from xml.etree import ElementTree as ET

xml_string = '<doc>Glück</doc>'

xml_utf_8 = xml_string.encode('utf-8')
xml_utf_16 = xml_string.encode('utf-16')

print(ET.fromstring(xml_utf_8).text)
print(ET.fromstring(xml_utf_16).text)

Output:

Glück
Glück

Questions:

  • Is it safe to let the parser detect the correct encoding (utf-8 vs. utf-16, other encodings fail if not specified in the parser)?
  • The detection seems to be done in the expat C library. How does it reliably detect the right encoding?
1

There are 1 best solutions below

0
On

The code to detect the encoding in Expat is in function initScan in file xmltok.c at the moment and it is inspecting individual bytes and is e.g. comparing them to the byte order marks known for little endian UTF-16, big endian UTF-16 and UTF8; null bytes also play a part. To find the places where the code makes a final decision about an encoding, you could do this on a Git clone of Expat:

# git --no-pager grep -F '= encodingTable[UTF'
lib/xmltok.c:      *encPtr = encodingTable[UTF_16BE_ENC];
lib/xmltok.c:      *encPtr = encodingTable[UTF_16LE_ENC];
lib/xmltok.c:      *encPtr = encodingTable[UTF_16LE_ENC];
lib/xmltok.c:        *encPtr = encodingTable[UTF_8_ENC];
lib/xmltok.c:        *encPtr = encodingTable[UTF_16BE_ENC];
lib/xmltok.c:        *encPtr = encodingTable[UTF_16LE_ENC];