How does Python xml parser detect encoding (utf-8 vs utf-16)?

744 Views Asked by Steve At 27 June 2025 at 07:54

The Python XML Parser can parse byte strings of various encodings (Even if there is no encoding specified in the XML header):

from xml.etree import ElementTree as ET

xml_string = '<doc>Glück</doc>'

xml_utf_8 = xml_string.encode('utf-8')
xml_utf_16 = xml_string.encode('utf-16')

print(ET.fromstring(xml_utf_8).text)
print(ET.fromstring(xml_utf_16).text)

Output:

Glück
Glück

Questions:

Is it safe to let the parser detect the correct encoding (utf-8 vs. utf-16, other encodings fail if not specified in the parser)?
The detection seems to be done in the expat C library. How does it reliably detect the right encoding?

Original Q&A

There are 1 best solutions below

Sebastian On 23 October 2023 at 19:44

The code to detect the encoding in Expat is in function initScan in file xmltok.c at the moment and it is inspecting individual bytes and is e.g. comparing them to the byte order marks known for little endian UTF-16, big endian UTF-16 and UTF8; null bytes also play a part. To find the places where the code makes a final decision about an encoding, you could do this on a Git clone of Expat:

# git --no-pager grep -F '= encodingTable[UTF'
lib/xmltok.c:      *encPtr = encodingTable[UTF_16BE_ENC];
lib/xmltok.c:      *encPtr = encodingTable[UTF_16LE_ENC];
lib/xmltok.c:      *encPtr = encodingTable[UTF_16LE_ENC];
lib/xmltok.c:        *encPtr = encodingTable[UTF_8_ENC];
lib/xmltok.c:        *encPtr = encodingTable[UTF_16BE_ENC];
lib/xmltok.c:        *encPtr = encodingTable[UTF_16LE_ENC];

How does Python xml parser detect encoding (utf-8 vs utf-16)?

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in XML

Related Questions in CHARACTER-ENCODING

Related Questions in ELEMENTTREE

Related Questions in EXPAT-PARSER

Trending Questions

Popular # Hahtags

Popular Questions