Flummoxxed by Text Encoding in Python 3.4: How to Prevent "UnicodeEncodeError: 'charmap' codec can't encode"

1.6k Views Asked by At

Once again, problems with handling character encoding have started to haunt me. I am opening a text file containing XML and importing it to

import xml.etree.ElementTree as ET
import codecs

f = open('Acta_Diabetol_2008_Jun_29_45(2)_107-127.nxml','r',encoding='cp1252')
myTree = ET.parse( f )
f.close()

of = open( 'Acta_Diabetol_2008_Jun_29_45(2)_107-127.txt','w')      
for elem in myTree.iter('sec'):
    of.write( elem2StringRecurse( elem ) )  #gets mad here
of.close()

The error given is

line 197, in <module>
of.write( elem2StringRecurse( elem ) )
File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2265' 
in position 139: character maps to <undefined>

My problem is two fold. First, although an experienced programmer, researching this has been more than normally chaotic because this is handled different in Python 2 and 3. Thus, I am not sure what the error means. I know that some Italian " a' " looking character is the culprit. Is it telling me that there is no Unicode substitution?

Second, how to I prevent this in the general case? I am trying to write code to pump and dump text files for natural language processing: from XML --> plain text. I can't have it crash over something like this; I mean I believe I can manually edit out the offending character, but I can't do it for 1000 occurrences...

1

There are 1 best solutions below

7
On

Use

f = open('Acta_Diabetol_2008_Jun_29_45(2)_107-127.nxml', 'r', encoding='utf-8')

and

of = open('Acta_Diabetol_2008_Jun_29_45(2)_107-127.txt', 'w', encoding='utf-8') 

Encoding XML in anything other than UTF-8 is asking for trouble.

Looking at your file (linked in the comments), it's clear that it's encoded in ASCII which is a subset of UTF-8 (and also of cp1252, which is probably why Firefox and jEdit are guessing that it's using that encoding). It also contains several Unicode escapes beyond 0xff that ElementTree appears to be parsing into Unicode codepoints. If you try to save that back to a cp1252-encoded file, the error that you encountered is the result.