I'm trying to write unicode strings to a file in Python but when I read the file using linux "cat" or "less" the correct characters are not written, instead they show up as garbage.
I am reading the object from an Oracle database. When I print the type (where a is a row in the database results):
logger.debug(type(a[index]))
it outputs:
<type 'unicode'>
I open the file for writing like so:
ff = codecs.open(filename, mode='w', encoding='utf-8')
and I write the line to the file like:
ff.write(a[index]))
but when I read the output file, it doesn't show the correctly accented characters but garbage instead:
$Bu��rger, Udo, -1985. Way to perfect horsemanship
How do I correctly write unicode string objects to a file in Python?
I can guess at how you arrived at that Mojibake of a string. It is quite involved, I am impressed how mucked up this got to be.
Something decoded text from bytes to Unicode with
error='replace'
, masking the fact the wrong codec was used as as bytes that weren't recognized were replaced with replacement characters.The resulting Unicode text with U+FFFD REPLACEMENT CHARACTER codepoints was then encoded to UTF-8, but decoded them again as Latin 1, most likely by your terminal as
cat
orles
output the raw bytes.The text encoded this way is:
Presumably this was meant to be Bürger, Udo, - 1985. Way to perfect horsemanship, with the
ü
being formed by the characteru
and the U+0308 COMBINING DIAERESIS codepoint, which would have been CC 88 in UTF-8, but not decodable as ASCII:The moral of the story: Don't use
errors='replace'
unless you are absolutely sure what you are doing.