Python utf-8 conversion to cp1252

3k Views Asked by At

I already have the code to iterate through all files in a deep file structure where all files are utf-8 and need to be converted to c1252 a.k.a. ANSI.

I need to achieve the same simple result as coverting the file in any serious text editor... why would there be any losses? Yes, some characters are standardly replaced by different ones: Šš=Šš Čč=Èè Ťť=?? Žž=Žž Ěě=Ìì Řř=Øø Ďď=Ïï Ňň=Òò Ůů=Ùù

But since a simple string conversion like

>>> print("Šš Čč Ťť Žž Ěě Řř Ďď Ňň Ůů".encode("utf-8").decode("cp1252"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Program Files\Python310\lib\encodings\cp1252.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 8: character maps to <undefined>

... doesn't work what are my chances? I've been literally through dozens of articles here and there throughout the whole day and could not find a working solution or understand the hell of this cp conversion PITA. Found even complete functions and converter obviously written for Python 2 none working.

Also not working:

chcp 65001

Active code page: 65001

           with open(fpath, mode="r", encoding="utf-8") as fd:
               content = fd.read()
           with open(fpath, mode="w", encoding="cp1252") as fd:
               fd.write(content)

or

          with open(fpath, mode="r", encoding="utf-8") as fd:
               decoded = fd.decode("utf-8")
               content = decoded.encode("cp1252")
1

There are 1 best solutions below

2
On

Your first example will never work. Encoding a Unicode string using one scheme and decoding to another is incorrect, but you can decode a file or byte string using the encoding it was generated with, then re-encode it in another encoding. The encodings need to support the same Unicode code points, however.

UTF-8 supports encoding all Unicode code points while CP1252 supports <256, so don't expect your files to contain the same information if you go this route.

There is an errors parameter that can be used when decoding (reading) a file and encoding (writing) a file. Here's an example of the loss from the example string provided:

>>> s = "Šš Čč Ťť Žž Ěě Řř Ďď Ňň Ůů"
>>> s.encode('cp1252',errors='ignore').decode('cp1252')
'Šš   Žž     '
>>> s.encode('cp1252',errors='replace').decode('cp1252')
'Šš ?? ?? Žž ?? ?? ?? ?? ??'

There are non-lossy conversions as well, but use replacement schemes. See Error Handlers in the Python codecs documentation.

So the second example can work with loss by providing the errors parameter:

with open(fpath, mode="r", encoding="utf-8") as fd:
    content = fd.read()
with open(fpath, mode="w", encoding="cp1252", errors='replace') as fd:
    fd.write(content)