Characters are wrongly coded

53 Views Asked by Tobias At 19 September 2023 at 15:32

I have a set of strings that are wrongly encoded. For example some characters are now:

actual: expected
À : Ã€
é : Ã©
ú : Ãº

I have found a UTF-8 character debug tool: https://www.i18nqa.com/debug/utf8-debug.html but I am not quite sure how to apply this tool.

I want to convert the actual characters to its expected character. I can creata a dictionary and replace the actual with the expected characters.

But I prefer to use a function and understand how it works exactly. If I take the character À as example, then the wrong code results from that À is wrongly decoded as UTF-8. The first byte corresponds to Ã and the second byte to € .

So in order to solve this I tried the following:

test = 'Ã€'
byte = test.encode('Windows-1252')
print(byte)
byte.decode('UTF-8')

This does result in the correct output: 'À'

But if I do the following:

test = 'Ã'
byte = test.encode('Windows-1252')
print(byte)
byte.decode('UTF-8')

I get the error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 0: unexpected end of data. I do not understand why this happens. Is there still a method to return all expected characters instead of the actual counterpart?

Original Q&A

There are 1 best solutions below

Mark Ransom On 19 September 2023 at 17:21

I don't know why you want Mojibake but it should be easy to do.

First you need to create a list of the Windows-1252 characters. The first 128 will be identical between Windows-1252 and UTF-8, so we won't worry about those.

chars = ''
for byte in range(0x80,0xff):
    try:
        chars = chars + bytes([byte]).decode('windows-1252')
    except UnicodeDecodeError:
        pass

Then for each you can convert to UTF-8 and see the result.

for ch in chars:
    if len(repr(ch)) == 3:
        try:
            print(ch, ch.encode('utf-8').decode('windows-1252'))
        except UnicodeDecodeError:
            pass

Characters are wrongly coded

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in UTF-8

Related Questions in WINDOWS-1252

Trending Questions

Popular # Hahtags

Popular Questions