Characters are wrongly coded

53 Views Asked by At

I have a set of strings that are wrongly encoded. For example some characters are now:

actual: expected
À : À
é : é
ú : ú

I have found a UTF-8 character debug tool: https://www.i18nqa.com/debug/utf8-debug.html but I am not quite sure how to apply this tool.

I want to convert the actual characters to its expected character. I can creata a dictionary and replace the actual with the expected characters.

But I prefer to use a function and understand how it works exactly. If I take the character À as example, then the wrong code results from that À is wrongly decoded as UTF-8. The first byte corresponds to à and the second byte to € .

So in order to solve this I tried the following:

test = 'À'
byte = test.encode('Windows-1252')
print(byte)
byte.decode('UTF-8')

This does result in the correct output: 'À'

But if I do the following:

test = 'Ã'
byte = test.encode('Windows-1252')
print(byte)
byte.decode('UTF-8')

I get the error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 0: unexpected end of data. I do not understand why this happens. Is there still a method to return all expected characters instead of the actual counterpart?

1

There are 1 best solutions below

0
Mark Ransom On

I don't know why you want Mojibake but it should be easy to do.

First you need to create a list of the Windows-1252 characters. The first 128 will be identical between Windows-1252 and UTF-8, so we won't worry about those.

chars = ''
for byte in range(0x80,0xff):
    try:
        chars = chars + bytes([byte]).decode('windows-1252')
    except UnicodeDecodeError:
        pass

Then for each you can convert to UTF-8 and see the result.

for ch in chars:
    if len(repr(ch)) == 3:
        try:
            print(ch, ch.encode('utf-8').decode('windows-1252'))
        except UnicodeDecodeError:
            pass