Encoding and decoding for chars are not treated the same for polish letters

511 Views Asked by At

From other source i get two names with two polish letter (ń and ó), like below:

  • piaseczyÅ„ski
  • zielonogórski

Of course these names is more then two.

The 1st should be looks like piaseczyński and the 2nd looks good. But when I use some operation to fix it using: str(entity_name).encode('1252').decode('utf-8') then 1st is fixed, but 2nd return error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf3 in position 8: invalid continuation byte

Why polish letter are not treated the same? How to fix it?

2

There are 2 best solutions below

0
Thomas On BEST ANSWER

As you probably realise already, those strings have different encodings. The best approach is to fix it at the source, so that it always returns UTF-8 (or at least some consistent, known encoding).

If you really can't do that, you should try to decode as UTF-8 first, because it's more strict: not every string of bytes is valid UTF-8. If you get UnicodeDecodeError, try to decode it as some other encoding:

def decode_crappy_bytes(b):
    try:
        return b.decode('utf-8')
    except UnicodeDecodeError:
        return b.decode('1252')

Note that this can still fail, in two ways:

  1. If you get a string in some non-UTF-8 encoding that happens to be decodable as UTF-8 as well.
  2. If you get a string in a non-UTF-8 encoding that's not Windows codepage 1252. Another common one in Europe is ISO-8859-1 (Latin-1). Every bytestring that's valid in one is also valid in the other.

If you do need to deal with multiple different non-UTF-8 encodings and you know that it should be Polish, you could count the number of non-ASCII Polish letters in each possible decoding, and return the one with the highest score. Still not infallible, so really, it's best to fix it at the source.

3
kicaj On

@Thomas I added another except then now works perfectly:

try:
    entity_name = entity_name.encode('1252').decode('utf-8')
except UnicodeDecodeError:
    pass
except UnicodeEncodeError:
    pass

Passed for żarski.