Reverse python's encoding of umlaut to normalize text or normalize in current form

277 Views Asked by At

Python automatically reads German umlauts and punctuation as

Gefrier- und Tiefkühlmöbel

How do I normalize this output to remove punctuation?

1

There are 1 best solutions below

0
On

You could "fix" the encoding issue by doing:

the_string = 'Gefrier- und Tiefkühlmöbel'.encode('latin-1').decode('utf-8')

And then apply a solution like this one: https://stackoverflow.com/a/518232/2452074

import unicodedata
def strip_accents(s):
   return ''.join(c for c in unicodedata.normalize('NFD', s)
                  if unicodedata.category(c) != 'Mn')

strip_accents(the_string)
> 'Gefrier- und Tiefkuhlmobel'

But first, I would try to understand why your input looks broken, Python itself shouldn't do that automatically.

Some background docs on unicode and encodings: https://docs.python.org/3/howto/unicode.html