Comprehensive character replacement module in python for non-unicode and non-ascii for HTML

799 Views Asked by x - y At 10 October 2012 at 23:33

Is there a comprehensive character replacement module for python that finds all non-ascii or non-unicode characters in a string and replaces them with ascii or unicode equivilents? This comfort with the "ignore" argument during encoding or decoding is insane, but likewise so is a '?' in every place that a non translated character was.

I'm looking for one module that finds irksome characters and conforms them to whatever standard is requested. I realize that the amount of extant alphabets and encodings makes this somewhat impossible, but surely someone has taken a stab at it? Even a rudimentary solution would be better than the status quo.

The simplification for data transfer that this would mean is enormous.

Original Q&A

There are 2 best solutions below

JBoyer On 11 October 2012 at 00:06

To look at any individual character and guess its encoding would be hard and probably not very accurate. However, you can use chardet to try and detect the encoding of an entire file. Then you can use the string decode() and encode() methods to convert its encoding to UTF-8.

http://pypi.python.org/pypi/chardet

And UTF-8 is backwards compatible with ASCII so that won't be a big deal.

Jonathan Vanasco On 17 October 2012 at 22:50

i don't think what you want is really possible - but i think there is a decent option.

unicodedata has a 'normalize' method that can gracefully degrade text for you...

import unicodedata
def gracefully_degrade_to_ascii( text ):
    return unicodedata.normalize('NFKD',text).encode('ascii','ignore')

assuming the charset you're using is already mapped into unicode - or at least can be mapped into unicode - you should be able to degrade the unicode version of that text down to ascii or utf-8 with this module ( it's part of the standard library too )

Full Docs - http://docs.python.org/library/unicodedata.html

Comprehensive character replacement module in python for non-unicode and non-ascii for HTML

There are 2 best solutions below

Related Questions in PYTHON

Related Questions in UNICODE

Related Questions in CHARACTER-ENCODING

Related Questions in STRING-DECODING

Trending Questions

Popular # Hahtags

Popular Questions