HTMLParser.unescape
behaves like this:
>>> import HTMLParser
>>> h= HTMLParser.HTMLParser()
>>> h.unescape('alpha < β')
u'alpha < \u03b2'
What should I do to get the exact beta
symbol instead of \u03b2
?
Thanks
HTMLParser.unescape
behaves like this:
>>> import HTMLParser
>>> h= HTMLParser.HTMLParser()
>>> h.unescape('alpha < β')
u'alpha < \u03b2'
What should I do to get the exact beta
symbol instead of \u03b2
?
Thanks
\u03b2
is "the exactbeta
symbol".You must learn to distinguish between a thing and a representation of that thing.
Your string consists of lowercase letter a, lowercase letter l, lowercase letter p, lowercase letter h, lowercase letter a, space, left angle bracket, space, and beta.
The
u'...'
sequence is a representation of a string. It shows you one possible sequence of characters that you could type into a Python source file in order to express the concept of that string.u'foo'
is the stringfoo
. So isu'\x66\x6f\x6f'
. So isu'\u0066\u006f\u006f'
. When you ask Python to display the representation of any of those, it will displayu'foo'
, because that's what Python considers to be the simplest representation of that string.When you print
u'\u0066\u006f\u006f'
, you will seefoo
, with nou
prefix and no quotes - because now you are asking for a text representation, instead of a source code representation. You can do the same with the string you have in your program:print h.unescape('alpha < β')
, and if your terminal is currently capable of displayingβ
, you should seealpha < β
. If it doesn't, you'll typically get a UnicodeEncodeError, as Python attempts to send a byte representation of the string to your terminal (using some kind of string encoding to turn the characters into bytes), and the encoding isn't designed to handleβ
. For that problem, please see Python, Unicode, and the Windows console