Wrong encoding when displaying an HTML Request in Python

2.2k Views Asked by At

I do not understand why when I make a HTTP request using the Requests library, then I ask to display the command .text, special characters (such as accents) are encoded (é = é for example).

Yet when I try r.encoding, I get utf-8.

In addition, the problem occurs only on some websites. Sometimes I have the correct characters, but other times, not at all.

Try as follows:

r = requests.get("https://gks.gs/login")
print r.text

There encoded characters which are displayed, we can see Mot de passe oublié ?.

I do not understand why. Do you think it may be because of https? How to fix this please?

4

There are 4 best solutions below

0
On BEST ANSWER

These are HTML character entity references, the easiest way to decode them is:

In Python 2.x:

>>> import HTMLParser
>>> HTMLParser.HTMLParser().unescape('oublié')
'oublié'

In Python 3.x:

>>> import html.parser
>>> html.parser.HTMLParser().unescape('oublié')
'oublié'
0
On

These are HTML escape codes, defined in the HTML Coded Character Set. Even though a certain document may be encoded in UTF-8, HTML (and its grandparent, SGML) were defined back in the good old days of ASCII. A system accessing an HTML page on the WWW may or may not natively support extended characters, and the developers needed a way to define "advanced" characters for some users, while failing gracefully for other users whose systems could not support them. Since UTF-8 standardization was only a gleam in its founders' eyes at that point, an encoding system was developed to describe characters that weren't part of ASCII. It was up to the browser developers to implement a way of displaying those extended characters, either through glyphs or through extended fonts.

0
On

Encoding special characters using &sometihg; is "legal" in any HTML and despite of looking a bit strange, they are to be considered valid.

The text is supposed to be rendered by some HTML browser and it will result in correct result, regardless if you find these character encoded using given construct or directly.

For instructions how to convert these encoded characters see HTML Entity Codes to Text

0
On

Those are HTML escape codes, often referred to as HTML entities. As you see, HTML uses its own code to replace reserved symbols.

You can use the library HTMLParser

parser = HTMLParser.HTMLParser
parsed = parser.unescape(r.text)