Python, url changes while executed in requests.get() and results in the famous UnicodeDecodeError

33 Views Asked by At

I use Visual Studio Code. Python version 3.12.2. Beautifulsoup version 4.12.3. I'm on Windows 11. Files encoding is set to: utf-8.

This is my code sample in VS code:

import requests
import urllib.parse
from urllib.parse import quote

from bs4 import BeautifulSoup

for topic in range(13717, 13718):
    url = 'https://www.scale-rc-car.com/forum/showthread.php?t='+str(topic) +'&pp=1&page=1'
    print(url)
    html_content = requests.get(url)
    soup = BeautifulSoup(html_content.text, 'html.parser')

    

print(url) results in the constructed url with the correct topic number (13717): https://www.scale-rc-car.com/forum/showthread.php?t=13717&pp=1&page=1 and that is correct and what I want.

But here's the rub, I get the often posted "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 64: invalid continuation byte"

The thing is, as soon as the html_content = requests.get(url) statement is executed the url seem to change to: https://www.scale-rc-car.com/forum/showthread.php?13717-Buggy-d-%E9tag%E8re-Team-Associated-RC10CC&pp=20

I can check that by pasting the constructed url (https://www.scale-rc-car.com/forum/showthread.php?t=13717&pp=1&page=1) in the webbrowser and when I hit ENTER it changes and adds the phrase: -Buggy-d-%E9tag%E8re-Team-Associated-RC10CC As you can see the characters é and è are replaced by respectively %E9 and %E8. And the result is the errormessage UnicodeDecodeError. The question is: How can I avoid or error-trap this problem? Extra info, I don't no on forehand if there will be special characters in the url.

This is the complete error message:

PS C:\xampp\htdocs\python> python dumpy.py
https://www.scale-rc-car.com/forum/showthread.php?t=13717&pp=1&page=1
Traceback (most recent call last):
  File "C:\xampp\htdocs\python\dumpy.py", line 10, in <module>
    html_content = requests.get(url)
                   ^^^^^^^^^^^^^^^^^
  File "C:\Users\bartz\AppData\Roaming\Python\Python312\site-packages\requests\api.py", line 73, in get
    return request("get", url, params=params, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\bartz\AppData\Roaming\Python\Python312\site-packages\requests\api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\bartz\AppData\Roaming\Python\Python312\site-packages\requests\sessions.py", line 589, in request     
    resp = self.send(prep, **send_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\bartz\AppData\Roaming\Python\Python312\site-packages\requests\sessions.py", line 725, in send        
    history = [resp for resp in gen]
              ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\bartz\AppData\Roaming\Python\Python312\site-packages\requests\sessions.py", line 175, in resolve_redirects
    url = self.get_redirect_target(resp)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\bartz\AppData\Roaming\Python\Python312\site-packages\requests\sessions.py", line 124, in get_redirect_target
    return to_native_string(location, "utf8")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\bartz\AppData\Roaming\Python\Python312\site-packages\requests\_internal_utils.py", line 33, in to_native_string
    out = string.decode(encoding)
          ^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 64: invalid continuation byte
PS C:\xampp\htdocs\python>
1

There are 1 best solutions below

1
JosefZ On

Get redirected URL using urllib.request — Extensible library for opening URLs, see final_url below:

import requests
import urllib.parse
from urllib.parse import quote,unquote
import urllib.request

from bs4 import BeautifulSoup

for topic in range(13717, 13718):
    url = 'https://www.scale-rc-car.com/forum/showthread.php?t='+str(topic) +'&pp=1&page=1'
    print(url)
    with urllib.request.urlopen(url) as cm:
        final_url = cm.geturl()
        print(cm.headers.get_content_charset())       # iso-8859-1
    print(final_url)
    print(unquote(final_url,encoding = 'iso-8859-1'))
    html_content = requests.get(final_url)
    soup = BeautifulSoup(html_content.text, 'html.parser')
    print(type(soup))

All prints merely for debugging purposes.

Output: .\SO\78094322.py

https://www.scale-rc-car.com/forum/showthread.php?t=13717&pp=1&page=1
iso-8859-1
https://www.scale-rc-car.com/forum/showthread.php?13717-Buggy-d-%E9tag%E8re-Team-Associated-RC10CC&pp=1
https://www.scale-rc-car.com/forum/showthread.php?13717-Buggy-d-étagère-Team-Associated-RC10CC&pp=1
<class 'bs4.BeautifulSoup'>