Python, url changes while executed in requests.get() and results in the famous UnicodeDecodeError

Question

Python, url changes while executed in requests.get() and results in the famous UnicodeDecodeError

33 Views Asked by Bart Zakkenwasser At 02 March 2024 at 21:11

I use Visual Studio Code. Python version 3.12.2. Beautifulsoup version 4.12.3. I'm on Windows 11. Files encoding is set to: utf-8.

This is my code sample in VS code:

import requests
import urllib.parse
from urllib.parse import quote

from bs4 import BeautifulSoup

for topic in range(13717, 13718):
    url = 'https://www.scale-rc-car.com/forum/showthread.php?t='+str(topic) +'&pp=1&page=1'
    print(url)
    html_content = requests.get(url)
    soup = BeautifulSoup(html_content.text, 'html.parser')

print(url) results in the constructed url with the correct topic number (13717): https://www.scale-rc-car.com/forum/showthread.php?t=13717&pp=1&page=1 and that is correct and what I want.

But here's the rub, I get the often posted "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 64: invalid continuation byte"

The thing is, as soon as the html_content = requests.get(url) statement is executed the url seem to change to: https://www.scale-rc-car.com/forum/showthread.php?13717-Buggy-d-%E9tag%E8re-Team-Associated-RC10CC&pp=20

I can check that by pasting the constructed url (https://www.scale-rc-car.com/forum/showthread.php?t=13717&pp=1&page=1) in the webbrowser and when I hit ENTER it changes and adds the phrase: -Buggy-d-%E9tag%E8re-Team-Associated-RC10CC As you can see the characters é and è are replaced by respectively %E9 and %E8. And the result is the errormessage UnicodeDecodeError. The question is: How can I avoid or error-trap this problem? Extra info, I don't no on forehand if there will be special characters in the url.

This is the complete error message:

PS C:\xampp\htdocs\python> python dumpy.py
https://www.scale-rc-car.com/forum/showthread.php?t=13717&pp=1&page=1
Traceback (most recent call last):
  File "C:\xampp\htdocs\python\dumpy.py", line 10, in <module>
    html_content = requests.get(url)
                   ^^^^^^^^^^^^^^^^^
  File "C:\Users\bartz\AppData\Roaming\Python\Python312\site-packages\requests\api.py", line 73, in get
    return request("get", url, params=params, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\bartz\AppData\Roaming\Python\Python312\site-packages\requests\api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\bartz\AppData\Roaming\Python\Python312\site-packages\requests\sessions.py", line 589, in request     
    resp = self.send(prep, **send_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\bartz\AppData\Roaming\Python\Python312\site-packages\requests\sessions.py", line 725, in send        
    history = [resp for resp in gen]
              ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\bartz\AppData\Roaming\Python\Python312\site-packages\requests\sessions.py", line 175, in resolve_redirects
    url = self.get_redirect_target(resp)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\bartz\AppData\Roaming\Python\Python312\site-packages\requests\sessions.py", line 124, in get_redirect_target
    return to_native_string(location, "utf8")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\bartz\AppData\Roaming\Python\Python312\site-packages\requests\_internal_utils.py", line 33, in to_native_string
    out = string.decode(encoding)
          ^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 64: invalid continuation byte
PS C:\xampp\htdocs\python>

Original Q&A

There are 1 best solutions below

**JosefZ** · Answer 1 · 2024-03-03T16:44:29.023000

Get redirected URL using urllib.request — Extensible library for opening URLs, see final_url below:

import requests
import urllib.parse
from urllib.parse import quote,unquote
import urllib.request

from bs4 import BeautifulSoup

for topic in range(13717, 13718):
    url = 'https://www.scale-rc-car.com/forum/showthread.php?t='+str(topic) +'&pp=1&page=1'
    print(url)
    with urllib.request.urlopen(url) as cm:
        final_url = cm.geturl()
        print(cm.headers.get_content_charset())       # iso-8859-1
    print(final_url)
    print(unquote(final_url,encoding = 'iso-8859-1'))
    html_content = requests.get(final_url)
    soup = BeautifulSoup(html_content.text, 'html.parser')
    print(type(soup))

All prints merely for debugging purposes.

Output: .\SO\78094322.py

https://www.scale-rc-car.com/forum/showthread.php?t=13717&pp=1&page=1
iso-8859-1
https://www.scale-rc-car.com/forum/showthread.php?13717-Buggy-d-%E9tag%E8re-Team-Associated-RC10CC&pp=1
https://www.scale-rc-car.com/forum/showthread.php?13717-Buggy-d-étagère-Team-Associated-RC10CC&pp=1
<class 'bs4.BeautifulSoup'>

Python, url changes while executed in requests.get() and results in the famous UnicodeDecodeError

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in URL

Related Questions in DECODE

Related Questions in ENCODE

Trending Questions

Popular # Hahtags

Popular Questions