The 'html2text' module not working when using with 'urllib.request' module

Question

The 'html2text' module not working when using with 'urllib.request' module

786 Views Asked by dev Asur At 27 July 2025 at 10:07

I want to get all the text of a webpage and therefore I am trying to use html2text module with the urllib.request module--

import urllib.request 
import html2text
request_url = urllib.request.urlopen('https://dev.to/justdevasur/let-s-perform-google-search-with-python-2gpi') 
u=request_url.read()
print(html2text.html2text(u))
print('Done')

But I am getting the following error--

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\rauna\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\html2text\__init__.py", line 947, in html2text
    return h.handle(html)
  File "C:\Users\rauna\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\html2text\__init__.py", line 142, in handle
    self.feed(data)
  File "C:\Users\rauna\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\html2text\__init__.py", line 138, in feed
    data = data.replace("</' + 'script>", "</ignore>")
TypeError: a bytes-like object is required, not 'str'

Original Q&A

There are 1 best solutions below

**baduker** · Answer 1

As the error says html2text expects a bytes-like object, so you should do this:

import urllib.request 
import html2text
request_url = urllib.request.urlopen('https://dev.to/justdevasur/let-s-perform-google-search-with-python-2gpi') 
print(html2text.html2text(request_url))
print('Done')

But that not only throws 403 but also it seems like html2text is not compatible with Python3. See this question, for example.

So, I would suggest a different approach, for example:

import requests
from bs4 import BeautifulSoup


headers = {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept-Language": "en-GB,en;q=0.5",
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:81.0) Gecko/20100101 Firefox/81.0",
}


req = requests.get('https://dev.to/justdevasur/let-s-perform-google-search-with-python-2gpi', headers).text
soup = BeautifulSoup(req, "html.parser").find("h1")
print(soup.getText(strip=True))

Prints: Let's perform Google search with python

The 'html2text' module not working when using with 'urllib.request' module

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in PYTHON-3.X

Related Questions in WEB-SCRAPING

Related Questions in URLLIB3

Trending Questions

Popular # Hahtags

Popular Questions