The 'html2text' module not working when using with 'urllib.request' module

795 Views Asked by At

I want to get all the text of a webpage and therefore I am trying to use html2text module with the urllib.request module--

import urllib.request 
import html2text
request_url = urllib.request.urlopen('https://dev.to/justdevasur/let-s-perform-google-search-with-python-2gpi') 
u=request_url.read()
print(html2text.html2text(u))
print('Done')

But I am getting the following error--

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\rauna\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\html2text\__init__.py", line 947, in html2text
    return h.handle(html)
  File "C:\Users\rauna\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\html2text\__init__.py", line 142, in handle
    self.feed(data)
  File "C:\Users\rauna\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\html2text\__init__.py", line 138, in feed
    data = data.replace("</' + 'script>", "</ignore>")
TypeError: a bytes-like object is required, not 'str'
1

There are 1 best solutions below

0
On

As the error says html2text expects a bytes-like object, so you should do this:

import urllib.request 
import html2text
request_url = urllib.request.urlopen('https://dev.to/justdevasur/let-s-perform-google-search-with-python-2gpi') 
print(html2text.html2text(request_url))
print('Done')

But that not only throws 403 but also it seems like html2text is not compatible with Python3. See this question, for example.

So, I would suggest a different approach, for example:

import requests
from bs4 import BeautifulSoup


headers = {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept-Language": "en-GB,en;q=0.5",
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:81.0) Gecko/20100101 Firefox/81.0",
}


req = requests.get('https://dev.to/justdevasur/let-s-perform-google-search-with-python-2gpi', headers).text
soup = BeautifulSoup(req, "html.parser").find("h1")
print(soup.getText(strip=True))

Prints: Let's perform Google search with python