I want to get all the text of a webpage and therefore I am trying to use html2text module with the urllib.request module--
import urllib.request
import html2text
request_url = urllib.request.urlopen('https://dev.to/justdevasur/let-s-perform-google-search-with-python-2gpi')
u=request_url.read()
print(html2text.html2text(u))
print('Done')
But I am getting the following error--
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\rauna\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\html2text\__init__.py", line 947, in html2text
return h.handle(html)
File "C:\Users\rauna\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\html2text\__init__.py", line 142, in handle
self.feed(data)
File "C:\Users\rauna\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\html2text\__init__.py", line 138, in feed
data = data.replace("</' + 'script>", "</ignore>")
TypeError: a bytes-like object is required, not 'str'
As the error says
html2text
expects abytes-like
object, so you should do this:But that not only throws
403
but also it seems likehtml2text
is not compatible with Python3. See this question, for example.So, I would suggest a different approach, for example:
Prints:
Let's perform Google search with python