I'm following a web tutorial trying to use BeautifulSoup4 to extract data from a html file (stored on my local PC) in Jupyterlab as follows:
from bs4 import BeautifulSoup
with open ('simple.html') as html_file:
simple = BeautifulSoup('html_file','lxml')
print(simple.prettify())
I'm getting the following output irrespective of what is in the html file instead of the expected html
<html>
<body>
<p>
html_file
</p>
</body>
</html>
I've also tried it using the html parser html.parser and I simply get html_file
as the output.
I know it can find the file because when I run the code after removing it from the directory I get a FileNotFoundError.
It works perfectly well when I run python interactively from the same directory. I'm able to run other BeautifulSoup to parse web pages.
I'm using Fedora 32 linux with Python3, Jupyterlab, BeautifulSoup4,requests, lxml installed in a virtual environment using pipenv.
Any help to get to the bottom of the problem is welcome.
Your problem is in this line:
In particular, you're telling BeautifulSoup to parse the literal string
'html_file'
instead of the contents of the variablehtml_file
.Changing it to:
(note the lack of quotes surrounding
html_file
) should give the desired result.