Why can I not get local files to parse using BeautifulSoup4 in Jupyterlab

105 Views Asked by At

I'm following a web tutorial trying to use BeautifulSoup4 to extract data from a html file (stored on my local PC) in Jupyterlab as follows:

from bs4 import BeautifulSoup

with open ('simple.html') as html_file:
    simple = BeautifulSoup('html_file','lxml')

print(simple.prettify())

I'm getting the following output irrespective of what is in the html file instead of the expected html

<html>
 <body>
  <p>
   html_file
  </p>
 </body>
</html>

I've also tried it using the html parser html.parser and I simply get html_file as the output. I know it can find the file because when I run the code after removing it from the directory I get a FileNotFoundError.

It works perfectly well when I run python interactively from the same directory. I'm able to run other BeautifulSoup to parse web pages.

I'm using Fedora 32 linux with Python3, Jupyterlab, BeautifulSoup4,requests, lxml installed in a virtual environment using pipenv.

Any help to get to the bottom of the problem is welcome.

1

There are 1 best solutions below

0
On BEST ANSWER

Your problem is in this line:

simple = BeautifulSoup('html_file','lxml')

In particular, you're telling BeautifulSoup to parse the literal string 'html_file' instead of the contents of the variable html_file.

Changing it to:

simple = BeautifulSoup(html_file,'lxml')

(note the lack of quotes surrounding html_file) should give the desired result.