Why does soup=BeautifulSoup(data, "html.parser") work but soup2=... does not?

138 Views Asked by At

I'm a Python beginner, hope my question is not too lenghty, please tell me if I should be more concise for future questions, thank you!

I'm opening a .XHTML file which contains financial data as XML (iXBRL standard). Right now I'm parsing the file with BeautifulSoup4 ("html.parser").

url = r"tk2021.xhtml"
data = open(url, encoding="utf8")

soup = BeautifulSoup(data, "html.parser")

Then I'm creating different lists, which contain all matching tags. I'm using those lists later to iterate and pull out all relevant data from each tag and load it in a pd.DataFrame

ix_nonfraction = soup.find_all({"ix:nonfraction"})
xbrli_unit = soup.find_all({"xbrli:unit"})

This works as expected. What I'm struggling with is the next step.

I'm trying to create another list containing all <xbrli:context> tags. They have <xbrli:entity> child-tags, which I need to remove before I create the list. This is how I'm doing that:

for tag in soup("xbrli:entity"):
    tag.decompose()

xbrli_context = soup.find_all({"xbrli:context"})

This also works fine, but I can't access the original soup later in my script (all <xbrli:entity> tags are missing). Also I read in the BS4 documentation, that "the behavior of a decomposed Tag or NavigableString is not defined and you should not use it for anything". So I thought it would be cleaner to create a new soup2 for this operation, so the original soup can be used later on.

And here's where I don't understand what's happening: When I create a second soup with a different name soup2 = BeautifulSoup(data, "html.parser") and use print(soup2.prettify()) it prints nothing. Doing the same with soup work just fine.

Why does soup2 seem to be empty? How do I handle multiple versions of one soup, so that I can always start with the original soup, if I want to?

2

There are 2 best solutions below

3
Driftr95 On

As already mentioned in the comments, since data is file object, after it's been read by BeautifulSoup the first time, it needs to be re-opened before being read again. You probably wouldn't have that issue if you had used

with open(url, encoding="utf8") as f:
    data = f.read()

since .read() returns a string, so data would just be a string.


You can also just do away with data entirely and use

# soup = BeautifulSoup(open(url, encoding="utf8"), "html.parser") ## less safe
with open(url, encoding="utf8") as f: soup = BeautifulSoup(f, "html.parser")

Btw, it's better to use with open, since a bare open should be followed by .close later [but you can't do that if you do it like in the commented line].

1
Ghislain Fourny On

I do not recommend reading an Inline XBRL file at the level of XML or XHTML. Rather, it is highly recommended to use an XBRL processor, which will provide the XBRL semantics at the right level of abstraction.

The XBRL data model is based on data cubes, and by reading the data directly as XML, you are essentially re-building an XBRL processor from scratch.

For example, there is an open-source processor called Arelle and available in Python:

https://pypi.org/project/arelle/

Main project page: https://arelle.org/arelle/