I am working on a project where I need to break up 10-Ks into their constituent paragraphs. For some 10-Ks I am able to do something simple like soup.find_all('p')
, but I am also seeing other 10-Ks that use <div>
for everything instead of <p>
tags. Here are three different ways I am seeing companies declare paragraph breaks:
Case where empty div tags are used to create create space between paragraphs:
<div></div><div>Text of a paragraph</div><div></div>
Case where margins/padding are used on either the top or bottom to create space:
<div style="padding-top: 10pt">Text of a paragraph</div>`, `<div style="margin-bottom: 10pt"></div>
Case where the company uses <br>
tags:
<div><br><div><div>Text of paragraph</div><div><br></div>
I have had to write new code for each of these three cases, and I am worried that there could be other ways of marking paragraphs that I haven't encountered yet.
QUESTION: Is there a package or method I can use to standardize all these different ways of declaring paragraph breaks, or should I continue to write code for each new case I encounter?
I don't think there is a generic approach that you can take here. However, as a heuristic, you can probably treat
<div>
with textual content as a paragraph irrespective of enclosing other tags (e.g. other<div>
s).You could even attempt to write an XPath query that would capture this condition and use XML parser to enumerate the nodes. Or pass the list of possible text-enclosing tags to the
soup.find_all()
, e.g.:And go through non-empty findings with only textual content to treat them as paragraphs.