How to scrape individual paragraphs from SEC 10-Ks

246 Views Asked by At

I am working on a project where I need to break up 10-Ks into their constituent paragraphs. For some 10-Ks I am able to do something simple like soup.find_all('p'), but I am also seeing other 10-Ks that use <div> for everything instead of <p> tags. Here are three different ways I am seeing companies declare paragraph breaks:

Case where empty div tags are used to create create space between paragraphs:

<div></div><div>Text of a paragraph</div><div></div>

Case where margins/padding are used on either the top or bottom to create space:

<div style="padding-top: 10pt">Text of a paragraph</div>`, `<div style="margin-bottom: 10pt"></div>

Case where the company uses <br> tags:

<div><br><div><div>Text of paragraph</div><div><br></div>

I have had to write new code for each of these three cases, and I am worried that there could be other ways of marking paragraphs that I haven't encountered yet.

QUESTION: Is there a package or method I can use to standardize all these different ways of declaring paragraph breaks, or should I continue to write code for each new case I encounter?

1

There are 1 best solutions below

0
On

I don't think there is a generic approach that you can take here. However, as a heuristic, you can probably treat <div> with textual content as a paragraph irrespective of enclosing other tags (e.g. other <div>s).

You could even attempt to write an XPath query that would capture this condition and use XML parser to enumerate the nodes. Or pass the list of possible text-enclosing tags to the soup.find_all(), e.g.:

soup.find_all(['div', 'p'])

And go through non-empty findings with only textual content to treat them as paragraphs.