Access link href using BeautifulSoup in Python

204 Views Asked by At

I am looking for help webscraping the SEC's EDGAR database using BeautifulSoup. I have a list of investment firm names that I am trying to iterate through, and ultimately access their 13F filings.

So far, using BeautifulSoup, I am able to specify an entry, but am having trouble finding a way to put together the SEC's base web url with a specific file to actually access the data.

My code so far looks like:

headers = {"user-agent": 'Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0'}

for i in firms: # pre-determined list, but using IFP Advisors for this example as 'i'
    edgar_url = r'https://www.sec.gov/cgi-bin/srch-edgar?text=form-type%3D13F-HR+and+company-name+%3D+%22' + i + '%22&first=2020&last=2021&output=atom'
    
    response = requests.get(url = edgar_url, headers = headers)
    soup = BeautifulSoup(response.content, 'lxml')
    entries = soup.find_all('entry')

which gets me to a list of specific 13F filing entries.

   <entry>
      <title>13F-HR - IFP Advisors, Inc</title>
      <link rel="alternate" type="text/html" href="/Archives/edgar/data/1641866/000164186621000007/0001641866-21-000001-index.htm"/>
      <summary type="html">&lt;b&gt;Filed Date:&lt;/b&gt; 01/25/2021 &lt;b&gt;Accession Number:&lt;/b&gt; 0001641866-21-000001 &lt;b&gt;Size:&lt;/b&gt; 4 MB</summary>
      <updated>01/25/2021</updated>
      <category scheme="http://www.sec.gov/" label="form type" term="4"/>
      <id>urn:tag:sec.gov,2008:accession-number=0001641866-21-000001</id>
   </entry>

Eventually, what I would be looking to do is pull out the href dictated above

/Archives/edgar/data/1641866/000164186621000007/0001641866-21-000007-index

and pair it with the scheme in the entry to access the 13F filing's text file, which can be found here: https://www.sec.gov/Archives/edgar/data/1641866/000164186620000007/0001641866-20-000007.txt

While I have the scheme designated, I am looking for a solution to pulling in the link href from each entry to create a new url to access more data.

Any help or suggestions would be appreciated. Thank you in advance!

1

There are 1 best solutions below

2
On

To get URLs for complete submissions you can use this example:

import requests
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0"
}

firms = [
    "IFP Advisors, Inc",
]

entries = []
for i in firms:
    edgar_url = (
        r"https://www.sec.gov/cgi-bin/srch-edgar?text=form-type%3D13F-HR+and+company-name+%3D+%22"
        + i
        + "%22&first=2020&last=2021&output=atom"
    )
    response = requests.get(url=edgar_url, headers=headers)
    soup = BeautifulSoup(response.content, "lxml")
    entries.extend(soup.find_all("entry"))

for e in entries:
    url = "https://www.sec.gov" + e.link["href"]
    print("Getting URL:", url)
    soup = BeautifulSoup(
        requests.get(url, headers=headers).content, "html.parser"
    )
    l = soup.select_one(
        'td:-soup-contains("Complete submission text file") + td a'
    )
    submission_url = "https://www.sec.gov" + l["href"]
    print("Complete submission text file:", submission_url)
    print()

Prints:

Getting URL: https://www.sec.gov/Archives/edgar/data/1641866/000164186621000005/0001641866-21-000005-index.htm
Complete submission text file: https://www.sec.gov/Archives/edgar/data/1641866/000164186621000005/0001641866-21-000005.txt

Getting URL: https://www.sec.gov/Archives/edgar/data/1641866/000164186621000004/0001641866-21-000004-index.htm
Complete submission text file: https://www.sec.gov/Archives/edgar/data/1641866/000164186621000004/0001641866-21-000004.txt

Getting URL: https://www.sec.gov/Archives/edgar/data/1641866/000164186621000001/0001641866-21-000001-index.htm
Complete submission text file: https://www.sec.gov/Archives/edgar/data/1641866/000164186621000001/0001641866-21-000001.txt

Getting URL: https://www.sec.gov/Archives/edgar/data/1641866/000164186620000007/0001641866-20-000007-index.htm
Complete submission text file: https://www.sec.gov/Archives/edgar/data/1641866/000164186620000007/0001641866-20-000007.txt

Getting URL: https://www.sec.gov/Archives/edgar/data/1641866/000164186620000006/0001641866-20-000006-index.htm
Complete submission text file: https://www.sec.gov/Archives/edgar/data/1641866/000164186620000006/0001641866-20-000006.txt

Getting URL: https://www.sec.gov/Archives/edgar/data/1641866/000164186620000002/0001641866-20-000002-index.htm
Complete submission text file: https://www.sec.gov/Archives/edgar/data/1641866/000164186620000002/0001641866-20-000002.txt

Getting URL: https://www.sec.gov/Archives/edgar/data/1641866/000164186620000001/0001641866-20-000001-index.htm
Complete submission text file: https://www.sec.gov/Archives/edgar/data/1641866/000164186620000001/0001641866-20-000001.txt