I am learning Beautiful Soup and dictionaries in Python. I am following a short tutorial in Beautiful Soup by Stanford University to be found here: http://web.stanford.edu/~zlotnick/TextAsData/Web_Scraping_with_Beautiful_Soup.html
Since access to the webside was forbiden I have stored the text presented in the tutorial to a string and then converted the string soup to soup object. The printout is the following:
print(soup_string)
<html><body><div class="ec_statements"><div id="legalert_title"><a
href="/Legislation-and-Politics/Legislative-Alerts/Letter-to-Senators-
Urging-Them-to-Support-Cloture-and-Final-Passage-of-the-Paycheck-
Fairness-Act-S.2199">'Letter to Senators Urging Them to Support Cloture
and Final Passage of the Paycheck Fairness Act (S.2199)
</a>
</div>
<div id="legalert_date">
September 10, 2014
</div>
</div>
<div class="ec_statements">
<div id="legalert_title">
<a href="/Legislation-and-Politics/Legislative-Alerts/Letter-to-
Representatives-Urging-Them-to-Vote-on-the-Highway-Trust-Fund-Bill">
Letter to Representatives Urging Them to Vote on the Highway Trust Fund Bill
</a>
</div>
<div id="legalert_date">
July 30, 2014
</div>
</div>
<div class="ec_statements">
<div id="legalert_title">
<a href="/Legislation-and-Politics/Legislative-Alerts/Letter-to-Representatives-Urging-Them-to-Vote-No-on-the-Legislation-Providing-Supplemental-Appropriations-for-the-Fiscal-Year-Ending-Sept.-30-2014">
Letter to Representatives Urging Them to Vote No on the Legislation Providing Supplemental Appropriations for the Fiscal Year Ending Sept. 30, 2014
</a>
</div>
<div id="legalert_date">
July 30, 2014
</div>
</div>
</body></html>
At some point the tutor captures all the elements in a soup object that have Tag "div", class_="ec_statements".
letters = soup_string.find_all("div", class_="ec_statements")
Then the tutor says:
"We'll go through all of the items in our letters collection, and for each one, pull out the name and make it a key in our dict. The value will be another dict, but we haven't yet found the contents for the other items yet so we'll just create assign an empty dict object."
At this point I take a different approach and I decide to store the data first in lists and then in a dataframe. The code is the following:
lobbying_1 = []
lobbying_2 = []
lobbying_3 = []
for element in letters:
lobbying_1.append(element.a.get_text())
lobbying_2.append(element.a.attrs.get('href'))
lobbying_3.append(element.find(id="legalert_date").get_text())
df =pd.DataFrame([])
df = pd.DataFrame(lobbying_1, columns = ['Name'] )
df['href'] = lobbying_2
df['Date'] = lobbying_3
The output is the following:
print(df)
Name \
0 \n 'Letter to Senators Urging Them to S...
1 \n Letter to Representatives Urging Th...
2 \n Letter to Representatives Urging Th...
href \
0 /Legislation-and-Politics/Legislative-Alerts/L...
1 /Legislation-and-Politics/Legislative-Alerts/L...
2 /Legislation-and-Politics/Legislative-Alerts/L...
Date
0 \n September 10, 2014\n
1 \n July 30, 2014\n
2 \n July 30, 2014\n
My question is: Is there a way to get cleaner data, i.e. strings without the\n and spaces, just the real values through Beautiful Soup? Or I have to post process the data using Regex?
Your advice will be appreciated.
To get rid of the newlines in the texts, pass
strip=True
when callingget_text()
:This, of course, assumes, you still want the data to be in a form of a
DataFrame
.