Getting clean data: Beautiful Soup is enough or I must use Regex as well?

327 Views Asked by At

I am learning Beautiful Soup and dictionaries in Python. I am following a short tutorial in Beautiful Soup by Stanford University to be found here: http://web.stanford.edu/~zlotnick/TextAsData/Web_Scraping_with_Beautiful_Soup.html

Since access to the webside was forbiden I have stored the text presented in the tutorial to a string and then converted the string soup to soup object. The printout is the following:

    print(soup_string)

    <html><body><div class="ec_statements"><div id="legalert_title"><a    
    href="/Legislation-and-Politics/Legislative-Alerts/Letter-to-Senators-
    Urging-Them-to-Support-Cloture-and-Final-Passage-of-the-Paycheck-
    Fairness-Act-S.2199">'Letter to Senators Urging Them to Support Cloture     
    and Final Passage of the Paycheck Fairness Act (S.2199)
    </a>
    </div>
    <div id="legalert_date">
    September 10, 2014
    </div>
    </div>
    <div class="ec_statements">
    <div id="legalert_title">
    <a href="/Legislation-and-Politics/Legislative-Alerts/Letter-to-  
    Representatives-Urging-Them-to-Vote-on-the-Highway-Trust-Fund-Bill">
    Letter to Representatives Urging Them to Vote on the Highway Trust Fund Bill
    </a>
    </div>
    <div id="legalert_date">
            July 30, 2014
           </div>
    </div>
    <div class="ec_statements">
    <div id="legalert_title">
    <a href="/Legislation-and-Politics/Legislative-Alerts/Letter-to-Representatives-Urging-Them-to-Vote-No-on-the-Legislation-Providing-Supplemental-Appropriations-for-the-Fiscal-Year-Ending-Sept.-30-2014">
             Letter to Representatives Urging Them to Vote No on the Legislation Providing Supplemental Appropriations for the Fiscal Year Ending Sept. 30, 2014
            </a>
    </div>
    <div id="legalert_date">
            July 30, 2014
           </div>
    </div>
 </body></html>

At some point the tutor captures all the elements in a soup object that have Tag "div", class_="ec_statements".

   letters = soup_string.find_all("div", class_="ec_statements")

Then the tutor says:

"We'll go through all of the items in our letters collection, and for each one, pull out the name and make it a key in our dict. The value will be another dict, but we haven't yet found the contents for the other items yet so we'll just create assign an empty dict object."

At this point I take a different approach and I decide to store the data first in lists and then in a dataframe. The code is the following:

lobbying_1 = []
lobbying_2 = []
lobbying_3 = []
for element in letters:
    lobbying_1.append(element.a.get_text())
    lobbying_2.append(element.a.attrs.get('href'))
    lobbying_3.append(element.find(id="legalert_date").get_text())
df =pd.DataFrame([])
df = pd.DataFrame(lobbying_1, columns = ['Name'] )
df['href'] = lobbying_2
df['Date'] = lobbying_3

The output is the following:

print(df)

                                                Name  \
0  \n        'Letter to Senators Urging Them to S...   
1  \n         Letter to Representatives Urging Th...   
2  \n         Letter to Representatives Urging Th...   

                                                href  \
0  /Legislation-and-Politics/Legislative-Alerts/L...   
1  /Legislation-and-Politics/Legislative-Alerts/L...   
2  /Legislation-and-Politics/Legislative-Alerts/L...   

                                    Date  
0  \n        September 10, 2014\n         
1       \n        July 30, 2014\n         
2       \n        July 30, 2014\n   

My question is: Is there a way to get cleaner data, i.e. strings without the\n and spaces, just the real values through Beautiful Soup? Or I have to post process the data using Regex?

Your advice will be appreciated.

1

There are 1 best solutions below

0
On

To get rid of the newlines in the texts, pass strip=True when calling get_text():

for element in letters:
    lobbying_1.append(element.a.get_text(strip=True))
    lobbying_2.append(element.a.attrs.get('href'))
    lobbying_3.append(element.find(id="legalert_date").get_text(strip=True))

This, of course, assumes, you still want the data to be in a form of a DataFrame.