Python 2.7.10 Trying to print text from website using Beautiful Soup 4

154 Views Asked by At

I want my output to be like:

count:0 - Bournemouth and Watford to go head-to-head for Abdisalam Ibrahim

Olympiacos midfielder Abdisalam Ibrahim is a target for Premier League new-boys Bournemouth and Watford.The former Manchester City man is keen to leave Greece this summer, and his potential availability has alerted Eddie Howe and Quique Sanchez Flores.Lorient of Ligue 1 and La Liga's Rayo Vallacano are also interested in the 24-year-old.

Count:1 - Andre-Pierre Gignac set for Mexico move

Former West Brom target Andre-Pierre Gignac is to complete a move to Mexican side Tigres.The France international is a free agent after leaving Marseille and is set to undergo a medical later today.West Ham, Stoke, Newcastle, West Brom and Dynamo Moscow all showed interest in the 30-year-old although Tony Pulis is understood to have cooled his interest after watching Gignac against Monaco towards the end of last season.

My Program:

from bs4 import BeautifulSoup
import urllib2
response = urllib2.urlopen('http://www.dailymail.co.uk/sport/football/article-3129389/Transfer-News-LIVE-Manchester-United-Arsenal-Liverpool-Real-Madrid-Barcelona-latest-plus-rest-Europe.html')
html = response.read()
soup = BeautifulSoup(html)

count=0
for tag in soup.find_all("div", {"id":"lc-commentary-posts"}):
    divTaginb = tag.find_all("div", {"class":"lc-title-container"})
    divTaginp = tag.find_all("div",{"class":"lc-post-body"})
    for tag1 in divTaginb:
        h4Tag = tag1.find_all("b")
        for tag2 in h4Tag:
            print "count:%d - "%count,
            print tag2.text
            print '\n'
            tagp = divTaginp[count].find_all('p')
            for p in tagp:
            print p
            print '\n'
            count +=1

My output:

Count:0 - ....
...
count:37 -  ICYMI: Hamburg target Celtic star Stefan Johansen as part of summer
rebuilding process


<p><strong>STEPHEN MCGOWAN:</strong> Bundesliga giants Hamburg have been linked
 with a move for CelticΓÇÖs PFA Scotland player of the year Stefan Johansen.</p>

<p>German newspapers claim the Norwegian features on a three-man shortlist of po
tential signings for HSV as part of their summer rebuilding process.</p>
<p>Hamburg scouts are reported to have watched Johansen during Friday nightΓÇÖs
scoreless Euro 2016 qualifier draw with Azerbaijan.</p>
<p><a href="http://www.dailymail.co.uk/sport/football/article-3128854/Hamburg-ta
rget-Celtic-star-Stefan-Johansen-summer-rebuilding-process.html"><strong>CLICK H
ERE for more</strong></a></p>


count:38 -  ICYMI: Sevilla agree deal with Chelsea to sign out-of-contract midfi
elder Gael Kakuta


<p>Sevilla have agreed a deal with Premier League champions Chelsea to sign out-
of-contract winger Gael Kakuta.</p>
<p>The French winger, who spent last season on loan in the Primera Division with
 Rayo Vallecano, will arrive in Seville on Thursday to undergo a medical with th
e back-to-back Europa League winners.</p>
<p>A statement published on Sevilla's official website confirmed the 23-year-old
's transfer would go through if 'everything goes well' in the Andalusian city.</
p>
<p><strong><a href="http://www.dailymail.co.uk/sport/football/article-3128756/Se
villa-agree-deal-Chelsea-sign-Gael-Kakuta-contract-winger-aims-resurrect-career-
Europa-League-winners.html">CLICK HERE for more</a></strong></p>


count:39 -  Good morning everybody!


<p>And welcome to <em>Sportsmail's</em> coverage of all the potential movers and
 shakers ahead of the forthcoming summer transfer window.</p>
<p>Whatever deals will be rumoured, agreed or confirmed today you can read all
about them here.</p>

DailyMail Website looks like this:

<div id="lc-commentary-posts"><div id="lc-id-39" class="lc-commentary-post cleared">
    <div class="lc-icons">
        <img src="http://i.mol.im/i/furniture/live_commentary/football_icons/teams/60x60_bournemouth.png" class="lc-icon">
        <img src="http://i.mol.im/i/furniture/live_commentary/football_icons/teams/60x60_watford.png" class="lc-icon">
        <div class="lc-post-time">18:03 </div>
    </div>
    <div class="lc-title-container">
        <h4>
            <a href="http://www.dailymail.co.uk/sport/football/article-3130092/Bournemouth-Watford-want-former-Manchester-City-midfielder.html" target="_blank"><b>Bournemouth and Watford to go head-to-head for Abdisalam Ibrahim</b></a>
        </h4>
    </div>
    <div class="lc-post-body">
        <p><strong>SAMI MOKBEL:&nbsp;</strong>Olympiacos midfielder Abdisalam Ibrahim is a target for Premier League new-boys Bournemouth and Watford.</p>
<p class="mol-para-with-font">The former Manchester City man is keen to leave Greece this summer, and his potential availability has alerted Eddie Howe and Quique Sanchez Flores.</p>
<p class="mol-para-with-font"><font>Lorient of Ligue 1 and La Liga's Rayo Vallacano are also interested in the 24-year-old.</font></p>
    </div>


    <img class="lc-post-image" src="http://i.dailymail.co.uk/i/pix/2015/06/18/18/1434647000147_lc_galleryImage_TEL_AVIV_ISRAEL_JUNE_11_A.JPG">
    <b class="lc-image-caption">Abdisalam Ibrahim could return to England</b>
    <div class="lc-clear"></div>

    <ul class="lc-social">
        <li class="lc-facebook"><span onclick="window.LiveCommentary.socialShare(postToFB, '39', 'facebook')"></span></li>
        <li class="lc-twitter"><span onclick="window.LiveCommentary.socialShare(postToTWTTR, '39', 'twitter', window.twitterVia)"></span></li>
    </ul>
</div>
<div id="lc-id-38" class="lc-commentary-post cleared">
    <div class="lc-icons">
        <img src="http://i.mol.im/i/furniture/live_commentary/football_icons/teams/60x60_west_brom.png" class="lc-icon">
        <img src="http://i.mol.im/i/furniture/live_commentary/flags/60x60_mexico.png" class="lc-icon">
        <div class="lc-post-time">16:54 </div>
    </div>
    <div class="lc-title-container">
            <span><b>Andre-Pierre Gignac set for Mexico move</b></span>
    </div>
    <div class="lc-post-body">
        <p>Former West Brom target Andre-Pierre Gignac is to complete a move to Mexican side Tigres.</p>
<p id="ext-gen225">The France international is a free agent after leaving Marseille and is set to undergo a medical later today.</p>
<p>West Ham, Stoke, Newcastle, West Brom and Dynamo Moscow all showed interest in the 30-year-old although Tony Pulis is understood to have cooled his interest after watching Gignac against Monaco towards the end of last season.</p>
    </div>


    <img class="lc-post-image" src="http://i.dailymail.co.uk/i/pix/2015/06/18/16/1434642784396_lc_galleryImage__FILES_A_file_picture_tak.JPG">
    <b class="lc-image-caption">Andre-Pierre Gignac is to complete a move to Mexican side Tigres</b>
    <div class="lc-clear"></div>

    <ul class="lc-social">
        <li class="lc-facebook"><span onclick="window.LiveCommentary.socialShare(postToFB, '38', 'facebook')"></span></li>
        <li class="lc-twitter"><span onclick="window.LiveCommentary.socialShare(postToTWTTR, '38', 'twitter', window.twitterVia)"></span></li>
    </ul>
</div>

Now my target is <div class="lc-title-container"> inside this <b></b>.Which I am getting easily. But when I am targeting <div class="lc-post-body"> inside this all <p></p>. I am not able to get only required text. I tried p.text and p.strip() but still I am not able to solve my problem.

Error while using p.text

count:19 -  City's pursuit of Sterling, Wilshere and Fabian Delph show a need fo
r English quality


MIKE KEEGAN: Colonial explorer Cecil Rhodes is famously reported to have once sa
id that to be an Englishman 'is to have won first prize in the lottery of life'.

Back in the 19th century, the vicar's son was no doubt preaching about the expan
ding Empire and his own experiences in Africa.
Traceback (most recent call last):
  File "app.py", line 24, in <module>
    print p.text
  File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2013' in position
 160: character maps to <undefined>

And while i am using p.strip() I am not getting any output. Is there any good way to do it. Help me get the best way. I am trying this thing from morning and now its night.

I dont want to use any encoder or decoder if possible

dammit = UnicodeDammit(html) print(dammit.unicode_markup)

1

There are 1 best solutions below

5
On BEST ANSWER

Here's my code. You should go though it. I was to lazy to add specific fields for the dataset and instead just combined everything.

from bs4 import BeautifulSoup, element
import urllib2




response = urllib2.urlopen('http://www.dailymail.co.uk/sport/football/article-3129389/Transfer-News-LIVE-Manchester-United-Arsenal-Liverpool-Real-Madrid-Barcelona-latest-plus-rest-Europe.html')
html = response.read()
soup = BeautifulSoup(html)

count=0

article_dataset = {}


# Try to make your variables express what your trying to do.
# Collect article posts
article_post_tags = soup.find_all("div", {"id":"lc-commentary-posts"})


# Set up the aricle_dataset with the artilce name as it's key
for article_post_tag in article_post_tags:

  container_tags = article_post_tag.find_all("div", {"class":"lc-title-container"})

  body_tags = article_post_tag.find_all("div",{"class":"lc-post-body"})

  # Find the article name, and initialize an empty dict as the value
  for count, container in enumerate(container_tags):

    # We know there is only 1 <b> tag in our container, 
    # so use find() instead of find_all()
    article_name_tag = container.find('b')

    # Our primary key is the article name, the corrosponding value is the body_tag.
    article_dataset[article_name_tag.text] = {'body_tag':body_tags[count]}





for article_name, details in article_dataset.items():

    content = []
    content_line_tags = details['body_tag'].find_all('p')

    # Go through each tag and collect the text
    for content_tag in content_line_tags:
        for data in content_tag.contents: # gather strings in our tags
            if type(data) == element.NavigableString:
                data = unicode(data)
            else:
                data = data.text
            content += [data]

    # combine the content
    content = '\n'.join(content)

    # Add the content to our data
    article_dataset[article_name]['content'] = content





# remove the body_tag from our aricle data_set
for name, details in article_dataset.items():
    del details['body_tag']

    print
    print
    print 'Artilce Name: ' + name
    print 'Player: ' + details['content'].split('\n')[0]
    print 'Article Summary: ' + details['content']
    print