Avoid image div while parsing description tag

59 Views Asked by At

parsing an rss feed with this code

resp=requests.get(url)
soup = BeautifulSoup(resp.content, features="xml")
soup.prettify()
items = soup.findAll('item')

news_items = []
for item in items:
    news_item={}
    news_item['title']=item.title.text
    news_item['description']=item.description.text
    news_item['link']=item.link.text
    news_item['pubDate']=item.pubDate.text
    news_items.append(news_item)

in the description tage there is a div for the img src

<description>
<![CDATA[ <div><img src="https://library.sportingnews.com/styles/twitter_card_120x120/s3/2023-11/nba-plain--358f0d81-148e-4590-ba34-3164ea0c87eb.png?itok=fG5f5Dwa" style="width: 100%;" /><div>Now back from his foot injury and ready to continue his Golden Boot charge, Erling Haaland looks to return in full as Man City visit Brentford in a Monday Premier League matinee.</div></div> ]]>
</description>

is there anyway i can retrieve everything in the description tag except for the image div, thanks

1

There are 1 best solutions below

0
Avo Asatryan On

You can modify your code to parse the HTML content inside the description tag and remove the img tag. Here's how you can do it using BeautifulSoup:

resp = requests.get(url)
soup = BeautifulSoup(resp.content, features="xml")
items = soup.findAll('item')

news_items = []
for item in items:
    news_item = {}
    news_item['title'] = item.title.text
    news_item['link'] = item.link.text
    news_item['pubDate'] = item.pubDate.text

    # Parse the HTML content inside the description tag
    description_soup = BeautifulSoup(item.description.text, features="html.parser")

    # Remove the img tag
    for img in description_soup.findAll('img'):
        img.decompose()

    # Get the text content of the description
    news_item['description'] = description_soup.text.strip()

    news_items.append(news_item)