Not sure why URL isn't being passed into a scraper of RSS feeds

Question

Not sure why URL isn't being passed into a scraper of RSS feeds

25 Views Asked by elksie5000 At 16 July 2023 at 06:55

Just want to scrape news feeds from RSS.

import feedparser
import pandas as pd
from datetime import datetime

archive = pd.read_csv("national_news_scrape.csv")

# Your list of feeds
feeds = [{"type": "news","title": "BBC", "url": "http://feeds.bbci.co.uk/news/uk/rss.xml"},
        {"type": "news","title": "The Economist", "url": "https://www.economist.com/international/rss.xml"},    
        {"type": "news","title": "The New Statesman", "url": "https://www.newstatesman.com/feed"},    
        {"type": "news","title": "The New York Times", "url": "https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml"},
        {"type": "news","title": "Metro UK","url": "https://metro.co.uk/feed/"},
        {"type": "news", "title": "Evening Standard", "url": "https://www.standard.co.uk/rss.xml"},
        {"type": "news","title": "Daily Mail", "url": "https://www.dailymail.co.uk/articles.rss"},
        {"type": "news","title": "Sky News", "url": "https://news.sky.com/feeds/rss/home.xml"},
        {"type": "news", "title": "The Mirror", "url": "https://www.mirror.co.uk/news/?service=rss"},
        {"type": "news", "title": "The Sun", "url": "https://www.thesun.co.uk/news/feed/"},
        {"type": "news", "title": "Sky News", "url": "https://news.sky.com/feeds/rss/home.xml"},
        {"type": "news", "title": "The Guardian", "url": "https://www.theguardian.com/uk/rss"},
        {"type": "news", "title": "The Independent", "url": "https://www.independent.co.uk/news/uk/rss"},
        #{"type": "news", "title": "The Telegraph", "url": "https://www.telegraph.co.uk/news/rss.xml"},
        {"type": "news", "title": "The Times", "url": "https://www.thetimes.co.uk/?service=rss"}]

# Create an empty DataFrame to store the news
news_df = pd.DataFrame(columns=['source', 'title', 'date', 'summary', "url"])

# For each feed, parse it and add the news to the DataFrame
for feed in feeds:
    print(f"Scraping: {feed['title']}")
    d = feedparser.parse(feed['url'])
    for entry in d.entries:
        # Some feeds do not have 'summary' field, handle this case
        summary = entry.summary if hasattr(entry, 'summary') else ''
        url = entry.link
        # Add the news to the DataFrame
        news_df = news_df.append({'source': feed['title'],
                                  'title': entry.title,
                                   "url": url,
                                  'date': datetime(*entry.published_parsed[:6]),
                                  'summary': summary,
                                  }, ignore_index=True)

combined = pd.concat([news_df, archive]).drop_duplicates()

# Save the DataFrame to a CSV file
news_df.to_csv('national_news_scrape.csv', index=False)

Why doesn't it read the URL of an individual article?

Original Q&A

There are 1 best solutions below

**Amira Bedhiafi** · Answer 1 · 2023-07-16T12:56:53.017000

The code you provided reads the URL of an individual article. The URL of each article is extracted from the entry.link attribute within the loop that iterates over the feed entries. The URL is then stored in the 'url' column of the news_df DataFrame using :

url = entry.link

The URL is added to the DataFrame using the news_df = news_df.append({...}) line within the same loop.

You are saving the scraped data to the news_df DataFrame, but you are not saving the combined DataFrame, which is created by concatenating news_df and archive DataFrames and removing duplicates. You may need to add this :

combined.to_csv('national_news_scrape.csv', index=False)

Not sure why URL isn't being passed into a scraper of RSS feeds

There are 1 best solutions below

Related Questions in PANDAS

Related Questions in FEEDPARSER

Trending Questions

Popular # Hahtags

Popular Questions