Not sure why URL isn't being passed into a scraper of RSS feeds

25 Views Asked by At

Just want to scrape news feeds from RSS.

import feedparser
import pandas as pd
from datetime import datetime

archive = pd.read_csv("national_news_scrape.csv")

# Your list of feeds
feeds = [{"type": "news","title": "BBC", "url": "http://feeds.bbci.co.uk/news/uk/rss.xml"},
        {"type": "news","title": "The Economist", "url": "https://www.economist.com/international/rss.xml"},    
        {"type": "news","title": "The New Statesman", "url": "https://www.newstatesman.com/feed"},    
        {"type": "news","title": "The New York Times", "url": "https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml"},
        {"type": "news","title": "Metro UK","url": "https://metro.co.uk/feed/"},
        {"type": "news", "title": "Evening Standard", "url": "https://www.standard.co.uk/rss.xml"},
        {"type": "news","title": "Daily Mail", "url": "https://www.dailymail.co.uk/articles.rss"},
        {"type": "news","title": "Sky News", "url": "https://news.sky.com/feeds/rss/home.xml"},
        {"type": "news", "title": "The Mirror", "url": "https://www.mirror.co.uk/news/?service=rss"},
        {"type": "news", "title": "The Sun", "url": "https://www.thesun.co.uk/news/feed/"},
        {"type": "news", "title": "Sky News", "url": "https://news.sky.com/feeds/rss/home.xml"},
        {"type": "news", "title": "The Guardian", "url": "https://www.theguardian.com/uk/rss"},
        {"type": "news", "title": "The Independent", "url": "https://www.independent.co.uk/news/uk/rss"},
        #{"type": "news", "title": "The Telegraph", "url": "https://www.telegraph.co.uk/news/rss.xml"},
        {"type": "news", "title": "The Times", "url": "https://www.thetimes.co.uk/?service=rss"}]

# Create an empty DataFrame to store the news
news_df = pd.DataFrame(columns=['source', 'title', 'date', 'summary', "url"])

# For each feed, parse it and add the news to the DataFrame
for feed in feeds:
    print(f"Scraping: {feed['title']}")
    d = feedparser.parse(feed['url'])
    for entry in d.entries:
        # Some feeds do not have 'summary' field, handle this case
        summary = entry.summary if hasattr(entry, 'summary') else ''
        url = entry.link
        # Add the news to the DataFrame
        news_df = news_df.append({'source': feed['title'],
                                  'title': entry.title,
                                   "url": url,
                                  'date': datetime(*entry.published_parsed[:6]),
                                  'summary': summary,
                                  }, ignore_index=True)

combined = pd.concat([news_df, archive]).drop_duplicates()

# Save the DataFrame to a CSV file
news_df.to_csv('national_news_scrape.csv', index=False)

Why doesn't it read the URL of an individual article?

1

There are 1 best solutions below

0
Amira Bedhiafi On

The code you provided reads the URL of an individual article. The URL of each article is extracted from the entry.link attribute within the loop that iterates over the feed entries. The URL is then stored in the 'url' column of the news_df DataFrame using :

url = entry.link

The URL is added to the DataFrame using the news_df = news_df.append({...}) line within the same loop.

You are saving the scraped data to the news_df DataFrame, but you are not saving the combined DataFrame, which is created by concatenating news_df and archive DataFrames and removing duplicates. You may need to add this :

combined.to_csv('national_news_scrape.csv', index=False)