Just want to scrape news feeds from RSS.
import feedparser
import pandas as pd
from datetime import datetime
archive = pd.read_csv("national_news_scrape.csv")
# Your list of feeds
feeds = [{"type": "news","title": "BBC", "url": "http://feeds.bbci.co.uk/news/uk/rss.xml"},
{"type": "news","title": "The Economist", "url": "https://www.economist.com/international/rss.xml"},
{"type": "news","title": "The New Statesman", "url": "https://www.newstatesman.com/feed"},
{"type": "news","title": "The New York Times", "url": "https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml"},
{"type": "news","title": "Metro UK","url": "https://metro.co.uk/feed/"},
{"type": "news", "title": "Evening Standard", "url": "https://www.standard.co.uk/rss.xml"},
{"type": "news","title": "Daily Mail", "url": "https://www.dailymail.co.uk/articles.rss"},
{"type": "news","title": "Sky News", "url": "https://news.sky.com/feeds/rss/home.xml"},
{"type": "news", "title": "The Mirror", "url": "https://www.mirror.co.uk/news/?service=rss"},
{"type": "news", "title": "The Sun", "url": "https://www.thesun.co.uk/news/feed/"},
{"type": "news", "title": "Sky News", "url": "https://news.sky.com/feeds/rss/home.xml"},
{"type": "news", "title": "The Guardian", "url": "https://www.theguardian.com/uk/rss"},
{"type": "news", "title": "The Independent", "url": "https://www.independent.co.uk/news/uk/rss"},
#{"type": "news", "title": "The Telegraph", "url": "https://www.telegraph.co.uk/news/rss.xml"},
{"type": "news", "title": "The Times", "url": "https://www.thetimes.co.uk/?service=rss"}]
# Create an empty DataFrame to store the news
news_df = pd.DataFrame(columns=['source', 'title', 'date', 'summary', "url"])
# For each feed, parse it and add the news to the DataFrame
for feed in feeds:
print(f"Scraping: {feed['title']}")
d = feedparser.parse(feed['url'])
for entry in d.entries:
# Some feeds do not have 'summary' field, handle this case
summary = entry.summary if hasattr(entry, 'summary') else ''
url = entry.link
# Add the news to the DataFrame
news_df = news_df.append({'source': feed['title'],
'title': entry.title,
"url": url,
'date': datetime(*entry.published_parsed[:6]),
'summary': summary,
}, ignore_index=True)
combined = pd.concat([news_df, archive]).drop_duplicates()
# Save the DataFrame to a CSV file
news_df.to_csv('national_news_scrape.csv', index=False)
Why doesn't it read the URL of an individual article?
The code you provided reads the URL of an individual article. The URL of each article is extracted from the
entry.linkattribute within the loop that iterates over the feed entries. The URL is then stored in the 'url' column of thenews_dfDataFrame using :The URL is added to the DataFrame using the
news_df = news_df.append({...})line within the same loop.You are saving the scraped data to the
news_dfDataFrame, but you are not saving thecombinedDataFrame, which is created by concatenatingnews_dfandarchiveDataFrames and removing duplicates. You may need to add this :