Web scraping articles from Google News

3.8k Views Asked by At

I am trying to web scrape googlenews with the gnews package. However, I don't know how to do web scraping for older articles like, for example, articles from 2010.

from gnews import GNews
from newspaper import Article
import pandas as pd
import datetime

google_news = GNews(language='es', country='Argentina', period = '7d')
argentina_news = google_news.get_news('protesta clarin')
print(len(argentina_news))

this code works perfectly to get recent articles but I need older articles. I saw https://github.com/ranahaani/GNews#todo and something like the following appears:

google_news = GNews(language='es', country='Argentina', period='7d', start_date='01-01-2015', end_date='01-01-2016', max_results=10, exclude_websites=['yahoo.com', 'cnn.com'],
                    proxy=proxy)

but when I try star_date I get:

TypeError: __init__() got an unexpected keyword argument 'start_date'

can anyone help to get articles for specific dates. Thank you very mucha guys!

3

There are 3 best solutions below

0
wkl On BEST ANSWER

The example code is incorrect for gnews==0.2.7 which is the latest you can install off PyPI via pip (or whatever). The documentation is for the unreleased mainline code that you can get directly off their git source.

Confirmed by inspecting the GNews::__init__ method, and the method doesn't have keyword args for start_date or end_date:

In [1]: import gnews

In [2]: gnews.GNews.__init__??
Signature:
gnews.GNews.__init__(
    self,
    language='en',
    country='US',
    max_results=100,
    period=None,
    exclude_websites=None,
    proxy=None,
)
Docstring: Initialize self.  See help(type(self)) for accurate signature.
Source:
    def __init__(self, language="en", country="US", max_results=100, period=None, exclude_websites=None, proxy=None):
        self.countries = tuple(AVAILABLE_COUNTRIES),
        self.languages = tuple(AVAILABLE_LANGUAGES),
        self._max_results = max_results
        self._language = language
        self._country = country
        self._period = period
        self._exclude_websites = exclude_websites if exclude_websites and isinstance(exclude_websites, list) else []
        self._proxy = {'http': proxy, 'https': proxy} if proxy else None
File:      ~/src/news-test/.venv/lib/python3.10/site-packages/gnews/gnews.py
Type:      function

If you want the start_date and end_date functionality, that was only added rather recently, so you will need to install the module off their git source.

# use whatever you use to uninstall any pre-existing gnews module
pip uninstall gnews

# install from the project's git main branch
pip install git+https://github.com/ranahaani/GNews.git

Now you can use the start/end functionality:

import datetime

import gnews

start = datetime.date(2015, 1, 15)
end = datetime.date(2015, 1, 16)

google_news = GNews(language='es', country='Argentina', start_date=start, end_date=end)
rsp = google_news.get_news("protesta")
print(rsp)

I get this as a result:

[{'title': 'Latin Roots: The Protest Music Of South America - NPR',
  'description': 'Latin Roots: The Protest Music Of South America  NPR',
  'published date': 'Thu, 15 Jan 2015 08:00:00 GMT',
  'url': 'https://www.npr.org/sections/world-cafe/2015/01/15/377491862/latin-roots-the-protest-music-of-south-america',
  'publisher': {'href': 'https://www.npr.org', 'title': 'NPR'}}]

Also note:

  • period is ignored if you set start_date and end_date
  • Their documentation shows you can pass the dates as tuples like (2015, 1, 15). This doesn't seem to work - just be safe and pass a datetime object.
0
JBJ On

You can also use Python requests module and xpath to get what you need without using any external packages. Here is a snapshot of the code:

from bs4 import BeautifulSoup
import requests
from lxml.html import fromstring



url = 'https://www.google.com/search?q=google+news&&hl=es&sxsrf=ALiCzsZoYzwIP0ZR9d6LLa5U6IJ2WDo1sw%3A1660116293247&source=lnt&tbs=cdr%3A1%2Ccd_min%3A8%2F10%2F2010%2Ccd_max%3A8%2F10%2F2022&tbm=nws'
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4758.87 Safari/537.36",
    }

r = requests.get(url,  headers=headers, timeout=30)
root = fromstring(r.text)

news = []
for i in root.xpath('//div[@class="xuvV6b BGxR7d"]'):
    item={}
    item['title'] =  i.xpath('.//div[@class="mCBkyc y355M ynAwRc MBeuO nDgy9d"]//text()')
    item['description'] =  i.xpath('.//div[@class="GI74Re nDgy9d"]//text()')
    item['published date'] =  i.xpath('.//div[@class="OSrXXb ZE0LJd"]//span/text()')
    item['url'] =  i.xpath('.//a/@href')
    item['publisher'] =  i.xpath('.//div[@class="CEMjEf NUnG9d"]//span/text()')
    news.append(item)

And here is what i get:

for i in news:
    print i

"""
{'published date': ['Hace 1 mes'], 'url': ['https://www.20minutos.es/noticia/5019464/0/google-news-regresa-a-espana-tras-ocho-anos-cerrado/'], 'publisher': ['20Minutos'], 'description': [u'"Google News ayuda a los lectores a encontrar noticias de fuentes \nfidedignas, desde los sitios web de noticias m\xe1s grandes del mundo hasta \nlas publicaciones...'], 'title': [u'Noticias de 20minutos en Google News: c\xf3mo seguir la \xfaltima ...']}
{'published date': ['14 jun 2022'], 'url': ['https://www.bbc.com/mundo/noticias-61803565'], 'publisher': ['BBC'], 'description': [u'C\xf3mo funciona LaMDA, el sistema de inteligencia artificial que "cobr\xf3 \nconciencia y siente" seg\xfan un ingeniero de Google. Alicia Hern\xe1ndez \n@por_puesto; BBC News...'], 'title': [u'C\xf3mo funciona LaMDA, el sistema de inteligencia artificial que "cobr\xf3 \nconciencia y siente" seg\xfan un ingeniero de Google']}
{'published date': ['24 mar 2022'], 'url': ['https://www.theguardian.com/world/2022/mar/24/russia-blocks-google-news-after-it-bans-ads-on-proukraine-invasion-content'], 'publisher': ['The Guardian'], 'description': [u'Russia has blocked Google News, accusing it of promoting \u201cinauthentic \ninformation\u201d about the invasion of Ukraine. The ban came just hours after \nGoogle...'], 'title': ['Russia blocks Google News after ad ban on content condoning Ukraine invasion']}
{'published date': ['2 feb 2021'], 'url': ['https://dircomfidencial.com/medios/google-news-showcase-que-es-y-como-funciona-el-agregador-por-el-que-los-medios-pueden-generar-ingresos-20210202-0401/'], 'publisher': ['Dircomfidencial'], 'description': [u'Google News Showcase: qu\xe9 es y c\xf3mo funciona el agregador por el que los \nmedios pueden generar ingresos. MEDIOS | 2 FEBRERO 2021 | ACTUALIZADO: 3 \nFEBRERO 2021 8...'], 'title': [u'Google News Showcase: qu\xe9 es y c\xf3mo funciona el ...']}
{'published date': ['4 nov 2021'], 'url': ['https://www.euronews.com/next/2021/11/04/google-news-returns-to-spain-after-the-country-adopts-new-eu-copyright-law'], 'publisher': ['Euronews'], 'description': ['News aggregator Google News will return to Spain following a change in \ncopyright law that allows online platforms to negotiate fees directly with \ncontent...'], 'title': ['Google News returns to Spain after the country adopts new EU copyright law']}
{'published date': ['27 may 2022'], 'url': ['https://indianexpress.com/article/technology/tech-news-technology/google-hit-with-fresh-uk-investigation-over-ad-tech-dominance-7938896/'], 'publisher': ['The Indian Express'], 'description': ['The Indian Express website has been rated GREEN for its credibility and \ntrustworthiness by Newsguard, a global service that rates news sources for \ntheir...'], 'title': ['Google hit with fresh UK investigation over ad tech dominance']}
{'published date': [u'Hace 1 d\xeda'], 'url': ['https://indianexpress.com/article/technology/tech-news-technology/google-down-outage-issues-user-error-8079170/'], 'publisher': ['The Indian Express'], 'description': ['The outage also impacted a range of other Google products such as Google \n... Join our Telegram channel (The Indian Express) for the latest news and \nupdates.'], 'title': ['Google, Google Maps and other services recover after global ...']}
{'published date': ['14 nov 2016'], 'url': ['https://www.reuters.com/article/us-alphabet-advertising-idUSKBN1392MM'], 'publisher': ['Reuters'], 'description': ["Google's move similarly does not address the issue of fake news or hoaxes \nappearing in Google search results. That happened in the last few days, \nwhen a search..."], 'title': ['Google, Facebook move to restrict ads on fake news sites']}
{'published date': ['27 sept 2021'], 'url': ['https://news.sky.com/story/googles-appeal-against-eu-record-3-8bn-fine-starts-today-as-us-cases-threaten-to-break-the-company-up-12413655'], 'publisher': ['Sky News'], 'description': ["Google's five-day appeal against the decision is being heard at European \n... told Sky News he expected there could be another appeal after the \nhearing in..."], 'title': [u"Google's appeal against EU record \xa33.8bn fine starts today, as US cases \nthreaten to break the company up"]}
{'published date': ['11 jun 2022'], 'url': ['https://www.washingtonpost.com/technology/2022/06/11/google-ai-lamda-blake-lemoine/'], 'publisher': ['The Washington Post'], 'description': [u"SAN FRANCISCO \u2014 Google engineer Blake Lemoine opened his laptop to the \ninterface for LaMDA, Google's artificially intelligent chatbot generator,..."], 'title': ["The Google engineer who thinks the company's AI has come ..."]}
"""
0
jfive989 On

You can use feedparser which does scrape data from embedded Google RSS.

First, install the Python library:

$ pip3 install feedparser

Here a code snippet:

import feedparser 
 
url = 'https://news.google.com/rss' 
feed = feedparser.parse(url) 
 
for entry in feed.entries: 
    print(entry)

The result:

{
   "title":"With I-10 Closed by Fire Damage, Los Angeles Drivers Find Ways to Cope - The New York Times",
   "title_detail":{
      "type":"text/plain",
      "language":"None",
      "base":"https://news.google.com/rss?hl=en-US&gl=US&ceid=US:en",
      "value":"With I-10 Closed by Fire Damage, Los Angeles Drivers Find Ways to Cope - The New York Times"
   },
   "links":[
      {
         "rel":"alternate",
         "type":"text/html",
         "href":"https://news.google.com/rss/articles/CBMiTWh0dHBzOi8vd3d3Lm55dGltZXMuY29tLzIwMjMvMTEvMTQvdXMvbG9zLWFuZ2VsZXMtZHJpdmVycy1mcmVld2F5LWNsb3NlZC5odG1s0gEA?oc=5"
      }
   ],
   "link":"https://news.google.com/rss/articles/CBMiTWh0dHBzOi8vd3d3Lm55dGltZXMuY29tLzIwMjMvMTEvMTQvdXMvbG9zLWFuZ2VsZXMtZHJpdmVycy1mcmVld2F5LWNsb3NlZC5odG1s0gEA?oc=5",
   "id":"2595334477",
   "guidislink":false,
   "published":"Wed, 15 Nov 2023 16:06:00 GMT",
   "published_parsed":time.struct_time(tm_year=2023,tm_mon=11,tm_mday=15,tm_hour=16,tm_min=6,tm_sec=0,tm_wday=2,tm_yday=319,tm_isdst=0),
   "summary":"<ol><li><a href=\"https://news.google.com/rss/articles/CBMiTWh0dHBzOi8vd3d3Lm55dGltZXMuY29tLzIwMjMvMTEvMTQvdXMvbG9zLWFuZ2VsZXMtZHJpdmVycy1mcmVld2F5LWNsb3NlZC5odG1s0gEA?oc=5\" target=\"_blank\">With I-10 Closed by Fire Damage, Los Angeles Drivers Find Ways to Cope</a>&nbsp;&nbsp;<font color=\"#6f6f6f\">The New York Times</font></li><li><a href=\"https://news.google.com/rss/articles/CCAiC3lQcnhaSE1lVzFZmAEB?oc=5\" target=\"_blank\">Drivers urged to avoid 10 Freeway, surface streets in downtown Los Angeles amid repair work</a>&nbsp;&nbsp;<font color=\"#6f6f6f\">KTLA 5</font></li><li><a href=\"https://news.google.com/rss/articles/CBMibWh0dHBzOi8va3RsYS5jb20vbmV3cy9sb2NhbC1uZXdzL2ZpcmVmaWdodGVycy1leHRpbmd1aXNoLWFub3RoZXItZmlyZS11bmRlcm5lYXRoLWFub3RoZXItbG9zLWFuZ2VsZXMtZnJlZXdheS_SAXFodHRwczovL2t0bGEuY29tL25ld3MvbG9jYWwtbmV3cy9maXJlZmlnaHRlcnMtZXh0aW5ndWlzaC1hbm90aGVyLWZpcmUtdW5kZXJuZWF0aC1hbm90aGVyLWxvcy1hbmdlbGVzLWZyZWV3YXkvYW1wLw?oc=5\" target=\"_blank\">Another fire breaks out underneath L.A. freeway</a>&nbsp;&nbsp;<font color=\"#6f6f6f\">KTLA Los Angeles</font></li><li><strong><a href=\"https://news.google.com/stories/CAAqNggKIjBDQklTSGpvSmMzUnZjbmt0TXpZd1NoRUtEd2pOMHNiVkNSR1lubFhjRUxHdVdpZ0FQAQ?hl=en-US&amp;gl=US&amp;ceid=US:en&amp;oc=5\" target=\"_blank\">View Full Coverage on Google News</a></strong></li></ol>",
   "summary_detail":{
      "type":"text/html",
      "language":"None",
      "base":"https://news.google.com/rss?hl=en-US&gl=US&ceid=US:en",
      "value":"<ol><li><a href=\"https://news.google.com/rss/articles/CBMiTWh0dHBzOi8vd3d3Lm55dGltZXMuY29tLzIwMjMvMTEvMTQvdXMvbG9zLWFuZ2VsZXMtZHJpdmVycy1mcmVld2F5LWNsb3NlZC5odG1s0gEA?oc=5\" target=\"_blank\">With I-10 Closed by Fire Damage, Los Angeles Drivers Find Ways to Cope</a>&nbsp;&nbsp;<font color=\"#6f6f6f\">The New York Times</font></li><li><a href=\"https://news.google.com/rss/articles/CCAiC3lQcnhaSE1lVzFZmAEB?oc=5\" target=\"_blank\">Drivers urged to avoid 10 Freeway, surface streets in downtown Los Angeles amid repair work</a>&nbsp;&nbsp;<font color=\"#6f6f6f\">KTLA 5</font></li><li><a href=\"https://news.google.com/rss/articles/CBMibWh0dHBzOi8va3RsYS5jb20vbmV3cy9sb2NhbC1uZXdzL2ZpcmVmaWdodGVycy1leHRpbmd1aXNoLWFub3RoZXItZmlyZS11bmRlcm5lYXRoLWFub3RoZXItbG9zLWFuZ2VsZXMtZnJlZXdheS_SAXFodHRwczovL2t0bGEuY29tL25ld3MvbG9jYWwtbmV3cy9maXJlZmlnaHRlcnMtZXh0aW5ndWlzaC1hbm90aGVyLWZpcmUtdW5kZXJuZWF0aC1hbm90aGVyLWxvcy1hbmdlbGVzLWZyZWV3YXkvYW1wLw?oc=5\" target=\"_blank\">Another fire breaks out underneath L.A. freeway</a>&nbsp;&nbsp;<font color=\"#6f6f6f\">KTLA Los Angeles</font></li><li><strong><a href=\"https://news.google.com/stories/CAAqNggKIjBDQklTSGpvSmMzUnZjbmt0TXpZd1NoRUtEd2pOMHNiVkNSR1lubFhjRUxHdVdpZ0FQAQ?hl=en-US&amp;gl=US&amp;ceid=US:en&amp;oc=5\" target=\"_blank\">View Full Coverage on Google News</a></strong></li></ol>"
   },
   "source":{
      "href":"https://www.nytimes.com",
      "title":"The New York Times"
   }
}

And detailed company-based article here.