Python library newspaper is not returning the published date

Question

Python library newspaper is not returning the published date

159 Views Asked by Sam Hall At 18 October 2022 at 14:26

I am using newspaper python library to extract some data from new stories. The problem is that I am not getting this data for some URLs. These URLs work fine. They all return 200. I am doing this for a very large dataset but this is one of the URLs for which the date extraction did not work. The code works for some links and not others (from the same domain) so I know that the problem isn't something like my IP being blocked for too many requests. I tried it on just one URL and getting the same result (no data).

import os
import sys
from newspaper import Article   

def split(link):
        try:
            story = Article(link)
            story.download()
            story.parse()
            date_time = str(story.publish_date)
            split_date = date_time.split()  
            date = split_date[0]
            if date != "None":
                print(date)
        except:
            print("This URL did not return a published date. Try a different URL.")
            print(link)

if __name__ == "__main__":
        link = "https://www.aljazeera.com/program/featured-documentaries/2020/12/29/lords-of-water-episode-one"
        split(link)

I am getting this output:

This URL did not return a published date. Try a different URL. https://www.aljazeera.com/program/featured-documentaries/2020/12/29/lords-of-water-episode-one

Original Q&A

There are 2 best solutions below

Full Stack Developer On 18 October 2022 at 14:28

Please check the link, I checked the link and it's unavailable now. If link is unavailable, the code will not be work.

**Life is complex** · Accepted Answer · 2022-10-19T11:58:21.117000

Try adding some error handling to your code to catch URLs that return a 404, such as this one: https://www.aljazeera.com/program/featured-documentaries/2020/12/29/lords-of-water-episode-one

from newspaper import Config
from newspaper import Article
from newspaper.article import ArticleException

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'

config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10

base_url = 'https://www.aljazeera.com/program/featured-documentaries/2020/12/29/lords-of-water-episode-one'
try:
    article = Article(base_url, config=config)
    article.download()
    article.parse()
except ArticleException as error:
    print(error)

Output:

Article `download()` failed with 404 Client Error: Not Found for url: https://www.aljazeera.com/program/featured-documentaries/2020/12/29/lords-of-water-episode-one on URL https://www.aljazeera.com/program/featured-documentaries/2020/12/29/lords-of-water-episode-one

Newspaper3k has multiple ways to extract the publish dates from articles. Take a look at this document that I wrote on how to use Newspaper3k.

Here is an example for this valid URL https://www.aljazeera.com/program/featured-documentaries/2022/3/31/lords-of-water that extracts data elements from the page's meta tags.

from newspaper import Config
from newspaper import Article
from newspaper.article import ArticleException

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'

config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10

base_url = 'https://www.aljazeera.com/program/featured-documentaries/2022/3/31/lords-of-water'
try:
    article = Article(base_url, config=config)
    article.download()
    article.parse()
    article_meta_data = article.meta_data

    article_title = [value for (key, value) in article_meta_data.items() if key == 'pageTitle']
    print(article_title)

    article_published_date = str([value for (key, value) in article_meta_data.items() if key == 'publishedDate'])
    print(article_published_date)

    article_description = [value for (key, value) in article_meta_data.items() if key == 'description']
    print(article_description)

except ArticleException as error:
    print(error)

Output

['Lords of Water']
['2022-03-31T06:08:59']
['Is water the new oil? We expose the financialisation of water.']

Python library newspaper is not returning the published date

There are 2 best solutions below

Related Questions in PYTHON

Related Questions in PYTHON-NEWSPAPER

Related Questions in NEWSPAPER3K

Trending Questions

Popular # Hahtags

Popular Questions