I am using newspaper python library to extract some data from new stories. The problem is that I am not getting this data for some URLs. These URLs work fine. They all return 200. I am doing this for a very large dataset but this is one of the URLs for which the date extraction did not work. The code works for some links and not others (from the same domain) so I know that the problem isn't something like my IP being blocked for too many requests. I tried it on just one URL and getting the same result (no data).
import os
import sys
from newspaper import Article
def split(link):
try:
story = Article(link)
story.download()
story.parse()
date_time = str(story.publish_date)
split_date = date_time.split()
date = split_date[0]
if date != "None":
print(date)
except:
print("This URL did not return a published date. Try a different URL.")
print(link)
if __name__ == "__main__":
link = "https://www.aljazeera.com/program/featured-documentaries/2020/12/29/lords-of-water-episode-one"
split(link)
I am getting this output:
This URL did not return a published date. Try a different URL. https://www.aljazeera.com/program/featured-documentaries/2020/12/29/lords-of-water-episode-one
Try adding some error handling to your code to catch URLs that return a 404, such as this one:
https://www.aljazeera.com/program/featured-documentaries/2020/12/29/lords-of-water-episode-oneOutput:
Newspaper3khas multiple ways to extract the publish dates from articles. Take a look at this document that I wrote on how to useNewspaper3k.Here is an example for this valid URL
https://www.aljazeera.com/program/featured-documentaries/2022/3/31/lords-of-waterthat extracts data elements from the page'smeta tags.Output