I want to scrape an article from a website with the newspaper library (newspaper3k). However, it doesn't find the published_date for the article, which is div.source-date in the website's source text, and the authors (or source rather), which is div.delfi-source-name in the website's source text. How can I scrape the date and the author/source?
Website/URL example: https://www.delfi.lt/en/politics/foreign-ministry-tsikhanouskayas-consultation-needed-for-treating-belarusians-in-lithuania.d?id=91531501
My code:
import newspaper
from newspaper import Article
from newspaper import Source
import pandas as pd
article = Article("url")
article.download()
article.parse()
article.nlp()
df = pd.DataFrame([{'Title':article.title, 'Author':article.authors, 'Text':article.text,
'published_date':article.publish_date, 'Source':article.source_url}])
df.to_excel('Delfi-1.xlsx')
Any suggestions?
The date element in your source is located in 2 locations. The one that you see
Wednesday, October 19, 2022is located in adivtag thatnewspaper3kcannot parse without usingBeautifulSoup.The second date is hidden in the meta tags, which
newspaper3kcan parse with some additional code.Output
P.S. Newspaper3k has multiple ways to extract the publish dates from articles. Take a look at this document that I wrote on how to use Newspaper3k.