I'm completely stuck with my code right now. First of all I try to retrieve all urls from a archive page of the volkskrant. This is the first step were I got struck. The url of one specific date looks as following: http://www.volkskrant.nl/archief/detail/01012016
The last numbers reflect the date and is the same for every page, so I came up with the idea to make strings for the day (DD) the month (MM) and the year(2016).
Next, I the links to the pages, directs me to a page with even more links of the articles I eventually want to get. I know how to get the URLS of one of the dates, but not for all. Eventually I need to scrape all text of the page, which I also can do for one page. Another problem is that I need to retrieve all articles on the next page as well, but I have no clue how to implement that in the code as well.
Basically, I have major troubles with iterating through links, especially with the use of strings in it. Hopefully, someone is able to help me out with this.
The code looks as following atm:
> Scraping archive of Volkskrant
month=['01','02'.. etc]
day =['01','02','03'.. etc]
year=['2016']
for x in month:
for y in day:
next_date= 'http://www.volkskrant.nl/archief/detail/'+str(y)+str(x)+str(year)
> getting links of one single date
req=request.Request('http://www.volkskrant.nl/archief/detail/01012016', headers={'User-Agent':"Mozilla/5.0"})
archive=request.urlopen(req).read()
archive=archive.decode(encoding="utf-8",errors="ignore").replace("\n"," ").replace("\t"," ")
for link in archive:
links=re.findall(r'<article class="article article--extended".*?</article>', archive)
url1= ''.join(map(str,links))
for item in url1:
urls= re.findall(r'href=[\'"]?([^\'">]+)', url1)
>go to next page and retrieve all links there
nextpage=re.findall(r'<span class="pagination__item">.*?</span>', archive)
nextp= ''.join(map(str,nextpage))
for item in nextp:
next= re.findall (r'href=[\'"]?([^\'">]+)', nextp)[:1]
> retrieving one article and scrape content
req=request.Request('http://www.volkskrant.nl/politiek/pechtold-wil-d66-blijven-leiden~a4283833/', headers={'User-Agent':"Mozilla/5.0"})
tekst=request.urlopen(req).read()
tekst=tekst.decode(encoding="utf-8",errors="ignore").replace("\n"," ").replace("\t"," ")
> scraping the introduction with Xpath as the regex was not applicable
tree=html.fromstring(request.urlopen(req).read().decode(encoding="utf-8", errors="ignore"))
artikel3=tree.xpath('//*[@itemprop="description"]/text()')
... etc
This will get you all the links including pagination and every day from Jan 1 to now:
To get the text just parse each link:
Forst few results: