Adding meta deltafetch_key for every request in SitemapSpider and CrawlSpider

172 Views Asked by romain-lavoix At 05 November 2018 at 09:15

I'm using scrapinghub's deltafetch feature in order to get new pages from a website, without requesting the urls I have already scraped.

I've noticed that on some websites, scrapy would still scrap pages with an already visited url. I had to replace the default fingerprint deltafetch_key, by just using the url.

It works fine with a scrapy Spider, since I can define the meta in the requests. However, when using CrawlSpider and SitemapSpider, I'm a bit stuck. For exemple, the SitemapSpider, has a _parse_sitemap method that includes Requests, but I can't really override it.

I've tried using a custom DOWNLOADER_MIDDLEWARES, by using process_request and adding request.meta['deltafetch_key'] = xxx. But somehow the deltafetch spider middleware is getting called before the custom downloader middleware.

Do you have any ideas how to add meta informations to the Request of CrawlSpider and SitemapSpider?

Original Q&A

There are 1 best solutions below

backtrack On 05 November 2018 at 09:25

you can override the original meta something like this

r.meta['original_meta'] = response.meta

i got this from https://github.com/scrapy/scrapy/issues/704

Adding meta deltafetch_key for every request in SitemapSpider and CrawlSpider

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in SCRAPY

Related Questions in SCRAPINGHUB

Trending Questions

Popular # Hahtags

Popular Questions