I'm using scrapinghub's deltafetch feature in order to get new pages from a website, without requesting the urls I have already scraped.
I've noticed that on some websites, scrapy would still scrap pages with an already visited url. I had to replace the default fingerprint deltafetch_key, by just using the url.
It works fine with a scrapy Spider
, since I can define the meta in the requests. However, when using CrawlSpider
and SitemapSpider
, I'm a bit stuck. For exemple, the SitemapSpider
, has a _parse_sitemap
method that includes Requests
, but I can't really override it.
I've tried using a custom DOWNLOADER_MIDDLEWARES
, by using process_request
and adding request.meta['deltafetch_key'] = xxx
. But somehow the deltafetch spider middleware is getting called before the custom downloader middleware.
Do you have any ideas how to add meta
informations to the Request
of CrawlSpider
and SitemapSpider
?
you can override the original meta something like this
i got this from https://github.com/scrapy/scrapy/issues/704