Adding meta deltafetch_key for every request in SitemapSpider and CrawlSpider

155 Views Asked by At

I'm using scrapinghub's deltafetch feature in order to get new pages from a website, without requesting the urls I have already scraped.

I've noticed that on some websites, scrapy would still scrap pages with an already visited url. I had to replace the default fingerprint deltafetch_key, by just using the url.

It works fine with a scrapy Spider, since I can define the meta in the requests. However, when using CrawlSpider and SitemapSpider, I'm a bit stuck. For exemple, the SitemapSpider, has a _parse_sitemap method that includes Requests, but I can't really override it.

I've tried using a custom DOWNLOADER_MIDDLEWARES, by using process_request and adding request.meta['deltafetch_key'] = xxx. But somehow the deltafetch spider middleware is getting called before the custom downloader middleware.

Do you have any ideas how to add meta informations to the Request of CrawlSpider and SitemapSpider?

1

There are 1 best solutions below

0
On

you can override the original meta something like this

r.meta['original_meta'] = response.meta

i got this from https://github.com/scrapy/scrapy/issues/704