Scarpy-redis slows down item pipelines

171 Views Asked by At

I'm just using the dupfilter and scheduler reimplemented by scrapy-redis to support recovering from interruptions (Redis only contains two keys - dmoz:dupfilter & dmoz:requests), and using one item pipeline to store items in remote MongoDB.

However, using scrapy-redis slows down the process of items while the speed of requesting and crawling are not effected. When using scrapy only, the new items ae mounting at speed of around 1000 items/min; when using scrapy-redis, it is 0-4 items/min. More specifically, the info is derived when I press CTRL-C, the spider is conducting graceful shutdown and processing only remained items, there are no other racing tasks.

INFO: Crawled 228 pages (at 47 pages/min), scraped 7 items (at 1 items/min)

This is not a new problem. I find a issue in scrapy-redis's Github page, but the advice there doesn't work (set SCHEDULER_IDLE_BEFORE_CLOSE to 0, however it's 0 by default referring to the source code). https://github.com/rmax/scrapy-redis/issues/43

Here are some of my settings:

LOG_LEVEL = 'DEBUG'

CONCURRENT_REQUESTS = 16

SCHEDULER_IDLE_BEFORE_CLOSE = 0

DUPFILTER_CLASS = 'scrapy_redis.dupfilter.RFPDupeFilter'

SCHEDULER = 'scrapy_redis.scheduler.Scheduler'

SCHEDULER_PERSIST = True

ITEM_PIPELINE = {
    'dmoz.pipelines.dmozPipeline': 300
}

Thanks for your attention!

0

There are 0 best solutions below