how to optimize Scrapyd setting for 200+ spider

1.2k Views Asked by At

My scrapyd is handling 200 spiders at once daily . Yesterday, the server crashed because RAM hit its cap.

I am using scrapyd default setting

[scrapyd]
http_port  = 6800
debug      = off
#max_proc  = 1
eggs_dir   = /var/lib/scrapyd/eggs
dbs_dir    = /var/lib/scrapyd/dbs
items_dir  = /var/lib/scrapyd/items
logs_dir   = /var/log/scrapyd

Here is code to schedule all spiders:

url = 'http://localhost:6800/schedule.json'
crawler = self.crawler_process.create_crawler()
crawler.spiders.list()
for s in crawler.spiders.list():
    values = {'project' : 'myproject', 'spider' : s}
    data = urllib.urlencode(values)
    req = urllib2.Request(url, data)
    response = urllib2.urlopen(req)

how to optimize scrapyd setting to handle 200+ spiders ?

Thanks

1

There are 1 best solutions below

0
On

I'd first try to run scrapy crawl with --profile option on those spiders and examine the result to see what takes most of the memory, in general scrapy should just pipe and store data and should not accumulate data in memory.

otherwise, in its default scrapyd will run 4 processes, it can be adjusted by using the following settings parameters

max_proc The maximum number of concurrent Scrapy process that will be started. If unset or 0 it will use the number of cpus available in the system mulitplied by the value in max_proc_per_cpu option. Defaults to 0.

max_proc_per_cpu The maximum number of concurrent Scrapy process that will be started per cpu. Defaults to 4.