My scrapyd is handling 200 spiders at once daily . Yesterday, the server crashed because RAM hit its cap.
I am using scrapyd default setting
[scrapyd]
http_port = 6800
debug = off
#max_proc = 1
eggs_dir = /var/lib/scrapyd/eggs
dbs_dir = /var/lib/scrapyd/dbs
items_dir = /var/lib/scrapyd/items
logs_dir = /var/log/scrapyd
Here is code to schedule all spiders:
url = 'http://localhost:6800/schedule.json'
crawler = self.crawler_process.create_crawler()
crawler.spiders.list()
for s in crawler.spiders.list():
values = {'project' : 'myproject', 'spider' : s}
data = urllib.urlencode(values)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
how to optimize scrapyd setting to handle 200+ spiders ?
Thanks
I'd first try to run scrapy crawl with --profile option on those spiders and examine the result to see what takes most of the memory, in general scrapy should just pipe and store data and should not accumulate data in memory.
otherwise, in its default scrapyd will run 4 processes, it can be adjusted by using the following settings parameters