Cheerio crawler is not crawling when I set maxRequestPerCrawl
to 1.
Even when I set maxRequestPerCrawl
to 10 or 100, after the 10th or 100th request nothing will be crawled again anymore. How can I fix that limitation?
I use a new instance of Cheerio for any single request, no parallel requests are necessary in my use cases. However, it counts requests on a global basis, whether I use a new instance of Cheerio for every request or a shared instance.
Once the count of all requests reaches the value of maxRequestPerCrawl
, it will deny all further requests. The only solution is to shutdown the full process and start it again.
Log:
INFO CheerioCrawler: Starting the crawl
INFO CheerioCrawler: Crawler reached the maxRequestsPerCrawl limit of 1 requests and will shut down soon. Requests that are in progress will be allowed to finish.
INFO CheerioCrawler: Earlier, the crawler reached the maxRequestsPerCrawl limit of 1 requests and all requests that were in progress at that time have now finished. In total, the crawler processed 1 requests and will shut down.
INFO CheerioCrawler: Crawl finished. Final request statistics: {"requestsFinished":0,"requestsFailed":0,"retryHistogram":[],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":0,"requestsTotal":0,"crawlerRuntimeMillis":190}
My Code:
const crawler = new CheerioCrawler({
minConcurrency: 1,
maxConcurrency: 1,
// proxyConfiguration: {},
// On error, retry each page at most once.
maxRequestRetries: 1,
// Increase the timeout for processing of each page.
requestHandlerTimeoutSecs: 30,
// Limit to 10 requests per one crawl
maxRequestsPerCrawl: 1,
async requestHandler({ request, $, proxyInfo }) {
// ...
}
)}
await crawler.run([url]);
await crawler.teardown();
What am I missing here to be able to run the crawler even with thousands of requests in a row?