I've built a Crawlee scrapper, but for some reason it invokes the same handler multiple times, creating a lot of duplicate requests and entries in my dataset. Also:
- I've already tried manually setting
uniqueKey
s for all my requests. - I've also tried setting
maxConcurrency: 1
for the crawler. - As you can see from the logs below, the issue is not that I'm adding the same requests multiple times. It's Crawlee who's invoking handlers multiple times with the same request.
Here are the relevant (simplified) files:
main.ts
:
await Actor.init();
const crawler = new CheerioCrawler({
requestHandler: router,
sameDomainDelaySecs: 3,
maxRequestRetries: 3,
maxConcurrency: 1,
});
const originalAddRequestsFn = crawler.addRequests.bind(crawler);
crawler.addRequests = function(requests: Source[], options: CrawlerAddRequestsOptions) {
if (requests.length > 1) {
log.info(`INITIAL REQUESTS = ${ requests.length }`);
} else {
log.info(`${ requests[0].label } | ${ requests[0].uniqueKey || '-' } = ${ requests[0].url }`);
}
return originalAddRequestsFn(requests, options);
}
const requestsOptions: RequestOptions<ScrapperData>[] = [{
uniqueKey: `ROUTE_A_${ dataset[0].startURL }`,
url: dataset[0].startURL,
label: RouterHandlerLabels.ROUTE_A,
userData: { datasetIndex: 0 },
}, {
uniqueKey: `ROUTE_A_${ dataset[1].startURL }`,
url: dataset[1].startURL,
label: RouterHandlerLabels.ROUTE_A,
userData: { datasetIndex: 1 },
}];
try {
await crawler.run(requestsOptions);
await Dataset.exportToJSON(JSON_OUTPUT_FILE_KEY);
} finally {
await Actor.exit();
}
router.ts
:
export enum RouterHandlerLabels {
ROUTE_A = 'route-a',
ROUTE_B = 'route-b',
ROUTE_C = 'route-c',
}
export const router = createCheerioRouter();
router.addHandler(RouterHandlerLabels.ROUTE_A, handlerA);
router.addHandler(RouterHandlerLabels.ROUTE_B, handlerB);
router.addHandler(RouterHandlerLabels.ROUTE_C, handlerC);
router.addDefaultHandler(async ({ log }) => {
log.info('Default handler...');
});
handler-a.ts
:
export async function handlerA({ request, $, pushData, log, crawler }: CheerioCrawlingContext<ScrapperData>) {
const { datasetIndex } = request.userData;
log.info(`A. ${ datasetIndex }: ${ request.loadedUrl || '?' }`);
const pageHTML = $('body').html() || '';
const nextURL = findLinkToB(pageHTML);
if (!nextURL) return;
log.info('A. Call addRequests(...)');
await crawler.addRequests([{
uniqueKey: `ROUTE_B_${ nextURL }`,
url: nextURL,
headers: DEFAULT_REQUEST_HEADERS,
label: RouterHandlerLabels.ROUTE_B,
userData: request.userData,
}]);
}
handler-b.ts
:
export async function handlerB({ request, $, pushData, log, crawler }: CheerioCrawlingContext<ScrapperData>) {
const { datasetIndex } = request.userData;
log.info(`B. ${ datasetIndex }: ${ request.loadedUrl || '?' }`);
const pageHTML = $('body').html() || '';
const nextURL = findLinkToC(pageHTML);
if (!nextURL) return;
log.info('B. Call addRequests(...)');
await crawler.addRequests([{
uniqueKey: `ROUTE_C_${ nextURL }`,
url: nextURL,
headers: DEFAULT_REQUEST_HEADERS,
label: RouterHandlerLabels.ROUTE_C,
userData: request.userData,
}]);
}
handler-c.ts
:
export async function handlerC({ request, $, pushData, log, crawler }: CheerioCrawlingContext<ScrapperData>) {
const { datasetIndex } = request.userData;
log.info(`C. ${ datasetIndex }: ${ request.loadedUrl || '?' }`);
const pageHTML = $('body').html() || '';
const extractedData = findDataInPageC(pageHTML);
if (!extractedData) return;
log.info(`C. Saving data for ${ datasetIndex }`);
await pushData({ ...extractedData, datasetIndex });
}
These are the logs I get:
INFO System info {"apifyVersion":"3.1.12","apifyClientVersion":"2.8.1","crawleeVersion":"3.5.8","osType":"Linux","nodeVersion":"v20.8.1"}
INFO INITIAL REQUESTS = 2
INFO CheerioCrawler: Starting the crawler.
INFO CheerioCrawler: A. 0: https://example.com/page-a/user-0
INFO CheerioCrawler: A. Call addRequests(...)
INFO ROUTE_B | ROUTE_B_https://example.com/page-b/user-0 = https://example.com/page-b/user-0
INFO CheerioCrawler: A. 1: https://example.com/page-a/user-1
INFO CheerioCrawler: A. Call addRequests(...)
INFO ROUTE_B | ROUTE_B_https://example.com/page-b/user-1 = https://example.com/page-b/user-1
INFO CheerioCrawler: B. 0: https://example.com/page-b/user-0
INFO CheerioCrawler: B. Call addRequests(...)
INFO ROUTE_C | ROUTE_C_https://example.com/page-c/user-0 = https://example.com/page-c/user-0
INFO CheerioCrawler: A. 1: https://example.com/page-a/user-1
INFO CheerioCrawler: A. Call addRequests(...)
INFO ROUTE_B | ROUTE_B_https://example.com/page-b/user-1 = https://example.com/page-b/user-1
INFO CheerioCrawler: B. 0: https://example.com/page-b/user-0
INFO CheerioCrawler: B. Call addRequests(...)
INFO ROUTE_C | ROUTE_C_https://example.com/page-c/user-0 = https://example.com/page-c/user-0
INFO CheerioCrawler: A. 1: https://example.com/page-a/user-1
INFO CheerioCrawler: A. Call addRequests(...)
INFO ROUTE_B | ROUTE_B_https://example.com/page-b/user-1 = https://example.com/page-b/user-1
INFO CheerioCrawler: B. 0: https://example.com/page-b/user-0
INFO CheerioCrawler: B. Call addRequests(...)
INFO ROUTE_C | ROUTE_C_https://example.com/page-c/user-0 = https://example.com/page-c/user-0
INFO CheerioCrawler: A. 1: https://example.com/page-a/user-1
INFO CheerioCrawler: A. Call addRequests(...)
INFO ROUTE_B | ROUTE_B_https://example.com/page-b/user-1 = https://example.com/page-b/user-1
INFO CheerioCrawler: A. 1: https://example.com/page-a/user-1
INFO CheerioCrawler: A. Call addRequests(...)
INFO ROUTE_B | ROUTE_B_https://example.com/page-b/user-1 = https://example.com/page-b/user-1
INFO Statistics: CheerioCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":5599,"requestsFinishedPerMinute":9,"requestsFailedPerMinute":0,"requestTotalDurationMillis":50388,"requestsTotal":9,"crawlerRuntimeMillis":61279,"retryHistogram":[9]}
INFO CheerioCrawler:AutoscaledPool: state {"currentConcurrency":1,"desiredConcurrency":1,"systemStatus":{"isSystemIdle":false,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":true,"limitRatio":0.7,"actualRatio":0.858},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
INFO CheerioCrawler: C. 0: https://example.com/page-c/user-0
INFO CheerioCrawler: C. Saving data for 0
INFO CheerioCrawler: C. 0: https://example.com/page-c/user-0
INFO CheerioCrawler: C. Saving data for 0
INFO CheerioCrawler: C. 0: https://example.com/page-c/user-0
INFO CheerioCrawler: C. Saving data for 0
INFO CheerioCrawler: C. 0: https://example.com/page-c/user-0
INFO CheerioCrawler: C. Saving data for 0
INFO CheerioCrawler: B. 1: https://example.com/page-b/user-1
INFO CheerioCrawler: B. Call addRequests(...)
INFO ROUTE_C | ROUTE_C_https://example.com/page-c/user-1 = https://example.com/page-c/user-1
INFO CheerioCrawler: B. 1: https://example.com/page-b/user-1
INFO CheerioCrawler: B. Call addRequests(...)
INFO ROUTE_C | ROUTE_C_https://example.com/page-c/user-1 = https://example.com/page-c/user-1
INFO CheerioCrawler: C. 1: https://example.com/page-c/user-1
INFO CheerioCrawler: C. Saving data for 1
INFO CheerioCrawler: B. 1: https://example.com/page-b/user-1
INFO CheerioCrawler: B. Call addRequests(...)
INFO ROUTE_C | ROUTE_C_https://example.com/page-c/user-1 = https://example.com/page-c/user-1
INFO CheerioCrawler: C. 1: https://example.com/page-c/user-1
INFO CheerioCrawler: C. Saving data for 1
INFO CheerioCrawler: C. 1: https://example.com/page-c/user-1
INFO CheerioCrawler: C. Saving data for 1
INFO CheerioCrawler: All requests from the queue have been processed, the crawler will shut down.
INFO CheerioCrawler: Final request statistics: {"requestsFinished":19,"requestsFailed":0,"retryHistogram":[19],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":5150,"requestsFinishedPerMinute":10,"requestsFailedPerMinute":0,"requestTotalDurationMillis":97844,"requestsTotal":19,"crawlerRuntimeMillis":115660}
INFO CheerioCrawler: Finished! Total 19 requests: 19 succeeded, 0 failed. {"terminal":true}
In this case, it produced a total of 7
results: 4
for the first dataset entry and 3
for the second one (it should actually be only one for each, so 2
results in total).
Line 13
on the logs would be the first one that doesn't make sense:
INFO CheerioCrawler: A. 1: https://example.com/page-a/user-1
As at that point, both requests to page-a
, one for user-0
and one for user-1
, have already been handled (lines 4
and 7
, respectively).
I've tried adding only 1 initial request (when calling crawler.run(...)
), but some handlers are still getting invoked more than once for the same request.
I'm using crawlee
3.5.8
.
Ok, so I got some help from Apify on their Discord and it's a known bug:
I've tried versions
3.5.2
and3.5.0
and I still ahve the same issue, so I ended up removingsameDomainDelaySecs
and adding anawait sleep(delayInMs)
before adding new requests.You can do that manually before calling
crawler.addRequests
, or you can overwritecrawler.addRequests
so that it always waits a few seconds before adding new ones: