Extending Crawlee scraper requestHandler

683 Views Asked by At

I'm using [email protected], following the quick tutorial here to spin up a scraper.

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page, request, enqueueLinks }) => {
        console.log(`Processing: ${request.url}`)
        await page.waitForSelector('.ActorStorePagination-pages a');

        await enqueueLinks({
            selector: '.ActorStorePagination-pages > a',
            label: 'LIST',
        })
    },
});

I now need to extend the enqueLinks function which is passed to the requestHandler. The goal is to add some custom logics whenever I add new urls to the queue. An example usecase is keeping track of how many links of a certain type I have found, so that I can do additional logging/publishing messages to other services. Is there any way to do that?

I have tried to extend the PlaywrightCrawler class instead. My problem with that approach is since the requestHandler is wrapped by an object, I cannot access its properties.

class CustomCrawler extends PlaywrightCrawler {
    categoryPagesQueued: string[];

    constructor() {
        super({
            requestHandler: async ({ page, request, enqueueLinks }) => {
                console.log(`Processing: ${request.url}`)
                // Wait for the actor cards to render,
                // otherwise enqueueLinks wouldn't enqueue anything.
                await page.waitForSelector('.ActorStorePagination-pages a');
        
                
                // Error: this does not access the CustomCrawler.categoryPagesQueued
                this.categoryPagesQueued.push("foo");
                customLogic(this.categoryPagesQueued);

                await enqueueLinks({
                    selector: '.ActorStorePagination-pages > a',
                    label: 'LIST',
                })
            },
        })
    }
}
1

There are 1 best solutions below

0
On

I achieved this, but not in the purest of object oriented overriding ways:

There is a good article here enter link description here that shows how to use arrow notation for the requestHandler. It enables access to this context so you can incorporate your custom processing that way.

It's the only way I've found so far to execute my custom processing.

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ request, page, enqueueLinks }) => {
        const title = await page.title()
        console.log(title);

        // Do Stuff with doStuff() defined in class ....
        this.doStuff();

        // Extract links from the current page
        // and add them to the crawling queue.
        await enqueueLinks();
    }
})