page.on('response') is not accessible in handlePageFunction // PuppeteerCrawler (Apify SDK)

1k Views Asked by At

I try to get some data from the page.on('response') event. This data should be pushed into the dataset with pushData.

It seems that this events:

await page
    .on('response', response => {
        if (response.status() === 404) {
            responseErrors.push(new Object({
                status: response.status(),
                url: response.url()
            }))
        }
    })
    .on('pageerror', err => {
        if (err.message) {
            pageErrors.push(JSON.stringify(err.message))
        }
    })
    .on('console', message => {
        consoleErrors.push(new Object({
            type: message.type(),
            url: message.text()
        }))
    });

Have no response if they are used in handlePageFunction.

If i add them to the gotoFunction of PuppeteerCrawler i get results. The problem is that i cant push into the same dataset.

So what would be the right way to access this data?

2

There are 2 best solutions below

1
On BEST ANSWER

Yes, it doesn't work in handlePageFunction because the page is already opened and responses have been processed. You have 2 options:

  1. Use the response parameter on handlePageFunction https://sdk.apify.com/docs/typedefs/puppeteer-handle-page-inputs

  2. Do what you did in the gotoFunction and instead of pushing to dataset, update request.userData and then read this in handlePageFunction, merge with your data and push to dataset.

0
On

With the help of @Lukáš Křivka i got the solution for me. Here is a code example:

userData Example:

const crawler = new Apify.PuppeteerCrawler({
            requestQueue,
            launchPuppeteerOptions: {
                headless: true,
                ignoreHTTPSErrors: true,
                // slowMo: 500,
            },
            maxRequestsPerCrawl: settings.maxurls,
            maxConcurrency: settings.maxcrawlers,
            gotoFunction: async ({
                page,
                request
            }) => {
                const responseErrors = [];
                const consoleErrors = [];
                const pageErrors = [];
                await page.authenticate({
                    username: settings.authenticate.username,
                    password: settings.authenticate.password
                });
                await page
                    .on('response', response => {
                        if (response.status() === 404) {
                            responseErrors.push(new Object({
                                status: response.status(),
                                url: response.url()
                            }))
                        }
                    })
                    .on('pageerror', err => {
                        console.log(err)
                        if (err.message) {
                            pageErrors.push(JSON.stringify(err.message))
                        }
                    })
                    .on('console', message => {
                        consoleErrors.push(new Object({
                            type: message.type(),
                            url: message.text()
                        }))
                    });
                request.userData.responseErrors = responseErrors;
                request.userData.pageErrors = pageErrors;
                request.userData.consoleErrors = consoleErrors;

                return page.goto(request.url, {
                    timeout: 120000
                });
            },

Access:

handlePageFunction: async ({
            request,
            response,
            page
        }) => {
            await page.waitFor(settings.waitForPageload);

            console.log(request.userData)