Blocking specific resources (css, images, videos, etc) using crawlee and playwright

1.5k Views Asked by At

I'm using [email protected] (not released yet, from github), and I'm trying to block specific resources from loading with playwrightUtils.blockRequests (which isn't available in previous versions). When I try the code suggested in the official repo, it works as expected:

import { launchPlaywright, playwrightUtils } from 'crawlee';

const browser = await launchPlaywright();
const page = await browser.newPage();
await playwrightUtils.blockRequests(page, {
    // extraUrlPatterns: ['adsbygoogle.js'],
});
await page.goto('https://cnn.com');
await page.screenshot({ path: 'cnn_no_images.png' });
await browser.close();

I can see that the images aren't loaded from the screenshot. My problem has to do with the fact that I'm using PlaywrightCrawler:

const crawler = new PlaywrightCrawler({
    maxRequestsPerCrawl: 3,
    async requestHandler({ page, request }) {
        console.log(`Processing: ${request.url}`);
        await playwrightUtils.blockRequests(page);
        await page.screenshot({ path: 'cnn_no_images2.png' });
    },
});

This way, I'm not able to block specific resources, and my guess is that blockRequests needs launchPlaywright to work, and I don't see a way to pass that to PlaywrightCrawler.blockRequests has been available for puppeteer, so maybe someone has tried this before.

Also, i've tried "route interception", but again, I couldn't make it work with PlaywrightCrawler.

1

There are 1 best solutions below

0
On BEST ANSWER

you can set any listeners or code before navigation by using preNavigationHooks like this:


const crawler = new PlaywrightCrawler({
    maxRequestsPerCrawl: 3,
    preNavigationHooks: [async ({ page }) => {
        await playwrightUtils.blockRequests(page);
    }],
    async requestHandler({ page, request }) {
        console.log(`Processing: ${request.url}`);
        await page.screenshot({ path: 'cnn_no_images2.png' });
    },
});