How to scrape a webpage with infinite scroll using crawlee/apify?

322 Views Asked by At

I am trying to scrape some data from twitch, the problem I am facing is that the site uses infinite scroll and I am only able to get data from the first page.

I have tried to scroll by using the built in utility infiniteScroll but it scrolls after going to the result page not on the main page. This is how I have implemented this

import {
  Dataset,
  createPlaywrightRouter,
  enqueueLinks,
  playwrightUtils,
} from "crawlee";

export const router = createPlaywrightRouter();

router.addDefaultHandler(
  async ({ log, page, request, infiniteScroll }) => {
    log.debug(`Processing: ${request.url}`);
    await page.waitForSelector('[data-a-target="preview-card-image-link"]');
    await page.click("body");

    await infiniteScroll();

    enqueueLinks({
        selector: ".ScTransformWrapper-sc-1wvuch4-1 a",
        label: "detail",
      });
  }
);

router.addHandler("detail", async ({ request, page, log }) => {
  log.debug(`Extracting data: ${request.url}`);

  await page.waitForSelector('[id="live-channel-about-panel"]');
  const instagram = await page
    .locator('a[role="link"][href*="instagram"]')
    .getAttribute("href");
  const twitter = await page
    .locator('a[role="link"][href*="twitter"]')
    .getAttribute("href");
  const discord = await page
    .locator('a[role="link"][href*="discord"]')
    .getAttribute("href");

  const results = { instagram, twitter, discord };

  log.debug(results);
});

Link I am trying to scrape: text

1

There are 1 best solutions below

1
On

I recommend adding some snapshots so you see what is happening. It could be the cookie modal blocking the scroll so try to click it out. Check this article for recommendations: https://docs.apify.com/academy/node-js/analyzing-pages-and-fixing-errors

You can try some of the parameters of enqueueLinks, like snapshot inside stopScrollCallback (https://crawlee.dev/api/playwright-crawler/namespace/playwrightUtils#stopScrollCallback). And run it locally with headless: false so you see in real time

Unrelated, you are missing await before enqueueLinks