I am trying to scrape some data from twitch, the problem I am facing is that the site uses infinite scroll and I am only able to get data from the first page.
I have tried to scroll by using the built in utility infiniteScroll but it scrolls after going to the result page not on the main page. This is how I have implemented this
import {
Dataset,
createPlaywrightRouter,
enqueueLinks,
playwrightUtils,
} from "crawlee";
export const router = createPlaywrightRouter();
router.addDefaultHandler(
async ({ log, page, request, infiniteScroll }) => {
log.debug(`Processing: ${request.url}`);
await page.waitForSelector('[data-a-target="preview-card-image-link"]');
await page.click("body");
await infiniteScroll();
enqueueLinks({
selector: ".ScTransformWrapper-sc-1wvuch4-1 a",
label: "detail",
});
}
);
router.addHandler("detail", async ({ request, page, log }) => {
log.debug(`Extracting data: ${request.url}`);
await page.waitForSelector('[id="live-channel-about-panel"]');
const instagram = await page
.locator('a[role="link"][href*="instagram"]')
.getAttribute("href");
const twitter = await page
.locator('a[role="link"][href*="twitter"]')
.getAttribute("href");
const discord = await page
.locator('a[role="link"][href*="discord"]')
.getAttribute("href");
const results = { instagram, twitter, discord };
log.debug(results);
});
Link I am trying to scrape: text
I recommend adding some snapshots so you see what is happening. It could be the cookie modal blocking the scroll so try to click it out. Check this article for recommendations: https://docs.apify.com/academy/node-js/analyzing-pages-and-fixing-errors
You can try some of the parameters of enqueueLinks, like snapshot inside
stopScrollCallback
(https://crawlee.dev/api/playwright-crawler/namespace/playwrightUtils#stopScrollCallback). And run it locally with headless: false so you see in real timeUnrelated, you are missing await before enqueueLinks