Trouble with Puppeteer Cluster and Excel4node: Not all data is being written to Excel

31 Views Asked by At

I'm encountering an issue while trying to scrape data using Puppeteer Cluster and write it to an Excel file using Excel4node.

Here's a summary of the script's functionality:

I'm using Puppeteer Cluster to scrape data from multiple URLs concurrently. For each URL, I scrape various pieces of information from the webpage. I'm using a write queue mechanism to write the scraped data to an Excel file using Excel4node.

The problem is that while some data is successfully written to the Excel file, not all of it is being captured. It seems like some rows are missing in the Excel file compared to the number of URLs processed. When I set maxConcurrency to 1 it works fine.

minimal reproducible example

const { Cluster } = require('puppeteer-cluster');
var xl = require('excel4node');

var wb = new xl.Workbook();
var ws = wb.addWorksheet('Sheet 1');

(async () => {

    let row_id = 1;

    const cluster = await Cluster.launch({
        concurrency: Cluster.CONCURRENCY_PAGE,
        maxConcurrency: 4,
        puppeteerOptions: {
            headless: false
        }
    });

    async function writeDataToExcel(row_id, text) {
        ws.cell(row_id, 1).string(text);
    }

    cluster.task(async ({ page, data: url}) => {
        row_id += 1
        await page.goto(url);
        text = await page.waitForSelector('#mw-content-text > div.mw-content-ltr.mw-parser-output > p:nth-child(13)');
        text = await text.evaluate(el => el.textContent);
        await writeDataToExcel(row_id, text);
    });

    cluster.queue('https://en.wikipedia.org/wiki/JavaScript');
    cluster.queue('https://en.wikipedia.org/wiki/JavaScript');
    cluster.queue('https://en.wikipedia.org/wiki/JavaScript');
    cluster.queue('https://en.wikipedia.org/wiki/JavaScript');
    cluster.queue('https://en.wikipedia.org/wiki/JavaScript');
    cluster.queue('https://en.wikipedia.org/wiki/JavaScript');
    cluster.queue('https://en.wikipedia.org/wiki/JavaScript');
    cluster.queue('https://en.wikipedia.org/wiki/JavaScript');
    cluster.queue('https://en.wikipedia.org/wiki/JavaScript');
    cluster.queue('https://en.wikipedia.org/wiki/JavaScript');
    cluster.queue('https://en.wikipedia.org/wiki/JavaScript');

    await cluster.idle();
    await cluster.close();

    wb.write('Excel.xlsx');
})();

code link: code

I've already tried to troubleshoot the issue by:

Checking for errors: There are no error messages in the console logs. Verifying data retrieval: Data retrieval from the web pages seems to be working correctly.

Despite these efforts, I'm still unable to pinpoint the exact cause of the problem. Any insights or suggestions on how to troubleshoot and resolve this issue would be greatly appreciated.

Thank you in advance for your help!

0

There are 0 best solutions below