• DEVHIDE
        • Home (current)
        • About
        • Contact
        • Cookie
        • Home (current)
        • About
        • Contact
        • Cookie
        • Disclaimer
        • Privacy
        • TOS
        Login Or Sign up

        Using Puppeteer and jQuery to scrape all unordered list items to an array

        72 Views Asked by Boots At 16 January 2024 at 18:22 2025-11-22T13:31:57.583000

        I'm using puppeteer with jQuery and NodeJS to try and get list items from a web page:

                <table>
                    <td class="hr">
                        <ul class="people">
                            <li class = "person">Richard</li>
                            <li class = "person">Linus</li>
                            <li class = "person">Brian</li>
                            <li class = "team_lead">Charles</li>
                        </ul>
                    </td>
                    <td class="manufacturing">
                        <ul class="people">
                            <li class = "person">Alan</ul>
                            <li class = "person">Margret</li>
                            <li class = "person">Ken</li>
                            <li class = "person">Edsger</li>
                            <li class = "team_lead">Dennis</li>
                        </ul>
                    </td>
                    <td class="design">
                        <ul class="people">
                            <li class = "person">Bill</li>
                            <li class = "person">Ada</li>
                            <li class = "person">Steve</li>
                            <li class = "person">Ken</li>
                            <li class = "team_lead">Dennis</li>
                        </ul>
                    </td>
                </table>
        

        and using the nodeJS code:

        const puppeteer = require("puppeteer");
        const cheerio = require("cheerio");
        
        async function main(){
            const browser = await puppeteer.launch({headless : false, defaultViewport: {width: 1920, height: 1080}});
            const page = await browser.newPage();
            await page.goto("${url}");
            const htmlContent = await page.content();
            const $ = cheerio.load(htmlContent);
        
            let peopleList = [];
        
            $(`table td .people`).each(function(i, li){
                peopleList.push(li.text());
            });
            console.log(`people: ${peopleList}`);
        }
        main();
        

        I have got this code for parsing through the list from another stackoverflow answer: How to store list items within an array with jQuery and using a Udemy tutorial, and tried to edit it accordingly.

        I am looking to store each name in a two dimensional array, so something like:

        peopleList = [[Richard, Linus, Brian, Charles], [Alan, Margret, Edsger, Dennis], [Bill, Ada, Steve, Ken, Dennis]];
        

        however I am getting a single string:

        RichardLinusBrianCharlesAlanMargretEdsgerDenisBillAdSteveKenDennis,RichardLinusBrianCharlesAlanMargretEdsgerDenisBillAdSteveKenDennis,...
        

        (repeat for each ul element) and when I try to go deeper and include li tags I just get an empty string.

        1. Is there any way I can save in the desired way?
        2. I am using an private site and therefore have removed the url and changed people to Computer scientists. Is there any way to point Puppeteer to a site run locally, eg: localhost/index.html?
        jquery node.js web-scraping puppeteer cheerio
        Original Q&A
        1

        There are 1 best solutions below

        2
        ggorlen ggorlen On 16 January 2024 at 18:52

        There is no need to use Cheerio with Puppeteer. Puppeteer already works with the live page, so it generally doesn't make sense to snapshot the page into a string, then dump it into a separate library. This is inefficient and leads to confusing bugs when the snapshot goes stale.

        Instead, use page.$$eval(yourSelector, browserCallback) to do the job:

        const puppeteer = require("puppeteer"); // ^21.6.0
        
        const html = `<HTML pasted from your question>`;
        
        let browser;
        (async () => {
          browser = await puppeteer.launch({headless: "new"});
          const [page] = await browser.pages();
          await page.setContent(html);
          const sel = "table td .people .person";
          await page.waitForSelector(sel);
          const people = await page.$$eval(
            sel,
            els => els.map(el => el.textContent.trim())
          );
          console.log(people);
        })()
          .catch(err => console.error(err))
          .finally(() => browser?.close());
        

        Output:

        [
          'Richard', 'Linus',
          'Brian',   'Alan',
          'Bill',    'Ada',
          'Steve',   'Ken'
        ]
        

        The joined string issue was resolved above by using the selector table td .people .person, which would technically work in the Cheerio approach as well.

        If you want to keep the categories distinct, you could use a nested query:

        // ...
        const people = await page.$$eval("table td", els =>
          els.map(el => ({
            category: el.className,
            people: [...el.querySelectorAll(".person")].map(e =>
              e.textContent.trim()
            ),
          }))
        );
        // ...
        

        which gives:

        [
          { category: 'hr', people: [ 'Richard', 'Linus', 'Brian' ] },
          {
            category: 'manufacturing',
            people: [ 'Alan', 'Margret', 'Ken', 'Edsger' ]
          },
          { category: 'design', people: [ 'Bill', 'Ada', 'Steve', 'Ken' ] }
        ]
        

        All that said, if the page you're working with has the data you want statically, using fetch and Cheerio may make sense. But I'm assuming you're working with a SPA or website that requires some interaction to get to the scrape point, or there's some other good motivator for using Puppeteer.

        As another aside, if you wind up sticking with Puppeteer but prefer to use jQuery, you can either add it, or use it if the page happens to have jQuery included already. You'll then access $ inside an evaluate-family callback that runs in the browser context. This makes more sense than using Cheerio in most cases, since you're taking advantage of the realtime page abilities of Puppeteer and won't suffer from stale data issues.

        To answer your other question, for demo and reproducibility purposes, I use setContent as shown above, but you can run a server and navigate to your page on localhost. Just make sure to include the port.

        Related Questions in JQUERY

        • In Datatables, start value resets to 0, when column sorting
        • Bootstrap modal not showing at the desired position on a web page when the screen size is smaller
        • window.location.href redirects but is causing problems on the webpage
        • Using JQuery Date Slider
        • Storing selected language in localStorage
        • How to stop other divs from still showing when i click a different button?
        • Check multiple values with jQuery
        • Bootstrap component does not want to render in Datatables function
        • put white spaces when entering an amount moneytype symfony
        • Trouble accessing custom header in AJAX response using jQuery in Fiware Keyrock
        • I just cant make it work, HTML, JS and Firebase error
        • Didn't declared variable still not getting any error in JavaScript
        • Move element horizontally while scrolling vertically in pure JavaScript
        • allow multi carousel in same page
        • Embedded TikTok posts / thumbnail styling issue

        Related Questions in NODE.JS

        • Using Puppeteer to scrape a public API only when the data changes
        • How to request administrator rights?
        • How do I link two models in mongoose?
        • Variable inside a Variable, not updating
        • Unable to Post Form Data to MongoDB because of picturepath
        • Connection terminated unexpectedly while performing multi row insert using pg-promise
        • Processing multiple forms in nodejs and postgresql
        • Node.js Server + Socket.IO + Android Mobile Applicatoin XHR Polling Error...?
        • How to change the Font Weight of a SelectValue component in React when a SelectItem is selected?
        • My unban and ban commands arent showing when i put the slash
        • how to make read only file/directory in Mac writable
        • How can I outsource worker processes within a for loop?
        • Get remote MKV file metadata using nodejs
        • Adding google-profanity-words to web page
        • Products aren't displayed after fetching data from mysql db (node.js & express)

        Related Questions in WEB-SCRAPING

        • Using Puppeteer to scrape a public API only when the data changes
        • Scraping information in a span located under nested span
        • How to scrape website which loads json content dynamically?
        • How can I find a button element and click on it?
        • WebScraping doesnt work, even without error
        • Need Help Extracting Redirect URL from a div Element with Specific Class Name in Python Selenium
        • beautifulsoup library not showing below #document data inside iframe tag in python
        • how to create robust scraper for specific website without updating code after develop?
        • Optimizing Selenium script for faster execution
        • Parse Dynamic Power BI table with selenium
        • How to extract table from webpage that requires click/toggle?
        • SSL Certificate Verification Error When Scraping Website and Inserting Data into MongoDB
        • Scraping all links using BeautifulSoup
        • How do I make it so all arrays are the same length?
        • I am getting 'NoneType object is not subscriptable' error in web scraping method

        Related Questions in PUPPETEER

        • Using Puppeteer to scrape a public API only when the data changes
        • How to save downloaded by parser file into js buffer?
        • Trouble with Puppeteer Cluster and Excel4node: Not all data is being written to Excel
        • 403 on brightdata ws endpoint
        • How do i get the newly opened page after a form submission using puppeteer
        • Connecting puppeteer to an existing Brave Browser instance using
        • How to use multiple exception filters in the main module in nestjs?
        • Puppeteer on Kubernetes throws errors: "Navigation frame was detached", "Requesting main frame too early"
        • Inquiry: ARM Compatibility for Puppeteer
        • What’s the best way to add a coverpage for a pdf made via browsershot / puppeteer?
        • How to simulate select event during stencil tests?
        • Launch web scraper with Windows task manager "Whether user is logged on or not"
        • How to simulate a file upload during tests?
        • How to test disabled attribute of a button?
        • Renaming a file - Google API direct download link

        Related Questions in CHEERIO

        • Empty result scraping site with Fetch and Cheerio
        • How do you read responses from cheerio?
        • how to get table from site github.com with Cheerio google app script
        • needle.get() method not working in NestJS. Any solutions?
        • I'm trying to get information in an HTML message
        • Cherrio JS return all image SRCs of parent div
        • selecting an element by a class name with a dot in it with Cheerio
        • Using Puppeteer and jQuery to scrape all unordered list items to an array
        • How to use Cheerio to get JSON data within a script tag
        • 'normalizeWhitespace' does not exist in type 'CheerioOptions' Node
        • Puppeteer Web Scraping - Scrape a product with multiple images
        • How to load a HTML file within a div in React App
        • Cheerio replace Tag values <h1> to <h2> and retain innerText
        • How to fix: "Crawler reached the maxRequestsPerCrawl limit of 1 requests and will shut down soon"
        • How to iterate through divs to scrape the text from a website using NodeJS and Cheerio?

        Trending Questions

        • UIImageView Frame Doesn't Reflect Constraints
        • Is it possible to use adb commands to click on a view by finding its ID?
        • How to create a new web character symbol recognizable by html/javascript?
        • Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
        • Heap Gives Page Fault
        • Connect ffmpeg to Visual Studio 2008
        • Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
        • How to avoid default initialization of objects in std::vector?
        • second argument of the command line arguments in a format other than char** argv or char* argv[]
        • How to improve efficiency of algorithm which generates next lexicographic permutation?
        • Navigating to the another actvity app getting crash in android
        • How to read the particular message format in android and store in sqlite database?
        • Resetting inventory status after order is cancelled
        • Efficiently compute powers of X in SSE/AVX
        • Insert into an external database using ajax and php : POST 500 (Internal Server Error)

        Popular # Hahtags

        javascript python java c# php android html jquery c++ css ios sql mysql r reactjs

        Popular Questions

        • How do I undo the most recent local commits in Git?
        • How can I remove a specific item from an array in JavaScript?
        • How do I delete a Git branch locally and remotely?
        • Find all files containing a specific text (string) on Linux?
        • How do I revert a Git repository to a previous commit?
        • How do I create an HTML button that acts like a link?
        • How do I check out a remote Git branch?
        • How do I force "git pull" to overwrite local files?
        • How do I list all files of a directory?
        • How to check whether a string contains a substring in JavaScript?
        • How do I redirect to another webpage?
        • How can I iterate over rows in a Pandas DataFrame?
        • How do I convert a String to an int in Java?
        • Does Python have a string 'contains' substring method?
        • How do I check if a string contains a specific word?
        .

        Copyright © 2021 Jogjafile Inc.

        • Disclaimer
        • Privacy
        • TOS
        • Homegardensmart
        • Math
        • Aftereffectstemplates