How to use Puppeteer to count the occurrences of a specific text on a web page?

473 Views Asked by At

I am working with NodeJS and the Puppeteer library to load a website and then check if a certain text is displayed on the page. I would like to count the number of occurrences of this specific text. Specifically, I would like this search to work exactly in the same manner as how the Ctrl+F function works in Chrome or Firefox.

Here's the code I have so far:

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://example.com');

  // How do I count the occurrences of the specific text here?

  await browser.close();
})();

Can someone please help me with a solution on how to achieve this? Any help would be greatly appreciated.

3

There are 3 best solutions below

0
ggorlen On BEST ANSWER

As I mentioned in a comment, the Ctrl+f algorithm may not be as simple as you presume, but you may be able to approximate it by making a list of all visible, non-style/script/metadata values and text contents.

Here's a simple proof of concept:

const puppeteer = require("puppeteer"); // ^19.7.2

let browser;
(async () => {
  browser = await puppeteer.launch();
  const [page] = await browser.pages();
  const ua =
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36";
  await page.setUserAgent(ua);
  const url = "https://www.google.com";
  await page.goto(url, {waitUntil: "domcontentloaded"});
  await page.evaluate(() =>
    window.isVisible = e =>
      // https://stackoverflow.com/a/21696585/6243352
      e.offsetParent !== null &&
      getComputedStyle(e).visibility !== "hidden" &&
      getComputedStyle(e).display !== "none"
  );
  const excludedTags = [
    "head",
    "link",
    "meta",
    "script",
    "style",
    "title",
  ];
  const text = await page.$$eval(
    "*",
    (els, excludedTags) =>
      els
        .filter(e =>
          !excludedTags.includes(e.tagName.toLowerCase()) &&
          isVisible(e)
        )
        .flatMap(e => [...e.childNodes])
        .filter(e => e.nodeType === Node.TEXT_NODE)
        .map(e => e.textContent.trim())
        .filter(Boolean),
    excludedTags
  );
  const values = await page.$$eval("[value]", els =>
    els
      .filter(isVisible)
      .map(e => e.value.trim())
      .filter(Boolean)
  );
  const visible = [
    ...new Set([...text, ...values].map(e => e.toLowerCase())),
  ];
  console.log(visible);
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close());

Output:

[
  'about',
  'store',
  'gmail',
  'images',
  'sign in',
  'businesses and job seekers',
  'in your community',
  'are growing with help from google',
  'advertising',
  'business',
  'how search works',
  'carbon neutral since 2007',
  'privacy',
  'terms',
  'settings',
  'google search',
  "i'm feeling lucky"
]

Undoubtedly, this has some false positives and negatives, and I've only tested it on google.com. Feel free to post a counterexample and I'll see if I can toss it in.

Also, since we run two separate queries, then combine the results and dedupe, ordering of the text isn't the same as it appears on the page. You could query by *, [value] and use conditions to figure out which you're working with if this matters. I've assumed your final goal is just a true/false "does some text exist?" semantic.

1
Ayush Gupta On

you can get all the text and then run regex or simple search.

const extractedText = await page.$eval('*', (el) => el.innerText);
console.log(extractedText);
const regx = new Regex('--search word--', 'g')
count = (extractedText.match(regx) || []).length;
console.log(count);
4
Eric Fortis On
import puppeteer from 'puppeteer'

(async () => {
  const textToFind = 'domain'
  const browser = await puppeteer.launch()
  const page = await browser.newPage()
  await page.goto('https://example.com')

  const text = await page.evaluate(() => document.documentElement.innerText)

  const n = [...text.matchAll(new RegExp(textToFind, 'gi'))].length
  console.log(`${textToFind} appears ${n} times`)

  await browser.close()
})()