How to scrape protected sites using puppeteer and js

483 Views Asked by At

I am trying to make a bot that can scrape any site, however some sites i run into problems. For now i Just open the browser in headless: false mode and then navigate myself. But i still run into problems, therefore i think it could be a case of the site detecting my footprint.

I have tried with a couple different sets of options when i launch, which is the reasons there is multiple option variables, and only 1 of them are used

Here is my current code:

const puppeteer = require("puppeteer-extra");
const { executablePath } = require("puppeteer");
const pluginStealth = require("puppeteer-extra-plugin-stealth");
const Ua = require("puppeteer-extra-plugin-anonymize-ua");

puppeteer.use(pluginStealth());

puppeteer.use(Ua());

let browser, page;

function log(log){
    console.log(log);
};

function delay(time) {
    return new Promise((resolve) => {
        setTimeout(resolve, time);
    });
}

async function openBrowser(){
    if (!browser){

        const options1= {
            headless: false, 
            executablePath: "C:/Program Files/Google/Chrome/Application/chrome.exe",
            args: ['--profile-directory="Person 1"'],
            userDataDir: "C:\\Users\\berti\\AppData\\Local\\Google\\Chrome\\User Data\\Default"
        };

        const options2 = {
            args: ['--start-maximized', 'disable-gpu', '--disable-infobars', '--disable-extensions', '--ignore-certificate-errors'],
            headless: false,
            ignoreDefaultArgs: ['--enable-automation'],
            executablePath: "C:/Program Files/Google/Chrome/Application/chrome.exe",
            defaultViewport: null,
        };
        browser = await puppeteer.launch(options2);
        await delay(Math.random() * 1000)
        page = await browser.newPage(); 
        log("New browser has been booted up");
    } else {
        log("Browser alleready in existience");
    };
}

One of the tests i do is to head onto nike and try and add a shoe to the cart, but it wont let me.

1

There are 1 best solutions below

0
On

To improve the success rate of your web scraping bot and avoid detection, you can try the following techniques:

User Agent Rotation: Use a library or plugin to rotate and randomize the User Agent string to make your bot appear more like a regular browser.

JavaScript Rendering: Ensure that the headless browser executes JavaScript properly, as many modern websites rely on it for functionality and content rendering.

Rate Limiting and Delay: Introduce random delays between your requests to avoid triggering rate-limiting mechanisms. Mimic human-like behavior in your bot.

IP Rotation and Proxying: Use a pool of rotating IP addresses or proxies to prevent IP-based blocking.

CAPTCHA Solving: Implement CAPTCHA-solving services or libraries to handle CAPTCHAs programmatically.