I wanted to build a semi-automatic solution for scraping a website protected by Cloudflare's hcaptcha. I thought that I could solve captcha manually whenever it appears and then let my scraper scrape the website for some time until another captcha must be solved.
To try out my solution I open the url with Selenium while trying to mask it as a regular user:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium_stealth import stealth
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
s=Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=s, options=options)
stealth(driver,
languages=["en-US", "en"],
vendor="Google Inc.",
platform="Win32",
webgl_vendor="Intel Inc.",
renderer="Intel Iris OpenGL Engine",
fix_hairline=True,
)
driver.get(url_to_scrape) # Fill the captcha manually
I would want to get to the actual website after solving the captcha so I can scrape some info from it. The problem is, even when I solve the captcha, Cloudflare doesn't let me see the site, it just refreshes the site with the captcha (with response 403) and makes me solve another one, then another, and another, etc.
What am I doing wrong? There shouldn't be any problem with me solving the captcha so it must somehow detect Selenium as a bot. I thought that with the snippet used above the website doesn't see Selenium any different than a normal user with Chrome web browser but surely I'm missing something.
Without the site url it is impossible to tell exactly what is happening, although from previous experience I believe, the Hcaptcha prompt is probably appearing as a result of the site protection and may not be on the site itself.
If its appearing as a result of the site protection then start you browser using your profile.
....then run the remaining part of your code to scrape the site.