How Incapsula works and how to beat it

Question

How Incapsula works and how to beat it

15.7k Views Asked by fpghost At 30 November 2017 at 20:00

Incapsula is a web application delivery platform that can be used to prevent scraping.

I am working in Python and Scrapy and I found this, but it seems to be out-of-date and not working with current Incapsula. I tested the Scrapy middleware with my target website and I got IndexErrors owing to the fact that the middleware was unable to extract some obfuscated parameter.

Is it possible to adapt this repo or has Incapsula now changed in its mode of operation?

I'm curious also as to how I can "copy as cURL" the request in from chrome dev tools to my target page, and the chrome response contains the user content, yet the curl response is an "incapsula incident" page. This is for chrome with cookies initially cleared.....

curl 'https://www.radarcupon.es/tienda/fotoprix.com' 
-H 'pragma: no-cache' -H 'dnt: 1' -H 'accept-encoding: gzip, deflate, br' 
-H 'accept-language: en-GB,en-US;q=0.9,en;q=0.8' 
-H 'upgrade-insecure-requests: 1' 
-H 'user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/62.0.3202.94 Chrome/62.0.3202.94 Safari/537.36' 
-H 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8' 
-H 'cache-control: no-cache' -H 'authority: www.radarcupon.es'
 --compressed

I was expecting the first request from both to return something like a javascript challenge, which would set a cookie, but it doesn't seem to quite work like that now?

Original Q&A

There are 4 best solutions below

**Scott Simontis** · Answer 1 · 2019-03-08T15:48:12.567000

It's difficult to give a specific answer because Incapsula has a very detailed rules engine that can be used to block or challenge requests. Cookie detection and Javascript support are the two most common data points used to identify suspicious traffic; user agent strings, headers, and behavior originating from the client IP address (requests per minute, AJAX requests, etc) can also cause Incapsula to challenge traffic. The DDoS protection feature blocks requests aggressively if it is not configured sensibly relative to the amount of traffic a site sees.

**Neha Setia Nagpal** · Answer 2 · 2022-06-16T07:55:28.503000

There could be multiple reasons. It's hard to pin-point exactly what combination of rules Incapsula is applying to detect you as a bot. It could be using IP rate limitation, Browser fingerprinting, Header Validation, TCP/IP Fingerprinting, User-agent etc...

But you can try

Rotating IPs.

You can easily find lists of free proxies on the internet, and you can use a solution like scrapy-rotating-proxies middleware to configure multiple proxies in your spider and have requests rotate through them automatically.
Rotating USER_AGENT.

One way to navigate this filter is to switch your USER_AGENT to a value copied from those that popular web browsers use. In some rare cases, you may need a user agent string from a specific web browser. There are multiple Scrapy plugins that can rotate your requests through popular web browser user agent strings, such as scrapy-random-useragent or Scrapy- UserAgents.
You can try inspecting developer tools and reverse engineer the request parameters.

Mostly in such scenarios, the objective is to avoid getting banned by crawling with best practises in mind. you can read about them here. or you can try using dedicated tools for the same like Smart Proxy Manager or Smart Browser too. I work as a Developer Advocate @Zyte.

**Granitosaurus** · Answer 3 · 2022-08-30T05:37:51.637000

Incapsula, like many other anti-scraping services, uses 3 types of details to identify web scrapers:

IP address meta information
Javascript Fingerprinting
Request analysis

To get around this protection, we need to ensure that these details match that of a common web user.

IP Addresses

A natural web user is usually connected from a residential or mobile IP address where many production scrapers are deployed on datacenter IP addresses (Google cloud, AWS etc.). These 3 types are very different and can be determined by analysis of IP databases. As the name implies: datacenter - commercial IP addresses, residential - household addresses, and mobile ones are cell tower-based mobile networks (3G, 4G, etc.)

So, we want to distribute our scraper network through a pool of residential or mobile proxies.

Javascript Fingerprinting

Using javascript, these services can analyze the browser environment and build a fingerprint. If we are running browser automation tools (like Selenium, Playwright, or Puppeteer) as web scrapers, we need to ensure that the browser environment appears to be user-like.

This is a huge subject, but a good start would be to take a look at what puppeteer-stealth plugin which applies patches to browser environment to hide various details that reveal the fact that the browser is being controlled by a script.

Note: puppeteer-stealth is incomplete and you need to do extra work to get pass Incapsula reliably.

SO answer is a bit short to cover this, but I wrote an extensive introduction on this subject on my blog How to Avoid Web Scraping Blocking: Javascript

Request Analysis

Finally, the way our scraper connects plays a huge role as well. Connection patterns can be used to determine whether the client is a real user or a bot. For example, real users usually navigate the website in more chaotic patterns like going to the home page, category pages, etc.

A stealthy scraper should introduce a bit of chaos into scraping connection patterns.

Curl is not going to cut it

As you're asking about using CURL since Incapsula relies on JS fingerprinting, you won't have much luck in this scenario. However, there are few things to note that might help with other systems:

HTTP2/3 protocol will have much higher success rate. Curl and many other http clients default to http 1.1 and majority of real user traffic runs http2+ - it's a dead giveaway.
Header values and ordering matters too as real browsers (Chrome, Firefox etc.) have specific header order and values. If your scraper connection differs - it's a dead giveaway.

Understanding these 3 details that differentiate bot traffic from real human traffic can help us to develop more stealthy scrapers. I wrote more on this subject on my blog if you'd like to learn more How to Scrape Without Getting Blocked

**smoalem** · Answer 4 · 2023-03-27T19:56:33.600000

I ran into the same issue with scraping an Incapsula site and for whatever reason, this actually worked:

try:
        results = scrape_data(url)
except:
        results = scrape_data(url)

The site is the California SDWIS portal for water quality data and I'm just grabbing the data from the site pages to validate some info. Until recently, there was no issue scraping everything but then they changed it to start throwing up the errors you mentioned in your question after the first several hundred pages were scraped.

If you're wondering why such a hacky and simple solution seems to work, that's a very good question. My favorite theory is that they didn't QA test it to reject a retry attempt and/or retrying has enough randomness in how long you wait that it tricks their bot detection. Of course, it could also be a bug or maybe something about the site requires neutering certain bot detection features.

Basically, that's what worked for me and I have no idea why.

How Incapsula works and how to beat it

There are 4 best solutions below

IP Addresses

Javascript Fingerprinting

Request Analysis

Curl is not going to cut it

Related Questions in PYTHON

Related Questions in WEB-SCRAPING

Related Questions in SCRAPY

Related Questions in INCAPSULA

Trending Questions

Popular # Hahtags

Popular Questions