Incapsula is a web application delivery platform that can be used to prevent scraping.
I am working in Python and Scrapy and I found this, but it seems to be out-of-date and not working with current Incapsula. I tested the Scrapy middleware with my target website and I got IndexErrors owing to the fact that the middleware was unable to extract some obfuscated parameter.
Is it possible to adapt this repo or has Incapsula now changed in its mode of operation?
I'm curious also as to how I can "copy as cURL" the request in from chrome dev tools to my target page, and the chrome response contains the user content, yet the curl response is an "incapsula incident" page. This is for chrome with cookies initially cleared.....
curl 'https://www.radarcupon.es/tienda/fotoprix.com'
-H 'pragma: no-cache' -H 'dnt: 1' -H 'accept-encoding: gzip, deflate, br'
-H 'accept-language: en-GB,en-US;q=0.9,en;q=0.8'
-H 'upgrade-insecure-requests: 1'
-H 'user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/62.0.3202.94 Chrome/62.0.3202.94 Safari/537.36'
-H 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8'
-H 'cache-control: no-cache' -H 'authority: www.radarcupon.es'
--compressed
I was expecting the first request from both to return something like a javascript challenge, which would set a cookie, but it doesn't seem to quite work like that now?
It's difficult to give a specific answer because Incapsula has a very detailed rules engine that can be used to block or challenge requests. Cookie detection and Javascript support are the two most common data points used to identify suspicious traffic; user agent strings, headers, and behavior originating from the client IP address (requests per minute, AJAX requests, etc) can also cause Incapsula to challenge traffic. The DDoS protection feature blocks requests aggressively if it is not configured sensibly relative to the amount of traffic a site sees.