I use Scrapy and I try to scrape this site that uses Incapsula
<meta name="robots" content="noindex,nofollow">
<script src="/_Incapsula_Resource?SWJIYLWA=719d34d31c8e3a6e6fffd425f7e032f3">
</script>
I had already asked a Question about this issue 2 years ago, but this method (Incapsula-Cracker) does not work anymore.
I tried to understand How Incapsula works and I tried this for bypass it
def start_requests(self):
yield Request('https://courses-en-ligne.carrefour.fr', cookies={'store': 92}, dont_filter=True, callback = self.init_shop)
def init_shop(self,response) :
result_content = response.body
RE_ENCODED_FUNCTION = re.compile('var b="(.*?)"', re.DOTALL)
RE_INCAPSULA = re.compile('(_Incapsula_Resource\?SWHANEDL=.*?)"')
INCAPSULA_URL = 'https://courses-en-ligne.carrefour.fr/%s'
encoded_func = RE_ENCODED_FUNCTION.search(result_content).group(1)
decoded_func = ''.join([chr(int(encoded_func[i:i+2], 16)) for i in xrange(0, len(encoded_func), 2)])
incapsula_params = RE_INCAPSULA.search(decoded_func).group(1)
incap_url = INCAPSULA_URL % incapsula_params
yield Request(incap_url)
def parse(self):
print response.body
But i'm redirected to RE-Captcha Page
<html style="height:100%">
<head>
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
<meta name="format-detection" content="telephone=no">
<meta name="viewport" content="initial-scale=1.0">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
</head>
<body style="margin:0px;height:100%">
<iframe src="/_Incapsula_Resource?CWUDNSAI=27&xinfo=3-10784678-0%200NNN%20RT%281523525225370%20394%29%20q%280%20-1%20-1%20-1%29%20r%280%20-1%29%20B12%284%2c316%2c0%29%20U10000&incident_id=459000960022408474-41333502566401539&edet=12&cinfo=04000000" frameborder=0 width="100%" height="100%" marginheight="0px" marginwidth="0px">Request unsuccessful. Incapsula incident ID: 459000960022408474-41333502566401539
</iframe>
</body>
</html>
This is not the best answer but just giving some points to understand why is not that easy to do web scraping and mainly when having a CDN in front.
First, maybe good to check what you will be fighting against, WAF & Bot Mitigation.
Then to get more ideas, this is a good talk: How Attackers Circumvent CDNs to Attack Origin
Now, this doesn't mean it is not possible to do web scraping, the problem here now reduces to time/speed, the faster you try something high are the changes you trigger the captchas and in worst case even get full blocked.
There are multiple approaches like using different IP per requests: Make requests using Python over Tor, change the user agent, etc. But most of them are bound to a set of defined timeouts and query patterns that you may need to found.