Scrapy spider not working with crawlera middleware

767 Views Asked by At

I wrote a spider to crawl a large site. im hosting it on scrapehub and am using the crawlera add on. Without crawlera my spider runs on scrapehub just fine. As soon as i switch to crawlera middleware the spider just exits without doing a single crawl.

Ive run the spider without crawlera and it runs on my local system as well as on scrapehub, the only thing i change is middleware enabled for crawlera. without crawlera it runs, with it doesnt. ive set concurrent requests to my C10 plan limit

   CRAWLERA_APIKEY = <apikey>
CONCURRENT_REQUESTS = 10
CONCURRENT_REQUESTS_PER_DOMAIN = 10
AUTOTHROTTLE_ENABLED = False
DOWNLOAD_TIMEOUT = 600

DOWNLOADER_MIDDLEWARES = {
    #'ytscraper.middlewares.YtscraperDownloaderMiddleware': 543,
    'scrapy_crawlera.CrawleraMiddleware': 300
}


Here is the log dump

    2019-02-06 05:54:34 INFO    Log opened.
1:  2019-02-06 05:54:34 INFO    [scrapy.log] Scrapy 1.5.1 started
2:  2019-02-06 05:54:34 INFO    [scrapy.utils.log] Scrapy 1.5.1 started (bot: ytscraper)
3:  2019-02-06 05:54:34 INFO    [scrapy.utils.log] Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.1, w3lib 1.19.0, Twisted 18.9.0, Python 2.7.15 (default, Nov 16 2018, 23:19:37) - [GCC 4.9.2], pyOpenSSL 18.0.0 (OpenSSL 1.1.1a 20 Nov 2018), cryptography 2.5, Platform Linux-4.4.0-141-generic-x86_64-with-debian-8.11
4:  2019-02-06 05:54:34 INFO    [scrapy.crawler] Overridden settings: {'NEWSPIDER_MODULE': 'ytscraper.spiders', 'STATS_CLASS': 'sh_scrapy.stats.HubStorageStatsCollector', 'LOG_LEVEL': 'INFO', 'CONCURRENT_REQUESTS_PER_DOMAIN': 10, 'CONCURRENT_REQUESTS': 10, 'SPIDER_MODULES': ['ytscraper.spiders'], 'AUTOTHROTTLE_ENABLED': True, 'LOG_ENABLED': False, 'DOWNLOAD_TIMEOUT': 600, 'MEMUSAGE_LIMIT_MB': 950, 'BOT_NAME': 'ytscraper', 'TELNETCONSOLE_HOST': '0.0.0.0'}
5:  2019-02-06 05:54:34 INFO    [scrapy.middleware] Enabled extensions: More
6:  2019-02-06 05:54:34 INFO    [scrapy.middleware] Enabled downloader middlewares: Less
['sh_scrapy.diskquota.DiskQuotaDownloaderMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 u'scrapy_crawlera.CrawleraMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats',
 'sh_scrapy.middlewares.HubstorageDownloaderMiddleware']
7:  2019-02-06 05:54:34 INFO    [scrapy.middleware] Enabled spider middlewares: Less
['sh_scrapy.diskquota.DiskQuotaSpiderMiddleware',
 'sh_scrapy.middlewares.HubstorageSpiderMiddleware',
 'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
8:  2019-02-06 05:54:34 INFO    [scrapy.middleware] Enabled item pipelines: More
9:  2019-02-06 05:54:34 INFO    [scrapy.core.engine] Spider opened
10: 2019-02-06 05:54:34 INFO    [scrapy.extensions.logstats] Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
11: 2019-02-06 05:54:34 INFO    [root] Using crawlera at http://proxy.crawlera.com:8010 (user: 11b143d...)
12: 2019-02-06 05:54:34 INFO    [root] CrawleraMiddleware: disabling download delays on Scrapy side to optimize delays introduced by Crawlera. To avoid this behaviour you can use the CRAWLERA_PRESERVE_DELAY setting but keep in mind that this may slow down the crawl significantly
13: 2019-02-06 05:54:34 INFO    TelnetConsole starting on 6023
14: 2019-02-06 05:54:40 INFO    [scrapy.core.engine] Closing spider (finished)
15: 2019-02-06 05:54:40 INFO    [scrapy.statscollectors] Dumping Scrapy stats: More
16: 2019-02-06 05:54:40 INFO    [scrapy.core.engine] Spider closed (finished)
17: 2019-02-06 05:54:40 INFO    Main loop terminated.

Here is the log of the same spider without crawlera middleware

0:  2019-02-05 17:42:13 INFO    Log opened.
1:  2019-02-05 17:42:13 INFO    [scrapy.log] Scrapy 1.5.1 started
2:  2019-02-05 17:42:13 INFO    [scrapy.utils.log] Scrapy 1.5.1 started (bot: ytscraper)
3:  2019-02-05 17:42:13 INFO    [scrapy.utils.log] Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.1, w3lib 1.19.0, Twisted 18.9.0, Python 2.7.15 (default, Nov 16 2018, 23:19:37) - [GCC 4.9.2], pyOpenSSL 18.0.0 (OpenSSL 1.1.1a 20 Nov 2018), cryptography 2.5, Platform Linux-4.4.0-135-generic-x86_64-with-debian-8.11
4:  2019-02-05 17:42:13 INFO    [scrapy.crawler] Overridden settings: {'NEWSPIDER_MODULE': 'ytscraper.spiders', 'STATS_CLASS': 'sh_scrapy.stats.HubStorageStatsCollector', 'LOG_LEVEL': 'INFO', 'CONCURRENT_REQUESTS_PER_DOMAIN': 32, 'CONCURRENT_REQUESTS': 32, 'SPIDER_MODULES': ['ytscraper.spiders'], 'AUTOTHROTTLE_ENABLED': True, 'LOG_ENABLED': False, 'DOWNLOAD_TIMEOUT': 600, 'MEMUSAGE_LIMIT_MB': 950, 'BOT_NAME': 'ytscraper', 'TELNETCONSOLE_HOST': '0.0.0.0'}
5:  2019-02-05 17:42:13 INFO    [scrapy.middleware] Enabled extensions: More
6:  2019-02-05 17:42:14 INFO    [scrapy.middleware] Enabled downloader middlewares: More
7:  2019-02-05 17:42:14 INFO    [scrapy.middleware] Enabled spider middlewares: More
8:  2019-02-05 17:42:14 INFO    [scrapy.middleware] Enabled item pipelines: More
9:  2019-02-05 17:42:14 INFO    [scrapy.core.engine] Spider opened
10: 2019-02-05 17:42:14 INFO    [scrapy.extensions.logstats] Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
11: 2019-02-05 17:42:14 INFO    [root] Using crawlera at http://proxy.crawlera.com:8010 (user: 11b143d...)
12: 2019-02-05 17:42:14 INFO    [root] CrawleraMiddleware: disabling download delays on Scrapy side to optimize delays introduced by Crawlera. To avoid this behaviour you can use the CRAWLERA_PRESERVE_DELAY setting but keep in mind that this may slow down the crawl significantly
13: 2019-02-05 17:42:14 INFO    TelnetConsole starting on 6023
14: 2019-02-05 17:43:14 INFO    [scrapy.extensions.logstats] Crawled 17 pages (at 17 pages/min), scraped 16 items (at 16 items/min)
15: 2019-02-05 17:44:14 INFO    [scrapy.extensions.logstats] Crawled 35 pages (at 18 pages/min), scraped 34 items (at 18 items/min)
16: 2019-02-05 17:45:14 INFO    [scrapy.extensions.logstats] Crawled 41 pages (at 6 pages/min), scraped 40 items (at 6 items/min)
17: 2019-02-05 17:45:30 INFO    [scrapy.crawler] Received SIGTERM, shutting down gracefully. Send again to force
18: 2019-02-05 17:45:30 INFO    [scrapy.core.engine] Closing spider (shutdown)
19: 2019-02-05 17:45:38 INFO    [scrapy.statscollectors] Dumping Scrapy stats: More
20: 2019-02-05 17:45:38 INFO    [scrapy.core.engine] Spider closed (shutdown)
21: 2019-02-05 17:45:38 INFO    Main loop terminated.

i wrote a script in python to test out my crawlera connection

import requests

response = requests.get(
    "https://www.youtube.com",
    proxies={
        "http": "http://<APIkey>:@proxy.crawlera.com:8010/",
    },
)
print(response.text)

this works, but i cant for the life of me get the crawler to work with crawlera middleware.

I want to get the same results using crawlera bc without im getting banned quickly.

Please help.

2

There are 2 best solutions below

2
Gallaecio On

You are missing CRAWLERA_ENABLED = True in your settings.

See the Configuration section of the scrapy-crawlera documentation for more information.

0
Georgiy On

Data from the logs don't correspond with issue definition. On both cases spider used crawlera proxy, because both logs have this line:

INFO    [root] Using crawlera at http://proxy.crawlera.com:8010 (user: 11b143d...)

according to scrapy_crawlera.CrawleraMiddleware source code this means that that CrawleraMiddleware was enabled on both cases. I need additional data from logs.(at least stats(ending lines of log that contain stats data))

Currently I have following assumption:
According to first log you didn't override cookies settings and CookiesMiddleware was enabled.
By default handling cookies are enabled by scrapy.
Generally websites use cookies to track visitor activities/sessions.
If website receives requests with single sessionId from multiple IPs (as any spider does with enabled crawlera and enabled cookies) - this allow webserver to identify proxy usage and ban all used IPs by its unique sessionId stored in cookies. So in this case spider stops work because of IP ban.(and other users of crawlera will not be able to send requests to that site for some time)
Cookies should be disabled by set COOKIES_ENABLED to False