Mocking the requests for testing in Scrapy Spider

1.5k Views Asked by At

My objective is to test the spider written using scrapy (Python). I tried using contracts but it is really limited in the sense that I can not test things like pagination or whether some attributes are extracted correctly or not.

def parse(self, response):
    """ This function parses a sample response. Some contracts are mingled
    with this docstring.

    @url http://someurl.com
    @returns items 1 16
    @returns requests 0 0
    @scrapes Title Author Year Price
    """

So the second idea is to mock all the requests that the spider makes in one run, and use that in the testing phase to check against expected results. However, I am unsure and how can I mock every request that is made via the spider. I looked into various libraries and one of them is betamax. However, it only supports http requests made by requests client of Python. (As mentioned here). There is another library vcrpy, but it also supports limited clients only.

Are you using Requests? If you’re not using Requests, Betamax is not for you. You should checkout VCRpy. Are you using Sessions or are you using the functional API (e.g., requests.get)?

Last option is to manually record all the requests and somehow store them, but that's not really feasible at the scale which the spider make requests.

Does scrapy.Requests use some underline python client which can be used to mock those requests? Or is there any other way I can mock all the http requests made by the spider in one run and use that for testing the spider for expected behavior?

1

There are 1 best solutions below

0
On

So, scrapy has in-built support for Caching which can be used to cache all the responses, and that really eliminates to mock the responses.

There are various settings found in HttpCacheMiddleware. Some of those are as follows. (to be included in settings.py for a scrapy project)

# Cache settings
HTTPCACHE_POLICY = 'scrapy.extensions.httpcache.DummyPolicy'
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0  # Never Expire
HTTPCACHE_DIR = 'httpcache'
HTTPCACHE_IGNORE_HTTP_CODES = [301, 302, 404]
HTTPCACHE_IGNORE_MISSING = False
HTTPCACHE_IGNORE_RESPONSE_CACHE_CONTROLS = ["no-cache", "no-store"]

This also stores the cache in a specified directory. Here is the whole list of options.