My objective is to test the spider written using scrapy (Python). I tried using contracts but it is really limited in the sense that I can not test things like pagination or whether some attributes are extracted correctly or not.
def parse(self, response):
""" This function parses a sample response. Some contracts are mingled
with this docstring.
@url http://someurl.com
@returns items 1 16
@returns requests 0 0
@scrapes Title Author Year Price
"""
So the second idea is to mock all the requests that the spider makes in one run, and use that in the testing phase to check against expected results. However, I am unsure and how can I mock every request that is made via the spider. I looked into various libraries and one of them is betamax. However, it only supports http requests made by requests client of Python. (As mentioned here). There is another library vcrpy, but it also supports limited clients only.
Are you using Requests? If you’re not using Requests, Betamax is not for you. You should checkout VCRpy. Are you using Sessions or are you using the functional API (e.g., requests.get)?
Last option is to manually record all the requests and somehow store them, but that's not really feasible at the scale which the spider make requests.
Does scrapy.Requests use some underline python client which can be used to mock those requests? Or is there any other way I can mock all the http requests made by the spider in one run and use that for testing the spider for expected behavior?
So,
scrapyhas in-built support for Caching which can be used tocacheall the responses, and that really eliminates tomockthe responses.There are various settings found in
HttpCacheMiddleware. Some of those are as follows. (to be included insettings.pyfor ascrapyproject)This also stores the
cachein a specified directory. Here is the whole list of options.