Probably you have realized by title, I am using scrapy and xpath to extract data. I tried and provided xpaths from file to the spider (to make spider generic - not to edit often) As required, I am able to extract data in the format required.
Further, now I want to check the xpath expression (relative to webpage specified in spider) if the xpath provided is valid or not (incase if the html style has changed, then my xpath will be invalid). Regarding this I want to check my xpath expression before spider starts.
How do I test my xpath's correctness? or is there any way to do truth testing? Please help.
class ExampleSpider(scrapy.Spider):
name = "example"
allowed_domains = ["example.com"]
start_urls = ["file:///<filepath>.html"]
def __init__(self):
self.mt = ""
def parse(self, response):
respDta = dict()
it_lst = []
dtData = response.selector.xpath(gx.spcPth[0])
for ra in dtData:
comoodityObj = ra.xpath(gx.spcPth[1])
list = comoodityObj.extract()
cmdNme = list[0].replace(u'\xa0', u' ')
cmdNme = cmdNme.replace("Header text: ", '')
self.populate_item(response, respDta, cmdNme, it_lst, list[0])
respDta["mt"] = self.mt
jsonString = json.dumps(respDta, default=lambda o: o.__dict__)
return jsonString
gx.spcPth
gx.spcPth is from other function which provides me xpath. And it has been used in many instances in rest of the code. I need to check xpath expression before spider starts throughout the code, wherever implemented
I understand what you are trying to do, I just don't see why. The whole process of running a spider is in the same time your "testing" process - simple as this: if the result of xpath is empty and it should return something, than something is wrong. Why don't you just check the xpath results and use the scrapy logging to mark it as a warning, error or critical, whatever you want. Simple as this:
After that, just run your spider and relax. When you see critical messages in your output log, you'll know its time to shit bricks. Otherwise, everything is ok.