how to check xpath expression (if valid/not) before scraping data with spider

2.6k Views Asked by At

Probably you have realized by title, I am using scrapy and xpath to extract data. I tried and provided xpaths from file to the spider (to make spider generic - not to edit often) As required, I am able to extract data in the format required.

Further, now I want to check the xpath expression (relative to webpage specified in spider) if the xpath provided is valid or not (incase if the html style has changed, then my xpath will be invalid). Regarding this I want to check my xpath expression before spider starts.

How do I test my xpath's correctness? or is there any way to do truth testing? Please help.

class ExampleSpider(scrapy.Spider):
name = "example"
allowed_domains = ["example.com"]
start_urls = ["file:///<filepath>.html"]
def __init__(self):
    self.mt = ""
def parse(self, response):
    respDta = dict()
    it_lst = []
    dtData = response.selector.xpath(gx.spcPth[0])
    for ra in dtData:
        comoodityObj = ra.xpath(gx.spcPth[1])
        list = comoodityObj.extract()
        cmdNme = list[0].replace(u'\xa0', u' ')
        cmdNme = cmdNme.replace("Header text: ", '')
        self.populate_item(response, respDta, cmdNme, it_lst, list[0])
    respDta["mt"] = self.mt
    jsonString = json.dumps(respDta, default=lambda o: o.__dict__)
    return jsonString

gx.spcPth gx.spcPth is from other function which provides me xpath. And it has been used in many instances in rest of the code. I need to check xpath expression before spider starts throughout the code, wherever implemented

6

There are 6 best solutions below

6
On BEST ANSWER

I understand what you are trying to do, I just don't see why. The whole process of running a spider is in the same time your "testing" process - simple as this: if the result of xpath is empty and it should return something, than something is wrong. Why don't you just check the xpath results and use the scrapy logging to mark it as a warning, error or critical, whatever you want. Simple as this:

from scrapy import log

somedata = response.xpath(my_supper_dupper_xpath)
# we know that this should have captured
# something, so we check
if not somedata:
    log.msg("This should never happen, XPath's are all wrong, OMG!", level=log.CRITICAL)
else:
    # do your actual parsing of the captured data, 
    # that we now know exists  

After that, just run your spider and relax. When you see critical messages in your output log, you'll know its time to shit bricks. Otherwise, everything is ok.

0
On

This is a simple way to do xpath validation with Selectors:

from scrapy.selector import Selector

try:
    my_xpath = '//div/some/xpath'
    Selector(text="").xpath(my_xpath)
    print("valid xpath")
except ValueError as e:
    print(e)
1
On

the shell is the way to go. if needed you can even invoke it within your spider as described in the documentation I found this useful sometimes.

0
On

The Scrapy shell is an interactive shell where you can try and debug your scraping code very quickly.

Ref: http://doc.scrapy.org/en/latest/topics/shell.html

The shell is used for testing XPath or CSS expressions and see how they work and what data they extract from the web pages you’re trying to scrape

5
On

Your best bet to test out how Scrapy will use the xpath you provided to the spider is to just use the Scrapy Shell.

$ scrapy shell <url>

That will give you a sel object that you can run xpaths against:

>>> sel.xpath('//title/text()')

If you want some really quick tests, install the "XPath Helper" Chrome extension. It's my favorite extension for testing out and determining xpaths very quickly:

XPath Helper

You simply visit a site in Chrome, press Ctrl+Shift+X, and type in an xpath. You'll see results on the right-hand side.

0
On

You also should not only make sure that you have a 200 code response, but also you should check what is the actual response:

view(response)

Then, As JoneLinux said you need to use

scrapy shell 'URL'

but instead of sel.xpath()

you should use:

response.xpath('//YourXpath...')