Scrapy response.xpath not returning anything for a query

808 Views Asked by At

I am using the scrapy shell to extract some text data. Here are the commands i gave in the scrapy shell:

>>> scrapy shell "http://jobs.parklandcareers.com/dallas/nursing/jobid6541851-nurse-resident-cardiopulmonary-icu-feb2015-nurse-residency-requires-contract-jobs"

>>> response.xpath('//*[@id="jobDesc"]/span[1]/text()')
[<Selector xpath='//*[@id="jobDesc"]/span[1]/text()' data=u'Dallas, TX'>]
>>> response.xpath('//*[@id="jobDesc"]/span[2]/p/text()[2]')
[<Selector xpath='//*[@id="jobDesc"]/span[2]/p/text()[2]' data=u'Responsible for attending assigned nursi'>]
>>> response.xpath('//*[@id="jobDesc"]/span[2]/p/text()[preceding-sibling::*="Education"][following-sibling::*="Certification"]')
[]

The third command is not returning any data. I was trying to extract data between 2 keywords in the command. Where am i wrong?

2

There are 2 best solutions below

2
On BEST ANSWER

//*[@id="jobDesc"]/span[2]/p/text() would return you a list of text nodes. You can filter the relevant nodes in Python. Here's how you can get the text between "Education/Experience:" and "Certification/Registration/Licensure:" text paragraphs:

>>> result = response.xpath('//*[@id="jobDesc"]/span[2]/p/text()').extract()
>>> start = result.index('Education/Experience:')
>>> end = result.index('Certification/Registration/Licensure:')
>>> print ''.join(result[start+1:end])
- Must be a graduate from an accredited school of Nursing.  

UPD (regarding an additional question in comments):

>>> response.xpath('//*[@id="jobDesc"]/span[3]/text()').re('Job ID: (\d+)')
[u'143112']
0
On

Try:

substring-before(
  substring-after('//*[@id="jobDesc"]/span[2]/p/text()', 'Education'), 'Certification')

Note: I couldn't test it.

The idea is that you cannot use preceding-sibling and following-sibling because you look in the same text node. You have to extract the text part that you want using substring-before() and substring-after()

By combining those two functions, you select what is in between.