I am using code
response.xpath("//*[contains(text(), 'Role')]/parent/parent/descendant::td//text()").extract()
to select all td text() content from rows following where word 'Role' is found in the following html table:
<table class="wh_preview_detail" border="1">
<tr>
<th colspan="3">
<span class="wh_preview_detail_heading">Names</span>
</th>
</tr>
<tr>
<th>Role</th>
<th>Name No</th>
<th>Name</th>
</tr>
<tr>
<td>Requestor</td>
<td>589528</td>
<td>John</td>
</tr>
<tr>
<td>Helper</td>
<td>589528</td>
<td>Mary</td>
</tr>
</table>
The 'Role' keyword is only acting as an identifier for the table.
In this case I'm expecting results:
['Requestor', '589528', 'John', ...]
However, I get an empty array when performing in scrapy.
My aim is to ultimately group the elements again as records. I have spent a few hours trying others' examples and experimenting in terminal and Chrome but all but 'simple' XPath is beyond me right now. I am looking to understand Xpath so ideally would like a generalised answer with explanation, that way I can learn and also share. Thank you kindly.
As general advice, it's usually easier to craft your XPath expression by going down the tree, step by step, instead of selecting
//typeiwant
all the way down, and adding predicates for what came before in the tree (with parent or ancestor)Let's look at how to solve your use case with Scrapy selectors:
Then, the rows you're interested in are at the same tree level as that
<tr>
with "Role". In XPath terms, these<tr>
elements are along thefollowing-sibling
axisSo you have each row, each row having 3 cells, to map to 3 fields: