In Python, how could I count the nodes using XPath? For example, using this webpage and this code:
from lxml import html, etree
import requests
url = "http://intelligencesquaredus.org/debates/past-debates/item/587-islam-is-dominated-by-radicals"
r = requests.get(url)
tree = html.fromstring(r.content)
count = tree.xpath('count(//*[@id="body"])')
print count
It prints 1. But it has 5 div
nodes.
Please explain this to me, and how can I do this correctly?
It prints 1 (or 1.0) because there is just one such element with
id="body"
in the HTML file you are fetching.I downloaded the file and verified this is the case. E.g.:
Grabs a file
587-islam-is-dominated-by-radicals
Answers 1. Just to be extra sure, I hand-searched in the file as well, using vi. Just the one!
Perhaps you are looking for another
div
node? One with a differentid
?Update: By the way, XPath and other HTML/XML parsing is pretty challenging to work with. A lot of bad data out there, and a lot of complex markup, times the complexity of the retrieval, parsing, and traversal process. You will probably be running your tests and trials a lot of times. It will be a lot faster if you do not "hit the net" for every one of them. Cache the live results. Raw code looks something like this:
But you can simplify a lot of that by using a generic caching front-end to
requests
, such as requests-cache. Happy parsing!