I was using Python 3.8, XPath and Scrapy where things just seemed to work. I took my XPath expressions for granted.
Now I'm must using Python 3.8, XPath and lxml.html and things are much less forgiving. For example, using this URL and this XPath:
//dt[text()='Services/Products']/following-sibling::dd[1]
I would return a paragraph or a list depending on what the innerhtml was. This is how I am attempting to extract the text now:
data = response.text
tree = html.fromstring(data)
Services_Product = tree.xpath("//dt[text()='Services/Products']/following-sibling::dd[1]")
which returns this: Services_Product[] which is a list of "li" elements for his page, but other times this field can be any of these:
<dd>some text</dd>
or
<dd><p>some text</p></dd>
or
<dd>
<ul>
<li>some text</li>
<li>some text</li>
</ul>
</dd>
or
<dd>
<ul>
<li><p>some text</p></li>
<li><p>some text</p></li>
</ul>
</dd>
What is the best practice for extracting text from situations like this where the target field can be a number of different things?
I used this test code to see what my options are:
file = open('html_01.txt', 'r')
data = file.read()
tree = html.fromstring(data)
Services_Product = tree.xpath("//dt[text()='Services/Products']/following-sibling::dd[1]")
stuff = Services_Product[0].xpath("//li")
for elem in stuff:
print(elem[0][0].text)
That returned this: Health Health doctors Health doctors
Which is not correct. Here's a screenshot of it in google chrome: The Xpath tool in google chrome along with the html in question
Whats the best way to scrape this data using Python and Xpath - or other options? Thank you.
After spending hours googling and then writing this post above, it just came to me: old code:
and new code that returns a nice list of text:
add the "/text()" on the end fixed it.