Best XPath practices for extracting data from a field that varies in format

51 Views Asked by At

I was using Python 3.8, XPath and Scrapy where things just seemed to work. I took my XPath expressions for granted.

Now I'm must using Python 3.8, XPath and lxml.html and things are much less forgiving. For example, using this URL and this XPath:

//dt[text()='Services/Products']/following-sibling::dd[1]

I would return a paragraph or a list depending on what the innerhtml was. This is how I am attempting to extract the text now:

data = response.text
tree = html.fromstring(data)
Services_Product = tree.xpath("//dt[text()='Services/Products']/following-sibling::dd[1]")

which returns this: Services_Product[] which is a list of "li" elements for his page, but other times this field can be any of these:

<dd>some text</dd>
or
<dd><p>some text</p></dd>
or
<dd>
  <ul>
    <li>some text</li>
    <li>some text</li>
  </ul>
</dd>
or
<dd>
  <ul>
    <li><p>some text</p></li>
    <li><p>some text</p></li>
  </ul>
</dd>

What is the best practice for extracting text from situations like this where the target field can be a number of different things?

I used this test code to see what my options are:

file = open('html_01.txt', 'r')
data = file.read()
tree = html.fromstring(data)
Services_Product = tree.xpath("//dt[text()='Services/Products']/following-sibling::dd[1]")
stuff = Services_Product[0].xpath("//li")
for elem in stuff:
    print(elem[0][0].text)

That returned this: Health Health doctors Health doctors

Which is not correct. Here's a screenshot of it in google chrome: The Xpath tool in google chrome along with the html in question

Whats the best way to scrape this data using Python and Xpath - or other options? Thank you.

1

There are 1 best solutions below

0
On

After spending hours googling and then writing this post above, it just came to me: old code:

Services_Product = tree.xpath("//dt[text()='Services/Products']/following-sibling::dd[1]")
stuff = Services_Product[0].xpath("//li")

and new code that returns a nice list of text:

Services_Product = tree.xpath("//dt[text()='Services/Products']/following-sibling::dd[1]")
stuff = Services_Product[0].xpath("//li/text()")

add the "/text()" on the end fixed it.