Python, lxml.html: Need a generic funtion to return innerhtml of any element

76 Views Asked by At

I found a nice function here by Siva Kannan but its not working in my case. I'm using lxml.html to get the data from the page and not etree. When I use etree I get the exception:

lxml.etree.XMLSyntaxError: error parsing attribute name

Below is his example modified to first get data from a yellowpages page, then attempt to get innerhtml from a specific div tag

Any help would be great and should help a many people.

Thank you

from lxml import etree
import requests, time, socket
import lxml.html as lxml

def innerXML(elem):
    elemName = elem.xpath('name(/*)')
    resultStr = ''
    for e in elem.xpath('/'+ elemName + '/node()'):
        if(isinstance(e, str) ):
            resultStr = resultStr + ''
        else:
            resultStr = resultStr + etree.tostring(e, encoding='unicode')

    return resultStr

# This works nicely but for my data
# XMLElem = etree.fromstring("<div>I am<xxxxxx>Jhon <last.xxxxx> Corner</last.xxxxx></xxxxxx>.I    work as <job>software engineer</job><end meta='bio' />.</div>")
# print(innerXML(XMLElem))

response = requests.get('https://www.yellowpages.com/washington-dc/mip/bnsf-railway-496598824')
data = response.text
# The next line is how I need to get data for all my work.
# tree = lxml.fromstring(data)

# Siva Kannan's way
tree = etree.fromstring(data)
div_node = tree.xpath("//dd[@class='open-hours']")
# div_node = tree.xpath("//dd[@class='open-hours']//div")  # When using lxml.fromstring (my normal    code) this returns a list when using 

div_html = innerXML(div_node)
print(div_html)
1

There are 1 best solutions below

0
On

Erase from lxml import etree, replace etree.tostring with lxml.tostring, and replace etree.fromstring with lxml.fromstring.

As a side note, this code will also produce an error because div_node will be a list of nodes rather than a node, but that should be easy to fix.