I found a nice function here by Siva Kannan but its not working in my case. I'm using lxml.html to get the data from the page and not etree. When I use etree I get the exception:
lxml.etree.XMLSyntaxError: error parsing attribute name
Below is his example modified to first get data from a yellowpages page, then attempt to get innerhtml from a specific div tag
Any help would be great and should help a many people.
Thank you
from lxml import etree
import requests, time, socket
import lxml.html as lxml
def innerXML(elem):
elemName = elem.xpath('name(/*)')
resultStr = ''
for e in elem.xpath('/'+ elemName + '/node()'):
if(isinstance(e, str) ):
resultStr = resultStr + ''
else:
resultStr = resultStr + etree.tostring(e, encoding='unicode')
return resultStr
# This works nicely but for my data
# XMLElem = etree.fromstring("<div>I am<xxxxxx>Jhon <last.xxxxx> Corner</last.xxxxx></xxxxxx>.I work as <job>software engineer</job><end meta='bio' />.</div>")
# print(innerXML(XMLElem))
response = requests.get('https://www.yellowpages.com/washington-dc/mip/bnsf-railway-496598824')
data = response.text
# The next line is how I need to get data for all my work.
# tree = lxml.fromstring(data)
# Siva Kannan's way
tree = etree.fromstring(data)
div_node = tree.xpath("//dd[@class='open-hours']")
# div_node = tree.xpath("//dd[@class='open-hours']//div") # When using lxml.fromstring (my normal code) this returns a list when using
div_html = innerXML(div_node)
print(div_html)
Erase
from lxml import etree
, replaceetree.tostring
withlxml.tostring
, and replaceetree.fromstring
withlxml.fromstring
.As a side note, this code will also produce an error because
div_node
will be a list of nodes rather than a node, but that should be easy to fix.