Here is some HTML code from http://chem.sis.nlm.nih.gov/chemidplus/rn/75-07-0 in Google Chrome that I want to parse the website for some project.
<div id="names">
<h2>Names and Synonyms</h2>
<div class="ds"><button class="toggle1Col"title="Toggle display between 1 column of wider results and multiple columns.">↔</button>
<h3 id="yui_3_18_1_3_1434394159641_407">Name of Substance</h3>
<ul>
<li id="ds2">
`` <div>Acetaldehyde</div>
</li>
</ul>
</div>
I wrote a python script to help me do such a thing by grabbing the name under one of the sections, but it just isn't returning the name. I think it's my xpath query, suggestions?
from lxml import html
import requests
import csv
names1 = []
page = requests.get('http://chem.sis.nlm.nih.gov/chemidplus/rn/75-07-0')
tree = html.fromstring(page.text)
//This will grab the name data
names = tree.xpath('//*[@id="yui_3_18_1_3_1434380225687_700"]')
//Print the name data
print 'Names: ', names
//Convert the data into a string
names1.append(names)
//Print the bit length
print len(names1)
//Write it to csv
b = open('testchem.csv', 'wb')
a = csv.writer(b)
a.writerows(names1)
b.close()
print "The end"
It is important to inspect the string returned by
page.text
and not just rely on the page source as returned by your Chrome browser. Web sites can return different content depending on theUser-Agent
, and moreover, GUI browsers such as your Chrome browser may change the content by executing JavaScript while in contrast,requests.get
does not.If you write the contents to a file
and use a text editor to search for
"yui_3_18_1_3_1434380225687_700"
you'll find that there is no tag with that attribute value.If instead you search for
Name of Substance
you'll findTherefore, instead you could use:
How this XPath was found:
Starting from the
<h3>
tag:The
<div>
tag that we want is not a child but rather it is a subchild of the parent of<h3>
. Therefore, go up to the parent:and then use
//div
to search for all<div>
s inside the parent:The first
div
is the one that we want:and we can extract the text using the
text_content
method: