Python, lxml and xpath: returns "[<Element x at 0x29a9998>] rather than expected value

2.2k Views Asked by At

I'm trying to scrape TD Asset Management pages (example below; I can't post more than two links) in order to retrieve the "price as on" value, i.e. the dollar amount in this snippet of HTML:

<div class="td-layout-grid9 td-layout-column td-layout-column-first">
Price As On: Jun 12, 2015
<br>
<strong>$14.54  </strong>
<strong class="td-copy-red">-0.01 (-0.07%)</strong>
</div>

I was hoping to achieve this with Python, requests, lxml, and XPath, which I installed as follows:

apt-get update
apt-get install python python-pip python-dev gcc build-essential libxml2-dev libxslt-dev libffi-dev libssl-dev
pip install lxml
pip install requests
pip install requests[security]

Next, to retrieve the page I did this:

python
>>> from lxml import html
>>> import requests
>>> page = requests.get('https://www.tdassetmanagement.com/fundDetails.form?fundId=6320&lang=en')
>>> tree = html.fromstring(page.text)

Finally, an attempt was made to retrieve the desired dollar value using the XPath of the relevant element as obtained from Chrome's "Inspect Element" tool:

>>> price = tree.xpath('//*[@id="fundCardVO"]/div[2]/div[1]/div[1]/div[1]/strong[1]')
>>> print price

Unfortunately the result is [<Element strong at 0x29a9998>] rather than the expected dollar amount $14.54&nbsp;&nbsp;.

To ensure that the expected data was retrieved by the initial "requests.get", I ran this:

>>> print page.content

The result can be seen here: http://pastebin.com/f5C4MFQb.

If I paste the above HTML into this tool: http://videlibri.sourceforge.net/cgi-bin/xidelcgi my XPath query //*[@id="fundCardVO"]/div[2]/div[1]/div[1]/div[1]/strong[1] returns the dollar amount as expected.

Any hints or tips as to how I might be able to use Python, lxml, and XPath to retrieve the desired value for this element would be very much appreciated. If there's a completely different way that I could be going about this to obtain the same result I would be interested in that too.

Thanks.

2

There are 2 best solutions below

1
On

After further Googling to find out what elements are (they're lists of things with attributes like tag or text), followed by more Googling regarding a UnicodeEncodeError (see UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)) I was able to obtain my desired value with this:

>>> priceelement = tree.xpath('//*[@id="fundCardVO"]/div[2]/div[1]/div[1]/div[1]/strong[1]')
>>> priceascii = priceelement[0].text
>>> price = priceascii.encode('utf-8')
>>> print price

Thanks for nudging me in the right direction jonrsharpe.

I still was not able to determine how to obtain a list of available attributes for the element though, but tag and text were available.

I went on to get just the number (without the dollar symbol and trailing non-breaking spaces) with this:

>>> import re
>>> p = re.search('[0-9]{1,3}\.[0-9]{2}', price)
>>> price = p.group(0)
>>> print price
1
On

use FOR RANGE: for x in price: print(x.text)