Python 3.4 : LXML web scraping

Question

Python 3.4 : LXML web scraping

478 Views Asked by Aran Freel At 19 October 2024 at 19:31

I am using the following code to try to return a list of tickers on that website. The result of the code is an empty list. I copy the xpath from google chromium developer tools. What am I doing wrong?

from lxml import html
import requests


url = 'http://en.wikipedia.org/wiki/List_of_S%26P_500_companies'

resp = requests.get(url)
tree = html.fromstring(resp.text)

tickers = tree.xpath(r'//*[@id="mw-content-text"]/table[1]/tbody/tr[1]/td[1]/a')

print(tickers)

Original Q&A

There are 1 best solutions below

**Martijn Pieters** · Accepted Answer

Browsers add in missing HTML elements that the HTML specification states are part of the model. lxml does not add those in.

The most common such element is the <tbody> element. Your document has no such element, but Chrome does and they put it in your XPath. Another such an element in the <thead> element; again, the original HTML is lacking it, but Chrome put it in and put the one <tr> row with <th> elements in it.

As such the correct XPath expression is:

tickers = tree.xpath(r'//*[@id="mw-content-text"]/table[1]/tr[2]/td[1]/a')

e.g. the second row in the table, first table cell in that row.

Note that lxml can load URLs directly; you don't really need to use requests in this specific case:

>>> from lxml import html
>>> url = 'http://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
>>> tree = html.parse(url)
>>> tree.xpath(r'//*[@id="mw-content-text"]/table[1]/tr[2]/td[1]/a')
[<Element a at 0x10445e628>]
>>> tree.xpath(r'//*[@id="mw-content-text"]/table[1]/tr[2]/td[1]/a')[0].text
'MMM'
>>> tree.xpath(r'//*[@id="mw-content-text"]/table[1]/tr[2]/td[1]/a')[0].attrib['href']
'https://www.nyse.com/quote/XNYS:MMM'

If you wanted to extract all <a> elements in that first column, you'd have to remove the restriction on the <tr> element; your XPath picks all, remove the [1] to select all:

links = tree.xpath(r'//*[@id="mw-content-text"]/table[1]/tr/td[1]/a')
for link in links:
    print(link.text, link.attrib['href'])

Python 3.4 : LXML web scraping

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in LXML

Trending Questions

Popular # Hahtags

Popular Questions