Scraping a webpage with Python

215 Views Asked by At

I'm trying to learn to scrape webpage (http://www.expressobeans.com/public/detail.php/185246), however I don't know what I'm doing wrong. I think it's to do with identifing the xpath but how do I get the correct path (if that is the issue)? I've tried Firebug in Firefox as well as the Developer Tools in Chrome.

I want to be able to scrape the Manufacturer value (D&L Screenprinting) as well as all the Edition Details.

python script:

from lxml import html
import requests

page = requests.get('http://www.expressobeans.com/public/detail.php/185246')

tree = html.fromstring(page.text)

buyers = tree.xpath('//*[@id="content"]/table/tbody/tr[2]/td/table/tbody/tr/td[1]/dl/dd[3]')

print buyers

returns:

[]
2

There are 2 best solutions below

2
On

I'd start by suggesting you look at the page HTML and try to find a node closer to the value you are looking for, and build your path from there to make it shorter and easier to follow.

In that page I can see that there is a "dl" with class "itemListingInfo" and under that one all the information you are looking for.

Also, if you want the "D&L Screenprinting" text, you need to extract the text from the link.

Try with this modified version, it should be straightforward to add the other xpath expressions and get the other fields as well.

from lxml import html
import requests

page = requests.get('http://www.expressobeans.com/public/detail.php/185246')

tree = html.fromstring(page.text)

buyers = tree.xpath('//dl[@class="itemListingInfo"]/dd[2]/a/text()')

print buyers
3
On

remove tbody from the xpath

buyers = tree.xpath('//*[@id="content"]/table/tr[2]/td/table/tr/td[1]/dl/dd[3]')