Python: getting prices of smartphones from website

585 Views Asked by At

I want to get the prices of smartphones from this website, http://tweakers.net. It's a Dutch site. The problem is that the prices are not collected from the website.

The textfile 'TweakersTelefoons.txt' contains 3 entries:

samsung-galaxy-s6-32gb-zwart

lg-nexus-5x-32gb-zwart

huawei-nexus-6p-32gb-zwart

I'm using python 2.7 and this is the code I used:

import urllib
import re

symbolfile = open("TweakersTelefoons.txt")
symbolslist = symbolfile.read()
symbolslist = symbolslist.split("\n")

for symbol in symbolslist:
    url = "http://tweakers.net/pricewatch/[^.]*/" +symbol+ ".html"
## http://tweakers.net/pricewatch/423541/samsung-galaxy-s6-32gb-zwart.html  is the original html

    htmlfile = urllib.urlopen(url)
    htmltext = htmlfile.read()

    regex = '<span itemprop="lowPrice">(.+?)</span>'
## <span itemprop="lowPrice">€ 471,95</span>  is what the original code looks like
    pattern = re.compile(regex)
    price = re.findall(pattern, htmltext)

    print "the price of", symbol, "is ", price

Output:

the price of samsung-galaxy-s6-32gb-zwart is []

the price of lg-nexus-5x-32gb-zwart is []

the price of huawei-nexus-6p-32gb-zwart is []

The prices are not shown I tried using [^.] to get rid of the euro sign, but that didn't work.

Furthermore it might be that in Europe we use a "," instead of "." as a seperator for decimals. Please help.

Thank you in advance.

2

There are 2 best solutions below

0
On

I think that your problem is that you are expecting a web server to resolve a wildcard within a URL with "http://tweakers.net/pricewatch/[^.]*/ and you are not checking the returned code which I suspect is 404.

You need to either identify the product ID if that is fixed or post a search request using the forms post method.

14
On
import requests

from bs4 import BeautifulSoup

soup = BeautifulSoup(requests.get("http://tweakers.net/categorie/215/smartphones/producten/").content)

print [(p.a["href"], p.a.text) for p in soup.find_all("p",{"class":"price"})]

To get all the pages:

from bs4 import BeautifulSoup

# base url to pass page number to 1-69 in this case
base_url = "http://tweakers.net/categorie/215/smartphones/producten/?page={}"
soup = BeautifulSoup(requests.get("http://tweakers.net/categorie/215/smartphones/producten/").content, "lxml")

# get and store all prices and phone links
data = {1: (p.a["href"], p.a.text) for p in soup.find_all("p", {'class': "price"})}

pag = soup.find("span", attrs={"class":"pageDistribution"}).find_all("a")

# last page number
mx_pg = max(int(a.text) for a in pag if a.text.isdigit())

# get all the pages from the second to  mx_pg 
for i in range(2, mx_pg + 1):
    req = requests.get(base_url.format(i))
    print req
    soup = BeautifulSoup(req.content)
    data[i] = [(p.a["href"], p.a.text) for p in soup.find_all("p",{"class":"price"})]

You will need both requests, BeautifulSoup. The dict has the links to each phones page that you can visit if you want to scrape more data.