I want to get the prices of smartphones from this website, http://tweakers.net. It's a Dutch site. The problem is that the prices are not collected from the website.
The textfile 'TweakersTelefoons.txt' contains 3 entries:
samsung-galaxy-s6-32gb-zwart
lg-nexus-5x-32gb-zwart
huawei-nexus-6p-32gb-zwart
I'm using python 2.7 and this is the code I used:
import urllib
import re
symbolfile = open("TweakersTelefoons.txt")
symbolslist = symbolfile.read()
symbolslist = symbolslist.split("\n")
for symbol in symbolslist:
url = "http://tweakers.net/pricewatch/[^.]*/" +symbol+ ".html"
## http://tweakers.net/pricewatch/423541/samsung-galaxy-s6-32gb-zwart.html is the original html
htmlfile = urllib.urlopen(url)
htmltext = htmlfile.read()
regex = '<span itemprop="lowPrice">(.+?)</span>'
## <span itemprop="lowPrice">€ 471,95</span> is what the original code looks like
pattern = re.compile(regex)
price = re.findall(pattern, htmltext)
print "the price of", symbol, "is ", price
Output:
the price of samsung-galaxy-s6-32gb-zwart is []
the price of lg-nexus-5x-32gb-zwart is []
the price of huawei-nexus-6p-32gb-zwart is []
The prices are not shown I tried using [^.] to get rid of the euro sign, but that didn't work.
Furthermore it might be that in Europe we use a "," instead of "." as a seperator for decimals. Please help.
Thank you in advance.
I think that your problem is that you are expecting a web server to resolve a wildcard within a URL with
"http://tweakers.net/pricewatch/[^.]*/
and you are not checking the returned code which I suspect is 404.You need to either identify the product ID if that is fixed or post a search request using the forms post method.