So the table I'm trying to scrape can be found here: http://www.betdistrict.com/tipsters
I'm after the table titled 'June Stats'.
Here's my spider:
from __future__ import division
from decimal import *
import scrapy
import urlparse
from ttscrape.items import TtscrapeItem
class BetdistrictSpider(scrapy.Spider):
name = "betdistrict"
allowed_domains = ["betdistrict.com"]
start_urls = ["http://www.betdistrict.com/tipsters"]
def parse(self, response):
for sel in response.xpath('//table[1]/tr'):
item = TtscrapeItem()
name = sel.xpath('td[@class="tipst"]/a/text()').extract()[0]
url = sel.xpath('td[@class="tipst"]/a/@href').extract()[0]
tipster = '<a href="' + url + '" target="_blank" rel="nofollow">' + name + '</a>'
item['Tipster'] = tipster
won = sel.xpath('td[2]/text()').extract()[0]
lost = sel.xpath('td[3]/text()').extract()[0]
void = sel.xpath('td[4]/text()').extract()[0]
tips = int(won) + int(void) + int(lost)
item['Tips'] = tips
strike = Decimal(int(won) / tips) * 100
strike = str(round(strike,2))
item['Strike'] = [strike + "%"]
profit = sel.xpath('//td[5]/text()').extract()[0]
if profit[0] in ['+']:
profit = profit[1:]
item['Profit'] = profit
yield_str = sel.xpath('//td[6]/text()').extract()[0]
yield_str = yield_str.replace(' ','')
if yield_str[0] in ['+']:
yield_str = yield_str[1:]
item['Yield'] = '<span style="color: #40AA40">' + yield_str + '%</span>'
item['Site'] = 'Bet District'
yield item
This gives me a list index out of range error on the very first variable (name).
However, when I rewrite my xpath selectors starting with //, e.g:
name = sel.xpath('//td[@class="tipst"]/a/text()').extract()[0]
The spider runs, but scrapes the first tipster over and over again.
I think it has something to do with the the table not having a thead, but containing th tags within the first tr of the tbody.
Any help is much appreciated.
----------EDIT----------
In response to Lars suggestions:
I've tried to use what you've suggested but still get a list out of range error:
from __future__ import division
from decimal import *
import scrapy
import urlparse
from ttscrape.items import TtscrapeItem
class BetdistrictSpider(scrapy.Spider):
name = "betdistrict"
allowed_domains = ["betdistrict.com"]
start_urls = ["http://www.betdistrict.com/tipsters"]
def parse(self, response):
for sel in response.xpath('//table[1]/tr[td[@class="tipst"]]'):
item = TtscrapeItem()
name = sel.xpath('a/text()').extract()[0]
url = sel.xpath('a/@href').extract()[0]
tipster = '<a href="' + url + '" target="_blank" rel="nofollow">' + name + '</a>'
item['Tipster'] = tipster
yield item
Also, I'm assuming by doing things this way, multiple for loops are required since not all cells have the same class?
I've also tried doing things without a for loop, but in that case it once again scrapes only the first tipster multiple times :s
Thanks
When you say
the XPath expression starts with
td
and so is relative to the context node that you have in the variablesel
(i.e. thetr
element in the set oftr
elements that thefor
loop iterates over).However when you say
the XPath expression starts with
//td
, i.e. select alltd
elements anywhere in the document; this is not relative tosel
, and so the results will be the same on every iteration of thefor
loop. That's why it scrapes the first tipster over and over again.Why does the first XPath expression fail with list index out of range error? Try taking the XPath expression one location step at a time, printing out the results, and you'll soon find the problem. In this case, it appears to be because the first
tr
child oftable[1]
does not have atd
child (onlyth
children). So thexpath()
selects nothing, theextract()
returns an empty list, and you try to reference the first item in that empty list, giving a list index out of range error.To fix this, you could change your for loop XPath expression to loop only over those
tr
elements that havetd
children:You could get fancier, requiring a
td
of the right class: