Stuck scraping a specific table with scrapy

682 Views Asked by At

So the table I'm trying to scrape can be found here: http://www.betdistrict.com/tipsters

I'm after the table titled 'June Stats'.

Here's my spider:

from __future__ import division
from decimal import *

import scrapy
import urlparse

from ttscrape.items import TtscrapeItem 

class BetdistrictSpider(scrapy.Spider):
name = "betdistrict"
allowed_domains = ["betdistrict.com"]
start_urls = ["http://www.betdistrict.com/tipsters"]

def parse(self, response):
    for sel in response.xpath('//table[1]/tr'):
        item = TtscrapeItem()
        name = sel.xpath('td[@class="tipst"]/a/text()').extract()[0]
        url = sel.xpath('td[@class="tipst"]/a/@href').extract()[0]
        tipster = '<a href="' + url + '" target="_blank" rel="nofollow">' + name + '</a>'
        item['Tipster'] = tipster
        won = sel.xpath('td[2]/text()').extract()[0]
        lost = sel.xpath('td[3]/text()').extract()[0]
        void = sel.xpath('td[4]/text()').extract()[0]
        tips = int(won) + int(void) + int(lost)
        item['Tips'] = tips
        strike = Decimal(int(won) / tips) * 100
        strike = str(round(strike,2))
        item['Strike'] = [strike + "%"]
        profit = sel.xpath('//td[5]/text()').extract()[0]
        if profit[0] in ['+']:
            profit = profit[1:]
        item['Profit'] = profit
        yield_str = sel.xpath('//td[6]/text()').extract()[0]
        yield_str = yield_str.replace(' ','')
        if yield_str[0] in ['+']:
            yield_str = yield_str[1:]
        item['Yield'] = '<span style="color: #40AA40">' + yield_str + '%</span>'
        item['Site'] = 'Bet District'
        yield item

This gives me a list index out of range error on the very first variable (name).

However, when I rewrite my xpath selectors starting with //, e.g:

name = sel.xpath('//td[@class="tipst"]/a/text()').extract()[0]

The spider runs, but scrapes the first tipster over and over again.

I think it has something to do with the the table not having a thead, but containing th tags within the first tr of the tbody.

Any help is much appreciated.

----------EDIT----------

In response to Lars suggestions:

I've tried to use what you've suggested but still get a list out of range error:

from __future__ import division
from decimal import *

import scrapy
import urlparse

from ttscrape.items import TtscrapeItem 

class BetdistrictSpider(scrapy.Spider):
    name = "betdistrict"
    allowed_domains = ["betdistrict.com"]
    start_urls = ["http://www.betdistrict.com/tipsters"]

def parse(self, response):
    for sel in response.xpath('//table[1]/tr[td[@class="tipst"]]'):
        item = TtscrapeItem()
        name = sel.xpath('a/text()').extract()[0]
        url = sel.xpath('a/@href').extract()[0]
        tipster = '<a href="' + url + '" target="_blank" rel="nofollow">' + name + '</a>'
        item['Tipster'] = tipster
        yield item 

Also, I'm assuming by doing things this way, multiple for loops are required since not all cells have the same class?

I've also tried doing things without a for loop, but in that case it once again scrapes only the first tipster multiple times :s

Thanks

1

There are 1 best solutions below

2
On BEST ANSWER

When you say

name = sel.xpath('td[@class="tipst"]/a/text()').extract()[0]

the XPath expression starts with td and so is relative to the context node that you have in the variable sel (i.e. the tr element in the set of tr elements that the for loop iterates over).

However when you say

name = sel.xpath('//td[@class="tipst"]/a/text()').extract()[0]

the XPath expression starts with //td, i.e. select all td elements anywhere in the document; this is not relative to sel, and so the results will be the same on every iteration of the for loop. That's why it scrapes the first tipster over and over again.

Why does the first XPath expression fail with list index out of range error? Try taking the XPath expression one location step at a time, printing out the results, and you'll soon find the problem. In this case, it appears to be because the first tr child of table[1] does not have a td child (only th children). So the xpath() selects nothing, the extract() returns an empty list, and you try to reference the first item in that empty list, giving a list index out of range error.

To fix this, you could change your for loop XPath expression to loop only over those tr elements that have td children:

for sel in response.xpath('//table[1]/tr[td]'):

You could get fancier, requiring a td of the right class:

for sel in response.xpath('//table[1]/tr[td[@class="tipst"]]'):