Scrapy Splash missing elements

336 Views Asked by At

I've written a spider to crawl the boardgamegeek.com/browse/boardgame site for information regarding boardgames in the list.

My problem is that when pulling two specific selectors in my code, a response is not always received for those selectors, sometimes it returns a selector object other times it doesn't. After inspecting the response during debugging, the dynamically loaded selectors don't exist in the code.

My two offending lines

bggspider.py    

bg['txt_cnt'] = response.xpath(
        selector_paths.SEL_TXT_REVIEWS).extract_first()
    bg['vid_cnt'] = response.xpath(
        selector_paths.SEL_VID_REVIEWS).extract_first()

Where the selectors are defined as

selector_paths.py

SEL_TXT_REVIEWS = '//div[@class="panel-inline-
links"]/a[contains(text(), "All Text Reviews")]/text()'
SEL_VID_REVIEWS = '//div[@class="panel-inline-
links"]/a[contains(text(), "All Video Reviews")]/text()'

After yielding the bg item, in the pipeline the attributes are processed where a check is performed since many boardgames have very little information for various parts of the page.

pipelines.py

    if item['txt_cnt']:
        item['txt_cnt'] = int(re.findall('\d+', item['txt_cnt'])[0])
    else:
        item['txt_cnt'] = 0
    if item['vid_cnt']:
        item['vid_cnt'] = int(re.findall('\d+', item['vid_cnt'])[0])
    else:
        item['vid_cnt'] = 0

The aim of the field processing is just to grab the numerical value in the string which is the number of text and video reviews for a boardgame.

I'm assuming I'm missing something that has to do with Splash since I'm getting selector items for some/most queries but still missing many. I am running the ScrapySplash docker container locally, localhost:8050.

Code for the spider can be found here. BGGSpider on Github

Any help or information about how to remedy this problem or how ScrapySplash works would be appreciated.

0

There are 0 best solutions below