I've written a spider to crawl the boardgamegeek.com/browse/boardgame site for information regarding boardgames in the list.
My problem is that when pulling two specific selectors in my code, a response is not always received for those selectors, sometimes it returns a selector object other times it doesn't. After inspecting the response during debugging, the dynamically loaded selectors don't exist in the code.
My two offending lines
bggspider.py
bg['txt_cnt'] = response.xpath(
selector_paths.SEL_TXT_REVIEWS).extract_first()
bg['vid_cnt'] = response.xpath(
selector_paths.SEL_VID_REVIEWS).extract_first()
Where the selectors are defined as
selector_paths.py
SEL_TXT_REVIEWS = '//div[@class="panel-inline-
links"]/a[contains(text(), "All Text Reviews")]/text()'
SEL_VID_REVIEWS = '//div[@class="panel-inline-
links"]/a[contains(text(), "All Video Reviews")]/text()'
After yielding the bg item, in the pipeline the attributes are processed where a check is performed since many boardgames have very little information for various parts of the page.
pipelines.py
if item['txt_cnt']:
item['txt_cnt'] = int(re.findall('\d+', item['txt_cnt'])[0])
else:
item['txt_cnt'] = 0
if item['vid_cnt']:
item['vid_cnt'] = int(re.findall('\d+', item['vid_cnt'])[0])
else:
item['vid_cnt'] = 0
The aim of the field processing is just to grab the numerical value in the string which is the number of text and video reviews for a boardgame.
I'm assuming I'm missing something that has to do with Splash since I'm getting selector items for some/most queries but still missing many. I am running the ScrapySplash docker container locally, localhost:8050.
Code for the spider can be found here. BGGSpider on Github
Any help or information about how to remedy this problem or how ScrapySplash works would be appreciated.