Scrapy not giving individual results of all the reviews of a phone?

222 Views Asked by At

This code is giving me results but the output is not as desired .what is wrong with my xpath? How to iterate the rule by +10. I have problem in these two always.

    import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from urlparse import urljoin


class CompItem(scrapy.Item):
    title = scrapy.Field()
    link = scrapy.Field()
    data = scrapy.Field()
    name_reviewer = scrapy.Field()
    date = scrapy.Field()
    model_name = scrapy.Field()
    rating = scrapy.Field()
    review = scrapy.Field()



class criticspider(CrawlSpider):
    name = "flip_review"
    allowed_domains = ["flipkart.com"]

    start_urls = ['http://www.flipkart.com/samsung-galaxy-s5/product-reviews/ITME5Z9GKXGMFSF6?pid=MOBDUUDTADHVQZXG&type=all']
    rules = (
        Rule(
            SgmlLinkExtractor(allow=('.*\&start=.*',)),
            callback="parse_start_url",
            follow=True),
    )

    def parse_start_url(self, response):
        sites = response.css('div.review-list div[review-id]')
        items = []
        model_name = response.xpath('//h1[@class="title"]/text()').re(r'Reviews of (.*?)$')
        for site in sites:
            item = CompItem()
            item['model_name'] = model_name
            item['name_reviewer'] = ''.join(site.xpath('.//div[contains(@class, "date")]/preceding-sibling::*[1]//text()').extract())
            item['date'] = site.xpath('.//div[contains(@class, "date")]/text()').extract()
            item['title'] = site.xpath('.//div[contains(@class,"line fk-font-normal bmargin5 dark-gray")]/strong/text()').extract()
            item['review'] = site.xpath('.//span[contains(@class,"review-text")]/text()').extract()
            yield item

My output is:

 {'date': [u'\n 31 Mar 2015 ', u'\n 23 Mar 2015 '],
  'model_name': [u'\n Reviews of A & K 333 '],
  'name_reviewer': [u'\n pradeep kumar', u'\n vikas agrawal']}

and I want my output to be :

{model_name :xyz
name_reviewer :abc
date:38383
}
{model_name :xyz
name_reviewer :hfhd
date:9283
}

I think the problem is with my XPath.

2

There are 2 best solutions below

3
On BEST ANSWER

First of all, your XPath expressions are very fragile in general.

The main problem with your approach is that site does not contain a review section, but it should. In other words, you are not iterating over review blocks on a page.

Also, the model name should be extracted outside of a loop since it is the same for every review on a page. I would also use .re() to extract the model name out of the title, e.g. SAMSUNG GALAXY S5 out of REVIEWS OF SAMSUNG GALAXY S5.

Here is the complete working code with fixes applied:

def parse_start_url(self, response):
    sites = response.css('div.review-list div[review-id]')

    model_name = response.xpath('//h1[@class="title"]/text()').re(r'Reviews of (.*?)$')[0].strip()
    for site in sites:
        item = CompItem()
        item['model_name'] = model_name
        item['name_reviewer'] = ''.join(site.xpath('.//div[contains(@class, "date")]/preceding-sibling::*[1]//text()').extract()).strip()
        item['date'] = site.xpath('.//div[contains(@class, "date")]/text()').extract()[0].strip()
        yield item

The XPath expressions are also made simpler. For the sake of an example, the review sections are identified by a CSS selector div.review-list div[review-id] that would match all div elements containing review-id attribute anywhere under the div having review-list class.

Also, note how name_reviewer is extracted - since there are different users, some of them are represented as a profile link, some are not registered and are located in the span with review-username class - I've taken a different approach: locating the review date and getting the first preceding sibling's text.


I'd like to point out that class names like line, fk-font-small, fk-font-11 etc are layout-oriented classes and are, generally speaking, not a good choice to rely your XPath expressions and CSS selectors on. Note, what classes are used to locate elements in the answer: review-list, title, date - they are more data-oriented and a better choice for your locators.

3
On

this should help, its the problem with your xpath,

In [1]: data_list = []

In [2]: sites = response.xpath('//div[@class="review-list"]/div')

In [3]: for site in sites:
    data = {}
    data['name_reviewer'] = site.xpath('./div/div[@class="line"]/span[@class="fk-color-title fk-font-11 review-username"]/text()|./div/div[@class="line"]/a[@class="load-user-widget fk-underline"]/text()').extract()[0].strip()
    data['date'] = site.xpath('./div/div[@class="date line fk-font-small"]/text()').extract()[0].strip()
    data['model_name'] =  response.xpath('//h1[@class="title"]/text()').extract()[0].strip()
    data_list.append(data)


In [4]: data_list
Out[4]: 
[{'date': u'10 Apr 2014',
  'model_name': u'Reviews of Samsung Galaxy S5',
  'name_reviewer': u'RISHABH GROVER'},
 {'date': u'11 Apr 2014',
  'model_name': u'Reviews of Samsung Galaxy S5',
  'name_reviewer': u'Hemraj Chaudhari'},
 {'date': u'28 Apr 2014',
  'model_name': u'Reviews of Samsung Galaxy S5',
  'name_reviewer': u'RISHABH GROVER'},
 {'date': u'27 Apr 2014',
  'model_name': u'Reviews of Samsung Galaxy S5',
  'name_reviewer': u'Debadutta Patnaik'},
 {'date': u'24 May 2014',
  'model_name': u'Reviews of Samsung Galaxy S5',
  'name_reviewer': u'Joel'},
 {'date': u'11 Apr 2014',
  'model_name': u'Reviews of Samsung Galaxy S5',
  'name_reviewer': u'Saswat Nayak'},
 {'date': u'14 Apr 2014',
  'model_name': u'Reviews of Samsung Galaxy S5',
  'name_reviewer': u'Amit Thakor'},
 {'date': u'28 May 2014',
  'model_name': u'Reviews of Samsung Galaxy S5',
  'name_reviewer': u'Nishchal Sharma'},
 {'date': u'13 May 2014',
  'model_name': u'Reviews of Samsung Galaxy S5',
  'name_reviewer': u'siddiq hassan'},
 {'date': u'16 May 2014',
  'model_name': u'Reviews of Samsung Galaxy S5',
  'name_reviewer': u'Raja Shekhar'}]