This code is giving me results but the output is not as desired .what is wrong with my xpath? How to iterate the rule by +10. I have problem in these two always.
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from urlparse import urljoin
class CompItem(scrapy.Item):
title = scrapy.Field()
link = scrapy.Field()
data = scrapy.Field()
name_reviewer = scrapy.Field()
date = scrapy.Field()
model_name = scrapy.Field()
rating = scrapy.Field()
review = scrapy.Field()
class criticspider(CrawlSpider):
name = "flip_review"
allowed_domains = ["flipkart.com"]
start_urls = ['http://www.flipkart.com/samsung-galaxy-s5/product-reviews/ITME5Z9GKXGMFSF6?pid=MOBDUUDTADHVQZXG&type=all']
rules = (
Rule(
SgmlLinkExtractor(allow=('.*\&start=.*',)),
callback="parse_start_url",
follow=True),
)
def parse_start_url(self, response):
sites = response.css('div.review-list div[review-id]')
items = []
model_name = response.xpath('//h1[@class="title"]/text()').re(r'Reviews of (.*?)$')
for site in sites:
item = CompItem()
item['model_name'] = model_name
item['name_reviewer'] = ''.join(site.xpath('.//div[contains(@class, "date")]/preceding-sibling::*[1]//text()').extract())
item['date'] = site.xpath('.//div[contains(@class, "date")]/text()').extract()
item['title'] = site.xpath('.//div[contains(@class,"line fk-font-normal bmargin5 dark-gray")]/strong/text()').extract()
item['review'] = site.xpath('.//span[contains(@class,"review-text")]/text()').extract()
yield item
My output is:
{'date': [u'\n 31 Mar 2015 ', u'\n 23 Mar 2015 '],
'model_name': [u'\n Reviews of A & K 333 '],
'name_reviewer': [u'\n pradeep kumar', u'\n vikas agrawal']}
and I want my output to be :
{model_name :xyz
name_reviewer :abc
date:38383
}
{model_name :xyz
name_reviewer :hfhd
date:9283
}
I think the problem is with my XPath.
First of all, your XPath expressions are very fragile in general.
The main problem with your approach is that
site
does not contain a review section, but it should. In other words, you are not iterating over review blocks on a page.Also, the model name should be extracted outside of a loop since it is the same for every review on a page. I would also use
.re()
to extract the model name out of the title, e.g.SAMSUNG GALAXY S5
out ofREVIEWS OF SAMSUNG GALAXY S5
.Here is the complete working code with fixes applied:
The XPath expressions are also made simpler. For the sake of an example, the review sections are identified by a CSS selector
div.review-list div[review-id]
that would match alldiv
elements containingreview-id
attribute anywhere under thediv
havingreview-list
class.Also, note how
name_reviewer
is extracted - since there are different users, some of them are represented as a profile link, some are not registered and are located in thespan
withreview-username
class - I've taken a different approach: locating the review date and getting the first preceding sibling's text.I'd like to point out that class names like
line
,fk-font-small
,fk-font-11
etc are layout-oriented classes and are, generally speaking, not a good choice to rely your XPath expressions and CSS selectors on. Note, what classes are used to locate elements in the answer:review-list
,title
,date
- they are more data-oriented and a better choice for your locators.