Scrapy extracting from Link

860 Views Asked by At

I am trying to extract information in certain links, but I don't get to go to the links, I extract from the start_url and I am not sure why.

Here is my code:

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from tutorial.items import DmozItem
from scrapy.selector import HtmlXPathSelector

class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = [""]
    start_urls = [
    rules = [Rule(SgmlLinkExtractor(allow=[r'Books']), callback='parse')] 

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        item = DmozItem()

        # Extract links
        item['link'] ="//li/a/text()").extract()  # Xpath selector for tag(s)

        print item['title']

        for cont, i in enumerate(item['link']):
            print "link: ", cont, i

I don't get the links from "", instead I get the links from "".



There are 1 best solutions below


For rules to work, you need to use CrawlSpider not the general scrapy Spider.

Also, you need to rename your first parsing function to a name other than parse. Otherwise, you will be overwriting an important method of the CrawlSpider and it will not work. See the warning in the docs

Your code was scraping the links from "" because the rule command was being ignored by the general Spider.

This code should work:

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from dmoz.items import DmozItem
from scrapy.selector import HtmlXPathSelector

class DmozSpider(CrawlSpider):
    name = "dmoz"
    allowed_domains = [""]
    start_urls = [
    rules = [Rule(SgmlLinkExtractor(allow=[r'Books']), callback='parse_item')] 

    def parse_item(self, response):
        hxs = HtmlXPathSelector(response)
        item = DmozItem()

        # Extract links
        item['link'] ="//li/a/text()").extract()  # Xpath selector for tag(s)

        print item['link']

        for cont, i in enumerate(item['link']):
            print "link: ", cont, i