How to get all outbound links in a given webpage and follow them?

869 Views Asked by At

I have this code that gets all the links within a webpage:

from scrapy.spider import Spider
from scrapy import Selector
from socialmedia.items import SocialMediaItem
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

class MySpider(Spider):
    name = 'smm'
    allowed_domains = ['*']
    start_urls = ['http://en.wikipedia.org/wiki/Social_media']
    def parse(self, response):
        items = []
        for link in response.xpath("//a"):
            item = SocialMediaItem()
            item['SourceTitle'] = link.xpath('/html/head/title').extract()
            item['TargetTitle'] = link.xpath('text()').extract()
            item['link'] = link.xpath('@href').extract()
            items.append(item)
        return items

I'd like to do the following: 1) Instead of getting all the links, to get only the outbound ones or, at least, only those startinf with http/s 2) Follow the outbound links 3) Scrape the next webpage only if it contains some keywords on the metadata 4) Repeat the whole process for a given amount of loops Can anyone help? Cheers!

Dani

1

There are 1 best solutions below

0
On

I think you're probably looking for something like scrapy's Rule and LinkExtractor.

from scrapy.contrib.spiders import Rule
from scrapy.contrib.linkextractors import LinkExtractor
class MySpider(Spider):
    name = 'smm'
    allowed_domains = ['*']
    start_urls = ['http://en.wikipedia.org/wiki/Social_media']
    rules = (
        Rule(LinkExtractor(restrict_paths=('//a[contains(., "http")]'), callback='pre_parse')
    )

def pre_parse(self, response):
    if keyword in response.body:
        parse(response)

def parse(self, response):

This code is completely untested, but just give an ides for how you might go about getting all the links, and then checking the follow-on pages for keywords before doing full parsing.

Good luck.