I have this code that gets all the links within a webpage:
from scrapy.spider import Spider
from scrapy import Selector
from socialmedia.items import SocialMediaItem
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
class MySpider(Spider):
name = 'smm'
allowed_domains = ['*']
start_urls = ['http://en.wikipedia.org/wiki/Social_media']
def parse(self, response):
items = []
for link in response.xpath("//a"):
item = SocialMediaItem()
item['SourceTitle'] = link.xpath('/html/head/title').extract()
item['TargetTitle'] = link.xpath('text()').extract()
item['link'] = link.xpath('@href').extract()
items.append(item)
return items
I'd like to do the following: 1) Instead of getting all the links, to get only the outbound ones or, at least, only those startinf with http/s 2) Follow the outbound links 3) Scrape the next webpage only if it contains some keywords on the metadata 4) Repeat the whole process for a given amount of loops Can anyone help? Cheers!
Dani
I think you're probably looking for something like scrapy's Rule and LinkExtractor.
This code is completely untested, but just give an ides for how you might go about getting all the links, and then checking the follow-on pages for keywords before doing full parsing.
Good luck.