My Scrapy spider has a bunch of independent target links to crawl.
def start_requests(self):
search_targets = get_search_targets()
for search in search_targets:
request = get_request(search.contract_type, search.postal_code, 1)
yield request
Each link multiple pages that will be followed. i.e.
def parse(self, response, **kwargs):
# Some Logic depending on the response
# ...
if cur_page < num_pages: # Following the link to the next page
next_page = cur_page + 1
request = get_request(contract_type, postal_code, next_page)
yield request
for estate_dict in estates: # Parsing the items of response
item = EstateItem()
fill_item(item, estate_dict)
yield item
Now each link (target) after a few pages will encounter duplicate and already seen items from previous crawls. Whether an item is a duplicate is decided in the pipeline, with a query to the database.
def save_estate_item(self, item: EstateItem, session: Session):
query = session.query(EstateModel)
previous_item = query.filter_by(code=item['code']).first()
if previous_item is not None:
logging.info("Duplicate Estate")
return
# Save the item in the DB
# ...
Now here when I find a duplicate estate, I want Scrapy to stop following pages for that specific link target, How could I do that?
I figured I would raise raise exceptions.DropItem('Duplicate post')
in the pipeline with the info about the finished search target, and catch that exception in my spider. But how could I tell scrapy to stop following links for that specific search target?