Scrapy Stop following requests for a specific target

54 Views Asked by At

My Scrapy spider has a bunch of independent target links to crawl.

def start_requests(self):
    search_targets = get_search_targets()

    for search in search_targets:
        request = get_request(search.contract_type, search.postal_code, 1)
        yield request

Each link multiple pages that will be followed. i.e.

def parse(self, response, **kwargs):
    # Some Logic depending on the response
    # ...

    if cur_page < num_pages:  # Following the link to the next page
        next_page = cur_page + 1
        request = get_request(contract_type, postal_code, next_page)
        yield request

    for estate_dict in estates:  # Parsing the items of response
        item = EstateItem()
        fill_item(item, estate_dict)
        yield item

Now each link (target) after a few pages will encounter duplicate and already seen items from previous crawls. Whether an item is a duplicate is decided in the pipeline, with a query to the database.

def save_estate_item(self, item: EstateItem, session: Session):
    query = session.query(EstateModel)
    previous_item = query.filter_by(code=item['code']).first()

    if previous_item is not None:
        logging.info("Duplicate Estate")
        return
    
    # Save the item in the DB
    # ...

Now here when I find a duplicate estate, I want Scrapy to stop following pages for that specific link target, How could I do that? I figured I would raise raise exceptions.DropItem('Duplicate post') in the pipeline with the info about the finished search target, and catch that exception in my spider. But how could I tell scrapy to stop following links for that specific search target?

0

There are 0 best solutions below