How to reset crawlee URL cache?

934 Views Asked by At

I'm running a crawler that is called via an expressjs call.

When I call the same route again, my crawler runs again but shows that all routes have already finished. I'm even removing the './storage' folder

I read the documentation but can't seem to get the purgeDefaultStorages() to work.

How would I go about "resetting" crawlee so that there's no cached results?

import express from 'express'
import { PlaywrightCrawler, purgeDefaultStorages, enqueueLinks, Configuration } from 'crawlee';

const app = express();

let crawler

let run = async () => {
    const config = new Configuration({ 'persistStorage': false, persistStorage: false }); //tested with quotes and no quotes.
    Configuration.set('persistStorage', false) //add this direct config to see if that might work too.
     crawler = new PlaywrightCrawler({
        launchContext: {
            launchOptions: {
                headless: true,
            },
        },

    }, config);
    crawler.router.addDefaultHandler(async ({ request, page, enqueueLinks }) => {
        console.log(`Title of ${request.loadedUrl} ': img: ${request.id}`);
        await enqueueLinks({
            strategy: 'same-domain'
        });
    });

    await crawler.run(['http://localhost:8088/']);

    try {
        await config.getStorageClient().purge()
        await config.getStorageClient().teardown() //tested adding this too just incase.
        console.log('purging')
    } catch (e) {
        console.log(e)
    }
}

app.get('/', async (req, res) => {
    try {
        await run();
        res.status(200)
    } catch (e) {
        res.status(500)
    }

});

const PORT = process.env.PORT || 8889;
app.listen(PORT, () => {
    console.log(
        `The container started successfully and is listening for HTTP requests on ${PORT}`
    );
2

There are 2 best solutions below

0
On

You will have to use purgeOnStart in addition to purgeDefaultStorages or you can just set environment variables and you are good to go

Set CRAWLEE_PURGE_ON_START environment variable to 0 or false
Set CRAWLEE_PERSIST_STORAGE environment variable to 0 or false

This will solve your current issue. Here is the link for your reference

0
On

You can modify the unique key for each request.

By default crawlee will make a unique key from the URL, thus if you add the same URL to the queue, it will not be crawled again.

For example, if you create a UUID for each run, you can start the crawl like this:

await crawler.run([
  { url: "http://localhost:8088", uniqueKey: `http://localhost:8088:${uuid}` },
]);

When adding links to the queue, use transformRequestFunction:

await enqueueLinks({
    strategy: 'same-domain',
    transformRequestFunction: (request) => {
        request.uniqueKey = `${request.url}:${uuid}`;
        return request;
    }
});

https://crawlee.dev/api/core/interface/RequestOptions#uniqueKey