Scraping Blogs - avoid already scraped items by checking urls from json/csv in advance

81 Views Asked by At

I'd like to scrape newspages / blogs (anything, which contains new informations on a daily basis).

My Crawler works fine and does everything, I kindly asked him to do.

But I cannot find a proper solution to the circumstance, that I'd like him to ignore already scraped urls (or items to keep it more general) and just add new urls/items to an already existing json/csv file.

I've seen many solutions here to check, whether an item exists in a csv file.. but none of this "solutions" did really work.

Scrapy DeltaFetch seems to cannot be installed on my system... I've get errors af. and all the hints, like e.g. $ sudo pip install bsddb3, upgrade this and update that.. etc.. does not do the trick. (tried it for 3 hours now and fed up with solutionfinding for a package, which wasn't updated since 2017).

I hope, that you have a handy and practical solution.

Thank you very much in advance!

Best regards!

1

There are 1 best solutions below

2
On BEST ANSWER

An option could be a custom downloader middleware with the following:

  • A process_response that puts the url you crawled in a database
  • A process_request method that checks if the url is present in the database. If it's in there, you raise an IgnoreRequest so the request is not going through anymore.