Defensive web scraping techniques for scrapy spider

144 Views Asked by pbthehuman At 11 January 2021 at 16:39

I have been web scraping for about 3 months now, and I have noticed that many of my spiders need to be constantly babysat, because of websites changing. I use scrapy, python, and crawlera to scrape my sites. For example, 2 weeks ago I created a spider and just had to rebuild it due to the website changing their metatags from singular to plural (so location became locations). Such a small change shouldn't be able to really mess with my spiders, so I would like to take a more defensive approach to my collections moving forward. Does anyone have any advice for web scraping to allow for less babysitting? thank you in advance!

Original Q&A

There are 1 best solutions below

Felix Eklöf On 11 January 2021 at 18:59 BEST ANSWER

Since you didn't post any code I can only give general advice.

Look if there's a hidden API that retrieves the data you're looking for. Load the page in Chrome. Inspect with F12 and look under Network tab. Click CTRL + F and you can search for the text you see on screen which you want to collect. If you find any file under the Network tab that contains the data as json, that is more reliable since the backend of a webpage will change less frequent than the frontend.
Be less specific with selectors. Instead of doing body > .content > #datatable > .row::text you can change to #datatable > .row::text. Then your spider will be less likely to break on small changes.
Handle errors with try except so to stop the whole parse function from ending if you're expecting some data might be inconsistent.

Defensive web scraping techniques for scrapy spider

There are 1 best solutions below

Related Questions in WEB-SCRAPING

Related Questions in SCRAPY

Related Questions in SCRAPY-SHELL

Related Questions in WEB-MINING

Trending Questions

Popular # Hahtags

Popular Questions