I have been web scraping for about 3 months now, and I have noticed that many of my spiders need to be constantly babysat, because of websites changing. I use scrapy, python, and crawlera to scrape my sites. For example, 2 weeks ago I created a spider and just had to rebuild it due to the website changing their metatags from singular to plural (so location became locations). Such a small change shouldn't be able to really mess with my spiders, so I would like to take a more defensive approach to my collections moving forward. Does anyone have any advice for web scraping to allow for less babysitting? thank you in advance!
Defensive web scraping techniques for scrapy spider
144 Views Asked by pbthehuman At
1
There are 1 best solutions below
Related Questions in WEB-SCRAPING
- Using Puppeteer to scrape a public API only when the data changes
- Scraping information in a span located under nested span
- How to scrape website which loads json content dynamically?
- How can I find a button element and click on it?
- WebScraping doesnt work, even without error
- Need Help Extracting Redirect URL from a div Element with Specific Class Name in Python Selenium
- beautifulsoup library not showing below #document data inside iframe tag in python
- how to create robust scraper for specific website without updating code after develop?
- Optimizing Selenium script for faster execution
- Parse Dynamic Power BI table with selenium
- How to extract table from webpage that requires click/toggle?
- SSL Certificate Verification Error When Scraping Website and Inserting Data into MongoDB
- Scraping all links using BeautifulSoup
- How do I make it so all arrays are the same length?
- I am getting 'NoneType object is not subscriptable' error in web scraping method
Related Questions in SCRAPY
- pagination, next page with scrapy
- Scraping Text through sections using scrapy
- How to access Script Tag Variables From a Website using Python
- xpath issue in nested div
- How to fixed Crawled (403) forbbiden in scrapy?
- Cannot set LOG_LEVEL when using CrawlerRunner
- Scrapy handle closespider timeout in middleware
- Scrapy CrawlProcess is throwing reactor already installed
- Scrapy playwright non-headless browser always closing
- why can't I retrieve the track of my Spotify playlist even i have given correct full xpath
- Scrapy - how do I load data from the database in ItemLoader before sending it to the pipeline?
- Scrapy Playwright Page Method: Prevent timeout error if selector cannot be located
- Why scrapy shell did not return an output?
- Python Scrapy Function that does always work
- Scrapy / extracting data across multiple HTML tags
Related Questions in SCRAPY-SHELL
- How to select specific class with Scrapy
- scrapy parse function not getting called- Also bulk saving
- Cannot find html element using css or xpath selectors in Scrapy
- Why is scrapy shell returning an empty list when my XPath selector works as it should in the “Elements” tab of my Chrome browser?
- Getting error when sending request to a website using Scrapy shell
- I cant open scrapy shell at anaconda shell
- Why is response.xpath('') not printing anything?
- Scrape the feature image from this website but it returns this `data:image/gif
- Scrapy: Parsing data for one variable directly from the start url and data for other variables after following all the href from the start url?
- Why do I get a empty list in scrapy when I use response.css
- Scrapy Playwright removed cookie when using proxy
- Unable to extract div html conent in scrapy python
- How can I use scrapy middleware in the scrapy Shell?
- scrapy splash css selector not getting data
- Scrapy shell with playwright
Related Questions in WEB-MINING
- Unable to fetch the Youtube Username using Javascript ( Chrome Extension )
- API | Coinimp | user/withdraw | Invalid parameters (POST)
- POST request issue with httr: desired table not retrieved
- Scrape join-dates/user info from a list (csv) of Twitter-users
- How can I use scrapy on booking.com without being blocked?
- Defensive web scraping techniques for scrapy spider
- Apache Nutch index only article pages to Solr
- Function not importing from external js file in react
- Craw data from urls by passing URL to Scrapy from other *.py file
- How to get text and href value in anchor tag with scrapy, xpath, python
- ECLAT Algorithm to find maximal and closed frequent sets
- Is it easier to scrape the AMP versions of webpages?
- Degree, Proximity and Rank Prestige
- Rcrawler - How to crawl account/password protected sites?
- Problems text mining using the ‘rJava’ and ‘tm.plugin.webmining’ packages
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
Since you didn't post any code I can only give general advice.
Look if there's a hidden API that retrieves the data you're looking for. Load the page in Chrome. Inspect with
F12and look under Network tab. ClickCTRL + Fand you can search for the text you see on screen which you want to collect. If you find any file under the Network tab that contains the data as json, that is more reliable since the backend of a webpage will change less frequent than the frontend.Be less specific with selectors. Instead of doing
body > .content > #datatable > .row::textyou can change to#datatable > .row::text. Then your spider will be less likely to break on small changes.Handle errors with
try exceptso to stop the whole parse function from ending if you're expecting some data might be inconsistent.