I'm working on a web-scraper that aggregates newspaper articles. I know AMP protocol mandates a stripped-down version of Javascript, and I also know that Javascript (in part) enables website administrators to detect/prevent scraping. So logically, I figured it would be easier to scrape AMP websites. However, one the other hand, if this is true, I presume StackOverflow would be on top of it, but I haven't found a single thread reaffirming my inference. Am I correct or am I overlooking something?
Is it easier to scrape the AMP versions of webpages?
391 Views Asked by Guy4444 At
1
There are 1 best solutions below
Related Questions in WEB-SCRAPING
- Using Puppeteer to scrape a public API only when the data changes
- Scraping information in a span located under nested span
- How to scrape website which loads json content dynamically?
- How can I find a button element and click on it?
- WebScraping doesnt work, even without error
- Need Help Extracting Redirect URL from a div Element with Specific Class Name in Python Selenium
- beautifulsoup library not showing below #document data inside iframe tag in python
- how to create robust scraper for specific website without updating code after develop?
- Optimizing Selenium script for faster execution
- Parse Dynamic Power BI table with selenium
- How to extract table from webpage that requires click/toggle?
- SSL Certificate Verification Error When Scraping Website and Inserting Data into MongoDB
- Scraping all links using BeautifulSoup
- How do I make it so all arrays are the same length?
- I am getting 'NoneType object is not subscriptable' error in web scraping method
Related Questions in BEAUTIFULSOUP
- Scraping information in a span located under nested span
- WebScraping doesnt work, even without error
- beautifulsoup library not showing below #document data inside iframe tag in python
- How to extract url from <a href="TextWithUrlBehind">Something</a> using BeautifulSoup?
- How to extract table from webpage that requires click/toggle?
- Scraping all links using BeautifulSoup
- How to convert scraped HTML document to a dataframe?
- Can I update a variable URL in a loop so it can run without me manually inputting new URL in beautifulsoup python
- Web Scraping 'NoneType' object has no attribute 'find_all' error using BeautifulSoup in python3 Juypter Notebook
- Scraping MLB daily lineups from rotowire using python
- How to include colspan to a table header while web scraping
- How to access Script Tag Variables From a Website using Python
- Can we scrap linkedin using python and without using selinium
- How to handle regex in BeautifulSoup / CSS selector?
- Chain multiple ajax requests in website to show more pages and get full list in single page
Related Questions in WEB-CRAWLER
- How do i get the newly opened page after a form submission using puppeteer
- How to crawl 5000 different URLs to find certain links
- Selenium cannot load a page
- FaceBook-Scraper (without API) works nicely - but Login Process failes some how
- Why scrapy shell did not return an output?
- Highcharts Spider Chart with different scale for each category
- Chrome for Testing crashes soon after launching chrome driver in script
- Permission denied When deploy Splash in OpenShift
- scrape( n ′ gcontent−serverapp ′ , ′ How to scrape HTML elements with a specific attribute using Python ′ )
- Puppeteer recognized by BET365 during crawler
- Python requests.get(url) returns empty content in Colab
- I want some of the content in my page to be crawlable but should not be indexed
- Selenium crawler had no problems starting up locally, but it always failed to start up on Linux,org.openqa.selenium.interactions.Coordinates
- Website Branch address not updating in Google search engine even after 1 month
- How can I execute javasript function before page load for search engine crawlers?
Related Questions in AMP-HTML
- Is there a working example of the AMP.DEV amp-link-rewriter?
- Why AMP show me error in search console to google
- AMP Connection with inactive site
- Wordpress Amp Domain
- How to toggle (play/stop) a hidden amp-audio element on a web page while simultaneously toggling the amp-img used to toggle the amp-audio?
- CSS nesting in amp pages- is there a way to include nested css (css-nesting) in amp-html?
- vast / vpaid video can run amp-video ads?
- Amp pages disappear in Google searches
- what is minimum duration requirement for amp-video in amp-story?
- I don't need Hubspot HUBL variables {{ standard_header_includes }}, how can I skip this requirement for AMP Page?
- how to set limit of visible elements with show more button in <amp-list>?
- AMP-Analytics and AdSense Triggers for GA4 Integration
- Are Container Queries supported on AMP?
- How to make an exception for a template not to load on AMP in wordpress?
- How to play a hidden AMP-Audio on AMP-Image click on an AMP-HTML page (Full Page Example)?
Related Questions in WEB-MINING
- Unable to fetch the Youtube Username using Javascript ( Chrome Extension )
- API | Coinimp | user/withdraw | Invalid parameters (POST)
- POST request issue with httr: desired table not retrieved
- Scrape join-dates/user info from a list (csv) of Twitter-users
- How can I use scrapy on booking.com without being blocked?
- Defensive web scraping techniques for scrapy spider
- Apache Nutch index only article pages to Solr
- Function not importing from external js file in react
- Craw data from urls by passing URL to Scrapy from other *.py file
- How to get text and href value in anchor tag with scrapy, xpath, python
- ECLAT Algorithm to find maximal and closed frequent sets
- Is it easier to scrape the AMP versions of webpages?
- Degree, Proximity and Rank Prestige
- Rcrawler - How to crawl account/password protected sites?
- Problems text mining using the ‘rJava’ and ‘tm.plugin.webmining’ packages
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
I would say that AMP pages are definitely easier to scrape due to the fact that there is virtually no custom JS code. Many sites insert content with JS or AJAX. AMP limits the amount of libraries you can use and thus has less amount of them compared to a regular site.
Furthermore, if you want to scrape content written in JavaScript, you should can Selenium. If not, PHP is the way to go (IMHO) or BeautifulSoup in Python.
Happy scraping!