Is it easier to scrape the AMP versions of webpages?

391 Views Asked by At

I'm working on a web-scraper that aggregates newspaper articles. I know AMP protocol mandates a stripped-down version of Javascript, and I also know that Javascript (in part) enables website administrators to detect/prevent scraping. So logically, I figured it would be easier to scrape AMP websites. However, one the other hand, if this is true, I presume StackOverflow would be on top of it, but I haven't found a single thread reaffirming my inference. Am I correct or am I overlooking something?

1

There are 1 best solutions below

0
Haddock-san On

I would say that AMP pages are definitely easier to scrape due to the fact that there is virtually no custom JS code. Many sites insert content with JS or AJAX. AMP limits the amount of libraries you can use and thus has less amount of them compared to a regular site.

Furthermore, if you want to scrape content written in JavaScript, you should can Selenium. If not, PHP is the way to go (IMHO) or BeautifulSoup in Python.

Happy scraping!