I would like to scrap a large online marketplace website that has ajax pages on it.I would like to set it up on a Centos VPS that would intermittently get data on the site to enable me to strategies my product offering.
I am new to python and scrapy and in this case scraping. I've read through some site for me to go about and scrap the pages with AJAx component.
Method 1. For scrapy to interact with selenium. I am installing the whole setup on my vps I do not know if this will work. Does selenium needs GUI browser to run? However this would be a great setup and it would allow for quick changes just in case that there's future change on the web portal arise.
Method 2. For scrapy to simulate an XHR request. There's some studying to do on the on the XHR call. However it will be faster to process but it takes more time to tweak if there's future changes to the site.
Any help is appreaciated.
Replicating XHR, AJAX or any other type of requests will always be multiple times faster and significantly less resource intensive than employing something like
Selenium
.However, to get the most performance out of this you need to replicate, reverse-engineer all the requests by hand. Some website can have several requests just to populate the product data you seek on the page.
On this ocassions it does make sense to use something to render the javascript instead of reverse-engineering all XHR-or-similar requests the website makes.
There's a pretty great tool designed for that called splash, which is a service that renders a webpage like a web-browser would (it uses qt web-browser to do that). This would be the lazy approach which would outperform selenium by a huge margin as well, but nevertheless still be behind the hands-on approach of rewriting the requests in scrapy.