I'm trying to get data from this website using scrapy-splash but im not able to extract data. I want to get data about each real state like href, price, etc. Here is my code:
in setings.py:
ROBOTSTXT_OBEY = False
USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36"
SPLASH_ENABLED = True
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPLASH_URL = 'http://localhost:8050/'
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
my spider:
class M2Spider(scrapy.Spider):
name = "m2"
allowed_domains = ['metrocuadrado.com']
start_urls = [
'https://www.metrocuadrado.com/bodega/arriendo'
]
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url=url,callback= self.parse,
endpoint='render.html',
args={'wait': 10},)
def parse(self, response):
print("--------------------------------------------------------------")
real_states= response.selector.xpath(".//a[@class='sc-bdVaJa ebNrSm']").getall()
print("real_states")
The output print is an empty list []. I am new to splash. Any suggestions?
What I would do instead is this:
Send a request to https://www.metrocuadrado.com/results/_next/static/chunks/commons.8afec6af6d5add2097bf.js, in the response you'll find an API-key if you search for "X-Api-Key". So that can be extracted easily with regex, something like:
re.findall(r'"X-Api-Key":"(\w+)"')
.Then, when you've extracted the API key, send a request to https://www.metrocuadrado.com/rest-search/search?seo=/bodega/arriendo&from=0&size=50, which is the hidden API in the website you sent. To get a valid response you have to attach the header like this
From that API you get JSON formatted data which is usually more reliable than parsing the html since that changes more oftan.