Missing items when scraping javascript rendered page using scrapy and splash

572 Views Asked by At

I am trying to scrape the following website for basic real estate listing information:

https://www.propertyfinder.ae/en/search?c=2&fu=0&l=50&ob=nd&page=1&rp=y

Parts of the website are dynamically loaded from a back end API when the page is scrolled down using javascript. To get around this I have tried using Scrapy with Splash to render the javascript. The issue I am having is that while instead of returning all the listings it only returns the first 8. I thought the problem was the page wasn't scrolled down and so the page wasnt populated and the divs I needed weren't rendered. I then tried adding some Lua code (which I have no experience with) to scroll the page down in hope it would be populated, however it hasn't worked. Below is my spider:

import scrapy
from scrapy.shell import inspect_response
import pandas as pd
import functools
import time
import requests
from lxml.html import fromstring
import math
from scrapy_splash import SplashRequest
import scrapy_splash



class pfspider(scrapy.Spider):
    name = 'property_finder_spider'

    start_urls = ["https://www.propertyfinder.ae/en/search?c=2&fu=0&l=50&ob=nd&page=1&rp=y"]



    script1 = """function main(splash)
        local num_scrolls = 10
        local scroll_delay = 1.0

        local scroll_to = splash:jsfunc("window.scrollTo")
        local get_body_height = splash:jsfunc(
            "function() {return document.body.scrollHeight;}"
        )
        assert(splash:go(splash.args.url))
        splash:wait(splash.args.wait)

        for _ = 1, num_scrolls do
            scroll_to(0, get_body_height())
            splash:wait(scroll_delay)
        end        
        return splash:html()
    end"""


    def start_requests(self):
        for urll in self.start_urls:
            # yield scrapy_splash.SplashRequest(url=urll, callback=self.parse, endpoint='execute',  args={'wait':2, 'lua_source': script1})
            yield scrapy_splash.SplashRequest(url=urll, endpoint='render.html', callback=self.parse)



    def parse(self, response):
        inspect_response(response, self)

        containers = response.xpath('//div[@class="column--primary"]/div[@class="card-list__item"]')

        Listing_names_pf = containers[0].xpath('//h2[@class="card__title card__title-link"]/text()').extract()

        Currency_pf = ['AED'] * len(Listing_names_pf)

        Prices_pf = containers[0].xpath('//span[@class="card__price-value"]/text()').extract()

        type_pf = containers[0].xpath('//p[@class="card__property-amenity card__property-amenity--property-type"]/text()').extract()

        Bedrooms_pf = containers[0].xpath('//p[@class="card__property-amenity card__property-amenity--bedrooms"]/text()').extract()

        Bathrooms_pf = containers[0].xpath('//p[@class="card__property-amenity card__property-amenity--bathrooms"]/text()').extract()

        SQF_pf = containers[0].xpath('//p[@class="card__property-amenity card__property-amenity--area"]/text()').extract()

        Location_pf = containers[0].xpath('//span[@class="card__location-text"]/text()').extract()

        Links_pf =  containers[0].xpath('//div[@class="card-list__item"]/a/@href').extract()

        Links_pf_full = []

        for link in Links_pf:
            Links_pf_full.append('https://www.propertyfinder.ae/'+link)


Another thing I noticed was when the is page rendered in splash, in the html output file there is a script called Tealium that does have the listing data for all items in lists but not under the divs in the page.

enter image description here

any and all help or suggestions would be greatly appreciated.

1

There are 1 best solutions below

2
On BEST ANSWER

I am not familiar with Scrappy. But it is simply done with Requests. Just explore F12 -> XHR tab to find out the following url.

To make it clearer, I break the parameters into a list of tuples that I then re-associate with the base url. The include parameter can be "lightened" to include only the data you want to retrieve, but by default it has everything. You can iterate on page[number], but beware you may be blocked if the number of req/s is excessive.

import requests as rq

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0"}
url = "https://www.propertyfinder.ae/en/api/search?"
params = [
    ("filter[category_id]", "2"),
    ("filter[furnished]","0"),
    ("filter[locations_ids][]","50"),
    ("filter[price_type]","y"),
    ("include","properties,properties.property_type,properties.property_images,properties.location_tree,properties.agent,properties.agent.languages,properties.broker,smart_ads,smart_ads.agent,smart_ads.broker,smart_ads.property_type,smart_ads.property_images,smart_ads.location_tree,direct_from_developer,direct_from_developer.property_type,direct_from_developer.property_images,direct_from_developer.location_tree,direct_from_developer.agent,direct_from_developer.broker,cts,cts.agent,cts.broker,cts.property_type,cts.property_images,cts.location_tree,similar_properties,similar_properties.agent,similar_properties.broker,similar_properties.property_type,similar_properties.property_images,similar_properties.location_tree,agent_smart_ads,agent_smart_ads.broker,agent_smart_ads.languages,agent_properties_smart_ads,agent_properties_smart_ads.agent,agent_properties_smart_ads.broker,agent_properties_smart_ads.location_tree,agent_properties_smart_ads.property_type,agent_properties_smart_ads.property_images"),
    ("page[limit]","25"),
    ("page[number]","4"),
    ("sort","nd")
]

resp = rq.get(url, params=params, headers=headers).json()

Ensuite, you have to search in resp to find the data you are interested in:

resultat = []
for el in resp["included"]:
    if el["type"] == "property":
        data = {
            "name": el["attributes"]["name"],
            "default_price": el["attributes"]["default_price"],
            "bathroom_value": el["attributes"]["bathroom_value"],
            "bedroom_value": el["attributes"]["bedroom_value"],
            "coordinates" : el["attributes"]["coordinates"]}

        resultat.append(data)

resultat contains :

[{'name': '1Bed Apartment | Available | Large Terrace',
  'default_price': 92000,
  'bathroom_value': 2,
  'bedroom_value': 1,
  'coordinates': {'lat': 25.08333, 'lon': 55.144753}},
 {'name': 'Furnished  |Full sea view | All bills included',
  'default_price': 179000,
  'bathroom_value': 3,
  'bedroom_value': 2,
  'coordinates': {'lat': 25.083121, 'lon': 55.141064}},
   ........

PS : selenium should be considered when all scraping leads are exhausted