how can I scrape data beyond the page limit on Zillow?

666 Views Asked by At

I created a code to scrape the Zillow data and it works fine. The only problem I have is that it's limited to 20 pages even though there are many more results. Is there a way to get around this page limitation and scrap all the data ?

I also wanted to know if there is a general solution to this problem since I encounter it practically in every site that I want to scrape.

Thank you

from bs4 import BeautifulSoup
import requests
import lxml
import json



headers = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36",
        "Accept-Language": "en-US,en;q=0.9"
    }   


search_link = 'https://www.zillow.com/homes/Florida--/'
response = requests.get(url=search_link, headers=headers)


pages_number = 19
def OnePage():
    soup = BeautifulSoup(response.text, 'lxml')
    data = json.loads(
        soup.select_one("script[data-zrr-shared-data-key]")
        .contents[0]
        .strip("!<>-")
    )
    all_data = data['cat1']['searchResults']['listResults']
    
    home_info = []
    result = []
    
    for i in range(len(all_data)):
        property_link = all_data[i]['detailUrl']
        property_response = requests.get(url=property_link, headers=headers)
        property_page_source = BeautifulSoup(property_response.text, 'lxml')
        property_data_all = json.loads(json.loads(property_page_source.find('script', {'id': 'hdpApolloPreloadedData'}).get_text())['apiCache'])
        zp_id = str(json.loads(property_page_source.find('script', {'id': 'hdpApolloPreloadedData'}).get_text())['zpid'])
        property_data = property_data_all['ForSaleShopperPlatformFullRenderQuery{"zpid":'+zp_id+',"contactFormRenderParameter":{"zpid":'+zp_id+',"platform":"desktop","isDoubleScroll":true}}']["property"]
        home_info["Broker Name"] = property_data['attributionInfo']['brokerName']
        home_info["Broker Phone"] = property_data['attributionInfo']['brokerPhoneNumber']
        result.append(home_info)
        
    return result
    


data = pd.DataFrame()
all_page_property_info = []
for page in range(pages_number):
    property_info_one_page = OnePage()
    search_link = 'https://www.zillow.com/homes/Florida--/'+str(page+2)+'_p'
    response = requests.get(url=search_link, headers=headers)
    all_page_property_info = all_page_property_info+property_info_one_page
    data = pd.DataFrame(all_page_property_info)
    data.to_csv(f"/Users//Downloads/Zillow Search Result.csv", index=False)
1

There are 1 best solutions below

0
Md. Fazlul Hoque On

Actually, you can't grab any data from zillow using bs4 because they are dynamically loaded by JS and bs4 can't render JS. Only 6 to 8 data items are static. All data are lying down in script tag with html comment as json format. How to pull the requied data? In this case you can follow the next example. Thus way you can extract all the items. So to pull rest of data items, is your task or just add your data items here. Zillow is one of the most famous and smart enough websites. So we should respect its terms and conditions.

Example:

import requests
import re
import json
import pandas as pd

url='https://www.zillow.com/fl/{page}_p/?searchQueryState=%7B%22usersSearchTerm%22%3A%22FL%22%2C%22mapBounds%22%3A%7B%22west%22%3A-94.21964006249998%2C%22east%22%3A-80.68448381249998%2C%22south%22%3A22.702203494269085%2C%22north%22%3A32.23788425255877%7D%2C%22regionSelection%22%3A%5B%7B%22regionId%22%3A14%2C%22regionType%22%3A2%7D%5D%2C%22isMapVisible%22%3Afalse%2C%22filterState%22%3A%7B%22sort%22%3A%7B%22value%22%3A%22days%22%7D%2C%22ah%22%3A%7B%22value%22%3Atrue%7D%7D%2C%22isListVisible%22%3Atrue%2C%22mapZoom%22%3A6%2C%22pagination%22%3A%7B%22currentPage%22%3A2%7D%7D'
lst=[]
for page in range(1,21):
    r = requests.get(url.format(page=page),headers = {'User-Agent':'Mozilla/5.0'})
    data = json.loads(re.search(r'!--(\{"queryState".*?)-->', r.text).group(1))

    for item in data['cat1']['searchResults']['listResults']:
        price= item['price'] 
        lst.append({'price': price})
df = pd.DataFrame(lst).to_csv('out.csv',index=False)
print(df)

Output:

       price
0      $354,900
1      $164,900
2      $155,000
3      $475,000
4      $245,000
..          ...
795    $295,000
796     $10,000
797    $385,000
798  $1,785,000
799  $1,550,000

[800 rows x 1 columns]