I am trying to scrape this site https://booking.com with playwright and python but I don't Know how to scrape through multiple result pages ,How do I solve this pagination problem? how can I loop over the page number in the end and scrap the data in other pages Here's my code
from playwright.sync_api import sync_playwright
import pandas as pd
def main():
with sync_playwright() as p:
checkin_date = '2023-12-10'
checkout_date = '2023-12-18'
page_url= f'https://www.booking.com/searchresults.html?ss=Medina%2C+Saudi+Arabia&label=gog235jc-1DCAEoggI46AdIM1gDaMQBiAEBmAExuAEXyAEP2AED6AEB-AECiAIBqAIDuAL4_M2qBsACAdICJDIyNmY5NThlLTdkNjctNDg2Yi05ZDMzLWY3M2JhZmRkZDdhNtgCBOACAQ&aid=397594&lang=en-us&sb=1&src_elem=sb&src=index&dest_id=-3092186&dest_type=city&checkin={checkin_date}&checkout={checkout_date}&group_adults=1&no_rooms=1&group_children=0&sb_travel_purpose=leisure'
browser = p.chromium.launch(headless=False)
page = browser.new_page()
page.goto(page_url,timeout=6000000)
hotels = page.locator('//div[@data-testid="property-card"]').all()
print(f'There are: {len(hotels)} hotels.')
hotels_list =[]
for hotel in hotels:
hotel_dict = {}
hotel_dict['hotel'] = hotel.locator('//div[@data-testid="title"]').inner_text()
hotel_dict['price'] = hotel.locator('//span[@data-testid="price-and-discounted-price"]').inner_text()
hotel_dict['Nights'] = hotel.locator('//div[@data-testid="availability-rate-wrapper"]/div[1]/div[1]').inner_text()
hotel_dict['tax'] = hotel.locator('//div[@data-testid="availability-rate-wrapper"]/div[1]/div[3]').inner_text()
hotel_dict['score'] = hotel.locator('//div[@data-testid="review-score"]/div[1]').inner_text()
hotel_dict['distance'] = hotel.locator('//span[@data-testid="distance"]').inner_text()
hotel_dict['avg review'] = hotel.locator('//div[@data-testid="review-score"]/div[2]/div[1]').inner_text()
hotel_dict['reviews count'] = hotel.locator('//div[@data-testid="review-score"]/div[2]/div[2]').inner_text().split()[0]
hotels_list.append(hotel_dict)
df = pd.DataFrame(hotels_list)
df.to_excel('hotels_list.xlsx', index=False)
df.to_csv('hotels_list.csv', index=False)
browser.close()
if __name__ == '__main__':
main()
You can create a new page while youre also able operate on it asynchronously.
So, what you could do is extract the links that you then feed to your newly created pages. For that create an asynchronous function which
In the main function, where you have the links at hand you would call that function with the newly created page as a parameter, that you feed one of the urls. Repeat that for all of the links and add the results to a list, await the list. youre done.