Handling pagination in python playwright

151 Views Asked by At

I am trying to scrape this site https://booking.com with playwright and python but I don't Know how to scrape through multiple result pages ,How do I solve this pagination problem? how can I loop over the page number in the end and scrap the data in other pages Here's my code

from playwright.sync_api import sync_playwright
import pandas as pd

def main():
    with sync_playwright() as p:
        checkin_date = '2023-12-10'
        checkout_date = '2023-12-18'
        page_url= f'https://www.booking.com/searchresults.html?ss=Medina%2C+Saudi+Arabia&label=gog235jc-1DCAEoggI46AdIM1gDaMQBiAEBmAExuAEXyAEP2AED6AEB-AECiAIBqAIDuAL4_M2qBsACAdICJDIyNmY5NThlLTdkNjctNDg2Yi05ZDMzLWY3M2JhZmRkZDdhNtgCBOACAQ&aid=397594&lang=en-us&sb=1&src_elem=sb&src=index&dest_id=-3092186&dest_type=city&checkin={checkin_date}&checkout={checkout_date}&group_adults=1&no_rooms=1&group_children=0&sb_travel_purpose=leisure'
        browser = p.chromium.launch(headless=False)
        page = browser.new_page()
        page.goto(page_url,timeout=6000000)

        hotels = page.locator('//div[@data-testid="property-card"]').all()
        print(f'There are: {len(hotels)} hotels.')

        hotels_list =[]
        for hotel in hotels:
            hotel_dict = {}
            hotel_dict['hotel'] = hotel.locator('//div[@data-testid="title"]').inner_text()
            hotel_dict['price'] = hotel.locator('//span[@data-testid="price-and-discounted-price"]').inner_text()
            hotel_dict['Nights'] = hotel.locator('//div[@data-testid="availability-rate-wrapper"]/div[1]/div[1]').inner_text()
            hotel_dict['tax'] = hotel.locator('//div[@data-testid="availability-rate-wrapper"]/div[1]/div[3]').inner_text()
            hotel_dict['score'] = hotel.locator('//div[@data-testid="review-score"]/div[1]').inner_text()
            hotel_dict['distance'] = hotel.locator('//span[@data-testid="distance"]').inner_text()
            hotel_dict['avg review'] = hotel.locator('//div[@data-testid="review-score"]/div[2]/div[1]').inner_text()
            hotel_dict['reviews count'] = hotel.locator('//div[@data-testid="review-score"]/div[2]/div[2]').inner_text().split()[0]

            hotels_list.append(hotel_dict)

        df = pd.DataFrame(hotels_list)
        df.to_excel('hotels_list.xlsx', index=False) 
        df.to_csv('hotels_list.csv', index=False) 
        browser.close()
        
if __name__ == '__main__':
    main()
1

There are 1 best solutions below

0
On

You can create a new page while youre also able operate on it asynchronously.

So, what you could do is extract the links that you then feed to your newly created pages. For that create an asynchronous function which

  • processes one page, that is, extracts all the links from the url
  • returns the links

In the main function, where you have the links at hand you would call that function with the newly created page as a parameter, that you feed one of the urls. Repeat that for all of the links and add the results to a list, await the list. youre done.