What is the best practice for scraping infinite scroll (one time scraping / progressing scraping)?

96 Views Asked by At

I'm scraping a site that has an infinite scroll, and was wondering what is the best way to do it:

Option 1: Scrape and Scroll (Repeat)

  • Load page
  • Scrape data
  • Scroll
  • Scrape data
  • Scroll
  • Repeat

Question:

  • Could I be scraping data twice?

Option 2: Scroll and Scrape (All)

  • Load page
  • Scroll
  • Scroll
  • Scroll
  • etc ...
  • Scrape all Data

Question:

  • Could I be missing data?

I have managed to code "Option 2", and curious if Option 1 would be working too, and the pro/cons.

Thanks.

I tried Option 2, and it's working:


Option 2: Scroll and Scrape (All)

  • Load page
  • Scroll
  • Scroll
  • Scroll
  • etc ...
  • Scrape all Data

Added information:

Function to scroll.

def scroll_to_bottom(driver):
    # Scroll to the bottom of the page using JavaScript
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(5)  # Adjust the sleep time as needed

Function to extract data:

def Site_Extract_Cal_Event(local_webdriver: webdriver, local_event_webdriver: webdriver, url_site: str, iteration: int):

    global df
    
    local_webdriver.get(url_site)
    
    logger.debug('BKRM: Extract Data from current page')
    logger.debug('BKRM: Extract URLs')

    # Sroll the infinite loop a few times

    for index in range(1,iteration):
        logger.debug('BKRM: scrolling [' + str(index) + ']')
        scroll_to_bottom(local_webdriver)

    # Search for WebElement for each event

    xpath_event = '//*[starts-with(@id, "ep-")]' 
    div_elements = local_webdriver.find_elements("xpath",xpath_event)                                                                                                                        

    element_index = 1

    for element in div_elements: 

        element_index_text = "{:02d}".format(element_index)
        logger.debug("ELEMENT [" + element_index_text + "] : ")

        # Search for time of event

        xpath_time = ".//time"
        run_time = element.find_element("xpath",xpath_time) 
        logger.debug("BKRM: " + "Time: " + run_time.text)

        # Search for URL of full event description
        # Used to extract organiser

        xpath_meetup_url_link = ".//a[@class='flex h-full flex-col justify-between space-y-5 outline-offset-8 hover:no-underline']"
        run_meetup_url_link = element.find_element("xpath",xpath_meetup_url_link)
        run_meetup_url_link_text = run_meetup_url_link.get_attribute("href")
        logger.debug("BKRM: " + "Meetup url link: " + run_meetup_url_link_text)

        # Search for Event Title

        xpath_title = './/span[@class="ds-font-title-3 block break-words leading-7 utils_cardTitle__lbnC_ text-gray6"]'
        run_title = element.find_element("xpath",xpath_title)
        logger.debug("BKRM: " + "Title: " + run_title.text)
        
        # Search for attendee number
        # in some event, no attendee is written

        xpath_attendee_number = ".//span[@class='hidden sm:inline']"
        try:
            run_attendee_number = element.find_element("xpath",xpath_attendee_number)
            run_attendee_number_text_temp= run_attendee_number.text
            an_text = run_attendee_number_text_temp.split()
            run_attendee_number_text = an_text[0] 
            logger.debug("BKRM: " + "Attendee Number: " + run_attendee_number.text)

        except NoSuchElementException:
            run_attendee_number_text= "0"
            
        # Search for organizer name from details event description

        run_organizer = Site_Extract_Event_Details(local_event_webdriver, run_meetup_url_link_text) 

        # run_organizer = "BKK RUNNERS"

        element_index = element_index +1

        # Create a new record to add
        new_record = {'event_site':BKRM_SITE,
                    'event_date': run_time.text,
                    'event_title': run_title.text,
                    'event_organizer': run_organizer,
                    'event_attendee_number': run_attendee_number_text,
                    'event_url': run_meetup_url_link_text}

        # Append the new record to the DataFrame
        df = df.append(new_record, ignore_index=True)
        logger.debug('Adding Event: ' + run_time.text + " / " + run_title.text)
1

There are 1 best solutions below

0
Harikrishnaa K On

Could I be scraping data twice?

This is completely based on the website which we are trying to crawl, some websites show only unique data and some websites show duplicate data too. So it's better to take unique results while crawling by deduping as post processing.

Could I be missing data?

It will be based on the number of scrolls which we are doing, if we are doing finite number of scrolls and taking data there is a less chance of missing data. If we are doing many scrolls and taking data our selenium browser may get struck after some time and there is a chance of missing data.

I'm scraping a site that has an infinite scroll, and was wondering what is the best way to do it:

If there is a possibility for you to not use selenium then it's bettter, try to identify the background requests they are doing when we scroll and make those requests. Why I am not suggesting to go with selenium is there is a chance of browser failure or network failures and managing selenium at large scale is also a hassle. When we are scrolling many times there is a chance of browser getting struck. If we know the background request we can try to do one request at a time.

For finding background request we need to do network debugging and try to findout the request from where the required data is coming.

Since target website has not disclosed not able to give more information on background requests.