I'm scraping a site that has an infinite scroll, and was wondering what is the best way to do it:
Option 1: Scrape and Scroll (Repeat)
- Load page
- Scrape data
- Scroll
- Scrape data
- Scroll
- Repeat
Question:
- Could I be scraping data twice?
Option 2: Scroll and Scrape (All)
- Load page
- Scroll
- Scroll
- Scroll
- etc ...
- Scrape all Data
Question:
- Could I be missing data?
I have managed to code "Option 2", and curious if Option 1 would be working too, and the pro/cons.
Thanks.
I tried Option 2, and it's working:
Option 2: Scroll and Scrape (All)
- Load page
- Scroll
- Scroll
- Scroll
- etc ...
- Scrape all Data
Added information:
Function to scroll.
def scroll_to_bottom(driver):
# Scroll to the bottom of the page using JavaScript
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(5) # Adjust the sleep time as needed
Function to extract data:
def Site_Extract_Cal_Event(local_webdriver: webdriver, local_event_webdriver: webdriver, url_site: str, iteration: int):
global df
local_webdriver.get(url_site)
logger.debug('BKRM: Extract Data from current page')
logger.debug('BKRM: Extract URLs')
# Sroll the infinite loop a few times
for index in range(1,iteration):
logger.debug('BKRM: scrolling [' + str(index) + ']')
scroll_to_bottom(local_webdriver)
# Search for WebElement for each event
xpath_event = '//*[starts-with(@id, "ep-")]'
div_elements = local_webdriver.find_elements("xpath",xpath_event)
element_index = 1
for element in div_elements:
element_index_text = "{:02d}".format(element_index)
logger.debug("ELEMENT [" + element_index_text + "] : ")
# Search for time of event
xpath_time = ".//time"
run_time = element.find_element("xpath",xpath_time)
logger.debug("BKRM: " + "Time: " + run_time.text)
# Search for URL of full event description
# Used to extract organiser
xpath_meetup_url_link = ".//a[@class='flex h-full flex-col justify-between space-y-5 outline-offset-8 hover:no-underline']"
run_meetup_url_link = element.find_element("xpath",xpath_meetup_url_link)
run_meetup_url_link_text = run_meetup_url_link.get_attribute("href")
logger.debug("BKRM: " + "Meetup url link: " + run_meetup_url_link_text)
# Search for Event Title
xpath_title = './/span[@class="ds-font-title-3 block break-words leading-7 utils_cardTitle__lbnC_ text-gray6"]'
run_title = element.find_element("xpath",xpath_title)
logger.debug("BKRM: " + "Title: " + run_title.text)
# Search for attendee number
# in some event, no attendee is written
xpath_attendee_number = ".//span[@class='hidden sm:inline']"
try:
run_attendee_number = element.find_element("xpath",xpath_attendee_number)
run_attendee_number_text_temp= run_attendee_number.text
an_text = run_attendee_number_text_temp.split()
run_attendee_number_text = an_text[0]
logger.debug("BKRM: " + "Attendee Number: " + run_attendee_number.text)
except NoSuchElementException:
run_attendee_number_text= "0"
# Search for organizer name from details event description
run_organizer = Site_Extract_Event_Details(local_event_webdriver, run_meetup_url_link_text)
# run_organizer = "BKK RUNNERS"
element_index = element_index +1
# Create a new record to add
new_record = {'event_site':BKRM_SITE,
'event_date': run_time.text,
'event_title': run_title.text,
'event_organizer': run_organizer,
'event_attendee_number': run_attendee_number_text,
'event_url': run_meetup_url_link_text}
# Append the new record to the DataFrame
df = df.append(new_record, ignore_index=True)
logger.debug('Adding Event: ' + run_time.text + " / " + run_title.text)
This is completely based on the website which we are trying to crawl, some websites show only unique data and some websites show duplicate data too. So it's better to take unique results while crawling by deduping as post processing.
It will be based on the number of scrolls which we are doing, if we are doing finite number of scrolls and taking data there is a less chance of missing data. If we are doing many scrolls and taking data our selenium browser may get struck after some time and there is a chance of missing data.
If there is a possibility for you to not use selenium then it's bettter, try to identify the background requests they are doing when we scroll and make those requests. Why I am not suggesting to go with selenium is there is a chance of browser failure or network failures and managing selenium at large scale is also a hassle. When we are scrolling many times there is a chance of browser getting struck. If we know the background request we can try to do one request at a time.
For finding background request we need to do network debugging and try to findout the request from where the required data is coming.
Since target website has not disclosed not able to give more information on background requests.