I'm a beginner and have a lot to learn, so please be patient with me.
Using Python and Selenium, I'm trying to scrap table data from a website while navigating through different pages. As I navigate through different pages, the table shows the updated data, but it doesn't refresh the page, and the URL remains the same.
To get the refreshed data from the table and avoid stale element exception, I used WebDriverWait and expected_conditions (tr elements). Even with the wait, my code didn't get the refreshed data. It was getting the old data from the previous page and was giving the exception. So, I added time.sleep() after I clicked the next page button, which solved the problem.
However, I noticed my code was getting slower as I was navigating more and more pages. So, at around page 120, it gave me the stale element exception and was not able to get the refreshed data. I'm assuming it is because I'm using a for loop within a while loop that slows down the performance.
I tried implicit wait and increased time.sleep() gradually to avoid staleness exception, but nothing was working. There are 100 table rows in each page and around 3,100 pages total.
The followings are the problems:
- Why do I get the stale element exception and how to avoid it
- How to increase the efficiency of the code
I searched a lot and really tried to fix it on my own before I decided to write here. I'm stuck here and don't know what to do. Please help, and thank you so much for your time.
while True:
# waits until the table elements are visible when the page is loaded
# this is a must step for Selenium to scrap data from the dynamic table when we navigate through different pages
tr = WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.XPATH, "//*[@id='erdashboard']/tbody/tr")))
for record in tr:
count += 1
posted_date = datetime.strptime(record.find_element(By.XPATH, './td[7]').text, "%m/%d/%Y").date()
exclusion_request_dict["ID"].append(int(record.find_element(By.XPATH, './td[1]').text))
exclusion_request_dict["Company"].append(record.find_element(By.XPATH, './td[2]').text)
exclusion_request_dict["Product"].append(record.find_element(By.XPATH, './td[3]').text)
exclusion_request_dict["HTSUSCode"].append(record.find_element(By.XPATH, './td[4]').text)
exclusion_request_dict["Status"].append(record.find_element(By.XPATH, './td[5]').text)
exclusion_request_dict["Posted Date"].append(posted_date)
next_button = driver.find_element(By.ID, "erdashboard_next")
next_button_clickable = driver.find_element(By.ID, "erdashboard_next").get_attribute("class").split(" ")
print(next_button_clickable)
print("Current Page:", page, "Total Counts:", count)
if next_button_clickable[-1] == "disabled":
break
next_button.click() # goes to the next page
time.sleep(wait + 0.01)
When you click the next page button, you can avoid the stale element exception by, for example, checking when the ID in the first row has changed. This is done in the section of the code
# wait until new page is loaded(see full code below).When scraping data from a table, you can increase the efficiency of the code with two tricks. First: loop over columns rather than over rows, because there are (almost always) more rows than columns. Second: use javascript instead of the selenium command
.text, because js is way faster than.text. For example, to scrape the values in the first column, the command in selenium isand it takes about 1.2 seconds on my computer, while the corresponding javascript command (see the code inside
for idx in range(1,8)below) takes only about 0.008 seconds (150 times faster!). Actually, the first trick is slightly noticeable when using.text, but when using javascript is really effective: for example to scrape the whole table by rows with js it takes about 0.52 seconds, while by columns it takes about 0.05 seconds.Here is the full code:
While the loop is running you get an output like this ("ETA" is the estimated remaining time in the format
minutes:seconds) ("mean per page" is the mean time it takes to execute each loop)Then by running
pandas.DataFrame(data)you get something like this