Web scraping with Selenium not capturing full text

728 Views Asked by At

I'm trying to mine quite a bit of text from a list of links using Selenium/Python.

In this example, I scrape only one of the pages and that successfully grabs the full text:

    page = 'https://xxxxxx.net/xxxxx/September%202020/2020-09-24'

driver = webdriver.Firefox()

driver.get(page)

elements = driver.find_element_by_class_name('text').text

elements

Then, when I try to loop through the whole list of links (all the by day links on this page: https://overrustlelogs.net/Destinygg%20chatlog/September%202020) (using the same method that worked for grabbing the text from a single page), it is not grabbing the full text:

for i in tqdm(chat_links):
driver.get(i)
#driver.implicitly_wait(200)
elements = driver.find_element_by_class_name('text').text
#elements = driver.find_element_by_xpath('/html/body/main/div[1]/div[1]').text
#elements = elements.text
temp={'elements':elements}
chat_text.append(temp)

driver.close()

chat_text

My thought is that maybe it doesn't have the chance to load the whole thing, but it works on the single page. Also, the driver.get method seems meant to load the whole given page.

Any ideas? Thanks, much appreciated.

1

There are 1 best solutions below

7
On

The page is lazy loading you need scroll the pages and add data in the list.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver=webdriver.Chrome()
driver.get("https://overrustlelogs.net/Destinygg%20chatlog/September%202020/2020-09-30")
WebDriverWait(driver,10).until(EC.visibility_of_element_located((By.CSS_SELECTOR,".text>span")))
height=driver.execute_script("return document.body.scrollHeight")
data=[]
while True:
    driver.execute_script("window.scrollTo(0,document.body.scrollHeight)")
    time.sleep(1)
    for item in driver.find_elements_by_css_selector(".text>span"):
        if item.text in data:
            continue
        else:
            data.append(item.text)

    lastheight=driver.execute_script("return document.body.scrollHeight")
    if height==lastheight:
        break
    height=lastheight

print(data)