Webiste scraping facing issue with " load button"

24 Views Asked by At

I am unable to scrape all the urls of the file newsletter. I only scrape first page urls. this is link to the website. https://news.matdesousa.com/

from bs4 import BeautifulSoup import requests

url = "https://news.matdesousa.com/" headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}

Make a request to the URL to get the HTML content

response = requests.get(url, headers=headers)

soup = BeautifulSoup(response.text, 'html.parser')

Find inner divs within the specified class

inner_divs = soup.find_all('div', class_='group h-full overflow-hidden transition-all shadow-none hover:shadow-none rounded-lg')

Extract links from each inner div

for inner_div in inner_divs: link = inner_div.find('a')['href'] print(link)

1

There are 1 best solutions below

0
devin On

For this specific question you will need to have control over the browser. Selenium or puppeteer are good for automating tasks a human would do in the browser, such as moving your mouse or clicking/scrolling things inside of the browser. You will need to click the Load More button. I have found it by using something called "xpath" you should use this as it is more reliable when scraping. This code below should work for your implementation.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup 
import requests
import time

# Setup Selenium WebDriver
driver = webdriver.Chrome(executable_path='path/to/chromedriver')
driver.get("https://news.matdesousa.com/")
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}



try:
    driver.get("https://news.matdesousa.com/")
    
    while True:
        # Find the "Load More" button by its text
        load_more_button = WebDriverWait(driver, 10).until(
            EC.element_to_be_clickable((By.XPATH, "//button[contains(., 'Load More')]"))
        )
        load_more_button.click()

        # Waiting for the page to load, adjust the time as needed
        time.sleep(5)

except Exception as e:
    print("No more 'Load More' button or error occurred:", e)

# Now that all content is loaded, use BeautifulSoup to parse it
soup = BeautifulSoup(driver.page_source, 'html.parser')

# Find inner divs within the specified class, update the class name as needed
inner_divs = soup.find_all('div', class_='group h-full overflow-hidden transition-all shadow-none hover:shadow-none rounded-lg')

# Extract links from each inner div
for inner_div in inner_divs:
    link = inner_div.find('a')['href']
    print(link)

driver.quit()