Scraping using BS4 and Requests-HTML works only on first page, then ('NoneType' object has no attribute 'find')

Question

Scraping using BS4 and Requests-HTML works only on first page, then ('NoneType' object has no attribute 'find')

68 Views Asked by greenbananas2 At 08 February 2024 at 17:10

I'm new to web scraping using Python and BeautifulSoup and I'm trying to extract car data (model, price, etc.) from a public site with Requests-HTML. I can successfully output the data I need from page 1, but when I try to go through the rest of the pages using my function getnextpage(soup) with a loop, I get the following error after it outputs the contents and the correct full url of page 1:

Traceback (most recent call last):
  File "C:\Users\...\PycharmProjects\WebScraping\scrape_test.py", line 40, in <module>
    url = getnextpage(soup)
  File "C:\Users\...\PycharmProjects\WebScraping\scrape_test.py", line 31, in getnextpage
    if not page.find('a', {'class': 'page-number page-number--next page-number--icon disabled'}):
AttributeError: 'NoneType' object has no attribute 'find'

Followed by

None

This is most of the code for context:

from bs4 import BeautifulSoup
from requests_html import HTMLSession

s = HTMLSession()
url = 'https://ksa.carswitch.com/en/saudi/used-cars/search'


def getdata(url):
    r = s.get(url)
    soup = BeautifulSoup(r.text, 'html.parser')
    return soup

def scrapesite(soup):
    [code for scraping data]

def getnextpage(soup):
    page = soup.find('div', {'class': 'pagination_navigation'})
    if not page.find('a', {'class': 'page-number page-number--next page-number--icon disabled'}):
        url = 'https://ksa.carswitch.com/' + str(page.find('a', {'class': 'page-number page-number--next page-number--icon'})['href'])
        return url
    else:
        return

while True:
    soup = getdata(url)
    scrapesite(soup)
    url = getnextpage(soup)
    if not url:
        break
    print(url)

The idea is to go through the loop to output the data, check if the Next Page button is not disabled, and do it over again after getting the href for the next page. I know this error typically occurs when there's a typo in the class name, but it's accessing the correct div tag contents when I print it.

This is the HTML that I'm accessing:

<div class="pagination_navigation">
<a class="page-number page-number--prev page-number--icon disabled" href="javascript:void();" onclick="changePage(0);event.preventDefault();">
Prev. page
</a>
<a class="page-number page-number--next page-number--icon" href="[href for next page]" onclick="changePage(1);event.preventDefault();">
Next page
</a>
</div>

I had initially started with requests and lxml, but I was getting a similar error. I decided to switch to Requests-HTML after some research, thinking it was a JavaScript issue, but the same thing happens. I would appreciate any advice. Thanks in advance.

Original Q&A

There are 1 best solutions below

**thetaco** · Answer 1 · 2024-02-09T16:51:03.057000

I modified your code to scrape the final number in the page count, since the website displays it from page 1, and then construct the url's with the knowledge of how many pages there are. You will need to add the scraping logic to the for loop:

from bs4 import BeautifulSoup
from requests_html import HTMLSession

s = HTMLSession()
url = 'https://ksa.carswitch.com/en/saudi/used-cars/search'


def getdata(url):
    r = s.get(url)
    soup = BeautifulSoup(r.text, 'html.parser')
    return soup

def scrapesite(soup):
    print('scraped')

def getmaxpage(soup):
    pagination_links = soup.find_all('a', class_='page-number')
    link = pagination_links[-1]
    last_page = int(link.text)
    return last_page

soup = getdata(url)
scrapesite(soup)
max_pages = getmaxpage(soup)
for i in range(max_pages):
    url = 'https://ksa.carswitch.com/en/saudi/used-cars/search?' + 'page=' + str(i + 1)
    #SCRAPING FUNCTION
    print(url)
    soup = getdata(url)
    scrapesite(soup)

I am always thankful when websites show the last page from the start:

Scraping using BS4 and Requests-HTML works only on first page, then ('NoneType' object has no attribute 'find')

There are 1 best solutions below

Related Questions in WEB-SCRAPING

Related Questions in BEAUTIFULSOUP

Related Questions in PYTHON-REQUESTS

Related Questions in PYTHON-REQUESTS-HTML

Trending Questions

Popular # Hahtags

Popular Questions