Scraping using BS4 and Requests-HTML works only on first page, then ('NoneType' object has no attribute 'find')

68 Views Asked by At

I'm new to web scraping using Python and BeautifulSoup and I'm trying to extract car data (model, price, etc.) from a public site with Requests-HTML. I can successfully output the data I need from page 1, but when I try to go through the rest of the pages using my function getnextpage(soup) with a loop, I get the following error after it outputs the contents and the correct full url of page 1:

Traceback (most recent call last):
  File "C:\Users\...\PycharmProjects\WebScraping\scrape_test.py", line 40, in <module>
    url = getnextpage(soup)
  File "C:\Users\...\PycharmProjects\WebScraping\scrape_test.py", line 31, in getnextpage
    if not page.find('a', {'class': 'page-number page-number--next page-number--icon disabled'}):
AttributeError: 'NoneType' object has no attribute 'find'

Followed by

None

This is most of the code for context:

from bs4 import BeautifulSoup
from requests_html import HTMLSession

s = HTMLSession()
url = 'https://ksa.carswitch.com/en/saudi/used-cars/search'


def getdata(url):
    r = s.get(url)
    soup = BeautifulSoup(r.text, 'html.parser')
    return soup

def scrapesite(soup):
    [code for scraping data]

def getnextpage(soup):
    page = soup.find('div', {'class': 'pagination_navigation'})
    if not page.find('a', {'class': 'page-number page-number--next page-number--icon disabled'}):
        url = 'https://ksa.carswitch.com/' + str(page.find('a', {'class': 'page-number page-number--next page-number--icon'})['href'])
        return url
    else:
        return

while True:
    soup = getdata(url)
    scrapesite(soup)
    url = getnextpage(soup)
    if not url:
        break
    print(url)

The idea is to go through the loop to output the data, check if the Next Page button is not disabled, and do it over again after getting the href for the next page. I know this error typically occurs when there's a typo in the class name, but it's accessing the correct div tag contents when I print it.

This is the HTML that I'm accessing:

<div class="pagination_navigation">
<a class="page-number page-number--prev page-number--icon disabled" href="javascript:void();" onclick="changePage(0);event.preventDefault();">
Prev. page
</a>
<a class="page-number page-number--next page-number--icon" href="[href for next page]" onclick="changePage(1);event.preventDefault();">
Next page
</a>
</div>

I had initially started with requests and lxml, but I was getting a similar error. I decided to switch to Requests-HTML after some research, thinking it was a JavaScript issue, but the same thing happens. I would appreciate any advice. Thanks in advance.

1

There are 1 best solutions below

4
thetaco On

I modified your code to scrape the final number in the page count, since the website displays it from page 1, and then construct the url's with the knowledge of how many pages there are. You will need to add the scraping logic to the for loop:

from bs4 import BeautifulSoup
from requests_html import HTMLSession

s = HTMLSession()
url = 'https://ksa.carswitch.com/en/saudi/used-cars/search'


def getdata(url):
    r = s.get(url)
    soup = BeautifulSoup(r.text, 'html.parser')
    return soup

def scrapesite(soup):
    print('scraped')

def getmaxpage(soup):
    pagination_links = soup.find_all('a', class_='page-number')
    link = pagination_links[-1]
    last_page = int(link.text)
    return last_page

soup = getdata(url)
scrapesite(soup)
max_pages = getmaxpage(soup)
for i in range(max_pages):
    url = 'https://ksa.carswitch.com/en/saudi/used-cars/search?' + 'page=' + str(i + 1)
    #SCRAPING FUNCTION
    print(url)
    soup = getdata(url)
    scrapesite(soup)

I am always thankful when websites show the last page from the start:

enter image description here