I'm new to web scraping using Python and BeautifulSoup and I'm trying to extract car data (model, price, etc.) from a public site with Requests-HTML. I can successfully output the data I need from page 1, but when I try to go through the rest of the pages using my function getnextpage(soup) with a loop, I get the following error after it outputs the contents and the correct full url of page 1:
Traceback (most recent call last):
File "C:\Users\...\PycharmProjects\WebScraping\scrape_test.py", line 40, in <module>
url = getnextpage(soup)
File "C:\Users\...\PycharmProjects\WebScraping\scrape_test.py", line 31, in getnextpage
if not page.find('a', {'class': 'page-number page-number--next page-number--icon disabled'}):
AttributeError: 'NoneType' object has no attribute 'find'
Followed by
None
This is most of the code for context:
from bs4 import BeautifulSoup
from requests_html import HTMLSession
s = HTMLSession()
url = 'https://ksa.carswitch.com/en/saudi/used-cars/search'
def getdata(url):
r = s.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
return soup
def scrapesite(soup):
[code for scraping data]
def getnextpage(soup):
page = soup.find('div', {'class': 'pagination_navigation'})
if not page.find('a', {'class': 'page-number page-number--next page-number--icon disabled'}):
url = 'https://ksa.carswitch.com/' + str(page.find('a', {'class': 'page-number page-number--next page-number--icon'})['href'])
return url
else:
return
while True:
soup = getdata(url)
scrapesite(soup)
url = getnextpage(soup)
if not url:
break
print(url)
The idea is to go through the loop to output the data, check if the Next Page button is not disabled, and do it over again after getting the href for the next page. I know this error typically occurs when there's a typo in the class name, but it's accessing the correct div tag contents when I print it.
This is the HTML that I'm accessing:
<div class="pagination_navigation">
<a class="page-number page-number--prev page-number--icon disabled" href="javascript:void();" onclick="changePage(0);event.preventDefault();">
Prev. page
</a>
<a class="page-number page-number--next page-number--icon" href="[href for next page]" onclick="changePage(1);event.preventDefault();">
Next page
</a>
</div>
I had initially started with requests and lxml, but I was getting a similar error. I decided to switch to Requests-HTML after some research, thinking it was a JavaScript issue, but the same thing happens. I would appreciate any advice. Thanks in advance.
I modified your code to scrape the final number in the page count, since the website displays it from page 1, and then construct the url's with the knowledge of how many pages there are. You will need to add the scraping logic to the
forloop:I am always thankful when websites show the last page from the start: