How to skip a line when using Beautifulsoup find_all?

49 Views Asked by At

This is my code. It finds all car links without "https://" and domain name. However one of them is a full link with "https://...". How to write a code, which will skip this one result, to tell him don't mind the line with "https://" or any other text?

for page_number in range(1, 10):
    url = f"xyz{page_number}"
    page_number += 1
    req = requests.get(url)
    src = req.text
    soup = BeautifulSoup(src, "lxml")
    get_car_links = soup.find_all(class_="info-container")
    for i in get_car_links:
        car_links = i.find("a", class_="title")
        car_datas = (car_links.get("href"))
        print(car_datas) 
2

There are 2 best solutions below

9
Zero On BEST ANSWER

You can add an if condition to check and skip the case.

from bs4 import BeautifulSoup
import requests

for page_number in range(1, 10):
    url = f"xyz{page_number}"
    page_number += 1
    req = requests.get(url)
    soup = BeautifulSoup(req.text, "lxml")
    
    get_car_links = soup.find_all(class_="info-container")
    for i in get_car_links:
        if  not 'http' in i.find('a', class_='title').get('href'):
            car_links = i.find("a", class_="title")
            car_datas = car_links.get("href")
            print(car_datas) 
0
SIGHUP On

What you're trying to do is eliminate (ignore) HREFs that have a scheme - e.g., https, http, ftp, mailto

Therefore it seems sensible to use a URL parser rather than searching for constant strings.

Something like this:

from urllib.parse import urlparse
import requests
from bs4 import BeautifulSoup as BS

URL = 'https://example.com/bar'

def main():
    for page in range(1, 10):
        with requests.get(f'{URL}{page}') as response:
            response.raise_for_status()
            soup = BS(response.text, 'lxml')
            for car_link in soup.find_all(class_='info-container'):
                if (a := car_link.find('a', class_='title')):
                    if not urlparse(href := a['href']).scheme:
                        print(href)

if __name__ == '__main__':
    main()