Python | How i get the link of products that doesn't have href with selenium

Question

Python | How i get the link of products that doesn't have href with selenium

416 Views Asked by SkyManu At 12 February 2023 at 12:28

im trying to scrap a web with but this products don't have href

link: https://es.wallapop.com/app/search?keywords=monitor&filters_source=search_box&latitude=39.46895&longitude=-0.37686

Im using selenium to navigate the web and beautifulsoup to digest the results, but if i want to open every product to get more data i don't know how to do it

Original Q&A

There are 3 best solutions below

Alexander Yanque On 12 February 2023 at 18:11

You can use wallapop's api to get this data. There are many ways to handle these data, I suggest using the the Scrapy framework. Here is the code to get the item urls (as well as saving item's data as a json file) using Scrapy:

import json
import scrapy


class WallapopSpider(scrapy.Spider):
    name = "wallapop"
    search_url = "https://api.wallapop.com/api/v3/general/search?keywords={keywords}&latitude={latitude}&longitude={longitude}&start={start}&items_count={items_count}&filters_source=search_box&order_by={order_by}"
    item_url = "https://api.wallapop.com/api/v3/items/"

    def start_requests(self):
        meta = {
            "keywords": "monitor",
            "latitude": 39.46895,
            "longitude": -0.37686,
            "start": 0,
            "items_count": 40,
            "order_by": "most_relevance",
        }
        yield scrapy.Request(
            url=self.search_url.format(**meta), callback=self.parse_search
        )

    def parse_search(self, response):
        # parsing the response into a dict
        dict_data = json.loads(response.text)
        # iterating over the search results
        for item in dict_data["search_objects"]:
            yield scrapy.Request(
                url=self.item_url + item["id"], callback=self.parse_item
            )

    def parse_item(self, response):
        # parsing the response into a dict
        item_data = json.loads(response.text)
        print(item_data["share_url"])
        print(item_data["description"])
        # saving the item data on a json file
        with open(f"{item_data['id']}.json", "w") as f:
            json.dump(item_data, f)

I found both API endpoints by inspecting the network traffic on the browser dev tools. I'm using Valencia's latitude and longitude's values and "monitor" as keyword. The required parameters to run the query are keywords, latitude and longitude.

I can also suggest running the crawler on a spider management solution like estela.

Driftr95 On 14 February 2023 at 00:40

While I think the API is more efficient, if you want a selenium+bs4 solution, this is an example.

Using these function:

# from selenium import webdriver
# from selenium.webdriver.common.by import By
# from selenium.webdriver.support.ui import WebDriverWait
# from selenium.webdriver.support import expected_conditions as EC
# from bs4 import BeautifulSoup


## scroll to an element and click [targetEl can be and element or selector] ##
def scrollClick(driverX, targetEl, maxWait=5, scroll2Top=False, printErr=True):
    try:
        xWait = WebDriverWait(driverX, maxWait)
        if isinstance(targetEl, str):
            xWait.until(
                EC.presence_of_element_located((By.CSS_SELECTOR,targetEl)))
            targetEl = driverX.find_element(By.CSS_SELECTOR, targetEl)
        xWait.until(EC.element_to_be_clickable(targetEl))
        driverX.execute_script('''
            arguments[0].scrollIntoView(arguments[1]);
        ''', targetEl, bool(scroll2Top)) ## execute js to scroll
        targetEl.click()
    except Exception as e: 
        if printErr: print(repr(e), '\nFailed to click', targetEl)


## find a nextSibling of refEl that matches selector [if specified by sel] ##
def selectNextSib(driverX, refEl, sel=False, printError=False):
    sel = sel.strip() if isinstance(sel, str) and sel.strip() else False
    try: ## execute js code to find next card
        return driverX.execute_script('''
            var sibling = arguments[0].nextElementSibling;
            while (sibling && arguments[1]) {
                if (sibling.matches(arguments[1])) break;
                sibling = sibling.nextElementSibling; } 
            return sibling;''', refEl, sel)
    except Exception as e: 
        if printError: print(f'Error finding next "{sel}":',repr(e))


## [bs4] extract text or attribute from a tag inside tagSoup ##
def selectGet(tagSoup, selector='', ta='', defaultVal=None):
    el = tagSoup.select_one(selector) if selector else tagSoup
    if el is None: return defaultVal
    return el.get(ta, defaultVal) if ta else el.get_text(' ', strip=True)


## parse product page html and extract product details ##
def getProductDetails(prodPgHtml:str, prodUrl=None):
    pSoup = BeautifulSoup(prodPgHtml.encode('utf-8'))
    detsDiv = pSoup.select_one('div.detail-item')
    detKeys = ['category_id', 'is_bulky', 'is_bumped',
               'is_free_shipping_allowed', 'item_id', 'item_uuid',
               'main_image_thumbnail', 'mine', 'sell_price',
               'seller_user_id', 'subcategory_id', 'itle', 'title']
    pDets = {} if detsDiv is None else {
        k.lstrip('data-').replace('-', '_'): v
        for k, v in sorted(detsDiv.attrs.items(), key=lambda x: x[0])
        if k.lstrip('data-').replace('-', '_') in detKeys
    }
    pDets['description'] = selectGet(pSoup, 'div.card-product-detail-top>p')
    pDets['date_posted'] = selectGet(pSoup, 'div[class$="published"]')
    pDets['views_count'] = selectGet(pSoup, 'i.ico-eye+span')
    pDets['likes_count'] = selectGet(pSoup, 'i.ico-coounter_favourites+span')
    pDets['seller_name'] = selectGet(pSoup, 'h2.card-user-detail-name')
    uLink = selectGet(pSoup, 'a.card-user-right[href]', 'href')
    if uLink: pDets['seller_link'] = urljoin(prodUrl, uLink)

    ### EXTRACT ANY OTHER DETAILS YOU WANT ###

    pDets['product_link'] = prodUrl
    return pDets

you can loop through the cards on the results page, clicking on each to open on a new tab and scrape the product details

kSearch, maxItems = 'monitor', 1500 ## adjust as preferred
url = f'https://es.wallapop.com/app/search?keywords={"+".join(kSearch.split())}'
url = f'{url}&filters_source=search_box&latitude=39.46895&longitude=-0.37686'

browser = webdriver.Chrome() 
browser.get(url)
browser.maximize_window()

scrollClick(browser, 'button[id="onetrust-accept-btn-handler"]') ## accept cookies
scrollClick(browser, 'tsl-button[id="btn-load-more"]') ## load more [then ∞-scroll]

itemCt, scrapedLinks, products = 0, [], [] ## initiate
itemSel, nextItem = 'a.ItemCardList__item[title]', None
try: nextItem = browser.find_element(By.CSS_SELECTOR, itemSel) ## first card
except Exception as e: print('No items found:', repr(e))

while nextItem:
    itemCt += 1 # counter
    cpHtml, cpTxt = '', '' # clear/initiate
    resultsTab = browser.current_window_handle # to go back

    try: # click card -> open new tab -> scrape product details
        cpHtml, cpTxt = nextItem.get_attribute('outerHTML'), nextItem.text
        scrollClick(browser, nextItem) ## click current card
        # add wait ?
        browser.switch_to.window(browser.window_handles[1]) ## go to 2nd tab
        WebDriverWait(browser, 5).until(EC.presence_of_element_located(
            (By.CSS_SELECTOR, 'div.detail-item'))) ## wait to load details
        pLink = browser.current_url ## product URL
        if pLink not in scrapedLinks: # skip duplicates [just in case]
            products.append(getProductDetails(browser.page_source, pLink))
        scrapedLinks.append(pLink)
    except Exception as e:
        print('!', [itemCt], ' '.join(cpTxt.split()), repr(e)) ## print error
        pSoup = BeautifulSoup(cpHtml.encode('utf-8'), 'lxml')
        products.append({
            'title': selectGet(pSoup, '', 'title'),
            'price': selectGet(pSoup, 'span.ItemCard__price'),
            'errorMsg': f'{type(e)} {e}'
        }) ## [ make do with info in card ]

    try: # close all tabs other than results tab
        for w in browser.window_handles:
            if w != resultsTab:
                browser.switch_to.window(w)
                browser.close()
            browser.switch_to.window(resultsTab)
    except Exception as e:
        print('Failed to restore results-tab-only window:', repr(e))
        break

    # print('', end=f"\r[{itemCt} of {maxItems}] {' '.join(cpTxt.split())} {repr(e)}")

    if isinstance(maxItems, int):
        if maxItems < itemCt: break

    nextItem = selectNextSib(browser, nextItem, itemSel) # get next result card

Some notes:

The second scrollClick call is for the "Load More" button - it only needs to be clicked once and after that more results load as you scroll down.
Maintaining and checking scrapedLinks doesn't seem to be necessary as there doesn't appear to be any duplicates to filter out. [It's just a habit since some sites do have duplicates, especially if there's pagination...]
If you set maxItems as None [or a really high number], it should keep scrolling and scraping until either
- there's no more to results to load, or
- the page crashes [it crashed before 800 results when I tested]. In this, the API has an advantage (you can observe the network logs as you click "Load More" and copy the API request to curlconverter to generate code to replicate it, if you want a start on exploring it...)
The resulting products should be a list of dictionaries each containing details about a different product.

Results should look something like

Btw, you can also save the results using something like pandas .to_csv like

# import pandas as pd
pd.DataFrame(products).to_csv('monitors.csv', index=False)

**Andrej Kesely** · Accepted Answer · 2023-02-12T18:06:01.620000

You can use their Ajax API to download the information about the items (one of the info is the URL):

import requests

api_url = "https://api.wallapop.com/api/v3/general/search"
params = {
    "keywords": "monitor",
    "filters_source": "search_box",
    "latitude": "39.46895",
    "longitude": "-0.37686",
}

data = requests.get(api_url, params=params).json()

for o in data["search_objects"]:
    print(
        "{:<30} {:<5} {}".format(
            o["title"][:30], o["price"], "https://es.wallapop.com/item/" + o["web_slug"]
        )
    )

Prints:

Monitor Philips                50.0  https://es.wallapop.com/item/pantalla-pc-874564380
monitor de vigilancia de bebé  48.0  https://es.wallapop.com/item/monitor-de-vigilancia-de-bebe-874583928
MONITOR PC NUEVO               100.0 https://es.wallapop.com/item/monitor-pc-nuevo-874568539
Monitor LG 24"                 65.0  https://es.wallapop.com/item/monitor-lg-24-874529151
Monitor Samsung 24"            80.0  https://es.wallapop.com/item/monitor-samsung-24-874496925
monitor roto gaming msi 32     50.0  https://es.wallapop.com/item/monitor-roto-gaming-msi-32-874589740
Monitor 22 pulgadas Full HD    30.0  https://es.wallapop.com/item/monitor-22-pulgadas-full-hd-874571734
Monitor Keep Out 32" Full HD H 120.0 https://es.wallapop.com/item/monitor-keep-out-32-full-hd-hdmi-874555259
Monitor Asus 4k 60hz con caja  230.0 https://es.wallapop.com/item/monitor-asus-4k-60hz-con-caja-874575205
Monitor Keep Out XGM24P 24" Fu 85.0  https://es.wallapop.com/item/monitor-keep-out-xgm24p-24-full-hd-hdmi-874557938
MONITOR HP W2072a NEGRO        50.0  https://es.wallapop.com/item/monitor-hp-w2072a-negro-874446605
Monitor ASUS VZ249HE 23.8" Ful 119.0 https://es.wallapop.com/item/monitor-asus-vz249he-23-8-full-hd-874542630
Samsung Monitor Curvo Gaming   150.0 https://es.wallapop.com/item/samsung-monitor-curvo-gaming-874418062
Monitor Lenovo L22e-20 21,5" F 80.0  https://es.wallapop.com/item/monitor-lenovo-l22e-20-21-5-full-hd-hdmi-874560092
Monitor Acer 18.5''            20.0  https://es.wallapop.com/item/monitor-acer-18-5-874380113
se vende monitor asus          45.0  https://es.wallapop.com/item/se-vende-monitor-asus-874244944
HP 22W Monitor                 60.0  https://es.wallapop.com/item/hp-22w-monitor-874320493
MONITOR ASUS 19 NUEVO          60.0  https://es.wallapop.com/item/monitor-asus-19-nuevo-874239039
Monitor Hp                     150.0 https://es.wallapop.com/item/monitor-hp-874114954
Monitor pc                     10.0  https://es.wallapop.com/item/monitor-pc-874098257
Monitor HP negro 20 pulgadas   15.0  https://es.wallapop.com/item/monitor-hp-negro-20-pulgadas-874234251
Monitor 21.5" Full HD LED      80.0  https://es.wallapop.com/item/monitor-21-5-full-hd-led-874261942
Monitor MEDION MD 5043 OD, des 10.0  https://es.wallapop.com/item/monitor-medion-md-5043-od-despiece-874263768
Monitor LED de 24" - LG 24EA53 80.0  https://es.wallapop.com/item/monitor-led-de-24-lg-24ea53vq-p-874296411
Monitor para ordenador         10.0  https://es.wallapop.com/item/monitor-para-ordenador-873697820
Monitor HP 27o negociable      110.0 https://es.wallapop.com/item/monitor-hp-27o-negociable-874017054
Monitor Samsung                30.0  https://es.wallapop.com/item/monitor-samsung-873754031
Monitor philips                30.0  https://es.wallapop.com/item/monitor-philips-873752989
BenQ GL2460 - Monitor LED de 2 50.0  https://es.wallapop.com/item/benq-gl2460-monitor-led-de-24-full-hd-2ms-hdmi-874289910
Monitor 2 k                    100.0 https://es.wallapop.com/item/monitor-2-k-873847429
Monitores Tv Monitor Lg M197wd 29.9  https://es.wallapop.com/item/monitores-tv-monitor-lg-m197wdp-hdmi-873875385
Samsung Monitor Profesional de 150.0 https://es.wallapop.com/item/samsung-monitor-profesional-de-27-2k-874038879
Monitor hp 22xw Pavillon como  70.0  https://es.wallapop.com/item/monitor-hp-22xw-como-nuevo-874014297
Monitor 24" Benq XL2411T 144Hz 150.0 https://es.wallapop.com/item/monitor-24-benq-xl2411t-144hz-874159199
Monitor                        65.0  https://es.wallapop.com/item/monitor-873408045
Monitor benq 144hz             120.0 https://es.wallapop.com/item/monitor-benq-144hz-873858204
Monitor 21,5" Samsung          69.0  https://es.wallapop.com/item/monitor-21-5-samsung-873777001
2x monitores Asus 19.5" sin es 60.0  https://es.wallapop.com/item/2x-monitores-asus-19-5-sin-estrenar-874000367
Monitores baratos              7.0   https://es.wallapop.com/item/monitores-baratos-873394797
MONITOR HP 21'5 LCD            95.0  https://es.wallapop.com/item/monitor-hp-21-5-lcd-873879049

EDIT: To get the products from next pages you can add start= parameter to URL:

import requests

api_url = "https://api.wallapop.com/api/v3/general/search"
params = {
    "keywords": "monitor",
    "filters_source": "search_box",
    "latitude": "39.46895",
    "longitude": "-0.37686",
    "start": 0
}


for page in range(0, 3):  # <-- increase number of pages here
    params['start'] = page * 40

    data = requests.get(api_url, params=params).json()
    for o in data["search_objects"]:
        print(
            "{:<30} {:<5} {}".format(
                o["title"][:30], o["price"], "https://es.wallapop.com/item/" + o["web_slug"]
            )
        )

Python | How i get the link of products that doesn't have href with selenium

There are 3 best solutions below

Related Questions in PYTHON

Related Questions in HTML

Related Questions in SELENIUM

Related Questions in WEB-SCRAPING

Related Questions in BEAUTIFULSOUP

Trending Questions

Popular # Hahtags

Popular Questions