Why can't I get the complete 'href' as showing in browser from noon.com

Question

Why can't I get the complete 'href' as showing in browser from noon.com

109 Views Asked by Tabraiz Yasin At 06 January 2021 at 08:06

Here is what I'm doing

import requests
from requests.adapters import HTTPAdapter
from bs4 import BeautifulSoup

HEADERS = {
    'authority': 'www.noon.com',
    'pragma': 'no-cache',
    'cache-control': 'no-cache',
    'dnt': '1',
    'upgrade-insecure-requests': '1',
    'accept': '*/*',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36',
    'sec-fetch-site': 'none',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-dest': 'document'
}

response = requests.get('https://www.noon.com/uae-en/electronics-and-mobiles/mobiles-and-accessories/mobiles-20905',headers=HEADERS,stream=True)
soup = BeautifulSoup(response.content,'lxml')
results = soup.find_all("div", {"class" : "productContainer"})
result = results[0]

print("https://www.noon.com" + result.a.get('href'))

Output

https://www.noon.com/uae-en

But the expected output should be 'https://www.noon.com/uae-en/product/N35521717A/p?o=f885efe0b6534e9f'

As here you can see from the browser

<div class="productContainer"><a class="sc-7vj7do-0 ftlAjW" href="/uae-en/product/N35521717A/p?o=f885efe0b6534e9f" id="productBox-N35521717A"><div class="kcs0h5-0 diNcmV grid" title="Samsung Galaxy M31 Dual SIM Blue 6GB RAM 128GB 4G LTE "><div class="e3js0d-1 efqIDW"><div class="productImage" data-qa-id="productImagePLP_Galaxy M31 Dual SIM Blue 6GB RAM 128GB 4G LTE "><div class="lazyload-wrapper"><div class="puv25r-0 hfEfTS"><div class="puv25r-2 hJKuPa"><img alt="Galaxy M31 Dual SIM Blue 6GB RAM 128GB 4G LTE " src="https://a.nooncdn.com/t_desktop-pdp-v1/v1605814225/N35521717A_1.jpg"/></div></div></div></div><div class="e3js0d-2 dqjnoR"><div class="tagContainer"></div></div></div><div class="e3js0d-6 iKEZJh"><div class="e3js0d-7 jULUCI"><div class="e3js0d-10 cyUANN"><span class="e3js0d-11 gXshOX">Samsung</span>Galaxy M31 Dual SIM Blue 6GB RAM 128GB 4G LTE </div></div><div class="e3js0d-8 jtiosv"><div class="sc-3751lm-0 hSumnU"><div class="sc-3751lm-1 eUJkVt large"><span class="currency">AED</span><strong>819.00</strong></div><div class="sc-3751lm-2 kWnsOk"><span class="oldPrice">AED<!-- --> <!-- -->859</span></div></div></div><div class="e3js0d-9 kDpjlW"><div class="e3js0d-12 gMFqig"><div class="u8zs36-0 kRPdZJ"><img alt="noon-express" height="20px" src="https://a.nooncdn.com/s/app/com/noon/images/fulfilment_express-en.png" width="80px"/></div></div></div></div></div></a></div>

Original Q&A

There are 1 best solutions below

**HedgeHog** · Accepted Answer · 2021-01-06T09:58:34.167000

What happens and steps to reproduce

Website seems to deal with dynamically generated content.

Open the website in browser
Open source code ctrl + u search for class="productContainer" and you will see the href of <a> only contains /uae-en -> That is what you get by using requests
Open inspector ctrl+shift+i and inspect your <a> and you will find the dynamically added part, what you get if you use selenium.

Minimal example

import time 
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains

browser = webdriver.Chrome('C:\Program Files\ChromeDriver\chromedriver.exe')
actions = ActionChains(browser)

browser.get('https://www.noon.com/uae-en/electronics-and-mobiles/mobiles-and-accessories/mobiles-20905')

time.sleep(3)
element = browser.find_element_by_xpath("//div[contains(@class, 'productContainer')]/a")

actions.move_to_element(element).perform()
print(element.get_attribute('href'))

browser.close()

Output

https://www.noon.com/uae-en/product/N35521717A/p?o=f885efe0b6534e9f
https://www.noon.com/uae-en/product/N41247213A/p?o=ca38c8921770ea2a
https://www.noon.com/uae-en/product/N41247235A/p?o=c97b8bfdc0114cba
https://www.noon.com/uae-en/product/N39790555A/p?o=d7354e20a0bb00ad
https://www.noon.com/uae-en/product/N32046052A/p?o=faea2e69f38bbf6a
...

EDIT

You wont get the information with requests by scraping the source, but there is an alternativ way.

You could use the api with requests and build the link (simple example you can customize):

import requests

url = "https://www.noon.com/_svc/catalog/api/u/electronics-and-mobiles/mobiles-and-accessories/mobiles-20905"
headers = {
    "user-agent": "Mozilla/5.0"
}

response = requests.get(url, headers=headers)
response.raise_for_status()

records = response.json()["hits"]

for record in records:
    offer_code = record["offer_code"]
    sku = record["sku"]
    url = record["url"]
    print(f"https://www.noon.com/uae-en/{url}/{sku}/p?o={offer_code}")

Output

https://www.noon.com/uae-en/galaxy-m31-dual-sim-blue-6gb-ram-128gb-4g-lte/N35521717A/p?o=f885efe0b6534e9f
https://www.noon.com/uae-en/iphone-12-pro-max-with-facetime-128gb-pacific-blue-5g-international-specs/N41247213A/p?o=ca38c8921770ea2a
https://www.noon.com/uae-en/iphone-12-pro-with-facetime-256gb-pacific-blue-5g-international-specs/N41247235A/p?o=cfab59c09cab747b
...

Why can't I get the complete 'href' as showing in browser from noon.com

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in WEB-SCRAPING

Related Questions in PYTHON-REQUESTS

Related Questions in HREF

Related Questions in ATAG

Trending Questions

Popular # Hahtags

Popular Questions