How to scrape news articles from cnbc with keyword "Green hydrogen"?

Question

How to scrape news articles from cnbc with keyword "Green hydrogen"?

784 Views Asked by Hitesh Prajapati At 14 October 2022 at 10:17

I am trying to scrap news article listed in this url, all article are in span.Card-title. But this gives blank output. Is there any to resolve this?

from bs4 import BeautifulSoup as soup

import requests

cnbc_url = "https://www.cnbc.com/search/?query=green%20hydrogen&qsearchterm=green%20hydrogen"

html = requests.get(cnbc_url)

bsobj = soup(html.content,'html.parser')

day = bsobj.find(id="root")

print(day.find_all('span',class_='Card-title'))

for link in bsobj.find_all('span',class_='Card-title'):

    print('Headlines : {}'.format(link.text))

Original Q&A

There are 2 best solutions below

**Tornike Skhulukhia** · Answer 1 · 2022-10-14T11:09:27.687000

The problem is that content is not present on page when it loads initially, only afterwards is it fetched from server using url like this

https://api.queryly.com/cnbc/json.aspx?queryly_key=31a35d40a9a64ab3&query=green%20hydrogen&endindex=0&batchsize=10&callback=&showfaceted=false&timezoneoffset=-240&facetedfields=formats&facetedkey=formats%7C&facetedvalue=!Press%20Release%7C&needtoptickers=1&additionalindexes=4cd6f71fbf22424d,937d600b0d0d4e23,3bfbe40caee7443e,626fdfcd96444f28

and added to page.

Take a look at /json.aspx endpoint in devtools, data seems to be there.

**Driftr95** · Answer 2 · 2022-10-15T00:17:53.403000

As mentioned in another answer, the data about the articles are loaded using another link, which you can find via the networks tab in devtools. [In chrome, you can open devtools with Ctrl+Shift+I, then go to the networks tab to see requests made, and then click on the name starting with 'json.aspx?...' to see details, then copy the Request URL from Headers section.]

Once you have the Request URL, you can copy it and make the request in your code to get the data:

# dataReqUrl contains the copied Request URL
dataReq = requests.get(dataReqUrl)
for r in dataReq.json()['results']: print(r['cn:title'])

If you don't feel like trying to find that one request in 250+ other requests, you might also try to assemble a shorter form of the url with something like:

# import urllib.parse

# find link to js file with api key
jsLinks = bsobj.select('link[href][rel="preload"]')
jUrl = [m.get('href') for m in jsLinks if 'main' in m.get('href')][0]

jRes = requests.get(jUrl) # request js file api key

# get api key from javascript
qKey = jRes.text.replace(' ', '').split(
    'QUERYLY_KEY:'
)[-1].split(',')[0].replace('"', '').strip()

# form url
qParams = {
    'queryly_key': qKey,
    'query': search_for, # = 'green hydrogen'
    'batchsize': 10 # can go up to 100 apparently
}
qUrlParams = urllib.parse.urlencode(qParams, quote_via=urllib.parse.quote)
dataReqUrl = f'https://api.queryly.com/cnbc/json.aspx?{qUrlParams}'

Even though the assembled dataReqUrl is not identical to the copied one, it seems to be giving the same results (I checked with a few different search terms). However, I don't know how reliable this method is, especially compared to the much less convoluted approach with selenium:

# from selenium import webdriver
# from selenium.webdriver.common.by import By
# from selenium.webdriver.support.ui import WebDriverWait
# from selenium.webdriver.support import expected_conditions as EC

# define chromeDriver_path <-- where you saved 'chromedriver.exe'
cnbc_url = "https://www.cnbc.com/search/?query=green%20hydrogen&qsearchterm=green%20hydrogen"
driver = webdriver.Chrome(chromeDriver_path)
driver.get(cnbc_url)

ctSelector = 'span.Card-title'
WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located(
        (By.CSS_SELECTOR, ctSelector)))
cardTitles = driver.find_elements(By.CSS_SELECTOR, ctSelector)

cardTitles_text = [ct.get_attribute('innerText') for ct in cardTitles] 
for c in cardTitles_text: print(c)

In my opinion, this approach is more reliable as well as simpler.

How to scrape news articles from cnbc with keyword "Green hydrogen"?

There are 2 best solutions below

Related Questions in PYTHON

Related Questions in WEB-SCRAPING

Related Questions in BEAUTIFULSOUP

Related Questions in FEED

Related Questions in FEEDPARSER

Trending Questions

Popular # Hahtags

Popular Questions