How to scrape news articles from cnbc with keyword "Green hydrogen"?

784 Views Asked by At

I am trying to scrap news article listed in this url, all article are in span.Card-title. But this gives blank output. Is there any to resolve this?

from bs4 import BeautifulSoup as soup

import requests

cnbc_url = "https://www.cnbc.com/search/?query=green%20hydrogen&qsearchterm=green%20hydrogen"

html = requests.get(cnbc_url)

bsobj = soup(html.content,'html.parser')

day = bsobj.find(id="root")

print(day.find_all('span',class_='Card-title'))

for link in bsobj.find_all('span',class_='Card-title'):

    print('Headlines : {}'.format(link.text))
2

There are 2 best solutions below

0
Tornike Skhulukhia On

The problem is that content is not present on page when it loads initially, only afterwards is it fetched from server using url like this

https://api.queryly.com/cnbc/json.aspx?queryly_key=31a35d40a9a64ab3&query=green%20hydrogen&endindex=0&batchsize=10&callback=&showfaceted=false&timezoneoffset=-240&facetedfields=formats&facetedkey=formats%7C&facetedvalue=!Press%20Release%7C&needtoptickers=1&additionalindexes=4cd6f71fbf22424d,937d600b0d0d4e23,3bfbe40caee7443e,626fdfcd96444f28

and added to page.

Take a look at /json.aspx endpoint in devtools, data seems to be there.

0
Driftr95 On

As mentioned in another answer, the data about the articles are loaded using another link, which you can find via the networks tab in devtools. [In chrome, you can open devtools with Ctrl+Shift+I, then go to the networks tab to see requests made, and then click on the name starting with 'json.aspx?...' to see details, then copy the Request URL from Headers section.]

Once you have the Request URL, you can copy it and make the request in your code to get the data:

# dataReqUrl contains the copied Request URL
dataReq = requests.get(dataReqUrl)
for r in dataReq.json()['results']: print(r['cn:title'])

If you don't feel like trying to find that one request in 250+ other requests, you might also try to assemble a shorter form of the url with something like:

# import urllib.parse

# find link to js file with api key
jsLinks = bsobj.select('link[href][rel="preload"]')
jUrl = [m.get('href') for m in jsLinks if 'main' in m.get('href')][0]

jRes = requests.get(jUrl) # request js file api key

# get api key from javascript
qKey = jRes.text.replace(' ', '').split(
    'QUERYLY_KEY:'
)[-1].split(',')[0].replace('"', '').strip()

# form url
qParams = {
    'queryly_key': qKey,
    'query': search_for, # = 'green hydrogen'
    'batchsize': 10 # can go up to 100 apparently
}
qUrlParams = urllib.parse.urlencode(qParams, quote_via=urllib.parse.quote)
dataReqUrl = f'https://api.queryly.com/cnbc/json.aspx?{qUrlParams}'

Even though the assembled dataReqUrl is not identical to the copied one, it seems to be giving the same results (I checked with a few different search terms). However, I don't know how reliable this method is, especially compared to the much less convoluted approach with selenium:

# from selenium import webdriver
# from selenium.webdriver.common.by import By
# from selenium.webdriver.support.ui import WebDriverWait
# from selenium.webdriver.support import expected_conditions as EC

# define chromeDriver_path <-- where you saved 'chromedriver.exe'
cnbc_url = "https://www.cnbc.com/search/?query=green%20hydrogen&qsearchterm=green%20hydrogen"
driver = webdriver.Chrome(chromeDriver_path)
driver.get(cnbc_url)

ctSelector = 'span.Card-title'
WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located(
        (By.CSS_SELECTOR, ctSelector)))
cardTitles = driver.find_elements(By.CSS_SELECTOR, ctSelector)

cardTitles_text = [ct.get_attribute('innerText') for ct in cardTitles] 
for c in cardTitles_text: print(c)

In my opinion, this approach is more reliable as well as simpler.