I wrote the following Python code extract 'odor' information from PubChem for a particular molecule; in this case molecule nonanal (CID=31289) The webpage for this molecule is: https://pubchem.ncbi.nlm.nih.gov/compound/31289#section=Odor
import requests
from bs4 import BeautifulSoup
url = 'https://pubchem.ncbi.nlm.nih.gov/compound/31289#section=Odor'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
odor_section = soup.find('section', {'id': 'Odor'})
odor_info = odor_section.find('div', {'class': 'section-content'})
print(odor_info.text.strip())
I get the following error. AttributeError: 'NoneType' object has no attribute 'find' It seems that not the whole page information is extracted by BeautifulSoup.
I expect the following output: Orange-rose odor, Floral, waxy, green
The page in question makes an AJAX request to load its data. We can see this in a web browser by looking at the Network tab of the dev tools (F12 in many browsers):
That is to say, the data simply isn't there when the initial page loads - so it isn't found by BeautifulSoup.
To solve the problem:
use Selenium, which can actually run the JavaScript code and thus populate the page with the desired data; or
simply query the API according to the request seen when loading the page in the browser. Thus:
Parsing the JSON Reply
Parsing it proves a bit of a challenge, as it is comprised of many lists. If the order of properties isn't guaranteed, you could opt for a solution like this:
Otherwise, follow the schema from a JSON-parsing website like jsonformatter.com