How to extract 'Odor' information from PubChem using BeautifulSoup

Question

How to extract 'Odor' information from PubChem using BeautifulSoup

148 Views Asked by John Mommers At 18 February 2023 at 13:26

I wrote the following Python code extract 'odor' information from PubChem for a particular molecule; in this case molecule nonanal (CID=31289) The webpage for this molecule is: https://pubchem.ncbi.nlm.nih.gov/compound/31289#section=Odor

import requests
from bs4 import BeautifulSoup

url = 'https://pubchem.ncbi.nlm.nih.gov/compound/31289#section=Odor'
page = requests.get(url)

soup = BeautifulSoup(page.content, 'html.parser')
odor_section = soup.find('section', {'id': 'Odor'})
odor_info = odor_section.find('div', {'class': 'section-content'})

print(odor_info.text.strip())

I get the following error. AttributeError: 'NoneType' object has no attribute 'find' It seems that not the whole page information is extracted by BeautifulSoup.

I expect the following output: Orange-rose odor, Floral, waxy, green

Original Q&A

There are 1 best solutions below

**Yarin_007** · Accepted Answer · 2023-02-18T13:37:34.167000

The page in question makes an AJAX request to load its data. We can see this in a web browser by looking at the Network tab of the dev tools (F12 in many browsers):

enter image description here

That is to say, the data simply isn't there when the initial page loads - so it isn't found by BeautifulSoup.

To solve the problem:

use Selenium, which can actually run the JavaScript code and thus populate the page with the desired data; or
simply query the API according to the request seen when loading the page in the browser. Thus:

PubChem_Nonanal_CID=31289
coumpund_data_url = 'https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/{}/JSON/'
compound_info = requests.get(coumpund_data_url.format(PubChem_Nonanal_CID))

print (compund_info.json())

Parsing the JSON Reply

Parsing it proves a bit of a challenge, as it is comprised of many lists. If the order of properties isn't guaranteed, you could opt for a solution like this:

for section in compund_info.json()['Record']['Section']:
    if section['TOCHeading']=="Chemical and Physical Properties":
       for sub_section in section['Section']:
           if sub_section['TOCHeading'] == 'Experimental Properties':
               for sub_sub_section in sub_section['Section']:
                   if sub_sub_section['TOCHeading']=="Odor":
                       print(sub_sub_section['Information'][0]['Value']['StringWithMarkup'][0]['String'])
                       break

Otherwise, follow the schema from a JSON-parsing website like jsonformatter.com

# object►Record►Section►3►Section►1►Section►2►Information►0►Value►StringWithMarkup►0►String`

odor = compund_info.json()['Record']['Section'][3]['Section'][1]['Section'][2]['Information'][0]['Value']['StringWithMarkup'][0]['String']

How to extract 'Odor' information from PubChem using BeautifulSoup

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in BEAUTIFULSOUP

Related Questions in PUBCHEM

Trending Questions

Popular # Hahtags

Popular Questions