Removing tags and grabbing the info between

Question

Removing tags and grabbing the info between

1k Views Asked by TimTom At 20 August 2025 at 02:48

I'm scrapping data from a webpage and have done so for a certain section that has the   tag.

<div class="scrollWrapper">
    <h3>Smiles</h3>
    CC=O<br>
    <button type="button" id="downloadSmiles">Download</button>
</div>

I solved this problem by doing the below script to output CC=O.

from lxml import html

page = requests.get('http://chem.sis.nlm.nih.gov/chemidplus/name/'+ substance)
tree = html.fromstring(page.text)
if ("Smiles" in page.text):
        smiles = tree.xpath('normalize-space(//*[text()="Smiles"]/..//br[1]/preceding-sibling::text()[1])')
else:
        smiles = ""

However, as I was browsing through other pages of different chemicals I encountered some pages that had the tag in them. I have no idea how to get rid of them while grabbing the information between them. An example is shown below with my desired output to be c1(c2ccccc2)ccc(N)cc1.

<div class="scrollWrapper">
   <h3>Smiles</h3>
   c1(c2ccccc2)<wbr>ccc(N)<wbr>cc1<br>
   <button type="button" id="downloadSmiles">Download</button>
</div>

Original Q&A

There are 3 best solutions below

Bun On 07 July 2015 at 18:17

The (Word Break Opportunity) tag specifies where in a text it would be ok to add a line-break. Tip: When a word is too long, or you are afraid that the browser will break your lines at the wrong place, you can use the element to add word break opportunities.

I use BeautifulSoup to parse this data.

from bs4 import BeautifulSoup as bs

html = """
<div class="scrollWrapper">
   <h3>Smiles</h3>
   c1(c2ccccc2)<wbr>ccc(N)<wbr>cc1<br>
   <button type="button" id="downloadSmiles">Download</button>
</div>
"""

soup = bs(html, "html.parser")
rows = soup.get_text().split()
print(rows[1])

Output:

   c1(c2ccccc2)ccc(N)cc1

AudioBubble On 07 July 2015 at 18:18

Just to point out: you can get rid of a specific string by doing:

str.replace(old, "")

So for instance:

"c1(c2ccccc2)<wbr>ccc(N)<wbr>cc1<br>".replace("<wbr>", "").replace("<br>", "")

However, the other answers get closer to the desired result.

**Anand S Kumar** · Accepted Answer

The easiest thing to do would be to replace  string in the page.text with empty string, before you parse it into html. Since its within < and > I doubt if any of the useful info you are looking for would have it.

Example -

from lxml import html

page = requests.get('http://chem.sis.nlm.nih.gov/chemidplus/name/'+ substance)
tree = html.fromstring(page.text.replace('<wbr>',''))
if ("Smiles" in page.text):
        smiles = tree.xpath('normalize-space(//*[text()="Smiles"]/..//br[1]/preceding-sibling::text()[1])')
else:
        smiles = ""

Otherwise you can use @Bun's solution of using BeautifulSoup , or write complex xpaths.

Also, an easier xpath for your case should be -

'normalize-space(//*[text()="Smiles"]/following-sibling::text()[1])'

Rather than finding out the Smiles, element and then taking its parent then find out the first br element that is its descendent then taking its preceding sibling and then its text.

You should directly take the following sibling for the Smiles element and then its text.

Removing <wbr> tags and grabbing the info between

There are 3 best solutions below

Related Questions in PYTHON

Related Questions in LXML

Related Questions in WBR

Trending Questions

Popular # Hahtags

Popular Questions