Removing <wbr> tags and grabbing the info between

1k Views Asked by At

I'm scrapping data from a webpage and have done so for a certain section that has the <br> tag.

<div class="scrollWrapper">
    <h3>Smiles</h3>
    CC=O<br>
    <button type="button" id="downloadSmiles">Download</button>
</div>

I solved this problem by doing the below script to output CC=O.

from lxml import html

page = requests.get('http://chem.sis.nlm.nih.gov/chemidplus/name/'+ substance)
tree = html.fromstring(page.text)
if ("Smiles" in page.text):
        smiles = tree.xpath('normalize-space(//*[text()="Smiles"]/..//br[1]/preceding-sibling::text()[1])')
else:
        smiles = ""

However, as I was browsing through other pages of different chemicals I encountered some pages that had the tag in them. I have no idea how to get rid of them while grabbing the information between them. An example is shown below with my desired output to be c1(c2ccccc2)ccc(N)cc1.

<div class="scrollWrapper">
   <h3>Smiles</h3>
   c1(c2ccccc2)<wbr>ccc(N)<wbr>cc1<br>
   <button type="button" id="downloadSmiles">Download</button>
</div>
3

There are 3 best solutions below

5
On BEST ANSWER

The easiest thing to do would be to replace <wbr> string in the page.text with empty string, before you parse it into html. Since its within < and > I doubt if any of the useful info you are looking for would have it.

Example -

from lxml import html

page = requests.get('http://chem.sis.nlm.nih.gov/chemidplus/name/'+ substance)
tree = html.fromstring(page.text.replace('<wbr>',''))
if ("Smiles" in page.text):
        smiles = tree.xpath('normalize-space(//*[text()="Smiles"]/..//br[1]/preceding-sibling::text()[1])')
else:
        smiles = ""

Otherwise you can use @Bun's solution of using BeautifulSoup , or write complex xpaths.

Also, an easier xpath for your case should be -

'normalize-space(//*[text()="Smiles"]/following-sibling::text()[1])'

Rather than finding out the Smiles, element and then taking its parent then find out the first br element that is its descendent then taking its preceding sibling and then its text.

You should directly take the following sibling for the Smiles element and then its text.

5
On

<wbr>

The (Word Break Opportunity) tag specifies where in a text it would be ok to add a line-break. Tip: When a word is too long, or you are afraid that the browser will break your lines at the wrong place, you can use the element to add word break opportunities.

I use BeautifulSoup to parse this data.

from bs4 import BeautifulSoup as bs

html = """
<div class="scrollWrapper">
   <h3>Smiles</h3>
   c1(c2ccccc2)<wbr>ccc(N)<wbr>cc1<br>
   <button type="button" id="downloadSmiles">Download</button>
</div>
"""

soup = bs(html, "html.parser")
rows = soup.get_text().split()
print(rows[1])

Output:

   c1(c2ccccc2)ccc(N)cc1
2
On

Just to point out: you can get rid of a specific string by doing:

str.replace(old, "")

So for instance:

"c1(c2ccccc2)<wbr>ccc(N)<wbr>cc1<br>".replace("<wbr>", "").replace("<br>", "")

However, the other answers get closer to the desired result.