in Python 3.6 - getting text using an XPath expression

Question

  in Python 3.6 - getting text using an XPath expression

2.2k Views Asked by Valdrin Shala At 07 November 2018 at 18:01

<div class = "card-block cms>
<p>and then have a tea or coffee on the balcony of the cafeteria.</p>
<p>&nbsp;</p>
</div>

I am trying to check if the text I crawl of a website contains

texts = driver.find_element_by_xpath("//div[@class='card-block cms']")
textInDivTag = texts.text
print(textInDivTag)
if u"\xa0" in textInDivTag:
    print("yes")

My output is as follows:

and then have a tea or coffee on the balcony of the cafeteria.

As you can see, it doesn't recognize the non-breaking space.

Original Q&A

There are 3 best solutions below

ewwink On 08 November 2018 at 10:25

To match u"\xa0", use

textInDivTag = texts.get_attribute('innerText')

To match u"\x20", use

textInDivTag = texts.text

undetected Selenium On 13 November 2018 at 10:54

Non-breaking Space (` `)

A non-breaking space i.e.   is a space that will not break into a new line. Two words separated by a non-breaking space will stick together (not break into a new line). This is handy when breaking the words might be disruptive. Examples:

§ 10
10 km/h
10 PM

Another common use of the non-breaking space is to prevent browsers from truncating spaces in HTML pages. If you write 10 spaces in your text, the browser will remove 9 of them. To add real spaces to your text, you can use the   character entity.

Element.innerHTML

Syntax:

const content = element.innerHTML;
element.innerHTML = htmlString;

Value: Element.innerHTML is a DOMString containing the HTML serialization of the element's descendants. Setting the value of innerHTML removes all of the element's descendants and replaces them with nodes constructed by parsing the HTML given in the string htmlString.
Note: If a <div>, <span>, or <noembed> node has a child text node that includes the characters (&), (<), or (>), innerHTML returns these characters as the HTML entities &, < and > respectively. Use Node.textContent to get a raw copy of these text nodes' contents.

Node.innerText

Node.innerText is a property that represents the rendered text content of a node and its descendants. As a getter, it approximates the text the user would get if they highlighted the contents of the element with the cursor and then copied to the clipboard.

Node.textContent

Node.textContent property represents the text content of a node and its descendants.

Syntax:

var text = element.textContent;
element.textContent = "this is some sample text";

Description:
textContent returns null if the node is a document, a DOCTYPE, or a notation. To grab all of the text and CDATA data for the whole document, one could use document.documentElement.textContent.
If the node is a CDATA section, comment, processing instruction, or text node, textContent returns the text inside this node (the nodeValue).
For other node types, textContent returns the concatenation of the textContent of every child node, excluding comments and processing instructions. This is an empty string if the node has no children.

This usecase

As your usecase is to check if the website contains   you have to use the textContent property as follows:

texts = driver.find_element_by_xpath("//div[@class='card-block cms']")
for my_text in texts:
    textInDivTag = texts.textContent
    print(textInDivTag)

**soerface** · Accepted Answer · 2018-11-07T20:49:18.313000

The character is recognized, but it is being converted to a normal space (u"\x20").

According to the comment in the Java Selenium sourcecode, .text / .getText() returns the visible text, and references the W3C webdriver specification, section "11.3.5 Get Element Text" (emphasis added by me):

The Get Element Text command intends to return an element’s text “as rendered”. An element’s rendered text is also used for locating a elements by their link text and partial link text.

One of the major inputs to this specification was the open source Selenium project. This was in wide-spread use before this specification written, and so had set user expectations of how the Get Element Text command should work. As such, the approach presented here is known to be flawed, but provides the best compatibility with existing users.

So probably, this behavior is according to the specification, but I couldn't yet find the source code specifically replacing non-breaking spaces by regular whitespace. I could also not find an issue in the Selenium repository, but maybe you can give it a try by opening one.

in Python 3.6 - getting text using an XPath expression

There are 3 best solutions below

Non-breaking Space (` `)

Element.innerHTML

Node.innerText

Node.textContent

This usecase

Related Questions in PYTHON

Related Questions in PYTHON-3.X

Related Questions in SELENIUM-WEBDRIVER

Related Questions in NON-BREAKING-CHARACTERS

Trending Questions

Popular # Hahtags

Popular Questions

&nbsp; in Python 3.6 - getting text using an XPath expression

There are 3 best solutions below

Non-breaking Space (&nbsp;)

Element.innerHTML

Node.innerText

Node.textContent

This usecase

Related Questions in PYTHON

Related Questions in PYTHON-3.X

Related Questions in SELENIUM-WEBDRIVER

Related Questions in NON-BREAKING-CHARACTERS

Trending Questions

Popular # Hahtags

Popular Questions

in Python 3.6 - getting text using an XPath expression

Non-breaking Space (` `)