XPath - how to exclude text from child node

64 Views Asked by At

I want this output (example):

I want this

I'm working with a XML/TEI document and I need to work with XPath expression and I want as output the text in the div/u, but without the text inside node element like "desc" or "vocal><desc" or the text between "anchor/><anchor/" (example).

From the code (example):

<div>
<u> 
I want this but 
     *<anchor/><desc>I don't want this</desc><anchor/>
      <anchor/>I don't want this also<anchor/>
     <del type="">I don't want this too</del>*
I want this
</u>
</div>

I tried to use (example) :

TEI//u[not(desc)]

But it excludes every <u> with <desc> inside.

3

There are 3 best solutions below

0
y.arazim On BEST ANSWER

If I read your requirements as:

select any text node that is a child of u (i.e. not inside another element such as desc or del), but exclude text nodes that are in-between two anchor elements

then I arrive at the following expression:

//u/text()[not(preceding-sibling::*[1][self::anchor] and following-sibling::*[1][self::anchor])]

Applying it to the given input produces:

" 
I want this but 
     **
I want this
"

which is different from the output you say you want, but nevertheless conforms to the stated requirements.

2
nelegalas On

This XPath expression will return text of all "u" tags excluding the text of any "desc" or "anchor" tags within them:

TEI//u//text()[not(ancestor::desc) and not(ancestor::anchor)]
5
kjhughes On

Old answer

This XPath,

//u/text()

will select all text node children of all u elements in the document:

I want this but 
I want this

If you only want the first text node children, use

//u/text()[1]

Note that this will select first text nodes of all u elements in the document. If you only want the first of these text nodes, use

(//u/text())[1]

Updated answer

Oops, a comment by @y.arazim made me realize that the tags here,

<anchor/>I don't want this also<anchor/>

despite their positioning, are self-closing, not opening and closing tags around the text. I wrote the old answer based on that mistake.

See @y.arazim's answer (+1) for an XPath that meets his interpretation of OP's requirements (and properly accounts for the self-closing anchor tags).

If OP more simply wants the u text node children before or after any anchor sibling elements, then this XPath would suffice:

//u/text()[not(preceding-sibling::anchor and following-sibling::anchor)]