How to remove xml nodes that are not in an array of xpath strings?

704 Views Asked by At

I have an array of xpath values and an xml feed.

When the feed comes in, I want to filter each xml file by removing the nodes that are not in my array of xpath's.

I can think of a very dirty way to do this:

1) for each node in the xml, i form its xpath

2) check if it's in the array.

3) if not, remove.

Is there a cleaner way?

2

There are 2 best solutions below

1
On BEST ANSWER

When the feed comes in, I want to filter each xml file by removing the nodes that are not in my array of xpath's

Step1. Select all nodes that aren't selected by the given XPath expressions

I guess that by "nodes" you mean elements. If so, this XPath expression:

//*[count(. | yourExpr1 | yourExpr2 ... | yourExprN)
   >
    count(yourExpr1 | yourExpr2 ... | yourExprN)
   ]

selects all elements in the XML document that aren't selected by any of your N XPath expressions yourExpr1, yourExpr2, ... , yourExprN

If by "nodes" you mean elements, text-nodes, processing-instruction-nodes (PIs), comment-nodes and attribute nodes, use this XPath expression to select all nodes not selected by your N XPath expressions:

(//node() | //*/@*)
   [count(. | yourExpr1 | yourExpr2 ... | yourExprN)
   >
    count(yourExpr1 | yourExpr2 ... | yourExprN)
   ]

Step2. Delete all nodes selected in Step1.

For each of the nodes selected in Step1 above, use:

 node.ParentNode.RemoveChild(node);

Explanation:

  1. The XPath union operator | produces the union of two node-sets. Therefore the expression yourExpr1 | yourExpr2 ... | yourExprN when applied on the XML document produces the set of all nodes that are selected by any of the N given XPath expressions.

  2. A node $n doesn't belong to a set of nodes $ns exactly when ...

    count($n | $ns) > count($ns)

2
On

Your approach is backwards (and error-prone, since any given node can be selected by multiple valid XPath expressions). You should:

  • First, iterate the array of expressions and somehow mark the nodes that each one selects (simply set some flag on each node, for example). Even better: evaluate the union of all the expressions and select everything in one step.
  • Then, traverse the DOM and remove any element that wasn't marked in the first step.