Delphi, MSXML: how to retrieve node XML without the document namespace?

4.4k Views Asked by At

I need to do some parsing and information retrieval from XML documents. The XML document is bound to an XML data binding, then parsed for specific elements. Once I have isolated the elements I need to dissect, I take each one in turn (lets call it E_parent) and try to identify the location of each non-text child element (E_child) within the overall XML text of E_parent and do some manipulation or other.

The problem I'm having, is that the XML document's namespace is added to the child elements' XML when they are accessed individually.

To give an example, say the original document looks like:

<?xml version="1.0" encoding="windows-1252"?>
<RootNode xml:lang="en" xmlns="urn:blah:names:blahblah">
<E_parent>Some text <E_child>child text</E_child> more parent text</E_parent>
</RootNode>
</xml>

When I try to access the XML from either the E_parent or E_child element by doing something like:

xmlParent := parentNode.XML;

I get:

<E_parent xmlns="urn:blah:names:blahblah">Some text <E_child>child text</E_child> more parent text</E_parent>

same thing if I try to access the XML for E_child, I get:

<E_child xmlns="urn:blah:names:blahblah">child text</E_child>

That's a problem when I then try to do a text search on the parent element, since the "real" text does not contain that namespace declaration:

Some text <E_child>child text</E_child> more parent text

So far, I've dealt with this by finding/deleting unwanted namespace attributes in the strings, but it's highly inefficient, and kind of ugly ;o) So, my question is, how can I retrieve the various nodes' XML from a bound XML document, without the document namespace being added to the tags?

=========

Thanks Remy, it was so obvious, I just need to start from a blank string and build it up rather than start from the inner XML!

Note though, that this is a better workaround than the one I had for this specific situation, but not quite what I wanted - obtaining the XML of elements without the namespace would still be useful for other things, such as logging, where I would want the exact XML of the node as it appears in the original document.

4

There are 4 best solutions below

1
On BEST ANSWER

Use the DOM for processing E_parent's contents. Rather then retreiving the XML of E_parent and then searching for an E_child tag inside of it, use the DOM to determine what plain text exists in front of the E_child node (the plain text will have its own child node), and the length of that plain-text will tell you the exact text position of E_Child without needing to retreive E_parent's XML at all. E-parent will have multiple plain-text child nodes in the relevant positions for each section of untagged text.

In other words, given the XML you showed, the structure of the DOM will look something like this:

RootNode
|
-- E_parent
   |
   |- "Some text "
   |
   |- E_child
   |  |
   |  -- "child text"
   |
   -- " more parent text"
0
On

Basically, you cannot use anything but an XML parser to parse XML. RegEx won't work. Anything simpler than RegEx won't work either.

At some moment, the XML you try to parse will change, breaking your simple search/replace code.

What you need to do is define in XML terms what should be replaced by which, not in Text terms.

You will end up with a definition what nodes should be changed/inserted/removed.

Then you need to translate that into Delphi DOM code.

Something that can help big time, is an XML tool (like XML Spy, but there are plenty more) that give you a DOM tree view of your XML.

Put the original old XML and changed new XML next to each other.

From there, you can visually see the old and new trees, that leads you to writing down the changes in XML nodes needed.

--jeroen

3
On

Use the code you have, and then use Pos/PoxEx to find the start and end of the E_Child element.

var
  cStart, cEnd: Integer;
  ChildName, ChildText: string;
begin
  ... other code
  xmlParent := parentNode.XML;
  ChildName := 'E_Child';
  // Find starting position of child tag
  cStart := Pos('<' + E_Child, xmlParent);
  // You now have the opening <
  cEnd   := PosEx('</' + E_Child, xmlParent, cStart);
  // You now have the final < of the child.
  // Add the length of the child's name + the closing >
  Inc(cEnd, Length('</' + E_Child + '>'));
  // Grab the entire child XML
  ChildText := System.Copy(xmlParent, cStart, cEnd - cStart);
  // Do whatever you want with the child. For instance,
  // remove the original text.
  System.Delete(xmlParent, cStart, cEnd - cStart);
  // Replace it with new text
  System.Insert(NewChildText, xmlParent, cStart);
end;
0
On

Another approach would be to use XPath to navigate your xml.

Given the sample XML

<?xml version="1.0" encoding="windows-1252"?>
<RootNode xml:lang="en" xmlns="urn:blah:names:blahblah">
<E_parent>Some text <E_child>child text</E_child> more parent text</E_parent>
</RootNode>

You could use the MSXML parser to navigate to your E_child element directly using a little bit of XPath. First you need to make your own copy of the MSXML2_TLB unit. The you can use Delphi code that looks something like this to access the E_child nodes:

uses MSXMLDOM,MSXML2_TLB;

procedure Sample;
var
  doc: IXMLDOMDocument2;
  root: IXMLDomElement;
  nodes: IXMLDOMNodeList;
  node: IXMLDOMNode;
begin

  doc := CoDOMDocument60.Create;
  doc.async := false;
  // Use same namespace as the default namespace here
  doc.setProperty('SelectionNamespaces', 'xmlns:t="urn:blah:names:blahblah"');
  doc.setProperty('SelectionLanguage', 'XPath');
  doc.loadXML(XmlSource.Text);

  root := doc.documentElement;
  nodes := root.selectNodes('//t:E_child');

  // Now thee nodes contains all E_child nodes
  // Processs them here
  // ...
end;

The key point is that you use a specific prefix for the documents default namespace for the XPath querying. The //t:E_child is the actual XPath expression used to find the E_child elements.