Preserving XML comments and processing instructions that occur before the root element

1.2k Views Asked by At

I need to add a new tag and write back to an XML. Here is my XML file.

<?xml version="1.0" encoding="UTF-8"?>
    <!--Arbortext, Inc., 1988-2011, v.4002-->
    <!DOCTYPE reference-configuration-statement PUBLIC "-//Juniper Networks//DTD Jbook Software Guide//EN"
     "file:////cmsxml/IWServer/default/main/TechPubsWorkInProgress/STAGING/bin/dtds/jbook-sw/jbook-sw.dtd">
    <?Pub UDT _nopagebreak _touchup KeepsKeep="yes" KeepsPrev="no" KeepsNext="no" KeepsBoundary="page"?>
    <?Pub UDT _bookmark _target?>
    <?Pub UDT instructions _comment FontColor="red"?>
    <?Pub UDT instructions-DUPLICATE1 _comment FontColor="red"?>
    <?Pub UDT __target_1 _target?>
    <?Pub UDT __target_3 _target?>
    <?Pub UDT __target_2 _target?>
    <?Pub UDT _bookmark-DUPLICATE1 _target?>
    <?Pub UDT __target_4 _target?>
    <?Pub EntList copy trade micro reg plusmn deg middot mdash ndash nbsp
    caret cent check acute frac12 frac13 frac14 frac15 frac16 frac18 frac23
    frac25 frac34 frac35 frac38 frac45 frac56 frac58 frac78 ohm pi sup sup1
    sup2 sup3 rsquo?>
    <?Pub Inc?>
    <root topic-id="25775"

Am able to complete the task with etree.

path="C:/Users/pshahul/Desktop/Official/Automation/Write_XMl_files/Source/"
            add=(path, Filename)
            myfile=s.join(add)
            try:
                et = xml.etree.ElementTree.parse(myfile)
                tree=etree.parse(myfile)
                docinfo=tree.docinfo.encoding
                root=et.getroot()
                elem = root.find('cli-help')
                if elem is None:
                    new_tag=ET.Element("cli-help")
                    new_tag.text=final
                    root.insert(2,new_tag)
                    et.write(myfile,encoding=docinfo, xml_declaration=True)
                else:
                    elem.text=final
                    et.write(myfile,encoding=docinfo, xml_declaration=True)
            except OSError:
                pass
        else:
            raise TypeError
    except TypeError:
        continue

Now, i got the DOCTYPE and XML declaration, but the following are skipped.

<!--Arbortext, Inc., 1988-2011, v.4002-->
     <?Pub UDT _nopagebreak _touchup KeepsKeep="yes" KeepsPrev="no" KeepsNext="no" KeepsBoundary="page"?>
    <?Pub UDT _bookmark _target?>
    <?Pub UDT instructions _comment FontColor="red"?>
    <?Pub UDT instructions-DUPLICATE1 _comment FontColor="red"?>
    <?Pub UDT __target_1 _target?>
    <?Pub UDT __target_3 _target?>
    <?Pub UDT __target_2 _target?>
    <?Pub UDT _bookmark-DUPLICATE1 _target?>
    <?Pub UDT __target_4 _target?>
    <?Pub EntList copy trade micro reg plusmn deg middot mdash ndash nbsp
    caret cent check acute frac12 frac13 frac14 frac15 frac16 frac18 frac23
    frac25 frac34 frac35 frac38 frac45 frac56 frac58 frac78 ohm pi sup sup1
    sup2 sup3 rsquo?>
    <?Pub Inc?>

How do i preserve that? I need those lines back in my XML file. Plus comments. I find the comments missing too.

2

There are 2 best solutions below

0
On

The documentation of ElementTree makes it clear that this is not possible:

Note: Not all elements of the XML input will end up as elements of the parsed tree. Currently, this module skips over any XML comments, processing instructions, and document type declarations in the input

What worked out of the box for me was minidom. Unless the document is very large you're fine to keep it in memory

from xml.dom import minidom
from xml.dom import Node

xml_string = "<?xml version='1.0'?><!--comment--><root><!--inside comment--><child/></root>"
xml_doc = minidom.parseString(xml_string)
for node in xml_doc.getchildNodes:
    if node.nodeType == Node.COMMENT_NODE:
        print("Comment", node.data)
0
On

As suggested by OP, the (or a) solution here is to utilize lxml as follows, which will preserve comments as well as processing instructions:

import lxml.etree as ET
tree = ET.parse(filename)