I have an xml file (document.xml of a word.docx file) that I need to delete certain sections from.
The structure is something like:
<?xml version='1.0' encoding='UTF-8' standalone='yes'?>
<w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
<w:body>
<w:p>
Bunch of nested tags
</w:p>
<w:p>
Bunch of nested tags to delete
</w:p>
<w:p>
Bunch of nested tags to delete
</w:p>
<w:tbl>
Bunch of nested tags to delete
</w:tbl>
<w:p>
Bunch of nested tags
</w:p>
</w:body>
</document>
I want to delete all tags and all their content between 2 specified boundary tags. I want to include the startTag and exclude the endTag, and delete everything in between.
My two boundary tags are <w:p> tags, and I have a bunch of other tags in between the <w:p> tags, such as <w:tbl> tags, that I also wish to delete.
My problem is that I do not know how to remove all these tags. Any help?
The desired output is:
<?xml version='1.0' encoding='UTF-8' standalone='yes'?>
<w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
<w:body>
<w:p>
Bunch of nested tags
</w:p>
<w:p>
Bunch of nested tags
</w:p>
</w:body>
</document>
This is what I tried:
I managed to obtain the boundary tags:
startTag = parentBoundaryTags[3]
endTag = parentBoundaryTags[4]
The boundary tag values are:
<Element p at 0x12cf32ccfa0>
<Element p at 0x12cf32ccff0>
I tried getting the common parent of the boundary tags, because from my research it seemde to me like I need it to remove elements below it:
common_ancestor = startTag.getparent()
The common_ancestor value is:
<Element body at 0x12cf32cccd0>
This makes sense to me because it corresponds to my xml structure; it's what I expect to see.
I used getchildren() to iterate over all direct children of the <w:body> tag. I'm trying to remove all the direct children of the <w:body> tag, starting from the point where the direct child of the <w:body> tag is equivalent to my startTag boundary tag.
I'm trying to keep removing direct children of <w:body>, until I reach a direct child which is equivalent to my endTag boundary tag.
# Flag to indicate whether to start removing elements
start_removal = False
# List to store elements to be removed
elements_to_remove = []
# Iterate over the children of the common ancestor
for child in common_ancestor.getchildren():
if child == startTag:
start_removal = True
elements_to_remove.append(child)
elif child == endTag:
start_removal = False
break
elif start_removal:
elements_to_remove.append(child)
# Remove the collected elements
for element in elements_to_remove:
common_ancestor.remove(element)
# Write the modified XML tree back to the document.xml file
tree.write(document_xml, encoding='utf-8', xml_declaration=True)
I expected this to delete all tags between my boundary tags, but it is not deleting anything at all.
Would anyone be able to help?
Here is a solution based on XSLT.
It is using a so called Identity Transform pattern.
Input XML
XSLT
Output XML
Python