Python LXML: How do I delete all tags between two specified tags?

119 Views Asked by At

I have an xml file (document.xml of a word.docx file) that I need to delete certain sections from.

The structure is something like:

<?xml version='1.0' encoding='UTF-8' standalone='yes'?>
<w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
 <w:body>
        <w:p>
            Bunch of nested tags
        </w:p>
        <w:p>
            Bunch of nested tags to delete
        </w:p>
        <w:p>
            Bunch of nested tags to delete
        </w:p>
        <w:tbl>
            Bunch of nested tags to delete
        </w:tbl>
        <w:p>
            Bunch of nested tags
        </w:p>
 </w:body>
</document>

I want to delete all tags and all their content between 2 specified boundary tags. I want to include the startTag and exclude the endTag, and delete everything in between.

My two boundary tags are <w:p> tags, and I have a bunch of other tags in between the <w:p> tags, such as <w:tbl> tags, that I also wish to delete.

My problem is that I do not know how to remove all these tags. Any help?

The desired output is:

<?xml version='1.0' encoding='UTF-8' standalone='yes'?>
<w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
 <w:body>
        <w:p>
            Bunch of nested tags
        </w:p>
        <w:p>
            Bunch of nested tags
        </w:p>
 </w:body>
</document>

This is what I tried:

I managed to obtain the boundary tags:

startTag = parentBoundaryTags[3]
endTag = parentBoundaryTags[4]

The boundary tag values are:

<Element p at 0x12cf32ccfa0>
<Element p at 0x12cf32ccff0>

I tried getting the common parent of the boundary tags, because from my research it seemde to me like I need it to remove elements below it:

common_ancestor = startTag.getparent()

The common_ancestor value is: <Element body at 0x12cf32cccd0>

This makes sense to me because it corresponds to my xml structure; it's what I expect to see.

I used getchildren() to iterate over all direct children of the <w:body> tag. I'm trying to remove all the direct children of the <w:body> tag, starting from the point where the direct child of the <w:body> tag is equivalent to my startTag boundary tag.

I'm trying to keep removing direct children of <w:body>, until I reach a direct child which is equivalent to my endTag boundary tag.

# Flag to indicate whether to start removing elements
start_removal = False

# List to store elements to be removed
elements_to_remove = []

# Iterate over the children of the common ancestor
for child in common_ancestor.getchildren():
    if child == startTag:
        start_removal = True
        elements_to_remove.append(child)
    elif child == endTag:
        start_removal = False
        break
    elif start_removal:
        elements_to_remove.append(child)

# Remove the collected elements
for element in elements_to_remove:
    common_ancestor.remove(element)

# Write the modified XML tree back to the document.xml file
tree.write(document_xml, encoding='utf-8', xml_declaration=True)

I expected this to delete all tags between my boundary tags, but it is not deleting anything at all.

Would anyone be able to help?

2

There are 2 best solutions below

2
Yitzhak Khabinsky On

Here is a solution based on XSLT.

It is using a so called Identity Transform pattern.

Input XML

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
    <w:body>
        <w:p>Bunch of nested tags</w:p>
        <w:p>Bunch of nested tags to delete</w:p>
        <w:p>Bunch of nested tags to delete</w:p>
        <w:tbl>Bunch of nested tags to delete</w:tbl>
        <w:p>Bunch of nested tags</w:p>
    </w:body>
</w:document>

XSLT

<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
                xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
    <xsl:output method="xml" omit-xml-declaration="no"
                encoding="UTF-8" indent="yes"
                standalone="yes"/>
    <xsl:strip-space elements="*"/>

    <!--Identity transform-->
    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="w:body/w:*[position() != 1 and position() != last()]"/>
</xsl:stylesheet>

Output XML

<?xml version='1.0' encoding='UTF-8' standalone='yes' ?>
<w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
  <w:body>
    <w:p>Bunch of nested tags</w:p>
    <w:p>Bunch of nested tags</w:p>
  </w:body>
</w:document>

Python

import lxml.etree as lx

# PARSE XML AND XSLT
doc = lx.parse("Input.xml")
style = lx.parse("Style.xslt")
outfile = "Output.xml"

# CONFIGURE AND RUN TRANSFORMER
transformer = lx.XSLT(style)
result = transformer(doc)

# OUTPUT TO FILE
with open(outfile, "wb") as f:
    f.write(result)
1
Hermann12 On

You have to find the <body> parent element and know the list of elements between body, e.g. <p>.

Note: I added in your closing tag the prefix <w:document>

Input file:

<?xml version='1.0' encoding='utf-8' standalone='yes'?>
<w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
  <w:body>
    <w:p>Bunch of nested tags</w:p>
    <w:p>Bunch of nested tags to delete</w:p>
    <w:p>Bunch of nested tags to delete</w:p>
    <w:tbl>Bunch of nested tags to delete</w:tbl>
    <w:p>Bunch of nested tags</w:p>
  </w:body>
</w:document>

This is my suggested code:

from lxml import etree

root = etree.parse("xml_file.xml")
ns = {"w":"http://schemas.openxmlformats.org/wordprocessingml/2006/main"}

body = root.find(".//w:body", ns)
all_p = root.findall(".//w:p", ns)

# hold first and last p-tag element in the list
keep = []
for p in all_p[::len(all_p)-1]:
    keep.append(p)

# remove all elements between body
for el in body:
    body.remove(el)

# insert the saved elements again
for k in keep:
    body.append(k)

etree.indent(root, space='  ')
root.write("out_new.xml", xml_declaration=True, pretty_print=True, encoding='UTF-8', standalone=True)

Output file:

<?xml version='1.0' encoding='UTF-8' standalone='yes'?>
<w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
  <w:body>
    <w:p>Bunch of nested tags</w:p>
    <w:p>Bunch of nested tags</w:p>
  </w:body>
</w:document>

Or alternative a little bit shorter you can skip first and last element between <body>, but this assume that your <p> is always the first and last in the list:

body = root.find(".//w:body", ns)
# find all elements between body
all_p = root.findall(".//w:body/*", ns)


# remove all elements between body except the first and last one
for el in all_p:
    if el in all_p[::len(all_p)-1]:
        pass
    else:
        body.remove(el)