I'm new to XSLT and am trying to sort an arbitrarily sized XML document according to certain instructions:
- All attributes should be sorted in alphabetical order
- Child elements should be sorted in alphabetical order by element name
- If two children have the same element name, they should be sorted using a consistent, well-defined process (preferably alphabetically in some way)
I'm currently stuck on implementing step 3. I referenced this post for steps 1 and 2, and this post in an attempt to implement step 3. My current implementation is as follows:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="3.0">
<xsl:output method="xml" indent="yes"/>
<xsl:template match="node()|@*">
<xsl:copy>
<xsl:apply-templates select="@*">
<xsl:sort select="name()"/>
</xsl:apply-templates>
<xsl:apply-templates select="node()">
<xsl:sort select="name()"/>
<xsl:sort select="."/>
</xsl:apply-templates>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
However, I noticed this isn't a perfect solution due to some edge cases, as when I run it on the following input:
<colours b="x" c="y" a="t">
<blue/>
<aa>
<bb>y</bb>
<bb>a</bb>
</aa>
<aa>
<bb>a</bb>
<bb>z</bb>
</aa>
<violet/>
</colours>
It outputs the following:
<?xml version="1.0" encoding="UTF-8"?>
<colours a="t" b="x" c="y">
<aa>
<bb>a</bb>
<bb>z</bb>
</aa>
<aa>
<bb>a</bb>
<bb>y</bb>
</aa>
<blue/>
<violet/>
</colours>
This output has the aa sections in the wrong order according to how the sort works (a,z should come after a,y) and I understand that it's because of the order the sort occurred in, but it seems the only workaround is to repeatedly execute this xslt on the output until no more changes are made.
From my understanding the secondary sort key from my implementation (<xsl:sort select="."/>) which I'm using to implement step 3 triggers when a duplicate element is encountered that can't be sorted alphabetically and uses the string value of the duplicate elements as the sort key. From what I've read, this string value is the the concatenation of the string-values of all its Text Node descendants in document order. I believe if the element is empty then its string value is the zero length string. However, I think this comes with another edge case where sorting can't occur for the following input:
<colours b="x" c="y" a="t">
<aa>
<cc/>
</aa>
<aa>
<bb/>
</aa>
<violet/>
</colours>
No matter what order the aa sections are put in, the output is identical to the input, meaning no sorting is occurring in this situation.
Are there any improved solutions for handling sorting these duplicate element names? Or is my implementation the best approach xslt can implement and I have to stick with the workaround for the first case and accept the second as unable to be fixed? An idea could be if there are two duplicate elements it sorts them by checking their next child node (in alphabetical order) recursively or something along those lines (add a third sort key?), though I'm unsure how to implement that in xslt.
Any help would be appreciated.
EDIT: As advised by the commentors, below I've added to the third instruction to establish a more concrete specification.
- If two children have the same element name, they should be sorted using a consistent, well-defined process (preferably alphabetically in some way)
If two children have the same element name, is it possible to define their sort key as a concatenation of their subtree of element names in document order? E.g., for the following code:
<colours>
<aa>
<cc>
<rr/>
</cc>
<ff/>
</aa>
<aa>
<cc>
<ee/>
</cc>
<zz/>
</aa>
<violet/>
</colours>
The first aa sort key would be ccrrff while the next one would be cceezz. This way they can be sorted in alphabetical order (i.e., the second aa section would go first).
This should fix the issue in my second example (I'm fine with the workaround for the first), and I'm thinking that this sort key would be the secondary sort key going in between the two sort keys I have in my current implementation.
Firstly, attributes in the XDM data model are unordered. That means there's no guarantee that they will be serialized in any particular order. You may be lucky, it depends on the implementation.
As for element sorting, I would start by ensuring whitespace text nodes are eliminated: use
xsl:strip-space elements="*". But I think your real problem is specifying your requirements rather than writing code to implement them. You've given us a couple of examples to play with, but two examples doesn't constitute a specification. You could achieve your two examples by computing a sort key that's something likestring-join(*/(local-name()||string())), but that might not give the results you want for your next example.