Using xsl:result-document in Java (Sax)

696 Views Asked by At

I really read and tested a lot, but I don't get a working java-solution:

I have a large xml-file (more than 100MB) which is processed via JAXB by now. The aim is to split the xml into many xmls using one child of root every time.
Important: Because of the filesize, a sax-way is preferred.

I found a lot of information about xsl:result-document, but I found no way to get it running from java and I am quite not sure, if it would be possible to keep needed memory low.

This is my Java-Code:

import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.stream.StreamSource;

public class TestParse {

public static void main(final String[] args) throws Throwable {
    final TransformerFactory factory = TransformerFactory.newInstance();
    final Transformer transformer = factory.newTransformer(new StreamSource("D:\\split.xsl"));

    final StreamSource in = new StreamSource("D:\\input.xml");
    final StreamResult out = new StreamResult("D:\\output.xml");
    transformer.transform(in, out);
}

This is an example-xml ("input.xml"):

<?xml version="1.0" encoding="ISO-8859-1"?>
<Taskname>
  <Item attr="ab" attr2="c">
    <MoreNodes>...</MoreNodes>
  </Item>
  <Item attr="xy" attr2="z">
    <MoreNodes>...</MoreNodes>
  </Item>
  <!-- ...and many items more -->
</Taskname>

This is my xsl (split.xsl):

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
    <xsl:strip-space elements="*"/>
    <xsl:param name="dir" select="'file:///D://'"/>

    <xsl:template match="Item">
        <xsl:result-document href="{$dir}section{position()}.xml" method="xml">
            <Taskname>
            <xsl:copy-of select="." />
            </Taskname>
        </xsl:result-document>
    </xsl:template>

</xsl:stylesheet>

So one result-xml should look like this:

<?xml version="1.0" encoding="ISO-8859-1"?>
<Taskname>
  <Item attr="..." attr2="...">
    <MoreNodes>...</MoreNodes>
  </Item>
 </Taskname>

My problem:

I really don't now, how I could get the different outputs of the xslt and more than that, I would need them as Streams an not as Files - and I would need them item by item (like sax' endElement) to use less memory.

Maybe, there is an other, better way than to use xslt, than, please just tell me.

2

There are 2 best solutions below

0
On

Firstly, if you want to avoid building a tree for the source document in memory, then you're going to have to run this with XSLT 3.0 streaming - which means you need a Saxon-EE license. (However, it's quite feasible to process a 100Mb file the traditional way, with a tree in memory).

Secondly, if you want the output of xsl:result-document to be captured as in-memory streams rather than being written to filestore, then in Saxon the way to achieve this is to write and register an OutputURIResolver. This will be called once for each result document, and can specify a destination (such as a StreamResult or SAXResult) to receive the document.

0
On

I would probably dispense with XSLT for this task and just use something like the StAX API directly. But it depends what you want to do with the split-up files in the end. You mention JAXB in the question, note that it is possible for a JAXB Unmarshaller to read from a StAX XMLStreamReader, which allows you to use a kind of "semi-streaming" processing model where you stream through the input file unmarshalling it one Item at a time. Assuming you have an Item class that represents the Item element type:

JAXBContext ctx = JAXBContext.newInstance(Item.class);
Unmarshaller u = ctx.createUnmarshaller();
XMLInputFactory inFactory = XMLInputFactory.newFactory();
try(InputStream stream = Files.newInputStream(Paths.get("input.xml"))) {
  XMLStreamReader reader = inFactory.createXMLStreamReader(stream);
  try {
    reader.nextTag(); // the root Taskname start tag
    reader.nextTag(); // the start tag of the first Item, if there is
                      // one, the end of the Taskname if there isn't
    while(reader.getEventType() == XMLStreamConstants.START_ELEMENT) {
      JAXBElement<Item> theItem = u.unmarshal(reader, Item.class);
      // do whatever you want to do with this item
      process(theItem.getValue());

      // this is an oddity of the JAXB API - when unmarshalling from
      // a stream reader the reader is left pointing to the event
      // *after* the closing tag, not to the closing tag itself,
      // so whether or not we need to advance to the next tag depends
      // whether there is whitespace between the close of one Item
      // and the start of the next.
      if(reader.getEventType() != XMLStreamConstants.START_ELEMENT &&
           reader.getEventType() != XMLStreamConstants.END_ELEMENT) {
        reader.nextTag();
      }
    }
  } finally {
    reader.close();
  }
}