How to format XML using Pentaho

1.4k Views Asked by At

I'm producing a XML using several steps and, in the end, due to the complexity of the XML (nested inside nested inside nested) I had to use a Text File Output Step and just change the 'Extension' option to '.xml'.

The problem is I'm getting a one line .xml file insted of a well formatted XML; If I copy-paste that one line into an xmlFormatter online it works perfectly.

Is there any way of reading that one line file as a String and change it into a well shaped XML file?

Obtained: obtained XML

Pretended: pretended XML

Thanks in advance.

1

There are 1 best solutions below

0
On

I would recommend using a User Defined Java Class step and write your own code to transform your XML one-liner into a pretty-printed version. Pentaho already comes with various library JARs for XML operations that you can use directly.

Here is how the test transformation I wrote looks like :

enter image description here

Generate XML String writes a single row containing an XML one-liner string in the 'xml' field.

Format XML contains the following code under its 'Processor' tab :

import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.InputSource;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathFactory;
import javax.xml.xpath.XPathConstants;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import java.io.StringWriter;
import java.io.ByteArrayInputStream;

public static String toPrettyString(String xml) {
    try {
        // Turn xml string into a document
        Document document = DocumentBuilderFactory.newInstance()
                .newDocumentBuilder()
                .parse(new InputSource(new ByteArrayInputStream(xml.getBytes("utf-8"))));

        // Remove whitespaces outside tags
        document.normalize();
        XPath xPath = XPathFactory.newInstance().newXPath();
        NodeList nodeList = (NodeList) xPath.evaluate("//text()[normalize-space()='']",
                                                      document,
                                                      XPathConstants.NODESET);

        for (int i = 0; i < nodeList.getLength(); ++i) {
            Node node = nodeList.item(i);
            node.getParentNode().removeChild(node);
        }

        // Setup pretty print options
        TransformerFactory transformerFactory = TransformerFactory.newInstance();
        Transformer transformer = transformerFactory.newTransformer();
        transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
        transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
        transformer.setOutputProperty(OutputKeys.INDENT, "yes");

        // Return pretty print xml string
        StringWriter stringWriter = new StringWriter();
        transformer.transform(new DOMSource(document), new StreamResult(stringWriter));
        return stringWriter.toString();
    } catch (Exception e) {
        throw new RuntimeException(e);
    }
}


public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException
{
 
    // First, get a row from the default input hop
    Object[] r = getRow();
 
    // If the row object is null, we are done processing.
    if (r == null) {
        setOutputDone();
        return false;
    }

    // Init output row
    Object[] outputRow = createOutputRow(r, data.outputRowMeta.size());
 
    // Getting fields
    String xml = get(Fields.In, "xml").getString(r);
    
    // Init error handling  
    boolean rowInError = false;
    String errMsg = "";
    int errCnt = 0;

    // Init Output
    String xml_pretty = "";

    // Put the xml in pretty format
    try{
        xml_pretty = toPrettyString(xml);
    }
    catch (Exception ex) {
        errMsg = ex.getMessage();
        errCnt++;
        rowInError = true;
    }

    // Set the value in the output field
    //
    get(Fields.Out, "result").setValue(outputRow, true);
    get(Fields.Out, "xml_pretty").setValue(outputRow, xml_pretty);

    if ( !rowInError ) {
        // putRow will send the row on to the default output hop.
        //
        putRow(data.outputRowMeta, outputRow);
    }
    else {
        // putError will send the row on to the error hop.
        //
        get(Fields.Out, "result").setValue(outputRow, false);
        get(Fields.Out, "xml_pretty").setValue(outputRow, "");
        putError(data.outputRowMeta, outputRow, errCnt, errMsg, "", "ERR_0");
    }

    return true;
}

The implementation of toPrettyString(String xml) is up to you. Here I used the code found in this SO answer. You also have to define the output fields (by giving their names and types) under the 'Fields' tab of the step.

The above code was tested using Spoon/PDI Client on Pentaho 8.3.0.10