How to read/write Java ASCII Characters value with XMLStreamReader?

343 Views Asked by At

I'd like to use XMLStreamReader for reading a XML file which contains Horizontal Tab ASCII Codes 	, for example:

<tag>foo&#009;bar</tag>

and print out or write it back to another xml file.

Google tells me to set javax.xml.stream.isCoalescing to true in XMLInputFactory, but my test code below does not work as expected.

public static void main(String[] args) throws IOException, XMLStreamException {
    XMLInputFactory factory = XMLInputFactory.newInstance();
    factory.setProperty(factory.IS_COALESCING, true);

    System.out.println("IS_COALESCING supported ? " + factory.isPropertySupported(factory.IS_COALESCING));
    System.out.println("factory IS_COALESCING value is " +factory.getProperty(factory.IS_COALESCING));

    String rawString = "<tag>foo&#009;bar</tag>";
    XMLStreamReader reader = factory.createXMLStreamReader(new StringReader(rawString));
    System.out.println("reader IS_COALESCING value is " +reader.getProperty(factory.IS_COALESCING));

    PrintWriter pw = new PrintWriter(System.out, true);
    while (reader.hasNext())
    {
        reader.next();
        pw.print(reader.getEventType());
        if (reader.hasText())
            pw.append(' ').append(reader.getText());
        pw.println();
    }
}

The output is

IS_COALESCING supported ? true
factory IS_COALESCING value is true
reader IS_COALESCING value is true
1
4 foo   bar
2
8

But I want to keep the same Horizontal Tab like:

IS_COALESCING supported ? true
factory IS_COALESCING value is true
reader IS_COALESCING value is true
1
4 foo&#009;bar
2
8

What am I missing here? thanks

1

There are 1 best solutions below

0
On

From what I see, the parsing part is correct - it's just not printed as you envision it. Your unicode encoding is interpreted by the XML reader as \t and represented accordingly in Java.

Using Guava's XmlEscapers, I can produce something similar to what you want to have:

public class Test {
    public static void main(String[] args) throws IOException, XMLStreamException {
        XMLInputFactory factory = XMLInputFactory.newInstance();
        factory.setProperty(XMLInputFactory.IS_COALESCING, true);

        System.out.println("IS_COALESCING supported ? " + factory.isPropertySupported(XMLInputFactory.IS_COALESCING));
        System.out.println("factory IS_COALESCING value is " + factory.getProperty(XMLInputFactory.IS_COALESCING));

        String rawString = "<tag>foo&#009;bar</tag>";
        XMLStreamReader reader = factory.createXMLStreamReader(new StringReader(rawString));
        System.out.println("reader IS_COALESCING value is " + reader.getProperty(XMLInputFactory.IS_COALESCING));

        PrintWriter pw = new PrintWriter(System.out, true);
        while (reader.hasNext()) {
            reader.next();
            pw.print(reader.getEventType());
            if (reader.hasText()) {
                pw.append(' ').append(XmlEscapers.xmlAttributeEscaper().escape(reader.getText()));
            }
            pw.println();
        }
    }

The Output looks like this:

IS_COALESCING supported ? true
factory IS_COALESCING value is true
reader IS_COALESCING value is true
1
4 foo&#x9;bar
2
8

Some remarks to this:

  • The library itself is marked as unstable, there might be other alternatives
  • \t does not need to be escaped in XML content, thus I had to choose the attribute converter. While it works, there might be some side effects
  • Is a 100%-copy of the content really required? Otherwise, I would suggest to let the XML libraries do their work and have them create the correct encoding.
    • If you really want to have a 1:1 copy, is it an option to specify the input as CDATA?