Why is the output different when serializing to different streams? (Java)

64 Views Asked by At

I have a problem with an xml that contains special characters (the problematic string is löööschee`*‘‘§a). The xml comes as an XOM Object in Java. While investigating the problem I tried to print out the text of the xml with a serializer. I noticed that streaming directly to System.out was the only way to get the correct string.

Here is the code I used for printing out the xml:

Element pEntry; //this is the XOM object I get, it contains the xml
Document document = pEntry.getDocument();
ByteArrayOutputStream stream = new ByteArrayOutputStream();
Serializer serializer = new Serializer(stream);
Serializer serializer2 = new Serializer(System.out);
try {
    serializer.write(document);
    serializer2.write(document);
} catch (IOException e) {
    System.out.println(e.getMessage());
}
System.out.println("#####################################################################");
System.out.println(stream);

So serializer2 writes directly to System.out, there the string is as it should be. The System.out.println prints the string as l??????schee`*????????a. I tried many different things with different encodings (the standard encoding for the serializer is "UTF-8" which seems correct), but the only way I found, that prints out the correct string is directly streaming to System.out.
I also printed the bytes of the first stream, that does not work and this was the output:
6c ffffffc3 ffffffb6 ffffffc3 ffffffb6 ffffffc3 ffffffb6 73 63 68 65 65 60 2a ffffffe2 ffffff80 ffffff98 ffffffe2 ffffff80 ffffff98 ffffffc2 ffffffa7 61.
I don't really know if this is correct and I can't print out the bytes that are streaming directly to System.out. I saw that c3 b6 for example should be an ö, which would be correct, but I don't know about the ffffffs.
Why are they different, even if they use the same encoding?

Other things I tried:

  • adding -J-Dfile.encoding=UTF-8 to the javac command -> didn't make a difference
  • initializing the serializers with different encodings (UTF-8, UTF-16, US-ASII) -> the only thing that worked correct for serializer2 was UTF-8 so I assume this is the correct encoding
  • Instead of System.out.println(stream) putting
    String xmlContent = stream.toString(StandardCharsets.UTF_8);
    System.out.println(xmlContent);
    
    -> this was at least an improvement I think, the string then looked like l???schee`*?????a
2

There are 2 best solutions below

0
Michael Lamprecht On BEST ANSWER

Putting the line System.setOut(new PrintStream(System.out, true, StandardCharsets.UTF_8)); above the console output solved the problem, now the console is always showing the correct string.

1
Pradipta Sarma On

You get the right output with serializer2 maybe because the console probably uses the default encoding of your system (mostlikely UTF8, which can display the special characters correctly).

With serializer you're using ByteArrayOutputStream, which may not inherently handle character encoding like Console does. You can try explicitly providing the encoding while converting the ByteArrayOutputStream to a string. Something like new String(stream.toByteArray(), StandardCharsets.UTF_8).