I am getting an xml from a third party system in utf-8 format and I am trying to parse it properly and save it in my db. For example below are 4 lines of the xml that I am getting and when I try to use unescapeXML it works for everything except en dash.
String one = "<Name>test ' test</Name>";
String two = "<Fi>Em – S</Fi>";
String three = "<FirstName>a1 ä</FirstName>";
String four = "crapÉ";
System.out.println(StringEscapeUtils.unescapeXml(one));
System.out.println(StringEscapeUtils.unescapeXml(two));
System.out.println(StringEscapeUtils.unescapeXml(three));
System.out.println(StringEscapeUtils.unescapeXml(four));
Output:
<Name>test ' test</Name>
<Fi>Em S</Fi>
<FirstName>a1 ä</FirstName>
crapÉ
Everything looks fine except the string "two", it should actually be "Em – S".
I am trying to figure out what I am doing wrong and what is the best way to decode such xml strings
A console may simply not be able to print character (
–
).But when you examine the unescaped string:
you will find that the character reference is correctly unescaped to a Java character with codepoint 150.