stringescapeutils unescape en dash with code &#150

400 Views Asked by At

I am getting an xml from a third party system in utf-8 format and I am trying to parse it properly and save it in my db. For example below are 4 lines of the xml that I am getting and when I try to use unescapeXML it works for everything except en dash.

String  one  = "<Name>test &apos; test</Name>";
String  two  = "<Fi>Em &#150; S</Fi>";
String three = "<FirstName>a1 &#228;</FirstName>";
String four = "crap&#201;";

System.out.println(StringEscapeUtils.unescapeXml(one));
System.out.println(StringEscapeUtils.unescapeXml(two));
System.out.println(StringEscapeUtils.unescapeXml(three));
System.out.println(StringEscapeUtils.unescapeXml(four));

Output:

<Name>test ' test</Name>

<Fi>Em  S</Fi>

<FirstName>a1 ä</FirstName>

crapÉ

Everything looks fine except the string "two", it should actually be "Em – S".

I am trying to figure out what I am doing wrong and what is the best way to decode such xml strings

1

There are 1 best solutions below

1
On

A console may simply not be able to print character – (&#150;).

But when you examine the unescaped string:

String two = "<Fi>Em &#150; S</Fi>";
String twoUnescaped = StringEscapeUtils.unescapeXml(two);
System.out.println(twoUnescaped.codePointAt(7));

you will find that the character reference is correctly unescaped to a Java character with codepoint 150.