I am using "*org.apache.commons.lang.StringEscapeUtils.unescapeHtml(myHtmlString)" to convert Html entity escapes to a string containing the actual Unicode characters corresponding to the escapes. However it doesn't parse "em dash" and "en dash" symbols properly. StringEscapeUtils replaces "" with "\u0096" while the correct misplacement is "\u2013". And as I have read "\u0096" is cp1252 equivalent for "". So how can I make it work in a right way? I know that I can replace it manually but I wonder if I can do it with StringEscapeUtils or with any other util.
"org.apache.commons.lang.StringEscapeUtils" and "en dash"
2.8k Views Asked by Zalivaka At
2
There are 2 best solutions below
0

I suspect that the problem is not in the StringEscapeUtils.unescapeHtml(...)
call.
Instead, I suspect that the character has been turned into '\u0096'
before the call. More specifically, I suspect that your code has used the wrong character set when reading the HTML as characters.
As you say, an en-dash is code-point 0x96
in cp1252. So one way to get an en-dashed mistranslated to the unicode code-point \u0096
would be to start with a byte stream that was encoded using cp1252 and read / decode it using an InputStreamReader(is, "Latin-1")
.
I don't think so. 0x0096 in Unicode is a C1 control code:
http://en.wikipedia.org/wiki/C0_and_C1_control_codes
and is unlikely to be the replacement for "-" (as you wrote).
Well, if StringEscapeUtils really messes this up (en dash should indeed be \u2013) and if it's the only escape it is messing up and if there's no reason to have any other 0x0096 in your String, then a replaceAll after having calling StringEscapeUtils should work.
The following does the replace you expect:
However you should first make sure that StringEscapeUtils really messes things up and really, really, understand why/how you get that 0x0096 in a Java String.
Then, also, it should probably be pointed out to you that sadly Java's Unicode support is a major SNAFU because Java was conceived before Unicode 3.1 came out.
Hence it seemed a smart idea to use 16 bits for the char primitive, it seemed a smart idea to use a 4-hexdigits '\uxxxx' escape sequence, it seemed a smart idea to represent the length of the char[] in String's length() method, etc.
These were actually all very very stupid idea leading to one of the major Java SNAFU where the char primitive cannot actually hold a Unicode char anymore and where String's length method does actually not return a String's real length.
I like the following:
Why this rant? Well, because I don't know how the regexp replacement in String's replaceAll is implemented but I really wouldn't be suprised if there were cases (i.e. certain codepoints) where String's replaceAll was, like char and like length and like \uxxxx, well.. hmmm, totally broken.