Replacing of HTML 5 codes with equivalent characters in Java

1.3k Views Asked by At

I'm trying to replace symbols of HTML 5 using StringEscapeUtils.unescapeHtml4(), but I still have a lot of symbols which haven't been replaced such as "&nbsp"," &amp". What will you recommend to use?

1

There are 1 best solutions below

0
On BEST ANSWER

&nbsp and &amp aren't entities.   and & are entities. If your string is really missing the ; on them, that's why they're not being decoded.

I just checked (just to be thorough!), and StringEscapeUtils.unescapeHtml4 does correctly decode   and &.

The correct fix is to fix whatever's giving you that string with the incomplete entities in it.

You could workaround it, also turning &nbsp and &amp into \u00A0 and & using String#replace after using StringEscapeUtils.unescapeHtml4:

// Ugly, technically-incorrect workaround (but we do these things sometimes)
String result =
    StringEscapeUtils.unescapeHtml4(sourceString)
    .replace("&nbsp", "\u00A0")
    .replace("&amp", "&");

...but it's not correct, because those aren't entities. Best to correct the string.