I am using "*org.apache.commons.lang.StringEscapeUtils.unescapeHtml(myHtmlString)" to convert Html entity escapes to a string containing the actual Unicode characters corresponding to the escapes. However it doesn't parse "em dash" and "en dash" symbols properly. StringEscapeUtils replaces "" with "\u0096" while the correct misplacement is "\u2013". And as I have read "\u0096" is cp1252 equivalent for "". So how can I make it work in a right way? I know that I can replace it manually but I wonder if I can do it with StringEscapeUtils or with any other util.
"org.apache.commons.lang.StringEscapeUtils" and "en dash"
2.8k Views Asked by Zalivaka At
2
There are 2 best solutions below
0
Stephen C
On
I suspect that the problem is not in the StringEscapeUtils.unescapeHtml(...) call.
Instead, I suspect that the character has been turned into '\u0096' before the call. More specifically, I suspect that your code has used the wrong character set when reading the HTML as characters.
As you say, an en-dash is code-point 0x96 in cp1252. So one way to get an en-dashed mistranslated to the unicode code-point \u0096 would be to start with a byte stream that was encoded using cp1252 and read / decode it using an InputStreamReader(is, "Latin-1").
Related Questions in JAVA
- Add image to JCheckBoxMenuItem
- How to access invisible Unordered List element with Selenium WebDriver using Java
- Inheritance in Java, apparent type vs actual type
- Java catch the ball Game
- Access objects variable & method by name
- GridBagLayout is displaying JTextField and JTextArea as short, vertical lines
- Perform a task each interval
- Compound classes stored in an array are not accessible in selenium java
- How to avoid concurrent access to a resource?
- Why does processing goes slower on implementing try catch block in java?
- Redirect inside java interceptor
- Push toolbar content below statusbar
- Animation in Java on top of JPanel
- JPA - How to query with a LIKE operator in combination with an AttributeConverter
- Java Assign a Value to an array cell
Related Questions in UNICODE
- Why is executing Java code in comments with certain Unicode characters allowed?
- LXML to write in unicode?
- erlang os:cmd() command with UTF8 binary
- How to encode bytes as a printable unicode string (like base64 for ascii)
- Unicode error from pip install
- How to express the full range of values of a char in F#?
- Change lowercase and uppercase of characters in java
- Need code for removing all unicode characters in vb6
- Error passing Unicode string through JSONObject
- How to combine Unicode characters
- FreeType2 and OpenGL : Use unicode
- Unicode Japanese prolonged sound mark excluded from Kana script?
- Parsing string containing Unicode character names
- How can I add an icon to select box choices?
- Displaying unicode characters in Python 3
Related Questions in CHARACTER-ENCODING
- How to encode bytes as a printable unicode string (like base64 for ascii)
- FPDF with iconv from utf8mb4
- Char encoding and SQL in C#
- How to set only one table charset to utf8mb4 without change mysql configuration?
- Why does opening a file in two different encodings work as expected?
- —- " added in HTML when converting MarkDown file to HTML using Jekyll tool
- Unicode error. database malfunctions
- Can we convert ANSI encoded CSV file to utf-8 encoded file with javascript?
- Determining ISO-8859-1 vs US-ASCII charset
- Unexpected Python String Encoding of '/b'
- Rails ActiveRecord string field encoding vs Ruby String encoding
- Jekyll JSON incorrect character encoding
- Nodejs encoding issue
- How do I encode HTML characters within Javascript functions?
- Specifying Encoding While Placing Files In InDesign Using Extendscript
Related Questions in HTML-ESCAPE-CHARACTERS
- What do I need to do to my HTML to allow accented characters to properly display?
- Font awesome icons disable getting square only
- Should src be HTML-escaped in script tags in HTML?
- what the best way for escape html with rails 4: gsub('<', '<') OR CGI.escapeHTML
- Cannot escape a quotation(") character when retriveing a string containg quotation inside a string from DB in jsp
- Escaping special characters in Javascript to use them in MySql
- How to have AngularJS output escaped HTML
- How do I replicate a \t tab space in HTML?
- XSL unescape HTML inside CDATA
- How to parse links and escape html entities?
- mysql_real_escape_string($text) is not working on live server
- How can I differentiate regular whitespaces and escaped ones ( ) when parsing XML with xml.etree.ElementTree (python)
- Is this a valid UTF8 character in this xml file?
- "org.apache.commons.lang.StringEscapeUtils" and "en dash"
- What do < and > stand for?
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
I don't think so. 0x0096 in Unicode is a C1 control code:
http://en.wikipedia.org/wiki/C0_and_C1_control_codes
and is unlikely to be the replacement for "-" (as you wrote).
Well, if StringEscapeUtils really messes this up (en dash should indeed be \u2013) and if it's the only escape it is messing up and if there's no reason to have any other 0x0096 in your String, then a replaceAll after having calling StringEscapeUtils should work.
The following does the replace you expect:
However you should first make sure that StringEscapeUtils really messes things up and really, really, understand why/how you get that 0x0096 in a Java String.
Then, also, it should probably be pointed out to you that sadly Java's Unicode support is a major SNAFU because Java was conceived before Unicode 3.1 came out.
Hence it seemed a smart idea to use 16 bits for the char primitive, it seemed a smart idea to use a 4-hexdigits '\uxxxx' escape sequence, it seemed a smart idea to represent the length of the char[] in String's length() method, etc.
These were actually all very very stupid idea leading to one of the major Java SNAFU where the char primitive cannot actually hold a Unicode char anymore and where String's length method does actually not return a String's real length.
I like the following:
Why this rant? Well, because I don't know how the regexp replacement in String's replaceAll is implemented but I really wouldn't be suprised if there were cases (i.e. certain codepoints) where String's replaceAll was, like char and like length and like \uxxxx, well.. hmmm, totally broken.