How can I convert a Java string to xml entities for versions of Unicode beyond 3.0?

3k Views Asked by At

To convert java characters to xml entities, I can do the following for each char in a String:

buf.append("&#x"+ Integer.toHexString(c | 0x10000).substring(1) +";");

However, according to other stackoverflow questions, this only works for Unicode 3.0.

If I use a UTF-8 Reader to read in a String, then presumably that String contains the characters in a format that works up through Unicode 6.0 (because Java 7 supports Unicode 6.0 according to the javadoc).

Once I have that String, how can I write it out as xml entities? Ideally I'd use some api that would continue working as new versions of unicode come out.

2

There are 2 best solutions below

3
On BEST ANSWER

Either you are not using correct terminology, or there is a great deal of confusion here.

The &#x character reference notation just specifies a numeric codepoint; it is independent of the version of Unicode used by any reader or parser.

Your code is actually only compatible with Unicode 1.x, because it assumes a character's numeric value is less than 216. As of Unicode 2.0 that is not a correct assumption. Some characters are represented by a single Java char, while other characters are represented by two Java chars (known as surrogates).

I'm not sure what a "UTF-8 Reader" is. A Reader just reads char values, and does not know about UTF-8 or any other charset, except for InputStreamReader, which uses a CharsetDecoder to translate bytes to chars using the UTF-8 encoding (or whatever encoding a particular CharsetDecoder uses).

In any event, no Reader will parse the XML &#x character reference notation. You must use an XML parser for that.

No Reader or XML parser is affected by the Unicode version known to Java, because no Reader or XML parser consults a Unicode database in any way. The characters are just treated as numeric values as they are parsed. Whether they correspond to assigned codepoints in any Unicode version is never considered.

Finally, to write out a String as XML, you can use a Formatter:

static String toXML(String s) {
    Formatter formatter = new Formatter();
    int len = s.length();
    for (int i = 0; i < len; i = s.offsetByCodePoints(i, 1)) {
        int c = s.codePointAt(i);
        if (c < 32 || c > 126 || c == '&' || c == '<' || c == '>') {
            formatter.format("&#x%x;", c);
        } else {
            formatter.format("%c", c);
        }
    }
    return formatter.toString();
}

As you can see, there is no code that depends on the Unicode version, because the characters are just numeric values. Whether each numeric value is an assigned Unicode codepoint is not relevant.

(My first inclination was to use the XMLStreamWriter class, but it turns out an XMLStreamWriter that uses a non-Unicode encoding such as ISO-8859-1 or US-ASCII does not properly output surrogate pairs as single character entities, as of Java 1.8.0_05.)

2
On

Originally Java supported Unicode 1.0 by making the char type 16 bits long, but Unicode 2.0 introduced a surrogate character mechanism to support more characters than the number allowed in 16 bits, so Java strings became UTF-16 encoded; that means that some characters need two Java chars to be represented, they are called the high surrogate char and the low surrogate char.

To know which chars in a String are actually high/low surrogate pairs, you can use the utility methods in Character:

Character.isHighSurrogate(myChar); // returns true if myChar is a high surrogate
Character.isLowSurrogate(myChar); // same for low surrogate

Character.isSurrogate(myChar); // just to know if myChar is a surrogate

Once you know which chars are high or low surrogate, you need to convert each pair to a unicode codepoint with this method:

int codePoint = Character.toCodePoint(highSurrogate, lowSurrogate);

As a piece of code is worth a thousand words, this is an example method to replace to xml character references non us-ascii chars inside a string:

public static String replaceToCharEntities(String str) {
    StringBuilder result = new StringBuilder(str.length());

    char surrogate = 0;
    for(char c: str.toCharArray()) {

        // if char is a high surrogate, keep it to match it
        // against the next char (low surrogate)
        if(Character.isHighSurrogate(c)) {
            surrogate = c;
            continue;
        }

        // get codePoint
        int codePoint;
        if(surrogate != 0) {
            codePoint = Character.toCodePoint(surrogate, c);
            surrogate = 0;
        } else {
            codePoint = c;
        }

        // decide wether using just a char or a character reference
        if(codePoint < 0x20 || codePoint > 0x7E || codePoint == '<'
                || codePoint == '>' || codePoint == '&' || codePoint == '"'
                || codePoint == '\'') {
            result.append(String.format("&#x%x;", codePoint));
        } else {
            result.append(c);
        }
    }

    return result.toString();
}

The next string example is a good one to test with, as it contains a non-ascii char that can be represented with a 16 bit value and also a char with a high/low surrogate pair:

String myString = "text with some non-US chars: 'Ñ' and ''";