Java XML Parsing - incorrect string version of the data with VTD-XML

206 Views Asked by At

I am parsing an XML document in UTF-8 encoding with Java using VTD-XML.

A small excerpt looks like:

<literal></literal>
<literal></literal>
<literal></literal>

I want to iterate through each literal and print it out to the console. However, what I get is:

¢

I am correctly navigating to each element. The way that I get the text value is by calling:

private static String toNormalizedString(String name, int val, final VTDNav vn) throws NavException {
    String strValue = null;
    if (val != -1) {
        strValue = vn.toNormalizedString(val);
    }
    return strValue;
}

I've also tried vn.getXPathStringVal();, however it yields the same results.

I know that each of the literals above aren't just strings of length one. Rather, they seem to be unicode "characters" composed of two characters. I am able to correctly parse and output the kanji characters if they're length is just one.

My question is - how can I correctly parse and output these characters using VTD-XML? Is there a way to get the underlying bytes of the text between the literal tags so that I can parse the bytes myself?

EDIT

Code to process each line of the XML - converting it to a byte array and then back to a String.

try (BufferedReader br = new BufferedReader(new FileReader("res/sample.xml"))) {
        String line;
        while ((line = br.readLine()) != null) {
            byte[] myBytes = null;

            try {
                myBytes = line.getBytes("UTF-8");
            } catch (UnsupportedEncodingException e) {
                e.printStackTrace();
                System.exit(-1);
            }

            System.out.println(new String(myBytes));
        }
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }
1

There are 1 best solutions below

1
vtd-xml-author On BEST ANSWER

You are probably trying to get the string involving characters that is greater than 0x10000. That bug is known and is in the process of being addressed... I will notify you once the fix is out. This question may be identical to this one... Map supplementary Unicode characters to BMP (if possible)