Escaped unicode characters are escaped again with StringEscapeUtils.escapeXml

80 Views Asked by At

I have a text "Begünstigter" which I'm trying to escape the character 'ü' with StringEscapeUtils.escapeXml. As the code for 'ü' is ü, I would expect the method to return Begünstigter. However, StringEscapeUtils.escapeXml is somehow doing the escape until there is no character to escape anymore, meaning after having the value Begünstigter, it escapes & as &. That's why the final result I get becomes Begünstigter. I've tried using commons-text, commons-lang, commons-lang3 with escapeXml10 and escapeXml11 methods as well as some other posted solutions. But nothing seems to work for me. What am I overlooking here, how can I solve this issue?

Here is the full code of where I'm doing this:

private void exportRecords(XMLStreamWriter writer, XmlExportDataDescription exportDataDescription) throws XMLStreamException {
        Long companyId = exportDataDescription.getCompanyId();
        String mainTagName = exportDataDescription.getMainTagNameInXml();

        long count = 0;

        Clock clock = Clock.systemDefaultZone();
        writer.writeStartElement(mainTagName);
        while (true) {
            Map<String, Object> parameter = new HashMap<>();
            parameter.put("companyId", companyId);
            parameter.put("offset", count + 1);
            parameter.put("rowNum", count + MANUAL_XML_CREATION_BATCH_SIZE);

            long startTimeResults = clock.millis();
            List<Map<String, Object>> resultList = getSqlMapClientTemplate().queryForList("XML_EXPORT." + mainTagName, parameter);
            long endTimeResults = clock.millis();

            if (resultList.isEmpty()) {
                break;
            }

            log.debug("---- Retrieving " + resultList.size() + " results for table " + exportDataDescription.getMainTagNameInXml() + " took " + (endTimeResults - startTimeResults) + " ms");

            count += resultList.size();

            long startTimeBatchWriting = clock.millis();
            for (Map<String, Object> listEntry : resultList) {
                writer.writeStartElement(mainTagName + "_ROW");

                for (Entry<String, Object> entry : listEntry.entrySet()) {
                    if (entry.getKey().toLowerCase().equals("rn")) {
                        continue;
                    }

                    if (entry.getValue() == null) {
                        writer.writeEmptyElement(entry.getKey());
                    } else {
                        writer.writeStartElement(entry.getKey());
                        writer.writeCharacters(StringEscapeUtils.escapeXml(entry.getValue().toString()));
                        writer.writeEndElement();
                    }
                }

                writer.writeEndElement();
            }

            long endTimeBatchWriting = clock.millis();
            log.debug("---- Writing batch results for table " + exportDataDescription.getMainTagNameInXml() + " took " + (endTimeBatchWriting - startTimeBatchWriting) + " ms");
        }

        writer.writeEndElement();
        exportDataDescription.setNumberOfDatasets(BigDecimal.valueOf(count));
    }
2

There are 2 best solutions below

0
snaikar On

One way to handle it is to unescape the parts that you do not want to escape

writer.writeCharacters(
     StringEscapeUtils.escapeXml(
       entry.getValue().toString()
     ).replaceAll("&amp;#(\\d+);", "&#$1;")
  );

Replacing all the &amp; with &

2
David Conrad On

Here is a minimal, reproducible example that show escaping is not necessary before calling XMLStreamWriter::writeCharacters:

import java.io.StringWriter;
import javax.xml.stream.XMLOutputFactory;
import javax.xml.stream.XMLStreamWriter;

StringWriter sw = new StringWriter();
XMLStreamWriter writer = XMLOutputFactory.newInstance().createXMLStreamWriter(sw);
writer.writeStartDocument();
writer.writeStartElement("value");
writer.writeCharacters("<Begünstigter>");
writer.writeEndElement();
writer.writeEndDocument();
writer.close();
System.out.println(sw.toString());

You can run this on JShell and the output is:

"<?xml version=\"1.0\" ?><value>&lt;Begünstigter&gt;</value>"

In short, XMLStreamWriter already knows how to write XML. You do not need to, and should not, escape text before passing it to the writeCharacters method.

Note: some implementations might only escape the < (left angle bracket) and not the > (right angle bracket); the former is required to be encoded while the latter is optional, but the result will still be correctly encoded and will be parsed correctly by an XML parser.