XSL Transformation, emoji and attributes

277 Views Asked by At

I'm encountering issue with emojis when trying to generate html output using xsl transformation under certain circumstances.

For instance, I've tested following xsl with different transformation engines:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="html" encoding="UTF-8"/>
  <xsl:template match="/">
    <xsl:text disable-output-escaping="yes">&lt;!doctype html&gt;</xsl:text>
    <html>
      <head>
        <meta charset="UTF-8"/>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
      </head>
      <body>
        <textarea></textarea><br/>
        <input type="text" value=""/>
      </body>
    </html>
  </xsl:template>
</xsl:stylesheet>

I tested with exact same code (based on JAXP definition) for all transformers. I only changed the transformer instance class reference.

Saxon gives correct result:

enter image description here

Java internal repackaged transformer based on xalan (aka com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl) is correct when emoji is put as text in textarea body, but generates wrong result for <input> field: it seems that emoji is wrong encoded when put in value attribute:

enter image description here

Xalan 2.7.2 gives even worse result:

enter image description here

For different reasons (mainly license one), I would prefer using Xalan transformer. Any idea how I can make xalan manage emoji correctly ?

EDIT

The transformation is performed with following code:

TransformerFactory factory = TransformerFactory.newInstance(
        "com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl",
        null);
Transformer transformer = factory.newTransformer(new StreamSource(xsl));
DocumentSource domSource = new DocumentSource(doc);
OutputStream stream = response.getOutputStream();

transformer.transform(domSource, new StreamResult(stream));

stream.flush();
stream.close();

where doc is a dom4j document, xsl is the inputstream containing above stylesheet and response is a HttpServletResponse object which will receive the transformation result.

3

There are 3 best solutions below

2
Martin Honnen On

I have tried

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="html" encoding="UTF-8" doctype-system="about:legacy-compat"/>
  <xsl:template match="/">
    <html>
      <head>
      </head>
      <body>
        <textarea></textarea><br/>
        <input type="text" value=""/>
      </body>
    </html>
  </xsl:template>
</xsl:stylesheet>

with Xalan 2.7.1 at http://xsltransform.net/ and both thumbs seems to be shown fine i.e. the serialized HTML is

<!DOCTYPE HTML SYSTEM "about:legacy-compat">
<html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<textarea></textarea>
<br>
<input value="" type="text">
</body>
</html>

which renders as

enter image description here

0
morbac On

After a day of research, I have come to the conclusion that this is a bug in xalan html serializer (line 1440 and following) with surrogate characters (char between \ud800 and \udbff). As mentioned in comments, xalan 2.6.0 makes a correct transformation, but xalan 2.7.* does not.

Martin Honnen mentioned the XALANJ-2419. I also found other tickets related to this issue (XALANJ-2617, https://github.com/apache/xalan-j/pull/4, etc.) I tried to implement some fixes. For instance the version suggested here effectively fixes the issue for my <input> field but it remains the issue with textarea.

enter image description here

I'll try to fork xalan and fix the issue for both attribute and text. Meanwhile, the easiest way to work around the issue is to change the replace the "UTF-8" encoding with "UTF-16" in xsl:output. This fixes both issues.

enter image description here

0
morbac On

I finally decided to fork xalan-java project and patch the serializer by myself. After compilation of the patch, I'm able to have correct emojis for both attributes and text with utf-8 xsl output.

The patch commit is following https://github.com/morbac/xalan-java/commit/a685171e1b621e9b63c8507f467a395fd1fc96a4. It fixes the issue for both input and textarea. The jar with fixed classes is available here