Unicode conversion in XML failed

184 Views Asked by At

In response of a webservice call, I am getting an XML

<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"><soap:Body><ns1:executeAPI xmlns:ns1="http://mobi.ce.webservices.inf.com/">
<Message>
<Body>
<P_NAME>Shampoo</P_NAME>
<P_DISC>SM - Premium Starter Kits Set / Anti Hair Loss &#x1d402;&#x1d428;&#x1d42b;&#x1d41a;&#x1d425; &#x1d402;&#x1d41a;&#x1d425;&#x1d41c;&#x1d422;&#x1d42e;&#x1d426; Shampoo + Treatment + Essence + Plasma Scalp Massager</P_DISC>
</Body>
</Message>
</ns1:executeAPI>
</soap:Body>
</soap:Envelope>

This again, have to convert to JSON for next call. This transform fails with Error

F-XSLT 41252: XSLT transformation error: org.xml.sax.SAXParseException; Character reference "&#55349" is an invalid XML character.

I tried changing it to

application/xml; charset=UTF-16
application/xml; charset=UTF-8,

I tried simply passing it to XSLT but to convert from Unicode to string but no luck.

Here is XSLT

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

    <xsl:output method="xml" indent="yes"/>

    <xsl:template match="@* | node()">
        <xsl:copy>
            <xsl:apply-templates select="@* | node()"/>
        </xsl:copy>
    </xsl:template>
</xsl:stylesheet>

With this XSL unicode

&#x1d402;&#x1d428;&#x1d42b;&#x1d41a;&#x1d425; &#x1d402;&#x1d41a;&#x1d425;&#x1d41c;&#x1d422;&#x1d42e;&#x1d426;

Converted to

&#55349;&#56322;&#55349;&#56360;&#55349;&#56363;&#55349;&#56346;&#55349;&#56357; &#55349;&#56322;&#55349;&#56346;&#55349;&#56357;&#55349;&#56348;&#55349;&#56354;&#55349;&#56366;&#55349;&#56358;

Any XSLT help on this??

Thanks All for help, i found solutions, below XSL works

<?xml version="1.0"?>
<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output indent="yes" method="xml" encoding="utf-8"/>
    <!-- template to copy elements -->
    <xsl:template match="*">
        <xsl:element name="{local-name()}">
            <xsl:apply-templates select="@* | node()"/>
        </xsl:element>
    </xsl:template>
</xsl:stylesheet>
1

There are 1 best solutions below

0
On

In Javascript,

String.fromCharCode(55349, 56322) === String.fromCodePoint(0x1d402)

In other words: The two decimal numbers are a surrogate pair for the hexadecimal number.

But this is a UTF-16-specific concept and your XML input contains only US-ASCII characters. In any case,

<P_DISC>SM - Premium Starter Kits Set / Anti Hair Loss
&#55349;&#56322;&#55349;&#56360;&#55349;&#56363;&#55349;&#56346;&#55349;&#56357; &#55349;&#56322;&#55349;&#56346;&#55349;&#56357;&#55349;&#56348;&#55349;&#56354;&#55349;&#56366;&#55349;&#56358;
Shampoo + Treatment + Essence + Plasma Scalp Massager</P_DISC>

is invalid XML. I suspect that your (Javascript-based?) XSLT processor does not handle surrogate pairs correctly. The incorrect handling seems to happen during <xsl:copy> of a text node (in your original transformation), but not when a text node is processed by the default template (<xsl:apply-templates select="node()"/> in your transformation that works).