Apache Camel to handle encoding declared in XML-File

1.7k Views Asked by At

I'm trying to parse an UTF-16 encoded document using Apache Camel Splitter with xtokenize, this delegates to Woodstox (com.ctc.wstx.sr.BasicStreamReader), also I cannot know the encoding of a file before I read it, currently some files are UTF-16, others UTF-8:

.split().xtokenize(getToken(), 'w', NAMESPACES)

The problem I encounter is that Camel tells Woodstox which encoding to use:

String charset = IOHelper.getCharsetName(exchange);

It sets the default UTF-8 as encoding, so BasicStreamReader tries to read BOM bytes as UTF-8 and fails with

com.ctc.wstx.exc.WstxUnexpectedCharException: Unexpected character '�' (code 65533 / 0xfffd) in prolog; expected '<'

As specified in https://www.w3.org/TR/xml/#sec-guessing XML Parser (Woodstox) should be able to autodetect the file encoding if only Camel lets it do the work.

Is there a way not to implement the encoding detection myself?

2

There are 2 best solutions below

1
On BEST ANSWER

Created a Camel JIRA ticket: https://issues.apache.org/jira/browse/CAMEL-11846 From my comments you can see there is no easy solution for splitting UTF-16 XML with Camel without knowing it's UTF-16 in advance.

Though subclassing XMLTokenExpressionIterator, which is an ExpressionAdapter and switching to InputStream works in the first place, there are several other places with xslt & xpath & conversion to StaxSource where it will break for the same reason.

As a workaround I consider it's easier to let XmlStreamReader find out encoding in advance (happens at the initialization) and setting Exchange.CHARSET_NAME header or property.

5
On

Okay I can see the current source code will fallback and use the platform encoding. So your use-case with the encoding provided in the XML stanza is not supported.

I am not sure if Camel really need to fallback to a default platform encoding as it uses the java.util.Scanner in the splitter, and it supports scanning without using a specific encoding.

Maybe you can try to patch the source code in the XMLTokenExpressionIterator and test it locally for you, and report back here.

We can then likely take a look at make it optional in Apache Camel to use the fallback encoding or not.

And in your current version of Apache Camel you can always extend XMLTokenExpressionIterator and override the doEvaluate method and then call the createIterator method without a charset parameter. And then use your custom iterator with the Camel splitter.