Woodstox StAX - How to turn off text content validation?

1.1k Views Asked by At

I'm reading an XML file using the default Woodstox EventReader, e.g.:

XMLInputFactory.newInstance().createXMLEventReader(new FileInputStream(fileName));

If an input file happens to have the Unicode NULL character in some textual content, the following Exception/Stacktrace occurs:

WstxUnexpectedCharException.<init>(String, Location, char) line: 17 
ValidatingStreamReader(StreamScanner).constructNullCharException() line: 604    
ValidatingStreamReader(StreamScanner).throwInvalidSpace(int, boolean) line: 633 
ValidatingStreamReader(BasicStreamReader).readTextSecondary(int, boolean) line: 4624    
ValidatingStreamReader(BasicStreamReader).finishToken(boolean) line: 3661   
ValidatingStreamReader(BasicStreamReader).next() line: 1063 
WstxEventReader(Stax2EventReaderImpl).nextEvent() line: 255 

I'd like to avoid validating textual content. Setting IS_VALIDATING on the XMLInputFactory does not solve the problem.

After inspecting the source code, it looks like BasicStreamReader's next() refers to the "mValidateText" variable to determine whether to validate or not.

From the Source:

/**
 * Flag that indicates that textual content (CDATA, CHARACTERS) is to
 * be validated within current element's scope. Enabled if one of
 * validators returns {@link XMLValidator#CONTENT_ALLOW_VALIDATABLE_TEXT},
 * and will prevent lazy parsing of text.
 */
protected boolean mValidateText = false;

I can't seem to figure out how to change/set this value in the InputFactory or EventReader? Perhaps I need to direct the InputFactory to not use the ValidatingStreamReader, but instead the TypedStreamReader?

2

There are 2 best solutions below

0
On

That is not validation but basic well-formedness problem. Validation is used with schemas like DTD, RelaxNG or XML Schema, which can define specific structure or values for textual content. So validation-related settings will not have any effect, as that would be handled if content is well-formed XML.

What you need to do is to pre-process content to remove or replace small number of characters that are illegal in XML. This includes 0 byte.

0
On

A conformant XML parser is required to reject ill-formed content. You need to fix your (non-)XML, and let the parser do its job.