I'm reading an XML file using the default Woodstox EventReader, e.g.:
XMLInputFactory.newInstance().createXMLEventReader(new FileInputStream(fileName));
If an input file happens to have the Unicode NULL character in some textual content, the following Exception/Stacktrace occurs:
WstxUnexpectedCharException.<init>(String, Location, char) line: 17
ValidatingStreamReader(StreamScanner).constructNullCharException() line: 604
ValidatingStreamReader(StreamScanner).throwInvalidSpace(int, boolean) line: 633
ValidatingStreamReader(BasicStreamReader).readTextSecondary(int, boolean) line: 4624
ValidatingStreamReader(BasicStreamReader).finishToken(boolean) line: 3661
ValidatingStreamReader(BasicStreamReader).next() line: 1063
WstxEventReader(Stax2EventReaderImpl).nextEvent() line: 255
I'd like to avoid validating textual content. Setting IS_VALIDATING on the XMLInputFactory does not solve the problem.
After inspecting the source code, it looks like BasicStreamReader's next() refers to the "mValidateText" variable to determine whether to validate or not.
From the Source:
/**
* Flag that indicates that textual content (CDATA, CHARACTERS) is to
* be validated within current element's scope. Enabled if one of
* validators returns {@link XMLValidator#CONTENT_ALLOW_VALIDATABLE_TEXT},
* and will prevent lazy parsing of text.
*/
protected boolean mValidateText = false;
I can't seem to figure out how to change/set this value in the InputFactory or EventReader? Perhaps I need to direct the InputFactory to not use the ValidatingStreamReader, but instead the TypedStreamReader?
That is not validation but basic well-formedness problem. Validation is used with schemas like DTD, RelaxNG or XML Schema, which can define specific structure or values for textual content. So validation-related settings will not have any effect, as that would be handled if content is well-formed XML.
What you need to do is to pre-process content to remove or replace small number of characters that are illegal in XML. This includes 0 byte.