I'm using Woodstox to process an XML that contains some entities (most notably >) in the value of one of the nodes. To use an extreme example, it's something like this:
<parent> < > & " ' </parent>
I have tried a lot of different configuration options for both WstxInputFactory (IS_REPLACING_ENTITY_REFERENCES, P_TREAT_CHAR_REFS_AS_ENTS, P_CUSTOM_INTERNAL_ENTITIES...) and WstxOutputFactory, but no matter what I try, the output is always something like this:
<parent>nbsp; < nbsp; > & " ' nbsp;</parent>
(> gets converted to >, < stays the same, loses the &...)
I'm reading the XML with an XMLEventReader created with
XMLEventReader reader = wstxInputFactory.createXMLEventReader(new StringReader(fulltext));
after configuring the WstxInputFactory.
Is there any way to configure Woodstox to just ignore all entities and output the text exactly as it was in the input String?
The basic five XML entities (quot, amp, apos, lt, gt) will be always processed. As far as I know there is no way to get the source of them with Sax.
For the other entities you can process them manually. You can capture the events until the end of the element and concatenate the values:
For your example input:
The output will be:
Otherwise if you want to convert all the entities maybe you could use a custom XmlResolver for undeclared entities:
And then tell Woodstox to use it for the undeclared entities: