I'm using Dom4j to parse HTML documents. Dom4j expects XML, so HTML entities are not declared. It's possible to declare them in document's DTD, but I am parsing external input, so that's not appropriate. I'd rather declare them programmatically in the parser.
Here's my code:
// Read.
final DocumentFactory df = DOMDocumentFactory.getInstance();
SAXReader reader = new SAXReader();
Document doc, outDoc;
try {
doc = reader.read( new StringReader(htmlStr) );
}
catch( Exception ex ){
throw new RuntimeException("Error parsing the HTML:\n " + ex.toString() );
}
I see that SAXReader has reader.setEntityResolver( ??? ); but seems like it's not the solution as the overridable method looks like this:
public InputSource resolveEntity(String publicId, String systemId) throws SAXException, IOException
What I am looking for is something like
reader.setTrueEntityResolver( new EntityResolver(){
public InputStream resolve( String name ){ ... }
}
I've found a possible solution in http://evc-cit.info/dom4j/dom4j_groovy.html Where it's suggested to add a XML Commons Catalog stuff.
However, that seems like an overkill, as there's no doctype specified anyway, and I only intend to resolve the commons HTML 4 entities.
Update: Turned out that without explicit DOCTYPE declaration, this doesn't have any effect - EntityResolver is never called.
Maven dep:
Config in
/CatalogManager.proeprtieson classpath:Tell the parser to use the catalog resolver when it encounters the DTD: