Dom4j parsing - How to declare HTML entities programmatically? "The entity "nbsp" was referenced, but not declared."

1.8k Views Asked by At

I'm using Dom4j to parse HTML documents. Dom4j expects XML, so HTML entities are not declared. It's possible to declare them in document's DTD, but I am parsing external input, so that's not appropriate. I'd rather declare them programmatically in the parser.

Here's my code:

    // Read.
    final DocumentFactory df = DOMDocumentFactory.getInstance();
    SAXReader reader = new SAXReader();
    Document doc, outDoc;
    try {
        doc = reader.read( new StringReader(htmlStr) );
    }
    catch( Exception ex ){
        throw new RuntimeException("Error parsing the HTML:\n       " + ex.toString() );
    }

I see that SAXReader has reader.setEntityResolver( ??? ); but seems like it's not the solution as the overridable method looks like this:

public InputSource resolveEntity(String publicId, String systemId) throws SAXException, IOException

What I am looking for is something like

reader.setTrueEntityResolver( new EntityResolver(){
    public InputStream resolve( String name ){ ... }
}
2

There are 2 best solutions below

0
Ondra Žižka On

I've found a possible solution in http://evc-cit.info/dom4j/dom4j_groovy.html Where it's suggested to add a XML Commons Catalog stuff.

However, that seems like an overkill, as there's no doctype specified anyway, and I only intend to resolve the commons HTML 4 entities.

Update: Turned out that without explicit DOCTYPE declaration, this doesn't have any effect - EntityResolver is never called.

Maven dep:

    <dependency>
        <groupId>xml-resolver</groupId>
        <artifactId>xml-resolver</artifactId>
        <version>1.2</version>
        <scope>test</scope>
    </dependency>

Config in /CatalogManager.proeprties on classpath:

# allow location to be relative to this file's directory
relative-catalogs=yes

# A semicolon-delimited list of catalog files.
# In this instance, we have a single catalog file, and it's a relative path name
catalogs=sgml-lib/xml.soc

# no debugging messages, please
verbosity=0

# Use the SYSTEM identifier 
prefer=system

Tell the parser to use the catalog resolver when it encounters the DTD:

cResolver = new CatalogResolver( cMgr )
reader = new SAXReader( )
reader.setEntityResolver( cResolver )
0
forty-two On

Well, as you said, DOM4J is not meant to parse HTML. I would rather use something like tagsoup or HTML Cleaner. It's just not entities, HTML is not XML.