JDOM 1.1: hyphen is not a valid comment character

1.3k Views Asked by At

I'm using tagsoup to clean some HTML I'm scraping from the internet, and I'm getting the following error when parsing through pages with comments:

The data "- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - " is not legal for a JDOM comment: Comment data cannot start with a hyphen.

I'm using JDOM 1.1, and here's the code that does the actual cleaning:

    SAXBuilder builder = new org.jdom.input.SAXBuilder("org.ccil.cowan.tagsoup.Parser"); // build
    // Don't check the doctype! At our usage rate, we'll get 503 responses
    // from the w3.
    builder.setEntityResolver(dummyEntityResolver);
    Reader in = new StringReader(str);
    org.jdom.Document doc = builder.build(in);
    String cleanXmlDoc = new org.jdom.output.XMLOutputter().outputString(doc);

Any idea what's going wrong, or how to fix this? I need to be able to parse pages with long comment strings of <!--------- data ------------>

1

There are 1 best solutions below

3
On BEST ANSWER

An XML/HTML/SGML comment begins with --, ends with -- and does not contain --. A comment declaration contains zero or more comments.

Your example string can be reformatted as:

<!----
  ----
  - data
  ----
  ----
  ---->

As you can see, - data is not a valid comment and therefore the document is not valid HTML. In your specific case you can probably fix it by replacing the regular expression /<?!--.*?-->/ with the empty string, but be aware that this change might also break some valid documents.