How to clean HTML before parsing it using HTML Unit

223 Views Asked by At

I am scraping html using HtmlUnit but the html is malformed with few tags as unclosed and thus HtmlUnit is giving wrong results.So I need to clean it before passing it to HtmlUnit.

How can I do that.

A short code snippet or tutorial would be appreciated

1

There are 1 best solutions below

0
On

I believe you could do this by implementing your own WebConnectionWrapper. Then you'll have to find some HTML library that fixes this properly (if possible). All you should do then is making sure the wrapper sends the content to the library so that when it reaches HTMLUnit's parser the HTML content is already processed.