I am scraping html
using HtmlUnit but the html
is malformed with few tags as unclosed and thus HtmlUnit is giving wrong results.So I need to clean it before passing it to HtmlUnit.
How can I do that.
A short code snippet or tutorial would be appreciated
I believe you could do this by implementing your own WebConnectionWrapper. Then you'll have to find some HTML library that fixes this properly (if possible). All you should do then is making sure the wrapper sends the content to the library so that when it reaches HTMLUnit's parser the HTML content is already processed.