I'm parsing some pretty bad html code. I've had good success, until I noticed that with some elements, the attributes contain "<".
Ex:
<a href="#Anchor-<ht-42368">40</a>
will result as
<a href="#Anchor-">
<ht-42368>40</ht-42368>
</a>
This will render fine in the browser, but HTML cleaner will think it is trying to start a new tag. It adds a '">" before beginning a new tag, which I don't want.
What is the best way to fix this? I'm not sure if HTMLCleaner has any properties that I can configure to manage this.. if not, how should I preprocess the HTML data to fix these characters?
EDIT: fixed example
EDIT: I'm thinking I could apply a replaceAll() with a regex, before going into htmlcleaner. Maybe something like ="[^"]*" and search if it contains "<".. and if it does, replace with an escaped html ampersand. Would that work?