Parsing HTML with Java with HTMLCleaner; How can I recognize "<" char within attributes?

660 Views Asked by At

I'm parsing some pretty bad html code. I've had good success, until I noticed that with some elements, the attributes contain "<".

Ex:

<a href="#Anchor-<ht-42368">40</a>

will result as

<a href="#Anchor-">
    <ht-42368>40</ht-42368>
</a>

This will render fine in the browser, but HTML cleaner will think it is trying to start a new tag. It adds a '">" before beginning a new tag, which I don't want.

What is the best way to fix this? I'm not sure if HTMLCleaner has any properties that I can configure to manage this.. if not, how should I preprocess the HTML data to fix these characters?

EDIT: fixed example

EDIT: I'm thinking I could apply a replaceAll() with a regex, before going into htmlcleaner. Maybe something like ="[^"]*" and search if it contains "<".. and if it does, replace with an escaped html ampersand. Would that work?

0

There are 0 best solutions below