How can I create a basic human readable plain text representation of XHTML using Java?

643 Views Asked by At

Given some simple XHTML, I'd like to create a human readable plain text version of it. This would involve removing all HTML tags, but adding or preserving some whitespace.

For example, this input:

<div>
<p>This is some text, some is <b>bold</b>.</p>
<ul>
  <li>Point one</li>
  <li>Point two</li>
</ul>
</div>

would become:

"This is some text, some is bold. Point one Point two"

(commas between the LIs would be ideal... :)

1

There are 1 best solutions below

3
On BEST ANSWER

Jericho HTML Parser. You can either strip all the tags or call on a "renderer" class that tries to mimick the look (eg your bulleted lists would be tabbed)