I searched for a HTML parser and came up with tidy. The thing is that now that I have installed it I can't find how to strip all HTML tags (and also javascript function if its possible). The example code turns html into XHTML and I'm starting to get a feeling that I have downloaded a not suitable package, couldn't find any documantation/manuals that explains it either.
Any suggestions on how this might be done with tidy?
EDIT:
As I understood tidy is an HTML parser, what I am trying to achieve is leave only the plain test i.e: <h3>Test</h3>
will come up into Test
Tidy is basically is used to clean HTML pages. You can send the output of Tidy to libxml++ to parse the generated XHTML.
For a working example on using libxml++, look at this link Parsing a XHTML using libxml++ You can use one of the 3 parsers to parse the string and get only text without any tags.