Trasform txt containing HTML to Plain Text

767 Views Asked by At

I am trying to find a tool to parse a TXT file, containing html, to plain text, while keeping it formatted, whith lists and so on

I have been able to find this http://jsoup.org/apidocs/org/jsoup/examples/HtmlToPlainText.html which works perfeclty. Only problem is that it reads an URL, not a file. I tried making some changes to the code but without success

Can someone point me to the right direction on how to have it read my txt file as input?

1

There are 1 best solutions below

1
On BEST ANSWER

You can start investigating the source code of the example program: https://github.com/jhy/jsoup/blob/master/src/main/java/org/jsoup/examples/HtmlToPlainText.java

It is pretty easy to load the html from a file instead of an URL. JSoup can easily parse a string.

Example

String fileName = "YOURFILE.htm";
Scanner scanner = new Scanner( new File(fileName) );
String content = scanner.useDelimiter("\\A").next();
scanner.close() // Put this call in a finally block

Document doc = Jsoup.parse(content);
//do whatever with the JSoup document