How to convert multiple html files to text files?

1.1k Views Asked by At

Hello everyone, I have a folder full of html files which I want to convert to text files. I am working on ubuntu platform and unfortunately the lynx --dump is not installing for me. Is there an alternative way to convert the html files to text files? Please help! Thanks in advance.

1

There are 1 best solutions below

0
On

This question is tagged python so my first choice would be Aaron Swartz's html2text. It produces test in markdown format.

Python solutions are also possible with BeautifulSoup.

If you like perl, here is a simple perl script to convert html to text:

#!/usr/bin/perl -w

use HTML::Parse;
use HTML::FormatText;

my $file = $ARGV[0];
if (not -r $file) {
    die "ERROR: File ($file) is not readable\n";
}

my $html = do { local $/; open(I,$file); <I> };
my $plain = HTML::FormatText->new->format(parse_html($html) );
print $plain;