I'm trying to find out how to use Mojo::DOM
with UTF8 (and other formats... not just UTF8). It seems to mess up the encoding:
my $dom = Mojo::DOM->new($html);
$dom->find('script')->reverse->each(sub {
#print "$_->{id}\n";
$_->remove;
});
$dom->find('style')->reverse->each(sub {
#print "$_->{id}\n";
$_->remove;
});
$dom->find('script')->reverse->each(sub {
#print "$_->{id}\n";
$_->remove;
});
my $html = "$dom"; # pass back to $html, now we have cleaned it up...
This is what I get when saving the file without running it through Mojo:
...and then once through Mojo:
FWIW, I'm grabbing the HTML file using Path::Tiny
, with:
my $utf8 = path($_[0])->slurp_raw;
Which to my understanding, should already have the string decoded into bytes ready for Mojo?
UPDATE: After Brians suggestion, I looked into how I could figure out the encoding type to decode it correctly. I tried Encode::Guess and a few others, but they seemed to get it wrong on quite a few. This one seems to do the trick:
my $enc_tmp = `encguess $_[0]`;
my ($fname,$type) = split /\s+/, $enc_tmp;
my $decoded = decode( $type||"UTF-8", path($_[0])->slurp_raw );
You are slurping raw octets but not decoding them (storing the raw in
$utf8
). Then you treat it as if you had decoded it, so the result is mojibake.slurp_utf8
will decode for you.open
pragma does that in this example.Mojo::File->slurp
to get raw octets, so you can reduce your dependency list.The output: