I'm using Mojo::DOM to identify and print out phrases (meaning strings of text between selected HTML tags) in hundreds of HTML documents that I'm extracting from existing content in the Movable Type content management system.
I'm writing those phrases out to a file, so they can be translated into other languages as follows:
$dom = Mojo::DOM->new(Mojo::Util::decode('UTF-8', $page->text));
##########
#
# Break down the Body into phrases. This is done by listing the tags and tag combinations that
# surround each block of text that we're looking to capture.
#
##########
print FILE "\n\t### Body\n\n";
for my $phrase ( $dom->find('h1, h2, h2 b, h3, p, p strong, span, a, caption, th, li, li a')->map('text')->each ) {
print_phrase($phrase); # utility function to write out the phrase to a file
}
When Mojo::DOM encountered embedded HTML entities (such as ™ and ) it converted those entities into encoded characters, rather than passing along as written. I wanted the entities to be passed through as written.
I recognized that I could use Mojo::Util::decode to pass these HTML entities through to the file I'm writing. The problem is "You can only call decode 'UTF-8' on a string that contains valid UTF-8. If it doesn't, for example because it is already converted to Perl characters, it will return undef."
If this is the case, I have to either try to figure out how to test the encoding of the current HTML page before calling Mojo::Util::decode('UTF-8', $page->text), or I must use some other technique to preserve the encoded HTML entities.
How do I most reliably preserve encoded HTML Entities when processing HTML documents with Mojo::DOM?
Through testing, my colleagues and I were able to determine that
Mojo::DOM->new()was decoding ampersand characters (&) automatically, rendering the preservation of HTML Entities as written impossible. To get around this, we added the following subroutine to double encode ampersand:Later in the script we pass
$page->textthroughencode_amp()as we instantiate a newMojo::DOMobject.The code block above incorporates previous suggestions from @Grinnz as seen in the comments in this question. Thanks also to @Robert for his answer, which had a good observation about how
Mojo::DOMworks.This code definitely works for my application.