Parsing RDFa in html/xhtml?

331 Views Asked by At

Using RDF::RDFa::Parser module in perl to parse rdf data out of website. On website with with !DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> it works, but on sites using xhtml !DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> no output...

test website -> http://www.filmstarts.de/kritiken/186918.html

use RDF::RDFa::Parser;

my $url     = 'http://www.filmstarts.de/kritiken/186918.html';
my $options = RDF::RDFa::Parser::Config->tagsoup;
my $rdfa    = RDF::RDFa::Parser->new_from_url($url, $options);

print $rdfa->opengraph('image');
print $rdfa->opengraph('description');
1

There are 1 best solutions below

1
On

(I'm the author of RDF::RDFa::Parser.)

It looks like the HTML parser used by the RDFa parser is failing on that page. (I'm also the maintainer of the HTML parser in question, so I can't shift the blame onto anyone else!) Thus, by the time the RDFa parsing starts, all it sees is an empty DOM tree.

The page is quite hideously invalid XHTML yet still I would have expected the HTML parser to do a reasonable job. I've filed a bug report for you.

In the mean time, a workaround might be to build the XML::LibXML DOM tree outside of RDF::RDFa::Parser (perhaps using libxml's built-in HTML parser?). You could pass that tree directly to the RDFa parser:

use RDF::RDFa::Parser;
use LWP::Simple qw(get);

my $url     = 'http://www.filmstarts.de/kritiken/186918.html';
my $xhtml   = get($url);
my $dom     = somehow_build_a_dom_tree($xhtml);  # hand-waving!!
my $options = RDF::RDFa::Parser::Config->tagsoup;
my $rdfa    = RDF::RDFa::Parser->new($dom, $url, $options);

print $rdfa->opengraph('image');
print $rdfa->opengraph('description');

I hope that helps!

Update: here's a possible implementation of somehow_build_a_dom_tree...

sub somehow_build_a_dom_tree {
    my $p = XML::LibXML->new;
    $p->recover_silently(1);
    $p->load_html( string => @_ );
}