Parsing RDFa in html/xhtml?

369 Views Asked by armin884 At 24 December 2013 at 23:53

Using RDF::RDFa::Parser module in perl to parse rdf data out of website. On website with with !DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> it works, but on sites using xhtml !DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> no output...

test website -> http://www.filmstarts.de/kritiken/186918.html

use RDF::RDFa::Parser;

my $url     = 'http://www.filmstarts.de/kritiken/186918.html';
my $options = RDF::RDFa::Parser::Config->tagsoup;
my $rdfa    = RDF::RDFa::Parser->new_from_url($url, $options);

print $rdfa->opengraph('image');
print $rdfa->opengraph('description');

Original Q&A

There are 1 best solutions below

tobyink On 25 December 2013 at 00:20

(I'm the author of RDF::RDFa::Parser.)

It looks like the HTML parser used by the RDFa parser is failing on that page. (I'm also the maintainer of the HTML parser in question, so I can't shift the blame onto anyone else!) Thus, by the time the RDFa parsing starts, all it sees is an empty DOM tree.

The page is quite hideously invalid XHTML yet still I would have expected the HTML parser to do a reasonable job. I've filed a bug report for you.

In the mean time, a workaround might be to build the XML::LibXML DOM tree outside of RDF::RDFa::Parser (perhaps using libxml's built-in HTML parser?). You could pass that tree directly to the RDFa parser:

use RDF::RDFa::Parser;
use LWP::Simple qw(get);

my $url     = 'http://www.filmstarts.de/kritiken/186918.html';
my $xhtml   = get($url);
my $dom     = somehow_build_a_dom_tree($xhtml);  # hand-waving!!
my $options = RDF::RDFa::Parser::Config->tagsoup;
my $rdfa    = RDF::RDFa::Parser->new($dom, $url, $options);

print $rdfa->opengraph('image');
print $rdfa->opengraph('description');

I hope that helps!

Update: here's a possible implementation of somehow_build_a_dom_tree...

sub somehow_build_a_dom_tree {
    my $p = XML::LibXML->new;
    $p->recover_silently(1);
    $p->load_html( string => @_ );
}

Parsing RDFa in html/xhtml?

There are 1 best solutions below

Related Questions in HTML

Related Questions in XML

Related Questions in PERL

Related Questions in PARSING

Related Questions in RDFA

Trending Questions

Popular # Hahtags

Popular Questions