Using RDFLib to extract non RDF data as RDF data from webpages

288 Views Asked by At

I recently installed the RDFLib to work with RDF data. I want to extract RDF data from any webpage with non-RDF/RDF data, like Virtuoso Sponger

[like this link does] (http://linkeddata.uriburner.com/about/html/http/www.slideshare.net/kleinerperkins/internet-trends-v1)

and store as a N-Triples(nt) or N3/Turtle format(as on the options in the link footer). I get warnings and errors if I perform

 g.parse("http://www.slideshare.net/kleinerperkins/internet-trends-v1.html",format="n3")

Also is there an inbuilt functionality for ontology mapping with RDFLib?

1

There are 1 best solutions below

5
On

I get warnings and errors if I perform

g.parse("http://www.slideshare.net/kleinerperkins/internet-trends-v1.html",format="n3")

This is not really surprising as you're essentially asking it to parse an HTML page with the n3 parser.

You could run

g.parse("http://www.slideshare.net/kleinerperkins/internet-trends-v1.html", format="html")

but this is probably not what you want either. RDFLib can work with RDF that is embedded in HTML (like RDFa or microdata) and it can also extract some "general purpose RDF" from HTML, but the results are pretty different from what you get back from uriburner. The reason is that it uses a custom "slideshare" Virtuoso Sponger for which is tailored to extract a lot more useful information from the slideshare HTML. If you want to use that knowledge which was put into the special sponger, you could query the page "through" uriburner by parsing the RDF version from uriburner (the link can be found on the bottom of the page): How to find the link

g.parse(
    'http://linkeddata.uriburner.com/sparql?default-graph-uri=http%3A%2F%2Fwww.slideshare.net%2Fkleinerperkins%2Finternet-trends-v1&query=DESCRIBE%20%3Chttp%3A%2F%2Flinkeddata.uriburner.com%2Fabout%2Fid%2Fentity%2Fhttp%2Fwww.slideshare.net%2Fkleinerperkins%2Finternet-trends-v1%3E&output=text%2Frdf%2Bn3',
    format='n3'
)