I need only the URL's from the dmoz/ODP file. But the file is in RDF. How do I get only the url's from the odp file? I want to extract all the url's in a text file.
Anyone knows of any script to parse only urls from rdf file ?
Several of the popular SemWeb APIs (Jena, Sesame and dotNetRDF) all provide fully streaming APIs for RDF files so you can write a custom data handler that will only take the URIs produced and throw away the rest of the stuff you aren't interested in.
You can probably do something hacky with perl and it may be faster but it may not be entirely accurate particularly if the RDF uses relative URIs which need to be resolved
Option 1. Download dmoz_v3.zip from http://sourceforge.net/projects/dmoz2mysql/files/latest/download. This is a PHP script that is used to parse the DMOZ RDF data dump files automatically. It features downloading of the files, extracting, cleaning, parsing and inserting the data into a MySQL database.
Option 2. Use the following link to find the tools to extract URLs from RDF dumpfiles
Maybe something like this then?
And then print the contents of @urls to a text file.