I am currently working on a project which involves finding the 'domains of knowledge' a certain key-word is related to. I plan to do this using DMOZ. For example, 'Brad Pitt' gives
Arts: People: P: Pitt, Brad: Fan Pages (10)
Arts: People: P: Pitt, Brad: Articles and Interviews (5)
Arts: People: P: Pitt, Brad (4)
Arts: People: P: Pitt, Brad: Image Galleries (2)
Arts: People: P: Pitt, Brad: Movies (2)
and so on...
I have the structure.rdf.u8 dump from DMOZ website. Someone had mentioned to me that if I do not need the URLs, just this file is enough(I don't need the websites, only the categories pertaining to keywords). Or do I need the content file also?
Moreover, I would like to know the best way to parse the structure file using Python (any library). I don't have any knowledge of XML, though I am good with Python.
I started with https://github.com/kremso/dmoz-parser and made a simple topic filter : https://github.com/lawrencecreates/dmoz-parser/blob/master/sample.py#L6