Split all the different graphs included in a n-quads file

479 Views Asked by At

I have a big n-quads file with a lot of statements included in a big number of different graphs The lines of the file are as follow :

<http://voag.linkedmodel.org/voag#useGuidelines> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/1999/02/22-rdf-syntax-ns#Property> <http://voag.linkedmodel.org/schema/voag> .

The fourth element correspond to the graph's URI.

I would like to parse this file and split all the different graphs in new files or datastructures, one object per graph, preferably with RDFlib. I really don't know how to tackle this problem, so any help would be appreciated.

1

There are 1 best solutions below

0
On

If the lines are such that all the graph URI's are together in a sequence then you can use itertools' groupby to parse each one in turn:

from itertools import groupby
import rdflib
def parse_nquads(lines):
    for group, quad_lines in groupby(lines, get_quad_label):
        graph = rdflib.Graph(identifier=group)
        graph.parse(data=''.join(quad_lines), format='nquads')
        yield graph

If the fourth element is always present and a URI (which is not guaranteed in the specification) you can find it by searching for whitespace.

import re
RDF_QUAD_LABEL_RE = re.compile("[ \t]+<([^>]*)>[ \t].\n$")
def get_quad_label(line):
    return RDF_QUAD_LABEL_RE.search(line).group(1)

Then you can process each graph from the input file into a new file or dataset

with open('myfile.nquads', 'rt') as f:
  for graph in parse_nquads(f):
    ...