Loading 220 milion triples in Anzograph

176 Views Asked by At

I've got a dataset with 220 milion triples, in one TTL file. Is there a way I can upload this data into AnzoGraph?

In the AnzoGraph documentation, https://docs.cambridgesemantics.com/anzograph/userdoc/load-reqs.htm, I came across the text below:

AnzoGraph supports a maximum URI length of 16K characters. There is also a limit of 64K on the number of unique URIs you can load into AnzoGraph. That is, the number of unique URIs, including graph URIs and predicate URIs, that you can load into AnzoGraph must be less than 64K. If you exceed this limit, the Load operation exceeding the limit will fail and AnzoGraph returns the message "m_lowest_unused_index <= a_max_value()".

With 64K of unique triples, I'm expecting the upload of 220 milion triples to fail. Especially since it's a linking dataset, linking multiple sources, so lot's of unique URI's.

Is there a way around this limitation?

1

There are 1 best solutions below

0
On

220 milion triples, in one TTL file.

This approach will load your TTL data very slowly because you will be engaging just a single CPU core to ingest the data. If you can load the data just once into e.g. <yourgraph>, then use the command

`COPY <yourgraph> TO <dir:/mydir/myfiles.ttl.gz>`

which will split your dataset into many gzip compressed TTL files and next time load the data MPP style from that data directory instead, using every single C{U core in your AnzoGraph server/cluster to load sub-sets of the data in parallel. I should also note that 220m triples is actually a very small data set for AnzoGraph. I have loaded over 100m on my T470s Thinkpad while just fiddling around and single server-class systems will easily handle into the billions, while a large cluster has been tested to over a trillion with a record-breaking LUBM some years ago. Typical production use cases are in the 10's of billions.

Disclaimer: I work for Cambridge Semantics.