So I tried loading some postcode and address data into neo4j. I pute a unique constraint there are effectively three labels. POSTCODE, ADDRESS, and REGION. REGION and POSTCODE have unique constraints on their single property. The query we use for insertion will MERGE REGION, MERGE POSTCODE CREATE ADDRESS, and then CREATE RELATIONSHIPS. The idea is to be able to see what postcodes are in which region, and how many addresses share a single postcode, so the MERGE behaviour is important.
However, we have found this to be very slow once the database gets to even quite a moderate size. Now we expected this, but we expected that constrain checks should scale as log(n). Instead, the performance is linear in the size of database, which is very unexpected.
What can I do to improve this without giving up the MERGE behaviour? Is this a consequence of the UNIQUE constraint? In theory there should be no difference between having a unique constraint and just having an index when you use merge since there is only one property. Either way merge needs to know if the property exists to decide whether to merge.
I'm aware that I can do various things to speed up the insertion, use the csv loader, etc. I am interested in here in improving the asymptotic performance. I thought unique constraints should have a time cost of O(log(n)), not O(n), and that potentially makes a huge difference.
EDIT: Further investigation has revealed that the issue is not index look ups, but R-tree insertion into the spatial layer. The particular code being used for insertion utilised the embedded API, not cypher, and the snippet:
graphDB.index().forNodes(s).add(node, "dummy", "variable");
gets progressively slower at O(n) as the size of the tree expands. This is apparently the expected behaviour for R-Trees. This takes about 0.0005 * Number of nodes in the layer. With the spatial insertion removed it goes orders of magnitude faster and shows no scaling behavior. I assume that the decrease is just due to the cache warming up after starting.
Incidentially, I am using the following code to start up the spatial index:
Map<String, String> config = SpatialIndexProvider.SIMPLE_POINT_CONFIG;
Transaction tx = graphDB.beginTx();
IndexManager indexMan = graphDB.index();
try{
indexMan.forNodes(lab.name(), config);
tx.success();
} finally {
tx.close();
}
As this makes gives you the Cypher entry point, but is there a qualititive differences between indexes and layers? Would a layer have better performance than the index, or are they both backed by identical R-trees.
The suggestion at this question: Neo4J huge performance degradation after records added to spatial layer seems to be that I should put all the nodes into the database before I start the spatial layer, as it will index much faster than incremental insertion.
Ill Try that tomorrow.
Which Neo4j version do you use?
Yes, please share your queries.
If you use
LOAD CSV
, you will have better performance in creating the nodes separately first withMERGE
and then in a second pass create the relationships withMATCH ... MATCH ... CREATE ...
see also: http://www.markhneedham.com/blog/2014/10/23/neo4j-cypher-avoiding-the-eager/
If you don't use LOAD CSV do you run individual small transactions? If so, it makes sense to batch them into for instance 1000 operations per transaction.
Can you also verify that your constraints are in place, with ":schema" in the browser or "schema" in the shell?
And check that the index/constraint is actually use by profiling your query in the shell? Just prefix it with
profile
.