I'm using Property Graph and Cypher from Neo4j. As described in the title, I'm trying to delete a number of nodes which can be reached from another node without passing other nodes and only have 1 incoming relationship. Here is the example of this case:
Each node has its label (big, bold character) and a property called nodeId, which is unique among nodes. The labels of relationships are omitted because we cannot rely on it for some reasons. The nodeId property is already indexed with a unique constraint.
Now, starting from node A {nodeId: 1}, I want to delete it and all other nodes which:
- can be reached from
A {nodeId: 1}without passing another A-label node. - only has 1 incoming relationship
So, the nodes will be deleted are: A {nodeId: 1}, B {nodeId: 3}, C {nodeId: 4}, and C {nodeId: 8}.
Below is my Cypher code:
MATCH p = (s:A {nodeId: 1 }) -[*1..10]-> (e)
WHERE NONE (x in NODES(p) WHERE x:A AND NOT x.nodeId = 1)
WITH s, e
MATCH (e) <-[r]- ()
WITH count(r) AS num_r, s, e
WHERE num_r < 2
DETACH DELETE e
DETACH DELETE s
The code works fine but as my graph grows, it becomes slower and slower. In the beginning, it takes less than 10 ms. But now, when I have around 1 million of nodes and 2 million of relationships, it takes more than 1 second.
What should I do to improve the performance of that code?
Thank you for your help.

Since you only care if there is A path, you should use shortestPath instead of just
(a)-[*]->(b). That way, Cypher just needs to find 1 valid path instead of all possible paths (This can be a life saver in larger sets) Also, you can use TAIL to cut off the first item in a list so that you can (Cypher can) skip that check.Depending on your Neo4j version, Using
MATCH <path> WHERE <stuff> WITH DISTINCT startnode ,endnodemay be more effective, as later Cypher Planners can use the WITH DISTINCT hint to do a faster, less exhaustive path matching. On earlier versions, this will hang Neo4j, and you will need to use the APOC neo4j library.You can also change
NOT ()-->(e)<--()toSIZE(()-->(e)) < 2if you need to change that number. The former may perform better in some Cypher Planners though. You may need to change that to "All parents of e are contained in path" if that is a scenario where e can have more than 2 incoming relationships but still need to be deleted.If your logic gets more complicated than that (where what nodes get deleted can change what other nodes can be deleted