Can I load in nodes and relationships from a csv file using 1 cypher command?

971 Views Asked by At

I have 2 csv files which I am trying to load into a Neo4j database using cypher: drivers.csv which holds every formula 1 driver and lap times.csv which stores every lap ever raced in F1.

I have managed to load in all of the nodes, although the lap times file is very large so it took quite a long time! I then tried to add relationships after, but there is so many that needs to be added that I gave up on it waiting (it was taking multiple days and still had not loaded in fully).

I’m pretty sure there is a way to load in the nodes and relationships at the same time, which would allow me to use periodic commit for the relationships which I cannot do right now. Essentially I just need to combine the 2 commands into one and after some attempts I can’t seem to work out how to do it?

// load in the lap_times.csv, changing the variable names - about half million nodes (takes 3-4 days)
PERIODIC COMMIT 25000
LOAD CSV WITH HEADERS from 'file:///lap_times.csv'
AS row
MERGE (lt: lapTimes {raceId: row.raceId, driverId: row.driverId, lap: row.lap, position: row.position, time: row.time, milliseconds: row.milliseconds})
RETURN lt;
// add a relationship between laptimes, drivers and races - takes 3-4 days
MATCH (lt:lapTimes),(d:Driver),(r:race)
WHERE lt.raceId = r.raceId AND lt.driverId = d.driverId
MERGE (d)-[rel8:LAPPING_AT]->(lt)
MERGE (r)-[rel9:TIMED_LAP]->(lt)
RETURN type(rel8), type(rel9)

Thanks in advance for any help!

1

There are 1 best solutions below

0
On BEST ANSWER

You should review the documentation for indexes here: https://neo4j.com/docs/cypher-manual/current/administration/indexes-for-search-performance/

Basically, indexes, once created, allow quick lookups of nodes of a given label, for the given property or properties. If you DON'T have an index and you do a MATCH or MERGE of a node, then for every row of that MATCH or MERGE, it has to do a label scan of all nodes of the given label and check all of their properties to find the nodes, and that becomes very expensive, especially when loading CSVs because those operations are likely happening for each row in the CSV.

For your :lapTimes nodes (though we would recommend you use singular labels in most cases), if there are none of them in your graph to start with, then a CREATE instead of a MERGE is fine. You may want a composite index on :lapTimes(raceId, driverId, lap), since that should uniquely identify the node, if you need to look it up later. Using CREATE instead of MERGE here should process much much faster.

Your second query should be MATCHing on :lapTimes nodes (label scan), and from each doing an index lookup on the :race and :driver nodes, so indexes are key here for performance.

You need indexes on: :race(raceId) and :Driver(driverId).

MATCH (lt:lapTimes)
WITH lt, lt.raceId as raceId, lt.driverId as driverId
MATCH (d:Driver), (r:race)
WHERE r.raceId = raceId AND d.driverId = driverId
MERGE (d)-[:LAPPING_AT]->(lt)
MERGE (r)-[:TIMED_LAP]->(lt)

You might consider CREATE instead of MERGE for the relationships, if you know there are no duplicate entries.

I removed your RETURN because returning the types isn't useful information.

Also, consider using consistent cases for your node labels, and that you are using the same case between the labels in your graph and the indexes you create.

Also, you would probably want to batch these changes instead of trying to process them all at once.

If you install APOC Procedures you can make use of apoc.periodic.iterate(), which can be used to batch changes, which will be faster and easier on your heap. You will still need indexes first.

CALL apoc.periodic.iterate("
 MATCH (lt:lapTimes)
 WITH lt, lt.raceId as raceId, lt.driverId as driverId
 MATCH (d:Driver), (r:race)
 WHERE r.raceId = raceId AND d.driverId = driverId
 RETURN lt, d, ir",
 "MERGE (d)-[:LAPPING_AT]->(lt)
 MERGE (r)-[:TIMED_LAP]->(lt)", {}) YIELD batches, total, errorMessages
RETURN batches, total, errorMessages

Single CSV load

If you want to handle everything all at once in a single CSV load, you can do that, but again you will need indexes first. Here's what you'll need at a minimum:

CREATE INDEX ON :Driver(driverId);
CREATE INDEX ON :Race(raceId);

After those are created, you can use this, assuming you are starting from scratch (I fixed the case of your labels and made them singular:

USING PERIODIC COMMIT 25000
LOAD CSV WITH HEADERS from 'file:///lap_times.csv' AS row
MERGE (d:Driver {driverId:row.driverId})
MERGE (r:Race {raceId:row.raceId})
CREATE (lt:LapTime {raceId: row.raceId, driverId: row.driverId, lap: row.lap, position: row.position, time: row.time, milliseconds: row.milliseconds})
CREATE (d)-[:LAPPING_AT]->(lt)
CREATE (r)-[:TIMED_LAP]->(lt)