I'm trying to benchmark the performance of Gremlin for biology-related knowledge graphs.
I need to write a Gremlin query that is equivalent to this Neo4j/Cypher:
MATCH path = (gene:Gene) - [:enc] -> (prot:Protein)
- [:h_s_s|ortho|xref*0..2] - (prot1:Protein)
- [:is_a|ac_by] - (enz:Enzyme)
- [:ac_by|in_by] -> (cmp:Comp)
- [:cs_by|pd_by] -> (trn:Transport)
- [:part_of*0..3] -> (pwy:Path)
RETURN
[ n in nodes(path) | n.iri ] as nodeIris,
rand() AS rnd
ORDER BY rnd
LIMIT 100
That is, proteins might be linked to other proteins by 1-2 relations like xref (actually, they might be much longer, but I'm setting a limit), and Path(-ways) might be part of other pathways (again, I'm limiting it). For both proteins and pathways, there are chains of variable lengths and I want to catch all of them up to the max len.
My understanding is that this is the Gremlin equivalent (label names are changed to support multiple labels):
g.V().hasLabel ( 'Concept:Gene:Resource' )
.out ( 'enc' ).hasLabel ( 'Concept:Protein:Resource' )
.emit ()
.repeat ( both ( 'h_s_s', 'ortho', 'xref' ).simplePath().hasLabel ( 'Concept:Protein:Resource' ) )
.times ( 2 )
.both ( 'is_a', 'ac_by' ).hasLabel ( 'Concept:Enzyme:Resource' )
.out ( 'ac_by', 'in_by' ).hasLabel ( 'Comp:Concept:Resource' )
.out ( 'cs_by', 'pd_by' ).hasLabel ( 'Concept:Resource:Transport' )
.emit ()
.repeat ( both ( 'part_of' ).simplePath().hasLabel ( 'Concept:Path:Resource' ) )
.times ( 3 )
.sample ( 100 )
.path ().by ( 'iri' )
While this works, but it's extremely slow (like 10-20secs). Is emit()/repeat()/times() the most efficient way to do it?
I might try unions with explicit paths of variable lengths, but that's not a very expressive and easy-to-write approach.