Efficient Gremlin queries for Variable-length paths

74 Views Asked by At

I'm trying to benchmark the performance of Gremlin for biology-related knowledge graphs.

I need to write a Gremlin query that is equivalent to this Neo4j/Cypher:

MATCH path = (gene:Gene) - [:enc] -> (prot:Protein)
  - [:h_s_s|ortho|xref*0..2] - (prot1:Protein)
  - [:is_a|ac_by] - (enz:Enzyme)
  - [:ac_by|in_by] -> (cmp:Comp)
  - [:cs_by|pd_by] -> (trn:Transport) 
  - [:part_of*0..3] -> (pwy:Path)

RETURN 
  [ n in nodes(path) | n.iri ] as nodeIris, 
  rand() AS rnd
ORDER BY rnd
LIMIT 100

That is, proteins might be linked to other proteins by 1-2 relations like xref (actually, they might be much longer, but I'm setting a limit), and Path(-ways) might be part of other pathways (again, I'm limiting it). For both proteins and pathways, there are chains of variable lengths and I want to catch all of them up to the max len.

My understanding is that this is the Gremlin equivalent (label names are changed to support multiple labels):

g.V().hasLabel ( 'Concept:Gene:Resource' )
  .out ( 'enc' ).hasLabel ( 'Concept:Protein:Resource' )

  .emit ()
  .repeat ( both ( 'h_s_s', 'ortho', 'xref' ).simplePath().hasLabel ( 'Concept:Protein:Resource' ) )
  .times ( 2 )

  .both ( 'is_a', 'ac_by' ).hasLabel ( 'Concept:Enzyme:Resource' )

  .out ( 'ac_by', 'in_by' ).hasLabel ( 'Comp:Concept:Resource' ) 
  .out ( 'cs_by', 'pd_by' ).hasLabel ( 'Concept:Resource:Transport' )

  .emit ()
  .repeat ( both ( 'part_of' ).simplePath().hasLabel ( 'Concept:Path:Resource' ) ) 
  .times ( 3 )

.sample ( 100 )
.path ().by ( 'iri' )

While this works, but it's extremely slow (like 10-20secs). Is emit()/repeat()/times() the most efficient way to do it?

I might try unions with explicit paths of variable lengths, but that's not a very expressive and easy-to-write approach.

0

There are 0 best solutions below