In a graph database, I have graphs like:
v1: Protein{prefName: 'QP1'}
-- r1: part_of{evidence: 'ns:testdb'}
--> v2: Protcmplx{prefName: 'P12 Complex'}
ev: EvidenceType{ iri = "ns:testdb", label = "Test Database" }
I'd like to write a Gremlin query to fetch instances of the part_of relationship and return v1 and v2's prefName, along with the evidence's label. So far I've tried this:
g.V().hasLabel( containing('Protein') ).as('p')
.outE().hasLabel( 'is_part_of' ).as('pr')
.inV().hasLabel( containing('Protcmplx') ).as('cpx')
.V().hasLabel( containing('EvidenceType') ).as('ev')
.has( 'iri', eq( select('pr').by('evidence') ) )
.select( 'p', 'cpx', 'ev', 'pr' )
.by('prefName')
.by('prefName')
.by('label')
.by('evidence')
.limit(100)
But it takes a lot of time for a few thousand nodes+edeges, and eventually, it doesn't return anything. I'm sure the values are there and I think the problem is with has( 'iri', ... ), but I can't figure out how to match an edge property with another property in an unconnected vertex.
The graph is modelled this way, cause the LPG model doesn't allow for hyper-edges (linking >2 vertices).
I've found a way using
where()andby(), but it is quite slow (11secs to get 100 tuples from a few thousands nodes+edges):Any help with optimisation would be welcome!
EDIT: following a suggestion from the comments (thanks!), I've rewritten the solution a bit (it's still slow) and used
.profile()at the end, obtaining this:So, the problem seems to be that the second V() picks up all the vertexes before the filters from the former traversal (on the where) can be applied. However, I can't find a way to avoid this. Does Gremlin have subqueries?
EDIT/2: inspired by the suggestion in the comments to use two separated queries (thanks!), I've tried this:
Which avoids a full cartesian product join by accumulating sub-query results into a map. This is much faster than the original query (like <1s for 100 edges), but not very simple to read, I'm sure there is a better way to write the same.