Gremlin, linking an edge to a vertex via property

107 Views Asked by At

In a graph database, I have graphs like:

v1: Protein{prefName: 'QP1'} 
  -- r1: part_of{evidence: 'ns:testdb'} 
  --> v2: Protcmplx{prefName: 'P12 Complex'}
ev: EvidenceType{ iri = "ns:testdb", label = "Test Database" }

I'd like to write a Gremlin query to fetch instances of the part_of relationship and return v1 and v2's prefName, along with the evidence's label. So far I've tried this:

g.V().hasLabel( containing('Protein') ).as('p')
  .outE().hasLabel( 'is_part_of' ).as('pr')
  .inV().hasLabel( containing('Protcmplx') ).as('cpx')
.V().hasLabel( containing('EvidenceType') ).as('ev')
  .has( 'iri', eq( select('pr').by('evidence') ) )
.select( 'p', 'cpx', 'ev', 'pr' )
  .by('prefName')
  .by('prefName')
  .by('label')
  .by('evidence')
.limit(100)

But it takes a lot of time for a few thousand nodes+edeges, and eventually, it doesn't return anything. I'm sure the values are there and I think the problem is with has( 'iri', ... ), but I can't figure out how to match an edge property with another property in an unconnected vertex.

The graph is modelled this way, cause the LPG model doesn't allow for hyper-edges (linking >2 vertices).

2

There are 2 best solutions below

6
zakmck On BEST ANSWER

I've found a way using where() and by(), but it is quite slow (11secs to get 100 tuples from a few thousands nodes+edges):

g.V().hasLabel ( containing ( 'Protcmplx' ) ).as ( 'cpx' )
  .inE().hasLabel ( 'is_part_of' ).limit ( 10 ).as ( 'pr' )
  .outV ().hasLabel ( containing ( 'Protein' ) ).as ( 'p' )  
.V().hasLabel ( containing ( 'EvidenceType' ) ).as ( 'ev' )
    .where ( 'ev', eq ( 'pr' ) ).by ( 'iri' ).by ( 'evidence' ) 
.select ( 'p', 'cpx', 'ev' )
.by ( 'prefName' )
.by ( 'prefName' )
.by ( 'label' )

Any help with optimisation would be welcome!

EDIT: following a suggestion from the comments (thanks!), I've rewritten the solution a bit (it's still slow) and used .profile() at the end, obtaining this:

Traversal Metrics
Step                                                               Count  Traversers       Time (ms)    % Dur
=============================================================================================================
GraphStep(vertex,[])                                              123591      123591         507.179     9.09
HasStep([~label.containing(Protcmplx)])@[cpx]                         10          10          34.313     0.61
VertexStep(IN,[is_part_of],edge)@[pr]                                 13          13           5.089     0.09
RangeGlobalStep(0,10)                                                 10          10           0.094     0.00
EdgeVertexStep(OUT)                                                   10          10           1.618     0.03
HasStep([~label.containing(Protein)])@[p]                             10          10           0.065     0.00
GraphStep(vertex,[])                                             1738360     1738360        4574.578    81.99
HasStep([~label.containing(EvidenceType)])@[ev]                      510         510         447.546     8.02
WherePredicateStep(ev,eq(pr),[value(iri), value...                    10          10           6.747     0.12
NoOpBarrierStep(2500)                                                 10          10           1.444     0.03
SelectStep(last,[p, cpx, ev],[value(prefName), ...                    10          10           0.154     0.00
NoOpBarrierStep(2500)                                                 10           8           0.785     0.01
                                            >TOTAL                     -           -        5579.617        -

So, the problem seems to be that the second V() picks up all the vertexes before the filters from the former traversal (on the where) can be applied. However, I can't find a way to avoid this. Does Gremlin have subqueries?

EDIT/2: inspired by the suggestion in the comments to use two separated queries (thanks!), I've tried this:

evLabels = [:]
g.V().hasLabel ( containing ( 'Protcmplx' ) ).as ( 'cpx' )
  // Trying to put the limit early-on
  .inE().hasLabel ( 'is_part_of' ).limit ( 100 ).as ( 'pr' )
  .outV ().hasLabel ( containing ( 'Protein' ) ).as ( 'p' )
.select ( 'p', 'cpx', 'pr' )
  .by ( 'prefName' )
  .by ( 'prefName' )
  .by ( map{
    pr = it.get()
    evIri = pr.values ( 'evidence' ).next ();
    lbl = evLabels [ evIri ];
    if ( lbl != null ) return lbl;
    lbl = g.V().hasLabel ( containing ( 'EvidenceType' ) )
             .has ( 'iri', evIri )
             .values ( 'label' ).next ();
    evLabels [ evIri ] = lbl == null ? "" : lbl;
    return lbl;
  })

Which avoids a full cartesian product join by accumulating sub-query results into a map. This is much faster than the original query (like <1s for 100 edges), but not very simple to read, I'm sure there is a better way to write the same.

1
Lvca On

The issue is with ArcadeDB query optimizer and the contains operator. If remove contains and just use the label names, it would use the index and should return in <10ms:

evLabels = [:]
g.V().hasLabel ( 'Concept:Protcmplx:Resource' ).as ( 'cpx' )
  // Trying to put the limit early-on
  .inE().hasLabel ( 'is_part_of' ).limit ( 100 ).as ( 'pr' )
  .outV ().hasLabel ( 'Concept:Protein:Resource' ).as ( 'p' )
.select ( 'p', 'cpx', 'pr' )
  .by ( 'prefName' )
  .by ( 'prefName' )
  .by ( map{
    pr = it.get()
    evIri = pr.values ( 'evidence' ).next ();
    lbl = evLabels [ evIri ];
    if ( lbl != null ) return lbl;
    lbl = g.V().hasLabel ( 'EvidenceType:Resource' )
             .has ( 'iri', evIri )
             .values ( 'label' ).next ();
    evLabels [ evIri ] = lbl == null ? "" : lbl;
    return lbl;
  })