Performance issues with Neptune gremlin query

293 Views Asked by popcoder At 23 June 2023 at 11:41

We have an analysis project for which we are using neptune db. Configuration: db.r5d.24xlarge 96 cores 768G memory.

It contains around 50 million vertices and 500 million edges.

Vertices have id, name and category properties. Edges have id, date, tag, score and a few other properties.

I've been running queries to find all possible paths with 1 and 2 hops but mostly getting timeouts (time out is 60 seconds). Queries only return proper responses when number of vertices / edges processed is small.

Eg of a query:

            g
            .V()
            .filter(values("name").is(fromName))
            .repeat(
            outE()
            .and(
                filter(values("tag").is(tag)),
                filter(values("date").is(P.gte(dateVal))))
            .inV())
            .until(loops().is(2))
            .filter(values("name").is(toName))
            .simplePath()
            .path()
            .by(T.id)
            .by("score");

My Neptune java client config as follows:

Cluster cluster =
    Cluster.build()
        .addContactPoint(neptuneHost)
        .port(neptunePort)
        .enableSsl(true)
        .serializer(Serializers.GRAPHBINARY_V1D0)
        .maxConnectionPoolSize(poolSizeMax)
        .minConnectionPoolSize(poolSizeMin).build();

DriverRemoteConnection connection = DriverRemoteConnection.using(cluster);

g = traversal().withRemote(connection);

where poolSizeMax == poolSizeMin = 8.

I've also tried the following properties of the connection with different values, no luck so far:

workerPoolSize
maxInProcessPerConnection
maxSimultaneousUsagePerConnection
minSimultaneousUsagePerConnection

We've noticed that the CPU and memory usage of the Neptune instance is pretty low as well.

Any pointers on optimizing the query or any other configurations to run the queries without timing out would be really helpful. Note that we can't put a limit() to vertices or edges since that will result in incorrect output.

Original Q&A

There are 1 best solutions below

Taylor Riggan On 23 June 2023 at 13:58

So this seems very much in line with an analytics (OLAP) style of using a graph data store. Neptune was originally designed as an OLTP, transactional, graph database. It is designed for high concurrency of more constrained graph queries (starting with one or a few starting points and traversing through the graph until you resolve to an ending condition).

If you were to attempt to do something like this on Neptune today, you would need to build a multi-threaded app and split your query into multiple concurrent sub-queries that could be executed in parallel.

Performance issues with Neptune gremlin query

There are 1 best solutions below

Related Questions in JAVA

Related Questions in AMAZON-WEB-SERVICES

Related Questions in GREMLIN

Related Questions in AMAZON-NEPTUNE

Related Questions in TINKERPOP3

Trending Questions

Popular # Hahtags

Popular Questions