Performance issues with Neptune gremlin query

293 Views Asked by At

We have an analysis project for which we are using neptune db. Configuration: db.r5d.24xlarge 96 cores 768G memory.

It contains around 50 million vertices and 500 million edges.

Vertices have id, name and category properties. Edges have id, date, tag, score and a few other properties.

I've been running queries to find all possible paths with 1 and 2 hops but mostly getting timeouts (time out is 60 seconds). Queries only return proper responses when number of vertices / edges processed is small.

Eg of a query:

            g
            .V()
            .filter(values("name").is(fromName))
            .repeat(
            outE()
            .and(
                filter(values("tag").is(tag)),
                filter(values("date").is(P.gte(dateVal))))
            .inV())
            .until(loops().is(2))
            .filter(values("name").is(toName))
            .simplePath()
            .path()
            .by(T.id)
            .by("score");

My Neptune java client config as follows:

Cluster cluster =
    Cluster.build()
        .addContactPoint(neptuneHost)
        .port(neptunePort)
        .enableSsl(true)
        .serializer(Serializers.GRAPHBINARY_V1D0)
        .maxConnectionPoolSize(poolSizeMax)
        .minConnectionPoolSize(poolSizeMin).build();

DriverRemoteConnection connection = DriverRemoteConnection.using(cluster);

g = traversal().withRemote(connection);

where poolSizeMax == poolSizeMin = 8.

I've also tried the following properties of the connection with different values, no luck so far:

workerPoolSize
maxInProcessPerConnection
maxSimultaneousUsagePerConnection
minSimultaneousUsagePerConnection

We've noticed that the CPU and memory usage of the Neptune instance is pretty low as well.

Any pointers on optimizing the query or any other configurations to run the queries without timing out would be really helpful. Note that we can't put a limit() to vertices or edges since that will result in incorrect output.

1

There are 1 best solutions below

0
Taylor Riggan On

So this seems very much in line with an analytics (OLAP) style of using a graph data store. Neptune was originally designed as an OLTP, transactional, graph database. It is designed for high concurrency of more constrained graph queries (starting with one or a few starting points and traversing through the graph until you resolve to an ending condition).

If you were to attempt to do something like this on Neptune today, you would need to build a multi-threaded app and split your query into multiple concurrent sub-queries that could be executed in parallel.