Changing the saga exemple from camel-spring-boot-examples/saga to execute multiple saga transactions with success causes the narayana-lra server to fail and restart because of a java.lang.OutOfMemoryError: Java heap space.

Using the camel org.apache.camel.springboot.example version 3.20.0

pom.xml

    <parent>
        <groupId>org.apache.camel.springboot</groupId>
        <artifactId>spring-boot</artifactId>
        <version>3.20.0</version>
    </parent>

    <groupId>org.apache.camel.springboot.example</groupId>
    <artifactId>examples</artifactId>
    <name>Camel SB :: Examples</name>
    <description>Camel Examples</description>
    <packaging>pom</packaging>
    <properties>
        <camel-version>3.20.0</camel-version>
        <skip.starting.camel.context>false</skip.starting.camel.context>
        <javax.servlet.api.version>4.0.1</javax.servlet.api.version>
        <jkube-maven-plugin-version>1.9.1</jkube-maven-plugin-version>
        <kafka-avro-serializer-version>5.2.2</kafka-avro-serializer-version>
        <reactor-version>3.2.16.RELEASE</reactor-version>
        <testcontainers-version>1.16.3</testcontainers-version>
        <hapi-structures-v24-version>2.3</hapi-structures-v24-version>
        <narayana-spring-boot-version>2.6.3</narayana-spring-boot-version>
    </properties>

changed docker-compose

version: "3.9"
services:
  lra-coordinator:
    image: "quay.io/jbosstm/lra-coordinator:7.0.0.Final-3.2.2.Final"
    network_mode: "host"
    deploy:
      resources:
        limits:
          memory: 400M
    environment:
      - 'JAVA_TOOL_OPTIONS=-Dquarkus.log.level=DEBUG 
        -Dcom.sun.management.jmxremote=true
        -Dcom.sun.management.jmxremote.port=7091
        -Dcom.sun.management.jmxremote.ssl=false 
        -Dcom.sun.management.jmxremote.authenticate=false'

  amq-broker:
    image: "registry.redhat.io/amq7/amq-broker-rhel8:7.10"
    environment:
      - AMQ_USER=admin
      - AMQ_PASSWORD=admin
      - AMQ_REQUIRE_LOGIN=true
    ports:
      - "8161:8161"
      - "61616:61616"

changed sagaRoute

public class SagaRoute extends RouteBuilder {

    private static final String DIRECT_SAGA = "direct:saga";
    @Autowired
    private ProducerTemplate producerTemplate;

    private final ExecutorService executor = Executors.newFixedThreadPool(10);

    @Override
    public void configure() throws Exception {

        rest().get("/perf")
                .param().type(RestParamType.query).name("n").dataType("int").required(true).endParam()
                .to("direct:perf");

        from("direct:perf")
            .process(exchange -> {
                Integer limit = exchange.getMessage().getHeader("n", Integer.class);
                for (int i = 0; i < limit; i++) {
                    int finalI = i;
                    executor.submit(()-> producerTemplate.sendBodyAndHeader(DIRECT_SAGA, finalI,"id",finalI))
                    ;
                }
            });

        rest().post("/saga")
                .param().type(RestParamType.query).name("id").dataType("int").required(true).endParam()
                .to(DIRECT_SAGA);

        from(DIRECT_SAGA)
                .saga()
                .compensation("direct:cancelOrder")
                    .log("Executing saga #${header.id} with LRA ${header.Long-Running-Action}")
                    .setHeader("payFor", constant("train"))
                    .to("activemq:queue:{{example.services.train}}?exchangePattern=InOut" +
                            "&replyTo={{example.services.train}}.reply")
                    .log("train seat reserved for saga #${header.id} with payment transaction: ${body}")
                    .setHeader("payFor", constant("flight"))
                    .to("activemq:queue:{{example.services.flight}}?exchangePattern=InOut" +
                            "&replyTo={{example.services.flight}}.reply")
                    .log("flight booked for saga #${header.id} with payment transaction: ${body}")
                .setBody(header("Long-Running-Action"))
                .end();

        from("direct:cancelOrder")
                .log("Transaction ${header.Long-Running-Action} has been cancelled due to flight or train failure");

    }

}

changed paymentRoute https://github.com/apache/camel-spring-boot-examples/blob/camel-spring-boot-examples-3.20.0/saga/saga-payment-service/src/main/java/org/apache/camel/example/saga/PaymentRoute.java

public class PaymentRoute extends RouteBuilder {

    @Override
    public void configure() throws Exception {

                from("activemq:queue:{{example.services.payment}}")
                .routeId("payment-service")
                .saga()
                    .propagation(SagaPropagation.MANDATORY)
                    .option("id", header("id"))
                    .compensation("direct:cancelPayment")
                    .log("Paying ${header.payFor} for order #${header.id}")
                    .setBody(header("JMSCorrelationID"))
                    .log("Payment ${header.payFor} done for order #${header.id} with payment transaction ${body}")
                .end();

        from("direct:cancelPayment")
                .routeId("payment-cancel")
                .log("Payment for order #${header.id} has been cancelled");
    }
}

executing o run-local.sh to start the services

executing the command bellow with the number of transactions to be done

curl http://localhost:8084/api/perf?n=100000

Doing this we get on the flight recorder this problem

The live set on the heap seems to increase with a speed of about 26,2 KiB per second during the recording.
An analysis of the reference tree found 1 leak candidates. The main candidate is java.util.concurrent.ConcurrentHashMap$Node[] 
Referenced by this chain:
java.util.concurrent.ConcurrentHashMap.table
io.narayana.lra.coordinator.domain.service.LRAService.participants
io.narayana.lra.coordinator.internal.LRARecoveryModule.service
java.lang.Object[]
java.util.Vector.elementData
com.arjuna.ats.internal.arjuna.recovery.PeriodicRecovery._recoveryModules

Flight Recorder Diagnose Image

and the erros on Narayana lra

2023-08-15 19:16:35,438 DEBUG [org.jbo.res.rea.com.cor.AbstractResteasyReactiveContext] (executor-thread-59) Restarting handler chain for exception exception: java.lang.OutOfMemoryError: Java heap space

2023-08-15 19:16:37,729 DEBUG [org.jbo.res.rea.com.cor.AbstractResteasyReactiveContext] (executor-thread-59) Attempting to handle unrecoverable exception: java.lang.OutOfMemoryError: Java heap space

2023-08-15 19:16:38,327 DEBUG [io.ver.ext.web.RoutingContext] (executor-thread-59) RoutingContext failure (500): java.lang.OutOfMemoryError: Java heap space

The attribute that holds the data causing the memory leak is

io.narayana.lra.coordinator.domain.service.LRAService.participants

public class LRAService {
    private final Map<String, String> participants = new ConcurrentHashMap<>();

in the LRAService class I could not find where itens are removed from this map.

Is it bug? a miss configuration on narayana lra? a bug on apache camel saga?

thanks a lot

2

There are 2 best solutions below

0
On

I received an answer on Narayana zulip chat

https://narayana.zulipchat.com/#narrow/stream/323714-users/topic/apache.20camel.20saga.20causes.20Narayana.20-LRA.20memory.20leak/near/386195429

Michael Musgrove: Well spotted, the participant should be removed here when the transaction finishes (and updated if participants move).

We'll get a issue tracker raised for you to monitor the fix.

Michael Musgrove: You may track our progress using issue https://issues.redhat.com/browse/JBTM-3795

content of the card

Description The LRA module maintains a map of participants 1 which should be cleaned up when an LRA finishes [2] (and if a participant wants to be notified on a different endpoint).

1 https://github.com/jbosstm/narayana/blob/7.0.0.Final/rts/lra/coordinator/src/main/java/io/narayana/lra/coordinator/domain/service/LRAService.java#L46

[2] https://github.com/jbosstm/narayana/blob/7.0.0.Final/rts/lra/coordinator/src/main/java/io/narayana/lra/coordinator/domain/service/LRAService.java#L195

0
On

As written in the Zulip thread this issue is related to Netty Direct access memory not respecting docker memory limitation. In order to fix the OOM error you could set

JAVA_TOOL_OPTIONS='-Dio.netty.maxDirectMemory=0'

as an environment variable. You might also make it work limiting io.netty.maxDirectMemory to a value lower than the docker container memory limit (i.e. -Dio.netty.maxDirectMemory=100m).