Quartz race condition between threads

86 Views Asked by At

Setup

We have two Quartz Scheduler instances inside a Spring Boot app running in two AWS ECS containers. Schedulers share one clustered JDBC job store (AWS Aurora PostgreSQL database). Quartz is auto-configured by Spring Boot with following custom settings:

spring.quartz.job-store-type=jdbc
spring.quartz.jdbc.initialize-schema=never
spring.quartz.properties.org.quartz.jobStore.driverDelegateClass = org.quartz.impl.jdbcjobstore.PostgreSQLDelegate
spring.quartz.properties.org.quartz.jobStore.isClustered = true
spring.quartz.properties.org.quartz.scheduler.instanceId = AUTO
spring.quartz.properties.org.quartz.scheduler.skipUpdateCheck = true 
spring.quartz.properties.org.quartz.threadPool.threadCount = 2

Quartz also has its own connection pool (@QuartzDataSource), which is a normal HikariDataSource that connects to the same database as the application's main pool.

Metadata:

  • Quartz v2.3.2
  • Spring Boot v3.1.3
  • HikariCP v5.0.1

The problem

Randomly (but frequently) there are race conditions within one scheduler instance, where two threads are trying to fire the same trigger. The first succeeds and executes the job fine, but the slower one throws a JobPersistenceException because it expected to find the trigger:

org.quartz.JobPersistenceException: Couldn't acquire next trigger: Couldn't retrieve trigger: No record found for selection of Trigger with key: ...

I have verified that this error is logged almost instantly after the other thread has started job execution.

Any ideas on how to fix this race condition?

Solutions already tried

  • Using the main connection pool. For some reason then Quartz cannot schedule new jobs while others are running. When some job runs for a long time, this creates a catastrophic domino effect where all database connections are consumed by threads waiting for the slow job to finish.
  • Configuring a transaction manager as suggested in https://stackoverflow.com/a/39725927/10396261
  • Reducing Quartz connection pool size from 10 to 1. This prevented race conditions, but after a while we started to see that Quartz failed to get database connections.
  • Delaying triggers to not fire instantly, but rather after few seconds. Did not make any difference.
0

There are 0 best solutions below