UPDATED BELOW 11.09.2023
We have a sizeable database of roughly 200,000 lines of code complete with read-replica hosted on AWS. The problem is we're occasionally encountering a deadlock timeout error that I can't seem to track down. This error is "presented" as a SAVEPOINT does not exist
error. I believe a sub-procedure is causing the timeout deadlock which is why the main procedure loses its savepoint (which is designed to rollback to in case of an error). Apparently a deadlock timeout of a subprocedure erases a savepoint...
We have slow logs turned on, and there's nothing in the logs indicating a record lock of any concerning time, or anything else. The whole thing is rather frustrating.
And of course this happened while I was asleep, so I couldn't query the data_locks
or data_lock_waits
tables. However it's so random, by the time we know it's happening, we almost never catch it.
Note our backend system does reuse open connections rather than closing them when everything is done. I believe the issue is related to the reuse of a connection - but in reality that should be fine since our system is designed to allow for 'core' procedure calls by the backend at any time with any connection.
Here's a snapshot of the main DB and the waits. I find it interesting the biggest wait are for the BIN_LOG, which synchronizes with the RR.
Sadly, other than this, we have nothing to go on. As mentioned, slow logs have nothing useable in them; highest lock time value is: 00:00:00.000035
In many regards, it makes no sense. I know we have queries that lock longer than that. Perhaps the issue is the ones that lock aren't slow? Is there a setting I need to adjust?
Any advice on tracking this down would be appreciated.
So I've tried the Slow Log approach and no kidding; none of my locks times are more than 0.046564. Even with slow log set to 0... and yes, timeouts did take place... so I'm trying a different technique; an event monitoring system that watches sys.innodb_lock_waits and when there's records, I store the data for review. We're testing it out today. I'll let everyone know how it goes.
The deadlock error happened again, but my event monitor that runs every 5 seconds to check for lock waits returns nothing, which tells me the deadlock is returning immediately and not 'timing out'. Is this possible? I thought all deadlocks would time out...
SQL code is:
SELECT
JSON_ARRAYAGG(JSON_OBJECT
(
'waiting_trx_id',waiting_trx_id,
'waiting_pid',waiting_pid,
'waiting_query',waiting_query,
'blocking_trx_id', blocking_trx_id,
'blocking_pid', blocking_pid,
'blocking_query', blocking_query
))
INTO
json_results
FROM
sys.innodb_lock_waits;