I am getting deadlocks from postgres occasionally when attempting to execute a SELECT query on a hypertable with both a compression and retention policy.
From the logs, it looks like the deadlock is occuring because the retention policy is kicking off in a job at the same time a SELECT query is executing.
Here's a sanitized snippet from the postgres logs:
2023-08-16T15:02:21.680047 [14-1] pid=235144,user=user,db=test,app=[unknown],client=9.9.9.9 ERROR: deadlock detected",
2023-08-16T15:02:21.680244 [14-2] pid=235144,user=user,db=test,app=[unknown],client=9.9.9.9 DETAIL: Process 235144 waits for ShareLock on transaction 6376925; blocked by process 252148.",
2023-08-16T15:02:21.680315 [14-3] Process 252148 waits for AccessExclusiveLock on relation 2999269 of database 16413; blocked by process 235144.",
2023-08-16T15:02:21.680392 [14-4] Process 235144: select coalesce(max(i.item_sequence),0) from items i where i.item_time >= now() and i.item_category_id = $1",
2023-08-16T15:02:21.680432 [14-5] Process 252148: CALL _timescaledb_internal.policy_retention()",
2023-08-16T15:02:21.680465 [14-6] pid=235144,user=user,db=test,app=[unknown],client=9.9.9.9 HINT: See server log for query details.",
2023-08-16T15:02:21.680493 [14-7] pid=235144,user=user,db=test,app=[unknown],client=9.9.9.9 CONTEXT: while locking tuple (0,158) in relation \"dimension_slice\"",
2023-08-16T15:02:21.680526 [14-8] pid=235144,user=user,db=test,app=[unknown],client=9.9.9.9 STATEMENT: select coalesce(max(i.item_sequence),0) from items i where i.item_time >= now() and i.item_category_id = $1",
Basics of the hypertable in question:
CREATE TABLE IF NOT EXISTS items (
item_time bigint NOT NULL,
item_category_id varchar NOT NULL,
item_sequence int NOT NULL
);
create index on items (item_category_id, item_time asc, item_sequence asc);
With the following data retention policy that deletes data older than, say, 2 days and checks it every 2 hours:
-- retention policy to drop content older than 2 days
SELECT add_retention_policy('items', BIGINT '172800000'); -- 2 days = 172800000 milliseconds
-- alter the retention policy job to drop old data every 2 hours instead of the default of once a day
SELECT alter_job(job_id, schedule_interval => INTERVAL '2 hours') FROM timescaledb_information.jobs WHERE proc_name = 'policy_retention' AND hypertable_name = 'items';
...and a similar compression policy:
ALTER TABLE items SET (timescaledb.compress, timescaledb.compress_segmentby = 'item_category_id');
-- we compress data that is 4 hours old
SELECT add_compression_policy('items', BIGINT '14400000');
-- Alter the new compression job so that it kicks off every 2 hours instead of the default of once a day, so we can compress old data quickly
SELECT alter_job(job_id, schedule_interval => INTERVAL '2 hours') FROM timescaledb_information.jobs WHERE proc_name = 'policy_compression' AND hypertable_name = 'items';
If I'm interpreting everything above correctly, it is maybe possible that the two locks in question are between the retention policy job kicking off, dropping older chunks, followed by the SELECT statement maybe causing the entire table to be scanned and triggering decompression - which results in another lock?
I would have thought the where i.item_time >= now() and i.item_category_id = $1 would have caused only the latest chunk to be checked??
I feel like I am missing something fundamental.
There ARE alternatives, from having the retention policy kick off less often to storing the latest item_sequence in a continuous aggregate (maybe?), but I am not sure what the simplest option is.