We use postgres 11 in Azure cloud as a temporary store. Two tables in particular receive large number of inserts and deletes (hundreds of millions) in lots of small transactions with heavy lock contention during our batch processing. The tables will eventually be empty when all the data is processed. Because of the amount of traffic going through these two tables we have tuned autovacuum to be aggressive on them
autovacuum_vacuum_cost_limit=2000, autovacuum_vacuum_cost_delay=1, autovacuum_vacuum_scale_factor=0.02, autovacuum_vacuum_threshold=10000
Autovacuum generally keeps up with the load and keeps the table size down. We also sometimes truncate or run vacuum full between loads. But once in a while we are noticing that the autovacuum process stops running completely. Around the same time we start seeing xid wraparound warnings in the logs. When this happens relfrozenxid age for some pg_catalog tables is over 200 million but not for the tables we are using.
Autovacuum doesn't come back even if we leave the database in a quiet state for a few hours. pg_stat_progress_vacuum shows no activity. pg_stat_activity does not show any long running transactions. There are no abandoned replication slots or prepared transactions that could block autovacuum. We can't vacuum tables in pg_catalog because it is owned by azure super user. We also don't have access to see if autovacuum threads are running at the os level.
Restarting postgres gets the autovacuum going again but it is not a viable option for us in the long run.
What could be going wrong here?
When your tables stop being vacuumed, you should check how many autovacuum workers you see in
pg_stat_activity. If there are as many asautovacuum_max_workers, the explanation is that there is no room for your tables beingVACUUMed. Anti-wraparound autovacuum may well cause that, since those runs are normally longer and more resource-intense than normal autovacuum runs.You can either speed up the anti-wraparound autovacuum workers by setting the parameters appropriately, or you can increase
autovacuum_max_workersto avoid starving your important tables. Both options will make autovacuum use more of your system's resources.