Postgresql Prune replicated data from outbox table

Question

Postgresql Prune replicated data from outbox table

570 Views Asked by MattyIceDev At 22 March 2022 at 03:35

Problem Statement

In order to ensure disk size isn't growing unnecessary, I want to be able to delete rows that have been replicated from my outbox table.

Context

Postgres is at v12

We are using a Kafka source connector to stream changes made to a postgres table. These changes are insert only and thus are no longer needed once written to kafka. The source connector is using logical replication to stream the changes to the connector and the state of the replication can be displayed in pg_replication_slots.

When looking at the pg_replication_slots you can see useful data that it's storing in order to know what logs it has to keep to ensure replication can still happen for the client.

For example when I run:

select * from pg_replication_slots;

I might see:

 slot_name |  plugin  | slot_type | datoid |      database      | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn
-----------+----------+-----------+--------+--------------------+-----------+--------+------------+------+--------------+-------------+---------------------
 debezium  | wal2json | logical   |  26593 | database_name | f         | t      |       7404 |      |        26729 | 0/DCD98E8   | 0/DCD9920
(1 row)

What I'm interested in knowing is if I can reliably use that data and then the postgresql metadata on the table to select all rows that have been replicated from that slot.

For example, this doesn't work as far as I can tell, but ideally would return rows that have been replicated and are now safe to prune from the table:

select * from outbox where age(xmin) < (select age(catalog_xmin) from pg_replication_slots);

Any guidance would be sweet! Cheers!

Original Q&A

There are 2 best solutions below

**Bennett T** · Answer 1 · 2022-04-01T19:21:06.213000

I have been implementing the Outbox pattern using Debezium with MySQL and delete the outbox record straight after inserting it which I saw done here https://debezium.io/blog/2019/02/19/reliable-microservices-data-exchange-with-the-outbox-pattern/ The insert is picked up and sent and the delete is ignored. So essentially there should never be anything in the outbox table(outside of the transaction). I also pre-generate the primary keys for the entries(which I use for the event ID in Kafka) so I can bulk insert and delete.

**MattyIceDev** · Answer 2 · 2022-05-02T18:02:21.163000

Circling back around to this, I had to think a bit differently around how I we could tie the replications progress to our outbox table. Previously in my question I was trying to glean progress from pg_replication_slots, but in this working example I switched to using pg_stat_replication. This table can be queried by the slot_name we care about and can return lag results. For an example:

SELECT * FROM outbox WHERE created_at < (SELECT(NOW() - COALESCE(replay_lag, interval '60 seconds')) as stale_time from pg_stat_replication where pg_stat_replication.slot_name = 'outbox_slot');

So here this will return to us rows from our outbox table that were inserted outside of our replay_lag time or 1 minute.

Postgresql Prune replicated data from outbox table

Problem Statement

Context

There are 2 best solutions below

Related Questions in POSTGRESQL

Related Questions in DEBEZIUM

Related Questions in OUTBOX-PATTERN

Trending Questions

Popular # Hahtags

Popular Questions