Re: Restrict copying of invalidated replication slots - Mailing list pgsql-hackers
From | vignesh C |
---|---|
Subject | Re: Restrict copying of invalidated replication slots |
Date | |
Msg-id | CALDaNm2rrxO5mg6OKoScw84K5P1Tw_cbjniHm+Geyxme8Ei-nQ@mail.gmail.com Whole thread Raw |
List | pgsql-hackers |
On Tue, 4 Feb 2025 at 15:27, Shlok Kyal <shlok.kyal.oss@gmail.com> wrote: > > Hi, > > Currently, we can copy an invalidated slot using the function > 'pg_copy_logical_replication_slot'. As per the suggestion in the > thread [1], we should prohibit copying of such slots. > > I have created a patch to address the issue. This patch does not fix all the copy_replication_slot scenarios completely, there is a very corner concurrency case where an invalidated slot still gets copied: + /* We should not copy invalidated replication slots */ + if (src_isinvalidated) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("cannot copy an invalidated replication slot"))); Consider the following scenario: step 1) Set up streaming replication between the primary and standby nodes. step 2) Create a logical replication slot (test1) on the standby node. step 3) Have a breakpoint in InvalidatePossiblyObsoleteSlot if cause is RS_INVAL_WAL_LEVEL, no need to hold other invalidation causes or add a sleep in InvalidatePossiblyObsoleteSlot function like below: if (cause == RS_INVAL_WAL_LEVEL) { while (bsleep) sleep(1); } step 4) Reduce wal_level on the primary to replica and restart the primary node. step 5) SELECT 'copy' FROM pg_copy_logical_replication_slot('test1', 'test2'); -- It will wait till the lock held by InvalidatePossiblyObsoleteSlot is released while trying to create a slot. step 6) Increase wal_level back to logical on the primary node and restart the primary. step 7) Now allow the invalidation to happen (continue the breakpoint held at step 3), the replication control lock will be released and the invalidated slot will be copied After this: postgres=# SELECT 'copy' FROM pg_copy_logical_replication_slot('test1', 'test2'); ?column? ---------- copy (1 row) -- The invalidated slot (test1) is copied successfully: postgres=# select * from pg_replication_slots ; slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn | wal_status | safe_wal_size | two_phas e | inactive_since | conflicting | invalidation_reason | failover | synced -----------+---------------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------+------------+---------------+--------- --+----------------------------------+-------------+------------------------+----------+-------- test1 | test_decoding | logical | 5 | postgres | f | f | | | 745 | 0/4029060 | 0/4029098 | lost | | f | 2025-02-13 15:26:54.666725+05:30 | t | wal_level_insufficient | f | f test2 | test_decoding | logical | 5 | postgres | f | f | | | 745 | 0/4029060 | 0/4029098 | reserved | | f | 2025-02-13 15:30:30.477836+05:30 | f | | f | f (2 rows) -- A subsequent attempt to decode changes from the invalidated slot (test2) fails: postgres=# SELECT data FROM pg_logical_slot_get_changes('test2', NULL, NULL); WARNING: detected write past chunk end in TXN 0x5e77e6c6f300 ERROR: logical decoding on standby requires "wal_level" >= "logical" on the primary -- Alternatively, the following error may occur: postgres=# SELECT data FROM pg_logical_slot_get_changes('test2', NULL, NULL); WARNING: detected write past chunk end in TXN 0x582d1b2d6ef0 data ------------ BEGIN 744 COMMIT 744 (2 rows) This is an edge case that can occur under specific conditions involving replication slot invalidation when there is a huge lag between primary and standby. There might be a similar concurrency case for wal_removed too. Regards, Vignesh
pgsql-hackers by date: