Re: Restrict copying of invalidated replication slots - Mailing list pgsql-hackers

From vignesh C
Subject Re: Restrict copying of invalidated replication slots
Date
Msg-id CALDaNm2rrxO5mg6OKoScw84K5P1Tw_cbjniHm+Geyxme8Ei-nQ@mail.gmail.com
Whole thread Raw
List pgsql-hackers
On Tue, 4 Feb 2025 at 15:27, Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:
>
> Hi,
>
> Currently, we can copy an invalidated slot using the function
> 'pg_copy_logical_replication_slot'. As per the suggestion in the
> thread [1], we should prohibit copying of such slots.
>
> I have created a patch to address the issue.

This patch does not fix all the copy_replication_slot scenarios
completely, there is a very corner concurrency case where an
invalidated slot still gets copied:
+       /* We should not copy invalidated replication slots */
+       if (src_isinvalidated)
+               ereport(ERROR,
+
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                                errmsg("cannot copy an invalidated
replication slot")));

Consider the following scenario:
step 1) Set up streaming replication between the primary and standby nodes.
step 2) Create a logical replication slot (test1) on the standby node.
step 3) Have a breakpoint in InvalidatePossiblyObsoleteSlot if cause
is RS_INVAL_WAL_LEVEL, no need to hold other invalidation causes or
add a sleep in InvalidatePossiblyObsoleteSlot function like below:
if (cause == RS_INVAL_WAL_LEVEL)
{
while (bsleep)
sleep(1);
}
step 4) Reduce wal_level on the primary to replica and restart the primary node.
step 5) SELECT 'copy' FROM pg_copy_logical_replication_slot('test1',
'test2');  -- It will wait till the lock held by
InvalidatePossiblyObsoleteSlot is released while trying to create a
slot.
step 6) Increase wal_level back to logical on the primary node and
restart the primary.
step 7) Now allow the invalidation to happen (continue the breakpoint
held at step 3), the replication control lock will be released and the
invalidated slot will be copied

After this:
postgres=# SELECT 'copy' FROM
pg_copy_logical_replication_slot('test1', 'test2');
 ?column?
----------
 copy
(1 row)

-- The invalidated slot (test1) is copied successfully:
postgres=# select * from pg_replication_slots ;
 slot_name |    plugin     | slot_type | datoid | database | temporary
| active | active_pid | xmin | catalog_xmin | restart_lsn |
confirmed_flush_lsn | wal_status | safe_wal_size | two_phas
e |          inactive_since          | conflicting |
invalidation_reason   | failover | synced

-----------+---------------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------+------------+---------------+---------
--+----------------------------------+-------------+------------------------+----------+--------
 test1     | test_decoding | logical   |      5 | postgres | f
| f      |            |      |          745 | 0/4029060   | 0/4029098
         | lost       |               | f
  | 2025-02-13 15:26:54.666725+05:30 | t           |
wal_level_insufficient | f        | f
 test2     | test_decoding | logical   |      5 | postgres | f
| f      |            |      |          745 | 0/4029060   | 0/4029098
         | reserved   |               | f
  | 2025-02-13 15:30:30.477836+05:30 | f           |
     | f        | f
(2 rows)

-- A subsequent attempt to decode changes from the invalidated slot
(test2) fails:
postgres=# SELECT data FROM pg_logical_slot_get_changes('test2', NULL, NULL);
WARNING:  detected write past chunk end in TXN 0x5e77e6c6f300
ERROR:  logical decoding on standby requires "wal_level" >= "logical"
on the primary

-- Alternatively, the following error may occur:
postgres=# SELECT data FROM pg_logical_slot_get_changes('test2', NULL, NULL);
WARNING:  detected write past chunk end in TXN 0x582d1b2d6ef0
    data
------------
 BEGIN 744
 COMMIT 744
(2 rows)

This is an edge case that can occur under specific conditions
involving replication slot invalidation when there is a huge lag
between primary and standby.
There might be a similar concurrency case for wal_removed too.

Regards,
Vignesh



pgsql-hackers by date:

Previous
From: Shubham Khanna
Date:
Subject: Re: Adding a '--clean-publisher-objects' option to 'pg_createsubscriber' utility.
Next
From: Vladlen Popolitov
Date:
Subject: Re: Windows meson build