Re: Restrict copying of invalidated replication slots - Mailing list pgsql-hackers
From | Masahiko Sawada |
---|---|
Subject | Re: Restrict copying of invalidated replication slots |
Date | |
Msg-id | CAD21AoCjJ72yp56Z5yWgWUObN6+-mca7RPQzQgLpXKVM1003hw@mail.gmail.com Whole thread Raw |
In response to | Re: Restrict copying of invalidated replication slots (Masahiko Sawada <sawada.mshk@gmail.com>) |
List | pgsql-hackers |
On Wed, Apr 2, 2025 at 2:58 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Tue, Mar 18, 2025 at 2:28 AM Shlok Kyal <shlok.kyal.oss@gmail.com> wrote: > > > > On Mon, 17 Mar 2025 at 22:57, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > On Sun, Mar 9, 2025 at 11:16 PM Shlok Kyal <shlok.kyal.oss@gmail.com> wrote: > > > > > > > > On Fri, 28 Feb 2025 at 08:56, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > On Fri, Feb 28, 2025 at 5:10 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > > > > > On Thu, Feb 27, 2025 at 12:52 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > > > > On Thu, Feb 27, 2025 at 10:47 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > > > > > > > > > On Tue, Feb 25, 2025 at 7:33 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > > > > > > > > AFAICU, InvalidateObsoleteReplicationSlots() is not serialized with > > > > > > > > > this operation. So, isn't it possible that the source slot exists at > > > > > > > > > the later position in ReplicationSlotCtl->replication_slots and the > > > > > > > > > loop traversing slots is already ahead from the point where the newly > > > > > > > > > copied slot is created? > > > > > > > > > > > > > > > > Good point. I think it's possible. > > > > > > > > > > > > > > > > > If so, the newly created slot won't be marked > > > > > > > > > as invalid whereas the source slot will be marked as invalid. I agree > > > > > > > > > that even in such a case, at a later point, the newly created slot > > > > > > > > > will also be marked as invalid. > > > > > > > > > > > > > > > > The wal_status of the newly created slot would immediately become > > > > > > > > 'lost' and the next checkpoint will invalidate it. Do we need to do > > > > > > > > something to deal with this case? > > > > > > > > > > > > > > > > > > > > > > + /* Check if source slot became invalidated during the copy operation */ > > > > > > > + if (second_slot_contents.data.invalidated != RS_INVAL_NONE) > > > > > > > + ereport(ERROR, > > > > > > > + errmsg("cannot copy replication slot \"%s\"", > > > > > > > + NameStr(*src_name)), > > > > > > > + errdetail("The source replication slot was invalidated during the > > > > > > > copy operation.")); > > > > > > > > > > > > > > How about adding a similar test as above for MyReplicationSlot? That > > > > > > > should suffice the need because we have already acquired the new slot > > > > > > > by this time and invalidation should signal this process before > > > > > > > marking the new slot as invalid. > > > > > > > > > > > > IIUC in the scenario you mentioned, the loop traversing slots already > > > > > > passed the position of newly created slot in > > > > > > ReplicationSlotCtl->replication_slots array, so > > > > > > MyReplicationSlot->data.invalidated is still RS_INVAL_NONE, no? > > > > > > > > > > > > > > > > Right, I don't have a simpler solution for this apart from making it > > > > > somehow serialize this operation with > > > > > InvalidateObsoleteReplicationSlots(). I don't think we need to go that > > > > > far to handle the scenario being discussed. It is better to add a > > > > > comment for this race and why it won't harm. > > > > > > > > I tried to reproduce the scenario described using the following steps: > > > > > > > > Here are the steps: > > > > > > > > 1. set max_replication_slots = 2 > > > > > > > > 2. create two logical replication slot, lets say slot1 and slot2. drop > > > > the first slot (slot1). > > > > > > > > 3. Now run pg_copy_logical_replication slot function on slot2. Lets > > > > say copied slot is 'slot_cp'. > > > > In function copy_replication_slot add breakpoint just after function > > > > 'create_logical_replication_slot'. > > > > slot_cp will be created. In array > > > > ReplicationSlotCtl->replication_slots, slot_cp will come before slot2. > > > > > > > > 4. Now invalidate the 'slot2'. > > > > Function 'InvalidateObsoleteReplicationSlots' will be called. Now it > > > > will loop through ReplicationSlotCtl->replication_slots and will get > > > > 'slot_cp' first. Since the slot is acquired by copy_replication_slot, > > > > It will wait for the slot to be released. Once slot is released, > > > > 'slot_cp' and 'slot2' will be invalidated. > > > > > > > > I have tried it with 'wal_removal' invalidation. So it is issue in > > > > PG_13 to HEAD. > > > > I also see a kill signal sent to function 'copy_replication_slot'. > > > > > > > > Terminal output: > > > > postgres=# select pg_copy_logical_replication_slot('slot2', 'slot_cp'); > > > > FATAL: terminating connection due to administrator command > > > > server closed the connection unexpectedly > > > > This probably means the server terminated abnormally > > > > before or while processing the request. > > > > The connection to the server was lost. Attempting reset: Succeeded. > > > > > > > > 5. Execute 'SELECT * from pg_replication_slots'. slot_cp is present in > > > > the list with wal_status as lost. > > > > > > > > I have added the comment on existing patches for REL_16 to master. > > > > Created a new patch to add only comments in REL13-REL15. > > > > > > I think we don't need a patch adding only comments to v15 or older. We > > > can either write the fact in the commit message that v15 or older are > > > not affected fortunately or add an explicit check if the copied slot > > > doesn't have invalid restart_lsn. > > > > > I felt adding a message in the commit message would be sufficient. I > > have made the changes for the same in [1]. > > Thank you for updating the patches. These patches look good to me. I'm > going to push these fixes to master, v17, and v16 branches tomorrow > unless there is further comment. Pushed (yesterday). Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
pgsql-hackers by date: