Re: Synchronizing slots from primary to standby - Mailing list pgsql-hackers

From Drouvot, Bertrand
Subject Re: Synchronizing slots from primary to standby
Date
Msg-id 1e0b2eb4-c977-482d-b16e-c52711c34d6c@gmail.com
Whole thread Raw
In response to Re: Synchronizing slots from primary to standby  (shveta malik <shveta.malik@gmail.com>)
List pgsql-hackers
Hi,

On 11/16/23 1:03 PM, shveta malik wrote:
> On Thu, Nov 16, 2023 at 3:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> PFA v35. It has below changes:

Thanks for the update!

> 6) shutdown the slotsync worker on promotion.

+   /*
+    * Shutdown the slot sync workers to prevent potential conflicts between
+    * user processes and slotsync workers after a promotion. Additionally,
+    * drop any slots that have initiated but not yet completed the sync
+    * process.
+    */
+   ShutDownSlotSync();
+   slotsync_drop_initiated_slots();

I think there is a corner case here.

If there is promotion while slot creation is in progress (slot has just
been created and is in 'i' state), then when we shutdown the sync slot worker
in ShutDownSlotSync() we'll set slot->in_use = false in ReplicationSlotDropPtr().

Indeed, when we shut the sync worker down:

(gdb) bt
#0  ReplicationSlotDropPtr (slot=0x7f25af5c9bb0) at slot.c:734
#1  0x000056266c8106a7 in ReplicationSlotDropAcquired () at slot.c:725
#2  0x000056266c810170 in ReplicationSlotRelease () at slot.c:583
#3  0x000056266c80f420 in ReplicationSlotShmemExit (code=1, arg=0) at slot.c:189
#4  0x000056266c86213b in shmem_exit (code=1) at ipc.c:243
#5  0x000056266c861fdf in proc_exit_prepare (code=1) at ipc.c:198
#6  0x000056266c861f23 in proc_exit (code=1) at ipc.c:111

So later on, when we'll want to drop this slot in slotsync_drop_initiated_slots()
we'll get things like:

2023-11-17 11:22:08.526 UTC [2195486] FATAL:  replication slot "logical_slot4" does not exist

Reason is that slotsync_drop_initiated_slots() does call SearchNamedReplicationSlot():

(gdb) bt
#0  SearchNamedReplicationSlot (name=0x7f743f5c9ab8 "logical_slot4", need_lock=false) at slot.c:388
#1  0x0000556ef0974ec1 in ReplicationSlotAcquire (name=0x7f743f5c9ab8 "logical_slot4", nowait=true) at slot.c:484
#2  0x0000556ef09754e7 in ReplicationSlotDrop (name=0x7f743f5c9ab8 "logical_slot4", nowait=true, user_cmd=false) at
slot.c:668
#3  0x0000556ef095f0a3 in slotsync_drop_initiated_slots () at slotsync.c:369

that returns a NULL slot if slot->in_use = false.

One option could be to make sure slot->in_use = true before calling ReplicationSlotDrop() here?

+   foreach(lc, slots)
+   {
+       ReplicationSlot *s = (ReplicationSlot *) lfirst(lc);
+
+       ReplicationSlotDrop(NameStr(s->data.name), true, false);

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: Dilip Kumar
Date:
Subject: Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
Next
From: Amit Kapila
Date:
Subject: Re: Synchronizing slots from primary to standby