Hi,
On 11/16/23 1:03 PM, shveta malik wrote:
> On Thu, Nov 16, 2023 at 3:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> PFA v35. It has below changes:
Thanks for the update!
> 6) shutdown the slotsync worker on promotion.
+ /*
+ * Shutdown the slot sync workers to prevent potential conflicts between
+ * user processes and slotsync workers after a promotion. Additionally,
+ * drop any slots that have initiated but not yet completed the sync
+ * process.
+ */
+ ShutDownSlotSync();
+ slotsync_drop_initiated_slots();
I think there is a corner case here.
If there is promotion while slot creation is in progress (slot has just
been created and is in 'i' state), then when we shutdown the sync slot worker
in ShutDownSlotSync() we'll set slot->in_use = false in ReplicationSlotDropPtr().
Indeed, when we shut the sync worker down:
(gdb) bt
#0 ReplicationSlotDropPtr (slot=0x7f25af5c9bb0) at slot.c:734
#1 0x000056266c8106a7 in ReplicationSlotDropAcquired () at slot.c:725
#2 0x000056266c810170 in ReplicationSlotRelease () at slot.c:583
#3 0x000056266c80f420 in ReplicationSlotShmemExit (code=1, arg=0) at slot.c:189
#4 0x000056266c86213b in shmem_exit (code=1) at ipc.c:243
#5 0x000056266c861fdf in proc_exit_prepare (code=1) at ipc.c:198
#6 0x000056266c861f23 in proc_exit (code=1) at ipc.c:111
So later on, when we'll want to drop this slot in slotsync_drop_initiated_slots()
we'll get things like:
2023-11-17 11:22:08.526 UTC [2195486] FATAL: replication slot "logical_slot4" does not exist
Reason is that slotsync_drop_initiated_slots() does call SearchNamedReplicationSlot():
(gdb) bt
#0 SearchNamedReplicationSlot (name=0x7f743f5c9ab8 "logical_slot4", need_lock=false) at slot.c:388
#1 0x0000556ef0974ec1 in ReplicationSlotAcquire (name=0x7f743f5c9ab8 "logical_slot4", nowait=true) at slot.c:484
#2 0x0000556ef09754e7 in ReplicationSlotDrop (name=0x7f743f5c9ab8 "logical_slot4", nowait=true, user_cmd=false) at
slot.c:668
#3 0x0000556ef095f0a3 in slotsync_drop_initiated_slots () at slotsync.c:369
that returns a NULL slot if slot->in_use = false.
One option could be to make sure slot->in_use = true before calling ReplicationSlotDrop() here?
+ foreach(lc, slots)
+ {
+ ReplicationSlot *s = (ReplicationSlot *) lfirst(lc);
+
+ ReplicationSlotDrop(NameStr(s->data.name), true, false);
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com