Re: Newly created replication slot may be invalidated by checkpoint - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Newly created replication slot may be invalidated by checkpoint
Date
Msg-id CAA4eK1J7Ep6p4CqcETCcYTeQzEjGJt+p_z50aKaEc9n6zZHoyg@mail.gmail.com
Whole thread Raw
In response to Re: Newly created replication slot may be invalidated by checkpoint  (vignesh C <vignesh21@gmail.com>)
Responses RE: Newly created replication slot may be invalidated by checkpoint
List pgsql-hackers
On Wed, Jan 7, 2026 at 2:52 PM vignesh C <vignesh21@gmail.com> wrote:
>
> On Mon, 5 Jan 2026 at 15:01, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Tue, Dec 30, 2025 at 5:31 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Sun, Dec 14, 2025 at 8:21 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > On Thu, Dec 11, 2025 at 12:39 PM Zhijie Hou (Fujitsu)
> > > > <houzj.fnst@fujitsu.com> wrote:
> > > > >
> > > > > >
> > > > > > The other idea to fix this problem is suggested by Alexander in his
> > > > > > email [1] which is to introduce a new ReplicationSlotReserveWALLock
> > > > > > for this purpose. I think introducing LWLock in back branches could be
> > > > > > questionable. Did you evaluate the pros and cons of using that
> > > > > > approach?
> > > > >
> > > > > I reviewed that approach, and I think the main distinction lies in whether to
> > > > > use a new LWLock to serialize the process or rely on an existing lock.
> > > > > Introducing a new LWLock in back branches would alter the size of
> > > > > MainLWLockArray and affect NUM_INDIVIDUAL_LWLOCKS/LWTRANCHE_FIRST_USER_DEFINED.
> > > > > Although this may not directly impact user applications since users typically
> > > > > use standard APIs like RequestNamedLWLockTranche and LWLockNewTrancheId to add
> > > > > private LWLocks, it still has a slight risk. Additionally, using an existing
> > > > > lock could keep code similarity with the HEAD, which can be helpful for future
> > > > > bug fixes and analysis.
> > > > >
> > > >
> > > > Fair enough. I'll wait for Sawada-san/Vitaly to see what their opinion
> > > > on this matter is.
> > >
> > > While it's hacky that the proposed approach takes
> > > ReplicationSlotAllocationLock before
> > > XLogGetReplicationSlotMinimumLSN() during checkpoint, I find that it's
> > > better than introducing a new LWLock in minor versions which could
> > > lead unexpected compatibility issues.
> > >
> > > Regarding the v10 patch, do we need to take
> > > ReplicationSlotAllocationLock also at the following place?
> > >
> > >         /*
> > >          * Recalculate the current minimum LSN to be used in the WAL segment
> > >          * cleanup.  Then, we must synchronize the replication slots again in
> > >          * order to make this LSN safe to use.
> > >          */
> > >         slotsMinReqLSN = XLogGetReplicationSlotMinimumLSN();
> > >         CheckPointReplicationSlots(shutdown);
> > >
> >
> > No, we don't need the allocation lock here because RedoRecPtr won't
> > change after the previous computation so the WAL reservation during
> > creation of slots won't be impacted. I mean if the slot reservation
> > starts just before this computation, it should use the latest (this
> > checkpoint cycle's RedoRecPtr) whereas that was not true with the case
> > where the patch acquires it.
> >
> > > I think we need to add some comments regardless of taking the lwlock.
> > >
> >
> > I have added a comment though not sure if it is required.
>
> Here is a version which includes back branch version patches with
> pgindent changes. I have verified the following scenario in PG17,
> PG16, PG15 and PG14 branches and works as expected with the patches:
>

Thanks for preparing and testing backbranch patches. One thing I
observed a bit different in PG15 (and PG14) is that we need to do
LogStandbySnapshot() and XLogFlush() under
ReplicationSlotAllocationLock because of the current coding pattern in
those branches. But even without this patch, we perform those actions
in a loop if the required WAL is removed which is not ideal and
changed in 16 and above. AFAICS, there is no problem in doing
LogStandbySnapshot() and XLogFlush() under
ReplicationSlotAllocationLock because the LW locks acquired in earlier
functions are not acquired after ReplicationSlotAllocationLock, so
there is no locking order problem. Also, the slot_creation shouldn't
be such a frequent operation to matter. However, if required, we can
move LogStandbySnapshot() and XLogFlush() outside the loop similar to
16 and above branches. OTOH, doing minimal changes in bank branches
and that too in 15 and 14 to fix this issue seems like a prudent
choice.

Thoughts?

--
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: "zengman"
Date:
Subject: [PATCH] Fix typo in pgstat_replslot.c
Next
From: "Hayato Kuroda (Fujitsu)"
Date:
Subject: RE: Newly created replication slot may be invalidated by checkpoint