Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
Date
Msg-id CAA4eK1LmVoZ8jm5Jt3DERN9Z=y6L6xj5O0asw0smWw31Y8KfVw@mail.gmail.com
Whole thread Raw
In response to Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly  ("Vitaly Davydov" <v.davydov@postgrespro.ru>)
Responses Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
List pgsql-hackers
On Mon, May 26, 2025 at 3:52 PM Vitaly Davydov <v.davydov@postgrespro.ru> wrote:
>
> Dear Alexander, Amit, All
>
> > Amit wrote:
> > > Is my understanding correct that we need 0001 because
> > > PhysicalConfirmReceivedLocation() doesn't save the slot to disk after
> > > changing the slot's restart_lsn?
> >
> > Yes.  Also, even if it would save slot to the disk, there is still
> > race condition that concurrent checkpoint could use updated value from
> > the shared memory to clean old WAL segments, and then crash happens
> > before we managed to write the slot to the disk.
> >
> > How can that happen, if we first write the updated value to disk and
> > then update the shared memory as we do in
> > LogicalConfirmReceivedLocation?
>
> I guess, that the problem with logical slots still exist. Please, see the tap
> test: src/test/recovery/t/046_logical_slot.pl from the v6 version of the patch.
> A race condition may happen when logical slot's restart_lsn was changed but not
> yet written to the disk. Imagine, there is another physical slot which is
> advanced at this moment. It recomputes oldest min LSN and takes into account
> changed but not saved to disk restart_lsn of the logical slot. We come to the
> situation when the WAL segment for the logical slot's restart_lsn may be
> truncated after immediate restart.
>

Okay, so I was missing the point that the physical slots can consider
the updated value of the logical slot's restart_lsn. The point I was
advocating for logical slots sanctity was when no physical slots are
involved. When updating replicationSlotMinLSN value in shared memory,
the logical slot machinery took care that the value we use should be
flushed to disk. One can argue that we should improve physical slots
machinery so that it also takes care to write the slot to disk before
updating the replicationSlotMinLSN, which is used to remove WAL. I
understand that the downside is physical slots will be written to disk
with a greater frequency, which will not be good from the performance
point of view, but can we think of doing it for the period when a
checkpoint is in progress? OTOH, if we don't want to adjust physical
slot machinery, it seems saving the logical slots to disk immediately
when its restart_lsn is updated is a waste of effort after your patch,
no? If so, why are we okay with that?

I understand that your proposed patch fixes the reported problem but I
am slightly afraid that the proposed solution is not a good idea w.r.t
logical slots so I am trying to see if there are any other alternative
ideas to fix this issue.

--
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: Daniel Gustafsson
Date:
Subject: Re: Fix a minor typo in jit/README
Next
From: Tender Wang
Date:
Subject: Re: MERGE issues around inheritance