Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly - Mailing list pgsql-hackers

From Alexander Korotkov
Subject Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
Date
Msg-id CAPpHfds0ScFbExsT+pnY7K0JGH6pFPHMOSfCOknC-1s7ahmRFA@mail.gmail.com
Whole thread Raw
In response to Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly  ("Vitaly Davydov" <v.davydov@postgrespro.ru>)
List pgsql-hackers
Hi, Vitaly!

On Tue, Jun 17, 2025 at 6:02 PM Vitaly Davydov <v.davydov@postgrespro.ru> wrote:
> Thank you for reporting the issue.
>
> >While tracking buildfarm for one of other commits, I noticed this failure:
> >TRAP: failed Assert("s->data.restart_lsn >=
> >s->last_saved_restart_lsn"), File:
> >"../pgsql/src/backend/replication/slot.c", Line: 1813, PID: 3945797
> >postgres: standby: checkpointer (ExceptionalCondition+0x83) [0x55fa69b79f5e]
> >postgres: standby: checkpointer
> >(InvalidateObsoleteReplicationSlots+0x53c) [0x55fa69982171]
> >postgres: standby: checkpointer (CreateCheckPoint+0x9ad) [0x55fa6971feb2]
>
> This assert was introduced in the patch. Now, I think, it is a wrong one. Let me
> please explain one of the possible scenarios when it can be triggered. In case
> of physical replication, when walsender receives a standby reply message, it
> calls PhysicalConfirmReceivedLocation function which updates slots' restart_lsn
> from received flush_lsn value. This value may be older than the saved value. If
> it happens during checkpoint, after slot saving to disk, this assert will be
> triggered, because last_saved_restart_lsn value may be lesser than the new
> restart_lsn value, updated from walsender.
>
> I propose to remove this assert.

Yes, but is it OK for restart_lsn to move backward?  That might mean
that if checkpoint happen faster than
PhysicalConfirmReceivedLocation(), then crash could cause this WAL
location to be unavailable.  Is that true?

Also, what do you think about proposed changes in [1]?  I wonder if it
could somehow decrease the coverage.

Links.
1.
https://www.postgresql.org/message-id/OSCPR01MB149665B3F0629D10731B18E5AF570A%40OSCPR01MB14966.jpnprd01.prod.outlook.com

------
Regards,
Alexander Korotkov
Supabase



pgsql-hackers by date:

Previous
From: Daniel Gustafsson
Date:
Subject: Re: Avoid possible dereference null pointer (src/backend/utils/cache/relcache.c)
Next
From: Konstantin Knizhnik
Date:
Subject: Re: Non-reproducible AIO failure