Home > mailing lists

Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly - Mailing list pgsql-hackers

From	Alexander Korotkov
Subject	Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
Date	June 17 18:08:11
Msg-id	CAPpHfds0ScFbExsT+pnY7K0JGH6pFPHMOSfCOknC-1s7ahmRFA@mail.gmail.com Whole thread Raw
In response to	Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly ("Vitaly Davydov" <v.davydov@postgrespro.ru>)
List	pgsql-hackers

Tree view

Hi, Vitaly!

On Tue, Jun 17, 2025 at 6:02 PM Vitaly Davydov <v.davydov@postgrespro.ru> wrote:
> Thank you for reporting the issue.
>
> >While tracking buildfarm for one of other commits, I noticed this failure:
> >TRAP: failed Assert("s->data.restart_lsn >=
> >s->last_saved_restart_lsn"), File:
> >"../pgsql/src/backend/replication/slot.c", Line: 1813, PID: 3945797
> >postgres: standby: checkpointer (ExceptionalCondition+0x83) [0x55fa69b79f5e]
> >postgres: standby: checkpointer
> >(InvalidateObsoleteReplicationSlots+0x53c) [0x55fa69982171]
> >postgres: standby: checkpointer (CreateCheckPoint+0x9ad) [0x55fa6971feb2]
>
> This assert was introduced in the patch. Now, I think, it is a wrong one. Let me
> please explain one of the possible scenarios when it can be triggered. In case
> of physical replication, when walsender receives a standby reply message, it
> calls PhysicalConfirmReceivedLocation function which updates slots' restart_lsn
> from received flush_lsn value. This value may be older than the saved value. If
> it happens during checkpoint, after slot saving to disk, this assert will be
> triggered, because last_saved_restart_lsn value may be lesser than the new
> restart_lsn value, updated from walsender.
>
> I propose to remove this assert.

Yes, but is it OK for restart_lsn to move backward?  That might mean
that if checkpoint happen faster than
PhysicalConfirmReceivedLocation(), then crash could cause this WAL
location to be unavailable.  Is that true?

Also, what do you think about proposed changes in [1]?  I wonder if it
could somehow decrease the coverage.

Links.
1.
https://www.postgresql.org/message-id/OSCPR01MB149665B3F0629D10731B18E5AF570A%40OSCPR01MB14966.jpnprd01.prod.outlook.com

------
Regards,
Alexander Korotkov
Supabase

pgsql-hackers by date:

From: Daniel Gustafsson
Date: 17 June, 17:59:31
Subject: Re: Avoid possible dereference null pointer (src/backend/utils/cache/relcache.c)

From: Konstantin Knizhnik
Date: 17 June, 18:08:30
Subject: Re: Non-reproducible AIO failure

Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly - Mailing list pgsql-hackers

Previous

Next