Hi, Vitaly!
On Tue, Jun 17, 2025 at 6:02 PM Vitaly Davydov <v.davydov@postgrespro.ru> wrote:
> Thank you for reporting the issue.
>
> >While tracking buildfarm for one of other commits, I noticed this failure:
> >TRAP: failed Assert("s->data.restart_lsn >=
> >s->last_saved_restart_lsn"), File:
> >"../pgsql/src/backend/replication/slot.c", Line: 1813, PID: 3945797
> >postgres: standby: checkpointer (ExceptionalCondition+0x83) [0x55fa69b79f5e]
> >postgres: standby: checkpointer
> >(InvalidateObsoleteReplicationSlots+0x53c) [0x55fa69982171]
> >postgres: standby: checkpointer (CreateCheckPoint+0x9ad) [0x55fa6971feb2]
>
> This assert was introduced in the patch. Now, I think, it is a wrong one. Let me
> please explain one of the possible scenarios when it can be triggered. In case
> of physical replication, when walsender receives a standby reply message, it
> calls PhysicalConfirmReceivedLocation function which updates slots' restart_lsn
> from received flush_lsn value. This value may be older than the saved value. If
> it happens during checkpoint, after slot saving to disk, this assert will be
> triggered, because last_saved_restart_lsn value may be lesser than the new
> restart_lsn value, updated from walsender.
>
> I propose to remove this assert.
Yes, but is it OK for restart_lsn to move backward? That might mean
that if checkpoint happen faster than
PhysicalConfirmReceivedLocation(), then crash could cause this WAL
location to be unavailable. Is that true?
Also, what do you think about proposed changes in [1]? I wonder if it
could somehow decrease the coverage.
Links.
1.
https://www.postgresql.org/message-id/OSCPR01MB149665B3F0629D10731B18E5AF570A%40OSCPR01MB14966.jpnprd01.prod.outlook.com
------
Regards,
Alexander Korotkov
Supabase