Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
Date
Msg-id 8679fea7-94ce-4a52-8e48-1a8cd0857fcb@vondra.me
Whole thread Raw
In response to Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly  (Tomas Vondra <tomas@vondra.me>)
Responses Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
List pgsql-hackers
On 11/20/24 18:24, Tomas Vondra wrote:
>
> ...
>
> What confuses me a bit is that we update the restart_lsn (and call
> ReplicationSlotsComputeRequiredLSN() to recalculate the global value)
> all the time. Walsender does that in PhysicalConfirmReceivedLocation for
> example. So we actually see the required LSN to move during checkpoint
> very often. So how come we don't see the issues much more often? Surely
> I miss something important.
> 

This question "How come we don't see this more often?" kept bugging me,
and the answer is actually pretty simple.

The restart_lsn can move backwards after a hard restart (for the reasons
explained), but physical replication does not actually rely on that. The
replica keeps track of the LSN it received (well, it uses the same LSN),
and on reconnect it sends the startpoint to the primary. And the primary
just proceeds use that instead of the (stale) restart LSN for the slot.
And the startpoint is guaranteed (I think) to be at least restart_lsn.

AFAICS this would work for pg_replication_slot_advance() too, that is if
you remember the last LSN the slot advanced to, it should be possible to
advance to it just fine. Of course, it requires a way to remember that
LSN, which for a replica is not an issue. But this just highlights we
can't rely on restart_lsn for this purpose.


(Apologies if this repeats something obvious, or something you already
said, Vitaly.)

regards

-- 
Tomas Vondra




pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: wrong comment in libpq.h
Next
From: Mark Dilger
Date:
Subject: Re: cannot freeze committed xmax