Home > mailing lists

Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly - Mailing list pgsql-hackers

From	Tomas Vondra
Subject	Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
Date	November 21 04:19:52
Msg-id	8679fea7-94ce-4a52-8e48-1a8cd0857fcb@vondra.me Whole thread Raw
In response to	Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly (Tomas Vondra <tomas@vondra.me>)
Responses	Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
List	pgsql-hackers

Tree view

On 11/20/24 18:24, Tomas Vondra wrote:
>
> ...
>
> What confuses me a bit is that we update the restart_lsn (and call
> ReplicationSlotsComputeRequiredLSN() to recalculate the global value)
> all the time. Walsender does that in PhysicalConfirmReceivedLocation for
> example. So we actually see the required LSN to move during checkpoint
> very often. So how come we don't see the issues much more often? Surely
> I miss something important.
> 

This question "How come we don't see this more often?" kept bugging me,
and the answer is actually pretty simple.

The restart_lsn can move backwards after a hard restart (for the reasons
explained), but physical replication does not actually rely on that. The
replica keeps track of the LSN it received (well, it uses the same LSN),
and on reconnect it sends the startpoint to the primary. And the primary
just proceeds use that instead of the (stale) restart LSN for the slot.
And the startpoint is guaranteed (I think) to be at least restart_lsn.

AFAICS this would work for pg_replication_slot_advance() too, that is if
you remember the last LSN the slot advanced to, it should be possible to
advance to it just fine. Of course, it requires a way to remember that
LSN, which for a replica is not an issue. But this just highlights we
can't rely on restart_lsn for this purpose.

(Apologies if this repeats something obvious, or something you already
said, Vitaly.)

regards

-- 
Tomas Vondra

pgsql-hackers by date:

From: Bruce Momjian
Date: 21 November, 04:09:29
Subject: Re: wrong comment in libpq.h

From: Mark Dilger
Date: 21 November, 04:25:37
Subject: Re: cannot freeze committed xmax

Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly - Mailing list pgsql-hackers

Previous

Next