Re: confusing results from pg_get_replication_slots() - Mailing list pgsql-hackers

From Andrey Borodin
Subject Re: confusing results from pg_get_replication_slots()
Date
Msg-id 744A2B9B-EDAF-42D0-955E-1B1AA1421849@yandex-team.ru
Whole thread Raw
In response to confusing results from pg_get_replication_slots()  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: confusing results from pg_get_replication_slots()
Re: confusing results from pg_get_replication_slots()
List pgsql-hackers
Hi Robert!

I've tried to look how people use wal_status.
There are lots of monitoring usages where transient race conditions do not matter much.
But in some cases fatal decisions are made on a "lost" basis. e.g.

https://github.com/readysettech/readyset/blob/cb77b75a56d952fb6b1c4171afa9f0b0175fb6d8/replicators/src/postgres_connector/connector.rs#L381

I concur that showing "unreserved" when there is no actual WAL is a bug.
Proposed fix will work and is very succinct. Resulting code structure is not super elegant, but acceptable.

I don't fully understand circumstances when this bug can do any harm. Maybe negative safe_wal_size could be a surprise
forsome monitoring tools. 

> On 2 Jan 2026, at 20:40, Robert Haas <robertmhaas@gmail.com> wrote:
>
> For all practical intents and purposes, such a slot is no
> more - has ceased to be - has expired and gone to meet its maker -
> it's an ex-slot. It makes no sense to me to display that slot with a
> status that shows that there is some hope of recovery when in fact
> there is none.
>
> Note, by the way, that in existing releases, connections to
> already-invalidated physical slots are not blocked. This has been
> changed, but only in master.

I don't understand a reason to disallow reviving a slot. Ofc with some new LSN that is currently available in pg_wal.

Imagine a following scenario: in a cluster of a Primary and a Standby a long analytical query is causing huge lag,
primaryremoves some WAL segments due to max_slot_wal_keep_size, standby is disconnected, consumes several WALs from
archive,catches up and continues. Or, if something was vacuumed, cancels analytical query. If we disallow reconnection
ofthis stanby, it will stay in archive recovery. I don't see how it's a good thing. 



> On 3 Jan 2026, at 02:10, Robert Haas <robertmhaas@gmail.com> wrote:
>
> Maybe we shouldn't display "lost" when the slot
> is invalidated but "invalidated", for example, and any other value
> means we're just returning whatever GetWALAvaliability() told us.
> Also, maybe the exception for connect slots should just be removed, on
> the assumption that the race condition isn't common enough to matter,
> or maybe that logic should be pushed down into GetWALAvailability() if
> we want to keep it.

I don't think following logic works: "someone seems to be connected to this slot, perhaps it's still not lost". This is
error-proneheuristics that is trying to workaround possibly stale restart_lsn. 
For HEAD I'd propose to actually read restart_lsn, and determine if walsender will issue "requested WAL segment has
alreadybeen removed" on next attempt to send something. In this case slot is "lost". 

If I understand correctly, slot might be "invalidated", but not "lost" in this sense yet: timeout occured, but WAL is
stillthere. 


Best regards, Andrey Borodin.


pgsql-hackers by date:

Previous
From: Marcos Pegoraro
Date:
Subject: Re: not fully correct error message
Next
From: Pavel Stehule
Date:
Subject: Re: not fully correct error message