Re: confusing results from pg_get_replication_slots() - Mailing list pgsql-hackers
| From | Robert Haas |
|---|---|
| Subject | Re: confusing results from pg_get_replication_slots() |
| Date | |
| Msg-id | CA+TgmoaLgv13eMwXuqNkipQU3ScK4+YJvBoJHobYGizojpy9iA@mail.gmail.com Whole thread Raw |
| In response to | Re: confusing results from pg_get_replication_slots() (Andrey Borodin <x4mmm@yandex-team.ru>) |
| Responses |
Re: confusing results from pg_get_replication_slots()
|
| List | pgsql-hackers |
On Sat, Jan 3, 2026 at 7:22 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote: > I concur that showing "unreserved" when there is no actual WAL is a bug. > Proposed fix will work and is very succinct. Resulting code structure is not super elegant, but acceptable. Agreed. > I don't fully understand circumstances when this bug can do any harm. Maybe negative safe_wal_size could be a surprisefor some monitoring tools. Yes, the fact that safe_wal_size can go negative is one of the things that makes me think this outcome was not really intended. > I don't understand a reason to disallow reviving a slot. Ofc with some new LSN that is currently available in pg_wal. > > Imagine a following scenario: in a cluster of a Primary and a Standby a long analytical query is causing huge lag, primaryremoves some WAL segments due to max_slot_wal_keep_size, standby is disconnected, consumes several WALs from archive,catches up and continues. Or, if something was vacuumed, cancels analytical query. If we disallow reconnection ofthis stanby, it will stay in archive recovery. I don't see how it's a good thing. I think for physical slots invalidation is a little bit of an odd concept -- why do we ever invalidate a physical slot at all, rather than just stop reserving WAL at some point and let what happens, happen? But the reality is that the slot cannot be resurrected once invalidated; you have to drop and recreate it. Possibly we should revisit that decision or document the logic more clearly, but that's not something to think of back-paching. > > On 3 Jan 2026, at 02:10, Robert Haas <robertmhaas@gmail.com> wrote: > > > > Maybe we shouldn't display "lost" when the slot > > is invalidated but "invalidated", for example, and any other value > > means we're just returning whatever GetWALAvaliability() told us. > > Also, maybe the exception for connect slots should just be removed, on > > the assumption that the race condition isn't common enough to matter, > > or maybe that logic should be pushed down into GetWALAvailability() if > > we want to keep it. > > I don't think following logic works: "someone seems to be connected to this slot, perhaps it's still not lost". This iserror-prone heuristics that is trying to workaround possibly stale restart_lsn. > For HEAD I'd propose to actually read restart_lsn, and determine if walsender will issue "requested WAL segment has alreadybeen removed" on next attempt to send something. In this case slot is "lost". > > If I understand correctly, slot might be "invalidated", but not "lost" in this sense yet: timeout occured, but WAL is stillthere. What I think is *really bad* about this situation is that, when the slot is invalidated, showing it as unreserved makes it still look potentially useful. But no matter whether the WAL is present or not, the slot neither serves to reserve WAL or to hold back xmin once invaliated. Therefore it is not useful. The user would be better off using no slot at all, in which case xmin would be held back and WAL reserved at least while the walreceiver is connected. It is not a question of whether the user can stream from the slot: the user doesn't need a slot to stream. It's a question of whether the user erroneously believes themselves to be protected against something when in fact they are using a defunct slot that is worse than no slot at all. -- Robert Haas EDB: http://www.enterprisedb.com
pgsql-hackers by date: