Hi,
Since v13, pg_get_replication_slots() returns a wal_status field that
supposedly tells you whether the slot is reserving WAL. It returns
either "reserved", "extended", "unreserved", or "lost". However, the
logic is more complicated than you might expect from a reporting
function. We normally call GetWALAvailability() and report whatever it
tells us, but there are two exceptions. First, if the slot is
invalidated, we skip calling GetWALAvailability() and assume that the
answer is "lost". Second, if something is still connected to the slot,
we assume that any apparent "lost" answer is due to a race condition
and instead return "unreserved". Both of these exceptions can occur at
the same time, and the checks are done in the order I've listed here.
Therefore, a still-connected slot which is invalidated is shown as
"unreserved" rather than, as I would have expected, as "lost".
I don't believe we should apply both of these exceptions at the same
time. If we actually called GetWALAvailability() and it said the WAL
was lost, then perhaps the fact that somebody's still-connected to the
slot is contrary evidence and maybe due to some race condition they
can catch up again. But if we didn't call GetWALAvailability() and
thought that the WAL was lost because the slot is invalidated, the
fact that some process is still connected to that slot doesn't
invalidate the conclusion. Once the slot is invalidated, it's ignored
for purposes of deciding how much WAL to retain in the future, and
it's ignored for hot_standby_feedback purposes. It is no longer
protecting against any of the things against which slots are supposed
to protect. For all practical intents and purposes, such a slot is no
more - has ceased to be - has expired and gone to meet its maker -
it's an ex-slot. It makes no sense to me to display that slot with a
status that shows that there is some hope of recovery when in fact
there is none.
Note, by the way, that in existing releases, connections to
already-invalidated physical slots are not blocked. This has been
changed, but only in master.
Here is a patch to make invalidated slots always report as "lost",
which I propose to back-patch to all supported versions.
Many people were involved in the diagnosis of this issue, but
particular shot-outs are appropriate to my colleague Nitin Chobisa,
who produced the first reproducible test case demonstrating the issue,
and my colleague Pavan Deolasee, who further refined the test case and
clearly established that it was possible for slots to emerge from the
"lost" state, going back to "unreserved".
--
Robert Haas
EDB: http://www.enterprisedb.com