Re: confusing results from pg_get_replication_slots() - Mailing list pgsql-hackers

From Robert Haas
Subject Re: confusing results from pg_get_replication_slots()
Date
Msg-id CA+Tgmobu=_Q+caYOzefZw1f7_taAyt6L2wagT_CXVzPqbGVLUg@mail.gmail.com
Whole thread Raw
In response to Re: confusing results from pg_get_replication_slots()  (Matheus Alcantara <matheusssilv97@gmail.com>)
List pgsql-hackers
On Fri, Jan 2, 2026 at 3:48 PM Matheus Alcantara
<matheusssilv97@gmail.com> wrote:
> On 02/01/26 12:40, Robert Haas wrote:
> > Here is a patch to make invalidated slots always report as "lost",
> > which I propose to back-patch to all supported versions.
>
> The patch looks correct to me. I'm just wondering if/how we could
> create a test for this. It possible to create a regression test or a
> TAP test? Or it's not worthwhile?

It's relatively difficult to reproduce this, especially on master.
Amit Kapila's commit f41d8468ddea34170fe19fdc17b5a247e7d3ac78 changed
the behavior for physical replication slots. Before that commit, you
couldn't connect to an invalidated logical replication slot, but not
an invalidated physical replication slot. After this commit, both are
prohibited. I imagine that Amit thought this was a distinction without
a difference, because of course if the WAL is actually removed then
use of the slot will fail later -- but that's not completely true,
because there's no guarantee if or when the connection will be used to
fetch WAL that has been removed. Nonetheless, I think it's a good
change: because invalidated replication slots are ignored, having
stuff connect to them and pretend to use them is bad.

However, this means that if you wanted a TAP test for this, you would
have to let a replication slot get behind far enough that it could be
invalidated, trigger a checkpoint that actually invalidates it, and
then have the process using the connection catch up quickly enough
that it never tries to fetch removed WAL. In older releases, I believe
it's a little easier to hit the problem, because you can actually
reconnect to an invalidated slot, but I think you still need to the
timing to be just right, so that you catch up after the invalidation
happens but before the files are actually removed. Even there, I don't
see how you could construct a TAP test without injection points, and
I'm not really convinced that it's worth adding a bunch of new
infrastructure for this. Such a test wouldn't be likely to catch the
next bug of this type, if there is one.

The best thing to do to really avoid future bugs of this type, IMHO,
would be to modify pg_get_replication_slots() so that it does not
editorialize on the value returned by GetWALAvaliability(), but how to
get there is arguable. Maybe we shouldn't display "lost" when the slot
is invalidated but "invalidated", for example, and any other value
means we're just returning whatever GetWALAvaliability() told us.
Also, maybe the exception for connect slots should just be removed, on
the assumption that the race condition isn't common enough to matter,
or maybe that logic should be pushed down into GetWALAvailability() if
we want to keep it. I'm not sure. Any of that seems like too much to
change in the back-branches, but I personally believe rethinking the
logic here would be a better use of energy than developing test cases
that verify the exact details of the current logic.

--
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Matheus Alcantara
Date:
Subject: Re: confusing results from pg_get_replication_slots()
Next
From: Tom Lane
Date:
Subject: Re: pgsql: Ignore PlaceHolderVars when looking up statistics