Re: confusing results from pg_get_replication_slots() - Mailing list pgsql-hackers
| From | Robert Haas |
|---|---|
| Subject | Re: confusing results from pg_get_replication_slots() |
| Date | |
| Msg-id | CA+Tgmobu=_Q+caYOzefZw1f7_taAyt6L2wagT_CXVzPqbGVLUg@mail.gmail.com Whole thread Raw |
| In response to | Re: confusing results from pg_get_replication_slots() (Matheus Alcantara <matheusssilv97@gmail.com>) |
| List | pgsql-hackers |
On Fri, Jan 2, 2026 at 3:48 PM Matheus Alcantara <matheusssilv97@gmail.com> wrote: > On 02/01/26 12:40, Robert Haas wrote: > > Here is a patch to make invalidated slots always report as "lost", > > which I propose to back-patch to all supported versions. > > The patch looks correct to me. I'm just wondering if/how we could > create a test for this. It possible to create a regression test or a > TAP test? Or it's not worthwhile? It's relatively difficult to reproduce this, especially on master. Amit Kapila's commit f41d8468ddea34170fe19fdc17b5a247e7d3ac78 changed the behavior for physical replication slots. Before that commit, you couldn't connect to an invalidated logical replication slot, but not an invalidated physical replication slot. After this commit, both are prohibited. I imagine that Amit thought this was a distinction without a difference, because of course if the WAL is actually removed then use of the slot will fail later -- but that's not completely true, because there's no guarantee if or when the connection will be used to fetch WAL that has been removed. Nonetheless, I think it's a good change: because invalidated replication slots are ignored, having stuff connect to them and pretend to use them is bad. However, this means that if you wanted a TAP test for this, you would have to let a replication slot get behind far enough that it could be invalidated, trigger a checkpoint that actually invalidates it, and then have the process using the connection catch up quickly enough that it never tries to fetch removed WAL. In older releases, I believe it's a little easier to hit the problem, because you can actually reconnect to an invalidated slot, but I think you still need to the timing to be just right, so that you catch up after the invalidation happens but before the files are actually removed. Even there, I don't see how you could construct a TAP test without injection points, and I'm not really convinced that it's worth adding a bunch of new infrastructure for this. Such a test wouldn't be likely to catch the next bug of this type, if there is one. The best thing to do to really avoid future bugs of this type, IMHO, would be to modify pg_get_replication_slots() so that it does not editorialize on the value returned by GetWALAvaliability(), but how to get there is arguable. Maybe we shouldn't display "lost" when the slot is invalidated but "invalidated", for example, and any other value means we're just returning whatever GetWALAvaliability() told us. Also, maybe the exception for connect slots should just be removed, on the assumption that the race condition isn't common enough to matter, or maybe that logic should be pushed down into GetWALAvailability() if we want to keep it. I'm not sure. Any of that seems like too much to change in the back-branches, but I personally believe rethinking the logic here would be a better use of energy than developing test cases that verify the exact details of the current logic. -- Robert Haas EDB: http://www.enterprisedb.com
pgsql-hackers by date: