Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune() - Mailing list pgsql-bugs
From | Noah Misch |
---|---|
Subject | Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune() |
Date | |
Msg-id | 20240106202413.e5@rfd.leadboat.com Whole thread Raw |
In response to | Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune() (Peter Geoghegan <pg@bowt.ie>) |
Responses |
Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()
Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune() Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune() |
List | pgsql-bugs |
On Sun, Dec 31, 2023 at 03:53:34PM -0800, Peter Geoghegan wrote: > On Sun, Dec 24, 2023 at 6:44 PM Noah Misch <noah@leadboat.com> wrote: > > On Mon, Nov 01, 2021 at 04:15:27PM +0100, Matthias van de Meent wrote: > > > Another alternative would be to replace the use of vacrel->OldestXmin > > > with `vacrel->vistest->maybe_needed` in lazy_scan_prune, but I believe > > > > v17 commit 1ccc1e05ae essentially did that. > > Obviously, 1ccc1e05ae would fix the immediate problem of infinite > retries, since it just rips out the loop. Yep. > > > that is not legal in how vacuum works (we cannot unilaterally decide > > > that we want to retain tuples < OldestXmin). > > > > Do you think commit 1ccc1e05ae creates problems in that respect? It does have > > the effect of retaining tuples for which GlobalVisState rules "retain" but > > HeapTupleSatisfiesVacuum() would have ruled "delete". If that doesn't create > > problems, then back-patching commit 1ccc1e05ae could be a fix for remaining > > infinite-retries scenarios, if any. > > My guess is that there is a decent chance that backpatching 1ccc1e05ae > would be okay, but that isn't much use. I really don't know either way > right now. And I wouldn't like to speculate too much further before > gaining a proper understanding of what's going on here. Fair enough. While I agree there's a decent chance back-patching would be okay, I think there's also a decent chance that 1ccc1e05ae creates the problem Matthias theorized. Something like: we update relfrozenxid based on OldestXmin, even though GlobalVisState caused us to retain a tuple older than OldestXmin. Then relfrozenxid disagrees with table contents. > Seems to be > specific to partitioning with cross-partition updates. > > > Using the https://postgr.es/m/d5d5af5d-ba46-aff3-9f91-776c70246cc3@gmail.com > > procedure, I see these results: > > > > - A commit from the day of that email, 2021-10-29, (5ccceb2946) reproduced the > > "numretries" assertion failure in each of five 10m runs. > > > > - At the 2022-01-13 commit (18b87b201f^) just before the fix for #17255, the > > script instead gets: FailedAssertion("HeapTupleHeaderIsHeapOnly(htup)", > > File: "pruneheap.c", Line: 964. That happened once in two 10m runs, so it > > was harder to reach than the numretries failure. > > > > - At 18b87b201f, a 1440m script run got no failures. > > > > I've seen symptoms that suggest the infinite-numretries bug remains reachable, > > but I don't know how to reproduce them. (Given the upthread notes about xmin > > going backward during end-of-xact, I'd first try some pauses there.) If it > > does remain reachable, likely some other code change between 2021-10 and > > 2022-01 made this particular test script no longer reach it. > > I am aware of a production database that appears to run into the same > problem. Inserts and concurrent cross-partition updates are used > heavily on this instance (the table in question uses partitioning). > Perhaps you happened upon a similar problematic production database, > and found this thread when researching the issue? Maybe we've both > seen the same problem in the wild? I did find this thread while researching the symptoms I was seeing. No partitioning where I saw them. > I have seen VACUUM get stuck like this on multiple versions, all > associated with the same application code/partitioning > scheme/workload. This includes a 15.4 instance, and various 14.* point > release instances. It seems likely that there is a bug here, and that > it affects Postgres 14, 15, and 16. Agreed.
pgsql-bugs by date: