Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune() - Mailing list pgsql-bugs

From Noah Misch
Subject Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()
Date
Msg-id 20240106202413.e5@rfd.leadboat.com
Whole thread Raw
In response to Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()  (Peter Geoghegan <pg@bowt.ie>)
Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()  (Peter Geoghegan <pg@bowt.ie>)
Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()  (Andres Freund <andres@anarazel.de>)
List pgsql-bugs
On Sun, Dec 31, 2023 at 03:53:34PM -0800, Peter Geoghegan wrote:
> On Sun, Dec 24, 2023 at 6:44 PM Noah Misch <noah@leadboat.com> wrote:
> > On Mon, Nov 01, 2021 at 04:15:27PM +0100, Matthias van de Meent wrote:
> > > Another alternative would be to replace the use of vacrel->OldestXmin
> > > with `vacrel->vistest->maybe_needed` in lazy_scan_prune, but I believe
> >
> > v17 commit 1ccc1e05ae essentially did that.
> 
> Obviously, 1ccc1e05ae would fix the immediate problem of infinite
> retries, since it just rips out the loop.

Yep.

> > > that is not legal in how vacuum works (we cannot unilaterally decide
> > > that we want to retain tuples < OldestXmin).
> >
> > Do you think commit 1ccc1e05ae creates problems in that respect?  It does have
> > the effect of retaining tuples for which GlobalVisState rules "retain" but
> > HeapTupleSatisfiesVacuum() would have ruled "delete".  If that doesn't create
> > problems, then back-patching commit 1ccc1e05ae could be a fix for remaining
> > infinite-retries scenarios, if any.
> 
> My guess is that there is a decent chance that backpatching 1ccc1e05ae
> would be okay, but that isn't much use. I really don't know either way
> right now. And I wouldn't like to speculate too much further before
> gaining a proper understanding of what's going on here.

Fair enough.  While I agree there's a decent chance back-patching would be
okay, I think there's also a decent chance that 1ccc1e05ae creates the problem
Matthias theorized.  Something like: we update relfrozenxid based on
OldestXmin, even though GlobalVisState caused us to retain a tuple older than
OldestXmin.  Then relfrozenxid disagrees with table contents.

> Seems to be
> specific to partitioning with cross-partition updates.
> 
> > Using the https://postgr.es/m/d5d5af5d-ba46-aff3-9f91-776c70246cc3@gmail.com
> > procedure, I see these results:
> >
> > - A commit from the day of that email, 2021-10-29, (5ccceb2946) reproduced the
> >   "numretries" assertion failure in each of five 10m runs.
> >
> > - At the 2022-01-13 commit (18b87b201f^) just before the fix for #17255, the
> >   script instead gets: FailedAssertion("HeapTupleHeaderIsHeapOnly(htup)",
> >   File: "pruneheap.c", Line: 964.  That happened once in two 10m runs, so it
> >   was harder to reach than the numretries failure.
> >
> > - At 18b87b201f, a 1440m script run got no failures.
> >
> > I've seen symptoms that suggest the infinite-numretries bug remains reachable,
> > but I don't know how to reproduce them.  (Given the upthread notes about xmin
> > going backward during end-of-xact, I'd first try some pauses there.)  If it
> > does remain reachable, likely some other code change between 2021-10 and
> > 2022-01 made this particular test script no longer reach it.
> 
> I am aware of a production database that appears to run into the same
> problem. Inserts and concurrent cross-partition updates are used
> heavily on this instance (the table in question uses partitioning).
> Perhaps you happened upon a similar problematic production database,
> and found this thread when researching the issue? Maybe we've both
> seen the same problem in the wild?

I did find this thread while researching the symptoms I was seeing.  No
partitioning where I saw them.

> I have seen VACUUM get stuck like this on multiple versions, all
> associated with the same application code/partitioning
> scheme/workload. This includes a 15.4 instance, and various 14.* point
> release instances. It seems likely that there is a bug here, and that
> it affects Postgres 14, 15, and 16.

Agreed.



pgsql-bugs by date:

Previous
From: PG Bug reporting form
Date:
Subject: BUG #18273: Incorrect memory access occurs when gist__int index is building on data containing 2^31-1
Next
From: Peter Geoghegan
Date:
Subject: Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()