Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune() - Mailing list pgsql-bugs

From Peter Geoghegan
Subject Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()
Date
Msg-id CAH2-Wzn57T=d7eB90m0wr+AiAXetk-NWA=ntS89R2mOcDimNsQ@mail.gmail.com
Whole thread Raw
In response to Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()  (Noah Misch <noah@leadboat.com>)
Responses Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()  (Peter Geoghegan <pg@bowt.ie>)
List pgsql-bugs
On Tue, Jan 9, 2024 at 4:44 PM Noah Misch <noah@leadboat.com> wrote:
> On Tue, Jan 09, 2024 at 03:59:19PM -0500, Peter Geoghegan wrote:
> > Did the affected system that you investigated happen to have an
> > atypically high number of databases? The system 15.4 system that I saw
> > the problem on had almost 3,000 databases.
>
> No, single-digit database count here.

My suspicion was that this factor might increase the propensity of
calls to GetOldestNonRemovableTransactionId (used to establish
VACUUM's OldestXmin) to not agree with the GlobalVis* based state used
by pruneheap.c, in the way that we need to worry about here (i.e.
inconsistencies that lead to VACUUM getting stuck inside
lazy_scan_prune's loop).

Using gdb I was able to determine that
ComputeXidHorizonsResultLastXmin == RecentXmin at some point long
after the system gets stuck (when I actually looked). So
GlobalVisTestShouldUpdate() doesn't return true at that point. And, I
see that VACUUM's OldestXmin value is between
GlobalVisDataRels.maybe_needed and
GlobalVisDataRels.definitely_needed. The deleted tuple's xmax is
committed according to OldestXmin (i.e. it's a value < OldestXmin),
and is < GlobalVisDataRels.definitely_needed, too. The same tuple xmax
is > GlobalVisDataRels.maybe_needed. As for this tuple's xmin, it was
already frozen by a previous VACUUM operation. The tuple infomask
flags indicate that it's a pretty standard deleted tuple.

Overall, there aren't a lot of details here that seem like they might
be out of the ordinary, hinting at a specific underlying cause.

It looks more like the assumptions that we make about OldestXmin
agreeing with GlobalVis* state just aren't quite robust, in general.
Ideally I'd be able to point to some specific assumption that has been
violated -- and we might yet tie the problem to some specific detail
that I've yet to identify. As I said upthread, I'm concerned that code
in places like procarray.c is rather loose about how the horizons are
recomputed, in a way that doesn't sit well with me.
GlobalVisTestShouldUpdate() thinks that it's okay to use
ComputeXidHorizonsResultLastXmin-based heuristics to decide when to
recompute horizons. It is more or less treated as a matter of weighing
costs against benefits -- not as a potential correctness issue.

--
Peter Geoghegan



pgsql-bugs by date:

Previous
From: Noah Misch
Date:
Subject: Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()
Next
From: Richard Guo
Date:
Subject: Re: BUG #18252: Assert in CheckOpSlotCompatibility() fails when recursive union filters tuples in non-recursive term