Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune() - Mailing list pgsql-bugs

From Peter Geoghegan
Subject Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()
Date
Msg-id CAH2-Wznv94Q_Td8OS8bAN7fYLpfU6CGgjn6Xau5eJ_sDxEGeBA@mail.gmail.com
Whole thread Raw
In response to Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()  (Noah Misch <noah@leadboat.com>)
List pgsql-bugs
On Tue, Jan 9, 2024 at 6:16 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Tue, Jan 9, 2024 at 4:44 PM Noah Misch <noah@leadboat.com> wrote:
> > On Tue, Jan 09, 2024 at 03:59:19PM -0500, Peter Geoghegan wrote:
> > > Did the affected system that you investigated happen to have an
> > > atypically high number of databases? The system 15.4 system that I saw
> > > the problem on had almost 3,000 databases.
> >
> > No, single-digit database count here.
>
> My suspicion was that this factor might increase the propensity of
> calls to GetOldestNonRemovableTransactionId (used to establish
> VACUUM's OldestXmin) to not agree with the GlobalVis* based state used
> by pruneheap.c, in the way that we need to worry about here (i.e.
> inconsistencies that lead to VACUUM getting stuck inside
> lazy_scan_prune's loop).

Another question about your database/system: does VACUUM get stuck
while scanning a page some time after it has already completed a round
of index vacuuming? And if so, does an nbtree bulk delete end up
deleting and then recycling many index leaf pages (e.g., due to bulk
range deletions)?

That's what I see here -- I don't think that pruning leaves behind
even a single live heap tuple, despite scanning thousands of pages
before reaching the page that it gets stuck on. Could be another red
herring. But it doesn't seem impossible that some of the nbtree calls
to procarray.c routines performed by code added by my commit
9dd963ae25, "Recycle nbtree pages deleted during same VACUUM", are
somehow related. That is, that code could be part of the chain of
events that cause the problem (whether or not the code itself is
technically at fault).

I'm referring to calls such as the
"GetOldestNonRemovableTransactionId(NULL)" and
"GlobalVisCheckRemovableFullXid()" calls that take place inside
_bt_pendingfsm_finalize(). It's not like we do stuff like that in very
many other places.

--
Peter Geoghegan



pgsql-bugs by date:

Previous
From: Alexander Lakhin
Date:
Subject: Re: BUG #17798: Incorrect memory access occurs when using BEFORE ROW UPDATE trigger
Next
From: Noah Misch
Date:
Subject: Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()