Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune() - Mailing list pgsql-bugs

From Noah Misch
Subject Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()
Date
Msg-id 20240110193851.f0.nmisch@google.com
Whole thread Raw
In response to Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()  (Peter Geoghegan <pg@bowt.ie>)
List pgsql-bugs
On Wed, Jan 10, 2024 at 02:06:42PM -0500, Peter Geoghegan wrote:
> On Tue, Jan 9, 2024 at 6:16 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > On Tue, Jan 9, 2024 at 4:44 PM Noah Misch <noah@leadboat.com> wrote:
> > > On Tue, Jan 09, 2024 at 03:59:19PM -0500, Peter Geoghegan wrote:
> > > > Did the affected system that you investigated happen to have an
> > > > atypically high number of databases? The system 15.4 system that I saw
> > > > the problem on had almost 3,000 databases.
> > >
> > > No, single-digit database count here.
> >
> > My suspicion was that this factor might increase the propensity of
> > calls to GetOldestNonRemovableTransactionId (used to establish
> > VACUUM's OldestXmin) to not agree with the GlobalVis* based state used
> > by pruneheap.c, in the way that we need to worry about here (i.e.
> > inconsistencies that lead to VACUUM getting stuck inside
> > lazy_scan_prune's loop).
> 
> Another question about your database/system: does VACUUM get stuck
> while scanning a page some time after it has already completed a round
> of index vacuuming?

I don't know.  That particular system experienced the infinite loop only once.

> That's what I see here -- I don't think that pruning leaves behind
> even a single live heap tuple, despite scanning thousands of pages
> before reaching the page that it gets stuck on. Could be another red
> herring. But it doesn't seem impossible that some of the nbtree calls
> to procarray.c routines performed by code added by my commit
> 9dd963ae25, "Recycle nbtree pages deleted during same VACUUM", are
> somehow related. That is, that code could be part of the chain of
> events that cause the problem (whether or not the code itself is
> technically at fault).
> 
> I'm referring to calls such as the
> "GetOldestNonRemovableTransactionId(NULL)" and
> "GlobalVisCheckRemovableFullXid()" calls that take place inside
> _bt_pendingfsm_finalize(). It's not like we do stuff like that in very
> many other places.

I see what you mean about the rarity and potential importance of
"GetOldestNonRemovableTransactionId(NULL)".  There's just one other caller,
vac_update_datfrozenxid(), which calls it for an unrelated cause.



pgsql-bugs by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()
Next
From: Peter Geoghegan
Date:
Subject: Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()