On Wed, Jan 10, 2024 at 02:06:42PM -0500, Peter Geoghegan wrote:
> On Tue, Jan 9, 2024 at 6:16 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > On Tue, Jan 9, 2024 at 4:44 PM Noah Misch <noah@leadboat.com> wrote:
> > > On Tue, Jan 09, 2024 at 03:59:19PM -0500, Peter Geoghegan wrote:
> > > > Did the affected system that you investigated happen to have an
> > > > atypically high number of databases? The system 15.4 system that I saw
> > > > the problem on had almost 3,000 databases.
> > >
> > > No, single-digit database count here.
> >
> > My suspicion was that this factor might increase the propensity of
> > calls to GetOldestNonRemovableTransactionId (used to establish
> > VACUUM's OldestXmin) to not agree with the GlobalVis* based state used
> > by pruneheap.c, in the way that we need to worry about here (i.e.
> > inconsistencies that lead to VACUUM getting stuck inside
> > lazy_scan_prune's loop).
>
> Another question about your database/system: does VACUUM get stuck
> while scanning a page some time after it has already completed a round
> of index vacuuming?
I don't know. That particular system experienced the infinite loop only once.
> That's what I see here -- I don't think that pruning leaves behind
> even a single live heap tuple, despite scanning thousands of pages
> before reaching the page that it gets stuck on. Could be another red
> herring. But it doesn't seem impossible that some of the nbtree calls
> to procarray.c routines performed by code added by my commit
> 9dd963ae25, "Recycle nbtree pages deleted during same VACUUM", are
> somehow related. That is, that code could be part of the chain of
> events that cause the problem (whether or not the code itself is
> technically at fault).
>
> I'm referring to calls such as the
> "GetOldestNonRemovableTransactionId(NULL)" and
> "GlobalVisCheckRemovableFullXid()" calls that take place inside
> _bt_pendingfsm_finalize(). It's not like we do stuff like that in very
> many other places.
I see what you mean about the rarity and potential importance of
"GetOldestNonRemovableTransactionId(NULL)". There's just one other caller,
vac_update_datfrozenxid(), which calls it for an unrelated cause.