Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune() - Mailing list pgsql-bugs

From Peter Geoghegan
Subject Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()
Date
Msg-id CAH2-WzkCOh3r-qpS16-TXFg6kzeg12qawJSTTHtNMavtXXG-sg@mail.gmail.com
Whole thread Raw
In response to Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()  (Noah Misch <noah@leadboat.com>)
Responses Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()  (Peter Geoghegan <pg@bowt.ie>)
List pgsql-bugs
On Sat, Jan 6, 2024 at 12:24 PM Noah Misch <noah@leadboat.com> wrote:
> On Sun, Dec 31, 2023 at 03:53:34PM -0800, Peter Geoghegan wrote:
> > My guess is that there is a decent chance that backpatching 1ccc1e05ae
> > would be okay, but that isn't much use. I really don't know either way
> > right now. And I wouldn't like to speculate too much further before
> > gaining a proper understanding of what's going on here.
>
> Fair enough.  While I agree there's a decent chance back-patching would be
> okay, I think there's also a decent chance that 1ccc1e05ae creates the problem
> Matthias theorized.  Something like: we update relfrozenxid based on
> OldestXmin, even though GlobalVisState caused us to retain a tuple older than
> OldestXmin.  Then relfrozenxid disagrees with table contents.

Either every relevant code path has the same OldestXmin to work off
of, or the whole NewRelfrozenXid/relfrozenxid-tracking thing can't be
expected to work as designed. I find it a bit odd that
pruneheap.c/GlobalVisState has no direct understanding of this
dependency (none that I can discern, at least). Wouldn't it at least
be more natural if pruneheap.c could access OldestXmin when run inside
VACUUM? (Could just be used by defensive hardening code.)

We're also relying on vacuumlazy.c's call to vacuum_get_cutoffs()
(which itself calls GetOldestNonRemovableTransactionId) taking place
before vacuumlazy.c goes on to call GlobalVisTestFor() a few lines
further down (I think). It seems like even the code in procarray.c
might have something to say about the vacuumlazy.c dependency, too.
But offhand it doesn't look like it does, either. Why shouldn't we
expect random implementation details in code like ComputeXidHorizons()
to break the assumption/dependency within vacuumlazy.c?

I also worry about the possibility that GlobalVisTestShouldUpdate()
masks problems in this area (as opposed to causing the problems). It
seems very hard to test.

> I did find this thread while researching the symptoms I was seeing.  No
> partitioning where I saw them.

If this was an isolated incident then it could perhaps have been a
symptom of corruption. Though corruption seems highly unlikely to be
involved with the cases that I've seen, since they appear
intermittently, across a variety of different contexts/versions.

--
Peter Geoghegan



pgsql-bugs by date:

Previous
From: Noah Misch
Date:
Subject: Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()
Next
From: Peter Geoghegan
Date:
Subject: Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()