Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic
Date
Msg-id CAH2-Wzm3tnHJExZ+fPgVJX7v1Vo7jWDFtN040zeYsGyjfjP7qw@mail.gmail.com
Whole thread Raw
In response to Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic  (Matthias van de Meent <boekewurm+postgres@gmail.com>)
Responses Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic
List pgsql-hackers
On Wed, Jun 16, 2021 at 3:59 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> On Tue, 15 Jun 2021 at 03:22, Andres Freund <andres@anarazel.de> wrote:
> > > @@ -4032,6 +4039,24 @@ GlobalVisTestShouldUpdate(GlobalVisState *state)
> > >  static void
> > >  GlobalVisUpdateApply(ComputeXidHorizonsResult *horizons)
> > >  {
> > > +     /* assert non-decreasing nature of horizons */
> >
> > Thinking more about it, I don't think these are correct. See the
> > following comment in procarray.c:
> >
> >  * Note: despite the above, it's possible for the calculated values to move
> >  * backwards on repeated calls.
>
> So the implicit assumption in heap_page_prune that
> HeapTupleSatisfiesVacuum(OldestXmin) is always consistent with
> heap_prune_satisfies_vacuum(vacrel) has never been true. In that case,
> we'll need to redo the condition in heap_page_prune as well.

I don't think that this shows that the assumption within
lazy_scan_prune() (the assumption that both "satisfies vacuum"
functions agree) is wrong, with the obvious exception of cases
involving the bug that Justin reported. GlobalVis*.maybe_needed is
supposed to be conservative.

> PFA my adapted patch that fixes this new-ish issue, and does not
> include the (incorrect) assertions in GlobalVisUpdateApply. I've
> tested this against the reproducing case, both with and without the
> fix in GetOldestNonRemovableTransactionId, and it fails fall into an
> infinite loop.
>
> I would appreciate it if someone could validate the new logic in the
> HEAPTUPLE_DEAD case. Although I believe it correctly handles the case
> where the vistest non-removable horizon moved backwards, a second pair
> of eyes would be appreciated.

If you look at the lazy_scan_prune() logic immediately prior to commit
8523492d4e3, you'll see that it used to have a HEAPTUPLE_DEAD case
that didn't involve a restart -- this was the "tupgone" mechanism.
Back then we actually had to remove any corresponding index tuples
from indexes when in this rare case. Plus there was a huge amount of
complicated mechanism to handle a very rare case, all of which was
removed by commit 8523492d4e3. Things like extra recovery conflict
code just for this rare case, or needing to acquire a super exclusive
lock on pages during VACUUM's second heap pass. This is all cruft that
I was happy to get rid of.

This is a good discussion of the tupgone stuff and the problems it
caused, which is good background information:

https://www.postgresql.org/message-id/20200724165514.dnu5hr4vvgkssf5p%40alap3.anarazel.de

Even if it was true that heap_prune_satisfies_vacuum() won't agree
with HeapTupleSatisfiesVacuum() after repeated retries within
lazy_scan_prune(), it would probably best if we then made code outside
vacuumlazy.c agree with the lazy_scan_prune() assumption, rather than
the other way around.

Have you actually been able to demonstrate a problem involving
lazy_scan_prune()'s goto, except the main
GetOldestNonRemovableTransactionId() bug reported by Justin?

-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: a path towards replacing GEQO with something better
Next
From: Robert Haas
Date:
Subject: Re: Transactions involving multiple postgres foreign servers, take 2