Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic - Mailing list pgsql-hackers

From Matthias van de Meent
Subject Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic
Date
Msg-id CAEze2WiU-crhst9Xtk=6sk8rBspD0LGE1N=cafVg091Twu4FQw@mail.gmail.com
Whole thread Raw
In response to Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic
List pgsql-hackers
On Thu, 10 Jun 2021 at 18:03, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Thu, Jun 10, 2021 at 8:49 AM Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
> > Could you elaborate on what this "matches what we expect" entails?
> >
> > Apart from this, I'm also quite certain that the goto-branch that
> > created this infinite loop should have been dead code: In a correctly
> > working system, the GlobalVis*Rels should always be at least as strict
> > as the vacrel->OldestXmin, but at the same time only GlobalVis*Rels
> > can be updated (i.e. move their horizon forward) during the vacuum. As
> > such, heap_prune_satisfies_vacuum should never fail to vacuum a tuple
> > that also satisifies the condition of HeapTupleSatisfiesVacuum.
>
> It's true that these two similar functions should be in perfect
> agreement in general (given the same OldestXmin). That in itself
> doesn't mean that they must always agree about a tuple in practice,
> when they're called in turn inside lazy_scan_prune(). In particular,
> nothing stops a transaction that was in progress to
> heap_prune_satisfies_vacuum (when it saw some tuples it inserted)
> concurrently aborting. That will render the same tuples fully DEAD
> inside HeapTupleSatisfiesVacuum(). So we need to restart using the
> goto purely to cover that case. See the commit message of commit
> 8523492d4e3.

I totally overlooked that HeapTupleSatisfiesVacuumHorizon does the
heavyweight XID validation and does return HEAPTUPLE_DEAD in those
recently rolled back cases. Thank you for reminding me.

> By "matches what we expect", I meant "involves a just-aborted
> transaction". We could defensively verify that the inserting
> transaction concurrently aborted at the point of retrying/calling
> heap_page_prune() a second time. If there is no aborted transaction
> involved (as was the case with this bug), then we can be confident
> that something is seriously broken.

I believe there are more cases than only the rolled back case, but
checking for those cases would potentially help, yes.

With regards,

Matthias van de Meent.



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Transactions involving multiple postgres foreign servers, take 2
Next
From: Peter Geoghegan
Date:
Subject: Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic