Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic
Date
Msg-id CAH2-Wz=nJ93NtbqT966_gmpVRekd4=Y9Lngv1fMNzfeVwdM+5g@mail.gmail.com
Whole thread Raw
In response to Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic  (Matthias van de Meent <boekewurm+postgres@gmail.com>)
Responses Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic  (Matthias van de Meent <boekewurm+postgres@gmail.com>)
List pgsql-hackers
On Thu, Jun 10, 2021 at 8:49 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> Could you elaborate on what this "matches what we expect" entails?
>
> Apart from this, I'm also quite certain that the goto-branch that
> created this infinite loop should have been dead code: In a correctly
> working system, the GlobalVis*Rels should always be at least as strict
> as the vacrel->OldestXmin, but at the same time only GlobalVis*Rels
> can be updated (i.e. move their horizon forward) during the vacuum. As
> such, heap_prune_satisfies_vacuum should never fail to vacuum a tuple
> that also satisifies the condition of HeapTupleSatisfiesVacuum.

It's true that these two similar functions should be in perfect
agreement in general (given the same OldestXmin). That in itself
doesn't mean that they must always agree about a tuple in practice,
when they're called in turn inside lazy_scan_prune(). In particular,
nothing stops a transaction that was in progress to
heap_prune_satisfies_vacuum (when it saw some tuples it inserted)
concurrently aborting. That will render the same tuples fully DEAD
inside HeapTupleSatisfiesVacuum(). So we need to restart using the
goto purely to cover that case. See the commit message of commit
8523492d4e3.

By "matches what we expect", I meant "involves a just-aborted
transaction". We could defensively verify that the inserting
transaction concurrently aborted at the point of retrying/calling
heap_page_prune() a second time. If there is no aborted transaction
involved (as was the case with this bug), then we can be confident
that something is seriously broken.

-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: David Rowley
Date:
Subject: Re: "an SQL" vs. "a SQL"
Next
From: Tom Lane
Date:
Subject: Re: "an SQL" vs. "a SQL"