Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune() - Mailing list pgsql-bugs

From Melanie Plageman
Subject Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()
Date
Msg-id CAAKRu_ai8PMW5cqCFhu-U46CWLmgP2d_FnpLOqCSvMxY-UQ9xw@mail.gmail.com
Whole thread Raw
In response to Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()  (Andres Freund <andres@anarazel.de>)
Responses Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()
List pgsql-bugs
On Mon, Apr 15, 2024 at 1:39 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> I've tried a couple times to catch up with this thread. But always kinda felt
> I must be missing something. It might be that this is one part of the
> confusion:
>
> On 2024-01-06 12:24:13 -0800, Noah Misch wrote:
> > Fair enough.  While I agree there's a decent chance back-patching would be
> > okay, I think there's also a decent chance that 1ccc1e05ae creates the problem
> > Matthias theorized.  Something like: we update relfrozenxid based on
> > OldestXmin, even though GlobalVisState caused us to retain a tuple older than
> > OldestXmin.  Then relfrozenxid disagrees with table contents.
>
> Looking at the state as of 1ccc1e05ae, I don't see how - in lazy_scan_prune(),
> if heap_page_prune() spuriously didn't prune a tuple, because the horizon went
> backwards, we'd encounter the tuple in the loop below and call
> heap_prepare_freeze_tuple(), which would error out with one of
>
>     /*
>      * Process xmin, while keeping track of whether it's already frozen, or
>      * will become frozen iff our freeze plan is executed by caller (could be
>      * neither).
>      */
>     xid = HeapTupleHeaderGetXmin(tuple);
>     if (!TransactionIdIsNormal(xid))
>         xmin_already_frozen = true;
>     else
>     {
>         if (TransactionIdPrecedes(xid, cutoffs->relfrozenxid))
>             ereport(ERROR,
>                     (errcode(ERRCODE_DATA_CORRUPTED),
>                      errmsg_internal("found xmin %u from before relfrozenxid %u",
>                                      xid, cutoffs->relfrozenxid)));
>
> or
>                 if (TransactionIdPrecedes(update_xact, cutoffs->relfrozenxid))
>                         ereport(ERROR,
>                                         (errcode(ERRCODE_DATA_CORRUPTED),
>                                          errmsg_internal("multixact %u contains update XID %u from before
relfrozenxid%u", 
>                                                                          multi, update_xact,
>                                                                          cutoffs->relfrozenxid)));
> or
>                 /* Raw xmax is normal XID */
>                 if (TransactionIdPrecedes(xid, cutoffs->relfrozenxid))
>                         ereport(ERROR,
>                                         (errcode(ERRCODE_DATA_CORRUPTED),
>                                          errmsg_internal("found xmax %u from before relfrozenxid %u",
>                                                                          xid, cutoffs->relfrozenxid)));
>
>
> I'm not saying that spuriously erroring out would be ok. But I guess I just
> don't understand the data corruption theory in this subthread, because we'd
> error out if we encountered a tuple that should have been frozen but wasn't?

I have a more basic question. How could GlobalVisState->maybe_needed
going backwards cause a problem with relfrozenxid? Yes, if
maybe_needed goes backwards, we may not remove a tuple whose xmin/xmax
are older than VacuumCutoffs->OldestXmin. But, if that tuple's
xmin/xmax are older than OldestXmin, then wouldn't we freeze it? If we
freeze it, there isn't an issue. And if the tuple's xids are not newer
than OldestXmin, then how could we end up advancing relfrozenxid to a
value greater than the tuple's xids?

- Melanie



pgsql-bugs by date:

Previous
From: Melanie Plageman
Date:
Subject: Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae
Next
From: Peter Geoghegan
Date:
Subject: Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()