Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae - Mailing list pgsql-bugs

From Robert Haas
Subject Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae
Date
Msg-id CA+Tgmob1BtWcP6R5-toVHB5wqHasPTSR2TJkcDCutMzaUYBaHQ@mail.gmail.com
Whole thread Raw
In response to Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae  (Peter Geoghegan <pg@bowt.ie>)
Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae  (Melanie Plageman <melanieplageman@gmail.com>)
Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae  (Andres Freund <andres@anarazel.de>)
List pgsql-bugs
On Fri, Mar 22, 2024 at 3:34 PM Robert Haas <robertmhaas@gmail.com> wrote:
> I think we need to understand what's actually happening here before we
> jump to solutions. I think it's clear that we can't determine a
> relfrozenxid in a way that is untethered from the decisions made while
> pruning, but surely whoever wrote this code intended for there to be
> some relationship between GlobalVisState and
> VacuumCutoffs->OldestXmin. For example, it may be that they intended
> for VacuumCutoffs->OldestXmin to always contain an XID that is older
> than any XID that the GlobalVisState could possibly have pruned; but
> they might have done that incorrectly, in such a fashion that when
> GlobalVisState->maybe_needed moves backward the intended invariant no
> longer holds.

OK, I did some more analysis of this. I kind of feel like maybe this
should be getting more attention from Melanie (since it was the commit
of her patch that triggered the addition of this to the open items
list) or Andres (because it has been alleged to be related to the
GlobalVisTest stuff), but here we are.

Maybe this is already clear to some other people here, but I think the
bug is essentially the result of this code in heap_vacuum_rel:

    vacrel->aggressive = vacuum_get_cutoffs(rel, params, &vacrel->cutoffs);
    vacrel->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel);
    vacrel->vistest = GlobalVisTestFor(rel);

The idea here seems to be that because we call vacuum_get_cutoffs()
before we call GlobalVisTestFor(), the cutoff established by the
former should precede the cutoff established by the latter.
vacrel->cutoffs.OldestXmin is propagated into vacrel->NewRelfrozenXid,
and although the latter value seems like it can get updated later, it
remains true that vacrel->cutoffs.OldestXmin affects how relfrozenxid
ultimately gets set. However, that value is set in
vacuum_get_cutoffs() to the result of
GetOldestNonRemovableTransactionId(rel); and that function works with
the results of ComputeXidHorizons(). Meanwhile, GlobalVisTestFor()
creates a visibility test object which is ultimately going to be
tested inside heap_page_prune() using GlobalVisTestIsRemovableXid()
which is a wrapper around GlobalVisTestIsRemovableFullXid() which can
call GlobalVisUpdate() which can call ... ta da ...
ComputeXidHorizons(). So the behavior of ComputeXidHorizons() is the
key point here. And sure enough, the header comment for that function
says this:

 * Note: despite the above, it's possible for the calculated values to move
 * backwards on repeated calls. The calculated values are conservative, so
 * that anything older is definitely not considered as running by anyone
 * anymore, but the exact values calculated depend on a number of things. For
 * example, if there are no transactions running in the current database, the
 * horizon for normal tables will be latestCompletedXid. If a transaction
 * begins after that, its xmin will include in-progress transactions in other
 * databases that started earlier, so another call will return a lower value.
 * Nonetheless it is safe to vacuum a table in the current database with the
 * first result.  There are also replication-related effects: a walsender
 * process can set its xmin based on transactions that are no longer running
 * on the primary but are still being replayed on the standby, thus possibly
 * making the values go backwards.  In this case there is a possibility that
 * we lose data that the standby would like to have, but unless the standby
 * uses a replication slot to make its xmin persistent there is little we can
 * do about that --- data is only protected if the walsender runs continuously
 * while queries are executed on the standby.  (The Hot Standby code deals
 * with such cases by failing standby queries that needed to access
 * already-removed data, so there's no integrity bug.)

That means that heap_vacuum_rel() isn't guaranteeing anything by
calling vacuum_get_cutoffs() before it calls GlobalVisTestFor(). I
started wondering when this bug got introduced, and concluded that it
probably dates all the way back to
dc7420c2c9274a283779ec19718d2d16323640c0, when Andres first introduced
the GlobalVisTest stuff. The code was substantially refactored
afterwards by Peter Geoghegan, in
b4af70cb210393c9c8f41643acf6b213e21178e7 and
73f6ec3d3c8d5786c54373e71a096e5acf78e7ca. However, as far as I can
see, neither of those commits created the problem; they just moved
code around, and in fact I'd say that the problem is easier to see now
than it was before all of that refactoring because the two problematic
initializations are now closer together than they were originally. I
think that Peter may have been under the impression at the time of the
latter commit that the order of initializing vacrel->cutoffs vs.
vacrel->vistest was significant, because the commit message says:

    Explain the relationship between vacuumlazy.c's vistest and OldestXmin
    cutoffs.  These closely related cutoffs are different in subtle but
    important ways.  Also document a closely related rule: we must establish
    rel_pages _after_ OldestXmin to ensure that no XID < OldestXmin can be
    missed by lazy_scan_heap().

And a comment in that commit says:

+     * We expect vistest will always make heap_page_prune remove any deleted
+     * tuple whose xmax is < OldestXmin.  lazy_scan_prune must never become

Neither of these is perhaps 100% explicit about the ordering being
relevant, but if that wasn't thought to make this code safe, then I
don't know what was thought to make it safe.

At any rate, unless I'm missing something, which is definitely
possible, the actual bug dates to dc7420c2c, which means whatever fix
we come up with here will need to be back-patched back to v14.

But what exactly is the fix here? I now understand why there was
discussion of preventing maybe_needed from going backward, but I'm not
sure that's quite the right idea. I think the problem isn't exactly
that the maybe_needed value that is updated by GlobalVisUpdateApply()
can go backwards, but rather that the SOMETHING_oldest_removable value
on which it depends can go backwards. AFAICS, the problem isn't so
much that the GlobalVisTest itself can retreat to an older value, but
that it can retreat compared to the OldestXmin obtained from calling
GetOldestNonRemovableTransactionId(), which depends on the
SOMETHING_oldest_removable threshold but *not* on maybe_needed.
Perhaps, for some reason I'm not grokking at the moment, preventing
maybe_needed from retreating would be good enough to prevent trouble
in practice, but it doesn't appear to me to be the most principled
solution as of this writing.

I kind of feel like the code is just going at this in a way that's
completely backwards. It feels like what we should be doing is looking
at the actual states of the pages post-pruning, and specifically what
the oldest (M)XIDs that are really present there are. But changing
that seems like a lot to do in a fix that needs to be back-patched.
It's tempting to look for a narrower fix, where we just find out the
oldest horizon that was ever actually used. But I don't feel very good
about doing that by tampering with ComputeXidHorizons(), either. That
seems like it could have nasty side effects in terms of either
performance or correctness, which is really scary for a fix that needs
to be back-patched. I wonder if there's some way that we can get a
more reliable idea of what XIDs were used for pruning that is strictly
local to vacuum and can't affect the rest of the system (and is also
hopefully not so conservative that it causes damage by unduly reducing
relfrozenxid advancement). However, I don't have a good idea about
what such a fix might look like right now, and I'm out of time for
today.

--
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-bugs by date:

Previous
From: vignesh C
Date:
Subject: Re: BUG #18415: Logical replication errors.
Next
From: Peter Geoghegan
Date:
Subject: Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae