Re: New strategies for freezing, advancing relfrozenxid early - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: New strategies for freezing, advancing relfrozenxid early
Date
Msg-id CAH2-WzksXKMS=Nt7kHzZdnpXeQ6yTOLzbyL-qCUbGcZruqnm3w@mail.gmail.com
Whole thread Raw
In response to Re: New strategies for freezing, advancing relfrozenxid early  (Jeff Davis <pgsql@j-davis.com>)
Responses Re: New strategies for freezing, advancing relfrozenxid early  (Peter Geoghegan <pg@bowt.ie>)
List pgsql-hackers
On Tue, Dec 20, 2022 at 5:44 PM Jeff Davis <pgsql@j-davis.com> wrote:
> Comments on 0002:
>
> Can you explain the following portion of the diff:
>
>
>   - else if (MultiXactIdPrecedes(multi, cutoffs->MultiXactCutoff))
>   + else if (MultiXactIdPrecedes(multi, cutoffs->OldestMxact))
>
>   ...
>
>   + /* Can't violate the MultiXactCutoff invariant, either */
>   + if (!need_replace)
>   +     need_replace = MultiXactIdPrecedes(
>   +        multi, cutoffs->MultiXactCutoff);

Don't forget the historic context: before Postgres 15's commit
0b018fab, VACUUM's final relfrozenxid always came from FreezeLimit.
Almost all of this code predates that work. So the general idea that
you can make a "should I freeze or should I ratchet back my
relfrozenxid tracker instead?" trade-off at the level of individual
tuples and pages is still a very new one. Right now it's only applied
within lazy_scan_noprune(), but 0002 leverages the same principles
here.

Before now, these heapam.c freezing routines had cutoffs called
cutoff_xid and cutoff_multi. These had values that actually came from
vacuumlazy.c's FreezeLimit and MultiXactCutoff cutoffs (which was
rather unclear). But cutoff_xid and cutoff_multi were *also* used as
inexact proxies for OldestXmin and OldestMxact (also kind of unclear,
but true). For example, there are some sanity checks in heapam.c that
kind of pretend that cutoff_xid is OldestXmin, even though it usually
isn't the same value (it can be, but only during VACUUM FREEZE, or
when the min freeze age is 0 in some other way).

So 0002 teaches the same heapam.c code about everything -- about all
of the different cutoffs, and about the true requirements of VACUUM
around relfrozenxid advancement. In fact, 0002 makes vacuumlazy.c cede
a lot of control of "XID stuff" to the same heapam.c code, freezing it
up to think about freezing as something that works at the level of
physical pages. This is key to allowing vacuumlazy.c to reason about
freezing at the level of the whole table. It thinks about physical
blocks, leaving logical XIDs up to heapam.c code.

This business that you asked about in FreezeMultiXactId() is needed so
that we can allow vacuumlazy.c to "think in terms of physical pages",
while at the same time avoiding allocating new Multis in VACUUM --
which requires "thinking about individual xmax fields" instead -- a
somewhat conflicting goal. We're really trying to have it both ways
(we get page-level freezing, with a little tuple level freezing on the
side, sufficient to to avoid allocating new Multis during VACUUMs in
roughly the same way as we do right now).

In most cases "freezing a page" removes all XIDs < OldestXmin, and all
MXIDs < OldestMxact. It doesn't quite work that way in certain rare
cases involving MultiXacts, though. It is convenient to define "freeze
the page" in a way that gives heapam.c's FreezeMultiXactId() the
leeway to put off the work of processing an individual tuple's xmax,
whenever it happens to be a MultiXactId that would require an
expensive second pass to process aggressively (allocating a new Multi
during VACUUM is especially worth avoiding here).

Our definition of "freeze the page" is a bit creative, at least if
you're used to thinking about it in terms of strict XID-wise cutoffs
like OldestXmin/FreezeLimit. But even if you do think of it in terms
of XIDs, the practical difference is extremely small in practice.

FreezeMultiXactId() effectively makes a decision on how to proceed
with processing at the level of each individual xmax field.  Its no-op
multi processing "freezes" an xmax in the event of a costly-to-process
xmax on a page when (for whatever reason) page-level freezing is
triggered. If, on the other hand, page-level freezing isn't triggered
for the page, then page-level no-op processing takes care of the multi
for us instead. Either way, the remaining Multi will ratchet back
VACUUM's relfrozenxid and/or relminmxid trackers as required, and we
won't need an expensive second pass over the multi (unless we really
have no choice, for example during a VACUUM FREEZE, where
OldestXmin==FreezeLimit).

> Regarding correctness, it seems like the basic structure and invariants
> are the same, and it builds on the changes already in 9e5405993c. Patch
> 0002 seems *mostly* about making choices within the existing framework.
> That gives me more confidence.

You're right that it's the same basic invariants as before, of course.
Turns out that those invariants can be pushed quite far.

Though note that I kind of invented a new invariant (not really, sort
of). Well, it's a postcondition, which is a sort of invariant: any
scanned heap page that can be cleanup locked must never have any
remaining XIDs < FreezeLimit, nor can any MXIDs < MultiXactCutoff
remain. But a cleanup-locked page does *not* need to get rid of all
XIDs < OldestXmin, nor MXIDs < OldestMxact.

This flexibility is mostly useful because it allows lazy_scan_prune to
just decide to not freeze. But, to a much lesser degree, it's useful
because of the edge case with multis -- in general we might just need
the same leeway when lazy_scan_prune "freezes the page".

> That being said, it does push harder against the limits on both sides.
> If I understand correctly, that means pages with wider distributions of
> xids are going to persist longer, which could expose pre-existing bugs
> in new and interesting ways.

I don't think it's fundamentally different to what we're already doing
in lazy_scan_noprune. It's just more complicated, because you have to
tease apart slightly different definitions of freezing to understand
code around FreezeMultiXactId(). This is more or less needed to
provide maximum flexibility, where we delay decisions about what to do
until the very last moment.

> Next, the 'freeze_required' field suggests that it's more involved in
> the control flow that causes freezing than it actually is. All it does
> is communicate how the trackers need to be adjusted. The return value
> of heap_prepare_freeze_tuple() (and underneath, the flags set by
> FreezeMultiXactId()) are what actually control what happens. It would
> be nice to make this more clear somehow.

I'm not sure what you mean. Page-level freezing *doesn't* have to go
ahead when freeze_required is not ever set to true for any tuple on
the page (which is most of the time, in practice). lazy_scan_prune
gets to make a choice about freezing the page, when the choice is
available.

Note also that the FRM_NOOP case happens when a call to
FreezeMultiXactId() takes place that won't leave behind a freeze plan
for the tuple (unless its xmin happens to necessitate a freeze plan
for the same tuple). And yet, it will do useful work, needed iff the
"freeze the page" path is ultimately taken by lazy_scan_prune --
FreezeMultiXactId() itself will ratchet back
FreezePageRelfrozenXid/NewRelfrozenXid as needed to make everything
safe.

> The comment:
>
>   /*
>    * If we freeze xmax, make absolutely sure that it's not an XID that
>    * is important.  (Note, a lock-only xmax can be removed independent
>    * of committedness, since a committed lock holder has released the
>    * lock).
>    */
>
> caused me to go down a rabbit hole looking for edge cases where we
> might want to freeze an xmax but not an xmin; e.g. tup.xmax <
> OldestXmin < tup.xmin or the related case where tup.xmax < RecentXmin <
> tup.xmin. I didn't find a problem, so that's good news.

This is an example of what I meant about the heapam.c code using a
cutoff that actually comes from FreezeLimit, when it would be more
sensible to use OldestXmin instead.

> I also tried some pgbench activity along with concurrent vacuums (and
> vacuum freezes) along with periodic verify_heapam(). No problems there.
>
> Did you already describe the testing you've done for 0001+0002
> specfiically? It's not radically new logic, but it would be good to try
> to catch minor state-handling errors.

Lots of stuff with contrib/amcheck, which, as you must already know,
will notice when an XID/MXID is contained in a table whose
relfrozenxid and/or relminmxid indicates that it shouldn't be there.
(Though VACUUM itself does the same thing, albeit not as effectively.)

Obviously the invariants haven't changed here. In many ways it's a
very small set of changes. But in one or two ways it's a significant
shift. It depends on how you think about it.

--
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Amin
Date:
Subject: Get access to the whole query in CustomScan path callback
Next
From: Thomas Munro
Date:
Subject: Re: Array initialisation notation in syscache.c