Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
Date
Msg-id 20220225071422.5fhmaa7rxfazduzr@alap3.anarazel.de
Whole thread Raw
In response to Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
List pgsql-hackers
Hi,

On 2022-02-24 20:53:08 -0800, Peter Geoghegan wrote:
> 0002 makes page-level freezing a first class thing.
> heap_prepare_freeze_tuple now has some (limited) knowledge of how this
> works. heap_prepare_freeze_tuple's cutoff_xid argument is now always
> the VACUUM caller's OldestXmin (not its FreezeLimit, as before). We
> still have to pass FreezeLimit to heap_prepare_freeze_tuple, which
> helps us to respect FreezeLimit as a backstop, and so now it's passed
> via the new backstop_cutoff_xid argument instead.

I am not a fan of the backstop terminology. It's still the reason we need to
do freezing for correctness reasons. It'd make more sense to me to turn it
around and call the "non-backstop" freezing opportunistic freezing or such.


> Whenever we opt to
> "freeze a page", the new page-level algorithm *always* uses the most
> recent possible XID and MXID values (OldestXmin and oldestMxact) to
> decide what XIDs/XMIDs need to be replaced. That might sound like it'd
> be too much, but it only applies to those pages that we actually
> decide to freeze (since page-level characteristics drive everything
> now). FreezeLimit is only one way of triggering that now (and one of
> the least interesting and rarest).

That largely makes sense to me and doesn't seem weird.

I'm a tad concerned about replacing mxids that have some members that are
older than OldestXmin but not older than FreezeLimit. It's not too hard to
imagine that accelerating mxid consumption considerably.  But we can probably,
if not already done, special case that.


> It seems that heap_prepare_freeze_tuple allocates new MXIDs (when
> freezing old ones) in large part so it can NOT freeze XIDs that it
> would have been useful (and much cheaper) to remove anyway.

Well, we may have to allocate a new mxid because some members are older than
FreezeLimit but others are still running. When do we not remove xids that
would have been cheaper to remove once we decide to actually do work?


> On HEAD, FreezeMultiXactId() doesn't get passed down the VACUUM operation's
> OldestXmin at all (it actually just gets FreezeLimit passed as its
> cutoff_xid argument). It cannot possibly recognize any of this for itself.

It does recognize something like OldestXmin in a more precise and expensive
way - MultiXactIdIsRunning() and TransactionIdIsCurrentTransactionId().


> Does that theory about MultiXacts sound plausible? I'm not claiming
> that the patch makes it impossible that FreezeMultiXactId() will have
> to allocate a new MultiXact to freeze during VACUUM -- the
> freeze-the-dead isolation tests already show that that's not true. I
> just think that page-level freezing based on page characteristics with
> oldestXmin and oldestMxact (not FreezeLimit and MultiXactCutoff)
> cutoffs might make it a lot less likely in practice.

Hm. I guess I'll have to look at the code for it. It doesn't immediately
"feel" quite right.


> oldestXmin and oldestMxact map to the same wall clock time, more or less --
> that seems like it might be an important distinction, independent of
> everything else.

Hm. Multis can be kept alive by fairly "young" member xids. So it may not be
removably (without creating a newer multi) until much later than its creation
time. So I don't think that's really true.



> From 483bc8df203f9df058fcb53e7972e3912e223b30 Mon Sep 17 00:00:00 2001
> From: Peter Geoghegan <pg@bowt.ie>
> Date: Mon, 22 Nov 2021 10:02:30 -0800
> Subject: [PATCH v9 1/4] Loosen coupling between relfrozenxid and freezing.
>
> When VACUUM set relfrozenxid before now, it set it to whatever value was
> used to determine which tuples to freeze -- the FreezeLimit cutoff.
> This approach was very naive: the relfrozenxid invariant only requires
> that new relfrozenxid values be <= the oldest extant XID remaining in
> the table (at the point that the VACUUM operation ends), which in
> general might be much more recent than FreezeLimit.  There is no fixed
> relationship between the amount of physical work performed by VACUUM to
> make it safe to advance relfrozenxid (freezing and pruning), and the
> actual number of XIDs that relfrozenxid can be advanced by (at least in
> principle) as a result.  VACUUM might have to freeze all of the tuples
> from a hundred million heap pages just to enable relfrozenxid to be
> advanced by no more than one or two XIDs.  On the other hand, VACUUM
> might end up doing little or no work, and yet still be capable of
> advancing relfrozenxid by hundreds of millions of XIDs as a result.
>
> VACUUM now sets relfrozenxid (and relminmxid) using the exact oldest
> extant XID (and oldest extant MultiXactId) from the table, including
> XIDs from the table's remaining/unfrozen MultiXacts.  This requires that
> VACUUM carefully track the oldest unfrozen XID/MultiXactId as it goes.
> This optimization doesn't require any changes to the definition of
> relfrozenxid, nor does it require changes to the core design of
> freezing.


> Final relfrozenxid values must still be >= FreezeLimit in an aggressive
> VACUUM (FreezeLimit is still used as an XID-age based backstop there).
> In non-aggressive VACUUMs (where there is still no strict guarantee that
> relfrozenxid will be advanced at all), we now advance relfrozenxid by as
> much as we possibly can.  This exploits workload conditions that make it
> easy to advance relfrozenxid by many more XIDs (for the same amount of
> freezing/pruning work).

Don't we now always advance relfrozenxid as much as we can, particularly also
during aggressive vacuums?



>   * FRM_RETURN_IS_MULTI
>   *        The return value is a new MultiXactId to set as new Xmax.
>   *        (caller must obtain proper infomask bits using GetMultiXactIdHintBits)
> + *
> + * "relfrozenxid_out" is an output value; it's used to maintain target new
> + * relfrozenxid for the relation.  It can be ignored unless "flags" contains
> + * either FRM_NOOP or FRM_RETURN_IS_MULTI, because we only handle multiXacts
> + * here.  This follows the general convention: only track XIDs that will still
> + * be in the table after the ongoing VACUUM finishes.  Note that it's up to
> + * caller to maintain this when the Xid return value is itself an Xid.
> + *
> + * Note that we cannot depend on xmin to maintain relfrozenxid_out.

What does it mean for xmin to maintain something?



> + * See heap_prepare_freeze_tuple for information about the basic rules for the
> + * cutoffs used here.
> + *
> + * Maintains *relfrozenxid_nofreeze_out and *relminmxid_nofreeze_out, which
> + * are the current target relfrozenxid and relminmxid for the relation.  We
> + * assume that caller will never want to freeze its tuple, even when the tuple
> + * "needs freezing" according to our return value.

I don't understand the "will never want to" bit?


> Caller should make temp
> + * copies of global tracking variables before starting to process a page, so
> + * that we can only scribble on copies.  That way caller can just discard the
> + * temp copies if it isn't okay with that assumption.
> + *
> + * Only aggressive VACUUM callers are expected to really care when a tuple
> + * "needs freezing" according to us.  It follows that non-aggressive VACUUMs
> + * can use *relfrozenxid_nofreeze_out and *relminmxid_nofreeze_out in all
> + * cases.

Could it make sense to track can_freeze and need_freeze separately?


> @@ -7158,57 +7256,59 @@ heap_tuple_needs_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
>      if (tuple->t_infomask & HEAP_XMAX_IS_MULTI)
>      {
>          MultiXactId multi;
> +        MultiXactMember *members;
> +        int            nmembers;
>
>          multi = HeapTupleHeaderGetRawXmax(tuple);
> -        if (!MultiXactIdIsValid(multi))
> -        {
> -            /* no xmax set, ignore */
> -            ;
> -        }

> -        else if (HEAP_LOCKED_UPGRADED(tuple->t_infomask))
> +        if (MultiXactIdIsValid(multi) &&
> +            MultiXactIdPrecedes(multi, *relminmxid_nofreeze_out))
> +            *relminmxid_nofreeze_out = multi;

I may be misreading the diff, but aren't we know continuing to use multi down
below even if !MultiXactIdIsValid()?


> +        if (HEAP_LOCKED_UPGRADED(tuple->t_infomask))
>              return true;
> -        else if (MultiXactIdPrecedes(multi, cutoff_multi))
> -            return true;
> -        else
> +        else if (MultiXactIdPrecedes(multi, backstop_cutoff_multi))
> +            needs_freeze = true;
> +
> +        /* need to check whether any member of the mxact is too old */
> +        nmembers = GetMultiXactIdMembers(multi, &members, false,
> +                                         HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask));

Doesn't this mean we unpack the members even if the multi is old enough to
need freezing? Just to then do it again during freezing? Accessing multis
isn't cheap...


> +            if (TransactionIdPrecedes(members[i].xid, backstop_cutoff_xid))
> +                needs_freeze = true;
> +            if (TransactionIdPrecedes(members[i].xid,
> +                                      *relfrozenxid_nofreeze_out))
> +                *relfrozenxid_nofreeze_out = xid;
>          }
> +        if (nmembers > 0)
> +            pfree(members);
>      }
>      else
>      {
>          xid = HeapTupleHeaderGetRawXmax(tuple);
> -        if (TransactionIdIsNormal(xid) &&
> -            TransactionIdPrecedes(xid, cutoff_xid))
> -            return true;
> +        if (TransactionIdIsNormal(xid))
> +        {
> +            if (TransactionIdPrecedes(xid, *relfrozenxid_nofreeze_out))
> +                *relfrozenxid_nofreeze_out = xid;
> +            if (TransactionIdPrecedes(xid, backstop_cutoff_xid))
> +                needs_freeze = true;
> +        }
>      }
>
>      if (tuple->t_infomask & HEAP_MOVED)
>      {
>          xid = HeapTupleHeaderGetXvac(tuple);
> -        if (TransactionIdIsNormal(xid) &&
> -            TransactionIdPrecedes(xid, cutoff_xid))
> -            return true;
> +        if (TransactionIdIsNormal(xid))
> +        {
> +            if (TransactionIdPrecedes(xid, *relfrozenxid_nofreeze_out))
> +                *relfrozenxid_nofreeze_out = xid;
> +            if (TransactionIdPrecedes(xid, backstop_cutoff_xid))
> +                needs_freeze = true;
> +        }
>      }

This stanza is repeated a bunch. Perhaps put it in a small static inline
helper?


>      /* VACUUM operation's cutoff for freezing XIDs and MultiXactIds */
>      TransactionId FreezeLimit;
>      MultiXactId MultiXactCutoff;
> -    /* Are FreezeLimit/MultiXactCutoff still valid? */
> -    bool        freeze_cutoffs_valid;
> +    /* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
> +    TransactionId NewRelfrozenXid;
> +    MultiXactId NewRelminMxid;

Struct member names starting with an upper case look profoundly ugly to
me...  But this isn't the first one, so I guess... :(




> From d10f42a1c091b4dc52670fca80a63fee4e73e20c Mon Sep 17 00:00:00 2001
> From: Peter Geoghegan <pg@bowt.ie>
> Date: Mon, 13 Dec 2021 15:00:49 -0800
> Subject: [PATCH v9 2/4] Make page-level characteristics drive freezing.
>
> Teach VACUUM to freeze all of the tuples on a page whenever it notices
> that it would otherwise mark the page all-visible, without also marking
> it all-frozen.  VACUUM typically won't freeze _any_ tuples on the page
> unless _all_ tuples (that remain after pruning) are all-visible.  This
> makes the overhead of vacuuming much more predictable over time.  We
> avoid the need for large balloon payments during aggressive VACUUMs
> (typically anti-wraparound autovacuums).  Freezing is proactive, so
> we're much less likely to get into "freezing debt".


I still suspect this will cause a very substantial increase in WAL traffic in
realistic workloads. It's common to have workloads where tuples are inserted
once, and deleted once/ partition dropped. Freezing all the tuples is a lot
more expensive than just marking the page all visible. It's not uncommon to be
bound by WAL traffic rather than buffer dirtying rate (since the latter may be
ameliorated by s_b and local storage, whereas WAL needs to be
streamed/archived).

This is particularly true because log_heap_visible() doesn't need an FPW if
checkpoints aren't enabled. A small record vs an FPI is a *huge* difference.


I think we'll have to make this less aggressive or tunable. Random ideas for
heuristics:

- Is it likely that freezing would not require an FPI or conversely that
  log_heap_visible() will also need an fpi?  If the page already was recently
  modified / checksums are enabled the WAL overhead of the freezing doesn't
  play much of a role.

- #dead items / #force-frozen items on the page - if we already need to do
  more than just setting all-visible, we can probably afford the WAL traffic.

- relfrozenxid vs max_freeze_age / FreezeLimit. The closer they get, the more
  aggressive we should freeze all-visible pages. Might even make sense to
  start vacuuming an increasing percentage of all-visible pages during
  non-aggressive vacuums, the closer we get to FreezeLimit.

- Keep stats about the age of dead and frozen over time. If all tuples are
  removed within a reasonable fraction of freeze_max_age, there's no point in
  freezing them.


> The new approach to freezing also enables relfrozenxid advancement in
> non-aggressive VACUUMs, which might be enough to avoid aggressive
> VACUUMs altogether (with many individual tables/workloads).  While the
> non-aggressive case continues to skip all-visible (but not all-frozen)
> pages (thereby making relfrozenxid advancement impossible), that in
> itself will no longer hinder relfrozenxid advancement (outside of
> pg_upgrade scenarios).

I don't know how to parse "thereby making relfrozenxid advancement impossible
... will no longer hinder relfrozenxid advancement"?


> We now consistently avoid leaving behind all-visible (not all-frozen) pages.
> This (as well as work from commit 44fa84881f) makes relfrozenxid advancement
> in non-aggressive VACUUMs commonplace.

s/consistently/try to/?


> The system accumulates freezing debt in proportion to the number of
> physical heap pages with unfrozen tuples, more or less.  Anything based
> on XID age is likely to be a poor proxy for the eventual cost of
> freezing (during the inevitable anti-wraparound autovacuum).  At a high
> level, freezing is now treated as one of the costs of storing tuples in
> physical heap pages -- not a cost of transactions that allocate XIDs.
> Although vacuum_freeze_min_age and vacuum_multixact_freeze_min_age still
> influence what we freeze, and when, they effectively become backstops.
> It may still be necessary to "freeze a page" due to the presence of a
> particularly old XID, from before VACUUM's FreezeLimit cutoff, though
> that will be rare in practice -- FreezeLimit is just a backstop now.

I don't really like the "rare in practice" bit. It'll be rare in some
workloads but others will likely be much less affected.



> + * Although this interface is primarily tuple-based, vacuumlazy.c caller
> + * cooperates with us to decide on whether or not to freeze whole pages,
> + * together as a single group.  We prepare for freezing at the level of each
> + * tuple, but the final decision is made for the page as a whole.  All pages
> + * that are frozen within a given VACUUM operation are frozen according to
> + * cutoff_xid and cutoff_multi.  Caller _must_ freeze the whole page when
> + * we've set *force_freeze to true!
> + *
> + * cutoff_xid must be caller's oldest xmin to ensure that any XID older than
> + * it could neither be running nor seen as running by any open transaction.
> + * This ensures that the replacement will not change anyone's idea of the
> + * tuple state.  Similarly, cutoff_multi must be the smallest MultiXactId used
> + * by any open transaction (at the time that the oldest xmin was acquired).

I think this means my concern above about increasing mxid creation rate
substantially may be warranted.


> + * backstop_cutoff_xid must be <= cutoff_xid, and backstop_cutoff_multi must
> + * be <= cutoff_multi.  When any XID/XMID from before these backstop cutoffs
> + * is encountered, we set *force_freeze to true, making caller freeze the page
> + * (freezing-eligible XIDs/XMIDs will be frozen, at least).  "Backstop
> + * freezing" ensures that VACUUM won't allow XIDs/XMIDs to ever get too old.
> + * This shouldn't be necessary very often.  VACUUM should prefer to freeze
> + * when it's cheap (not when it's urgent).

Hm. Does this mean that we might call heap_prepare_freeze_tuple and then
decide not to freeze? Doesn't that mean we might create new multis over and
over, because we don't end up pulling the trigger on freezing the page?


> +
> +            /*
> +             * We allocated a MultiXact for this, so force freezing to avoid
> +             * wasting it
> +             */
> +            *force_freeze = true;

Ah, I guess not. But it'd be nicer if I didn't have to scroll down to the body
of the function to figure it out...



> From d2190abf366f148bae5307442e8a6245c6922e78 Mon Sep 17 00:00:00 2001
> From: Peter Geoghegan <pg@bowt.ie>
> Date: Mon, 21 Feb 2022 12:46:44 -0800
> Subject: [PATCH v9 3/4] Remove aggressive VACUUM skipping special case.
>
> Since it's simply never okay to miss out on advancing relfrozenxid
> during an aggressive VACUUM (that's the whole point), the aggressive
> case treated any page from a next_unskippable_block-wise skippable block
> range as an all-frozen page (not a merely all-visible page) during
> skipping.  Such a page might not be all-visible/all-frozen at the point
> that it actually gets skipped, but it could nevertheless be safely
> skipped, and then counted in frozenskipped_pages (the page must have
> been all-frozen back when we determined the extent of the range of
> blocks to skip, since aggressive VACUUMs _must_ scan all-visible pages).
> This is necessary to ensure that aggressive VACUUMs are always capable
> of advancing relfrozenxid.

> The non-aggressive case behaved slightly differently: it rechecked the
> visibility map for each page at the point of skipping, and only counted
> pages in frozenskipped_pages when they were still all-frozen at that
> time.  But it skipped the page either way (since we already committed to
> skipping the page at the point of the recheck).  This was correct, but
> sometimes resulted in non-aggressive VACUUMs needlessly wasting an
> opportunity to advance relfrozenxid (when a page was modified in just
> the wrong way, at just the wrong time).  It also resulted in a needless
> recheck of the visibility map for each and every page skipped during
> non-aggressive VACUUMs.
>
> Avoid these problems by conditioning the "skippable page was definitely
> all-frozen when range of skippable pages was first determined" behavior
> on what the visibility map _actually said_ about the range as a whole
> back when we first determined the extent of the range (don't deduce what
> must have happened at that time on the basis of aggressive-ness).  This
> allows us to reliably count skipped pages in frozenskipped_pages when
> they were initially all-frozen.  In particular, when a page's visibility
> map bit is unset after the point where a skippable range of pages is
> initially determined, but before the point where the page is actually
> skipped, non-aggressive VACUUMs now count it in frozenskipped_pages,
> just like aggressive VACUUMs always have [1].  It's not critical for the
> non-aggressive case to get this right, but there is no reason not to.
>
> [1] Actually, it might not work that way when there happens to be a mix
> of all-visible and all-frozen pages in a range of skippable pages.
> There is no chance of VACUUM advancing relfrozenxid in this scenario
> either way, though, so it doesn't matter.

I think this commit message needs a good amount of polishing - it's very
convoluted. It's late and I didn't sleep well, but I've tried to read it
several times without really getting a sense of what this precisely does.




> From 15dec1e572ac4da0540251253c3c219eadf46a83 Mon Sep 17 00:00:00 2001
> From: Peter Geoghegan <pg@bowt.ie>
> Date: Thu, 24 Feb 2022 17:21:45 -0800
> Subject: [PATCH v9 4/4] Avoid setting a page all-visible but not all-frozen.

To me the commit message body doesn't actually describe what this is doing...


> This is pretty much an addendum to the work in the "Make page-level
> characteristics drive freezing" commit.  It has been broken out like
> this because I'm not even sure if it's necessary.  It seems like we
> might want to be paranoid about losing out on the chance to advance
> relfrozenxid in non-aggressive VACUUMs, though.

> The only test that will trigger this case is the "freeze-the-dead"
> isolation test.  It's incredibly narrow.  On the other hand, why take a
> chance?  All it takes is one heap page that's all-visible (and not also
> all-frozen) nestled between some all-frozen heap pages to lose out on
> relfrozenxid advancement.  The SKIP_PAGES_THRESHOLD stuff won't save us
> then [1].

FWIW, I'd really like to get rid of SKIP_PAGES_THRESHOLD. It often ends up
causing a lot of time doing IO that we never need, completely trashing all CPU
caches, while not actually causing decent readaead IO from what I've seen.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Kyotaro Horiguchi
Date:
Subject: Re: Add checkpoint and redo LSN to LogCheckpointEnd log message
Next
From: Andres Freund
Date:
Subject: Re: [Proposal] Global temporary tables