Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
Date
Msg-id CAH2-Wz=sKE-1yDCOTAq0jf9imE4AEh6G=Lvih5RmrZcG0uFBeQ@mail.gmail.com
Whole thread Raw
In response to Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Fri, Feb 18, 2022 at 1:56 PM Robert Haas <robertmhaas@gmail.com> wrote:
> + * Also maintains *NewRelfrozenxid and *NewRelminmxid, which are the current
> + * target relfrozenxid and relminmxid for the relation.  Assumption is that
>
> "maintains" is fuzzy. I think you should be saying something much more
> explicit, and the thing you are saying should make it clear that these
> arguments are input-output arguments: i.e. the caller must set them
> correctly before calling this function, and they will be updated by
> the function.

Makes sense.

> I don't think you have to spell all of that out in every
> place where this comes up in the patch, but it needs to be clear from
> what you do say. For example, I would be happier with a comment that
> said something like "Every call to this function will either set
> HEAP_XMIN_FROZEN in the xl_heap_freeze_tuple struct passed as an
> argument, or else reduce *NewRelfrozenxid to the xmin of the tuple if
> it is currently newer than that. Thus, after a series of calls to this
> function, *NewRelfrozenxid represents a lower bound on unfrozen xmin
> values in the tuples examined. Before calling this function, caller
> should initialize *NewRelfrozenxid to <something>."

We have to worry about XIDs from MultiXacts (and xmax values more
generally). And we have to worry about the case where we start out
with only xmin frozen (by an earlier VACUUM), and then have to freeze
xmax too. I believe that we have to generally consider xmin and xmax
independently. For example, we cannot ignore xmax, just because we
looked at xmin, since in general xmin alone might have already been
frozen.

> + * Also maintains *NewRelfrozenxid and *NewRelminmxid, which are the current
> + * target relfrozenxid and relminmxid for the relation.  Assumption is that
> + * caller will never freeze any of the XIDs from the tuple, even when we say
> + * that they should.  If caller opts to go with our recommendation to freeze,
> + * then it must account for the fact that it shouldn't trust how we've set
> + * NewRelfrozenxid/NewRelminmxid.  (In practice aggressive VACUUMs always take
> + * our recommendation because they must, and non-aggressive VACUUMs always opt
> + * to not freeze, preferring to ratchet back NewRelfrozenxid instead).
>
> I don't understand this one.
>
> +        * (Actually, we maintain NewRelminmxid differently here, because we
> +        * assume that XIDs that should be frozen according to cutoff_xid won't
> +        * be, whereas heap_prepare_freeze_tuple makes the opposite assumption.)
>
> This one either.

The difference between the cleanup lock path (in
lazy_scan_prune/heap_prepare_freeze_tuple) and the share lock path (in
lazy_scan_noprune/heap_tuple_needs_freeze) is what is at issue in both
of these confusing comment blocks, really. Note that cutoff_xid is the
name that both heap_prepare_freeze_tuple and heap_tuple_needs_freeze
have for FreezeLimit (maybe we should rename every occurence of
cutoff_xid in heapam.c to FreezeLimit).

At a high level, we aren't changing the fundamental definition of an
aggressive VACUUM in any of the patches -- we still need to advance
relfrozenxid up to FreezeLimit in an aggressive VACUUM, just like on
HEAD, today (we may be able to advance it *past* FreezeLimit, but
that's just a bonus). But in a non-aggressive VACUUM, where there is
still no strict requirement to advance relfrozenxid (by any amount),
the code added by 0001 can set relfrozenxid to any known safe value,
which could either be from before FreezeLimit, or after FreezeLimit --
almost anything is possible (provided we respect the relfrozenxid
invariant, and provided we see that we didn't skip any
all-visible-not-all-frozen pages).

Since we still need to "respect FreezeLimit" in an aggressive VACUUM,
the aggressive case might need to wait for a full cleanup lock the
hard way, having tried and failed to do it the easy way within
lazy_scan_noprune (lazy_scan_noprune will still return false when any
call to heap_tuple_needs_freeze for any tuple returns false) -- same
as on HEAD, today.

And so the difference at issue here is: FreezeLimit/cutoff_xid only
needs to affect the new NewRelfrozenxid value we use for relfrozenxid in
heap_prepare_freeze_tuple, which is involved in real freezing -- not
in heap_tuple_needs_freeze, whose main purpose is still to help us
avoid freezing where a cleanup lock isn't immediately available. While
the purpose of FreezeLimit/cutoff_xid within heap_tuple_needs_freeze
is to determine its bool return value, which will only be of interest
to the aggressive case (which might have to get a cleanup lock and do
it the hard way), not the non-aggressive case (where ratcheting back
NewRelfrozenxid is generally possible, and generally leaves us with
almost as good of a value).

In other words: the calls to heap_tuple_needs_freeze made from
lazy_scan_noprune are simply concerned with the page as it actually
is, whereas the similar/corresponding calls to
heap_prepare_freeze_tuple from lazy_scan_prune are concerned with
*what the page will actually become*, after freezing finishes, and
after lazy_scan_prune is done with the page entirely (ultimately
the final NewRelfrozenxid value set in pg_class.relfrozenxid only has
to be <= the oldest extant XID *at the time the VACUUM operation is
just about to end*, not some earlier time, so "being versus becoming"
is an interesting distinction for us).

Maybe the way that FreezeLimit/cutoff_xid is overloaded can be fixed
here, to make all of this less confusing. I only now fully realized
how confusing all of this stuff is -- very.

> I haven't really grokked exactly what is happening in
> heap_tuple_needs_freeze yet, and may not have time to study it further
> in the near future. Not saying it's wrong, although improving the
> comments above would likely help me out.

Definitely needs more polishing.

> You've used the term "freezing cliff" repeatedly in earlier emails,
> and this is the first time I've been able to understand what you
> meant. I'm glad I do, now.

Ugh. I thought that a snappy term like that would catch on quickly. Guess not!

> But can you describe the algorithm that 0002 uses to accomplish this
> improvement? Like "if it sees that the page meets criteria X, then it
> freezes all tuples on the page, else if it sees that that individual
> tuples on the page meet criteria Y, then it freezes just those." And
> like explain what of that is same/different vs. now.

The mechanics themselves are quite simple (again, understanding the
implications is the hard part). The approach taken within 0002 is
still rough, to be honest, but wouldn't take long to clean up (there
are XXX/FIXME comments about this in 0002).

As a general rule, we try to freeze all of the remaining live tuples
on a page (following pruning) together, as a group, or none at all.
Most of the time this is triggered by our noticing that the page is
about to be set all-visible (but not all-frozen), and doing work
sufficient to mark it fully all-frozen instead. Occasionally there is
FreezeLimit to consider, which is now more of a backstop thing, used
to make sure that we never get too far behind in terms of unfrozen
XIDs. This is useful in part because it avoids any future
non-aggressive VACUUM that is fundamentally unable to advance
relfrozenxid (you can't skip all-visible pages if there are only
all-frozen pages in the VM in practice).

We're generally doing a lot more freezing with 0002, but we still
manage to avoid freezing too much in tables like pgbench_tellers or
pgbench_branches -- tables where it makes the least sense. Such tables
will be updated so frequently that VACUUM is relatively unlikely to
ever mark any page all-visible, avoiding the main criteria for
freezing implicitly. It's also unlikely that they'll ever have an XID that is so
old to trigger the fallback FreezeLimit-style criteria for freezing.

In practice, freezing tuples like this is generally not that expensive in
most tables where VACUUM freezes the majority of pages immediately
(tables that aren't like pgbench_tellers or pgbench_branches), because
they're generally big tables, where the overhead of FPIs tends
to dominate anyway (gambling that we can avoid more FPIs later on is not a
bad gamble, as gambles go). This seems to make the overhead
acceptable, on balance. Granted, you might be able to poke holes in
that argument, and reasonable people might disagree on what acceptable
should mean. There are many value judgements here, which makes it
complicated. (On the other hand we might be able to do better if there
was a particularly bad case for the 0002 work, if one came to light.)

--
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Mark all GUC variable as PGDLLIMPORT
Next
From: Tomas Vondra
Date:
Subject: Re: Condition pushdown: why (=) is pushed down into join, but BETWEEN or >= is not?