Re: New strategies for freezing, advancing relfrozenxid early - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: New strategies for freezing, advancing relfrozenxid early
Date
Msg-id CAH2-WznXHb7zJCpo5pNEScX-ATmr33RU2uRhic9+kfq5gX8tAQ@mail.gmail.com
Whole thread Raw
In response to Re: New strategies for freezing, advancing relfrozenxid early  (Jeff Davis <pgsql@j-davis.com>)
List pgsql-hackers
On Mon, Aug 29, 2022 at 11:47 AM Jeff Davis <pgsql@j-davis.com> wrote:
> Sounds like a good goal, and loosely follows the precedent of
> checkpoint targets and vacuum cost delays.

Right.

> Why is the threshold per-table? Imagine someone who has a bunch of 4GB
> partitions that add up to a huge amount of deferred freezing work.

I think it's possible that our cost model will eventually become very
sophisticated, and weigh all kinds of different factors, and work as
one component of a new framework that dynamically schedules autovacuum
workers. My main goal in posting this v1 was validating the *general
idea* of strategies with cost models, and the related question of how
we might use VM snapshots for that. After all, even the basic concept
is totally novel.

> The initial problem you described is a system-level problem, so it
> seems we should track the overall debt in the system in order to keep
> up.

I agree that the problem is fundamentally a system-level problem. One
reason why vacuum_freeze_strategy_threshold works at the table level
right now is to get the ball rolling. In any case the specifics of how
we trigger each strategy are from from settled. That's not the only
reason why we think about things at the table level in the patch set,
though.

There *are* some fundamental reasons why we need to care about
individual tables, rather than caring about unfrozen pages at the
system level *exclusively*. This is something that
vacuum_freeze_strategy_threshold kind of gets right already, despite
its limitations. There are 2 aspects of the design that seemingly have
to work at the whole table level:

1. Concentration matters when it comes to wraparound risk.

Fundamentally, each VACUUM still targets exactly one heap rel, and
advances relfrozenxid at most once per VACUUM operation. While the
total number of "unfrozen heap pages" across the whole database is the
single most important metric, it's not *everything*.

As a general rule, there is much less risk in having a certain fixed
number of unfrozen heap pages spread fairly evenly among several
larger tables, compared to the case where the same number of unfrozen
pages are all concentrated in one particular table -- right now it'll
often be one particular table that is far larger than any other table.
Right now the pain is generally felt with large tables only.

2. We need to think about things at the table level is to manage costs
*over time* holistically. (Closely related to #1.)

The ebb and flow of VACUUM for one particular table is a big part of
the picture here -- and will be significantly affected by table size.
We can probably always afford to risk falling behind on
freezing/relfrozenxid (i.e. we should prefer laziness) if we know that
we'll almost certainly be able to catch up later when things don't
quite work out. That makes small tables much less trouble, even when
there are many more of them (at least up to a point).

As you know, my high level goal is to avoid ever having to make huge
balloon payments to catch up on freezing, which is a much bigger risk
with a large table -- this problem is mostly a per-table problem (both
now and in the future).

A large table will naturally require fewer, larger VACUUM operations
than a small table, no matter what approach is taken with the strategy
stuff. We therefore have fewer VACUUM operations in a given
week/month/year/whatever to spread out the burden -- there will
naturally be fewer opportunities. We want to create the impression
that each autovacuum does approximately the same amount of work (or at
least the same per new heap page for large append-only tables).

It also becomes much more important to only dirty each heap page
during vacuuming ~once with larger tables. With a smaller table, there
is a much higher chance that the pages we modify will already be dirty
from user queries.

> > for this table, at this time: Is it more important to advance
> > relfrozenxid early (be eager), or to skip all-visible pages instead
> > (be lazy)? If it's the former, then we must scan every single page
> > that isn't all-frozen according to the VM snapshot (including every
> > all-visible page).
>
> This feels too absolute, to me. If the goal is to freeze more
> incrementally, well in advance of wraparound limits, then why can't we
> just freeze 1000 out of 10000 freezable pages on this run, and then
> leave the rest for a later run?

My remarks here applied only to the question of relfrozenxid
advancement -- not to freezing. Skipping strategy (relfrozenxid
advancement) is a distinct though related concept to freezing
strategy. So I was making a very narrow statement about
invariants/basic correctness rules -- I wasn't arguing against
alternative approaches to freezing beyond the 2 freezing strategies
(not to be confused with skipping strategies) that appear in v1.
That's all I meant -- there is definitely no point in scanning only a
subset of the table's all-visible pages, as far as relfrozenxid
advancement is concerned (and skipping strategy is fundamentally a
choice about relfrozenxid advancement vs work avoidance, eagerness vs
laziness).

Maybe you're right that there is room for additional freezing
strategies, besides the two added by v1-0003-*patch. Definitely seems
possible. The freezing strategy concept should be usable as a
framework for adding additional strategies, including (just for
example) a strategy that decides ahead of time to freeze only so many
pages, though not others (without regard for the fact that the pages
that we are freezing may not be very different to those we won't be
freezing in the current VACUUM).

I'm definitely open to that. It's just a matter of characterizing what
set of workload characteristics this third strategy would solve, how
users might opt in or opt out, etc. Both the eager and the lazy
freezing strategies are based on some notion of what's important for
the table, based on its known characteristics, and based on what seems
like to happen to the table in the future (the next VACUUM, at least).
I'm not completely sure how many strategies we'll end up needing.
Though it seems like the eager/lazy trade-off is a really important
part of how these strategies will need to work, in general.

(Thinks some more) I guess that such an alternative freezing strategy
would probably have to affect the skipping strategy too. It's tricky
to tease apart because it breaks the idea that skipping strategy and
freezing strategy are basically distinct questions. That is a factor
that makes it a bit more complicated to discuss. In any case, as I
said, I have an open mind about alternative freezing strategies beyond
the 2 basic lazy/eager freezing strategies from the patch.

> What if we thought about this more like a "background freezer". It
> would keep track of the total number of unfrozen pages in the system,
> and freeze them at some kind of controlled/adaptive rate.

I like the idea of storing metadata in shared memory. And scheduling
and deprioritizing running autovacuums. Being able to slow down or
even totally halt a given autovacuum worker without much consequence
is enabled by the VM snapshot concept.

That said, this seems like future work to me. Worth discussing, but
trying to keep out of scope in the first version of this that is
committed.

> Regular autovacuum's job would be to keep advancing relfrozenxid for
> all tables and to do other cleanup, and the background freezer's job
> would be to keep the absolute number of unfrozen pages under some
> limit. Conceptually those two jobs seem different to me.

The problem with making it such a sharp distinction is that it can be
very useful to manage costs by making it the job of VACUUM to do both
-- we can avoid dirtying the same page multiple times.

I think that we can accomplish the same thing by giving VACUUM more
freedom to do either more or less work, based on the observed
characteristics of the table, and some sense of how costs will tend to
work over time. across multiple distinct VACUUM operations. In
practice that might end up looking very similar to what you describe.

It seems undesirable for VACUUM to ever be too sure of itself -- the
information that triggers autovacuum may not be particularly reliable,
which can be solved to some degree by making as many decisions as
possible at runtime, dynamically, based on the most authoritative and
recent information. Delaying committing to one particular course of
action isn't always possible, but when it is possible (and not too
expensive) we should do it that way on general principle.

> Also, regarding patch v1-0001-Add-page-level-freezing, do you think
> that narrows the conceptual gap between an all-visible page and an all-
> frozen page?

Yes, definitely. However, I don't think that we can just get rid of
the distinction completely -- though I did think about it for a while.
For one thing we need to be able to handle cases like the case where
heap_lock_tuple() modifies an all-frozen page, and makes it
all-visible without making it completely unskippable to every VACUUM
operation.

-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Postmaster self-deadlock due to PLT linkage resolution
Next
From: Thomas Munro
Date:
Subject: Re: logical decoding and replication of sequences