Re: New strategies for freezing, advancing relfrozenxid early - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: New strategies for freezing, advancing relfrozenxid early
Date
Msg-id CAH2-WzkvBdD13T+O1NFv97Pt-_kL3NxiAA9JZ5-nAFZS2xuZSA@mail.gmail.com
Whole thread Raw
In response to Re: New strategies for freezing, advancing relfrozenxid early  (Jeff Davis <pgsql@j-davis.com>)
Responses Re: New strategies for freezing, advancing relfrozenxid early
List pgsql-hackers
On Mon, Dec 12, 2022 at 3:47 PM Jeff Davis <pgsql@j-davis.com> wrote:
> Freezing is driven by a need to keep the age of the oldest
> transaction ID in a table to less than ~2B; and also the need to
> truncate the clog (and reduce lookups of really old xids). It's fine to
> give a brief explanation about why we can't track very old xids, but
> it's more of an internal detail and not the main point.

I agree that that's the conventional definition. What I am proposing
is that we revise that definition a little. We should start the
discussion of freezing in the user level docs by pointing out that
freezing also plays a role at the level of individual pages. An
all-frozen page is self-contained, now and forever (or until it gets
dirtied again, at least). Even on a standby we will reliably avoid
having to do clog lookups for a page that happens to have all of its
tuples frozen.

I don't want to push back too much here. I just don't think that it
makes terribly much sense for the docs to start the conversation about
freezing by talking about the worst consequences of not freezing for
an extended period of time. That's relevant, and it's probably going
to end up as the aspect of freezing that we spend most time on, but it
still doesn't seem like a useful starting point to me.

To me this seems related to the fallacy that relfrozenxid age is any
kind of indicator about how far behind we are on freezing. I think
that there is value in talking about freezing as a maintenance task
for physical heap pages, and only then talking about relfrozenxid and
the circular XID space. The 64-bit XID patch doesn't get rid of
freezing at all, because it is still needed to break the dependency of
tuples stored in heap pages on the pg_xact, and other SLRUs -- which
suggests that you can talk about freezing and advancing relfrozenxid
as different (though still closely related) concepts.

> * I'm still having a hard time with vacuum_freeze_strategy_threshold.
> Part of it is the name, which doesn't seem to convey the meaning.

I chose the name long ago, and never gave it terribly much thought.
I'm happy to go with whatever name you prefer.

> But the heuristic also seems off to me. What if you have lots of partitions
> in an append-only range-partitioned table? That would tend to use the
> lazy freezing strategy (because each partition is small), but that's
> not what you want. I understand heuristics aren't perfect, but it feels
> like we could do something better.

It is at least vastly superior to vacuum_freeze_min_age in cases like
this. Not that that's hard -- vacuum_freeze_min_age just doesn't ever
trigger freezing in any autovacuum given a table like pgbench_history
(barring during aggressive mode), due to how it interacts with the
visibility map. So we're practically guaranteed to do literally all
freezing for an append-only table in an aggressive mode VACUUM.

Worst of all, that happens on a timeline that has nothing to do with
the physical characteristics of the table itself (like the number of
unfrozen heap pages or something). In fact, it doesn't even have
anything to do with how many distinct XIDs modified that particular
table -- XID age works at the system level.

By working at the heap rel level (which means the partition level if
it's a partitioned table), and by being based on physical units (table
size), vacuum_freeze_strategy_threshold at least manages to limit the
accumulation of unfrozen heap pages in each individual relation. This
is the fundamental unit at which VACUUM operates. So even if you get
very unlucky and accumulate many unfrozen heap pages that happen to be
distributed across many different tables, you can at least vacuum each
table independently, and in parallel. The really big problems all seem
to involve concentration of unfrozen tables in one particular table
(usually the events table, the largest table in the system by a couple
of orders of magnitude).

That said, I agree that the system-level picture of debt (the system
level view of the number of unfrozen heap pages) is relevant, and that
it isn't directly considered by the patch. I think that that can be
treated as work for a future release. In fact, I think that there is a
great deal that we could teach autovacuum.c about the system level
view of things -- this is only one.

> Also, another purpose of this seems
> to be to achieve v15 behavior (if v16 behavior causes a problem for
> some workload), which seems like a good idea, but perhaps we should
> have a more direct setting for that?

Why, though? I think that it happens to make sense to do both with one
setting. Not because it's better to have 2 settings than 1 (though it
is) -- just because it makes sense here, given these specifics.

> * The comment above lazy_scan_strategy() is phrased in terms of the
> "traditional approach". It would be more clear if you described the
> current strategies and how they're chosen. The pre-16 behavior was as
> lazy as possible, so that's easy enough to describe without referring
> to history.

Agreed. Will fix.

> * "eager skipping behavior" seems like a weird phrasing because it's
> not immediately clear if that means "skip more pages" (eager to skip
> pages and lazy to process them) or "skip fewer pages" (lazy to skip the
> pages and eager to process the pages).

I agree that that's a problem. I'll try to come up with a terminology
that doesn't have this problem ahead of the next version.

> * The skipping behavior is for all-visible pages is binary: skip them
> all, or skip none. That makes sense in the context of relfrozenxid
> advancement. But how does that avoid IO spikes? It would seem perfectly
> reasonable to me, if relfrozenxid advancement is not a pressing
> problem, to process some fraction of the all-visible pages (or perhaps
> process enough of them to freeze some fraction).

That's something that v9 will do, unlike earlier versions. So I agree.

In particular, we'll now start freezing eagerly before we switch over
to preferring to advance relfrozenxid for the same table. As I said in
my summary of v9 the other day, we "stagger" the point at which these
two behaviors are first applied, with the goal of smoothing the
transition. We try to disguise the fact that there are still two
different sets of behavior. We try to get the best of both worlds
(eager and lazy behaviors), without the user ever really noticing.

Don't forget that eager behavior with the visibility map is expected
to directly lead to freezing more pages (not a guarantee, but quite
likely). So while skipping strategy and freezing strategy are two
independent things, they're independent in name only, mechanically.
They are not independent things in any practical sense. (The
underlying reason why that is true is of course the same reason why
vacuum_freeze_min_age only really works as designed in aggressive mode
VACUUMs.)

> each VACUUM makes a payment on the deferred costs of freezing. I think
> this has already been discussed but it keeps reappearing in my mind, so
> maybe we can settle this with a comment (and/or docs)?

That said, I believe that we should always advance relfrozenxid in
tables that are already moderately sized -- a table that is already
big enough to be some small multiple of
vacuum_freeze_strategy_threshold should always take an eager approach
to advancing relfrozenxid. That is, I don't think that it makes sense
to pay the cost of freezing down incrementally given a moderately
large table.

Large tables and small tables are qualitatively different things, at
least from a VACUUM point of view. To some degree we can afford to be
wrong about small tables, because that won't cause us any serious
pain. This isn't really true with larger tables -- a VACUUM of a large
table is "too big to fail". Our working assumption for tables that are
still growing now, in the ongoing VACUUM, is that they will continue
to grow.

There is often one very large table, and by the time the next VACUUM
comes around, the table may have accumulated more unfrozen pages than
the entire rest of the database combined (I mean all of the rest of
the database, frozen and unfrozen pages alike). This may even be
common:

https://brandur.org/fragments/events

> * I'm wondering whether vacuum_freeze_min_age makes sense anymore. It
> doesn't take effect unless the page is not skipped, which is confusing
> from a usability standpoint, and we have better heuristics to decide if
> the whole page should be frozen or not anyway (i.e. if an FPI was
> already taken then freezing is cheaper).

I think that vacuum_freeze_min_age still has a role to play. The only
thing that can trigger freezing during a VACUUM that opts to use a
lazy strategy VACUUM is the FPI-from-pruning trigger mechanism (new to
v9), plus vacuum_freeze_min_age/FreezeLimit. So you cannot really have
a lazy strategy without vacuum_freeze_min_age. The original
vacuum_freeze_min_age design did make sense, at least
pre-visibility-map, because sometimes being lazy about freezing is the
best strategy. Especially with small, frequently updated tables like
most of the pgbench tables.

There is nothing inherently wrong with deciding to freeze (or even to
wait for a cleanup lock) on the basis of a given XID's age. My problem
isn't with that behavior in general. It's with the fact that we use it
even when it's clearly inappropriate -- wildly inappropriate. We have
plenty of information that strongly hints at whether or not laziness
is a good idea. It's a good idea whenever laziness has a decent chance
of avoiding completely unnecessary work altogether, provided we can
afford to be wrong about that without having to pay too high a cost
later on, when we have to course correct. What this mostly boils down
to is this: lazy freezing is generally a good idea in small tables
only.

-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: David Rowley
Date:
Subject: Re: slab allocator performance issues
Next
From: Thomas Munro
Date:
Subject: Re: Tree-walker callbacks vs -Wdeprecated-non-prototype