Thread: New strategies for freezing, advancing relfrozenxid early

New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
Attached patch series is a completely overhauled version of earlier
work on freezing. Related work from the Postgres 15 cycle became
commits 0b018fab, f3c15cbe, and 44fa8488.

Recap
=====

The main high level goal of this work is to avoid painful, disruptive
antiwraparound autovacuums (and other aggressive VACUUMs) that do way
too much "catch up" freezing, all at once, causing significant
disruption to production workloads. The patches teach VACUUM to care
about how far behind it is on freezing for each table -- the number of
unfrozen all-visible pages that have accumulated so far is directly
and explicitly kept under control over time. Unfrozen pages can be
seen as debt. There isn't necessarily anything wrong with getting into
debt (getting into debt to a small degree is all but inevitable), but
debt can be dangerous when it isn't managed carefully. Accumulating
large amounts of debt doesn't always end badly, but it does seem to
reliably create the *risk* that things will end badly.

Right now, a standard append-only table could easily do *all* freezing
in aggressive/antiwraparound VACUUM, without any earlier
non-aggressive VACUUM operations triggered by
autovacuum_vacuum_insert_threshold doing any freezing at all (unless
the user goes out of their way to tune vacuum_freeze_min_age). There
is currently no natural limit on the number of unfrozen all-visible
pages that can accumulate -- unless you count age(relfrozenxid), the
triggering condition for antiwraparound autovacuum. But relfrozenxid
age predicts almost nothing about how much freezing is required (or
will be required later on). The overall result is that it oftens takes
far too long for freezing to finally happen, even when the table
receives plenty of autovacuums (they all could freeze something, but
in practice just don't freeze anything). It's very hard to avoid that
through tuning, because what we really care about is something pretty
closely related to (if not exactly) the number of unfrozen heap pages
in the system. XID age is fundamentally "the wrong unit" here -- the
physical cost of freezing is the most important thing, by far.

In short, the goal of the patch series/project is to make autovacuum
scheduling much more predictable over time. Especially with very large
append-only tables. The patches improve the performance stability of
VACUUM by managing costs holistically, over time. What happens in one
single VACUUM operation is much less important than the behavior of
successive VACUUM operations over time.

What's new: freezing/skipping strategies
========================================

This newly overhauled version introduces the concept of
per-VACUUM-operation strategies, which we decide on once per VACUUM,
at the very start. There are 2 choices to be made at this point (right
after we acquire OldestXmin and similar cutoffs):

1) Do we scan all-visible pages, or do we skip instead? (Added by
second patch, involves a trade-off between eagerness and laziness.)
2) How should we freeze -- eagerly or lazily? (Added by third patch)

The strategy-based approach can be thought of as something that blurs
the distinction between aggressive and non-aggressive VACUUM, giving
VACUUM more freedom to do either more or less work, based on known
costs and benefits. This doesn't completely supersede
aggressive/antiwraparound VACUUMs, but should make them much rarer
with larger tables, where controlling freeze debt actually matters.
There is a need to keep laziness and eagerness in balance here. We try
to get the benefit of lazy behaviors/strategies, but will still course
correct when it doesn't work out.

A new GUC/reloption called vacuum_freeze_strategy_threshold is added
to control freezing strategy (also influences our choice of skipping
strategy). It defaults to 4GB, so tables smaller than that cutoff
(which are usually the majority of all tables) will continue to freeze
in much the same way as today by default. Our current lazy approach to
freezing makes sense there, and should be preserved for its own sake.

Compatibility
=============

Structuring the new freezing behavior as an explicit user-configurable
strategy is also useful as a bridge between the old and new freezing
behaviors. It makes it fairly easy to get the old/current behavior
where that's preferred -- which, I must admit, is something that
wasn't well thought through last time around. The
vacuum_freeze_strategy_threshold GUC is effectively (though not
explicitly) a compatibility option. Users that want something close to
the old/current behavior can use the GUC or reloption to more or less
opt-out of the new freezing behavior, and can do so selectively. The
GUC should be easy for users to understand, too -- it's just a table
size cutoff.

Skipping pages using a snapshot of the visibility map
=====================================================

We now take a copy of the visibility map at the point that VACUUM
begins, and work off of that when skipping, instead of working off of
the mutable/authoritative VM -- this is a visibility map snapshot.
This new infrastructure helps us to decide on a skipping strategy.
Every non-aggressive VACUUM operation now has a choice to make: Which
skipping strategy should it use?  (This was introduced as
item/question #1 a moment ago.)

The decision on skipping strategy is a decision about our priorities
for this table, at this time: Is it more important to advance
relfrozenxid early (be eager), or to skip all-visible pages instead
(be lazy)? If it's the former, then we must scan every single page
that isn't all-frozen according to the VM snapshot (including every
all-visible page). If it's the latter, we'll scan exactly 0
all-visible pages. Either way, once a decision has been made, we don't
leave much to chance -- we commit. ISTM that this is the only approach
that really makes sense. Fundamentally, we advance relfrozenxid a
table at a time, and at most once per VACUUM operation. And for larger
tables it's just impossible as a practical matter to have frequent
VACUUM operations. We ought to be *somewhat* biased in the direction
of advancing relfrozenxid by *some* amount during each VACUUM, even
when relfrozenxid isn't all that old right now.

A strategy (whether for skipping or for freezing) is a big, up-front
decision -- and there are certain kinds of risks that naturally
accompany that approach. The information driving the decision had
better be fairly reliable! By using a VM snapshot, we can choose our
skipping strategy based on precise information about how many *extra*
pages we will have to scan if we go with eager scanning/relfrozenxid
advancement. Concurrent activity cannot change what we scan and what
we skip, either -- everything is locked in from the start. That seems
important to me. It justifies trying to advance relfrozenxid early,
just because the added cost of scanning any all-visible pages happens
to be low.

This is quite a big shift for VACUUM, at least in some ways. The patch
adds a DETAIL to the "starting vacuuming" INFO message shown by VACUUM
VERBOSE. The VERBOSE output is already supposed to work as a
rudimentary progress indicator (at least when it is run at the
database level), so it now shows the final scanned_pages up-front,
before the physical scan of the heap even begins:

regression=# vacuum verbose tenk1;
INFO:  vacuuming "regression.public.tenk1"
DETAIL:  total table size is 486 pages, 3 pages (0.62% of total) must be scanned
INFO:  finished vacuuming "regression.public.tenk1": index scans: 0
pages: 0 removed, 486 remain, 3 scanned (0.62% of total)
*** SNIP ***
system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
VACUUM

I included this VERBOSE tweak in the second patch because it became
natural with VM snapshots, and not because it felt particularly
compelling -- scanned_pages just works like this now (an assertion
verifies that our initial scanned_pages is always an exact match to
what happened during the physical scan, in fact).

There are many things that VM snapshots might also enable that aren't
particularly related to freeze debt. VM snapshotting has the potential
to enable more flexible behavior by VACUUM. I'm thinking of things
like suspend-and-resume for VACUUM/autovacuum, or even autovacuum
scheduling that coordinates autovacuum workers before and during
processing by vacuumlazy.c. Locking-in scanned_pages up-front avoids
the main downside that comes with throttling VACUUM right now: the
fact that simply taking our time during VACUUM will tend to increase
the number of concurrently modified pages that we end up scanning.
These pages are bound to mostly just contain "recently dead" tuples
that the ongoing VACUUM can't do much about anyway -- we could dirty a
lot more heap pages as a result, for little to no benefit.

New patch to avoid allocating MultiXacts
========================================

The fourth and final patch is also new. It corrects an undesirable
consequence of the work done by the earlier patches: it makes VACUUM
avoid allocating new MultiXactIds (unless it's fundamentally
impossible, like in a VACUUM FREEZE). With just the first 3 patches
applied, VACUUM will naively process xmax using a cutoff XID that
comes from OldestXmin (and not FreezeLimit, which is how it works on
HEAD). But with the fourth patch in place VACUUM applies an XID cutoff
of either OldestXmin or FreezeLimit selectively, based on the costs
and benefits for any given xmax.

Just like in lazy_scan_noprune, the low level xmax-freezing code can
pick and choose as it goes, within certain reasonable constraints. We
must accept an older final relfrozenxid/relminmxid value for the rel's
authoritative pg_class tuple as a consequence of avoiding xmax
processing, of course, but that shouldn't matter at all (it's
definitely better than the alternative).

Reducing the WAL space overhead of freezing
===========================================

Not included in this new v1 are other patches that control the
overhead of added freezing -- my focus since joining AWS has been to
get these more strategic patches in shape, and telling the right story
about what I'm trying to do here. I'm going to say a little on the
patches that I have in the pipeline here, though. Getting the
low-level/mechanical overhead of freezing under control will probably
require a few complementary techniques, not just high-level strategies
(though the strategy stuff is the most important piece).

The really interesting omitted-in-v1 patch adds deduplication of
xl_heap_freeze_page WAL records. This reduces the space overhead of
WAL records used to freeze by ~5x in most cases. It works in the
obvious way: we just store the 12 byte freeze plans that appear in
each xl_heap_freeze_page record only once, and then store an array of
item offset numbers for each entry (rather than naively storing a full
12 bytes per tuple frozen per page-level WAL record). This means that
we only need an "extra" ~2 bytes of WAL space per "extra" tuple frozen
(2 bytes for an OffsetNumber) once we decide to freeze something on
the same page. The *marginal* cost can be much lower than it is today,
which makes page-based batching of freezing much more compelling IMV.

Thoughts?
--
Peter Geoghegan

Attachment

Re: New strategies for freezing, advancing relfrozenxid early

From
Jeremy Schneider
Date:
On 8/25/22 2:21 PM, Peter Geoghegan wrote:
> 
> New patch to avoid allocating MultiXacts
> ========================================
> 
> The fourth and final patch is also new. It corrects an undesirable
> consequence of the work done by the earlier patches: it makes VACUUM
> avoid allocating new MultiXactIds (unless it's fundamentally
> impossible, like in a VACUUM FREEZE). With just the first 3 patches
> applied, VACUUM will naively process xmax using a cutoff XID that
> comes from OldestXmin (and not FreezeLimit, which is how it works on
> HEAD). But with the fourth patch in place VACUUM applies an XID cutoff
> of either OldestXmin or FreezeLimit selectively, based on the costs
> and benefits for any given xmax.
> 
> Just like in lazy_scan_noprune, the low level xmax-freezing code can
> pick and choose as it goes, within certain reasonable constraints. We
> must accept an older final relfrozenxid/relminmxid value for the rel's
> authoritative pg_class tuple as a consequence of avoiding xmax
> processing, of course, but that shouldn't matter at all (it's
> definitely better than the alternative).

We should be careful here. IIUC, the current autovac behavior helps
bound the "spread" or range of active multixact IDs in the system, which
directly determines the number of distinct pages that contain those
multixacts. If the proposed change herein causes the spread/range of
MXIDs to significantly increase, then it will increase the number of
blocks and increase the probability of thrashing on the SLRUs for these
data structures. There may be another separate thread or two about
issues with SLRUs already?

-Jeremy


PS. see also
https://www.postgresql.org/message-id/247e3ce4-ae81-d6ad-f54d-7d3e0409a950@ardentperf.com


-- 
Jeremy Schneider
Database Engineer
Amazon Web Services




Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Thu, Aug 25, 2022 at 3:35 PM Jeremy Schneider <schnjere@amazon.com> wrote:
> We should be careful here. IIUC, the current autovac behavior helps
> bound the "spread" or range of active multixact IDs in the system, which
> directly determines the number of distinct pages that contain those
> multixacts. If the proposed change herein causes the spread/range of
> MXIDs to significantly increase, then it will increase the number of
> blocks and increase the probability of thrashing on the SLRUs for these
> data structures.

As a general rule VACUUM will tend to do more eager freezing with the
patch set compared to HEAD, though it should never do less eager
freezing. Not even in corner cases -- never.

With the patch, VACUUM pretty much uses the most aggressive possible
XID-wise/MXID-wise cutoffs in almost all cases (though only when we
actually decide to freeze a page at all, which is now a separate
question). The fourth patch in the patch series introduces a very
limited exception, where we use the same cutoffs that we'll always use
on HEAD (FreezeLimit + MultiXactCutoff) instead of the aggressive
variants (OldestXmin and OldestMxact). This isn't just *any* xmax
containing a MultiXact: it's a Multi that contains *some* XIDs that
*need* to go away during the ongoing VACUUM, and others that *cannot*
go away. Oh, and there usually has to be a need to keep two or more
XIDs for this to happen -- if there is only one XID then we can
usually swap xmax with that XID without any fuss.

> PS. see also
> https://www.postgresql.org/message-id/247e3ce4-ae81-d6ad-f54d-7d3e0409a950@ardentperf.com

I think that the problem you describe here is very real, though I
suspect that it needs to be addressed by making opportunistic cleanup
of Multis happen more reliably. Running VACUUM more often just isn't
practical once a table reaches a certain size. In general, any kind of
processing that is time sensitive probably shouldn't be happening
solely during VACUUM -- it's just too risky. VACUUM might take a
relatively long time to get to the affected page. It might not even be
that long in wall clock time or whatever -- just too long to reliably
avoid the problem.

-- 
Peter Geoghegan



Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Thu, Aug 25, 2022 at 4:23 PM Peter Geoghegan <pg@bowt.ie> wrote:
> As a general rule VACUUM will tend to do more eager freezing with the
> patch set compared to HEAD, though it should never do less eager
> freezing. Not even in corner cases -- never.

Come to think of it, I don't think that that's quite true. Though the
fourth patch isn't particularly related to the problem.

It *is* true that VACUUM will do at least as much freezing of XID
based tuple header fields as before. That just leaves MXIDs. It's even
true that we will do just as much freezing as before if you go pure on
MultiXact-age. But I'm the one that likes to point out that age is
altogether the wrong approach for stuff like this -- so that won't cut
it.

More concretely, I think that the patch series will fail to do certain
inexpensive eager processing of tuple xmax, that will happen today,
regardless of what the user has set vacuum_freeze_min_age or
vacuum_multixact_freeze_min_age to. Although we currently only care
about XID age when processing simple XIDs, we already sometimes make
trade-offs similar to the trade-off I propose to make in the fourth
patch for Multis.

In other words, on HEAD, we promise to process any XMID >=
MultiXactCutoff inside FreezeMultiXactId(). But we also manage to do
"eager processing of xmax" when it's cheap and easy to do so, without
caring about MultiXactCutoff at all -- this is likely the common case,
even. This preexisting eager processing of Multis is likely important
in many applications.

The problem that I think I've created is that page-level freezing as
implemented in lazy_scan_prune by the third patch doesn't know that we
might be a good idea to execute a subset of freeze plans, in order to
remove a multi from a page right away. It mostly has the right idea by
holding off on freezing until it looks like a good idea at the level
of the whole page, but I think that this is a plausible exception.
Just because we're much more sensitive to leaving behind an Multi, and
right now the only code path that can remove a Multi that isn't needed
anymore is FreezeMultiXactId().

If xmax was an updater that aborted instead of a multi then we could
rely on hint bits being set by pruning to avoid clog lookups.
Technically nobody has violated their contract here, I think, but it
still seems like it could easily be unacceptable.

I need to come up with my own microbenchmark suite for Multis -- that
was on my TODO list already. I still believe that the fourth patch
addresses Andres' complaint about allocating new Multis during VACUUM.
But it seems like I need to think about the nuances of Multis some
more. In particular, what the performance impact might be of making a
decision on freezing at the page level, in light of the special
performance considerations for Multis.

Maybe it needs to be more granular than that, more often. Or maybe we
can comprehensively solve the problem in some other way entirely.
Maybe pruning should do this instead, in general. Something like that
might put this right, and be independently useful.

-- 
Peter Geoghegan



Re: New strategies for freezing, advancing relfrozenxid early

From
Jeff Davis
Date:
On Thu, 2022-08-25 at 14:21 -0700, Peter Geoghegan wrote:
> The main high level goal of this work is to avoid painful, disruptive
> antiwraparound autovacuums (and other aggressive VACUUMs) that do way
> too much "catch up" freezing, all at once, causing significant
> disruption to production workloads.

Sounds like a good goal, and loosely follows the precedent of
checkpoint targets and vacuum cost delays.

> A new GUC/reloption called vacuum_freeze_strategy_threshold is added
> to control freezing strategy (also influences our choice of skipping
> strategy). It defaults to 4GB, so tables smaller than that cutoff
> (which are usually the majority of all tables) will continue to
> freeze
> in much the same way as today by default. Our current lazy approach
> to
> freezing makes sense there, and should be preserved for its own sake.

Why is the threshold per-table? Imagine someone who has a bunch of 4GB
partitions that add up to a huge amount of deferred freezing work.

The initial problem you described is a system-level problem, so it
seems we should track the overall debt in the system in order to keep
up.

> for this table, at this time: Is it more important to advance
> relfrozenxid early (be eager), or to skip all-visible pages instead
> (be lazy)? If it's the former, then we must scan every single page
> that isn't all-frozen according to the VM snapshot (including every
> all-visible page).

This feels too absolute, to me. If the goal is to freeze more
incrementally, well in advance of wraparound limits, then why can't we
just freeze 1000 out of 10000 freezable pages on this run, and then
leave the rest for a later run?

> Thoughts?

What if we thought about this more like a "background freezer". It
would keep track of the total number of unfrozen pages in the system,
and freeze them at some kind of controlled/adaptive rate.

Regular autovacuum's job would be to keep advancing relfrozenxid for
all tables and to do other cleanup, and the background freezer's job
would be to keep the absolute number of unfrozen pages under some
limit. Conceptually those two jobs seem different to me.

Also, regarding patch v1-0001-Add-page-level-freezing, do you think
that narrows the conceptual gap between an all-visible page and an all-
frozen page?

Regards,
    Jeff Davis




Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Mon, Aug 29, 2022 at 11:47 AM Jeff Davis <pgsql@j-davis.com> wrote:
> Sounds like a good goal, and loosely follows the precedent of
> checkpoint targets and vacuum cost delays.

Right.

> Why is the threshold per-table? Imagine someone who has a bunch of 4GB
> partitions that add up to a huge amount of deferred freezing work.

I think it's possible that our cost model will eventually become very
sophisticated, and weigh all kinds of different factors, and work as
one component of a new framework that dynamically schedules autovacuum
workers. My main goal in posting this v1 was validating the *general
idea* of strategies with cost models, and the related question of how
we might use VM snapshots for that. After all, even the basic concept
is totally novel.

> The initial problem you described is a system-level problem, so it
> seems we should track the overall debt in the system in order to keep
> up.

I agree that the problem is fundamentally a system-level problem. One
reason why vacuum_freeze_strategy_threshold works at the table level
right now is to get the ball rolling. In any case the specifics of how
we trigger each strategy are from from settled. That's not the only
reason why we think about things at the table level in the patch set,
though.

There *are* some fundamental reasons why we need to care about
individual tables, rather than caring about unfrozen pages at the
system level *exclusively*. This is something that
vacuum_freeze_strategy_threshold kind of gets right already, despite
its limitations. There are 2 aspects of the design that seemingly have
to work at the whole table level:

1. Concentration matters when it comes to wraparound risk.

Fundamentally, each VACUUM still targets exactly one heap rel, and
advances relfrozenxid at most once per VACUUM operation. While the
total number of "unfrozen heap pages" across the whole database is the
single most important metric, it's not *everything*.

As a general rule, there is much less risk in having a certain fixed
number of unfrozen heap pages spread fairly evenly among several
larger tables, compared to the case where the same number of unfrozen
pages are all concentrated in one particular table -- right now it'll
often be one particular table that is far larger than any other table.
Right now the pain is generally felt with large tables only.

2. We need to think about things at the table level is to manage costs
*over time* holistically. (Closely related to #1.)

The ebb and flow of VACUUM for one particular table is a big part of
the picture here -- and will be significantly affected by table size.
We can probably always afford to risk falling behind on
freezing/relfrozenxid (i.e. we should prefer laziness) if we know that
we'll almost certainly be able to catch up later when things don't
quite work out. That makes small tables much less trouble, even when
there are many more of them (at least up to a point).

As you know, my high level goal is to avoid ever having to make huge
balloon payments to catch up on freezing, which is a much bigger risk
with a large table -- this problem is mostly a per-table problem (both
now and in the future).

A large table will naturally require fewer, larger VACUUM operations
than a small table, no matter what approach is taken with the strategy
stuff. We therefore have fewer VACUUM operations in a given
week/month/year/whatever to spread out the burden -- there will
naturally be fewer opportunities. We want to create the impression
that each autovacuum does approximately the same amount of work (or at
least the same per new heap page for large append-only tables).

It also becomes much more important to only dirty each heap page
during vacuuming ~once with larger tables. With a smaller table, there
is a much higher chance that the pages we modify will already be dirty
from user queries.

> > for this table, at this time: Is it more important to advance
> > relfrozenxid early (be eager), or to skip all-visible pages instead
> > (be lazy)? If it's the former, then we must scan every single page
> > that isn't all-frozen according to the VM snapshot (including every
> > all-visible page).
>
> This feels too absolute, to me. If the goal is to freeze more
> incrementally, well in advance of wraparound limits, then why can't we
> just freeze 1000 out of 10000 freezable pages on this run, and then
> leave the rest for a later run?

My remarks here applied only to the question of relfrozenxid
advancement -- not to freezing. Skipping strategy (relfrozenxid
advancement) is a distinct though related concept to freezing
strategy. So I was making a very narrow statement about
invariants/basic correctness rules -- I wasn't arguing against
alternative approaches to freezing beyond the 2 freezing strategies
(not to be confused with skipping strategies) that appear in v1.
That's all I meant -- there is definitely no point in scanning only a
subset of the table's all-visible pages, as far as relfrozenxid
advancement is concerned (and skipping strategy is fundamentally a
choice about relfrozenxid advancement vs work avoidance, eagerness vs
laziness).

Maybe you're right that there is room for additional freezing
strategies, besides the two added by v1-0003-*patch. Definitely seems
possible. The freezing strategy concept should be usable as a
framework for adding additional strategies, including (just for
example) a strategy that decides ahead of time to freeze only so many
pages, though not others (without regard for the fact that the pages
that we are freezing may not be very different to those we won't be
freezing in the current VACUUM).

I'm definitely open to that. It's just a matter of characterizing what
set of workload characteristics this third strategy would solve, how
users might opt in or opt out, etc. Both the eager and the lazy
freezing strategies are based on some notion of what's important for
the table, based on its known characteristics, and based on what seems
like to happen to the table in the future (the next VACUUM, at least).
I'm not completely sure how many strategies we'll end up needing.
Though it seems like the eager/lazy trade-off is a really important
part of how these strategies will need to work, in general.

(Thinks some more) I guess that such an alternative freezing strategy
would probably have to affect the skipping strategy too. It's tricky
to tease apart because it breaks the idea that skipping strategy and
freezing strategy are basically distinct questions. That is a factor
that makes it a bit more complicated to discuss. In any case, as I
said, I have an open mind about alternative freezing strategies beyond
the 2 basic lazy/eager freezing strategies from the patch.

> What if we thought about this more like a "background freezer". It
> would keep track of the total number of unfrozen pages in the system,
> and freeze them at some kind of controlled/adaptive rate.

I like the idea of storing metadata in shared memory. And scheduling
and deprioritizing running autovacuums. Being able to slow down or
even totally halt a given autovacuum worker without much consequence
is enabled by the VM snapshot concept.

That said, this seems like future work to me. Worth discussing, but
trying to keep out of scope in the first version of this that is
committed.

> Regular autovacuum's job would be to keep advancing relfrozenxid for
> all tables and to do other cleanup, and the background freezer's job
> would be to keep the absolute number of unfrozen pages under some
> limit. Conceptually those two jobs seem different to me.

The problem with making it such a sharp distinction is that it can be
very useful to manage costs by making it the job of VACUUM to do both
-- we can avoid dirtying the same page multiple times.

I think that we can accomplish the same thing by giving VACUUM more
freedom to do either more or less work, based on the observed
characteristics of the table, and some sense of how costs will tend to
work over time. across multiple distinct VACUUM operations. In
practice that might end up looking very similar to what you describe.

It seems undesirable for VACUUM to ever be too sure of itself -- the
information that triggers autovacuum may not be particularly reliable,
which can be solved to some degree by making as many decisions as
possible at runtime, dynamically, based on the most authoritative and
recent information. Delaying committing to one particular course of
action isn't always possible, but when it is possible (and not too
expensive) we should do it that way on general principle.

> Also, regarding patch v1-0001-Add-page-level-freezing, do you think
> that narrows the conceptual gap between an all-visible page and an all-
> frozen page?

Yes, definitely. However, I don't think that we can just get rid of
the distinction completely -- though I did think about it for a while.
For one thing we need to be able to handle cases like the case where
heap_lock_tuple() modifies an all-frozen page, and makes it
all-visible without making it completely unskippable to every VACUUM
operation.

-- 
Peter Geoghegan



Re: New strategies for freezing, advancing relfrozenxid early

From
Jeff Davis
Date:
On Thu, 2022-08-25 at 14:21 -0700, Peter Geoghegan wrote:
> Attached patch series is a completely overhauled version of earlier
> work on freezing. Related work from the Postgres 15 cycle became
> commits 0b018fab, f3c15cbe, and 44fa8488.
>
> Recap
> =====
>
> The main high level goal of this work is to avoid painful, disruptive
> antiwraparound autovacuums (and other aggressive VACUUMs) that do way
> too much "catch up" freezing, all at once

I agree with the motivation: that keeping around a lot of deferred work
(unfrozen pages) is risky, and that administrators would want a way to
control that risk.

The solution involves more changes to the philosophy and mechanics of
vacuum than I would expect, though. For instance, VM snapshotting,
page-level-freezing, and a cost model all might make sense, but I don't
see why they are critical for solving the problem above. I think I'm
still missing something. My mental model is closer to the bgwriter and
checkpoint_completion_target.

Allow me to make a naive counter-proposal (not a real proposal, just so
I can better understand the contrast with your proposal):

  * introduce a reloption unfrozen_pages_target (default -1, meaning
infinity, which is the current behavior)
  * introduce two fields to LVRelState: n_pages_frozen and
delay_skip_count, both initialized to zero
  * when marking a page frozen: n_pages_frozen++
  * when vacuum begins:
      if (unfrozen_pages_target >= 0 &&
          current_unfrozen_page_count > unfrozen_pages_target)
      {
        vacrel->delay_skip_count = current_unfrozen_page_count -
          unfrozen_pages_target;
        /* ?also use more aggressive freezing thresholds? */
      }
  * in lazy_scan_skip(), have a final check:
      if (vacrel->n_pages_frozen < vacrel->delay_skip_count)
      {
         break;
      }

I know there would still be some problem cases, but to me it seems like
we solve 80% of the problem in a couple dozen lines of code.

a. Can you clarify some of the problem cases, and why it's worth
spending more code to fix them?

b. How much of your effort is groundwork for related future
improvements? If it's a substantial part, can you explain in that
larger context?

c. Can some of your patches be separated into independent discussions?
For instance, patch 1 has been discussed in other threads and seems
independently useful, and I don't see the current work as dependent on
it. Patch 4 also seems largerly independent.

d. Can you help give me a sense of scale of the problems solved by
visibilitymap snapshots and the cost model? Do those need to be in v1?

Regards,
    Jeff Davis




Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Tue, Aug 30, 2022 at 11:11 AM Jeff Davis <pgsql@j-davis.com> wrote:
> The solution involves more changes to the philosophy and mechanics of
> vacuum than I would expect, though. For instance, VM snapshotting,
> page-level-freezing, and a cost model all might make sense, but I don't
> see why they are critical for solving the problem above.

I certainly wouldn't say that they're critical. I tend to doubt that I
can be perfectly crisp about what the exact relationship is between
each component in isolation and how it contributes towards addressing
the problems we're concerned with.

> I think I'm
> still missing something. My mental model is closer to the bgwriter and
> checkpoint_completion_target.

That's not a bad starting point. The main thing that that mental model
is missing is how the timeframes work with VACUUM, and the fact that
there are multiple timeframes involved (maybe the system's vacuuming
work could be seen as having one timeframe at the highest level, but
it's more of a fractal picture overall). Checkpoints just don't take
that long, and checkpoint duration has a fairly low variance (barring
pathological performance problems).

You only have so many buffers that you can dirty, too -- it's a
self-limiting process. This is even true when (for whatever reason)
the checkpoint_completion_target logic just doesn't do what it's
supposed to do. There is more or less a natural floor on how bad
things can get, so you don't have to invent a synthetic floor at all.
LSM-based DB systems like the MyRocks storage engine for MySQL don't
use checkpoints at all -- the closest analog is compaction, which is
closer to a hybrid of VACUUM and checkpointing than anything else.

The LSM compaction model necessitates adding artificial throttling to
keep the system stable over time [1]. There is a disconnect between
the initial ingest of data, and the compaction process. And so
top-down modelling of costs and benefits with compaction is more
natural with an LSM [2] -- and not a million miles from the strategy
stuff I'm proposing.

> Allow me to make a naive counter-proposal (not a real proposal, just so
> I can better understand the contrast with your proposal):

> I know there would still be some problem cases, but to me it seems like
> we solve 80% of the problem in a couple dozen lines of code.

It's not that this statement is wrong, exactly. It's that I believe
that it is all but mandatory for me to ameliorate the downside that
goes with more eager freezing, for example by not doing it at all when
it doesn't seem to make sense. I want to solve the big problem of
freeze debt, without creating any new problems. And if I should also
make things in adjacent areas better too, so much the better.

Why stop at a couple of dozens of lines of code? Why not just change
the default of vacuum_freeze_min_age and
vacuum_multixact_freeze_min_age to 0?

> a. Can you clarify some of the problem cases, and why it's worth
> spending more code to fix them?

For one thing if we're going to do a lot of extra freezing, we really
want to "get credit" for it afterwards, by updating relfrozenxid to
reflect the new oldest extant XID, and so avoid getting an
antiwraparound VACUUM early, in the near future.

That isn't strictly true, of course. But I think that we at least
ought to have a strong bias in the direction of updating relfrozenxid,
having decided to do significantly more freezing in some particular
VACUUM operation.

> b. How much of your effort is groundwork for related future
> improvements? If it's a substantial part, can you explain in that
> larger context?

Hard to say. It's true that the idea of VM snapshots is quite general,
and could have been introduced in a number of different ways. But I
don't think that that should count against it. It's also not something
that seems contrived or artificial -- it's at least as good of a
reason to add VM snapshots as any other I can think of.

Does it really matter if this project is the freeze debt project, or
the VM snapshot project? Do we even need to decide which one it is
right now?

> c. Can some of your patches be separated into independent discussions?
> For instance, patch 1 has been discussed in other threads and seems
> independently useful, and I don't see the current work as dependent on
> it.

I simply don't know if I can usefully split it up just yet.

> Patch 4 also seems largerly independent.

Patch 4 directly compensates for a problem created by the earlier
patches. The patch series as a whole isn't supposed to amerliorate the
problem of MultiXacts being allocated in VACUUM. It only needs to
avoid making the situation any worse than it is today IMV (I suspect
that the real fix is to make the VACUUM FREEZE command not tune
vacuum_freeze_min_age).

> d. Can you help give me a sense of scale of the problems solved by
> visibilitymap snapshots and the cost model? Do those need to be in v1?

I'm not sure. I think that having certainty that we'll be able to scan
only so many pages up-front is very broadly useful, though. Plus it
removes the SKIP_PAGES_THRESHOLD stuff, which was intended to enable
relfrozenxid advancement in non-aggressive VACUUMs, but does so in a
way that results in scanning many more pages needlessly. See commit
bf136cf6, which added the SKIP_PAGES_THRESHOLD stuff back in 2009,
shortly after the visibility map first appeared.

Since relfrozenxid advancement fundamentally works at the table level,
it seems natural to make it a top-down, VACUUM-level thing -- even
within non-aggessive VACUUMs (I guess it already meets that
description in aggressive VACUUMs). And since we really want to
advance relfrozenxid when we do extra freezing (for the reasons I just
went into), it seems natural to me to view it as one problem. I accept
that it's not clear cut, though.

[1] https://docs.google.com/presentation/d/1WgP-SlKay5AnSoVDSvOIzmu7edMmtYhdywoa0oAR4JQ/edit?usp=sharing
[2] https://disc-projects.bu.edu/compactionary/research.html
-- 
Peter Geoghegan



Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Tue, Aug 30, 2022 at 1:45 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > d. Can you help give me a sense of scale of the problems solved by
> > visibilitymap snapshots and the cost model? Do those need to be in v1?
>
> I'm not sure. I think that having certainty that we'll be able to scan
> only so many pages up-front is very broadly useful, though. Plus it
> removes the SKIP_PAGES_THRESHOLD stuff, which was intended to enable
> relfrozenxid advancement in non-aggressive VACUUMs, but does so in a
> way that results in scanning many more pages needlessly. See commit
> bf136cf6, which added the SKIP_PAGES_THRESHOLD stuff back in 2009,
> shortly after the visibility map first appeared.

Here is a better example:

Right now the second patch adds both VM snapshots and the skipping
strategy stuff. The VM snapshot is used in the second patch, as a
source of reliable information about how we need to process the table,
in terms of the total number of scanned_pages -- which drives our
choice of strategy. Importantly, we can assess the question of which
skipping strategy to take (in non-aggressive VACUUM) based on 100%
accurate information about how many *extra* pages we'll have to scan
in the event of being eager (i.e. in the event that we prioritize
early relfrozenxid advancement over skipping some pages). Importantly,
that cannot change later on, since VM snapshots are immutable --
everything is locked in. That already seems quite valuable to me.

This general concept could be pushed a lot further without great
difficulty. Since VM snapshots are immutable, it should be relatively
easy to have the implementation make its final decision on skipping
only *after* lazy_scan_heap() returns. We could allow VACUUM to
"change its mind about skipping" in cases where it initially thought
that skipping was the best strategy, only to discover much later on
that that was the wrong choice after all.

A huge amount of new, reliable information will come to light from
scanning the heap rel. In particular, the current value of
vacrel->NewRelfrozenXid seems like it would be particularly
interesting when the time came to consider if a second scan made sense
-- if NewRelfrozenXid is a recent-ish value already, then that argues
for finishing off the all-visible pages in a second heap pass, with
the aim of setting relfrozenxid to a similarly recent value when it
happens to be cheap to do so.

The actual process of scanning precisely those all-visible pages that
were skipped the first time around during a second call to
lazy_scan_heap() can be implemented in the obvious way: by teaching
the VM snapshot infrastructure/lazy_scan_skip() to treat pages that
were skipped the first time around to get scanned during the second
pass over the heap instead. Also, those pages that were scanned the
first time around can/must be skipped on our second pass (excluding
all-frozen pages, which won't be scanned in either heap pass).

I've used the term "second heap pass" here, but that term is slightly
misleading. The final outcome of this whole process is that every heap
page that the vmsnap says VACUUM will need to scan in order for it to
be able to safely advance relfrozenxid will be scanned, precisely
once. The overall order that the heap pages are scanned in will of
course differ from the simple case, but I don't think that it makes
very much difference. In reality there will have only been one heap
pass, consisting of two distinct phases. No individual heap page will
ever be considered for pruning/freezing more than once, no matter
what. This is just a case of *reordering* work. Immutability makes
reordering work easy in general.

--
Peter Geoghegan



Re: New strategies for freezing, advancing relfrozenxid early

From
Jeff Davis
Date:
On Tue, 2022-08-30 at 18:50 -0700, Peter Geoghegan wrote:
> Since VM snapshots are immutable, it should be relatively
> easy to have the implementation make its final decision on skipping
> only *after* lazy_scan_heap() returns.

I like this idea.

Regards,
    Jeff Davis




Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Tue, Aug 30, 2022 at 9:37 PM Jeff Davis <pgsql@j-davis.com> wrote:
> On Tue, 2022-08-30 at 18:50 -0700, Peter Geoghegan wrote:
> > Since VM snapshots are immutable, it should be relatively
> > easy to have the implementation make its final decision on skipping
> > only *after* lazy_scan_heap() returns.
>
> I like this idea.

I was hoping that you would. I imagine that this idea (with minor
variations) could enable an approach that's much closer to what you
were thinking of: one that mostly focuses on controlling the number of
unfrozen pages, and not so much on advancing relfrozenxid early, just
because we can and we might not get another chance for a long time. In
other words your idea of a design that can freeze more during a
non-aggressive VACUUM, while still potentially skipping all-visible
pages.

I said earlier on that we ought to at least have a strong bias in the
direction of advancing relfrozenxid in larger tables, especially when
we decide to freeze whole pages more eagerly -- we only get one chance
to advance relfrozenxid per VACUUM, and those opportunities will
naturally be few and far between. We cannot really justify all this
extra freezing if it doesn't prevent antiwraparound autovacuums. That
was more or less my objection to going in that direction.

But if we can more or less double the number of opportunities to at
least ask the question "is now a good time to advance relfrozenxid?"
without really paying much for keeping this option open (and I think
that we can), my concern about relfrozenxid advancement becomes far
less important. Just being able to ask that question is significantly
less rare and precious. Plus we'll probably be able to make
significantly better decisions about relfrozenxid overall with the
"second phase because I changed my mind about skipping" concept in
place.

-- 
Peter Geoghegan



Re: New strategies for freezing, advancing relfrozenxid early

From
Jeff Davis
Date:
On Tue, 2022-08-30 at 13:45 -0700, Peter Geoghegan wrote:
> It's that I believe
> that it is all but mandatory for me to ameliorate the downside that
> goes with more eager freezing, for example by not doing it at all
> when
> it doesn't seem to make sense. I want to solve the big problem of
> freeze debt, without creating any new problems. And if I should also
> make things in adjacent areas better too, so much the better.

That clarifies your point. It's still a challenge for me to reason
about which of these potential new problems really need to be solved in
v1, though.

> Why stop at a couple of dozens of lines of code? Why not just change
> the default of vacuum_freeze_min_age and
> vacuum_multixact_freeze_min_age to 0?

I don't think that would actually solve the unbounded buildup of
unfrozen pages. It would still be possible for pages to be marked all
visible before being frozen, and then end up being skipped until an
aggressive vacuum is forced, right?

Or did you mean vacuum_freeze_table_age?

Regards,
    Jeff Davis




Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Tue, Aug 30, 2022 at 11:28 PM Jeff Davis <pgsql@j-davis.com> wrote:
> That clarifies your point. It's still a challenge for me to reason
> about which of these potential new problems really need to be solved in
> v1, though.

I don't claim to understand it that well myself -- not just yet.
I feel like I have the right general idea, but the specifics
aren't all there (which is very often the case for me at this
point in the cycle). That seems like a good basis for further
discussion.

It's going to be quite a few months before some version of this
patchset is committed, at the very earliest. Obviously these are
questions that need answers, but the process of getting to those
answers is a significant part of the work itself IMV.

> > Why stop at a couple of dozens of lines of code? Why not just change
> > the default of vacuum_freeze_min_age and
> > vacuum_multixact_freeze_min_age to 0?
>
> I don't think that would actually solve the unbounded buildup of
> unfrozen pages. It would still be possible for pages to be marked all
> visible before being frozen, and then end up being skipped until an
> aggressive vacuum is forced, right?

With the 15 work in place, and with the insert-driven autovacuum
behavior from 13, it is likely that this would be enough to avoid all
antiwraparound vacuums in an append-only table. There is still one
case where we can throw away the opportunity to advance relfrozenxid
during non-aggressive VACUUMs for no good reason -- I didn't fix them
all just yet. But the remaining case (which is in lazy_scan_skip()) is
very narrow.

With vacuum_freeze_min_age = 0 and vacuum_multixact_freeze_min_age =
0, any page that is eligible to be set all-visible is also eligible to
have its tuples frozen and be set all-frozen instead, immediately.
When it isn't then we'll scan it in the next VACUUM anyway.

Actually I'm also ignoring some subtleties with Multis that could make
this not quite happen, but again, that's only a super obscure corner case.
The idea that just setting vacuum_freeze_min_age = 0 and
vacuum_multixact_freeze_min_age = 0 will be enough is definitely true
in spirit. You don't need to touch vacuum_freeze_table_age (if you did
then you'd get aggressive VACUUMs, and one goal here is to avoid
those whenever possible -- especially aggressive antiwraparound
autovacuums).

--
Peter Geoghegan



Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Thu, Aug 25, 2022 at 2:21 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Attached patch series is a completely overhauled version of earlier
> work on freezing. Related work from the Postgres 15 cycle became
> commits 0b018fab, f3c15cbe, and 44fa8488.

Attached is v2.

This is just to keep CFTester happy, since v1 now has conflicts when
applied against HEAD. There are no notable changes in this v2 compared
to v1.

-- 
Peter Geoghegan

Attachment

Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Wed, Aug 31, 2022 at 12:03 AM Peter Geoghegan <pg@bowt.ie> wrote:
> Actually I'm also ignoring some subtleties with Multis that could make
> this not quite happen, but again, that's only a super obscure corner case.
> The idea that just setting vacuum_freeze_min_age = 0 and
> vacuum_multixact_freeze_min_age = 0 will be enough is definitely true
> in spirit. You don't need to touch vacuum_freeze_table_age (if you did
> then you'd get aggressive VACUUMs, and one goal here is to avoid
> those whenever possible -- especially aggressive antiwraparound
> autovacuums).

Attached is v3. There is a new patch included here -- v3-0004-*patch,
or "Unify aggressive VACUUM with antiwraparound VACUUM". No other
notable changes.

I decided to work on this now because it seems like it might give a
more complete picture of the high level direction that I'm pushing
towards. Perhaps this will make it easier to review the patch series
as a whole, even. The new patch unifies the concept of antiwraparound
VACUUM with the concept of aggressive VACUUM. Now there is only
antiwraparound and regular VACUUM (uh, barring VACUUM FULL). And now
antiwraparound VACUUMs are not limited to antiwraparound autovacuums
-- a manual VACUUM can also be antiwraparound (that's just the new
name for "aggressive").

We will typically only get antiwraparound vacuuming in a regular
VACUUM when the user goes out of their way to get that behavior.
VACUUM FREEZE is the best example. For the most part the
skipping/freezing strategy stuff has a good sense of what matters
already, and shouldn't need to be guided very often.

The patch relegates vacuum_freeze_table_age to a compatibility option,
making its default -1, meaning "just use autovacuum_freeze_max_age". I
always thought that having two table age based GUCs was confusing.
There was a period between 2006 and 2009 when we had
autovacuum_freeze_max_age, but did not yet have
vacuum_freeze_table_age. This change can almost be thought of as a
return to the simpler user interface that existed at that time. Of
course we must not resurrect the problems that vacuum_freeze_table_age
was intended to address (see originating commit 65878185) by mistake.
We need an improved version of the same basic concept, too.

The patch more or less replaces the table-age-aggressive-escalation
concept (previously implemented using vacuum_freeze_table_age) with
new logic that makes lazyvacuum.c's choice of skipping strategy *also*
depend upon table age -- it is now one more factor to be considered.
Both costs and benefits are weighed here. We now give just a little
weight to table age at a relatively early stage (XID-age-wise early),
and escalate from there. As the table's relfrozenxid gets older and
older, we give less and less weight to putting off the cost of
freezing. This general approach is possible because the false
dichotomy that is "aggressive vs non-aggressive" has mostly been
eliminated. This makes things less confusing for users and hackers.

The details of the skipping-strategy-choice algorithm are still
unsettled in v3 (no real change there). ISTM that the important thing
is still the high level concepts. Jeff was slightly puzzled by the
emphasis placed on the cost model/strategy stuff, at least at one
point. Hopefully my intent will be made clearer by the ideas featured
in the new patch. The skipping strategy decision making process isn't
particularly complicated, but it now looks more like an optimization
problem of some kind or other.

It might make sense to go further in the same direction by making
"regular vs aggressive/antiwraparound" into a *strict* continuum. In
other words, it might make sense to get rid of the two remaining cases
where VACUUM conditions its behavior on whether this VACUUM operation
is antiwraparound/aggressive or not. I'm referring to the cleanup lock
skipping behavior around lazy_scan_noprune(), as well as the
PROC_VACUUM_FOR_WRAPAROUND no-auto-cancellation behavior enforced in
autovacuum workers. We will still need to keep roughly the same two
behaviors, but the timelines can be totally different. We must be
reasonably sure that the cure won't be worse than the disease -- I'm
aware of quite a few individual cases that felt that way [1].
Aggressive interventions can make sense, but they need to be
proportionate to the problem that's right in front of us. "Kicking the
can down the road" is often the safest and most responsible approach
-- it all depends on the details.

[1] https://www.tritondatacenter.com/blog/manta-postmortem-7-27-2015
is the most high profile example, but I have personally been called in
to deal with similar problems in the past
--
Peter Geoghegan

Attachment

Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Thu, Sep 8, 2022 at 1:23 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Attached is v3. There is a new patch included here -- v3-0004-*patch,
> or "Unify aggressive VACUUM with antiwraparound VACUUM". No other
> notable changes.
>
> I decided to work on this now because it seems like it might give a
> more complete picture of the high level direction that I'm pushing
> towards. Perhaps this will make it easier to review the patch series
> as a whole, even.

This needed to be rebased over the guc.c work recently pushed to HEAD.

Attached is v4. This isn't just to fix bitrot, though; I'm also
including one new patch -- v4-0006-*.patch. This small patch teaches
VACUUM to size dead_items while capping the allocation at the space
required for "scanned_pages * MaxHeapTuplesPerPage" item pointers. In
other words, we now use scanned_pages instead of rel_pages to cap the
size of dead_items, potentially saving quite a lot of memory. There is
no possible downside to this approach, because we already know exactly
how many pages will be scanned from the VM snapshot -- there is zero
added risk of a second pass over the indexes.

This is still only scratching the surface of what is possible with
dead_items. The visibility map snapshot concept can enable a far more
sophisticated approach to resource management in vacuumlazy.c. It
could help us to replace a simple array of item pointers (the current
dead_items array) with a faster and more space-efficient data
structure. Masahiko Sawada has done a lot of work on this recently, so
this may interest him.

We don't just have up-front knowledge of the total number of
scanned_pages with VM snapshots -- we also have up-front knowledge of
which specific pages will be scanned. So we have reliable information
about the final distribution of dead_items (which specific heap blocks
might have dead_items) right from the start. While this extra
information/context is not a totally complete picture, it still seems
like it could be very useful as a way of driving how some new
dead_items data structure compresses TIDs. That will depend on the
distribution of TIDs -- the final "heap TID key space".

VM snapshots could also make it practical for the new data structure
to spill to disk to avoid multiple index scans/passed by VACUUM.
Perhaps this will result in behavior that's similar to how hash joins
spill to disk -- having 90% of the memory required to do everything
in-memory *usually* has similar performance characteristics to just
doing everything in memory. Most individual TID lookups from
ambulkdelete() will find that the TID *doesn't* need to be deleted --
a little like a hash join with low join selectivity (the common case
for hash joins). It's not like a merge join + sort, where we must
either spill everything or nothing (a merge join can be better than a
hash join with high join selectivity).

-- 
Peter Geoghegan

Attachment

Re: New strategies for freezing, advancing relfrozenxid early

From
John Naylor
Date:
On Wed, Sep 14, 2022 at 12:53 AM Peter Geoghegan <pg@bowt.ie> wrote:

> This is still only scratching the surface of what is possible with
> dead_items. The visibility map snapshot concept can enable a far more
> sophisticated approach to resource management in vacuumlazy.c. It
> could help us to replace a simple array of item pointers (the current
> dead_items array) with a faster and more space-efficient data
> structure. Masahiko Sawada has done a lot of work on this recently, so
> this may interest him.

I don't quite see how it helps "enable" that. It'd be more logical to
me to say the VM snapshot *requires* you to think harder about
resource management, since a palloc'd snapshot should surely be
counted as part of the configured memory cap that admins control.
(Commonly, it'll be less than a few dozen MB, so I'll leave that
aside.) Since Masahiko hasn't (to my knowlege) gone as far as
integrating his ideas into vacuum, I'm not sure if the current state
of affairs has some snag that a snapshot will ease, but if there is,
you haven't described what it is.

I do remember your foreshadowing in the radix tree thread a while
back, and I do think it's an intriguing idea to combine pages-to-scan
and dead TIDs in the same data structure. The devil is in the details,
of course. It's worth looking into.

> VM snapshots could also make it practical for the new data structure
> to spill to disk to avoid multiple index scans/passed by VACUUM.

I'm not sure spilling to disk is solving the right problem (as opposed
to the hash join case, or to the proposed conveyor belt system which
has a broader aim). I've found several times that a customer will ask
if raising maintenance work mem from 1GB to 10GB will make vacuum
faster. Looking at the count of index scans, it's pretty much always
"1", so even if the current approach could scale above 1GB, "no" it
wouldn't help to raise that limit.

Your mileage may vary, of course.

Continuing my customer example, searching the dead TID list faster
*will* make vacuum faster. The proposed tree structure is more memory
efficient, and IIUC could scale beyond 1GB automatically since each
node is a separate allocation, so the answer will be "yes" in the rare
case the current setting is in fact causing multiple index scans.
Furthermore, it doesn't have to anticipate the maximum size, so there
is no up front calculation assuming max-tuples-per-page, so it
automatically uses less memory for less demanding tables.

(But +1 for changing that calculation for as long as we do have the
single array.)

--
John Naylor
EDB: http://www.enterprisedb.com



Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Wed, Sep 14, 2022 at 3:18 AM John Naylor
<john.naylor@enterprisedb.com> wrote:
> On Wed, Sep 14, 2022 at 12:53 AM Peter Geoghegan <pg@bowt.ie> wrote:
> > This is still only scratching the surface of what is possible with
> > dead_items. The visibility map snapshot concept can enable a far more
> > sophisticated approach to resource management in vacuumlazy.c.

> I don't quite see how it helps "enable" that.

I have already written a simple throwaway patch that can use the
current VM snapshot data structure (which is just a local copy of the
VM's pages) to do a cheap precheck ahead of actually doing a binary
search in dead_items -- if a TID's heap page is all-visible or
all-frozen (depending on the type of VACUUM) then we're 100%
guaranteed to not visit it, and so it's 100% guaranteed to not have
any dead_items (actually it could have LP_DEAD items by the time the
index scan happens, but they won't be in our dead_items array in any
case). Since we're working off of an immutable source, this
optimization is simple to implement already. Very simple.

I haven't even bothered to benchmark this throwaway patch (I literally
wrote it in 5 minutes to show Masahiko what I meant). I can't see why
even that throwaway prototype wouldn't significantly improve
performance, though. After all, the VM snapshot data structure is far
denser than dead_items, and the largest tables often have most heap
pages skipped via the VM.

I'm not really interested in pursuing this simple approach because it
conflicts with Masahiko's work on the data structure, and there are
other good reasons to expect that to help. Plus I'm already very busy
with what I have here.

> It'd be more logical to
> me to say the VM snapshot *requires* you to think harder about
> resource management, since a palloc'd snapshot should surely be
> counted as part of the configured memory cap that admins control.

That's clearly true -- it creates a new problem for resource
management that will need to be solved. But that doesn't mean that it
can't ultimately make resource management better and easier.

Remember, we don't randomly visit some skippable pages for no good
reason in the patch, since the SKIP_PAGES_THRESHOLD stuff is
completely gone. The VM snapshot isn't just a data structure that
vacuumlazy.c uses as it sees fit -- it's actually more like a set of
instructions on which pages to scan, that vacuumlazy.c *must* follow.
There is no way that vacuumlazy.c can accidentally pick up a few extra
dead_items here and there due to concurrent activity that unsets VM
pages. We don't need to leave that to chance -- it is locked in from
the start.

> I do remember your foreshadowing in the radix tree thread a while
> back, and I do think it's an intriguing idea to combine pages-to-scan
> and dead TIDs in the same data structure. The devil is in the details,
> of course. It's worth looking into.

Of course.

> Looking at the count of index scans, it's pretty much always
> "1", so even if the current approach could scale above 1GB, "no" it
> wouldn't help to raise that limit.

I agree that multiple index scans are rare. But I also think that
they're disproportionately involved in really problematic cases for
VACUUM. That said, I agree that simply making lookups to dead_items as
fast as possible is the single most important way to improve VACUUM by
improving dead_items.

> Furthermore, it doesn't have to anticipate the maximum size, so there
> is no up front calculation assuming max-tuples-per-page, so it
> automatically uses less memory for less demanding tables.

The final number of TIDs doesn't seem like the most interesting
information that VM snapshots could provide us when it comes to
building the dead_items TID data structure -- the *distribution* of
TIDs across heap pages seems much more interesting. The "shape" can be
known ahead of time, at least to some degree. It can help with
compression, which will reduce cache misses.

Andres made remarks about memory usage with sparse dead TID patterns
at this point on the "Improve dead tuple storage for lazy vacuum"
thread:

https://postgr.es/m/20210710025543.37sizjvgybemkdus@alap3.anarazel.de

I haven't studied the radix tree stuff in great detail, so I am
uncertain of how much the VM snapshot concept could help, and where
exactly it would help. I'm just saying that it seems promising,
especially as a way of addressing concerns like this.

--
Peter Geoghegan



Re: New strategies for freezing, advancing relfrozenxid early

From
John Naylor
Date:
On Wed, Sep 14, 2022 at 11:33 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Wed, Sep 14, 2022 at 3:18 AM John Naylor

> > Furthermore, it doesn't have to anticipate the maximum size, so there
> > is no up front calculation assuming max-tuples-per-page, so it
> > automatically uses less memory for less demanding tables.
>
> The final number of TIDs doesn't seem like the most interesting
> information that VM snapshots could provide us when it comes to
> building the dead_items TID data structure -- the *distribution* of
> TIDs across heap pages seems much more interesting. The "shape" can be
> known ahead of time, at least to some degree. It can help with
> compression, which will reduce cache misses.

My point here was simply that spilling to disk is an admission of
failure to utilize memory efficiently and thus shouldn't be a selling
point of VM snapshots. Other selling points could still be valid.

-- 
John Naylor
EDB: http://www.enterprisedb.com



Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Thu, Sep 15, 2022 at 12:09 AM John Naylor
<john.naylor@enterprisedb.com> wrote:
> On Wed, Sep 14, 2022 at 11:33 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > The final number of TIDs doesn't seem like the most interesting
> > information that VM snapshots could provide us when it comes to
> > building the dead_items TID data structure -- the *distribution* of
> > TIDs across heap pages seems much more interesting. The "shape" can be
> > known ahead of time, at least to some degree. It can help with
> > compression, which will reduce cache misses.
>
> My point here was simply that spilling to disk is an admission of
> failure to utilize memory efficiently and thus shouldn't be a selling
> point of VM snapshots. Other selling points could still be valid.

I was trying to explain the goals of this work in a way that was as
accessible as possible. It's not easy to get the high level ideas
across, let alone all of the details.

It's true that I have largely ignored the question of how VM snapshots
will need to spill up until now. There are several reasons for this,
most of which you could probably guess. FWIW it wouldn't be at all
difficult to add *some* reasonable spilling behavior very soon; the
underlying access patterns are highly sequential and predictable, in
the obvious way.

-- 
Peter Geoghegan



Re: New strategies for freezing, advancing relfrozenxid early

From
Jeff Davis
Date:
On Thu, 2022-09-08 at 13:23 -0700, Peter Geoghegan wrote:
> The new patch unifies the concept of antiwraparound
> VACUUM with the concept of aggressive VACUUM. Now there is only
> antiwraparound and regular VACUUM (uh, barring VACUUM FULL). And now
> antiwraparound VACUUMs are not limited to antiwraparound autovacuums
> -- a manual VACUUM can also be antiwraparound (that's just the new
> name for "aggressive").

I like this general approach. The existing GUCs have evolved in a
confusing way.

> For the most part the
> skipping/freezing strategy stuff has a good sense of what matters
> already, and shouldn't need to be guided very often.

I'd like to know more clearly where manual VACUUM fits in here. Will it
user a more aggressive strategy than an autovacuum, and how so?

> The patch relegates vacuum_freeze_table_age to a compatibility
> option,
> making its default -1, meaning "just use autovacuum_freeze_max_age".

The purpose of vacuum_freeze_table_age seems to be that, if you
regularly issue VACUUM commands, it will prevent a surprise
antiwraparound vacuum. Is that still the case?

Maybe it would make more sense to have vacuum_freeze_table_age be a
fraction of autovacuum_freeze_max_age, and be treated as a maximum so
that other intelligence might kick in and freeze sooner?

>  This makes things less confusing for users and hackers.

It may take an adjustment period ;-)

> The details of the skipping-strategy-choice algorithm are still
> unsettled in v3 (no real change there). ISTM that the important thing
> is still the high level concepts. Jeff was slightly puzzled by the
> emphasis placed on the cost model/strategy stuff, at least at one
> point. Hopefully my intent will be made clearer by the ideas featured
> in the new patch.

Yes, it's clearing things up, but it's still a complex problem.
There's:

 a. xid age vs the actual amount of deferred work to be done
 b. advancing relfrozenxid vs skipping all-visible pages
 c. difficulty in controlling reasonable behavior (e.g.
    vacuum_freeze_min_age often being ignored, freezing
    individual tuples rather than pages)

Your first email described the motivation in terms of (a), but the
patches seem more focused on (b) and (c).

>  The skipping strategy decision making process isn't
> particularly complicated, but it now looks more like an optimization
> problem of some kind or other.

There's another important point here, which is that it gives an
opportunity to decide to freeze some all-visible pages in a given round
just to reduce the deferred work, without worrying about advancing
relfrozenxid.


--
Jeff Davis
PostgreSQL Contributor Team - AWS





Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Mon, Oct 3, 2022 at 5:41 PM Jeff Davis <pgsql@j-davis.com> wrote:
> I like this general approach. The existing GUCs have evolved in a
> confusing way.

Thanks for taking a look!

> > For the most part the
> > skipping/freezing strategy stuff has a good sense of what matters
> > already, and shouldn't need to be guided very often.
>
> I'd like to know more clearly where manual VACUUM fits in here. Will it
> user a more aggressive strategy than an autovacuum, and how so?

There is no change whatsoever in the relationship between manually
issued VACUUMs and autovacuums. We interpret autovacuum_freeze_max_age
in almost the same way as HEAD. The only detail that's changed is that
we almost always interpret "freeze_table_age" as "just use
autovacuum_freeze_max_age" in the patch, rather than as
"vacuum_freeze_table_age, though never more than 95% of
autovacuum_freeze_max_age", as on HEAD.

Maybe this would be less confusing if I went just a bit further, and
totally got rid of the concept that vacuumlazy.c calls aggressive
VACUUM on HEAD -- then there really would be exactly one kind of
VACUUM, just like before the visibility map was first introduced back
in 2009. This would relegate antiwraparound-ness to just another
condition that autovacuum.c used to launch VACUUMs.

Giving VACUUM the freedom to choose where and how to freeze and
advance relfrozenxid based on both costs and benefits is key here.
Anything that needlessly imposes a rigid rule on vacuumlazy.c
undermines that -- it ties VACUUM's hands. The user can still
influence many of the details using high-level GUCs that work at the
table level, rather than GUCs that can only work at the level of
individual VACUUM operations (that leaves too much to chance). Users
shouldn't want or need to micromanage VACUUM.

> > The patch relegates vacuum_freeze_table_age to a compatibility
> > option,
> > making its default -1, meaning "just use autovacuum_freeze_max_age".
>
> The purpose of vacuum_freeze_table_age seems to be that, if you
> regularly issue VACUUM commands, it will prevent a surprise
> antiwraparound vacuum. Is that still the case?

The user really shouldn't need to do anything with
vacuum_freeze_table_age at all now. It's mostly just a way for the
user to optionally insist on advancing relfrozenxid via a
antiwraparound/aggressive VACUUM -- like in a manual VACUUM FREEZE.
Even VACUUM FREEZE shouldn't be necessary very often.

> Maybe it would make more sense to have vacuum_freeze_table_age be a
> fraction of autovacuum_freeze_max_age, and be treated as a maximum so
> that other intelligence might kick in and freeze sooner?

That's kind of how the newly improved skipping strategy stuff works.
It gives some weight to table age as one additional factor (based on
how close the table's age is to autovacuum_freeze_max_age or its Multi
equivalent).

If table age is (say) 60% of autovacuum_freeze_max_age, then VACUUM
should be "60% as aggressive" as a conventional
aggressive/antiwraparound autovacuum would be. What that actually
means is that the VACUUM will tend to prefer advancing relfrozenxid
the closer we get to the cutoff, gradually giving less and less
consideration to putting off work as we get closer and closer. When we
get to 100% then we'll definitely advance relfrozenxid (via a
conventional aggressive/antiwraparound VACUUM).

The precise details are unsettled, but I'm pretty sure that the
general idea is sound. Basically we're replacing
vacuum_freeze_table_age with a dynamic, flexible version of the same
basic idea. Now we don't just care about the need to advance
relfrozenxid (benefits), though; we also care about costs.

> >  This makes things less confusing for users and hackers.
>
> It may take an adjustment period ;-)

Perhaps this is more of an aspiration at this point.  :-)

> Yes, it's clearing things up, but it's still a complex problem.
> There's:
>
>  a. xid age vs the actual amount of deferred work to be done
>  b. advancing relfrozenxid vs skipping all-visible pages
>  c. difficulty in controlling reasonable behavior (e.g.
>     vacuum_freeze_min_age often being ignored, freezing
>     individual tuples rather than pages)
>
> Your first email described the motivation in terms of (a), but the
> patches seem more focused on (b) and (c).

I think that all 3 areas are deeply and hopelessly intertwined.

For example, vacuum_freeze_min_age is effectively ignored in many
important cases right now precisely because we senselessly skip
all-visible pages with unfrozen tuples, no matter what -- the problem
actually comes from the visibility map, which vacuum_freeze_min_age
predates by quite a few years. So how can you possibly address the
vacuum_freeze_min_age issues without also significantly revising VM
skipping behavior? They're practically the same problem!

And once you've fixed vacuum_freeze_min_age (and skipping), how can
you then pass up the opportunity to advance relfrozenxid early when
doing so will require only a little extra work? I'm going to regress
some cases if I simply ignore the relfrozenxid factor. Finally, the
debt issue is itself a consequence of the other problems.

Perhaps this is an example of the inventor's paradox, where the more
ambitious plan may actually be easier and more likely to succeed than
a more limited plan that just focuses on one immediate problem. All of
these problems seem to be a result of adding accretion after accretion
over the years. A high-level rethink is well overdue. We need to
return to basics.

> >  The skipping strategy decision making process isn't
> > particularly complicated, but it now looks more like an optimization
> > problem of some kind or other.
>
> There's another important point here, which is that it gives an
> opportunity to decide to freeze some all-visible pages in a given round
> just to reduce the deferred work, without worrying about advancing
> relfrozenxid.

True. Though I think that a strong bias in the direction of advancing
relfrozenxid by some amount (not necessarily by very many XIDs) still
makes sense, especially when we're already freezing aggressively.

-- 
Peter Geoghegan



Re: New strategies for freezing, advancing relfrozenxid early

From
Jeff Davis
Date:
On Mon, 2022-10-03 at 20:11 -0700, Peter Geoghegan wrote:
> True. Though I think that a strong bias in the direction of advancing
> relfrozenxid by some amount (not necessarily by very many XIDs) still
> makes sense, especially when we're already freezing aggressively.

Take the case where you load a lot of data in one transaction. After
the loading transaction finishes, those new pages will soon be marked
all-visible.

In the future, vacuum runs will have to decide what to do. If a vacuum
decides to do an aggressive scan to freeze all of those pages, it may
be at some unfortunate time and disrupt the workload. But if it skips
them all, then it's just deferring the work until it runs up against
autovacuum_freeze_max_age, which might also be at an unfortunate time.

So how does your patch series handle this case? I assume there's some
mechanism to freeze a moderate number of pages without worrying about
advancing relfrozenxid?

Regards,
    Jeff Davis




Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Mon, Oct 3, 2022 at 10:13 PM Jeff Davis <pgsql@j-davis.com> wrote:
> Take the case where you load a lot of data in one transaction. After
> the loading transaction finishes, those new pages will soon be marked
> all-visible.
>
> In the future, vacuum runs will have to decide what to do. If a vacuum
> decides to do an aggressive scan to freeze all of those pages, it may
> be at some unfortunate time and disrupt the workload. But if it skips
> them all, then it's just deferring the work until it runs up against
> autovacuum_freeze_max_age, which might also be at an unfortunate time.

Predicting the future accurately is intrinsically hard. We're already
doing that today by freezing lazily. I think that we can come up with
a better overall strategy, but there is always a risk that we'll come
out worse off in some individual cases. I think it's worth it if it
avoids ever really flying off the rails.

> So how does your patch series handle this case? I assume there's some
> mechanism to freeze a moderate number of pages without worrying about
> advancing relfrozenxid?

It mostly depends on whether or not the table exceeds the new
vacuum_freeze_strategy_threshold GUC in size at the time of the
VACUUM. This is 4GB by default, at least right now.

The case where the table size doesn't exceed that threshold yet will
see each VACUUM advance relfrozenxid when it happens to be very cheap
to do so, in terms of the amount of extra scanned_pages. If the number
of extra scanned_pages is less than 5% of the total table size
(current rel_pages), then we'll advance relfrozenxid early by making
sure to scan any all-visible pages.

Actually, this scanned_pages threshold starts at 5%. It is usually 5%,
but it will eventually start to grow (i.e. make VACUUM freeze eagerly
more often) once table age exceeds 50% of autovacuum_freeze_max_age at
the start of the VACUUM. So the skipping strategy threshold is more or
less a blend of physical units (heap pages) and logical units (XID
age).

Then there is the case where it's already a larger table at the point
a given VACUUM begins -- a table that ends up exceeding the same table
size threshold, vacuum_freeze_strategy_threshold. When that happens
we'll freeze all pages that are going to be marked all-visible as a
matter of policy (i.e. use eager freezing strategy), so that the same
pages can be marked all-frozen instead. We won't freeze pages that
aren't full of all-visible tuples (except for LP_DEAD items), unless
they have XIDs that are so old that vacuum_freeze_min_age triggers
freezing.

Once a table becomes larger than vacuum_freeze_strategy_threshold,
VACUUM stops marking pages all-visible in the first place,
consistently marking them all-frozen instead. So naturally there just
cannot be any all-visible pages after the first eager freezing VACUUM
(actually there are some obscure edge cases that can result in the odd
all-visible page here or there, but this should be extremely rare, and
have only negligible impact).

Bigger tables always have pages frozen eagerly, and in practice always
advance relfrozenxid early. In other words, eager freezing strategy
implies eager freezing strategy -- though not the other way around.
Again, these details that may change in the future. My focus is
validating the high level concepts.

So we avoid big spikes, and try to do the work when it's cheapest.

-- 
Peter Geoghegan



Re: New strategies for freezing, advancing relfrozenxid early

From
Jeff Davis
Date:
On Mon, 2022-10-03 at 22:45 -0700, Peter Geoghegan wrote:
> Once a table becomes larger than vacuum_freeze_strategy_threshold,
> VACUUM stops marking pages all-visible in the first place,
> consistently marking them all-frozen instead.

What are the trade-offs here? Why does it depend on table size?

Regards,
    Jeff Davis




Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Tue, Oct 4, 2022 at 10:39 AM Jeff Davis <pgsql@j-davis.com> wrote:
> On Mon, 2022-10-03 at 22:45 -0700, Peter Geoghegan wrote:
> > Once a table becomes larger than vacuum_freeze_strategy_threshold,
> > VACUUM stops marking pages all-visible in the first place,
> > consistently marking them all-frozen instead.
>
> What are the trade-offs here? Why does it depend on table size?

That's a great question. The table-level threshold
vacuum_freeze_strategy_threshold more or less buckets every table into
one of two categories: small tables and big tables. Perhaps this seems
simplistic to you. That would be an understandable reaction, given the
central importance of this threshold. The current default of 4GB could
have easily been 8GB or perhaps even 16GB instead.

It's not so much size as the rate of growth over time that matters. We
really want to do eager freezing on "growth tables", particularly
append-only tables. On the other hand we don't want to do useless
freezing on small, frequently updated tables, like pgbench_tellers or
pgbench_branches -- those tables may well require zero freezing, and
yet each VACUUM will advance relfrozenxid to a very recent value
consistently (even on Postgres 15). But "growth" is hard to capture,
because in general we have to infer things about the future from the
past, which is difficult and messy.

Since it's hard to capture "growth table vs fixed size table"
directly, we use table size as a proxy. It's far from perfect, but I
think that it will work quite well in practice because most individual
tables simply never get very large. It's very common for a relatively
small number of tables to consistently grow, without bound (perhaps
not strictly append-only tables, but tables where nothing is ever
deleted and inserts keep happening). So a simplistic threshold
(combined with dynamic per-page decisions about freezing) should be
enough to avoid most of the downside of eager freezing. In particular,
we will still freeze lazily in tables where it's obviously very
unlikely to be worth it.

In general I think that being correct on average is overrated. It's
more important to always avoid being dramatically wrong -- especially
if there is no way to course correct in the next VACUUM. Although I
think that we have a decent chance of coming out ahead by every
available metric, that isn't really the goal. Why should performance
stability not have some cost, at least in some cases? I want to keep
the cost as low as possible (often "negative cost" relative to
Postgres 15), but overall I am consciously making a trade-off. There
are downsides.

-- 
Peter Geoghegan



Re: New strategies for freezing, advancing relfrozenxid early

From
Jeff Davis
Date:
On Tue, 2022-10-04 at 11:09 -0700, Peter Geoghegan wrote:
> So a simplistic threshold
> (combined with dynamic per-page decisions about freezing) should be
> enough to avoid most of the downside of eager freezing.

...

> I want to keep
> the cost as low as possible (often "negative cost" relative to
> Postgres 15), but overall I am consciously making a trade-off. There
> are downsides.

I am fine with that, but I'd like us all to understand what the
downsides are.

If I understand correctly:

1. Eager freezing (meaning to freeze at the same time as setting all-
visible) causes a modest amount of WAL traffic, hopefully before the
next checkpoint so we can avoid FPIs. Lazy freezing (meaning set all-
visible but don't freeze) defers the work, and it might never need to
be done; but if it does, it can cause spikes at unfortunate times and
is more likely to generate more FPIs.

2. You're trying to mitigate the downsides of eager freezing by:
  a. when freezing a tuple, eagerly freeze other tuples on that page
  b. optimize WAL freeze records

3. You're trying to capture the trade-off in #1 by using the table size
as a proxy. Deferred work is only really a problem for big tables, so
that's where you use eager freezing. But maybe we can just always use
eager freezing?:
  a. You're mitigating the WAL work for freezing.
  b. A lot of people run with checksums on, meaning that setting the
all-visible bit requires WAL work anyway, and often FPIs.
  c. All-visible is conceptually similar to freezing, but less
important, and it feels more and more like the design concept of all-
visible isn't carrying its weight.
  d. (tangent) I had an old patch[1] that actually removed
PD_ALL_VISIBLE (the page bit, not the VM bit), which was rejected, but
perhaps its time has come?

Regards,
    Jeff Davis


[1]
https://www.postgresql.org/message-id/1353551097.11440.128.camel%40sussancws0025




Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Tue, Oct 4, 2022 at 7:59 PM Jeff Davis <pgsql@j-davis.com> wrote:
> I am fine with that, but I'd like us all to understand what the
> downsides are.

Although I'm sure that there must be one case that loses measurably,
it's not particularly obvious where to start looking for one. I mean
it's easy to imagine individual pages that we lose on, but a practical
test case where most of the pages are like that reliably is harder to
imagine.

> If I understand correctly:
>
> 1. Eager freezing (meaning to freeze at the same time as setting all-
> visible) causes a modest amount of WAL traffic, hopefully before the
> next checkpoint so we can avoid FPIs. Lazy freezing (meaning set all-
> visible but don't freeze) defers the work, and it might never need to
> be done; but if it does, it can cause spikes at unfortunate times and
> is more likely to generate more FPIs.

Lazy freezing means to freeze every eligible tuple (every XID <
OldestXmin) when one or more XIDs are before FreezeLimit. Eager
freezing means freezing every eligible tuple when the page is about to
be set all-visible, or whenever lazy freezing would trigger freezing.

Eager freezing tends to avoid big spikes in larger tables, which is
very important. It can sometimes be cheaper and better in every way
than lazy freezing. Though lazy freezing sometimes retains an
advantage by avoiding freezing that is never going to be needed
altogether, typically only in small tables.

Lazy freezing is fairly similar to what we do on HEAD now -- though
it's not identical. It's still "page level freezing". It has lazy
criteria for triggering page freezing.

> 2. You're trying to mitigate the downsides of eager freezing by:
>   a. when freezing a tuple, eagerly freeze other tuples on that page
>   b. optimize WAL freeze records

Sort of.

Both of these techniques apply to eager freezing too, in fact. It's
just that eager freezing is likely to do the bulk of all freezing that
actually goes ahead. It'll disproportionately be helped by these
techniques because it'll do most actual freezing that goes ahead (even
when most VACUUM operations use the lazy freezing strategy, which is
probably the common case -- just because lazy freezing freezes
lazily).

> 3. You're trying to capture the trade-off in #1 by using the table size
> as a proxy. Deferred work is only really a problem for big tables, so
> that's where you use eager freezing.

Right.

> But maybe we can just always use
> eager freezing?:

That doesn't seem like a bad idea, though it might be tricky to put
into practice. It might be possible to totally unite the concept of
all-visible and all-frozen pages in the scope of this work. But there
are surprisingly many tricky details involved. I'm not surprised that
you're suggesting this -- it basically makes sense to me. It's just
the practicalities that I worry about here.

>   a. You're mitigating the WAL work for freezing.

I don't see why this would be true. Lazy vs Eager are exactly the same
for a given page at the point that freezing is triggered. We'll freeze
all eligible tuples (often though not always every tuple), or none at
all.

Lazy vs Eager describe the policy for deciding to freeze a page, but
do not affect the actual execution steps taken once we decide to
freeze.

>   b. A lot of people run with checksums on, meaning that setting the
> all-visible bit requires WAL work anyway, and often FPIs.

The idea of rolling the WAL records into one does seem appealing, but
we'd still need the original WAL record to set a page all-visible in
VACUUM's second heap pass (only setting a page all-visible in the
first heap pass could be optimized by making the FREEZE_PAGE WAL
record mark the page all-visible too). Or maybe we'd roll that into
the VACUUM WAL record at the same time.

In any case the second heap pass would have to have a totally
different WAL logging strategy to the first heap pass. Not
insurmountable, but not exactly an easy thing to do in passing either.

>   c. All-visible is conceptually similar to freezing, but less
> important, and it feels more and more like the design concept of all-
> visible isn't carrying its weight.

Well, not quite -- at least not on the VM side itself.

There are cases where heap_lock_tuple() will update a tuple's xmax,
replacing it with a new Multi. This will necessitate clearly the
page's all-frozen bit in the VM -- but the all-visible bit will stay
set. This is why it's possible for small numbers of all-visible pages
to appear even in large tables that have been eagerly frozen.

>   d. (tangent) I had an old patch[1] that actually removed
> PD_ALL_VISIBLE (the page bit, not the VM bit), which was rejected, but
> perhaps its time has come?

I remember that pgCon developer meeting well.  :-)

If anything your original argument for getting rid of PD_ALL_VISIBLE
is weakened by the proposal to merge together the WAL records for
freezing and for setting a heap page all visible. You'd know for sure
that the page will be dirtied when such a WAL record needed to be
written, so there is actually no reason to care about dirtying the
page. No?

I'm in favor of reducing the number of WAL records required in common
cases if at all possible -- purely because the generic WAL record
overhead of having an extra WAL record does probably add to the WAL
overhead for work performed in lazy_scan_prune(). But it seems like
separate work to me.

-- 
Peter Geoghegan



Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Thu, Sep 8, 2022 at 1:23 PM Peter Geoghegan <pg@bowt.ie> wrote:
> It might make sense to go further in the same direction by making
> "regular vs aggressive/antiwraparound" into a *strict* continuum. In
> other words, it might make sense to get rid of the two remaining cases
> where VACUUM conditions its behavior on whether this VACUUM operation
> is antiwraparound/aggressive or not.

I decided to go ahead with this in the attached revision, v5. This
revision totally gets rid of the general concept of discrete
aggressive/non-aggressive modes for each VACUUM operation (see
"v5-0004-Make-VACUUM-s-aggressive-behaviors-continuous.patch" and its
commit message). My new approach turned out to be simpler than the
previous half measures that I described as "unifying aggressive and
antiwraparound" (which itself first appeared in v3).

I now wish that I had all of these pieces in place for v1, since this
was the direction I was thinking of all along -- that might have made
life easier for reviewers like Jeff. What we have in v5 is what I had
in mind all along, which turns out to have only a little extra code
anyway. It might have been less confusing if I'd started this thread
with something like v5 -- the story I need to tell would have been
simpler that way. This is pretty much the end point I had in mind.

Note that we still retain what were previously "aggressive only"
behaviors. We only remove "aggressive" as a distinct mode of operation
that exclusively applies the aggressive behaviors. We're now selective
in how we apply each of the behaviors, based on the needs of the
table. We want to behave in a way that's proportionate to the problem
at hand, which is made easy by not tying anything to a discrete mode
of operation. It's a false dichotomy; why should we ever have only one
reason for running VACUUM, that's determined up front?

There are still antiwraparound autovacuums in v5, but that is really
just another way that autovacuum can launch an autovacuum worker (much
like it was before the introduction of the visibility map in 8.4) --
both conceptually, and in terms of how the code works in vacuumlazy.c.
In practice an antiwraparound autovacuum is guaranteed to advance
relfrozenxid in roughly the same way as on HEAD (otherwise what's the
point?), but that doesn't make the VACUUM operation itself special in
any way. Besides, antiwraparound autovacuums will naturally be rare,
because there are many more opportunities for a VACUUM to advance
relfrozenxid "early" now (only "early" relative to how it would work
on early Postgres versions). It's already clear that having
antiwraparound autovacuums and aggressive mode VACUUMs as two separate
concepts that are closely associated has some problems [1]. Formally
making antiwraparound autovacuums just another way to launch a VACUUM
via autovacuum seems quite useful to me.

For the most part users are expected to just take relfrozenxid
advancement for granted now. They should mostly be able to assume that
VACUUM will do whatever is required to keep it sufficiently current
over time. They can influence VACUUM's behavior, but that mostly works
at the level of the table (not the level of any individual VACUUM
operation). The freezing and skipping strategy stuff should do what is
necessary to keep up in the long run. We don't want to put too much
emphasis on relfrozenxid in the short run, because it isn't a reliable
proxy for how we've kept up with the physical work of freezing --
that's what really matters. It should be okay to "fall behind on table
age" in the short run, provided we don't fall behind on the physical
work of freezing. Those two things shouldn't be conflated.

We now use a separate pair of XID/MXID-based cutoffs to determine
whether or not we're willing to wait for a cleanup lock the hard way
(which can happen in any VACUUM, since of course there is no longer
any special VACUUM with special behaviors). The new pair of cutoffs
replace the use of FreezeLimit/MultiXactCutoff by lazy_scan_noprune
(those are now only used to decide on what to freeze inside
lazy_scan_prune). Same concept, but with a different, independent
timeline. This was necessary just to get an existing isolation test
(vacuum-no-cleanup-lock) to continue to work. But it just makes sense
to have a different timeline for a completely different behavior. And
it'll be more robust.

It's a really bad idea for VACUUM to try to wait indefinitely long for
a cleanup lock, since that's totally outside of its control. It
typically won't take very long at all for VACUUM to acquire a cleanup
lock, of course, but that is beside the point -- who really cares
what's true on average, for something like this? Sometimes it'll take
hours to acquire a cleanup lock, and there is no telling when that
might happen! And so pausing VACUUM/freezing of all other pages just
to freeze one page makes little sense. Waiting for a cleanup lock
before we really need to is just an overreaction, which risks making
the situation worse. The cure must not be worse than the disease.

This revision also resolves problems with freezing MultiXactIds too
lazily [2]. We now always trigger page level freezing in the event of
encountering a Multi. This is more consistent with the behavior on
HEAD, where we can easily process a Multi well before the cutoff
represented by vacuum_multixact_freeze_min_age (e.g., we notice that a
Multi has no members still running, making it safe to remove before
the cutoff is reached).

Also attaching a prebuilt copy of the "routine vacuuming" docs as of
v5. This is intended to be a convenience for reviewers, or anybody
with a general interest in the patch series. The docs certainly still
need work, but I feel that I'm making progress on that side of things
(especially in this latest revision). Making life easier for DBAs is
the single most important goal of this work, so the user docs are of
central importance. The current "Routine Vacuuming" docs have lots of
problems, but to some extent the problems are with the concepts
themselves.

[1] https://postgr.es/m/CAH2-Wz=DJAokY_GhKJchgpa8k9t_H_OVOvfPEn97jGNr9W=deg@mail.gmail.com
[2] https://postgr.es/m/CAH2-Wz=+B5f1izRDPYKw+sUgOr6=AkWXp2NikU5cub0ftbRQhA@mail.gmail.com
--
Peter Geoghegan

Attachment

Re: New strategies for freezing, advancing relfrozenxid early

From
Justin Pryzby
Date:
Note that this fails under -fsanitize=align

Subject: [PATCH v5 2/6] Teach VACUUM to use visibility map snapshot.

performing post-bootstrap initialization ...
../src/backend/access/heap/visibilitymap.c:482:38: runtime error: load of misaligned address 0x5559e1352424 for type
'uint64',which requires 8 byte alignment
 

>                         *all_visible += pg_popcount64(umap[i] & VISIBLE_MASK64);



Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Thu, Nov 10, 2022 at 7:44 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> performing post-bootstrap initialization ...
> ../src/backend/access/heap/visibilitymap.c:482:38: runtime error: load of misaligned address 0x5559e1352424 for type
'uint64',which requires 8 byte alignment
 

This issue is fixed in the attached revision, v6. I now avoid breaking
alignment-picky platforms in visibilitymap.c by using PGAlignedBlock
in the vm snapshot struct (this replaces the raw char buffer used in
earlier revisions).

Posting v6 will also keep CFTester happy. v5 no longer applies cleanly
due to conflicts caused by today's "Deduplicate freeze plans in freeze
WAL records" commit.

No other changes in v6 that are worth noting here.

Thanks
--
Peter Geoghegan

Attachment

Re: New strategies for freezing, advancing relfrozenxid early

From
Andres Freund
Date:
Hi,

On 2022-11-15 19:02:12 -0800, Peter Geoghegan wrote:
> From 352867c5027fae6194ab1c6480cd326963e201b1 Mon Sep 17 00:00:00 2001
> From: Peter Geoghegan <pg@bowt.ie>
> Date: Sun, 12 Jun 2022 15:46:08 -0700
> Subject: [PATCH v6 1/6] Add page-level freezing to VACUUM.
> 
> Teach VACUUM to decide on whether or not to trigger freezing at the
> level of whole heap pages, not individual tuple fields.  OldestXmin is
> now treated as the cutoff for freezing eligibility in all cases, while
> FreezeLimit is used to trigger freezing at the level of each page (we
> now freeze all eligible XIDs on a page when freezing is triggered for
> the page).
> 
> This approach decouples the question of _how_ VACUUM could/will freeze a
> given heap page (which of its XIDs are eligible to be frozen) from the
> question of whether it actually makes sense to do so right now.
> 
> Just adding page-level freezing does not change all that much on its
> own: VACUUM will still typically freeze very lazily, since we're only
> forcing freezing of all of a page's eligible tuples when we decide to
> freeze at least one (on the basis of XID age and FreezeLimit).  For now
> VACUUM still freezes everything almost as lazily as it always has.
> Later work will teach VACUUM to apply an alternative eager freezing
> strategy that triggers page-level freezing earlier, based on additional
> criteria.
> ---
>  src/include/access/heapam.h          |  42 +++++-
>  src/backend/access/heap/heapam.c     | 199 +++++++++++++++++----------
>  src/backend/access/heap/vacuumlazy.c |  95 ++++++++-----
>  3 files changed, 222 insertions(+), 114 deletions(-)
> 
> diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
> index ebe723abb..ea709bf1b 100644
> --- a/src/include/access/heapam.h
> +++ b/src/include/access/heapam.h
> @@ -112,6 +112,38 @@ typedef struct HeapTupleFreeze
>      OffsetNumber offset;
>  } HeapTupleFreeze;
>  
> +/*
> + * State used by VACUUM to track what the oldest extant XID/MXID will become
> + * when determing whether and how to freeze a page's heap tuples via calls to
> + * heap_prepare_freeze_tuple.

Perhaps this could say something like "what the oldest extant XID/MXID
currently is and what it would be if we decide to freeze the page" or such?


> + * The relfrozenxid_out and relminmxid_out fields are the current target
> + * relfrozenxid and relminmxid for VACUUM caller's heap rel.  Any and all

"VACUUM caller's heap rel." could stand to be rephrased.


> + * unfrozen XIDs or MXIDs that remain in caller's rel after VACUUM finishes
> + * _must_ have values >= the final relfrozenxid/relminmxid values in pg_class.
> + * This includes XIDs that remain as MultiXact members from any tuple's xmax.
> + * Each heap_prepare_freeze_tuple call pushes back relfrozenxid_out and/or
> + * relminmxid_out as needed to avoid unsafe values in rel's authoritative
> + * pg_class tuple.
> + *
> + * Alternative "no freeze" variants of relfrozenxid_nofreeze_out and
> + * relminmxid_nofreeze_out must also be maintained for !freeze pages.
> + */

relfrozenxid_nofreeze_out isn't really a "no freeze variant" :)

I think it might be better to just always maintain the nofreeze state.


> +typedef struct HeapPageFreeze
> +{
> +    /* Is heap_prepare_freeze_tuple caller required to freeze page? */
> +    bool        freeze;

s/freeze/freeze_required/?


> +    /* Values used when page is to be frozen based on freeze plans */
> +    TransactionId relfrozenxid_out;
> +    MultiXactId relminmxid_out;
> +
> +    /* Used by caller for '!freeze' pages */
> +    TransactionId relfrozenxid_nofreeze_out;
> +    MultiXactId relminmxid_nofreeze_out;
> +
> +} HeapPageFreeze;
> +

Given the number of parameters to heap_prepare_freeze_tuple, why don't we pass
in more of them in via HeapPageFreeze?


>  /* ----------------
>   *        function prototypes for heap access method
>   *
> @@ -180,17 +212,17 @@ extern void heap_inplace_update(Relation relation, HeapTuple tuple);
>  extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
>                                        TransactionId relfrozenxid, TransactionId relminmxid,
>                                        TransactionId cutoff_xid, TransactionId cutoff_multi,
> +                                      TransactionId limit_xid, MultiXactId limit_multi,
>                                        HeapTupleFreeze *frz, bool *totally_frozen,
> -                                      TransactionId *relfrozenxid_out,
> -                                      MultiXactId *relminmxid_out);
> +                                      HeapPageFreeze *xtrack);

What does 'xtrack' stand for? Xid Tracking?


>   * VACUUM caller must assemble HeapFreezeTuple entries for every tuple that we
>   * returned true for when called.  A later heap_freeze_execute_prepared call
> - * will execute freezing for caller's page as a whole.
> + * will execute freezing for caller's page as a whole.  Caller should also
> + * initialize xtrack fields for page as a whole before calling here with first
> + * tuple for the page.  See page_frozenxid_tracker comments.

s/should/need to/?

page_frozenxid_tracker appears to be a dangling pointer.


> +     * VACUUM calls limit_xid "FreezeLimit", and cutoff_xid "OldestXmin".
> +     * (limit_multi is "MultiXactCutoff", and cutoff_multi "OldestMxact".)

Hm. Perhaps we should just rename them if it requires this kind of
explanation? They're really not good names.



> @@ -6524,8 +6524,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
>          else
>          {
>              /* xmin to remain unfrozen.  Could push back relfrozenxid_out. */
> -            if (TransactionIdPrecedes(xid, *relfrozenxid_out))
> -                *relfrozenxid_out = xid;
> +            if (TransactionIdPrecedes(xid, xtrack->relfrozenxid_out))
> +                xtrack->relfrozenxid_out = xid;
>          }
>      }

Could use TransactionIdOlder().


> @@ -6563,8 +6564,11 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
>               */
>              Assert(!freeze_xmax);
>              Assert(TransactionIdIsValid(newxmax));
> -            if (TransactionIdPrecedes(newxmax, *relfrozenxid_out))
> -                *relfrozenxid_out = newxmax;
> +            Assert(heap_tuple_would_freeze(tuple, limit_xid, limit_multi,
> +                                           &xtrack->relfrozenxid_nofreeze_out,
> +                                           &xtrack->relminmxid_nofreeze_out));
> +            if (TransactionIdPrecedes(newxmax, xtrack->relfrozenxid_out))
> +                xtrack->relfrozenxid_out = newxmax;

Perhaps the Assert(heap_tuple_would_freeze()) bit could be handled once at the
end of the routine, for all paths?


> @@ -6731,18 +6751,36 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
>              else
>                  frz->frzflags |= XLH_FREEZE_XVAC;
>  
> -            /*
> -             * Might as well fix the hint bits too; usually XMIN_COMMITTED
> -             * will already be set here, but there's a small chance not.
> -             */
> +            /* Set XMIN_COMMITTED defensively */
>              Assert(!(tuple->t_infomask & HEAP_XMIN_INVALID));
>              frz->t_infomask |= HEAP_XMIN_COMMITTED;
> +
> +            /*
> +             * Force freezing any page with an xvac to keep things simple.
> +             * This allows totally_frozen tracking to ignore xvac.
> +             */
>              changed = true;
> +            xtrack->freeze = true;
>          }
>      }

Oh - I totally didn't realize that ->freeze is an out parameter. Seems a bit
odd to have the other fields suffied with _out but not this one?



> @@ -6786,13 +6824,13 @@ heap_execute_freeze_tuple(HeapTupleHeader tuple, HeapTupleFreeze *frz)
>   */
>  void
>  heap_freeze_execute_prepared(Relation rel, Buffer buffer,
> -                             TransactionId FreezeLimit,
> +                             TransactionId OldestXmin,
>                               HeapTupleFreeze *tuples, int ntuples)
>  {
>      Page        page = BufferGetPage(buffer);
>  
>      Assert(ntuples > 0);
> -    Assert(TransactionIdIsValid(FreezeLimit));
> +    Assert(TransactionIdIsValid(OldestXmin));
>  
>      START_CRIT_SECTION();
>  
> @@ -6822,11 +6860,10 @@ heap_freeze_execute_prepared(Relation rel, Buffer buffer,
>  
>          /*
>           * latestRemovedXid describes the latest processed XID, whereas
> -         * FreezeLimit is (approximately) the first XID not frozen by VACUUM.
> -         * Back up caller's FreezeLimit to avoid false conflicts when
> -         * FreezeLimit is precisely equal to VACUUM's OldestXmin cutoff.
> +         * OldestXmin is the first XID not frozen by VACUUM.  Back up caller's
> +         * OldestXmin to avoid false conflicts.
>           */
> -        latestRemovedXid = FreezeLimit;
> +        latestRemovedXid = OldestXmin;
>          TransactionIdRetreat(latestRemovedXid);
>  
>          xlrec.latestRemovedXid = latestRemovedXid;

Won't using OldestXmin instead of FreezeLimit potentially cause additional
conflicts? Is there any reason to not compute an accurate value?


> @@ -1634,27 +1639,23 @@ retry:
>              continue;
>          }
>  
> -        /*
> -         * LP_DEAD items are processed outside of the loop.
> -         *
> -         * Note that we deliberately don't set hastup=true in the case of an
> -         * LP_DEAD item here, which is not how count_nondeletable_pages() does
> -         * it -- it only considers pages empty/truncatable when they have no
> -         * items at all (except LP_UNUSED items).
> -         *
> -         * Our assumption is that any LP_DEAD items we encounter here will
> -         * become LP_UNUSED inside lazy_vacuum_heap_page() before we actually
> -         * call count_nondeletable_pages().  In any case our opinion of
> -         * whether or not a page 'hastup' (which is how our caller sets its
> -         * vacrel->nonempty_pages value) is inherently race-prone.  It must be
> -         * treated as advisory/unreliable, so we might as well be slightly
> -         * optimistic.
> -         */
>          if (ItemIdIsDead(itemid))
>          {
> +            /*
> +             * Delay unsetting all_visible until after we have decided on
> +             * whether this page should be frozen.  We need to test "is this
> +             * page all_visible, assuming any LP_DEAD items are set LP_UNUSED
> +             * in final heap pass?" to reach a decision.  all_visible will be
> +             * unset before we return, as required by lazy_scan_heap caller.
> +             *
> +             * Deliberately don't set hastup for LP_DEAD items.  We make the
> +             * soft assumption that any LP_DEAD items encountered here will
> +             * become LP_UNUSED later on, before count_nondeletable_pages is
> +             * reached.  Whether the page 'hastup' is inherently race-prone.
> +             * It must be treated as unreliable by caller anyway, so we might
> +             * as well be slightly optimistic about it.
> +             */
>              deadoffsets[lpdead_items++] = offnum;
> -            prunestate->all_visible = false;
> -            prunestate->has_lpdead_items = true;
>              continue;
>          }

What does this have to do with the rest of the commit? And why are we doing
this?


> @@ -1782,11 +1783,13 @@ retry:
>          if (heap_prepare_freeze_tuple(tuple.t_data,
>                                        vacrel->relfrozenxid,
>                                        vacrel->relminmxid,
> +                                      vacrel->OldestXmin,
> +                                      vacrel->OldestMxact,
>                                        vacrel->FreezeLimit,
>                                        vacrel->MultiXactCutoff,
>                                        &frozen[tuples_frozen],
>                                        &tuple_totally_frozen,
> -                                      &NewRelfrozenXid, &NewRelminMxid))
> +                                      &xtrack))
>          {
>              /* Save prepared freeze plan for later */
>              frozen[tuples_frozen++].offset = offnum;
> @@ -1807,9 +1810,33 @@ retry:
>       * that will need to be vacuumed in indexes later, or a LP_NORMAL tuple
>       * that remains and needs to be considered for freezing now (LP_UNUSED and
>       * LP_REDIRECT items also remain, but are of no further interest to us).
> +     *
> +     * Freeze the page when heap_prepare_freeze_tuple indicates that at least
> +     * one XID/MXID from before FreezeLimit/MultiXactCutoff is present.
>       */
> -    vacrel->NewRelfrozenXid = NewRelfrozenXid;
> -    vacrel->NewRelminMxid = NewRelminMxid;
> +    if (xtrack.freeze || tuples_frozen == 0)
> +    {
> +        /*
> +         * We're freezing the page.  Our final NewRelfrozenXid doesn't need to
> +         * be affected by the XIDs that are just about to be frozen anyway.

Seems quite confusing to enter a block with described as "We're freezing the
page." when we're not freezing anything (tuples_frozen == 0).


> +         * Note: although we're freezing all eligible tuples on this page, we
> +         * might not need to freeze anything (might be zero eligible tuples).
> +         */
> +        vacrel->NewRelfrozenXid = xtrack.relfrozenxid_out;
> +        vacrel->NewRelminMxid = xtrack.relminmxid_out;
> +        freeze_all_eligible = true;

I don't really get what freeze_all_eligible is trying to do.


>  #ifdef USE_ASSERT_CHECKING
>      /* Note that all_frozen value does not matter when !all_visible */
> -    if (prunestate->all_visible)
> +    if (prunestate->all_visible && lpdead_items == 0)
>      {
>          TransactionId cutoff;
>          bool        all_frozen;
> @@ -1849,8 +1876,7 @@ retry:
>          if (!heap_page_is_all_visible(vacrel, buf, &cutoff, &all_frozen))
>              Assert(false);

Not related to this change, but why isn't this just
Assert(heap_page_is_all_visible(vacrel, buf, &cutoff, &all_frozen))?



> From 8f3b6237affda15101ffb0b88787bfd6bb92e32f Mon Sep 17 00:00:00 2001
> From: Peter Geoghegan <pg@bowt.ie>
> Date: Mon, 18 Jul 2022 14:35:44 -0700
> Subject: [PATCH v6 2/6] Teach VACUUM to use visibility map snapshot.
> 
> Acquire an in-memory immutable "snapshot" of the target rel's visibility
> map at the start of each VACUUM, and use the snapshot to determine when
> and how VACUUM will skip pages.

This should include a description of the memory usage effects.


> This has significant advantages over the previous approach of using the
> authoritative VM fork to decide on which pages to skip.  The number of
> heap pages processed will no longer increase when some other backend
> concurrently modifies a skippable page, since VACUUM will continue to
> see the page as skippable (which is correct because the page really is
> still skippable "relative to VACUUM's OldestXmin cutoff").

Why is it an advantage for the number of pages to not increase?


> It also
> gives VACUUM reliable information about how many pages will be scanned,
> before its physical heap scan even begins.  That makes it easier to
> model the costs that VACUUM incurs using a top-down, up-front approach.
> 
> Non-aggressive VACUUMs now make an up-front choice about VM skipping
> strategy: they decide whether to prioritize early advancement of
> relfrozenxid (eager behavior) over avoiding work by skipping all-visible
> pages (lazy behavior).  Nothing about the details of how lazy_scan_prune
> freezes changes just yet, though a later commit will add the concept of
> freezing strategies.
> 
> Non-aggressive VACUUMs now explicitly commit to (or decide against)
> early relfrozenxid advancement up-front.

Why?


> VACUUM will now either scan
> every all-visible page, or none at all.  This replaces lazy_scan_skip's
> SKIP_PAGES_THRESHOLD behavior, which was intended to enable early
> relfrozenxid advancement (see commit bf136cf6), but left many of the
> details to chance.

The main goal according to bf136cf6 was to avoid defeating OS readahead, so I
think it should be mentioned here.

To me this is something that ought to be changed separately from the rest of
this commit.


> TODO: We don't spill VM snapshots to disk just yet (resource management
> aspects of VM snapshots still need work).  For now a VM snapshot is just
> a copy of the VM pages stored in local buffers allocated by palloc().

HEAPBLOCKS_PER_PAGE is 32672 with the defaults. The maximum relation size is
2**32 - 1 blocks. So the max FSM size is 131458 pages, a bit more than 1GB. Is
that correct?

For large relations that are already nearly all-frozen this does add a
noticable amount of overhead, whether spilled to disk or not. Of course
they're also not going to be vacuumed super often, but ...

Perhaps worth turning the VM into a range based description for the snapshot,
given it's a readonly datastructure in local memory? And we don't necessarily
need the all-frozen and all-visible in memory, one should suffice? We don't
even need random access, so it could easily be allocated incrementally, rather
than one large allocation.

Hard to imagine anybody having a multi-TB table without "runs" of
all-visible/all-frozen. I don't think it'd be worth worrying about patterns
that'd be inefficient in a range representation.



> +    /*
> +     * VACUUM must scan all pages that might have XIDs < OldestXmin in tuple
> +     * headers to be able to safely advance relfrozenxid later on.  There is
> +     * no good reason to scan any additional pages. (Actually we might opt to
> +     * skip all-visible pages.  Either way we won't scan pages for no reason.)
> +     *
> +     * Now that OldestXmin and rel_pages are acquired, acquire an immutable
> +     * snapshot of the visibility map as well.  lazy_scan_skip works off of
> +     * the vmsnap, not the authoritative VM, which can continue to change.
> +     * Pages that lazy_scan_heap will scan are fixed and known in advance.

Hm. It's a bit sad to compute the snapshot after determining OldestXmin.

We probably should refresh OldestXmin periodically. That won't allow us to get
a more aggressive relfrozenxid, but it'd allow to remove more gunk.


> +     *
> +     * The exact number of pages that lazy_scan_heap will scan also depends on
> +     * our choice of skipping strategy.  VACUUM can either choose to skip any
> +     * all-visible pages lazily, or choose to scan those same pages instead.

What does it mean to "skip lazily"?





> +        /*
> +         * Visibility map page copied to local buffer for caller's snapshot.
> +         * Caller requires an exact count of all-visible and all-frozen blocks
> +         * in the heap relation.  Handle that now.

This part of the comment seems like it actually belongs further down?


> +         * Must "truncate" our local copy of the VM to avoid incorrectly
> +         * counting heap pages >= rel_pages as all-visible/all-frozen.  Handle
> +         * this by clearing irrelevant bits on the last VM page copied.
> +         */

Hm - why would those bits already be set?


> +        map = PageGetContents(localvmpage);
> +        if (mapBlock == mapBlockLast)
> +        {
> +            /* byte and bit for first heap page not to be scanned by VACUUM */
> +            uint32        truncByte = HEAPBLK_TO_MAPBYTE(rel_pages);
> +            uint8        truncOffset = HEAPBLK_TO_OFFSET(rel_pages);
> +
> +            if (truncByte != 0 || truncOffset != 0)
> +            {
> +                /* Clear any bits set for heap pages >= rel_pages */
> +                MemSet(&map[truncByte + 1], 0, MAPSIZE - (truncByte + 1));
> +                map[truncByte] &= (1 << truncOffset) - 1;
> +            }
> +
> +            /* Now it's safe to tally bits from this final VM page below */
> +        }
> +
> +        /* Tally the all-visible and all-frozen counts from this page */
> +        umap = (uint64 *) map;
> +        for (int i = 0; i < MAPSIZE / sizeof(uint64); i++)
> +        {
> +            *all_visible += pg_popcount64(umap[i] & VISIBLE_MASK64);
> +            *all_frozen += pg_popcount64(umap[i] & FROZEN_MASK64);
> +        }
> +    }
> +
> +    return vmsnap;
> +}



> From 4f5969932451869f0f28295933c28de49a22fdf2 Mon Sep 17 00:00:00 2001
> From: Peter Geoghegan <pg@bowt.ie>
> Date: Mon, 18 Jul 2022 15:13:27 -0700
> Subject: [PATCH v6 3/6] Add eager freezing strategy to VACUUM.
> 
> Avoid large build-ups of all-visible pages by making non-aggressive
> VACUUMs freeze pages proactively for VACUUMs/tables where eager
> vacuuming is deemed appropriate.  Use of the eager strategy (an
> alternative to the classic lazy freezing strategy) is controlled by a
> new GUC, vacuum_freeze_strategy_threshold (and an associated
> autovacuum_* reloption).  Tables whose rel_pages are >= the cutoff will
> have VACUUM use the eager freezing strategy.

What's the logic behind a hard threshold?  Suddenly freezing everything on a
huge relation seems problematic. I realize that never getting all that far
behind is part of the theory, but I don't think that's always going to work.

Wouldn't a better strategy be to freeze a percentage of the relation on every
non-aggressive vacuum? That way the amount of work for an eventual aggressive
vacuum will shrink, without causing individual vacuums to take extremely long.


> When the eager strategy is in use, lazy_scan_prune will trigger freezing
> a page's tuples at the point that it notices that it will at least
> become all-visible -- it can be made all-frozen instead.  We still
> respect FreezeLimit, though: the presence of any XID < FreezeLimit also
> triggers page-level freezing (just as it would with the lazy strategy).

The other thing that I think would be to good to use is a) whether the page is
already in s_b, and b) whether the page already is dirty. The cost of freezing
shrinks significantly if it doesn't cause an additional read + write. And that
additional IO IMO one of the major concerns with freezing much more
aggressively in OLTPish workloads where a lot of the rows won't ever get old
enough to need freezing.




> From f2066c8ca5ba1b6f31257a36bb3dd065ecb1e3d4 Mon Sep 17 00:00:00 2001
> From: Peter Geoghegan <pg@bowt.ie>
> Date: Mon, 5 Sep 2022 17:46:34 -0700
> Subject: [PATCH v6 4/6] Make VACUUM's aggressive behaviors continuous.
> 
> The concept of aggressive/scan_all VACUUM dates back to the introduction
> of the visibility map in Postgres 8.4.  Before then, every lazy VACUUM
> was "equally aggressive": each operation froze whatever tuples before
> the age-wise cutoff needed to be frozen.  And each table's relfrozenxid
> was updated at the end.  In short, the previous behavior was much less
> efficient, but did at least have one thing going for it: it was much
> easier to understand at a high level.
>
> VACUUM no longer applies a separate mode of operation (aggressive mode).
> There are still antiwraparound autovacuums, but they're now little more
> than another way that autovacuum.c can launch an autovacuum worker to
> run VACUUM.

The most significant aspect of anti-wrap autvacuums right now is that they
don't auto-cancel. Is that still used? If so, what's the threshold?

IME one of the most common reasons for autovac not keeping up is that the
application occasionally acquires conflicting locks on one of the big
tables. Before reaching anti-wrap age all autovacuums on that table get
cancelled before it gets to update relfrozenxid. Once in that situation
autovac really focusses only on that relation...


> Now every VACUUM might need to wait for a cleanup lock, though few will.
> It can only happen when required to advance relfrozenxid to no less than
> half way between the existing relfrozenxid and nextXID.

Where's that "halfway" bit coming from?

Isn't "half way between the relfrozenxid and nextXID" a problem for instances
with longrunning transactions? Wouldn't this mean that wait for every page if
relfrozenxid can't be advanced much because of a longrunning query or such?



> From 51a863190f70c8baa6d04e3ffd06473843f3326d Mon Sep 17 00:00:00 2001
> From: Peter Geoghegan <pg@bowt.ie>
> Date: Sun, 31 Jul 2022 13:53:19 -0700
> Subject: [PATCH v6 5/6] Avoid allocating MultiXacts during VACUUM.
> 
> Pass down vacuumlazy.c's OldestXmin cutoff to FreezeMultiXactId(), and
> teach it the difference between OldestXmin and FreezeLimit.  Use this
> high-level context to intelligently avoid allocating new MultiXactIds
> during VACUUM operations.  We should always prefer to avoid allocating
> new MultiXacts during VACUUM on general principle.  VACUUM is the only
> mechanism that can claw back MultixactId space, so allowing VACUUM to
> consume MultiXactId space (for any reason) adds to the risk that the
> system will trigger the multiStopLimit wraparound protection mechanism.

Strictly speaking that's not quite true, you can also drop/truncate tables ;)


Greetings,

Andres Freund



Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Tue, Nov 15, 2022 at 9:20 PM Andres Freund <andres@anarazel.de> wrote:
> > Subject: [PATCH v6 1/6] Add page-level freezing to VACUUM.

Attached is v7, which incorporates much of your feedback. Thanks for the review!

> > +/*
> > + * State used by VACUUM to track what the oldest extant XID/MXID will become
> > + * when determing whether and how to freeze a page's heap tuples via calls to
> > + * heap_prepare_freeze_tuple.
>
> Perhaps this could say something like "what the oldest extant XID/MXID
> currently is and what it would be if we decide to freeze the page" or such?

Fixed.

> > + * The relfrozenxid_out and relminmxid_out fields are the current target
> > + * relfrozenxid and relminmxid for VACUUM caller's heap rel.  Any and all
>
> "VACUUM caller's heap rel." could stand to be rephrased.

Fixed.

> > + * unfrozen XIDs or MXIDs that remain in caller's rel after VACUUM finishes
> > + * _must_ have values >= the final relfrozenxid/relminmxid values in pg_class.
> > + * This includes XIDs that remain as MultiXact members from any tuple's xmax.
> > + * Each heap_prepare_freeze_tuple call pushes back relfrozenxid_out and/or
> > + * relminmxid_out as needed to avoid unsafe values in rel's authoritative
> > + * pg_class tuple.
> > + *
> > + * Alternative "no freeze" variants of relfrozenxid_nofreeze_out and
> > + * relminmxid_nofreeze_out must also be maintained for !freeze pages.
> > + */
>
> relfrozenxid_nofreeze_out isn't really a "no freeze variant" :)

Why not? I think that that's exactly what it is. We maintain these
alternative "oldest extant XID" values so that vacuumlazy.c's
lazy_scan_prune function can "opt out" of freezing. This is exactly
the same as what we do in lazy_scan_noprune, both conceptually and at
the implementation level.

> I think it might be better to just always maintain the nofreeze state.

Not sure. Even if there is very little to gain in cycles by not
maintaining the "nofreeze" cutoffs needlessly, it's still a pure waste
of cycles that can easily be avoided. So it just feels natural to not
waste those cycles -- it may even make the design clearer.

> > +typedef struct HeapPageFreeze
> > +{
> > +     /* Is heap_prepare_freeze_tuple caller required to freeze page? */
> > +     bool            freeze;
>
> s/freeze/freeze_required/?

Fixed.

> Given the number of parameters to heap_prepare_freeze_tuple, why don't we pass
> in more of them in via HeapPageFreeze?

HeapPageFreeze is supposed to be mutable state used for one single
page, though. Seems like we should use a separate immutable struct for
this instead.

I've already prototyped a dedicated immutable "cutoffs" struct, which
is instantiated exactly once per VACUUM. Seems like a good approach to
me. The immutable state can be shared by heapam.c's
heap_prepare_freeze_tuple(), vacuumlazy.c, and even
vacuum_set_xid_limits() -- so everybody can work off of the same
struct directly. Will try to get that into shape for the next
revision.

> What does 'xtrack' stand for? Xid Tracking?

Yes.

> >   * VACUUM caller must assemble HeapFreezeTuple entries for every tuple that we
> >   * returned true for when called.  A later heap_freeze_execute_prepared call
> > - * will execute freezing for caller's page as a whole.
> > + * will execute freezing for caller's page as a whole.  Caller should also
> > + * initialize xtrack fields for page as a whole before calling here with first
> > + * tuple for the page.  See page_frozenxid_tracker comments.
>
> s/should/need to/?

Changed it to "must".

> page_frozenxid_tracker appears to be a dangling pointer.

I think that you mean that the code comments reference an obsolete
type name -- fixed.

> > +      * VACUUM calls limit_xid "FreezeLimit", and cutoff_xid "OldestXmin".
> > +      * (limit_multi is "MultiXactCutoff", and cutoff_multi "OldestMxact".)
>
> Hm. Perhaps we should just rename them if it requires this kind of
> explanation? They're really not good names.

Agreed -- this can be taken care of as part of using a new VACUUM
operation level struct that is passed as immutable state, which I went
into a moment ago. That centralizes the definitions, which makes it
far easier to understand which cutoff is which. For now I've kept the
names as they were.

> Could use TransactionIdOlder().

I suppose, but the way I've done it feels a bit more natural to me,
and appears more often elsewhere. Not sure.

> > @@ -6563,8 +6564,11 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
> >                        */
> >                       Assert(!freeze_xmax);
> >                       Assert(TransactionIdIsValid(newxmax));
> > -                     if (TransactionIdPrecedes(newxmax, *relfrozenxid_out))
> > -                             *relfrozenxid_out = newxmax;
> > +                     Assert(heap_tuple_would_freeze(tuple, limit_xid, limit_multi,
> > +
&xtrack->relfrozenxid_nofreeze_out,
> > +
&xtrack->relminmxid_nofreeze_out));
> > +                     if (TransactionIdPrecedes(newxmax, xtrack->relfrozenxid_out))
> > +                             xtrack->relfrozenxid_out = newxmax;
>
> Perhaps the Assert(heap_tuple_would_freeze()) bit could be handled once at the
> end of the routine, for all paths?

The problem with that is that we cannot Assert() when we're removing a
Multi via FRM_INVALIDATE_XMAX processing in certain cases (I tried it
this way myself, and the assertion fails there). This can happen when
the call to FreezeMultiXactId() for the xmax determined that we should
do FRM_INVALIDATE_XMAX processing for the xmax due to the Multi being
"isLockOnly" and preceding "OldestVisibleMXactId[MyBackendId])". Which
is relatively common.

I fixed this by moving the assert further down, while still only
checking the FRM_RETURN_IS_XID and FRM_RETURN_IS_MULTI cases.

> Oh - I totally didn't realize that ->freeze is an out parameter. Seems a bit
> odd to have the other fields suffied with _out but not this one?

Fixed this by not having an "_out" suffix for any of these mutable
fields from HeapPageFreeze. Now everything is consistent. (The "_out"
convention is totally redundant, now that we have the HeapPageFreeze
struct, which makes it obvious that it is is all mutable state.)

> Won't using OldestXmin instead of FreezeLimit potentially cause additional
> conflicts? Is there any reason to not compute an accurate value?

This is a concern that I share. I was hoping that I'd be able to get
away with using OldestXmin just for this, because it's simpler that
way. But I had my doubts about it already.

I wonder why it's correct to use FreezeLimit for this on HEAD, though.
What about those FRM_INVALIDATE_XMAX cases that I just mentioned we
couldn't Assert() on? That case effectively removes XIDs that might be
well after FreezeLimit. Granted it might be safe in practice, but it's
far from obvious why it is safe.

Perhaps we can fix this in a not-too-invasive way by reusing
LVPagePruneState.visibility_cutoff_xid for FREEZE_PAGE conflicts (not
just VISIBLE conflicts) in cases where that was possible (while still
using OldestXmin as a fallback in much rarer cases). In practice we're
only triggering freezing eagerly because the page is already expected
to be set all-visible (the whole point is that we'd prefer if it was
set all-frozen instead of all-visible).

(I've not done this in v7, but it's on my TODO list.)

Note that the patch already maintains
LVPagePruneState.visibility_cutoff_xid when there are some LP_DEAD
items on the page, because we temporarily ignore those LP_DEAD items
when considering the eager freezing stuff......

> >               if (ItemIdIsDead(itemid))
> >               {

> >                       deadoffsets[lpdead_items++] = offnum;
> > -                     prunestate->all_visible = false;
> > -                     prunestate->has_lpdead_items = true;
> >                       continue;
> >               }
>
> What does this have to do with the rest of the commit? And why are we doing
> this?

....which is what you're asking about here.

The eager freezing strategy triggers page-level freezing for any page
that is about to become all-visible, so that it can be set all-frozen
instead. But that's not entirely straightforward when there happens to
be some LP_DEAD items on the heap page. There are really two ways that
a page can become all-visible during VACUUM, and we want to account
for that here. With eager freezing we want to make the pages become
all-frozen instead of just all-visible, regardless of which heap pass
(first pass or second pass) the page is set to become all-visible (and
maybe even all-frozen).

The comments that you mention were moved around a bit in passing.

Note that we still set prunestate->all_visible to false inside
lazy_scan_prune when we see remaining LP_DEAD stub items. We just do
it later on, after we've decided on freezing stuff. (Obviously it
wouldn't be okay to return to lazy_scan_heap without unsetting
prunestate->all_visible if there are LP_DEAD items.)

> Seems quite confusing to enter a block with described as "We're freezing the
> page." when we're not freezing anything (tuples_frozen == 0).

> I don't really get what freeze_all_eligible is trying to do.

freeze_all_eligible (and the "tuples_frozen == 0" behavior) are both
there because we can mark a page as all-frozen in the VM without
freezing any of its tuples first. When that happens, we must make sure
that "prunestate->all_frozen" is set to true, so that we'll actually
set the all-frozen bit. At the same time, we need to be careful about
the case where we *could* set the page all-frozen if we decided to
freeze all eligible tuples -- we need to handle the case where we
choose against freezing (and so can't set the all-frozen bit in the
VM, and so must actually set "prunestate->all_frozen" to false).

This is all kinda tricky because we're simultaneously dealing with the
actual state of the page, and the anticipated state of the page in the
near future. Closely related concepts, but distinct in important ways.

> >  #ifdef USE_ASSERT_CHECKING
> >       /* Note that all_frozen value does not matter when !all_visible */
> > -     if (prunestate->all_visible)
> > +     if (prunestate->all_visible && lpdead_items == 0)
> >       {
> >               TransactionId cutoff;
> >               bool            all_frozen;
> > @@ -1849,8 +1876,7 @@ retry:
> >               if (!heap_page_is_all_visible(vacrel, buf, &cutoff, &all_frozen))
> >                       Assert(false);
>
> Not related to this change, but why isn't this just
> Assert(heap_page_is_all_visible(vacrel, buf, &cutoff, &all_frozen))?

It's just a matter of personal preference. I prefer to have a clear
block of related code that contains multiple related assertions. You
would probably have declared PG_USED_FOR_ASSERTS_ONLY variables at the
top of lazy_scan_prune instead. FWIW if you did it the other way the
assertion would actually have to include a "!prunestate->all_visible"
test that short circuits the heap_page_is_all_visible() call from the
Assert().

> > Subject: [PATCH v6 2/6] Teach VACUUM to use visibility map snapshot.

> This should include a description of the memory usage effects.

The visibilitymap.c side of this is the least worked out part of the
patch series, by far. I have deliberately put off work on the data
structure itself, preferring to focus on the vacuumlazy.c side of
things for the time being. But I still agree -- fixed by acknowledging
that that particular aspect of resource management is unresolved.

I did have an open TODO before in the commit message, which is now
improved based on your feedback: it now fully owns the fact that we
really ignore the impact on memory usage right now. Just because that
part is very WIP (much more so than every other part).

> > This has significant advantages over the previous approach of using the
> > authoritative VM fork to decide on which pages to skip.  The number of
> > heap pages processed will no longer increase when some other backend
> > concurrently modifies a skippable page, since VACUUM will continue to
> > see the page as skippable (which is correct because the page really is
> > still skippable "relative to VACUUM's OldestXmin cutoff").
>
> Why is it an advantage for the number of pages to not increase?

The commit message goes into that immediately after the last line that
you quoted.  :-)

Having an immutable structure will help us, both in the short term,
for this particular project, and the long term. for other VACUUM
enhancements.

We need to have something that drives the cost model in vacuumlazy.c
for the skipping strategy stuff -- we need to have advanced
information about costs that drive the decision making process. Thanks
to VM snapshots, the cost model is able to reason about the cost of
relfrozenxid advancement precisely, in terms of "extra" scanned_pages
implied by advancing relfrozenxid during this VACUUM. That level of
precision is pretty nice IMV. It's not strictly necessary, but it's
nice to be able to make a precise accurate comparison between each of
the two skipping strategies.

Did you happen to look at the 6th and final patch? It's trivial, but
can have a big impact. It sizes dead_items while capping its sized
based on scanned_pages, not based on rel_pages. That's obviously
guaranteed to be correct.  Note also that the 2nd patch teaches VACUUM
VERBOSE to report the final number of scanned_pages right at the
start, before scanning anything -- so it's a useful basis for much
better progress reporting in pg_stat_progress_vacuum. Stuff like that
also becomes very easy with VM snapshots.

Then there is the more ambitious stuff, that's not in scope for this
project. Example: Perhaps Sawada san will be able to take the concept
of visibility map snapshots, and combine it with his Radix tree design
-- which could presumably benefit from advanced knowledge of which
pages can be scanned. This is information that is reliable, by
definition. In fact I think that it would make a lot of sense for this
visibility map snapshot data structure to be exactly the same
structure used to store dead_items. They really are kind of the same
thing. The design can reason precisely about which heap pages can ever
end up having any LP_DEAD items. (It's already trivial to use the VM
snapshot infrastructure as a precheck cache for dead_items lookups.)

> > Non-aggressive VACUUMs now explicitly commit to (or decide against)
> > early relfrozenxid advancement up-front.
>
> Why?

We can advance relfrozenxid because it's cheap to, or because it's
urgent (according to autovacuum_freeze_max_age). This is kind of true
on HEAD already due to the autovacuum_freeze_max_age "escalate to
aggressive" thing -- but we can do much better than that. Why not
decide to advance relfrozenxid when (say) it's only *starting* to get
urgent when it happens to be relatively cheap (though not dirt cheap)?
We make relfrozenxid advancement a deliberate decision that weighs
*all* available information, and has a sense of the needs of the table
over time.

The user experience is important here. Going back to a model where
there is really just one kind of lazy VACUUM makes a lot of sense. We
should have much more approximate guarantees about relfrozenxid
advancement, since that's what gives us the flexibility to find a
cheaper (or more stable) way of keeping up over time. It matters that
we keep up over time, but it doesn't matter if we fall behind on
relfrozenxid advancement -- at least not if we don't also fall behind
on the work of freezing physical heap pages.

> > VACUUM will now either scan
> > every all-visible page, or none at all.  This replaces lazy_scan_skip's
> > SKIP_PAGES_THRESHOLD behavior, which was intended to enable early
> > relfrozenxid advancement (see commit bf136cf6), but left many of the
> > details to chance.
>
> The main goal according to bf136cf6 was to avoid defeating OS readahead, so I
> think it should be mentioned here.

Agreed. Fixed.

> To me this is something that ought to be changed separately from the rest of
> this commit.

Maybe, but I'd say it depends on the final approach taken -- the
visibilitymap.c aspects of the patch are the least settled. I am
seriously considering adding prefetching to the vm snapshot structure,
which would make it very much a direct replacement for
SKIP_PAGES_THRESHOLD.

Separately, I'm curious about what you think of VM snapshots from an
aio point of view. Seems like it would be ideal for prefetching for
aio?

> > TODO: We don't spill VM snapshots to disk just yet (resource management
> > aspects of VM snapshots still need work).  For now a VM snapshot is just
> > a copy of the VM pages stored in local buffers allocated by palloc().
>
> HEAPBLOCKS_PER_PAGE is 32672 with the defaults. The maximum relation size is
> 2**32 - 1 blocks. So the max FSM size is 131458 pages, a bit more than 1GB. Is
> that correct?

I think that you meant "max VM size". That sounds correct to me.

> For large relations that are already nearly all-frozen this does add a
> noticable amount of overhead, whether spilled to disk or not. Of course
> they're also not going to be vacuumed super often, but ...

I wouldn't be surprised if the patch didn't work with relations that
approach 32 TiB in size. As I said, the visibilitymap.c data structure
is the least worked out piece of the project.

> Perhaps worth turning the VM into a range based description for the snapshot,
> given it's a readonly datastructure in local memory? And we don't necessarily
> need the all-frozen and all-visible in memory, one should suffice? We don't
> even need random access, so it could easily be allocated incrementally, rather
> than one large allocation.

Definitely think that we should do simple run-length encoding, stuff
like that. Just as long as it allows vacuumlazy.c to work off of a
true snapshot, with scanned_pages known right from the start. The
consumer side of things has been my focus so far.

> Hm. It's a bit sad to compute the snapshot after determining OldestXmin.
>
> We probably should refresh OldestXmin periodically. That won't allow us to get
> a more aggressive relfrozenxid, but it'd allow to remove more gunk.

That may well be a good idea, but I think that it's also a good idea
to just not scan heap pages that we know won't have XIDs < OldestXmin
(OldestXmin at the start of the VACUUM). That visibly makes the
problem of "recently dead" tuples that cannot be cleaned up a lot
better, without requiring that we do anything with OldestXmin.

I also think that there is something to be said for not updating the
FSM for pages that were all-visible at the beginning of the VACUUM
operation. VACUUM is currently quite happy to update the FSM with its
own confused idea about how much free space there really is on heap
pages with recently dead (dead but not yet removable) tuples. That's
really bad, but really subtle.

> What does it mean to "skip lazily"?

Skipping even all-visible pages, prioritizing avoiding work over
advancing relfrozenxid. This is a cost-based decision. As I mentioned
a moment ago, that's one immediate use of VM snapshots (it gives us
precise information to base our decision on, that simply *cannot*
become invalid later on).

> > +             /*
> > +              * Visibility map page copied to local buffer for caller's snapshot.
> > +              * Caller requires an exact count of all-visible and all-frozen blocks
> > +              * in the heap relation.  Handle that now.
>
> This part of the comment seems like it actually belongs further down?

No, it just looks a bit like that because of the "truncate in-memory
VM" code stanza. It's actually the right order.

> > +              * Must "truncate" our local copy of the VM to avoid incorrectly
> > +              * counting heap pages >= rel_pages as all-visible/all-frozen.  Handle
> > +              * this by clearing irrelevant bits on the last VM page copied.
> > +              */
>
> Hm - why would those bits already be set?

No real reason, we "truncate" like this defensively. This will
probably look quite different before too long.

> > Subject: [PATCH v6 3/6] Add eager freezing strategy to VACUUM.

> What's the logic behind a hard threshold?  Suddenly freezing everything on a
> huge relation seems problematic. I realize that never getting all that far
> behind is part of the theory, but I don't think that's always going to work.

It's a vast improvement on what we do currently, especially in
append-only tables.

There is simply no limit on how many physical heap pages will have to
be frozen when there is an aggressive mode VACUUM. It could be
terabytes, since table age predicts precisely nothing about costs.
With the patch we have a useful limit for the first time, that uses
physical units (the only kind of units that make any sense).

Admittedly we should really have special instrumentation that reports
when VACUUM must do "catch up freezing" when the
vacuum_freeze_strategy_threshold threshold is first crossed, to help
users to make better choices in this area. And maybe
vacuum_freeze_strategy_threshold should be lower by default, so it's
not as noticeable. (The GUC partly exists as a compatibility option, a
bridge to the old lazy behavior.)

Freezing just became approximately 5x cheaper with the freeze plan
deduplication work (commit 9e540599). To say nothing about how
vacuuming indexes became a lot cheaper in recent releases. So to some
extent we can afford to be more proactive here. There are some very
nonlinear cost profiles involved here due to write amplification
effects. So having a strong bias against write amplification seems
totally reasonable to me -- we can potentially "get it wrong" and
still come out ahead, because we at least had the right idea about
costs.

I don't deny that there are clear downsides, though. I am convinced
that it's worth it -- performance stability is what users actually
complain about in almost all cases. Why should performance stability
be 100% free?

> Wouldn't a better strategy be to freeze a percentage of the relation on every
> non-aggressive vacuum? That way the amount of work for an eventual aggressive
> vacuum will shrink, without causing individual vacuums to take extremely long.

I think that it's better to avoid aggressive mode altogether. By
committing to advancing relfrozenxid by *some* amount in ~all VACUUMs
against larger tables, we can notice when we don't actually need to do
very much freezing to keep relfrozenxid current, due to workload
characteristics. It depends on workload, of course. But if we don't
try to do this we'll never notice that it's possible to do it.

Why should we necessarily need to freeze very much, after a while? Why
shouldn't most newly frozen pages stay frozen ~forever after a little
while?

> The other thing that I think would be to good to use is a) whether the page is
> already in s_b, and b) whether the page already is dirty. The cost of freezing
> shrinks significantly if it doesn't cause an additional read + write. And that
> additional IO IMO one of the major concerns with freezing much more
> aggressively in OLTPish workloads where a lot of the rows won't ever get old
> enough to need freezing.

Maybe, but I think that systematic effects are more important. We
freeze eagerly during this VACUUM in part because it makes
relfrozenxid advancement possible in the next VACUUM.

Note that eager freezing doesn't freeze the page unless it's already
going to set it all-visible. That's another way in which we ameliorate
the problem of freezing when it makes little sense to -- even with
eager freezing strategy, we *don't* freeze heap pages where it
obviously makes little sense to. Which makes a huge difference on its
own.

There is good reason to believe that most individual heap pages are
very cold data, even in OLTP apps. To a large degree Postgres is
successful because it is good at inexpensively storing data that will
possibly never be accessed:


https://www.microsoft.com/en-us/research/video/cost-performance-in-modern-data-stores-how-data-cashing-systems-succeed/

Speaking of OLTP apps:

in many cases VACUUM will prune just to remove one or two heap-only
tuples, maybe even generating an FPI in the process. But the removed
tuple wasn't actually doing any harm -- an opportunistic prune could
have done the same thing later on, once we'd built up some more
garbage tuples. So the only reason to prune is to freeze the page. And
yet right now we don't recognize this and freeze the page to get
*some* benefit out of the arguably needlessly prune. This is quite
common, in fact.

> The most significant aspect of anti-wrap autvacuums right now is that they
> don't auto-cancel. Is that still used? If so, what's the threshold?

This patch set doesn't change anything about antiwraparound
autovacuums -- though it does completely eliminate aggressive mode (so
it's a little like Postgres 8.4).

There is a separate thread discussing the antiwraparound side of this, actually:

https://postgr.es/m/CAH2-Wz=S-R_2rO49Hm94Nuvhu9_twRGbTm6uwDRmRu-Sqn_t3w@mail.gmail.com

I think that I will need to invent a new type of autovacuum that's
similar to antiwraparound autovacuum, but doesn't have the
cancellation behavior -- that is more or less prerequisite to
committing this patch series. We can accept some risk of relfrozenxid
falling behind if that doesn't create any real risk of antiwraparound
autovacuums.

We can retain antiwraparound autovacuum, which should kick in only
when the new kind of autovacuum has failed to advance relfrozenxid,
having had the opportunity. Maybe antiwraparound autovacuum should be
triggered when age(relfrozenxid) is twice the value of
autovacuum_freeze_max_age. The new kind of autovacuum would trigger at
the same table age that triggers antiwraparound autovacuum with the
current design.

So antiwraparound autovacuum would work in the same way, but would be
much less common -- even for totally static tables. We'd at least be
sure that the auto cancellation behavior was *proportionate* to the
problem at hand, because we'll always have tried and ultimately failed
to advance relfrozenxid without activating the auto cancellation
behavior. We wouldn't trigger a very disruptive behavior routinely,
without any very good reason.

> > Now every VACUUM might need to wait for a cleanup lock, though few will.
> > It can only happen when required to advance relfrozenxid to no less than
> > half way between the existing relfrozenxid and nextXID.
>
> Where's that "halfway" bit coming from?

We don't use FreezeLimit within lazy_scan_noprune in the patch that
gets rid of aggressive mode VACUUM. We use something called minXid in
its place. So a different timeline to freezing (even for tables where
we always use the lazy freezing strategy).

The new minXid cutoff (used by lazy_scan_noprune) comes from this
point in vacuum_set_xid_limits():

+ *minXid = nextXID - (freeze_table_age / 2);
+ if (!TransactionIdIsNormal(*minXid))
+     *minXid = FirstNormalTransactionId;

So that's what I meant by "half way".

(Note that minXid is guaranteed to be <= FreezeLimit, which is itself
guaranteed to be <= OldestXmin, no matter what.)

> Isn't "half way between the relfrozenxid and nextXID" a problem for instances
> with longrunning transactions?

Should we do less relfrozenxid advancement because there is a long
running transaction, though? It's obviously seriously bad when things
are blocked by a long running transaction, but I don't see the
connection between that and how we wait for cleanup locks. Waiting for
cleanup locks is always really, really bad, and can be avoided in
almost all cases.

I suspect that I still haven't been aggressive enough in how minXid is
set, BTW -- we should be avoiding waiting for a cleanup lock like the
plague. So "half way" isn't enough. Maybe we should have a LOG message
in cases where it actually proves necessary to wait, because it's just
asking for trouble (at least when we're running in autovacuum).

> Wouldn't this mean that wait for every page if
> relfrozenxid can't be advanced much because of a longrunning query or such?

Old XIDs always start out as young XIDs. Which we're now quite willing
to freeze when conditions look good.

Page level freezing always freezes all eligible XIDs on the page when
triggered, no matter what the details may be. This means that the
oldest XID on a heap page is more or less always an XID that's after
whatever OldestXmin was for the last VACUUM that ran and froze the
page, whenever that happened, and regardless of the mix of XID ages
was on the page at that time.

As a consequence, lone XIDs that are far older than other XIDs on the
same page become much rarer than what you'd see with the current
design -- they have to "survive" multiple VACUUMs, not just one
VACUUM. The best predictor of XID age becomes the time that VACUUM
last froze the page as a whole -- so workload characteristics and
natural variations are much much less likely to lead to problems from
waiting for cleanup locks. (Of course it also helps that we'll try
really hard to do that, and almost always prefer lazy_scan_noprune
processing.)

There is some sense in which we're trying to create a virtuous cycle
here. If we are always in a position to advance relfrozenxid by *some*
amount each VACUUM, however small, then we will have many individual
opportunities (spaced out over multiple VACUUM operations) to freeze
tuples on any heap tuples that (for whatever reason) are harder to get
a cleanup lock on, and then catch up on relfrozenxid by a huge amount
whenever we "get lucky". We have to "keep an open mind" to ever have
any chance of "getting lucky" in this sense, though.

> > VACUUM is the only
> > mechanism that can claw back MultixactId space, so allowing VACUUM to
> > consume MultiXactId space (for any reason) adds to the risk that the
> > system will trigger the multiStopLimit wraparound protection mechanism.
>
> Strictly speaking that's not quite true, you can also drop/truncate tables ;)

Fixed.

--
Peter Geoghegan

Attachment

Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Fri, Nov 18, 2022 at 5:06 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I've already prototyped a dedicated immutable "cutoffs" struct, which
> is instantiated exactly once per VACUUM. Seems like a good approach to
> me. The immutable state can be shared by heapam.c's
> heap_prepare_freeze_tuple(), vacuumlazy.c, and even
> vacuum_set_xid_limits() -- so everybody can work off of the same
> struct directly. Will try to get that into shape for the next
> revision.

Attached is v8.

Notable improvement over v7:

* As anticipated on November 18th, his revision adds a new refactoring
commit/patch, which adds a struct that contains fields like
FreezeLimit and OldestXmin, which is used by vacuumlazy.c to pass the
information to heap_prepare_freeze_tuple().

This refactoring makes everything easier to understand -- it's a
significant structural improvement.

* The changes intended to avoid allocating a new Multi during VACUUM
no longer appear in their own commit. That was squashed/combined with
the earlier page-level freezing commit.

This is another structural improvement.

The FreezeMultiXactId() changes were never really an optimization, and
I shouldn't have explained them that way. They are only needed to
avoid MultiXactId related regressions that page-level freezing would
otherwise cause. Doing these changes in the page-level freezing patch
makes that far clearer.

* Fixes an issue with snapshotConflictHorizon values for FREEZE_PAGE
records, where earlier revisions could have more false recovery
conflicts relative to the behavior on HEAD.

In other words, v8 addresses a concern that you (Andres) had in your
review of v6, here:

> > Won't using OldestXmin instead of FreezeLimit potentially cause additional
> > conflicts? Is there any reason to not compute an accurate value?

As anticipated, it is possible to generate valid FREEZE_PAGE
snapshotConflictHorizon using LVPagePruneState.visibility_cutoff_xid
in almost all cases -- so we should avoid almost all false recovery
conflicts. Granted, my approach here only works when the page will
become eligible to mark all-frozen (otherwise we can't trust
LVPagePruneState.visibility_cutoff_xid and have to fall back on
OldestXmin), but that's not really a problem in practice. Since in
practice page-level freezing is supposed to find a way to freeze pages
as a group, or not at all (so falling back on OldestXmin should be
very rare).

I could be more precise about generating a FREEZE_PAGE
snapshotConflictHorizon than this, but that didn't seem worth the
added complexity (I'd prefer to be able to ignore MultiXacts/xmax for
this stuff). I'm pretty sure that the new v8 approach is more than
good enough. It's actually an improvement on HEAD, where
snapshotConflictHorizon is derived from FreezeLimit, an approach with
the same basic problem as deriving snapshotConflictHorizon from
OldestXmin. Namely: using FreezeLimit is a poor proxy for what we
really want to use, which is a cutoff that comes from the specific
latest XID in some specific tuple header on the page we're freezing.

There are no remaining blockers to commit for the first two patches
from v8 (the two patches that add page-level freezing). I think that
I'll be able to commit page-level freezing in a matter of weeks, in
fact. All specific outstanding concerns about page-level freezing have
been addressed.

I believe that page-level freezing is uncontroversial. Unlike later
patches in the series, it changes nothing user-facing about VACUUM --
nothing very high level. Having the freeze plan deduplication work
added by commit 9e540599 helps here. The focus is WAL overhead over
time, and page level freezing can almost be understood as a mechanical
improvement to freezing that keeps costs over time down.

--
Peter Geoghegan

Attachment

Re: New strategies for freezing, advancing relfrozenxid early

From
Andres Freund
Date:
Hi,

On 2022-11-23 15:06:52 -0800, Peter Geoghegan wrote:
> Attached is v8.

The docs don't build:
https://cirrus-ci.com/task/5456939761532928
[20:00:58.203] postgres.sgml:52: element link: validity error : IDREF attribute linkend references an unknown ID
"vacuum-for-wraparound"

Greetings,

Andres Freund



Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Tue, Dec 6, 2022 at 10:42 AM Andres Freund <andres@anarazel.de> wrote:
> The docs don't build:
> https://cirrus-ci.com/task/5456939761532928
> [20:00:58.203] postgres.sgml:52: element link: validity error : IDREF attribute linkend references an unknown ID
"vacuum-for-wraparound"

Thanks for pointing this out. FWIW it is a result of Bruce's recent
addition of the transaction processing chapter to the docs.

My intention is to post v9 later in the week, which will fix the doc
build, and a lot more besides that. If you are planning on doing
another round of review, I'd suggest that you hold off until then. v9
will have structural improvements that will likely make it easier to
understand all the steps leading up to removing aggressive mode
completely. It'll be easier to relate each local step/patch to the
bigger picture for VACUUM.

v9 will also address some of the concerns you raised in your review
that weren't covered by v8, especially about the VM snapshotting
infrastructure. But also your concerns about the transition from lazy
strategies to eager strategies. The "catch up freezing" performed by
the first VACUUM operation run against a table that just exceeded the
GUC-controlled table size threshold will have far more limited impact,
because the burden of freezing will be spread out across multiple
VACUUM operations. The big idea behind the patch series is to relieve
users from having to think about a special type of VACUUM that has to
do much more freezing than other VACUUMs that ran against the same
table in the recent past, of course, so it is important to avoid
accidentally allowing any behavior that looks kind of like the ghost
of aggressive VACUUM.

-- 
Peter Geoghegan



Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Tue, Dec 6, 2022 at 1:45 PM Peter Geoghegan <pg@bowt.ie> wrote:
> v9 will also address some of the concerns you raised in your review
> that weren't covered by v8, especially about the VM snapshotting
> infrastructure. But also your concerns about the transition from lazy
> strategies to eager strategies.

Attached is v9. Highlights:

* VM snapshot infrastructure now spills using temp files when required
(only in larger tables).

v9 is the first version that has a credible approach to resource
management, which was something I put off until recently. We only use
a fixed amount of memory now, which should be acceptable from the
viewpoint of VACUUM resource management. The temp files use the
BufFile infrastructure in a relatively straightforward way.

* VM snapshot infrastructure now uses explicit prefetching.

Our approach is straightforward, and perhaps even obvious: we prefetch
at the point that VACUUM requests the next block in line. There is a
configurable prefetch distance, controlled by
maintenance_io_concurrency. We "stage" a couple of thousand
BlockNumbers in VACUUM's vmsnap by bulk-reading from the vmsnap's
local copy of the visibility map -- these staged blocks are returned
to VACUUM to scan, with interlaced prefetching of later blocks from
the same local BlockNumber array.

The addition of prefetching ought to be enough to avoid regressions
that might otherwise result from the removal of SKIP_PAGES_THRESHOLD
from vacuumlazy.c (see commit bf136cf6 from around the time the
visibility map first went in for the full context). While I definitely
need to do more performance validation work around prefetching
(especially on high latency network-attached storage), I imagine that
it won't be too hard to get into shape for commit. It's certainly not
committable yet, but it's vastly better than v8.

The visibility map snapshot interface (presented by visibilitymap.h)
also changed in v9, mostly to support prefetching. We now have an
iterator style interface (so vacuumlazy.c cannot request random
access). This iterator interface is implemented by visibilitymap.c
using logic similar to the current lazy_scan_skip() logic from
vacuumlazy.c (which is gone).

All told, visibilitymap.c knows quite a bit more than it used to about
high level requirements from vacuumlazy.c. For example it has explicit
awareness of VM skipping strategies.

* Page-level freezing commit now freezes a page whenever VACUUM
detects that pruning ran and generated an FPI.

Following a suggestion by Andres, page-level freezing is now always
triggered when pruning needs an FPI. Note that this optimization gets
applied regardless of freezing strategy (unless you turn off
full_page_writes, I suppose).

This optimization is added by the second patch
(v9-0002-Add-page-level-freezing-to-VACUUM.patch).

* Fixed the doc build.

* Much improved criteria for deciding on freezing and vmsnap skipping
strategies in vacuumlazy.c lazy_scan_strategy function -- improved
"cost model".

VACUUM should now give users a far smoother "transition" from lazy
processing to eager processing. A table that starts out small (smaller
than vacuum_freeze_strategy_threshold), but gradually grows, and
eventually becomes fairly large (perhaps to a multiple of
vacuum_freeze_strategy_threshold in size) will now experience a far
more gradual transition, with catch-up freezing spread out multiple
VACUUM operations. We avoid big jumps in the overhead of freezing,
where one particular VACUUM operation does all required "catch-up
freezing" in one go.

My approach is to "stagger" the timeline for switching freezing
strategy and vmsnap skipping strategy. We now change over from lazy to
eager freezing strategy when the table size threshold (controlled by
vacuum_freeze_strategy_threshold) is first crossed, just like in v8.
But unlike v8, v9 will switch over to eager skipping in some later
VACUUM operation (barring edge cases). This is implemented in a fairly
simple way: we now apply a "separate" threshold that is based on
vacuum_freeze_strategy_threshold: a threshold that's *twice* the
current value of the vacuum_freeze_strategy_threshold GUC/reloption
threshold.

My approach of "staggering" multiple distinct behaviors to avoid
having them all kick in during the same VACUUM operation isn't new to
v9. The behavior around waiting for cleanup locks (added by
v9-0005-Finish-removing-aggressive-mode-VACUUM.patch) is another
example of the same general idea.

In general I think that VACUUM shouldn't switch to more aggressive
behaviors all at the same time, in the same VACUUM. Each distinct
aggressive behavior has totally different properties, so there is no
reason why VACUUM should start to apply each and every one of them at
the same time. Some "aggressive" behaviors have the potential to make
things quite a lot worse, in fact. The cure must not be worse than the
disease.

* Related to the previous item (about the "cost model" that chooses a
strategy), we now have a much more sophisticated approach when it
comes to when and how we decide to advance relfrozenxid in smaller
tables (tables whose size is < vacuum_freeze_strategy_threshold). This
improves things for tables that start out small, and stay small.
Tables where we're unlikely to want to advance relfrozenxid in every
single VACUUM (better to be lazy with such a table), but still want to
be clever about advancing relfrozenxid "opportunistically".

The way that VACUUM weighs both table age and the added cost of
relfrozenxid advancement is more sophisticated in v9. The goal is to
make it more likely that VACUUM will stumble upon opportunities to
advance relfrozenxid when it happens to be cheap, which can happen for
many reasons. All of which have a great deal to do with workload
characteristics.

As in v8, v9 makes VACUUM willing to advance relfrozenxid without
concern for table age, whenever it notices that the cost of doing so
happens to be very cheap (in practice this means that the number of
"extra" heap pages scanned is < 5% of rel_pages). However, in v9 we
now go further by scaling this threshold through interpolation, based
on table age.

We have the same "5% of rel_pages" threshold when table age is less
than half way towards the point that autovacuum.c will launch an
antiwraparound autovacuum -- when we still have only minimal concern
about table age. But the rel_pages-wise threshold starts to grow once
table age gets past that "half way towards antiwrap AV" point. We
interpolate the rel_pages-wise threshold using a new approach in v9.

At first the rel_pages-wise threshold grows quite slowly (relative to
the rate at which table age approaches the point of forcing an
antiwraparound AV). For example, when we're 60% of the way towards
needing an antiwraparound AV, and VACUUM runs, we'll eagerly advance
relfrozenxid provided that the "extra" cost of doing so happens to be
less than ~22% of rel_pages. It "accelerates" from there (assuming
fixed rel_pages).

VACUUM will now tend to take advantage of individual table
characteristics that make it relatively cheap to advance relfrozenxid.
Bear in mind that these characteristics are not fixed for the same
table. The "extra" cost of advancing relfrozenxid during this VACUUM
(whether measured in absolute terms, of as a proportion of the net
amount of work just to do simple vacuuming) just isn't predictable
with real workloads. Especially not with the FPI opportunistic
freezing stuff from the second patch (the "freeze when heap pruning
gets an FPI" thing) in place. We should expect significant "natural
variation" among tables, and within the same table over time -- this
is a good thing.

For example, imagine a table that experiences a bunch of random
deletes, which leads to a  VACUUM that must visit most heap pages (say
85% of rel_pages). Let's suppose that those deletes are a once-off
thing. The added cost of advancing relfrozenxid in the next VACUUM
still isn't trivial (assuming the remaining 15% of pages are
all-visible). But it is probably still worth doing if table age is at
least starting to become a concern. It might actually be a lot cheaper
to advance relfrozenxid early.

* Numerous structural improvements, lots of code polishing.

The patches have been reordered in a way that should make review a bit
easier. Now the commit messages are written in a way that clearly
anticipates the removal of aggressive mode VACUUM, which the last
patch actually finishes. Most of the earlier commits are presented as
preparation for completely removing aggressive mode VACUUM.

The first patch (which refactors how VACUUM passes around cutoffs like
FreezeLimit and OldestXmin by using a dedicated struct) is much
improved. heap_prepare_freeze_tuple() now takes a more explicit
approach to tracking what needs to happen for the tuple's freeze plan.
This allowed me to pepper it with defensive assertions. It's also a
lot clearer IMV. For example, we now have separate freeze_xmax and
replace_xmax tracker variables.

The second patch in the series (the page-level freezing patch) is also
much improved. I'm much happier with the way that
heap_prepare_freeze_tuple() now explicitly delegates control of
page-level freezing to FreezeMultiXactId() in v9, for example.

Note that I squashed the patch that taught VACUUM to size dead_items
using scanned_pages into the main visibility map patch
(v9-0004-Add-eager-and-lazy-VM-strategies-to-VACUUM.patch). That's why
there are only 5 patches (down from 6) in v9.

--
Peter Geoghegan

Attachment

Re: New strategies for freezing, advancing relfrozenxid early

From
Jeff Davis
Date:
On Sat, 2022-12-10 at 18:11 -0800, Peter Geoghegan wrote:
> On Tue, Dec 6, 2022 at 1:45 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > v9 will also address some of the concerns you raised in your review
> > that weren't covered by v8, especially about the VM snapshotting
> > infrastructure. But also your concerns about the transition from
> > lazy
> > strategies to eager strategies.
>
> Attached is v9. Highlights:

Comments:

* The documentation shouldn't have a heading like "Managing the 32-bit
Transaction ID address space". We already have a concept of "age"
documented, and I think that's all that's needed in the relevant
section. Freezing is driven by a need to keep the age of the oldest
transaction ID in a table to less than ~2B; and also the need to
truncate the clog (and reduce lookups of really old xids). It's fine to
give a brief explanation about why we can't track very old xids, but
it's more of an internal detail and not the main point.

* I'm still having a hard time with vacuum_freeze_strategy_threshold.
Part of it is the name, which doesn't seem to convey the meaning. But
the heuristic also seems off to me. What if you have lots of partitions
in an append-only range-partitioned table? That would tend to use the
lazy freezing strategy (because each partition is small), but that's
not what you want. I understand heuristics aren't perfect, but it feels
like we could do something better. Also, another purpose of this seems
to be to achieve v15 behavior (if v16 behavior causes a problem for
some workload), which seems like a good idea, but perhaps we should
have a more direct setting for that?

* The comment above lazy_scan_strategy() is phrased in terms of the
"traditional approach". It would be more clear if you described the
current strategies and how they're chosen. The pre-16 behavior was as
lazy as possible, so that's easy enough to describe without referring
to history.

* "eager skipping behavior" seems like a weird phrasing because it's
not immediately clear if that means "skip more pages" (eager to skip
pages and lazy to process them) or "skip fewer pages" (lazy to skip the
pages and eager to process the pages).

* The skipping behavior is for all-visible pages is binary: skip them
all, or skip none. That makes sense in the context of relfrozenxid
advancement. But how does that avoid IO spikes? It would seem perfectly
reasonable to me, if relfrozenxid advancement is not a pressing
problem, to process some fraction of the all-visible pages (or perhaps
process enough of them to freeze some fraction). That would ensure that
each VACUUM makes a payment on the deferred costs of freezing. I think
this has already been discussed but it keeps reappearing in my mind, so
maybe we can settle this with a comment (and/or docs)?

* I'm wondering whether vacuum_freeze_min_age makes sense anymore. It
doesn't take effect unless the page is not skipped, which is confusing
from a usability standpoint, and we have better heuristics to decide if
the whole page should be frozen or not anyway (i.e. if an FPI was
already taken then freezing is cheaper).


--
Jeff Davis
PostgreSQL Contributor Team - AWS





Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Mon, Dec 12, 2022 at 3:47 PM Jeff Davis <pgsql@j-davis.com> wrote:
> Freezing is driven by a need to keep the age of the oldest
> transaction ID in a table to less than ~2B; and also the need to
> truncate the clog (and reduce lookups of really old xids). It's fine to
> give a brief explanation about why we can't track very old xids, but
> it's more of an internal detail and not the main point.

I agree that that's the conventional definition. What I am proposing
is that we revise that definition a little. We should start the
discussion of freezing in the user level docs by pointing out that
freezing also plays a role at the level of individual pages. An
all-frozen page is self-contained, now and forever (or until it gets
dirtied again, at least). Even on a standby we will reliably avoid
having to do clog lookups for a page that happens to have all of its
tuples frozen.

I don't want to push back too much here. I just don't think that it
makes terribly much sense for the docs to start the conversation about
freezing by talking about the worst consequences of not freezing for
an extended period of time. That's relevant, and it's probably going
to end up as the aspect of freezing that we spend most time on, but it
still doesn't seem like a useful starting point to me.

To me this seems related to the fallacy that relfrozenxid age is any
kind of indicator about how far behind we are on freezing. I think
that there is value in talking about freezing as a maintenance task
for physical heap pages, and only then talking about relfrozenxid and
the circular XID space. The 64-bit XID patch doesn't get rid of
freezing at all, because it is still needed to break the dependency of
tuples stored in heap pages on the pg_xact, and other SLRUs -- which
suggests that you can talk about freezing and advancing relfrozenxid
as different (though still closely related) concepts.

> * I'm still having a hard time with vacuum_freeze_strategy_threshold.
> Part of it is the name, which doesn't seem to convey the meaning.

I chose the name long ago, and never gave it terribly much thought.
I'm happy to go with whatever name you prefer.

> But the heuristic also seems off to me. What if you have lots of partitions
> in an append-only range-partitioned table? That would tend to use the
> lazy freezing strategy (because each partition is small), but that's
> not what you want. I understand heuristics aren't perfect, but it feels
> like we could do something better.

It is at least vastly superior to vacuum_freeze_min_age in cases like
this. Not that that's hard -- vacuum_freeze_min_age just doesn't ever
trigger freezing in any autovacuum given a table like pgbench_history
(barring during aggressive mode), due to how it interacts with the
visibility map. So we're practically guaranteed to do literally all
freezing for an append-only table in an aggressive mode VACUUM.

Worst of all, that happens on a timeline that has nothing to do with
the physical characteristics of the table itself (like the number of
unfrozen heap pages or something). In fact, it doesn't even have
anything to do with how many distinct XIDs modified that particular
table -- XID age works at the system level.

By working at the heap rel level (which means the partition level if
it's a partitioned table), and by being based on physical units (table
size), vacuum_freeze_strategy_threshold at least manages to limit the
accumulation of unfrozen heap pages in each individual relation. This
is the fundamental unit at which VACUUM operates. So even if you get
very unlucky and accumulate many unfrozen heap pages that happen to be
distributed across many different tables, you can at least vacuum each
table independently, and in parallel. The really big problems all seem
to involve concentration of unfrozen tables in one particular table
(usually the events table, the largest table in the system by a couple
of orders of magnitude).

That said, I agree that the system-level picture of debt (the system
level view of the number of unfrozen heap pages) is relevant, and that
it isn't directly considered by the patch. I think that that can be
treated as work for a future release. In fact, I think that there is a
great deal that we could teach autovacuum.c about the system level
view of things -- this is only one.

> Also, another purpose of this seems
> to be to achieve v15 behavior (if v16 behavior causes a problem for
> some workload), which seems like a good idea, but perhaps we should
> have a more direct setting for that?

Why, though? I think that it happens to make sense to do both with one
setting. Not because it's better to have 2 settings than 1 (though it
is) -- just because it makes sense here, given these specifics.

> * The comment above lazy_scan_strategy() is phrased in terms of the
> "traditional approach". It would be more clear if you described the
> current strategies and how they're chosen. The pre-16 behavior was as
> lazy as possible, so that's easy enough to describe without referring
> to history.

Agreed. Will fix.

> * "eager skipping behavior" seems like a weird phrasing because it's
> not immediately clear if that means "skip more pages" (eager to skip
> pages and lazy to process them) or "skip fewer pages" (lazy to skip the
> pages and eager to process the pages).

I agree that that's a problem. I'll try to come up with a terminology
that doesn't have this problem ahead of the next version.

> * The skipping behavior is for all-visible pages is binary: skip them
> all, or skip none. That makes sense in the context of relfrozenxid
> advancement. But how does that avoid IO spikes? It would seem perfectly
> reasonable to me, if relfrozenxid advancement is not a pressing
> problem, to process some fraction of the all-visible pages (or perhaps
> process enough of them to freeze some fraction).

That's something that v9 will do, unlike earlier versions. So I agree.

In particular, we'll now start freezing eagerly before we switch over
to preferring to advance relfrozenxid for the same table. As I said in
my summary of v9 the other day, we "stagger" the point at which these
two behaviors are first applied, with the goal of smoothing the
transition. We try to disguise the fact that there are still two
different sets of behavior. We try to get the best of both worlds
(eager and lazy behaviors), without the user ever really noticing.

Don't forget that eager behavior with the visibility map is expected
to directly lead to freezing more pages (not a guarantee, but quite
likely). So while skipping strategy and freezing strategy are two
independent things, they're independent in name only, mechanically.
They are not independent things in any practical sense. (The
underlying reason why that is true is of course the same reason why
vacuum_freeze_min_age only really works as designed in aggressive mode
VACUUMs.)

> each VACUUM makes a payment on the deferred costs of freezing. I think
> this has already been discussed but it keeps reappearing in my mind, so
> maybe we can settle this with a comment (and/or docs)?

That said, I believe that we should always advance relfrozenxid in
tables that are already moderately sized -- a table that is already
big enough to be some small multiple of
vacuum_freeze_strategy_threshold should always take an eager approach
to advancing relfrozenxid. That is, I don't think that it makes sense
to pay the cost of freezing down incrementally given a moderately
large table.

Large tables and small tables are qualitatively different things, at
least from a VACUUM point of view. To some degree we can afford to be
wrong about small tables, because that won't cause us any serious
pain. This isn't really true with larger tables -- a VACUUM of a large
table is "too big to fail". Our working assumption for tables that are
still growing now, in the ongoing VACUUM, is that they will continue
to grow.

There is often one very large table, and by the time the next VACUUM
comes around, the table may have accumulated more unfrozen pages than
the entire rest of the database combined (I mean all of the rest of
the database, frozen and unfrozen pages alike). This may even be
common:

https://brandur.org/fragments/events

> * I'm wondering whether vacuum_freeze_min_age makes sense anymore. It
> doesn't take effect unless the page is not skipped, which is confusing
> from a usability standpoint, and we have better heuristics to decide if
> the whole page should be frozen or not anyway (i.e. if an FPI was
> already taken then freezing is cheaper).

I think that vacuum_freeze_min_age still has a role to play. The only
thing that can trigger freezing during a VACUUM that opts to use a
lazy strategy VACUUM is the FPI-from-pruning trigger mechanism (new to
v9), plus vacuum_freeze_min_age/FreezeLimit. So you cannot really have
a lazy strategy without vacuum_freeze_min_age. The original
vacuum_freeze_min_age design did make sense, at least
pre-visibility-map, because sometimes being lazy about freezing is the
best strategy. Especially with small, frequently updated tables like
most of the pgbench tables.

There is nothing inherently wrong with deciding to freeze (or even to
wait for a cleanup lock) on the basis of a given XID's age. My problem
isn't with that behavior in general. It's with the fact that we use it
even when it's clearly inappropriate -- wildly inappropriate. We have
plenty of information that strongly hints at whether or not laziness
is a good idea. It's a good idea whenever laziness has a decent chance
of avoiding completely unnecessary work altogether, provided we can
afford to be wrong about that without having to pay too high a cost
later on, when we have to course correct. What this mostly boils down
to is this: lazy freezing is generally a good idea in small tables
only.

-- 
Peter Geoghegan



Re: New strategies for freezing, advancing relfrozenxid early

From
John Naylor
Date:

On Tue, Dec 13, 2022 at 8:00 AM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Mon, Dec 12, 2022 at 3:47 PM Jeff Davis <pgsql@j-davis.com> wrote:
> > But the heuristic also seems off to me. What if you have lots of partitions
> > in an append-only range-partitioned table? That would tend to use the
> > lazy freezing strategy (because each partition is small), but that's
> > not what you want. I understand heuristics aren't perfect, but it feels
> > like we could do something better.
>
> It is at least vastly superior to vacuum_freeze_min_age in cases like
> this. Not that that's hard -- vacuum_freeze_min_age just doesn't ever
> trigger freezing in any autovacuum given a table like pgbench_history
> (barring during aggressive mode), due to how it interacts with the
> visibility map. So we're practically guaranteed to do literally all
> freezing for an append-only table in an aggressive mode VACUUM.
>
> Worst of all, that happens on a timeline that has nothing to do with
> the physical characteristics of the table itself (like the number of
> unfrozen heap pages or something).

If the number of unfrozen heap pages is the thing we care about, perhaps that, and not the total size of the table, should be the parameter that drives freezing strategy?

> That said, I agree that the system-level picture of debt (the system
> level view of the number of unfrozen heap pages) is relevant, and that
> it isn't directly considered by the patch. I think that that can be
> treated as work for a future release. In fact, I think that there is a
> great deal that we could teach autovacuum.c about the system level
> view of things -- this is only one.

It seems an easier path to considering system-level of debt (as measured by unfrozen heap pages) would be to start with considering table-level debt measured the same way.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Tue, Dec 13, 2022 at 12:29 AM John Naylor
<john.naylor@enterprisedb.com> wrote:
> If the number of unfrozen heap pages is the thing we care about, perhaps that, and not the total size of the table,
shouldbe the parameter that drives freezing strategy?
 

That's not the only thing we care about, though. And to the extent we
care about it, we mostly care about the consequences of either
freezing or not freezing eagerly. Concentration of unfrozen pages in
one particular table is a lot more of a concern than the same number
of heap pages being spread out across multiple tables. Those tables
can all be independently vacuumed, and come with their own
relfrozenxid, that can be advanced independently, and are very likely
to be frozen as part of a vacuum that needed to happen anyway.

Pages become frozen pages because VACUUM freezes those pages. Same
with all-visible pages, which could in principle have been made
all-frozen instead, had VACUUM opted to do it that way back when it
processed the page. So VACUUM is not a passive, neutral observer here.
What happens over time and across multiple VACUUM operations is very
relevant. VACUUM needs to pick up where it left off last time, at
least with larger tables, where the time between VACUUMs is naturally
very high, and where each individual VACUUM has to process a huge
number of individual pages. It's not really practical to take a "wait
and see" approach with big tables.

At the very least, a given VACUUM operation has to choose its freezing
strategy based on how it expects the table will look when it's done
vacuuming the table, and how that will impact the next VACUUM against
the same table. Without that, then vacuuming an append-only table will
fall into a pattern of setting pages all-visible in one vacuum, and
then freezing those same pages all-frozen in the very next vacuum
because there are too many. Which makes little sense; we're far better
off freezing the pages at the earliest opportunity instead.

We're going to have to write a WAL record for the visibility map
anyway, so doing everything at the same time has a lot to recommend
it. Even if it turns out to be quite wrong, we may still come out
ahead in terms of absolute volume of WAL written, and especially in
terms of performance stability. To a limited extent we need to reason
about what will happen in the near future. But we also need to reason
about which kinds of mispredictions we cannot afford to make, and
which kinds are okay. Some mistakes hurt a lot more than others.

-- 
Peter Geoghegan



Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Tue, Dec 13, 2022 at 9:16 AM Peter Geoghegan <pg@bowt.ie> wrote:
> That's not the only thing we care about, though. And to the extent we
> care about it, we mostly care about the consequences of either
> freezing or not freezing eagerly. Concentration of unfrozen pages in
> one particular table is a lot more of a concern than the same number
> of heap pages being spread out across multiple tables. Those tables
> can all be independently vacuumed, and come with their own
> relfrozenxid, that can be advanced independently, and are very likely
> to be frozen as part of a vacuum that needed to happen anyway.

At the suggestion of Jeff, I wrote a Wiki page that shows motivating
examples for the patch series:

https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples

These are all cases where VACUUM currently doesn't do the right thing
around freezing, in a way that is greatly ameliorated by the patch.
Perhaps this will help other hackers to understand the motivation
behind some of these mechanisms. There are plenty of details that only
make sense in the context of a certain kind of table, with certain
performance characteristics that the design is sensitive to, and seeks
to take advantage of in one way or another.

-- 
Peter Geoghegan



Re: New strategies for freezing, advancing relfrozenxid early

From
Matthias van de Meent
Date:
On Wed, 14 Dec 2022 at 00:07, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Tue, Dec 13, 2022 at 9:16 AM Peter Geoghegan <pg@bowt.ie> wrote:
> > That's not the only thing we care about, though. And to the extent we
> > care about it, we mostly care about the consequences of either
> > freezing or not freezing eagerly. Concentration of unfrozen pages in
> > one particular table is a lot more of a concern than the same number
> > of heap pages being spread out across multiple tables. Those tables
> > can all be independently vacuumed, and come with their own
> > relfrozenxid, that can be advanced independently, and are very likely
> > to be frozen as part of a vacuum that needed to happen anyway.
>
> At the suggestion of Jeff, I wrote a Wiki page that shows motivating
> examples for the patch series:
>
> https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples
>
> These are all cases where VACUUM currently doesn't do the right thing
> around freezing, in a way that is greatly ameliorated by the patch.
> Perhaps this will help other hackers to understand the motivation
> behind some of these mechanisms. There are plenty of details that only
> make sense in the context of a certain kind of table, with certain
> performance characteristics that the design is sensitive to, and seeks
> to take advantage of in one way or another.

In this mentioned wiki page, section "Simple append-only", the
following is written:

> Our "transition from lazy to eager strategies" concludes with an autovacuum that actually advanced relfrozenxid
eagerly:
>> automatic vacuum of table "regression.public.pgbench_history": index scans: 0
>> pages: 0 removed, 1078444 remain, 561143 scanned (52.03% of total)
>> [...]
>> frozen: 560841 pages from table (52.00% of total) had 88051825 tuples frozen
>> [...]
>> WAL usage: 1121683 records, 557662 full page images, 4632208091 bytes

I think that this 'transition from lazy to eager' could benefit from a
limit on how many all_visible blocks each autovacuum iteration can
freeze. This first run of (auto)vacuum after the 8GB threshold seems
to appear as a significant IO event (both in WAL and relation
read/write traffic) with 50% of the table updated and WAL-logged. I
think this should be limited to some degree, such as only freeze
all_visible blocks up to 10% of the table's blocks in eager vacuum, so
that the load is spread across a larger time frame and more VACUUM
runs.


Kind regards,

Matthias van de Meent.



Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Thu, Dec 15, 2022 at 6:50 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> This first run of (auto)vacuum after the 8GB threshold seems
> to appear as a significant IO event (both in WAL and relation
> read/write traffic) with 50% of the table updated and WAL-logged. I
> think this should be limited to some degree, such as only freeze
> all_visible blocks up to 10% of the table's blocks in eager vacuum, so
> that the load is spread across a larger time frame and more VACUUM
> runs.

I agree that the burden of catch-up freezing is excessive here (in
fact I already wrote something to that effect on the wiki page). The
likely solution can be simple enough.

In v9 of the patch, we switch over to eager freezing when table size
crosses 4GB (since that is the value of the
vacuum_freeze_strategy_threshold GUC). The catch up freezing that you
draw attention to here occurs when table size exceeds 8GB, which is a
separate physical table size threshold that forces eager relfrozenxid
advancement. The second threshold is hard-coded to 2x the first one.

I think that this issue can be addressed by making the second
threshold 4x or even 8x vacuum_freeze_strategy_threshold, not just 2x.
That would mean that we'd have to freeze just as many pages whenever
we did the catch-up freezing -- so no change in the added *absolute*
cost of freezing. But, the *relative* cost would be much lower, simply
because catch-up freezing would take place when the table was much
larger. So it would be a lot less noticeable.

Note that we might never reach the second table size threshold before
we must advance relfrozenxid, in any case. The catch-up freezing might
actually take place because table age created pressure to advance
relfrozenxid. It's useful to have a purely physical/table-size
threshold like this, especially in bulk loading scenarios. But it's
not like table age doesn't have any influence at all, anymore. The
cost model weighs physical units/costs as well as table age, and in
general the most likely trigger for advancing relfrozenxid is usually
some combination of the two, not any one factor on its own [1].

[1]
https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Opportunistically_advancing_relfrozenxid_with_bursty.2C_real-world_workloads
-- 
Peter Geoghegan



Re: New strategies for freezing, advancing relfrozenxid early

From
Justin Pryzby
Date:
The patches (003 and 005) are missing a word
should use to decide whether to its eager freezing strategy.

On the wiki, missing a word:
builds on related added



Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Thu, Dec 15, 2022 at 11:11 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> The patches (003 and 005) are missing a word
> should use to decide whether to its eager freezing strategy.

I mangled this during rebasing for v9, which reordered the commits.
Will be fixed in v10.

> On the wiki, missing a word:
> builds on related added

Fixed.

Thanks
-- 
Peter Geoghegan



Re: New strategies for freezing, advancing relfrozenxid early

From
John Naylor
Date:

On Wed, Dec 14, 2022 at 6:07 AM Peter Geoghegan <pg@bowt.ie> wrote:
>
> At the suggestion of Jeff, I wrote a Wiki page that shows motivating
> examples for the patch series:
>
> https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples
>
> These are all cases where VACUUM currently doesn't do the right thing
> around freezing, in a way that is greatly ameliorated by the patch.
> Perhaps this will help other hackers to understand the motivation
> behind some of these mechanisms. There are plenty of details that only
> make sense in the context of a certain kind of table, with certain
> performance characteristics that the design is sensitive to, and seeks
> to take advantage of in one way or another.

Thanks for this. This is the kind of concrete, data-based evidence that I find much more convincing, or at least easy to reason about. I'd actually recommend in the future to open discussion with this kind of analysis -- even before coding, it's possible to indicate what a design is *intended* to achieve. And reviewers can likewise bring up cases of their own in a concrete fashion.

On Wed, Dec 14, 2022 at 12:16 AM Peter Geoghegan <pg@bowt.ie> wrote:

> At the very least, a given VACUUM operation has to choose its freezing
> strategy based on how it expects the table will look when it's done
> vacuuming the table, and how that will impact the next VACUUM against
> the same table. Without that, then vacuuming an append-only table will
> fall into a pattern of setting pages all-visible in one vacuum, and
> then freezing those same pages all-frozen in the very next vacuum
> because there are too many. Which makes little sense; we're far better
> off freezing the pages at the earliest opportunity instead.

That makes sense, but I wonder if we can actually be more specific: One motivating example mentioned is the append-only table. If we detected that case, which I assume we can because autovacuum_vacuum_insert_* GUCs exist, we could use that information as one way to drive eager freezing independently of size. At least in theory -- it's very possible size will be a necessary part of the decision, but it's less clear that it's as useful as a user-tunable knob.

If we then ignored the append-only case when evaluating a freezing policy, maybe other ideas will fall out. I don't have a well-thought out idea about policy or knobs, but it's worth thinking about.

Aside from that, I've only given the patches a brief reading. Having seen the VM snapshot in practice (under "Scanned pages, visibility map snapshot" in the wiki page), it's neat to see fewer pages being scanned. Prefetching not only seems superior to SKIP_PAGES_THRESHOLD, but anticipates asynchronous IO. Keeping only one VM snapshot page in memory makes perfect sense.

I do have a cosmetic, but broad-reaching, nitpick about terms regarding "skipping strategy". That's phrased as a kind of negative -- what we're *not* doing. Many times I had to pause and compute in my head what we're *doing*, i.e. the "scanning strategy". For example, I wonder if the VM strategies would be easier to read as:

VMSNAP_SKIP_ALL_VISIBLE -> VMSNAP_SCAN_LAZY
VMSNAP_SKIP_ALL_FROZEN -> VMSNAP_SCAN_EAGER
VMSNAP_SKIP_NONE -> VMSNAP_SCAN_ALL

Notice here they're listed in order of increasing eagerness. 

--
John Naylor
EDB: http://www.enterprisedb.com

Re: New strategies for freezing, advancing relfrozenxid early

From
Nikita Malakhov
Date:
Hi!

I've found this discussion very interesting, in view of vacuuming
TOAST tables is always a problem because these tables tend to
bloat very quickly with dead data - just to remind, all TOAST-able
columns of the relation use the same TOAST table which is one
for the relation, and TOASTed data are not updated - there are
only insert and delete operations.

Have you tested it with large and constantly used TOAST tables?
How would it work with the current TOAST implementation?

We propose a different approach to the TOAST mechanics [1],
and a new vacuum would be very promising.

Thank you!


On Fri, Dec 16, 2022 at 10:48 AM John Naylor <john.naylor@enterprisedb.com> wrote:

On Wed, Dec 14, 2022 at 6:07 AM Peter Geoghegan <pg@bowt.ie> wrote:
>
> At the suggestion of Jeff, I wrote a Wiki page that shows motivating
> examples for the patch series:
>
> https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples
>
> These are all cases where VACUUM currently doesn't do the right thing
> around freezing, in a way that is greatly ameliorated by the patch.
> Perhaps this will help other hackers to understand the motivation
> behind some of these mechanisms. There are plenty of details that only
> make sense in the context of a certain kind of table, with certain
> performance characteristics that the design is sensitive to, and seeks
> to take advantage of in one way or another.

Thanks for this. This is the kind of concrete, data-based evidence that I find much more convincing, or at least easy to reason about. I'd actually recommend in the future to open discussion with this kind of analysis -- even before coding, it's possible to indicate what a design is *intended* to achieve. And reviewers can likewise bring up cases of their own in a concrete fashion.

On Wed, Dec 14, 2022 at 12:16 AM Peter Geoghegan <pg@bowt.ie> wrote:

> At the very least, a given VACUUM operation has to choose its freezing
> strategy based on how it expects the table will look when it's done
> vacuuming the table, and how that will impact the next VACUUM against
> the same table. Without that, then vacuuming an append-only table will
> fall into a pattern of setting pages all-visible in one vacuum, and
> then freezing those same pages all-frozen in the very next vacuum
> because there are too many. Which makes little sense; we're far better
> off freezing the pages at the earliest opportunity instead.

That makes sense, but I wonder if we can actually be more specific: One motivating example mentioned is the append-only table. If we detected that case, which I assume we can because autovacuum_vacuum_insert_* GUCs exist, we could use that information as one way to drive eager freezing independently of size. At least in theory -- it's very possible size will be a necessary part of the decision, but it's less clear that it's as useful as a user-tunable knob.

If we then ignored the append-only case when evaluating a freezing policy, maybe other ideas will fall out. I don't have a well-thought out idea about policy or knobs, but it's worth thinking about.

Aside from that, I've only given the patches a brief reading. Having seen the VM snapshot in practice (under "Scanned pages, visibility map snapshot" in the wiki page), it's neat to see fewer pages being scanned. Prefetching not only seems superior to SKIP_PAGES_THRESHOLD, but anticipates asynchronous IO. Keeping only one VM snapshot page in memory makes perfect sense.

I do have a cosmetic, but broad-reaching, nitpick about terms regarding "skipping strategy". That's phrased as a kind of negative -- what we're *not* doing. Many times I had to pause and compute in my head what we're *doing*, i.e. the "scanning strategy". For example, I wonder if the VM strategies would be easier to read as:

VMSNAP_SKIP_ALL_VISIBLE -> VMSNAP_SCAN_LAZY
VMSNAP_SKIP_ALL_FROZEN -> VMSNAP_SCAN_EAGER
VMSNAP_SKIP_NONE -> VMSNAP_SCAN_ALL

Notice here they're listed in order of increasing eagerness. 

--
John Naylor
EDB: http://www.enterprisedb.com


--
Regards,
Nikita Malakhov
Postgres Professional 

Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Thu, Dec 15, 2022 at 11:48 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
> Thanks for this. This is the kind of concrete, data-based evidence that I find much more convincing, or at least easy
toreason about. 

I'm glad to hear that it helped. It's always difficult to judge where
other people are coming from, especially when it's not clear how much
context is shared. Face time would have helped here, too.

> One motivating example mentioned is the append-only table. If we detected that case, which I assume we can because
autovacuum_vacuum_insert_*GUCs exist, we could use that information as one way to drive eager freezing independently of
size.At least in theory -- it's very possible size will be a necessary part of the decision, but it's less clear that
it'sas useful as a user-tunable knob. 

I am not strongly opposed to that idea, though I have my doubts about
it. I have thought about it already, and it wouldn't be hard to get
the information to vacuumlazy.c (I plan on doing it as part of related
work on antiwraparound autovacuum, in fact [1]). I'm skeptical of the
general idea that autovacuum.c has enough reliable information to give
detailed recommendations as to how vacuumlazy.c should process the
table.

I have pointed out several major flaws with the autovacuum.c dead
tuple accounting in the past [2][3], but I also think that there are
significant problems with the tuples inserted accounting. Basically, I
think that there are effects which are arguably an example of the
inspection paradox [4]. Insert-based autovacuums occur on a timeline
determined by the "inserted since last autovacuum" statistics. These
statistics are (in part) maintained by autovacuum/VACUUM itself. Which
has no specific understanding of how it might end up chasing its own
tail.

Let me be more concrete about what I mean about autovacuum chasing its
own tail. The autovacuum_vacuum_insert_threshold mechanism works by
triggering an autovacuum whenever the number of tuples inserted since
the last autovacuum/VACUUM reaches a certain threshold -- usually some
fixed proportion of pg_class.reltuples. But the
tuples-inserted-since-last-VACUUM counter gets reset at the end of
VACUUM, not at the start. Whereas VACUUM itself processes only the
subset of pages that needed to be vacuumed at the start of the VACUUM.
There is no attempt to compensate for that disparity. This *isn't*
really a measure of "unvacuumed tuples" (you'd need to compensate to
get that).

This "at the start vs at the end" difference won't matter at all with
smaller tables. And even in larger tables we might hope that the
effect would kind of average out. But what about cases where one
particular VACUUM operation takes an unusually long time, out of a
sequence of successive VACUUMs that run against the same table? For
example, the sequence that you see on the Wiki page, when Postgres
HEAD autovacuum does an aggressive VACUUM on one occasion, which takes
dramatically longer [5].

Notice that the sequence in [5] shows that the patch does one more
autovacuum operation in total, compared to HEAD/master. That's a lot
more -- we're talking about VACUUMs that each take 40+ minutes. That
can be explained by the fact that VACUUM (quite naturally) resets the
"tuples inserted since last VACUUM" at the end of that unusually long
running aggressive autovacuum -- just like any other VACUUM would.
That seems very weird to me. If (say) we happened to have a much
higher vacuum_freeze_table_age setting, then we wouldn't have had an
aggressive VACUUM until much later on (or never, because the benchmark
would just end). And the VACUUM that was aggressive would have been a
regular VACUUM instead, and would therefore have completed far sooner,
and would therefore have had a *totally* different cadence, compared
to what we actually saw -- it becomes distorted in a way that outlasts
the aggressive VACUUM.

With a far higher vacuum_freeze_table_age, we might have even managed
to do two regular autovacuums in the same period that it took a single
aggressive VACUUM to run in (that's not too far from what actually
happened with the patch). The *second* regular autovacuum would then
end up resetting the "inserted since last VACUUM" counter to 0 at the
same time as the long running aggressive VACUUM actually did so (same
wall clock time, same time since the start of the benchmark). Notice
that we'll have done much less useful work (on cleaning up bloat and
setting newer pages all-visible) with the "one long aggressive mode
VACUUM" setup/scenario -- we'll be way behind -- but the statistics
will nevertheless look about the same as they do in the "two fast
autovacuums instead of one slow autovacuum" counterfactual scenario.

In short, autovacuum.c fails to appreciate that a lot of stuff about
the table changes when VACUUM runs. Time hasn't stood still -- the
table was modified and extended throughout. So autovacuum.c hasn't
compensated for how VACUUM actually performed, and, in effect, forgets
how far it has fallen behind. It should be eager to start the nex
autovacuum very quickly, having fallen behind, but it isn't eager.
This is all the more reason to get rid of aggressive mode, but that's
not my point -- my point is that the statistics driving things seem
quite dubious, in all sorts of ways.

> Aside from that, I've only given the patches a brief reading.

Thanks for taking a look.

> Having seen the VM snapshot in practice (under "Scanned pages, visibility map snapshot" in the wiki page), it's neat
tosee fewer pages being scanned. Prefetching not only seems superior to SKIP_PAGES_THRESHOLD, but anticipates
asynchronousIO. 

All of that is true, but more than anything else the VM snapshot
concept appeals to me because it seems to make VACUUMs of large tables
more similar to VACUUMs of small tables. Particularly when one
individual VACUUM happens to take an unusually long amount of time,
for whatever reason (best example right now is aggressive mode, but
there are other ways in which VACUUM can take far longer than
expected). That approach seems much more logical. I also think that
it'll make it easier to teach VACUUM to "pick up where the last VACUUM
left off" in the future.

I understand why you haven't seriously investigated using the same
information for the Radix tree dead_items project. I certainly don't
object. But I still think that having one integrated data structure
(VM snapshots + dead_items) is worth exploring in the future. It's
something that I think is quite promising.

> I do have a cosmetic, but broad-reaching, nitpick about terms regarding "skipping strategy". That's phrased as a kind
ofnegative -- what we're *not* doing. Many times I had to pause and compute in my head what we're *doing*, i.e. the
"scanningstrategy". For example, I wonder if the VM strategies would be easier to read as: 
>
> VMSNAP_SKIP_ALL_VISIBLE -> VMSNAP_SCAN_LAZY
> VMSNAP_SKIP_ALL_FROZEN -> VMSNAP_SCAN_EAGER
> VMSNAP_SKIP_NONE -> VMSNAP_SCAN_ALL
>
> Notice here they're listed in order of increasing eagerness.

I agree that the terminology around skipping strategies is confusing,
and plan to address that in the next version. I'll consider using this
scheme for v10.

[1] https://commitfest.postgresql.org/41/4027/
[2] https://postgr.es/m/CAH2-Wz=MGFwJEpEjVzXwEjY5yx=UuNPzA6Bt4DSMasrGLUq9YA@mail.gmail.com
[3] https://postgr.es/m/CAH2-WznrZC-oHkB+QZQS65o+8_Jtj6RXadjh+8EBqjrD1f8FQQ@mail.gmail.com
[4] https://towardsdatascience.com/the-inspection-paradox-is-everywhere-2ef1c2e9d709
[5]
https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Scanned_pages.2C_visibility_map_snapshot
--
Peter Geoghegan



Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Thu, Dec 15, 2022 at 11:59 PM Nikita Malakhov <hukutoc@gmail.com> wrote:
> I've found this discussion very interesting, in view of vacuuming
> TOAST tables is always a problem because these tables tend to
> bloat very quickly with dead data - just to remind, all TOAST-able
> columns of the relation use the same TOAST table which is one
> for the relation, and TOASTed data are not updated - there are
> only insert and delete operations.

I don't think that it would be any different to any other table that
happened to have lots of inserts and deletes, such as the table
described here:

https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Mixed_inserts_and_deletes

In the real world, a table like this would probably consist of some
completely static data, combined with other data that is constantly
deleted and re-inserted -- probably only a small fraction of the table
at any one time. I would expect such a table to work quite well,
because the static pages would all become frozen (at least after a
while), leaving behind only the tuples that are deleted quickly, most
of the time. VACUUM would have a decent chance of noticing that it
will be cheap to advance relfrozenxid in earlier VACUUM operations, as
bloat is cleaned up -- even a VACUUM that happens long before the
point that autovacuum.c will launch an antiwraparound autovacuum has a
decent chance of it. That's not a new idea, really; the
pgbench_branches example from the Wiki page looks like that already,
and even works on Postgres 15.

Here is the part that's new: the pressure to advance relfrozenxid
grows gradually, as table age grows. If table age is still very young,
then we'll only do it if the number of "extra" scanned pages is < 5%
of rel_pages -- only when the added cost is very low (again, like the
pgbench_branches example, mostly). Once table age gets about halfway
towards the point that antiwraparound autovacuuming is required,
VACUUM then starts caring less about costs. It gradually worries less
about the costs, and more about the need to advance it. Ideally it
will happen before antiwraparound autovacuum is actually required.

I'm not sure how much this would help with bloat. I suspect that it
could make a big difference with the right workload. If you always
need frequent autovacuums, just to deal with bloat, then there is
never a good time to run an aggressive antiwraparound autovacuum. An
aggressive AV will probably end up taking much longer than the typical
autovacuum that deals with bloat. While the aggressive AV will remove
as much bloat as any other AV, in theory, that might not help much. If
the aggressive AV takes as long as (say) 5 regular autovacuums would
have taken, and if you really needed those 5 separate autovacuums to
run, just to deal with the bloat, then that's a real problem.  The
aggressive AV effectively causes bloat with such a workload.

-- 
Peter Geoghegan



Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Thu, Dec 15, 2022 at 10:53 AM Peter Geoghegan <pg@bowt.ie> wrote:
> I agree that the burden of catch-up freezing is excessive here (in
> fact I already wrote something to that effect on the wiki page). The
> likely solution can be simple enough.

Attached is v10, which fixes this issue, but using a different
approach to the one I sketched here.

This revision also changes the terminology around VM skipping: we now
call the strategies there "scanning strategies", per feedback from
Jeff and John. This does seem a lot clearer.

Also cleaned up the docs a little bit, which were messed up by a
rebasing issue in v9.

I ended up fixing the aforementioned "too much catch-up freezing"
issue by just getting rid of the whole concept of a second table-size
threshold that forces the eager scanning strategy. I now believe that
it's fine to just rely on the generic logic that determines scanning
strategy based on a combination of table age and the added cost of
eager scanning. It'll work in a way that doesn't result in too much of
a freezing spike during any one VACUUM operation, without waiting
until an antiwraparound autovacuum to advance relfrozenxid (it'll
happen far earlier than that, though still quite a lot later than what
you'd see with v9, so as to avoid that big spike in freezing that was
possible in pgbench_history-like tables [1]).

This means that vacuum_freeze_strategy_threshold is now strictly
concerned with freezing. A table that is always frozen eagerly will
inevitably fall into a pattern of advancing relfrozenxid in every
VACUUM operation, but that isn't something that needs to be documented
or anything. We don't need to introduce a special case here.

The other notable change for v10 is in the final patch, which removes
aggressive mode altogether. v10 now makes lazy_scan_noprune less
willing to give up on setting relfrozenxid to a relatively recent XID.
Now lazy_scan_noprune is willing to wait a short while for a cleanup
lock on a heap page (a few tens of milliseconds) when doing so might
be all it takes to preserve VACUUM's ability to advance relfrozenxid
all the way up to FreezeLimit, which is the traditional guarantee made
by aggressive mode VACUUM.

This makes lazy_scan_noprune "under promise and over deliver". It now
only promises to advance relfrozenxid up to MinXid in the very worst
case -- even if that means waiting indefinitely long for a cleanup
lock. That's not a very strong promise, because advancing relfrozenxid
up to MinXid is only barely adequate. At the same time,
lazy_scan_noprune is willing to go to extra trouble to
get a recent enough FreezeLimit -- it'll wait for a few 10s of milliseconds.
It's just not willing to wait indefinitely. This seems likely to give us the
best of both worlds.

This was based in part on something that Andres said about cleanup
locks a while back. He had a concern about cases where even MinXid was
before OldestXmin. To some degree that's addressed here, because I've
also changed the way that MinXid is determined, so that it'll be a
much earlier value. That doesn't have much downside now, because of the
way that lazy_scan_noprune is now "aggressive-ish" when that happens to
make sense.

Not being able to get a cleanup lock on our first attempt is relatively
rare, and when it happens it's often something completely benign. For
example, it might just be that the checkpointer was writing out the
same page at the time, which signifies nothing about it really being
hard to get a cleanup lock -- the checkpointer will have dropped its
conflicting buffer pin almost immediately. It would be a shame to
accept a significantly older final relfrozenxid during an infrequent,
long running antiwraparound autovacuum of larger tables when that
happens -- we should be willing to wait 30 milliseconds (just not 30
minutes, or 30 days).

None of this even comes up for pages whose XIDs are >= FreezeLimit,
which is actually most pages with the patch, even in larger tables.
It's relatively rare for VACUUM to need to process any heap page in
lazy_scan_noprune, but it'll be much rarer still for it to have to do
a "short wait" like this. So "short waits" have a very small downside,
and (at least occasionally) a huge upside.

By inventing a third alternative behavior (to go along with processing
pages via standard lazy_scan_noprune skipping and processing pages in
lazy_scan_prune), VACUUM has the flexibility to respond in a way
that's proportionate to the problem at hand, in one particular heap
page. The new behavior has zero chance of mattering in most individual
tables/workloads, but it's good to have every possible eventuality
covered. I really hate the idea of getting a significantly worse
outcome just because of something that happened in one single heap
page, because the wind changed directions at the wrong time.

[1] https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Patch

--
Peter Geoghegan

Attachment

Re: New strategies for freezing, advancing relfrozenxid early

From
Jeff Davis
Date:
On Sun, 2022-12-18 at 14:20 -0800, Peter Geoghegan wrote:
> Attached is v10, which fixes this issue, but using a different
> approach to the one I sketched here.

In 0001, it's fairly straightforward rearrangement and looks like an
improvement to me. I have a few complaints, but they are about pre-
existing code that you moved around, and I like that you didn't
editorialize too much while just moving code around. +1 from me.


--
Jeff Davis
PostgreSQL Contributor Team - AWS





Re: New strategies for freezing, advancing relfrozenxid early

From
Nikita Malakhov
Date:
Hi!

I'll try to apply this patch onto my branch with Pluggable TOAST to test
these mechanics with new TOAST. Would reply on the result. It could
be difficult though, because both have a lot of changes that affect
the same code.

>I'm not sure how much this would help with bloat. I suspect that it
>could make a big difference with the right workload. If you always
>need frequent autovacuums, just to deal with bloat, then there is
>never a good time to run an aggressive antiwraparound autovacuum. An
>aggressive AV will probably end up taking much longer than the typical
>autovacuum that deals with bloat. While the aggressive AV will remove
>as much bloat as any other AV, in theory, that might not help much. If
>the aggressive AV takes as long as (say) 5 regular autovacuums would
>have taken, and if you really needed those 5 separate autovacuums to
>run, just to deal with the bloat, then that's a real problem.  The
>aggressive AV effectively causes bloat with such a workload.



On Tue, Dec 20, 2022 at 12:01 PM Jeff Davis <pgsql@j-davis.com> wrote:
On Sun, 2022-12-18 at 14:20 -0800, Peter Geoghegan wrote:
> Attached is v10, which fixes this issue, but using a different
> approach to the one I sketched here.

In 0001, it's fairly straightforward rearrangement and looks like an
improvement to me. I have a few complaints, but they are about pre-
existing code that you moved around, and I like that you didn't
editorialize too much while just moving code around. +1 from me.


--
Jeff Davis
PostgreSQL Contributor Team - AWS






--
Regards,

--
Nikita Malakhov
Postgres Professional 

Re: New strategies for freezing, advancing relfrozenxid early

From
Jeff Davis
Date:
On Sun, 2022-12-18 at 14:20 -0800, Peter Geoghegan wrote:
> On Thu, Dec 15, 2022 at 10:53 AM Peter Geoghegan <pg@bowt.ie> wrote:
> > I agree that the burden of catch-up freezing is excessive here (in
> > fact I already wrote something to that effect on the wiki page).
> > The
> > likely solution can be simple enough.
>
> Attached is v10, which fixes this issue, but using a different
> approach to the one I sketched here.

Comments on 0002:

Can you explain the following portion of the diff:


  - else if (MultiXactIdPrecedes(multi, cutoffs->MultiXactCutoff))
  + else if (MultiXactIdPrecedes(multi, cutoffs->OldestMxact))

  ...

  + /* Can't violate the MultiXactCutoff invariant, either */
  + if (!need_replace)
  +     need_replace = MultiXactIdPrecedes(
  +        multi, cutoffs->MultiXactCutoff);

Regarding correctness, it seems like the basic structure and invariants
are the same, and it builds on the changes already in 9e5405993c. Patch
0002 seems *mostly* about making choices within the existing framework.
That gives me more confidence.

That being said, it does push harder against the limits on both sides.
If I understand correctly, that means pages with wider distributions of
xids are going to persist longer, which could expose pre-existing bugs
in new and interesting ways.

Next, the 'freeze_required' field suggests that it's more involved in
the control flow that causes freezing than it actually is. All it does
is communicate how the trackers need to be adjusted. The return value
of heap_prepare_freeze_tuple() (and underneath, the flags set by
FreezeMultiXactId()) are what actually control what happens. It would
be nice to make this more clear somehow.

The comment:

  /*
   * If we freeze xmax, make absolutely sure that it's not an XID that
   * is important.  (Note, a lock-only xmax can be removed independent
   * of committedness, since a committed lock holder has released the
   * lock).
   */

caused me to go down a rabbit hole looking for edge cases where we
might want to freeze an xmax but not an xmin; e.g. tup.xmax <
OldestXmin < tup.xmin or the related case where tup.xmax < RecentXmin <
tup.xmin. I didn't find a problem, so that's good news.

I also tried some pgbench activity along with concurrent vacuums (and
vacuum freezes) along with periodic verify_heapam(). No problems there.
 
Did you already describe the testing you've done for 0001+0002
specfiically? It's not radically new logic, but it would be good to try
to catch minor state-handling errors.


--
Jeff Davis
PostgreSQL Contributor Team - AWS





Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Tue, Dec 20, 2022 at 5:44 PM Jeff Davis <pgsql@j-davis.com> wrote:
> Comments on 0002:
>
> Can you explain the following portion of the diff:
>
>
>   - else if (MultiXactIdPrecedes(multi, cutoffs->MultiXactCutoff))
>   + else if (MultiXactIdPrecedes(multi, cutoffs->OldestMxact))
>
>   ...
>
>   + /* Can't violate the MultiXactCutoff invariant, either */
>   + if (!need_replace)
>   +     need_replace = MultiXactIdPrecedes(
>   +        multi, cutoffs->MultiXactCutoff);

Don't forget the historic context: before Postgres 15's commit
0b018fab, VACUUM's final relfrozenxid always came from FreezeLimit.
Almost all of this code predates that work. So the general idea that
you can make a "should I freeze or should I ratchet back my
relfrozenxid tracker instead?" trade-off at the level of individual
tuples and pages is still a very new one. Right now it's only applied
within lazy_scan_noprune(), but 0002 leverages the same principles
here.

Before now, these heapam.c freezing routines had cutoffs called
cutoff_xid and cutoff_multi. These had values that actually came from
vacuumlazy.c's FreezeLimit and MultiXactCutoff cutoffs (which was
rather unclear). But cutoff_xid and cutoff_multi were *also* used as
inexact proxies for OldestXmin and OldestMxact (also kind of unclear,
but true). For example, there are some sanity checks in heapam.c that
kind of pretend that cutoff_xid is OldestXmin, even though it usually
isn't the same value (it can be, but only during VACUUM FREEZE, or
when the min freeze age is 0 in some other way).

So 0002 teaches the same heapam.c code about everything -- about all
of the different cutoffs, and about the true requirements of VACUUM
around relfrozenxid advancement. In fact, 0002 makes vacuumlazy.c cede
a lot of control of "XID stuff" to the same heapam.c code, freezing it
up to think about freezing as something that works at the level of
physical pages. This is key to allowing vacuumlazy.c to reason about
freezing at the level of the whole table. It thinks about physical
blocks, leaving logical XIDs up to heapam.c code.

This business that you asked about in FreezeMultiXactId() is needed so
that we can allow vacuumlazy.c to "think in terms of physical pages",
while at the same time avoiding allocating new Multis in VACUUM --
which requires "thinking about individual xmax fields" instead -- a
somewhat conflicting goal. We're really trying to have it both ways
(we get page-level freezing, with a little tuple level freezing on the
side, sufficient to to avoid allocating new Multis during VACUUMs in
roughly the same way as we do right now).

In most cases "freezing a page" removes all XIDs < OldestXmin, and all
MXIDs < OldestMxact. It doesn't quite work that way in certain rare
cases involving MultiXacts, though. It is convenient to define "freeze
the page" in a way that gives heapam.c's FreezeMultiXactId() the
leeway to put off the work of processing an individual tuple's xmax,
whenever it happens to be a MultiXactId that would require an
expensive second pass to process aggressively (allocating a new Multi
during VACUUM is especially worth avoiding here).

Our definition of "freeze the page" is a bit creative, at least if
you're used to thinking about it in terms of strict XID-wise cutoffs
like OldestXmin/FreezeLimit. But even if you do think of it in terms
of XIDs, the practical difference is extremely small in practice.

FreezeMultiXactId() effectively makes a decision on how to proceed
with processing at the level of each individual xmax field.  Its no-op
multi processing "freezes" an xmax in the event of a costly-to-process
xmax on a page when (for whatever reason) page-level freezing is
triggered. If, on the other hand, page-level freezing isn't triggered
for the page, then page-level no-op processing takes care of the multi
for us instead. Either way, the remaining Multi will ratchet back
VACUUM's relfrozenxid and/or relminmxid trackers as required, and we
won't need an expensive second pass over the multi (unless we really
have no choice, for example during a VACUUM FREEZE, where
OldestXmin==FreezeLimit).

> Regarding correctness, it seems like the basic structure and invariants
> are the same, and it builds on the changes already in 9e5405993c. Patch
> 0002 seems *mostly* about making choices within the existing framework.
> That gives me more confidence.

You're right that it's the same basic invariants as before, of course.
Turns out that those invariants can be pushed quite far.

Though note that I kind of invented a new invariant (not really, sort
of). Well, it's a postcondition, which is a sort of invariant: any
scanned heap page that can be cleanup locked must never have any
remaining XIDs < FreezeLimit, nor can any MXIDs < MultiXactCutoff
remain. But a cleanup-locked page does *not* need to get rid of all
XIDs < OldestXmin, nor MXIDs < OldestMxact.

This flexibility is mostly useful because it allows lazy_scan_prune to
just decide to not freeze. But, to a much lesser degree, it's useful
because of the edge case with multis -- in general we might just need
the same leeway when lazy_scan_prune "freezes the page".

> That being said, it does push harder against the limits on both sides.
> If I understand correctly, that means pages with wider distributions of
> xids are going to persist longer, which could expose pre-existing bugs
> in new and interesting ways.

I don't think it's fundamentally different to what we're already doing
in lazy_scan_noprune. It's just more complicated, because you have to
tease apart slightly different definitions of freezing to understand
code around FreezeMultiXactId(). This is more or less needed to
provide maximum flexibility, where we delay decisions about what to do
until the very last moment.

> Next, the 'freeze_required' field suggests that it's more involved in
> the control flow that causes freezing than it actually is. All it does
> is communicate how the trackers need to be adjusted. The return value
> of heap_prepare_freeze_tuple() (and underneath, the flags set by
> FreezeMultiXactId()) are what actually control what happens. It would
> be nice to make this more clear somehow.

I'm not sure what you mean. Page-level freezing *doesn't* have to go
ahead when freeze_required is not ever set to true for any tuple on
the page (which is most of the time, in practice). lazy_scan_prune
gets to make a choice about freezing the page, when the choice is
available.

Note also that the FRM_NOOP case happens when a call to
FreezeMultiXactId() takes place that won't leave behind a freeze plan
for the tuple (unless its xmin happens to necessitate a freeze plan
for the same tuple). And yet, it will do useful work, needed iff the
"freeze the page" path is ultimately taken by lazy_scan_prune --
FreezeMultiXactId() itself will ratchet back
FreezePageRelfrozenXid/NewRelfrozenXid as needed to make everything
safe.

> The comment:
>
>   /*
>    * If we freeze xmax, make absolutely sure that it's not an XID that
>    * is important.  (Note, a lock-only xmax can be removed independent
>    * of committedness, since a committed lock holder has released the
>    * lock).
>    */
>
> caused me to go down a rabbit hole looking for edge cases where we
> might want to freeze an xmax but not an xmin; e.g. tup.xmax <
> OldestXmin < tup.xmin or the related case where tup.xmax < RecentXmin <
> tup.xmin. I didn't find a problem, so that's good news.

This is an example of what I meant about the heapam.c code using a
cutoff that actually comes from FreezeLimit, when it would be more
sensible to use OldestXmin instead.

> I also tried some pgbench activity along with concurrent vacuums (and
> vacuum freezes) along with periodic verify_heapam(). No problems there.
>
> Did you already describe the testing you've done for 0001+0002
> specfiically? It's not radically new logic, but it would be good to try
> to catch minor state-handling errors.

Lots of stuff with contrib/amcheck, which, as you must already know,
will notice when an XID/MXID is contained in a table whose
relfrozenxid and/or relminmxid indicates that it shouldn't be there.
(Though VACUUM itself does the same thing, albeit not as effectively.)

Obviously the invariants haven't changed here. In many ways it's a
very small set of changes. But in one or two ways it's a significant
shift. It depends on how you think about it.

--
Peter Geoghegan



Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Tue, Dec 20, 2022 at 7:15 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Tue, Dec 20, 2022 at 5:44 PM Jeff Davis <pgsql@j-davis.com> wrote:
> > Next, the 'freeze_required' field suggests that it's more involved in
> > the control flow that causes freezing than it actually is. All it does
> > is communicate how the trackers need to be adjusted. The return value
> > of heap_prepare_freeze_tuple() (and underneath, the flags set by
> > FreezeMultiXactId()) are what actually control what happens. It would
> > be nice to make this more clear somehow.
>
> I'm not sure what you mean. Page-level freezing *doesn't* have to go
> ahead when freeze_required is not ever set to true for any tuple on
> the page (which is most of the time, in practice). lazy_scan_prune
> gets to make a choice about freezing the page, when the choice is
> available.

Oh wait, I think I see the point of confusion now.

When freeze_required is set to true, that means that lazy_scan_prune
literally has no choice -- it simply must freeze the page as
instructed by heap_prepare_freeze_tuple/FreezeMultiXactId. It's not
just a strong suggestion -- it's crucial that lazy_scan_prune freezes
the page as instructed.

The "no freeze" trackers (HeapPageFreeze.NoFreezePageRelfrozenXid and
HeapPageFreeze.NoFreezePageRelminMxid) won't have been maintained
properly when freeze_required was set, so lazy_scan_prune can't expect
to use them -- doing so would lead to VACUUM setting incorrect values
in pg_class later on.

Avoiding the work of maintaining those "no freeze" trackers isn't just
a nice-to-have microoptimization -- it is sometimes very important. We
kind of rely on this to be able to avoid getting too many MultiXact
member SLRU buffer misses inside FreezeMultiXactId. There is a comment
above FreezeMultiXactId that advises its caller that it had better not
call heap_tuple_should_freeze when freeze_required is set to true,
because that could easily lead to multixact member SLRU buffer misses
-- misses that FreezeMultiXactId set out to avoid itself.

It could actually be cheaper to freeze than to not freeze, in the case
of a Multi -- member space misses can sometimes be really expensive.
And so FreezeMultiXactId sometimes freezes a Multi even though it's
not strictly required to do so.

Note also that this isn't a new behavior -- it's actually an old one,
for the most part. It kinda doesn't look that way, because we haven't
passed down separate FreezeLimit/OldestXmin cutoffs (and separate
OldestMxact/MultiXactCutoff cutoffs) until now. But we often don't
need that granular information to be able to process Multis before the
multi value is < MultiXactCutoff.

If you look at how FreezeMultiXactId works, in detail, you'll see that
even on Postgres HEAD it can (say) set a tuple's xmax to
InvalidTransactionId long before the multi value is < MultiXactCutoff.
It just needs to detect that the multi is not still running, and
notice that it's HEAP_XMAX_IS_LOCKED_ONLY(). Stuff like that happens
quite a bit. So for the most part "eager processing of Multis as a
special case" is an old behavior, that has only been enhanced a little
bit (the really important, new change in FreezeMultiXactId is how the
FRM_NOOP case works with FreezeLimit, even though OldestXmin is used
nearby -- this is extra confusing because 0002 doesn't change how we
use FreezeLimit -- it actually changes every other use of FreezeLimit
nearby, making it OldestXmin).

-- 
Peter Geoghegan



Re: New strategies for freezing, advancing relfrozenxid early

From
Jeff Davis
Date:
On Tue, 2022-12-20 at 21:26 -0800, Peter Geoghegan wrote:
> When freeze_required is set to true, that means that lazy_scan_prune
> literally has no choice -- it simply must freeze the page as
> instructed by heap_prepare_freeze_tuple/FreezeMultiXactId. It's not
> just a strong suggestion -- it's crucial that lazy_scan_prune freezes
> the page as instructed.

The confusing thing to me is perhaps just the name -- to me,
"freeze_required" suggests that if it were set to true, it would cause
freezing to happen. But as far as I can tell, it does not cause
freezing to happen, it causes some other things to happen that are
necessary when freezing happens (updating and using the right
trackers).

A minor point, no need to take action here. Perhaps rename the
variable.

I think 0001+0002 are about ready.


--
Jeff Davis
PostgreSQL Contributor Team - AWS





Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Wed, Dec 21, 2022 at 4:30 PM Jeff Davis <pgsql@j-davis.com> wrote:
> The confusing thing to me is perhaps just the name -- to me,
> "freeze_required" suggests that if it were set to true, it would cause
> freezing to happen. But as far as I can tell, it does not cause
> freezing to happen, it causes some other things to happen that are
> necessary when freezing happens (updating and using the right
> trackers).

freeze_required is about what's required, which tells us nothing about
what will happen when it's not required (could go either way,
depending on how lazy_scan_prune feels about it).

Setting freeze_required=true implies that heap_prepare_freeze_tuple
has stopped doing maintenance of the "no freeze" trackers. When it
sets freeze_required=true, it really *does* force freezing to happen,
in every practical sense. This happens because lazy_scan_prune does
what it's told to do when it's told that freezing is required. Because
of course it does, why wouldn't it?

So...I still don't get what you mean. Why would lazy_scan_prune ever
break its contract with heap_prepare_freeze_tuple? And in what sense
would you say that heap_prepare_freeze_tuple's setting
freeze_required=true doesn't quite amount to "forcing freezing"? Are
you worried about the possibility that lazy_scan_prune will decide to
rebel at some point, and fail to honor its contract with
heap_prepare_freeze_tuple?  :-)

> A minor point, no need to take action here. Perhaps rename the
> variable.

Andres was the one that suggested this name, actually. I initially
just called it "freeze", but I think that Andres had it right.

> I think 0001+0002 are about ready.

Great. I plan on committing 0001 in the next few days. Committing 0002
might take a bit longer.

Thanks
-- 
Peter Geoghegan



Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Wed, Dec 21, 2022 at 4:53 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Great. I plan on committing 0001 in the next few days. Committing 0002
> might take a bit longer.

I pushed the VACUUM cutoffs patch (previously 0001) this morning -
thanks for your help with that one.

Attached is v11, which is mostly just to fix the bitrot caused by
today's commits. Though I did adjust some of the commit messages a
bit. There is also one minor functional change in v11: we now always
use eager freezing strategy in unlogged and temp tables, since it's
virtually guaranteed to be a win there.

With an unlogged or temp table, most of the cost of freezing is just
the cycles spent preparing to freeze, since, of course, there isn't
any WAL overhead to have to worry about (which is the dominant concern
with freezing costs, in general). Deciding *not* to freeze pages that
we can freeze and make all-frozen in the VM from unlogged/temp tables
seems like a case of wasting the cycles spent preparing freeze plans.
Why not just do the tiny additional work of executing the freeze plans
at that point?

It's not like eager freezing strategy comes with an added risk that
VACUUM will allocate new multis that it wouldn't otherwise have to
allocate. Nor does it change cleanup-lock-wait behavior. Clearly this
optimization isn't equivalent to interpreting vacuum_freeze_min_age as
0 in unlogged/temp tables. The whole design of freezing strategies is
supposed to abstract away details like that, freeing up high level
code like lazy_scan_strategy to think about freezing at the level of
the whole table -- the cost model stuff really benefits from being
able to measure debt at the table level, measuring things in terms of
units like total all-frozen pages, rel_pages, etc.

--
Peter Geoghegan

Attachment

Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Thu, Dec 22, 2022 at 11:39 AM Peter Geoghegan <pg@bowt.ie> wrote:
> On Wed, Dec 21, 2022 at 4:53 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > Great. I plan on committing 0001 in the next few days. Committing 0002
> > might take a bit longer.
>
> I pushed the VACUUM cutoffs patch (previously 0001) this morning -
> thanks for your help with that one.

Attached is v12. I think that the page-level freezing patch is now
commitable, and plan on committing it in the next 2-4 days barring any
objections.

Notable changes in v12:

* Simplified some of the logic in FreezeMultiXactId(), which now
doesn't have any needless handling of NewRelfrozenXid style cutoffs
except in the one case that still needs it (its no-op processing
case).

We don't need most of the handling on HEAD anymore because every
possible approach to processing a Multi other than FRM_NOOP will
reliably leave behind a new xmax that is either InvalidTransactionId,
or an XID/MXID >= OldestXmin/OldestMxact. Such values cannot possibly
need to be tracked by the NewRelfrozenXid trackers, since the trackers
are initialized using OldestXmin/OldestMxact to begin with.

* v12 merges together the code for the "freeze the page"
lazy_scan_prune path with the block that actually calls
heap_freeze_execute_prepared().

This should make it clear that pagefrz.freeze_required really does
mean that freezing is required. Hopefully that addresses Jeff's recent
concern. It's certainly an improvement, in any case.

* On a related note, comments around the same point in lazy_scan_prune
as well as comments above the HeapPageFreeze struct now explain a
concept I decided to call "nominal freezing". This is the case where
we "freeze a page" without having any freeze plans to execute.

"nominal freezing" is the new name for a concept I invented many
months ago, which helps to resolve subtle problems with the way that
heap_prepare_freeze_tuple is tasked with doing two different things
for its lazy_scan_prune caller: 1. telling lazy_scan_prune how it
would freeze each tuple (were it to freeze the page), and 2. helping
lazy_scan_prune to determine if the page should become all-frozen in
the VM. The latter is always conditioned on page-level freezing
actually going ahead, since everything else in
heap_prepare_freeze_tuple has to work that way.

We always freeze a page with zero freeze plans (or "nominally freeze"
the page) in lazy_scan_prune (which is nothing new in itself). We
thereby avoid breaking heap_prepare_freeze_tuple's working assumption
that all it needs to focus on what the page will look like after
freezing executes, while also avoiding senselessly throwing away the
ability to set a page all-frozen in the VM in lazy_scan_prune when
it'll cost us nothing extra. That is, by always freezing in the event
of zero freeze plans, we won't senselessly miss out on setting a page
all-frozen in cases where we don't actually have to execute any freeze
plans to make that safe, while the "freeze the page path versus don't
freeze the page path" dichotomy still works as a high level conceptual
abstraction.

-- 
Peter Geoghegan

Attachment

RE: New strategies for freezing, advancing relfrozenxid early

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Peter, Jeff,

While reviewing other patches, I found that cfbot raised ERROR during the VACUUM FREEZE [1] on FreeBSD instance.
It seemed that same error has been occurred in other threads.

```
2022-12-23 08:50:20.175 UTC [34653][postmaster] LOG:  server process (PID 37171) was terminated by signal 6: Abort
trap
2022-12-23 08:50:20.175 UTC [34653][postmaster] DETAIL:  Failed process was running: VACUUM FREEZE tab_freeze;
2022-12-23 08:50:20.175 UTC [34653][postmaster] LOG:  terminating any other active server processes
```

I guessed that this assertion failure seemed to be caused by the commit 4ce3af[2],
because the Assert() seemed to be added by the commit.

```
[08:51:31.189] #3  0x00000000009b88d7 in ExceptionalCondition (conditionName=<optimized out>, fileName=0x2fd9df
"../src/backend/access/heap/heapam.c",lineNumber=lineNumber@entry=6618) at ../src/backend/utils/error/assert.c:66
 
[08:51:31.189] No locals.
[08:51:31.189] #4  0x0000000000564205 in heap_prepare_freeze_tuple (tuple=0x8070f0bb0,
cutoffs=cutoffs@entry=0x80222e768,frz=0x7fffffffb2d0, totally_frozen=totally_frozen@entry=0x7fffffffc478,
relfrozenxid_out=<optimizedout>, relfrozenxid_out@entry=0x7fffffffc4a8, relminmxid_out=<optimized out>,
relminmxid_out@entry=0x7fffffffc474)at ../src/backend/access/heap/heapam.c:6618
 
```

Sorry for noise if you have already known or it is not related with this thread.

[1]: https://cirrus-ci.com/task/4580705867399168
[2]: https://github.com/postgres/postgres/commit/4ce3afb82ecfbf64d4f6247e725004e1da30f47c

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Mon, Dec 26, 2022 at 10:57 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
> I guessed that this assertion failure seemed to be caused by the commit 4ce3af[2],
> because the Assert() seemed to be added by the commit.

I agree that the problem is with this assertion, which is on the
master branch (not in recent versions of the patch series itself)
following commit 4ce3af:

else
{
    /*
    * Freeze plan for tuple "freezes xmax" in the strictest sense:
    * it'll leave nothing in xmax (neither an Xid nor a MultiXactId).
    */
    ....
    Assert(MultiXactIdPrecedes(xid, cutoffs->OldestMxact));
    ...
}

The problem is that FRM_INVALIDATE_XMAX multi processing can occur
both in Multis from before OldestMxact and Multis >= OldestMxact. The
latter case (the >= case) is far less common, but still quite
possible. Not sure how I missed that.

Anyway, this assertion is wrong, and simply needs to be removed.
Thanks for the report
-- 
Peter Geoghegan



RE: New strategies for freezing, advancing relfrozenxid early

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Peter,

> Anyway, this assertion is wrong, and simply needs to be removed.
> Thanks for the report

Thanks for modifying for quickly! I found your commit in the remote repository.
I will watch and report again if there are another issue.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Mon, Dec 26, 2022 at 12:53 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Attached is v12. I think that the page-level freezing patch is now
> commitable, and plan on committing it in the next 2-4 days barring any
> objections.

I've pushed the page-level freezing patch, so now I need to produce a
new revision, just to keep CFTester happy.

Attached is v13. No notable changes since v12.

-- 
Peter Geoghegan

Attachment

Re: New strategies for freezing, advancing relfrozenxid early

From
Jeff Davis
Date:
On Mon, 2022-12-26 at 12:53 -0800, Peter Geoghegan wrote:
> * v12 merges together the code for the "freeze the page"
> lazy_scan_prune path with the block that actually calls
> heap_freeze_execute_prepared().
>
> This should make it clear that pagefrz.freeze_required really does
> mean that freezing is required. Hopefully that addresses Jeff's
> recent
> concern. It's certainly an improvement, in any case.

Better, thank you.

> * On a related note, comments around the same point in
> lazy_scan_prune
> as well as comments above the HeapPageFreeze struct now explain a
> concept I decided to call "nominal freezing". This is the case where
> we "freeze a page" without having any freeze plans to execute.
>
> "nominal freezing" is the new name for a concept I invented many
> months ago, which helps to resolve subtle problems with the way that
> heap_prepare_freeze_tuple is tasked with doing two different things
> for its lazy_scan_prune caller: 1. telling lazy_scan_prune how it
> would freeze each tuple (were it to freeze the page), and 2. helping
> lazy_scan_prune to determine if the page should become all-frozen in
> the VM. The latter is always conditioned on page-level freezing
> actually going ahead, since everything else in
> heap_prepare_freeze_tuple has to work that way.
>
> We always freeze a page with zero freeze plans (or "nominally freeze"
> the page) in lazy_scan_prune (which is nothing new in itself). We
> thereby avoid breaking heap_prepare_freeze_tuple's working assumption
> that all it needs to focus on what the page will look like after
> freezing executes, while also avoiding senselessly throwing away the
> ability to set a page all-frozen in the VM in lazy_scan_prune when
> it'll cost us nothing extra. That is, by always freezing in the event
> of zero freeze plans, we won't senselessly miss out on setting a page
> all-frozen in cases where we don't actually have to execute any
> freeze
> plans to make that safe, while the "freeze the page path versus don't
> freeze the page path" dichotomy still works as a high level
> conceptual
> abstraction.

I always understood "freezing" to mean that a concrete action was
taken, and associated WAL generated.

"Nominal freezing" is happening when there are no freeze plans at all.
I get that it's to manage control flow so that the right thing happens
later. But I think it should be defined in terms of what state the page
is in so that we know that following a given path is valid. Defining
"nominal freezing" as a case where there are no freeze plans is just
confusing to me.


--
Jeff Davis
PostgreSQL Contributor Team - AWS





Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Fri, Dec 30, 2022 at 12:43 PM Jeff Davis <pgsql@j-davis.com> wrote:
> I always understood "freezing" to mean that a concrete action was
> taken, and associated WAL generated.

"When I use a word… it means just what I choose it to mean -- neither
more nor less".

I have also always understood freezing that way too. In fact, I still
do understand it that way -- I don't think that it has been undermined
by any of this. I've just invented this esoteric concept of nominal
freezing that can be ignored approximately all the time, to solve one
narrow problem that needed to be solved, that isn't that interesting
anywhere else.

> "Nominal freezing" is happening when there are no freeze plans at all.
> I get that it's to manage control flow so that the right thing happens
> later. But I think it should be defined in terms of what state the page
> is in so that we know that following a given path is valid. Defining
> "nominal freezing" as a case where there are no freeze plans is just
> confusing to me.

What would you prefer? The state that the page is in is not something
that I want to draw much attention to, because it's confusing in a way
that mostly isn't worth talking about. When we do nominal freezing, we
don't necessarily go on to set the page all-frozen. In fact, it's not
particularly likely that that will end up happening!

Bear in mind that the exact definition of "freeze the page" is
somewhat creative, even without bringing nominal freezing into it. It
just has to be in order to support the requirements we have for
MultiXacts (in particular for FRM_NOOP processing). The new concepts
don't quite map directly on to the old ones. At the same time, it
really is very often the case that "freezing the page" will perform
maximally aggressive freezing, in the sense that it does precisely
what a VACUUM FREEZE would do given the same page (in any Postgres
version).

--
Peter Geoghegan



Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Fri, Dec 30, 2022 at 1:12 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > "Nominal freezing" is happening when there are no freeze plans at all.
> > I get that it's to manage control flow so that the right thing happens
> > later. But I think it should be defined in terms of what state the page
> > is in so that we know that following a given path is valid. Defining
> > "nominal freezing" as a case where there are no freeze plans is just
> > confusing to me.
>
> What would you prefer? The state that the page is in is not something
> that I want to draw much attention to, because it's confusing in a way
> that mostly isn't worth talking about.

I probably should have addressed what you said more directly. Here goes:

Following the path of freezing a page is *always* valid, by
definition. Including when there are zero freeze plans to execute, or
even zero tuples to examine in the first place -- we'll at least be
able to perform nominal freezing, no matter what. OTOH, following the
"no freeze" path is permissible whenever the freeze_required flag
hasn't been set during any call to heap_prepare_freeze_tuple(). It is
never actually mandatory for lazy_scan_prune() to *not* freeze.

It's a bit like how a simple point can be understood as a degenerate
circle of radius 0. It's an abstract definition, which is just a tool
for describing things precisely -- hopefully a useful tool. I welcome
the opportunity to be able to describe things in a way that is clearer
or more useful, in whatever way. But it's not like I haven't already
put in significant effort to this exact question of what "freezing the
page" really means to lazy_scan_prune(). Naming things is hard.

-- 
Peter Geoghegan



Re: New strategies for freezing, advancing relfrozenxid early

From
Jeff Davis
Date:
On Fri, 2022-12-30 at 16:58 -0800, Peter Geoghegan wrote:
> Following the path of freezing a page is *always* valid, by
> definition. Including when there are zero freeze plans to execute, or
> even zero tuples to examine in the first place -- we'll at least be
> able to perform nominal freezing, no matter what.

This is a much clearer description, in my opinion. Do you think this is
already reflected in the comments (and I missed it)?

Perhaps the comment in the "if (tuples_frozen == 0)" branch could be
something more like:

"We have no freeze plans to execute, so there's no cost to following
the freeze path. This is important in the case where the page is
entirely frozen already, so that the page will be marked as such in the
VM."

I'm not even sure we really want a new concept of "nominal freezing". I
think you are right to just call it a degenerate case where it can be
interpreted as either freezing zero things or not freezing; and the
former is convenient for us because we want to follow that code path.
That would be another good way of writing the comment, in my opinion.

Of course, I'm sure there are some nuances that I'm still missing.


--
Jeff Davis
PostgreSQL Contributor Team - AWS





Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Sat, Dec 31, 2022 at 11:46 AM Jeff Davis <pgsql@j-davis.com> wrote:
> On Fri, 2022-12-30 at 16:58 -0800, Peter Geoghegan wrote:
> > Following the path of freezing a page is *always* valid, by
> > definition. Including when there are zero freeze plans to execute, or
> > even zero tuples to examine in the first place -- we'll at least be
> > able to perform nominal freezing, no matter what.
>
> This is a much clearer description, in my opinion. Do you think this is
> already reflected in the comments (and I missed it)?

I am arguably the person least qualified to answer this question.   :-)

> Perhaps the comment in the "if (tuples_frozen == 0)" branch could be
> something more like:
>
> "We have no freeze plans to execute, so there's no cost to following
> the freeze path. This is important in the case where the page is
> entirely frozen already, so that the page will be marked as such in the
> VM."

I'm happy to use your wording instead -- I'll come up with a patch for that.

In my mind it's just a restatement of what's there already. I assume
that you're right about it being clearer this way.

> Of course, I'm sure there are some nuances that I'm still missing.

I don't think that there is, actually. I now believe that you totally
understand the mechanics involved here. I'm glad that I was able to
ascertain that that's all it was. It's worth going to the trouble of
getting something like this exactly right.

-- 
Peter Geoghegan



Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Sat, Dec 31, 2022 at 12:45 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Sat, Dec 31, 2022 at 11:46 AM Jeff Davis <pgsql@j-davis.com> wrote:
> > "We have no freeze plans to execute, so there's no cost to following
> > the freeze path. This is important in the case where the page is
> > entirely frozen already, so that the page will be marked as such in the
> > VM."
>
> I'm happy to use your wording instead -- I'll come up with a patch for that.

What do you think of the wording adjustments in the attached patch?
It's based on your suggested wording.

--
Peter Geoghegan

Attachment

Re: New strategies for freezing, advancing relfrozenxid early

From
Jeff Davis
Date:
On Mon, 2023-01-02 at 11:45 -0800, Peter Geoghegan wrote:
> What do you think of the wording adjustments in the attached patch?
> It's based on your suggested wording.

Great, thank you.


--
Jeff Davis
PostgreSQL Contributor Team - AWS





Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Mon, Jan 2, 2023 at 6:26 PM Jeff Davis <pgsql@j-davis.com> wrote:
> On Mon, 2023-01-02 at 11:45 -0800, Peter Geoghegan wrote:
> > What do you think of the wording adjustments in the attached patch?
> > It's based on your suggested wording.
>
> Great, thank you.

Pushed that today.

Attached is v14.

v14 simplifies the handling of setting the visibility map at the end
of the blkno-wise loop in lazy_scan_heap(). And,
visibilitymap_snap_next() doesn't tell its caller (lazy_scan_heap)
anything about the visibility status of each returned block -- we no
longer need a all_visible_according_to_vm local variable to help with
setting the visibility map.

This new approach to setting the VM is related to hardening that I
plan on adding, which makes the visibility map robust against certain
race conditions that can lead to setting a page all-frozen but not
all-visible. I go into that here:

https://postgr.es/m/CAH2-WznuNGSzF8v6OsgjaC5aYsb3cZ6HW6MLm30X0d65cmSH6A@mail.gmail.com

(It's the second patch -- the first patch already became yesterday's
commit 6daeeb1f.)

In general I don't think that we should be using
all_visible_according_to_vm for anything, especially not anything
critical -- it is just information about how the page used to be in
the past, after all. This will be more of a problem with visibility
map snapshots, since all_visible_according_to_vm could be information
that is hours old by the time it's actually used by lazy_scan_heap().
But it is an existing issue.

BTW, it would be helpful if I could get a +1 to the visibility map
patch posted on that other thread. It's practically a bug fix -- the
VM shouldn't be able to show contradictory information about any given
heap page (i.e. "page is all-frozen but not all-visible"), no matter
what. Just on general principle.

--
Peter Geoghegan

Attachment

Re: New strategies for freezing, advancing relfrozenxid early

From
Matthias van de Meent
Date:
On Tue, 3 Jan 2023 at 21:30, Peter Geoghegan <pg@bowt.ie> wrote:
>
> Attached is v14.

Some reviews (untested; only code review so far) on these versions of
the patches:

> [PATCH v14 1/3] Add eager and lazy freezing strategies to VACUUM.

> +    /*
> +     * Threshold cutoff point (expressed in # of physical heap rel blocks in
> +     * rel's main fork) that triggers VACUUM's eager freezing strategy
> +     */

I don't think the mention of 'cutoff point' is necessary when it has
'Threshold'.

> +    int            freeze_strategy_threshold;    /* threshold to use eager
> [...]
> +    BlockNumber freeze_strategy_threshold;

Is there a way to disable the 'eager' freezing strategy? `int` cannot
hold the maximum BlockNumber...

> +    lazy_scan_strategy(vacrel);
>      if (verbose)

I'm slightly suprised you didn't update the message for verbose vacuum
to indicate whether we used the eager strategy: there are several GUCs
for tuning this behaviour, so you'd expect to want direct confirmation
that the configuration is effective.
(looks at further patches) I see that the message for verbose vacuum
sees significant changes in patch 2 instead.

---

> [PATCH v14 2/3] Add eager and lazy VM strategies to VACUUM.

General comments:

I don't see anything regarding scan synchronization in the vmsnap scan
system. I understand that VACUUM is a load that is significantly
different from normal SEQSCANs, but are there good reasons to _not_
synchronize the start of VACUUM?

Right now, we don't use syncscan to determine a startpoint. I can't
find the reason why in the available documentation: [0] discusses the
issue but without clearly describing an issue why it wouldn't be
interesting from a 'nothing lost' perspective.

In addition, I noticed that progress reporting of blocks scanned
("heap_blocks_scanned", duh) includes skipped pages. Now that we have
a solid grasp of how many blocks we're planning to scan, we can update
the reported stats to how many blocks we're planning to scan (and have
scanned), increasing the user value of that progress view.

[0] https://www.postgresql.org/message-id/flat/19398.1212328662%40sss.pgh.pa.us#17b2feb0fde6a41779290632d8c70ef1

> +    double        tableagefrac;

I think this can use some extra info on the field itself, that it is
the fraction of how "old" the relfrozenxid and relminmxid fields are,
as a fraction between 0 (latest values; nextXID and nextMXID), and 1
(values that are old by at least freeze_table_age and
multixact_freeze_table_age (multi)transaction ids, respectively).


> -#define VACOPT_DISABLE_PAGE_SKIPPING 0x80    /* don't skip any pages */
> +#define VACOPT_DISABLE_PAGE_SKIPPING 0x80    /* don't skip using VM */

I'm not super happy with this change. I don't think we should touch
the VM using snapshots _at all_ when disable_page_skipping is set:

> +     * Decide vmsnap scanning strategy.
>       *
> -     * This test also enables more frequent relfrozenxid advancement during
> -     * non-aggressive VACUUMs.  If the range has any all-visible pages then
> -     * skipping makes updating relfrozenxid unsafe, which is a real downside.
> +     * First acquire a visibility map snapshot, which determines the number of
> +     * pages that each vmsnap scanning strategy is required to scan for us in
> +     * passing.

I think we should not take disk-backed vm snapshots when
force_scan_all is set. We need VACUUM to be able to run on very
resource-constrained environments, and this does not do that - it adds
a disk space requirement for the VM snapshot for all but the smallest
relation sizes, which is bad when you realize that we need VACUUM when
we want to clean up things like CLOG.

Additionally, it took me several reads of the code and comments to
understand what the decision-making process for lazy vs eager is, and
why. The comments are interspersed with the code, with no single place
that describes it from a bird's eyes' view. I think something like the
following would be appreciated by other readers of the code:

+ We determine whether we choose the eager or lazy scanning strategy
based on how many extra pages the eager strategy would take over the
lazy strategy, and how "old" the table is (as determined in
tableagefrac):
+ When a table is still "young" (tableagefrac <
TABLEAGEFRAC_MIDPOINT), the eager strategy is accepted if we need to
scan 5% (MAX_PAGES_YOUNG_TABLEAGE) more of the table.
+ As the table gets "older" (tableagefrac between MIDPOINT and
HIGHPOINT), the threshold for eager scanning is relaxed linearly from
this 5% to 70% (MAX_PAGES_OLD_TABLEAGE) of the table being scanned
extra (over what would be scanned by the lazy strategy).
+ Once the tableagefrac passes HIGHPOINT, we stop considering the lazy
strategy, and always eagerly scan the table.

> @@ -1885,6 +1902,30 @@ retry:
>         tuples_frozen = 0;        /* avoid miscounts in instrumentation */
>      }
>
>     /*
> +     * There should never be dead or deleted tuples when PD_ALL_VISIBLE is
> +     * set.  Check that here in passing.
> +     *
> [...]

I'm not sure this patch is the appropriate place for this added check.
I don't disagree with the change, I just think that it's unrelated to
the rest of the patch. Same with some of the changes in
lazy_scan_heap.

> +vm_snap_stage_blocks

Doesn't this waste a lot of cycles on skipping frozen blocks if most
of the relation is frozen? I'd expected something more like a byte- or
word-wise processing of skippable blocks, as opposed to this per-block
loop. I don't think it's strictly necessary to patch, but I think it
would be a very useful addition for those with larger tables.

> +    XIDFrac = (double) (nextXID - cutoffs->relfrozenxid) /
> +        ((double) freeze_table_age + 0.5);

I don't quite understand what this `+ 0.5` is used for, could you explain?

> + [...] Freezing and advancing
> +         <structname>pg_class</structname>.<structfield>relfrozenxid</structfield>
> +         now take place more proactively, in every
> +         <command>VACUUM</command> operation.

This claim that it happens more proactively in "every" VACUUM
operation is false, so I think the removal of "every" would be better.

---

> [PATCH v14 3/3] Finish removing aggressive mode VACUUM.

I've not completed a review for this patch - I'll continue on that
tomorrow - but here's a first look:

I don't quite enjoy the refactoring+rewriting of the docs section;
it's difficult to determine what changed when so many things changed
line lengths and were moved around. Tomorrow I'll take a closer look,
but a separation of changes vs moved would be useful for review.

> +    /*
> +     * Earliest permissible NewRelfrozenXid/NewRelminMxid values that can be
> +     * set in pg_class at the end of VACUUM.
> +     */
> +    TransactionId MinXid;
> +    MultiXactId MinMulti;

I don't quite like this wording, but I'm not sure what would be better.

> +    cutoffs->MinXid = nextXID - (freeze_table_age * 0.95);
> [...]
> +    cutoffs->MinMulti = nextMXID - (multixact_freeze_table_age * 0.95);

Why are these values adjusted down (up?) by 5%? If I configure this
GUC, I'd expect this to be used effectively verbatim; not adjusted by
an arbitrary factor.

---

That's it for now; thanks for working on this,

Kind regards,

Matthias van de Meent



Re: New strategies for freezing, advancing relfrozenxid early

From
Matthias van de Meent
Date:
On Thu, 5 Jan 2023 at 02:21, I wrote:
>
> On Tue, 3 Jan 2023 at 21:30, Peter Geoghegan <pg@bowt.ie> wrote:
> >
> > Attached is v14.
> > [PATCH v14 3/3] Finish removing aggressive mode VACUUM.
>
> I've not completed a review for this patch - I'll continue on that
> tomorrow:

This is that.

> @@ -2152,10 +2109,98 @@ lazy_scan_noprune(LVRelState *vacrel,
> [...]
> +            /* wait 10ms, then 20ms, then 30ms, then give up */
> [...]
> +                pg_usleep(1000L * 10L * i);

Could this use something like autovacuum_cost_delay? I don't quite
like the use of arbitrary hardcoded millisecond delays - it can slow a
system down by a significant fraction, especially on high-contention
systems, and this potential of 60ms delay per scanned page can limit
the throughput of this new vacuum strategy to < 17 pages/second
(<136kB/sec) for highly contended sections, which is not great.

It is also not unlikely that in the time it was waiting, the page
contents were updated significantly (concurrent prune, DELETEs
committed), which could result in improved bounds. I think we should
redo the dead items check if we waited, but failed to get a lock - any
tuples removed now reduce work we'll have to do later.

> +++ b/doc/src/sgml/ref/vacuum.sgml
> [...] Pages where
> +      all tuples are known to be frozen are always skipped.

"...are always skipped, unless the >DISABLE_PAGE_SKIPPING< option is used."

> +++ b/doc/src/sgml/maintenance.sgml

There are a lot of details being lost from the previous version of
that document. Some of the details are obsolete (mentions of
aggressive VACUUM and freezing behavior), but others are not
(FrozenTransactionId in rows from a pre-9.4 system, the need for
vacuum for prevention of issues surrounding XID wraparound).

I also am not sure this is the best place to store most of these
mentions, but I can't find a different place where these details on
certain interesting parts of the system are documented, and plain
removal of the information does not sit right with me.

Specifically, I don't like the removal of the following information
from our documentation:

- Size of pg_xact and pg_commit_ts data in relation to autovacuum_freeze_max_age
   Although it is less likely with the new behaviour that we'll hit
these limits due to more eager freezing of transactions, it is still
important for users to have easy access to this information, and
tuning this for storage size is not useless information.

- The reason why VACUUM is essential to the long-term consistency of
Postgres' MVCC system
    Informing the user about our use of 32-bit transaction IDs and
that we update an epoch when this XID wraps around does not
automatically make the user aware of the issues that surface around
XID wraparound. Retaining the explainer for XID wraparound in the docs
seems like a decent idea - it may be moved, but please don't delete
it.

- Special transaction IDs, their meaning and where they can occur
   I can't seem to find any other information in the docs section, and
it is useful to have users understand that certain values are
considered special: FrozenTransactionId and BootstrapTransactionId.


Kind regards,

Matthias van de Meent



Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Wed, Jan 4, 2023 at 5:21 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> Some reviews (untested; only code review so far) on these versions of
> the patches:

Thanks for the review!

> > [PATCH v14 1/3] Add eager and lazy freezing strategies to VACUUM.

> I don't think the mention of 'cutoff point' is necessary when it has
> 'Threshold'.

Fair. Will fix.

> > +    int            freeze_strategy_threshold;    /* threshold to use eager
> > [...]
> > +    BlockNumber freeze_strategy_threshold;
>
> Is there a way to disable the 'eager' freezing strategy? `int` cannot
> hold the maximum BlockNumber...

I'm going to fix this by switching over to making the GUC (and the
reloption) GUC_UNIT_MB, while keeping it in ConfigureNamesInt[]. That
approach is a little bit more cumbersome, but not by much. That'll
solve this problem.

> > +    lazy_scan_strategy(vacrel);
> >      if (verbose)
>
> I'm slightly suprised you didn't update the message for verbose vacuum
> to indicate whether we used the eager strategy: there are several GUCs
> for tuning this behaviour, so you'd expect to want direct confirmation
> that the configuration is effective.

Perhaps that would be worth doing, but I don't think that it's all
that useful in the grand scheme of things. I wouldn't mind including
it, but I think that it shouldn't be given much prominence. It's
certainly far less important than "aggressive vs non-aggressive" is
right now.

Eagerness is not just a synonym of aggressiveness. For example, every
VACUUM of a table like pgbench_tellers or pgbench_branches will use
eager scanning strategy. More generally, you have to bear in mind that
the actual state of the table is just as important as the GUCs
themselves. We try to avoid obligations that could be very hard or
even impossible for vacuumlazy.c to fulfill.

There are far weaker constraints on things like the final relfrozenxid
value we'll set in pg_class (more on this below, when I talk about
MinXid/MinMulti). It will advance far more frequently and by many more
XIDs than it would today, on average. But occasionally it will allow a
far earlier relfrozenxid than aggressive mode would ever allow, since
making some small amount of progress now is almost always much better
than making no progress at all.

> (looks at further patches) I see that the message for verbose vacuum
> sees significant changes in patch 2 instead.

It just works out to be slightly simpler that way. I want to add the
scanned_pages stuff to VERBOSE in the vmsnap/scanning strategies
commit, so I need to make significant changes to the initial VERBOSE
message in that commit. There is little point in preserving
information about aggressive mode if it's removed in the very next
commit anyway.

> > [PATCH v14 2/3] Add eager and lazy VM strategies to VACUUM.

> Right now, we don't use syncscan to determine a startpoint. I can't
> find the reason why in the available documentation: [0] discusses the
> issue but without clearly describing an issue why it wouldn't be
> interesting from a 'nothing lost' perspective.

That's not something I've given much thought to. It's a separate issue, I think.

Though I will say that one reason why I think that the vm snapshot
concept will become important is that working off an immutable
structure makes various things much easier, in fairly obvious ways. It
makes it straightforward to reorder work. So things like parallel heap
vacuuming are a lot more straightforward.

I also think that it would be useful to teach VACUUM to speculatively
scan a random sample of pages, just like a normal VACUUM. We start out
doing a normal VACUUM that just processes scanned_pages in a random
order. At some point we look at the state of pages so far. If it looks
like the table really doesn't urgently need to be vacuumed, then we
can give up before paying much of a cost. If it looks like the table
really needs to be VACUUM'd, we can press on almost like any other
VACUUM would.

This is related to the problem of bad statistics that drive
autovacuum. Deciding as much as possible at runtime, dynamically,
seems promising to me.

> In addition, I noticed that progress reporting of blocks scanned
> ("heap_blocks_scanned", duh) includes skipped pages. Now that we have
> a solid grasp of how many blocks we're planning to scan, we can update
> the reported stats to how many blocks we're planning to scan (and have
> scanned), increasing the user value of that progress view.

Yeah, that's definitely a natural direction to go with this. Knowing
scanned_pages from the start is a basis for much more useful progress
reporting.

> > +    double        tableagefrac;
>
> I think this can use some extra info on the field itself, that it is
> the fraction of how "old" the relfrozenxid and relminmxid fields are,
> as a fraction between 0 (latest values; nextXID and nextMXID), and 1
> (values that are old by at least freeze_table_age and
> multixact_freeze_table_age (multi)transaction ids, respectively).

Agreed that that needs more than that in comments above the
"tableagefrac" struct field.

> > +     * Decide vmsnap scanning strategy.
> >       *
> > -     * This test also enables more frequent relfrozenxid advancement during
> > -     * non-aggressive VACUUMs.  If the range has any all-visible pages then
> > -     * skipping makes updating relfrozenxid unsafe, which is a real downside.
> > +     * First acquire a visibility map snapshot, which determines the number of
> > +     * pages that each vmsnap scanning strategy is required to scan for us in
> > +     * passing.
>
> I think we should not take disk-backed vm snapshots when
> force_scan_all is set. We need VACUUM to be able to run on very
> resource-constrained environments, and this does not do that - it adds
> a disk space requirement for the VM snapshot for all but the smallest
> relation sizes, which is bad when you realize that we need VACUUM when
> we want to clean up things like CLOG.

I agree that I still have work to do to make visibility map snapshots
as robust as possible in resource constrained environments, including
in cases where there is simply no disk space at all. They should
gracefully degrade even when there isn't space on disk to store a copy
of the VM in temp files, or even a single page.

> Additionally, it took me several reads of the code and comments to
> understand what the decision-making process for lazy vs eager is, and
> why. The comments are interspersed with the code, with no single place
> that describes it from a bird's eyes' view.

You probably have a good point there. I'll try to come up with
something, possibly based on your suggested wording.

> > @@ -1885,6 +1902,30 @@ retry:
> >         tuples_frozen = 0;        /* avoid miscounts in instrumentation */
> >      }
> >
> >     /*
> > +     * There should never be dead or deleted tuples when PD_ALL_VISIBLE is
> > +     * set.  Check that here in passing.
> > +     *
> > [...]
>
> I'm not sure this patch is the appropriate place for this added check.
> I don't disagree with the change, I just think that it's unrelated to
> the rest of the patch. Same with some of the changes in
> lazy_scan_heap.

This issue is hard to explain. I kind of need to do this in the VM
snapshot/scanning strategies commit, because it removes the
all_visible_according_to_vm local variable used inside lazy_scan_heap.

This change that you highlight detects cases where PD_ALL_VISIBLE is
set incorrectly earlier in lazy_scan_prune is part of that, and then
unsets it, so that once lazy_scan_prune returns and lazy_scan_heap
needs to consider setting the VM, it can trust PD_ALL_VISIBLE -- it is
definitely up to date at that point, even in cases involving
corruption. So the steps where we consider setting the VM now always
starts from a clean slate.

Now we won't just unset both PD_ALL_VISIBLE and the VM bits in the
event of corruption like this. We'll complain about it in
lazy_scan_prune, then fully fix the issue in the most appropriate way
in lazy_scan_heap (could be setting the page all-visible now, even
though it shouldn't have been set but was set when we first arrived).
We also won't fail to complain about PD_ALL_VISIBLE corruption because
lazy_scan_prune "destroyed the evidence" before lazy_scan_heap had the
chance to notice the problem. PD_ALL_VISIBLE corruption should never
happen, obviously, so we should make a point of complaining about it
whenever it can be detected. Which is much more often than what you
see on HEAD today.

> > +vm_snap_stage_blocks
>
> Doesn't this waste a lot of cycles on skipping frozen blocks if most
> of the relation is frozen? I'd expected something more like a byte- or
> word-wise processing of skippable blocks, as opposed to this per-block
> loop. I don't think it's strictly necessary to patch, but I think it
> would be a very useful addition for those with larger tables.

I agree that the visibility map snapshot stuff could stand to be a bit
more frugal with memory. It's certainly not critical, but it is
probably fairly easy to do better here, and so I should do better.

> > +    XIDFrac = (double) (nextXID - cutoffs->relfrozenxid) /
> > +        ((double) freeze_table_age + 0.5);
>
> I don't quite understand what this `+ 0.5` is used for, could you explain?

It avoids division by zero.

> > + [...] Freezing and advancing
> > +         <structname>pg_class</structname>.<structfield>relfrozenxid</structfield>
> > +         now take place more proactively, in every
> > +         <command>VACUUM</command> operation.
>
> This claim that it happens more proactively in "every" VACUUM
> operation is false, so I think the removal of "every" would be better.

Good catch. Will fix.

> > [PATCH v14 3/3] Finish removing aggressive mode VACUUM.

> I don't quite enjoy the refactoring+rewriting of the docs section;
> it's difficult to determine what changed when so many things changed
> line lengths and were moved around. Tomorrow I'll take a closer look,
> but a separation of changes vs moved would be useful for review.

I think that I should break out the doc changes some more. The docs
are likely the least worked out thing at this point.

> > +    cutoffs->MinXid = nextXID - (freeze_table_age * 0.95);
> > [...]
> > +    cutoffs->MinMulti = nextMXID - (multixact_freeze_table_age * 0.95);
>
> Why are these values adjusted down (up?) by 5%? If I configure this
> GUC, I'd expect this to be used effectively verbatim; not adjusted by
> an arbitrary factor.

It is kind of arbitrary, but not in the way that you suggest. This
isn't documented in the user docs, and shouldn't really need to be. It
should have very little if any noticeable impact on our final
relfrozenxid/relminmxid in practice. If it does have any noticeable
impact, I strongly suspect it'll be a useful, positive impact.

MinXid/MinMulti control the behavior around whether or not
lazy_scan_noprune is willing to wait the hard way for a cleanup lock,
no matter how long it takes. We do still need something like that, but
it can be far looser than it is right now. The problem with aggressive
mode is that it absolutely insists on a certain outcome, no matter the
cost, and regardless of whether or not a slightly inferior outcome is
acceptable. It's extremely rigid. Rigid things tend to break. Loose,
springy things much less so.

I think that it's an extremely bad idea to wait indefinitely for a
cleanup lock. Sure, it'll work out the vast majority of the time --
it's *very* likely to work. But when it doesn't work right away, there
is no telling how long the wait will be -- all bets are off. Could be
a day, a week, a month -- who knows? The application itself is the
crucial factor here, and in general the application can do whatever it
wants to do -- that is the reality. So we should be willing to kick
the can down the road in almost all cases -- that is actually the
responsible thing to do under the circumstances. We need to get on
with freezing every other page in the table!

There just cannot be very many pages that can't be cleanup locked at
any given time, so waiting indefinitely is a very drastic measure in
response to a problem that is quite likely to go away on its own. A
problem that waiting doesn't really solve anyway. Maybe the only thing
that will work is waiting for a very long time, but we have nothing to
lose (and everything to gain) by waiting to wait.

--
Peter Geoghegan



Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Thu, Jan 5, 2023 at 10:19 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> Could this use something like autovacuum_cost_delay? I don't quite
> like the use of arbitrary hardcoded millisecond delays

It's not unlike (say) the way that there can sometimes be hardcoded
waits inside GetMultiXactIdMembers(), which does run during VACUUM.

It's not supposed to be noticeable at all. If it is noticeable in any
practical sense, then the design is flawed, and should be fixed.

> it can slow a
> system down by a significant fraction, especially on high-contention
> systems, and this potential of 60ms delay per scanned page can limit
> the throughput of this new vacuum strategy to < 17 pages/second
> (<136kB/sec) for highly contended sections, which is not great.

We're only willing to wait the full 60ms when smaller waits don't work
out. And when 60ms doesn't do it, we'll then accept an older final
NewRelfrozenXid value. Our willingness to wait at all is conditioned
on the existing NewRelfrozenXid tracker being affected at all by
whether or not we accept reduced lazy_scan_noprune processing for the
page. So the waits are naturally self-limiting.

You may be right that I need to do more about the possibility of
something like that happening -- it's a legitimate concern. But I
think that this may be enough on its own. I've never seen a workload
where more than a small fraction of all pages couldn't be cleanup
locked right away. But I *have* seen workloads where VACUUM vainly
waited forever for a cleanup lock on one single heap page.

> It is also not unlikely that in the time it was waiting, the page
> contents were updated significantly (concurrent prune, DELETEs
> committed), which could result in improved bounds. I think we should
> redo the dead items check if we waited, but failed to get a lock - any
> tuples removed now reduce work we'll have to do later.

I don't think that it matters very much. That's always true. It seems
very unlikely that we'll get better bounds here, unless it happens by
getting a full cleanup lock and then doing full lazy_scan_prune
processing after all.

Sure, it's possible that a concurrent opportunistic prune could make
the crucial difference, even though we ourselves couldn't get a
cleanup lock despite going to considerable trouble. I just don't think
that it's worth doing anything about.

> > +++ b/doc/src/sgml/ref/vacuum.sgml
> > [...] Pages where
> > +      all tuples are known to be frozen are always skipped.
>
> "...are always skipped, unless the >DISABLE_PAGE_SKIPPING< option is used."

I'll look into changing this.

> > +++ b/doc/src/sgml/maintenance.sgml
>
> There are a lot of details being lost from the previous version of
> that document. Some of the details are obsolete (mentions of
> aggressive VACUUM and freezing behavior), but others are not
> (FrozenTransactionId in rows from a pre-9.4 system, the need for
> vacuum for prevention of issues surrounding XID wraparound).

I will admit that I really hate the "Routine Vacuuming" docs, and
think that they explain things in just about the worst possible way.

I also think that this needs to be broken up into pieces. As I said
recently, the docs are the part of the patch series that is the least
worked out.

> I also am not sure this is the best place to store most of these
> mentions, but I can't find a different place where these details on
> certain interesting parts of the system are documented, and plain
> removal of the information does not sit right with me.

I'm usually the person that argues for describing more implementation
details in the docs. But starting with low-level details here is
deeply confusing. At most these are things that should be discussed in
the context of internals, as part of some completely different
chapter.

I'll see about moving details of things like FrozenTransactionId somewhere else.

> Specifically, I don't like the removal of the following information
> from our documentation:
>
> - Size of pg_xact and pg_commit_ts data in relation to autovacuum_freeze_max_age
>    Although it is less likely with the new behaviour that we'll hit
> these limits due to more eager freezing of transactions, it is still
> important for users to have easy access to this information, and
> tuning this for storage size is not useless information.

That is a fair point. Though note that these things have weaker
relationships with settings like autovacuum_freeze_max_age now. Mostly
this is a positive improvement (in the sense that we can truncate
SLRUs much more aggressively on average), but not always.

> - The reason why VACUUM is essential to the long-term consistency of
> Postgres' MVCC system
>     Informing the user about our use of 32-bit transaction IDs and
> that we update an epoch when this XID wraps around does not
> automatically make the user aware of the issues that surface around
> XID wraparound. Retaining the explainer for XID wraparound in the docs
> seems like a decent idea - it may be moved, but please don't delete
> it.

We do need to stop telling users to enter single user mode. It's quite
simply obsolete, bad advice, and has been since Postgres 14. It's the
worst thing that you could do, in fact.

--
Peter Geoghegan



Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Tue, Jan 3, 2023 at 12:30 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Attached is v14.

This has stopped applying due to conflicts with nearby work on VACUUM
from Tom. So I attached a new revision, v15, just to make CFTester
green again.

I didn't have time to incorporate any of the feedback from Matthias
just yet. That will have to wait until v16.

--
Peter Geoghegan

Attachment

Re: New strategies for freezing, advancing relfrozenxid early

From
Dilip Kumar
Date:
On Mon, Jan 9, 2023 at 7:16 AM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Tue, Jan 3, 2023 at 12:30 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > Attached is v14.
>
> This has stopped applying due to conflicts with nearby work on VACUUM
> from Tom. So I attached a new revision, v15, just to make CFTester
> green again.
>
> I didn't have time to incorporate any of the feedback from Matthias
> just yet. That will have to wait until v16.
>
I have looked into the patch set, I think 0001 looks good to me about
0002 I have a few questions, 0003 I haven't yet looked at

1.
+    /*
+     * Finally, set tableagefrac for VACUUM.  This can come from either XID or
+     * XMID table age (whichever is greater currently).
+     */
+    XIDFrac = (double) (nextXID - cutoffs->relfrozenxid) /
+        ((double) freeze_table_age + 0.5);

I think '(nextXID - cutoffs->relfrozenxid) / freeze_table_age' should
be the actual fraction right?  What is the point of adding 0.5 to the
divisor?  If there is a logical reason, maybe we can explain in the
comments.

2.
While looking into the logic of 'lazy_scan_strategy', I think the idea
looks very good but the only thing is that
we have kept eager freeze and eager scan completely independent.
Don't you think that if a table is chosen for an eager scan
then we should force the eager freezing as well?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Sun, Jan 15, 2023 at 9:13 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I have looked into the patch set, I think 0001 looks good to me about
> 0002 I have a few questions, 0003 I haven't yet looked at

Thanks for taking a look.

> I think '(nextXID - cutoffs->relfrozenxid) / freeze_table_age' should
> be the actual fraction right?  What is the point of adding 0.5 to the
> divisor?  If there is a logical reason, maybe we can explain in the
> comments.

It's just a way of avoiding division by zero.

> While looking into the logic of 'lazy_scan_strategy', I think the idea
> looks very good but the only thing is that
> we have kept eager freeze and eager scan completely independent.
> Don't you think that if a table is chosen for an eager scan
> then we should force the eager freezing as well?

Earlier versions of the patch kind of worked that way.
lazy_scan_strategy would actually use twice the GUC setting to
determine scanning strategy. That approach could make our "transition
from lazy to eager strategies" involve an excessive amount of
"catch-up freezing" in the VACUUM operation that advanced relfrozenxid
for the first time, which you see an example of here:

https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Patch

Now we treat the scanning and freezing strategies as two independent
choices. Of course they're not independent in any practical sense, but
I think it's slightly simpler and more elegant that way -- it makes
the GUC vacuum_freeze_strategy_threshold strictly about freezing
strategy, while still leading to VACUUM advancing relfrozenxid in a
way that makes sense. It just happens as a second order effect. Why
add a special case?

In principle the break-even point for eager scanning strategy (i.e.
advancing relfrozenxid) is based on the added cost only under this
scheme. There is no reason for lazy_scan_strategy to care about what
happened in the past to make the eager scanning strategy look like a
good idea. Similarly, there isn't any practical reason why
lazy_scan_strategy  needs to anticipate what will happen in the near
future with freezing.

-- 
Peter Geoghegan



Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Sun, Jan 8, 2023 at 5:45 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I didn't have time to incorporate any of the feedback from Matthias
> just yet. That will have to wait until v16.

Attached is v16, which incorporates some of Matthias' feedback.

I've rolled back the major restructuring to the "Routine Vacuuming"
docs that previously appeared in 0003, preferring to take a much more
incremental approach. I do still think that somebody needs to do some
major reworking of that, just in general. That can be done by a
separate patch. There are now only fairly mechanical doc updates in
all 3 patches.

Other changes:

* vacuum_freeze_strategy_threshold is now MB-based, and can be set up to 512TB.

* Various refinements to comments.

--
Peter Geoghegan

Attachment

Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Mon, Jan 16, 2023 at 10:10 AM Peter Geoghegan <pg@bowt.ie> wrote:
> Attached is v16, which incorporates some of Matthias' feedback.

0001 (the freezing strategies patch) is now committable IMV. Or at
least will be once I polish the docs a bit more. I plan on committing
0001 some time next week, barring any objections.

I should point out that 0001 is far shorter and simpler than the
page-level freezing commit that already went in (commit 1de58df4). The
only thing in 0001 that seems like it might be a bit controversial
(when considered on its own) is the addition of the
vacuum_freeze_strategy_threshold GUC/reloption. Note in particular
that vacuum_freeze_strategy_threshold doesn't look like any other
existing GUC; it gets applied as a threshold on the size of the rel's
main fork at the beginning of vacuumlazy.c processing. As far as I
know there are no objections to that approach at this time, but it
does still seem worth drawing attention to now.

0001 also makes unlogged tables and temp tables always use eager
freezing strategy, no matter how the GUC/reloption are set. This seems
*very* easy to justify, since the potential downside of such a policy
is obviously extremely low, even when we make very pessimistic
assumptions. The usual cost we need to worry about when it comes to
freezing is the added WAL overhead -- that clearly won't apply when
we're vacuuming non-permanent tables. That really just leaves the cost
of dirtying extra pages, which in general could have a noticeable
system-level impact in the case of unlogged tables.

Dirtying extra pages when vacuuming an unlogged table is also a
non-issue. Even the eager freezing strategy only freezes "extra" pages
("extra" relative to the lazy strategy behavior) given a page that
will be set all-visible in any case [1]. Such a page will need to have
its page-level PD_ALL_VISIBLE bit set in any case -- which is already
enough to dirty the page. And so there can never be any additional
pages dirtied as a result of the special policy 0001 adds for
non-permanent relations.

[1] https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Patch_2
--
Peter Geoghegan



Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Mon, Jan 16, 2023 at 10:00 AM Peter Geoghegan <pg@bowt.ie> wrote:
> Now we treat the scanning and freezing strategies as two independent
> choices. Of course they're not independent in any practical sense, but
> I think it's slightly simpler and more elegant that way -- it makes
> the GUC vacuum_freeze_strategy_threshold strictly about freezing
> strategy, while still leading to VACUUM advancing relfrozenxid in a
> way that makes sense. It just happens as a second order effect. Why
> add a special case?

This might be a better way to explain it:

The main page-level freezing commit (commit 1de58df4) already added an
optimization that triggers page-level freezing "early" (early relative
to vacuum_freeze_min_age). This happens whenever a page already needs
to have an FPI logged inside lazy_scan_prune -- even when we're using
the lazy freezing strategy. The optimization isn't configurable, and
gets applied regardless of freezing strategy (technically there is no
such thing as freezing strategies on HEAD just yet, though HEAD still
has this optimization).

There will be workloads where the FPI optimization will result in
freezing many more pages -- especially when data checksums are in use
(since then we could easily need to log an FPI just so pruning can set
a hint bit). As a result, certain VACUUMs that use the lazy freezing
strategy will freeze in almost the same way as an equivalent VACUUM
using the eager freezing strategy. Such a "nominally lazy but actually
quite eager" VACUUM operation should get the same benefit in terms of
relfrozenxid advancement as it would if it really had used the eager
freezing strategy instead. It's fairly obvious that we'll get the same
benefit in relfrozenxid advancement (comparable relfrozenxid results
for comparable freezing work), since the way that VACUUM decides on
its scanning strategy is not conditioned on freezing strategy (whether
by the ongoing VACUUM or any other VACUUM against the same table).

All that matters is the conditions in the table (in particular the
added cost of opting for eager scanning over lazy scanning) as
indicated by the visibility map at the start of each VACUUM -- how
those conditions came about really isn't interesting at that point.
And so lazy_scan_strategy doesn't care about them when it chooses
VACUUM's scanning strategy.

There are even tables/workloads where relfrozenxid will be able to
jump forward by a huge amount whenever VACUUM choosing the eager
scanning strategy, despite the fact that VACUUM generally does little
or no freezing to make that possible:

https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Patch_3

--
Peter Geoghegan



Re: New strategies for freezing, advancing relfrozenxid early

From
Dilip Kumar
Date:
On Mon, Jan 16, 2023 at 11:31 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> > I think '(nextXID - cutoffs->relfrozenxid) / freeze_table_age' should
> > be the actual fraction right?  What is the point of adding 0.5 to the
> > divisor?  If there is a logical reason, maybe we can explain in the
> > comments.
>
> It's just a way of avoiding division by zero.

oh, correct :)

> > While looking into the logic of 'lazy_scan_strategy', I think the idea
> > looks very good but the only thing is that
> > we have kept eager freeze and eager scan completely independent.
> > Don't you think that if a table is chosen for an eager scan
> > then we should force the eager freezing as well?
>
> Earlier versions of the patch kind of worked that way.
> lazy_scan_strategy would actually use twice the GUC setting to
> determine scanning strategy. That approach could make our "transition
> from lazy to eager strategies" involve an excessive amount of
> "catch-up freezing" in the VACUUM operation that advanced relfrozenxid
> for the first time, which you see an example of here:
>
> https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Patch
>
> Now we treat the scanning and freezing strategies as two independent
> choices. Of course they're not independent in any practical sense, but
> I think it's slightly simpler and more elegant that way -- it makes
> the GUC vacuum_freeze_strategy_threshold strictly about freezing
> strategy, while still leading to VACUUM advancing relfrozenxid in a
> way that makes sense. It just happens as a second order effect. Why
> add a special case?

I think that it makes sense to keep 'vacuum_freeze_strategy_threshold'
strictly for freezing.  But the point is that the eager scanning
strategy is driven by table freezing needs of the table (tableagefrac)
that make sense, but if we have selected the eager freezing based on
the table age and its freezing need then why don't we force the eager
freezing as well if we have selected eager scanning, after all the
eager scanning is selected for satisfying the freezing need.  But
OTOH, the eager scanning might get selected if it appears that we
might not have to scan too many extra pages compared to lazy scan so
in those cases forcing the eager freezing might not be wise.  So maybe
it is a good idea to keep them the way you have in your patch.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Mon, Jan 16, 2023 at 8:13 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I think that it makes sense to keep 'vacuum_freeze_strategy_threshold'
> strictly for freezing.  But the point is that the eager scanning
> strategy is driven by table freezing needs of the table (tableagefrac)
> that make sense, but if we have selected the eager freezing based on
> the table age and its freezing need then why don't we force the eager
> freezing as well if we have selected eager scanning, after all the
> eager scanning is selected for satisfying the freezing need.

Don't think of eager scanning as the new name for aggressive mode --
it's a fairly different concept, because we care about costs now.
Eager scanning can be chosen just because it's very cheap relative to
the alternative of lazy scanning, even when relfrozenxid is still very
recent. (This kind of behavior isn't really new [1], but the exact
implementation from the patch is new.)

Tables such as pgbench_branches and pgbench_tellers will reliably use
eager scanning strategy, no matter how any GUC has been set -- just
because the added cost is always zero (relative to lazy scanning). It
really doesn't matter how far along tableagefrac here, ever. These
same tables will never use eager freezing strategy, unless the
vacuum_freeze_strategy_threshold GUC is misconfigured. (This is
another example of how scanning strategy and freezing strategy may
differ for the same table.)

You do have a good point, though. I think that I know what you mean.
Note that antiwraparound autovacuums (or VACUUMs of tables very near
to that point) *will* always use both the eager freezing strategy and
the eager scanning strategy -- which is probably close to what you
meant.

The important point is that there can be more than one reason to
prefer one strategy to another -- and the reasons can be rather
different. Occasionally it'll be a combination of two factors together
that push things in favor of one strategy over the other -- even
though either factor on its own would not have resulted in the same
choice.

[1]
https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Constantly_updated_tables_.28usually_smaller_tables.29
-- 
Peter Geoghegan



Re: New strategies for freezing, advancing relfrozenxid early

From
Dilip Kumar
Date:
On Tue, Jan 17, 2023 at 10:05 AM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Mon, Jan 16, 2023 at 8:13 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > I think that it makes sense to keep 'vacuum_freeze_strategy_threshold'
> > strictly for freezing.  But the point is that the eager scanning
> > strategy is driven by table freezing needs of the table (tableagefrac)
> > that make sense, but if we have selected the eager freezing based on
> > the table age and its freezing need then why don't we force the eager
> > freezing as well if we have selected eager scanning, after all the
> > eager scanning is selected for satisfying the freezing need.
>
> Don't think of eager scanning as the new name for aggressive mode --
> it's a fairly different concept, because we care about costs now.
> Eager scanning can be chosen just because it's very cheap relative to
> the alternative of lazy scanning, even when relfrozenxid is still very
> recent. (This kind of behavior isn't really new [1], but the exact
> implementation from the patch is new.)
>
> Tables such as pgbench_branches and pgbench_tellers will reliably use
> eager scanning strategy, no matter how any GUC has been set -- just
> because the added cost is always zero (relative to lazy scanning). It
> really doesn't matter how far along tableagefrac here, ever. These
> same tables will never use eager freezing strategy, unless the
> vacuum_freeze_strategy_threshold GUC is misconfigured. (This is
> another example of how scanning strategy and freezing strategy may
> differ for the same table.)

Yes, I agree with that.  Thanks for explaining in detail.

> You do have a good point, though. I think that I know what you mean.
> Note that antiwraparound autovacuums (or VACUUMs of tables very near
> to that point) *will* always use both the eager freezing strategy and
> the eager scanning strategy -- which is probably close to what you
> meant.

Right

> The important point is that there can be more than one reason to
> prefer one strategy to another -- and the reasons can be rather
> different. Occasionally it'll be a combination of two factors together
> that push things in favor of one strategy over the other -- even
> though either factor on its own would not have resulted in the same
> choice.

Yes, that makes sense to me.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: New strategies for freezing, advancing relfrozenxid early

From
Dilip Kumar
Date:
On Wed, Jan 18, 2023 at 1:47 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Jan 17, 2023 at 10:05 AM Peter Geoghegan <pg@bowt.ie> wrote:

My final set of comments for 0002

1.
+struct vmsnapshot
+{
+    /* Target heap rel */
+    Relation    rel;
+    /* Scanning strategy used by VACUUM operation */
+    vmstrategy    strat;
+    /* Per-strategy final scanned_pages */
+    BlockNumber rel_pages;
+    BlockNumber scanned_pages_lazy;
+    BlockNumber scanned_pages_eager;

I do not understand much use of maintaining these two
'scanned_pages_lazy' and 'scanned_pages_eager' variables.  I think
just maintaining 'scanned_pages' should be sufficient.  I do not see
in patches also they are really used.  lazy_scan_strategy() is using
these variables but this is getting values of these out parameters
from visibilitymap_snap_acquire().  And visibilitymap_snap_strategy()
is also using this, but it seems there we just need the final result
of 'scanned_pages' instead of these two variables.

2.

+#define MAX_PAGES_YOUNG_TABLEAGE    0.05    /* 5% of rel_pages */
+#define MAX_PAGES_OLD_TABLEAGE        0.70    /* 70% of rel_pages */

Why is the logic behind 5% and 70% are those based on some
experiments?  Should those be tuning parameters so that with real
world use cases if we realise that it would be good if the eager scan
is getting selected more frequently or less frequently then we can
tune those parameters?

3.
+    /*
+     * VACUUM's DISABLE_PAGE_SKIPPING option overrides our decision by forcing
+     * VACUUM to scan every page (VACUUM effectively distrusts rel's VM)
+     */
+    if (force_scan_all)
+        vacrel->vmstrat = VMSNAP_SCAN_ALL;

I think this should be moved as first if case, I mean why to do all
the calculations based on the 'tableagefrac' and
'TABLEAGEFRAC_XXPOINT' if we are forced to scan them all.  I agree the
extra computation we are doing might not really matter compared to the
vacuum work we are going to perform but still seems logical to me to
do the simple check first.

4. Should we move prefetching as a separate patch, instead of merging
with the scanning strategy?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Mon, Jan 23, 2023 at 3:17 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> My final set of comments for 0002

Thanks for the review!

> I do not understand much use of maintaining these two
> 'scanned_pages_lazy' and 'scanned_pages_eager' variables.  I think
> just maintaining 'scanned_pages' should be sufficient.  I do not see
> in patches also they are really used.

I agree that the visibility map snapshot struct could stand to be
cleaned up -- some of that state may not be needed, and it wouldn't be
that hard to use memory a little more economically, particularly with
very small tables. It's on my TODO list already.

> +#define MAX_PAGES_YOUNG_TABLEAGE    0.05    /* 5% of rel_pages */
> +#define MAX_PAGES_OLD_TABLEAGE        0.70    /* 70% of rel_pages */
>
> Why is the logic behind 5% and 70% are those based on some
> experiments?  Should those be tuning parameters so that with real
> world use cases if we realise that it would be good if the eager scan
> is getting selected more frequently or less frequently then we can
> tune those parameters?

The specific multipliers constants chosen (for
MAX_PAGES_YOUNG_TABLEAGE and MAX_PAGES_OLD_TABLEAGE) were based on
both experiments and intuition. The precise values could be somewhat
different without it really mattering, though. For example, with a
table like pgbench_history (which is a really important case for the
patch in general), there won't be any all-visible pages at all (at
least after a short while), so it won't matter what these constants
are -- eager scanning will always be chosen.

I don't think that they should be parameters. The useful parameter for
users remains vacuum_freeze_table_age/autovacuum_freeze_max_age (note
that vacuum_freeze_table_age usually gets its value from
autovacuum_freeze_max_age due to changes in 0002). Like today,
vacuum_freeze_table_age forces VACUUM to scan all not-all-frozen pages
so that relfrozenxid can be advanced. Unlike today, it forces eager
scanning (not aggressive mode). But even long before eager scanning is
*forced*, pressure to use eager scanning gradually builds. That
pressure will usually cause some VACUUM to use eager scanning before
it's strictly necessary. Overall,
vacuum_freeze_table_age/autovacuum_freeze_max_age now provide loose
guidance.

It really has to be loose in this sense in order for
lazy_scan_strategy() to have the freedom to do the right thing based
on the characteristics of the table as a whole, according to its
visibility map snapshot. This allows lazy_scan_strategy() to stumble
upon once-off opportunities to advance relfrozenxid inexpensively,
including cases where it could never happen with the current model.
These opportunities are side-effects of workload characteristics that
can be hard to predict [1][2].

> I think this should be moved as first if case, I mean why to do all
> the calculations based on the 'tableagefrac' and
> 'TABLEAGEFRAC_XXPOINT' if we are forced to scan them all.  I agree the
> extra computation we are doing might not really matter compared to the
> vacuum work we are going to perform but still seems logical to me to
> do the simple check first.

This is only needed for DISABLE_PAGE_SKIPPING, which is an escape
hatch option that is never supposed to be needed. I don't think that
it's worth going to the trouble of indenting the code more just so
this is avoided -- it really is an afterthought. Besides, the compiler
might well be doing this for us.

> 4. Should we move prefetching as a separate patch, instead of merging
> with the scanning strategy?

I don't think that breaking that out would be an improvement. A lot of
the prefetching stuff informs how the visibility map code is
structured.

[1] https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Patch_3
[2]
https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Opportunistically_advancing_relfrozenxid_with_bursty.2C_real-world_workloads
--
Peter Geoghegan



Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Mon, Jan 16, 2023 at 5:55 PM Peter Geoghegan <pg@bowt.ie> wrote:
> 0001 (the freezing strategies patch) is now committable IMV. Or at
> least will be once I polish the docs a bit more. I plan on committing
> 0001 some time next week, barring any objections.

I plan on committing 0001 (the freezing strategies commit) tomorrow
morning, US Pacific time.

Attached is v17. There are no significant differences compared to v17.
I decided to post a new version now, ahead of commit, to show how I've
cleaned up the docs in 0001 -- docs describing the new GUC, freeze
strategies, and so on.

--
Peter Geoghegan

Attachment

Re: New strategies for freezing, advancing relfrozenxid early

From
Matthias van de Meent
Date:
On Tue, 24 Jan 2023 at 23:50, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Mon, Jan 16, 2023 at 5:55 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > 0001 (the freezing strategies patch) is now committable IMV. Or at
> > least will be once I polish the docs a bit more. I plan on committing
> > 0001 some time next week, barring any objections.
>
> I plan on committing 0001 (the freezing strategies commit) tomorrow
> morning, US Pacific time.
>
> Attached is v17. There are no significant differences compared to v17.
> I decided to post a new version now, ahead of commit, to show how I've
> cleaned up the docs in 0001 -- docs describing the new GUC, freeze
> strategies, and so on.

LGTM, +1 on 0001

Some more comments on 0002:

> +lazy_scan_strategy(LVRelState *vacrel, bool force_scan_all)
> scanned_pages_lazy & scanned_pages_eager

We have not yet scanned the pages, so I suggest plan/scan_pages_eager
and *_lazy as variable names instead, to minimize confusion about the
naming.

I'll await the next iteration of 0002 in which you've completed more
TODOs before I'll dig deeper into that patch.


Kind regards,

Matthias van de Meent



Re: New strategies for freezing, advancing relfrozenxid early

From
Andres Freund
Date:
Hi,

On 2023-01-24 14:49:38 -0800, Peter Geoghegan wrote:
> On Mon, Jan 16, 2023 at 5:55 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > 0001 (the freezing strategies patch) is now committable IMV. Or at
> > least will be once I polish the docs a bit more. I plan on committing
> > 0001 some time next week, barring any objections.
>
> I plan on committing 0001 (the freezing strategies commit) tomorrow
> morning, US Pacific time.

I unfortunately haven't been able to keep up with the thread and saw this just
now. But I've expressed the concern below several times before, so it
shouldn't come as a surprise.

I think, as committed, this will cause serious issues for some reasonably
common workloads, due to substantially increased WAL traffic.


The most common problematic scenario I see are tables full of rows with
limited lifetime. E.g. because rows get aggregated up after a while. Before
those rows practically never got frozen - but now we'll freeze them all the
time.


I whipped up a quick test: 15 pgbench threads insert rows, 1 psql \while loop
deletes older rows.

Workload fits in s_b:

Autovacuum on average generates between 1.5x-7x as much WAL as before,
depending on how things interact with checkpoints. And not just that, each
autovac cycle also takes substantially longer than before - the average time
for an autovacuum roughly doubled.  Which of course increases the amount of
bloat.


When workload doesn't fit in s_b:

Time for vacuuming goes up to ~5x. WAL volume to ~9x. Autovacuum can't keep up
with bloat, every vacuum takes longer than the prior one:
65s->78s->139s->176s
And that's with autovac cost limits removed! Relation size nearly doubles due
to bloat.


After I disabled the new strategy autovac started to catch up again:
124s->101s->103->46s->20s->28s->24s


This is significantly worse than I predicted. This was my first attempt at
coming up with a problematic workload. There'll likely be way worse in
production.



I think as-is this logic will cause massive issues.

Andres



Re: New strategies for freezing, advancing relfrozenxid early

From
Andres Freund
Date:
Hi,

On 2023-01-24 14:49:38 -0800, Peter Geoghegan wrote:
> From e41d3f45fcd6f639b768c22139006ad11422575f Mon Sep 17 00:00:00 2001
> From: Peter Geoghegan <pg@bowt.ie>
> Date: Thu, 24 Nov 2022 18:20:36 -0800
> Subject: [PATCH v17 1/3] Add eager and lazy freezing strategies to VACUUM.
> 
> Eager freezing strategy avoids large build-ups of all-visible pages.  It
> makes VACUUM trigger page-level freezing whenever doing so will enable
> the page to become all-frozen in the visibility map.  This is useful for
> tables that experience continual growth, particularly strict append-only
> tables such as pgbench's history table.  Eager freezing significantly
> improves performance stability by spreading out the cost of freezing
> over time, rather than doing most freezing during aggressive VACUUMs.
> It complements the insert autovacuum mechanism added by commit b07642db.

However, it significantly increases the overall work when rows have a somewhat
limited lifetime. The documented reason why vacuum_freeze_min_age exist -
although I think it doesn't really achieve its documented goal anymore, after
the recent changes page-level freezing changes.


> VACUUM determines its freezing strategy based on the value of the new
> vacuum_freeze_strategy_threshold GUC (or reloption) with logged tables;
> tables that exceed the size threshold use the eager freezing strategy.

I think that's not a sufficient guard at all. The size of a table doesn't say
much about how a table is used.


> Unlogged tables and temp tables will always use eager freezing strategy,
> since there is essentially no downside.

I somewhat doubt that that is true, but certainly the cost is lower.


> Eager freezing is strictly more aggressive than lazy freezing.  Settings
> like vacuum_freeze_min_age still get applied in just the same way in
> every VACUUM, independent of the strategy in use.  The only mechanical
> difference between eager and lazy freezing strategies is that only the
> former applies its own additional criteria to trigger freezing pages.

That's only true because vacuum_freeze_min_age being has been fairly radically
redefined recently.

Greetings,

Andres Freund



Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Wed, Jan 25, 2023 at 4:43 PM Andres Freund <andres@anarazel.de> wrote:
> I unfortunately haven't been able to keep up with the thread and saw this just
> now. But I've expressed the concern below several times before, so it
> shouldn't come as a surprise.

You missed the announcement 9 days ago, and the similar clear
signalling of a commit from yesterday. I guess I'll need to start
personally reaching out to you any time I commit anything in this area
in the future. I almost considered doing that here, in fact.

> The most common problematic scenario I see are tables full of rows with
> limited lifetime. E.g. because rows get aggregated up after a while. Before
> those rows practically never got frozen - but now we'll freeze them all the
> time.

Fundamentally, the choice to freeze or not freeze is driven by
speculation about the needs of the table, with some guidance from the
user. That isn't new. It seems to me that it will always be possible
for you to come up with an adversarial case that makes any given
approach look bad, no matter how good it is. Of course that doesn't
mean that this particular complaint has no validity; but it does mean
that you need to be willing to draw the line somewhere.

In particular, it would be very useful to know what the parameters of
the discussion are. Obviously I cannot come up with an algorithm that
can literally predict the future. But I may be able to handle specific
cases of concern better, or to better help users cope in whatever way.

> I whipped up a quick test: 15 pgbench threads insert rows, 1 psql \while loop
> deletes older rows.

Can you post the script? And what setting did you use?

> Workload fits in s_b:
>
> Autovacuum on average generates between 1.5x-7x as much WAL as before,
> depending on how things interact with checkpoints. And not just that, each
> autovac cycle also takes substantially longer than before - the average time
> for an autovacuum roughly doubled.  Which of course increases the amount of
> bloat.

Anything that causes an autovacuum to take longer will effectively
make autovacuum think that it has removed more bloat than it really
has, which will then make autovacuum less aggressive when it really
should be more aggressive. That's a preexisting issue, that needs to
be accounted for in the context of this discussion.

> This is significantly worse than I predicted. This was my first attempt at
> coming up with a problematic workload. There'll likely be way worse in
> production.

As I said in the commit message, the current default for
vacuum_freeze_strategy_threshold is considered low, and was always
intended to be provisional. Something that I explicitly noted would be
reviewed after the beta period is over, once we gained more experience
with the setting.

I think that a far higher setting could be almost as effective. 32GB,
or even 64GB could work quite well, since you'll still have the FPI
optimization.

-- 
Peter Geoghegan



Re: New strategies for freezing, advancing relfrozenxid early

From
Andres Freund
Date:
Hi,

On 2023-01-25 16:43:47 -0800, Andres Freund wrote:
> I think, as committed, this will cause serious issues for some reasonably
> common workloads, due to substantially increased WAL traffic.
> 
> 
> The most common problematic scenario I see are tables full of rows with
> limited lifetime. E.g. because rows get aggregated up after a while. Before
> those rows practically never got frozen - but now we'll freeze them all the
> time.

Another bad scenario: Some longrunning / hung transaction caused us to get
close to the xid wraparound. Problem was resolved, autovacuum runs. Previously
we wouldn't have frozen the portion of the table that was actively changing,
now we will. Consequence: We get closer to the "no write" limit / the outage
lasts longer.

I don't see an alternative to reverting this for now.

Greetings,

Andres Freund



Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Wed, Jan 25, 2023 at 5:15 PM Andres Freund <andres@anarazel.de> wrote:
> However, it significantly increases the overall work when rows have a somewhat
> limited lifetime. The documented reason why vacuum_freeze_min_age exist -
> although I think it doesn't really achieve its documented goal anymore, after
> the recent changes page-level freezing changes.

Huh? vacuum_freeze_min_age hasn't done that, at all. At least not
since the visibility map went in back in 8.4:

https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Today.2C_on_Postgres_HEAD_2

That's why we literally do ~100% of all freezing in aggressive mode
VACUUM with append-only or append-mostly tables.

> > VACUUM determines its freezing strategy based on the value of the new
> > vacuum_freeze_strategy_threshold GUC (or reloption) with logged tables;
> > tables that exceed the size threshold use the eager freezing strategy.
>
> I think that's not a sufficient guard at all. The size of a table doesn't say
> much about how a table is used.

Sufficient for what purpose?

> > Eager freezing is strictly more aggressive than lazy freezing.  Settings
> > like vacuum_freeze_min_age still get applied in just the same way in
> > every VACUUM, independent of the strategy in use.  The only mechanical
> > difference between eager and lazy freezing strategies is that only the
> > former applies its own additional criteria to trigger freezing pages.
>
> That's only true because vacuum_freeze_min_age being has been fairly radically
> redefined recently.

So? This part of the commit message is a simple statement of fact.

-- 
Peter Geoghegan



Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Wed, Jan 25, 2023 at 5:26 PM Andres Freund <andres@anarazel.de> wrote:
> Another bad scenario: Some longrunning / hung transaction caused us to get
> close to the xid wraparound. Problem was resolved, autovacuum runs. Previously
> we wouldn't have frozen the portion of the table that was actively changing,
> now we will. Consequence: We get closer to the "no write" limit / the outage
> lasts longer.

Obviously it isn't difficult to just invent a new rule that gets
applied by lazy_scan_strategy. For example, it would take me less than
5 minutes to write a patch that disables eager freezing when the
failsafe is in effect.

> I don't see an alternative to reverting this for now.

I want to see your test case before acting.

-- 
Peter Geoghegan



Re: New strategies for freezing, advancing relfrozenxid early

From
Andres Freund
Date:
Hi,

On 2023-01-25 17:22:32 -0800, Peter Geoghegan wrote:
> On Wed, Jan 25, 2023 at 4:43 PM Andres Freund <andres@anarazel.de> wrote:
> > I unfortunately haven't been able to keep up with the thread and saw this just
> > now. But I've expressed the concern below several times before, so it
> > shouldn't come as a surprise.
> 
> You missed the announcement 9 days ago, and the similar clear
> signalling of a commit from yesterday. I guess I'll need to start
> personally reaching out to you any time I commit anything in this area
> in the future. I almost considered doing that here, in fact.

There's just too much email on -hackers to keep up with, if I ever want to do
any development of my own. I raised this concern before though, so it's not
like it's a surprise.


> > The most common problematic scenario I see are tables full of rows with
> > limited lifetime. E.g. because rows get aggregated up after a while. Before
> > those rows practically never got frozen - but now we'll freeze them all the
> > time.
> 
> Fundamentally, the choice to freeze or not freeze is driven by
> speculation about the needs of the table, with some guidance from the
> user. That isn't new. It seems to me that it will always be possible
> for you to come up with an adversarial case that makes any given
> approach look bad, no matter how good it is. Of course that doesn't
> mean that this particular complaint has no validity; but it does mean
> that you need to be willing to draw the line somewhere.

Sure. But significantly regressing plausible if not common workloads is
different than knowing that there'll be some edge case where we'll do
something worse.


> > I whipped up a quick test: 15 pgbench threads insert rows, 1 psql \while loop
> > deletes older rows.
> 
> Can you post the script? And what setting did you use?

prep:
CREATE TABLE pgbench_time_data(client_id int8 NOT NULL, ts timestamptz NOT NULL, filla int8 NOT NULL, fillb int8 not
null,fillc int8 not null);
 
CREATE INDEX ON pgbench_time_data(ts);
ALTER SYSTEM SET autovacuum_naptime = '10s';
ALTER SYSTEM SET autovacuum_vacuum_cost_delay TO -1;
ALTER SYSTEM SET synchronous_commit = off; -- otherwise more clients are needed

pgbench script, with 15 clients:
INSERT INTO pgbench_time_data(client_id, ts, filla, fillb, fillc) VALUES (:client_id, now(), 0, 0, 0);

psql session deleting old data:
EXPLAIN ANALYZE DELETE FROM pgbench_time_data WHERE ts < now() - '120s'::interval \watch 1

Realistically the time should be longer, but I didn't want to wait that long
for the deletions to actually start.


I reproduced both with checkpoint_timeout=5min and 1min. 1min is easier for
impatient me.


I switched between vacuum_freeze_strategy_threshold=0 and
vacuum_freeze_strategy_threshold=too-high, because it's quicker/takes less
warmup to set up something with smaller tables.

shared_buffers=32GB for fits in s_b, 1GB otherwise.

max_wal_size=150GB, log_autovacuum_min_duration=0, and a bunch of logging
settings.


> > Workload fits in s_b:
> >
> > Autovacuum on average generates between 1.5x-7x as much WAL as before,
> > depending on how things interact with checkpoints. And not just that, each
> > autovac cycle also takes substantially longer than before - the average time
> > for an autovacuum roughly doubled.  Which of course increases the amount of
> > bloat.
> 
> Anything that causes an autovacuum to take longer will effectively
> make autovacuum think that it has removed more bloat than it really
> has, which will then make autovacuum less aggressive when it really
> should be more aggressive. That's a preexisting issue, that needs to
> be accounted for in the context of this discussion.

That's not the problem here - on my system autovac starts again very
quickly. The problem is that we accumulate bloat while autovacuum is
running. Wasting time/WAL volume on freezing pages that don't need to be
frozen is an issue.



> In particular, it would be very useful to know what the parameters of
> the discussion are. Obviously I cannot come up with an algorithm that
> can literally predict the future. But I may be able to handle specific
> cases of concern better, or to better help users cope in whatever way.

> > This is significantly worse than I predicted. This was my first attempt at
> > coming up with a problematic workload. There'll likely be way worse in
> > production.
> 
> As I said in the commit message, the current default for
> vacuum_freeze_strategy_threshold is considered low, and was always
> intended to be provisional. Something that I explicitly noted would be
> reviewed after the beta period is over, once we gained more experience
> with the setting.

> I think that a far higher setting could be almost as effective. 32GB,
> or even 64GB could work quite well, since you'll still have the FPI
> optimization.

The concrete setting of vacuum_freeze_strategy_threshold doesn't matter.
Table size simply isn't a usable proxy for whether eager freezing is a good
idea or not.

You can have a 1TB table full of transient data, or you can have a 1TB table
where part of the data is transient and only settles after a time. In neither
case eager freezing is ok.

Or you can have an append-only table. In which case eager freezing is great.

Greetings,

Andres Freund



Re: New strategies for freezing, advancing relfrozenxid early

From
Andres Freund
Date:
Hi,

On 2023-01-25 17:37:17 -0800, Peter Geoghegan wrote:
> On Wed, Jan 25, 2023 at 5:26 PM Andres Freund <andres@anarazel.de> wrote:
> > Another bad scenario: Some longrunning / hung transaction caused us to get
> > close to the xid wraparound. Problem was resolved, autovacuum runs. Previously
> > we wouldn't have frozen the portion of the table that was actively changing,
> > now we will. Consequence: We get closer to the "no write" limit / the outage
> > lasts longer.
> 
> Obviously it isn't difficult to just invent a new rule that gets
> applied by lazy_scan_strategy. For example, it would take me less than
> 5 minutes to write a patch that disables eager freezing when the
> failsafe is in effect.

Sure. I'm not saying that these issues cannot be addressed. Of course no patch
of a meaningful size is perfect and we all can't predict the future. But this
is a very significant behavioural change to vacuum, and there are pretty
simple scenarios in which it causes significant regressions. And at least some
of the issues have been pointed out before.

Greetings,

Andres Freund



Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Wed, Jan 25, 2023 at 5:49 PM Andres Freund <andres@anarazel.de> wrote:
> Sure. But significantly regressing plausible if not common workloads is
> different than knowing that there'll be some edge case where we'll do
> something worse.

That's very vague. Significant to whom, for what purpose?

> prep:
> CREATE TABLE pgbench_time_data(client_id int8 NOT NULL, ts timestamptz NOT NULL, filla int8 NOT NULL, fillb int8 not
null,fillc int8 not null);
 
> CREATE INDEX ON pgbench_time_data(ts);
> ALTER SYSTEM SET autovacuum_naptime = '10s';
> ALTER SYSTEM SET autovacuum_vacuum_cost_delay TO -1;
> ALTER SYSTEM SET synchronous_commit = off; -- otherwise more clients are needed
>
> pgbench script, with 15 clients:
> INSERT INTO pgbench_time_data(client_id, ts, filla, fillb, fillc) VALUES (:client_id, now(), 0, 0, 0);
>
> psql session deleting old data:
> EXPLAIN ANALYZE DELETE FROM pgbench_time_data WHERE ts < now() - '120s'::interval \watch 1
>
> Realistically the time should be longer, but I didn't want to wait that long
> for the deletions to actually start.

I'll review this tomorrow.

> I reproduced both with checkpoint_timeout=5min and 1min. 1min is easier for
> impatient me.

You said "Autovacuum on average generates between 1.5x-7x as much WAL
as before". Why stop there, though? There's a *big* multiplicative
effect in play here from FPIs, obviously, so the sky's the limit. Why
not set checkpoint_timeout to 30s?

> I switched between vacuum_freeze_strategy_threshold=0 and
> vacuum_freeze_strategy_threshold=too-high, because it's quicker/takes less
> warmup to set up something with smaller tables.

This makes no sense to me, at all.

> The concrete setting of vacuum_freeze_strategy_threshold doesn't matter.
> Table size simply isn't a usable proxy for whether eager freezing is a good
> idea or not.

It's not supposed to be - you have it backwards. It's intended to work
as a proxy for whether lazy freezing is a bad idea, particularly in
the worst case.

There is also an effect that likely would have been protective with
your test case had you used a larger table with the same test case
(and had you not lowered vacuum_freeze_strategy_threshold from its
already low default). In general there'd be a much better chance of
concurrent reuse of space by new inserts discouraging page-level
freezing, since VACUUM would take much longer relative to everything
else, as compared to a small table.

> You can have a 1TB table full of transient data, or you can have a 1TB table
> where part of the data is transient and only settles after a time. In neither
> case eager freezing is ok.

It sounds like you're not willing to accept any kind of trade-off.
How, in general, can we detect what kind of 1TB table it will be, in
the absence of user input? And in the absence of user input, why would
we prefer to default to a behavior that is highly destabilizing when
we get it wrong?

-- 
Peter Geoghegan



Re: New strategies for freezing, advancing relfrozenxid early

From
Andres Freund
Date:
Hi,

On 2023-01-25 17:28:48 -0800, Peter Geoghegan wrote:
> On Wed, Jan 25, 2023 at 5:15 PM Andres Freund <andres@anarazel.de> wrote:
> > However, it significantly increases the overall work when rows have a somewhat
> > limited lifetime. The documented reason why vacuum_freeze_min_age exist -
> > although I think it doesn't really achieve its documented goal anymore, after
> > the recent changes page-level freezing changes.
> 
> Huh? vacuum_freeze_min_age hasn't done that, at all. At least not
> since the visibility map went in back in 8.4:

My point was the other way round. That vacuum_freeze_min_age *prevented* us
from freezing rows "too soon" - obviously a very blunt instrument.

Since page level freezing, it only partially does that, because we'll freeze
even newer rows, if pruning triggered an FPI (I don't think that's quite the
right check, but that's a separate discussion).

As far as I can tell, with the eager strategy, the only thing
vacuum_freeze_min_age really influences is whether we'll block waiting for a
cleanup lock.  IOW, VACUUM on a table > vacuum_freeze_strategy_threshold is
now a slightly less-blocking version of VACUUM FREEZE.


The paragraph I was referencing:
   <para>
    One disadvantage of decreasing <varname>vacuum_freeze_min_age</varname> is that
    it might cause <command>VACUUM</command> to do useless work: freezing a row
    version is a waste of time if the row is modified
    soon thereafter (causing it to acquire a new XID).  So the setting should
    be large enough that rows are not frozen until they are unlikely to change
    any more.
   </para>

But now vacuum_freeze_min_age doesn't reliably influence whether we'll freeze
row anymore.

Am I missing something here?



> > > VACUUM determines its freezing strategy based on the value of the new
> > > vacuum_freeze_strategy_threshold GUC (or reloption) with logged tables;
> > > tables that exceed the size threshold use the eager freezing strategy.
> >
> > I think that's not a sufficient guard at all. The size of a table doesn't say
> > much about how a table is used.
> 
> Sufficient for what purpose?

Not not regress a substantial portion of our userbase.

Greetings,

Andres Freund



Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Wed, Jan 25, 2023 at 6:33 PM Andres Freund <andres@anarazel.de> wrote:
> My point was the other way round. That vacuum_freeze_min_age *prevented* us
> from freezing rows "too soon" - obviously a very blunt instrument.

Yes, not freezing at all until aggressive vacuum is definitely good
when you don't really need to freeze at all.

> Since page level freezing, it only partially does that, because we'll freeze
> even newer rows, if pruning triggered an FPI (I don't think that's quite the
> right check, but that's a separate discussion).

But the added cost is very low, and it might well make all the difference.

> As far as I can tell, with the eager strategy, the only thing
> vacuum_freeze_min_age really influences is whether we'll block waiting for a
> cleanup lock.  IOW, VACUUM on a table > vacuum_freeze_strategy_threshold is
> now a slightly less-blocking version of VACUUM FREEZE.

That's simply not true, at all. I'm very surprised that you think
that. The commit message very clearly addresses this. You know, the
part that you specifically quoted to complain about today!

Once again I'll refer you to my Wiki page on this:

https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Patch_2

The difference between this and VACUUM FREEZE is described here:

"Note how we freeze most pages, but still leave a significant number
unfrozen each time, despite using an eager approach to freezing
(2981204 scanned - 2355230 frozen = 625974 pages scanned but left
unfrozen). Again, this is because we don't freeze pages unless they're
already eligible to be set all-visible. We saw the same effect with
the first pgbench_history example, but it was hardly noticeable at all
there. Whereas here we see that even eager freezing opts to hold off
on freezing relatively many individual heap pages, due to the observed
conditions on those particular heap pages."

If it was true that eager freezing strategy behaved just the same as
VACUUM FREEZE (at least as far as freezing is concerned) then
scenarios like this one would show that VACUUM froze practically all
of the pages it scanned -- maybe fully 100% of all scanned pages would
be frozen. This effect is absent from small tables, and I suspect that
it's absent from your test case in part because you used a table that
was too small.

Obviously the way that eager freezing strategy avoids freezing
concurrently modified pages isn't perfect. It's one approach to
limiting the downside from eager freezing, in tables (or even
individual pages) where it's inappropriate. Of course that isn't
perfect, but it's a significant factor.

-- 
Peter Geoghegan



Re: New strategies for freezing, advancing relfrozenxid early

From
Andres Freund
Date:
Hk,

On 2023-01-25 18:31:16 -0800, Peter Geoghegan wrote:
> On Wed, Jan 25, 2023 at 5:49 PM Andres Freund <andres@anarazel.de> wrote:
> > Sure. But significantly regressing plausible if not common workloads is
> > different than knowing that there'll be some edge case where we'll do
> > something worse.
> 
> That's very vague. Significant to whom, for what purpose?

Sure it's vague. But you can't tell me that it's uncommon to use postgres to
store rows that isn't retained for > 50million xids.



> > I reproduced both with checkpoint_timeout=5min and 1min. 1min is easier for
> > impatient me.
> 
> You said "Autovacuum on average generates between 1.5x-7x as much WAL
> as before". Why stop there, though? There's a *big* multiplicative
> effect in play here from FPIs, obviously, so the sky's the limit. Why
> not set checkpoint_timeout to 30s?

The amount of WAL increases substantially even with 5min, the degree of the
increase varies more though. But that largely vanishes if you increase the
time after which rows are deleted a bit. I just am not patient enough to wait
for that.


> > I switched between vacuum_freeze_strategy_threshold=0 and
> > vacuum_freeze_strategy_threshold=too-high, because it's quicker/takes less
> > warmup to set up something with smaller tables.
> 
> This makes no sense to me, at all.

It's quicker to run the workload with a table that initially is below 4GB, but
still be able to test the eager strategy. It wouldn't change anything
fundamental to just make the rows a bit wider, or to have a static portion of
the table.

And changing between vacuum_freeze_strategy_threshold=0/very-large (or I
assume -1, didn't check) while the workload is running having to wait until
the 120s to start deleting have passed..


> > The concrete setting of vacuum_freeze_strategy_threshold doesn't matter.
> > Table size simply isn't a usable proxy for whether eager freezing is a good
> > idea or not.
> 
> It's not supposed to be - you have it backwards. It's intended to work
> as a proxy for whether lazy freezing is a bad idea, particularly in
> the worst case.

That's a distinction without a difference.


> There is also an effect that likely would have been protective with
> your test case had you used a larger table with the same test case
> (and had you not lowered vacuum_freeze_strategy_threshold from its
> already low default).

Again, you just need a less heavily changing portion of the the table or a
slightly larger "deletion delay" and you end up with a table well over
4GB. Even as stated I end up with > 4GB after a bit of running.

It's just a shortcut to make testing this easier.



> > You can have a 1TB table full of transient data, or you can have a 1TB table
> > where part of the data is transient and only settles after a time. In neither
> > case eager freezing is ok.
> 
> It sounds like you're not willing to accept any kind of trade-off.

I am. Just not every tradeoff. I just don't see any useful tradeoffs purely
based on the relation size.


> How, in general, can we detect what kind of 1TB table it will be, in the
> absence of user input?

I suspect we'll need some form of heuristics to differentiate between tables
that are more append heavy and tables that are changing more heavily. I think
it might be preferrable to not have a hard cliff but a gradual changeover -
hard cliffs tend to lead to issue one can't see coming.

I think several of the heuristics below become easier once we introduce "xid
age" vacuums.


One idea is to start tracking the number of all-frozen pages in pg_class. If
there's a significant percentage of all-visible but not all-frozen pages,
vacuum should be more eager. If only a small portion of the table is not
frozen, there's no need to be eager. If only a small portion of the table is
all-visible, there similarly is no need to freeze eagerly.


I IIRC previously was handwaving at keeping track of the average age of tuples
on all-visible pages. That could extend the prior heuristic. A heavily
changing table will have a relatively young average, a more append only table
will have an increasing average age.


It might also make sense to look at the age of relfrozenxid - there's really
no point in being overly eager if the relation is quite young. And a very
heavily changing table will tend to be younger. But likely the approach of
tracking the age of all-visible pages will be more accurate.



The heuristics don't have to be perfect. If we get progressively more eager,
an occasional somewhat eager vacuum isn't a huge issue, as long as it then
leads to the next few vacuums to become less eager.



> And in the absence of user input, why would we prefer to default to a
> behavior that is highly destabilizing when we get it wrong?

Users know the current behaviour. Introducing significant issues that didn't
previously exist will cause new issues and new frustrations.

Greetings,

Andres Freund



Re: New strategies for freezing, advancing relfrozenxid early

From
Robert Haas
Date:
On Wed, Jan 25, 2023 at 8:49 PM Andres Freund <andres@anarazel.de> wrote:
> The concrete setting of vacuum_freeze_strategy_threshold doesn't matter.
> Table size simply isn't a usable proxy for whether eager freezing is a good
> idea or not.

I strongly agree. I can't imagine how a size-based threshold can make
any sense at all.

Both Andres and I have repeatedly expressed concern about how much is
being changed in the behavior of vacuum, and how quickly, and IMHO on
the basis of very limited evidence that the changes are improvements.
The fact that Andres was very quickly able to find cases where the
patch produces large regression is just more evidence of that. It's
also hard to even understand what has been changed, because the
descriptions are so theoretical.

I think we're on a very dangerous path here. I want VACUUM to be
better as the next person, but I really don't believe that's the
direction we're headed. I think if we release like this, we're going
to experience more VACUUM pain, not less. And worse still, I don't
think anyone other than Peter and Andres is going to understand why
it's happening.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Wed, Jan 25, 2023 at 7:11 PM Andres Freund <andres@anarazel.de> wrote:
> > > I switched between vacuum_freeze_strategy_threshold=0 and
> > > vacuum_freeze_strategy_threshold=too-high, because it's quicker/takes less
> > > warmup to set up something with smaller tables.
> >
> > This makes no sense to me, at all.
>
> It's quicker to run the workload with a table that initially is below 4GB, but
> still be able to test the eager strategy. It wouldn't change anything
> fundamental to just make the rows a bit wider, or to have a static portion of
> the table.

What does that actually mean? Wouldn't change anything fundamental?

What it would do is significantly reduce the write amplification
effect that you encountered. You came up with numbers of up to 7x, a
number that you used without any mention of checkpoint_timeout being
lowered to only 1 minutes (I had to push to get that information). Had
you done things differently (larger table, larger setting) then that
would have made the regression far smaller. So yeah, "nothing
fundamental".

> > How, in general, can we detect what kind of 1TB table it will be, in the
> > absence of user input?
>
> I suspect we'll need some form of heuristics to differentiate between tables
> that are more append heavy and tables that are changing more heavily.

The TPC-C tables are actually a perfect adversarial cases for this,
because it's both, together. What then?

> I think
> it might be preferrable to not have a hard cliff but a gradual changeover -
> hard cliffs tend to lead to issue one can't see coming.

As soon as you change your behavior you have to account for the fact
that you behaved lazily up until all prior VACUUMs. I think that
you're better off just being eager with new pages and modified pages,
while not specifically going

> I IIRC previously was handwaving at keeping track of the average age of tuples
> on all-visible pages. That could extend the prior heuristic. A heavily
> changing table will have a relatively young average, a more append only table
> will have an increasing average age.
>
>
> It might also make sense to look at the age of relfrozenxid - there's really
> no point in being overly eager if the relation is quite young.

I don't think that's true. What about bulk loading? It's a totally
valid and common requirement.

--
Peter Geoghegan



Re: New strategies for freezing, advancing relfrozenxid early

From
Andres Freund
Date:
Hi,

On 2023-01-25 18:43:10 -0800, Peter Geoghegan wrote:
> On Wed, Jan 25, 2023 at 6:33 PM Andres Freund <andres@anarazel.de> wrote:
> > As far as I can tell, with the eager strategy, the only thing
> > vacuum_freeze_min_age really influences is whether we'll block waiting for a
> > cleanup lock.  IOW, VACUUM on a table > vacuum_freeze_strategy_threshold is
> > now a slightly less-blocking version of VACUUM FREEZE.
>
> That's simply not true, at all. I'm very surprised that you think
> that. The commit message very clearly addresses this.

It says something like that, but it's not really true:

Looking at the results of
  DROP TABLE IF EXISTS frak;
  -- autovac disabled so we see just the result of the vacuum below
  CREATE TABLE frak WITH (autovacuum_enabled=0) AS SELECT generate_series(1, 10000000);
  VACUUM frak;
  SELECT pg_relation_size('frak') / 8192 AS relsize_pages, SUM(all_visible::int) all_vis_pages, SUM(all_frozen::int)
all_frozen_pagesFROM pg_visibility('frak');
 

across releases.

In < 16 you'll get:
┌───────────────┬───────────────┬──────────────────┐
│ relsize_pages │ all_vis_pages │ all_frozen_pages │
├───────────────┼───────────────┼──────────────────┤
│         44248 │         44248 │                0 │
└───────────────┴───────────────┴──────────────────┘

You simply can't freeze these rows, because they're not vacuum_freeze_min_age
xids old.

With 16 and the default vacuum_freeze_strategy_threshold you'll get the same
(even though we wouldn't actually trigger an FPW).

With 16 and vacuum_freeze_strategy_threshold=0, you'll get:
┌───────────────┬───────────────┬──────────────────┐
│ relsize_pages │ all_vis_pages │ all_frozen_pages │
├───────────────┼───────────────┼──────────────────┤
│         44248 │         44248 │            44248 │
└───────────────┴───────────────┴──────────────────┘

IOW, basically what you get with VACUUM FREEZE.


That's actually what I was complaining about. The commit message in a way is
right that
    Settings
    like vacuum_freeze_min_age still get applied in just the same way in
    every VACUUM, independent of the strategy in use.  The only mechanical
    difference between eager and lazy freezing strategies is that only the
    former applies its own additional criteria to trigger freezing pages.

but that's only true because page level freezing neutered
vacuum_freeze_min_age. Compared to <16, it's a *huge* change.


Yes, it's true that VACUUM still is less agressive than VACUUM FREEZE, even
disregarding cleanup locks, because it won't freeze if there's non-removable
rows on the page. But more often than not that's a pretty small difference.



> Once again I'll refer you to my Wiki page on this:
>
> https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Patch_2
>
> The difference between this and VACUUM FREEZE is described here:
>
> "Note how we freeze most pages, but still leave a significant number
> unfrozen each time, despite using an eager approach to freezing
> (2981204 scanned - 2355230 frozen = 625974 pages scanned but left
> unfrozen). Again, this is because we don't freeze pages unless they're
> already eligible to be set all-visible.

The only reason there is a substantial difference is because of pgbench's
uniform access pattern. Most real-world applications don't have that.

Greetings,

Andres Freund



Re: New strategies for freezing, advancing relfrozenxid early

From
John Naylor
Date:

On Thu, Jan 26, 2023 at 10:11 AM Andres Freund <andres@anarazel.de> wrote:

> I am. Just not every tradeoff. I just don't see any useful tradeoffs purely
> based on the relation size.

I expressed reservations about relation size six weeks ago:

On Wed, Dec 14, 2022 at 12:16 AM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Tue, Dec 13, 2022 at 12:29 AM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> > If the number of unfrozen heap pages is the thing we care about, perhaps that, and not the total size of the table, should be the parameter that drives freezing strategy?
>
> That's not the only thing we care about, though.

That was followed by several paragraphs that never got around to explaining why table size should drive freezing strategy. Review is a feedback mechanism alerting the patch author to possible problems. Listening to feedback is like vacuum, in a way: If it hurts, you're not doing it enough. 

--
John Naylor
EDB: http://www.enterprisedb.com

Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Wed, Jan 25, 2023 at 7:41 PM Robert Haas <robertmhaas@gmail.com> wrote:
> Both Andres and I have repeatedly expressed concern about how much is
> being changed in the behavior of vacuum, and how quickly, and IMHO on
> the basis of very limited evidence that the changes are improvements.
> The fact that Andres was very quickly able to find cases where the
> patch produces large regression is just more evidence of that. It's
> also hard to even understand what has been changed, because the
> descriptions are so theoretical.

Did you actually read the motivating examples Wiki page?

> I think we're on a very dangerous path here. I want VACUUM to be
> better as the next person, but I really don't believe that's the
> direction we're headed. I think if we release like this, we're going
> to experience more VACUUM pain, not less. And worse still, I don't
> think anyone other than Peter and Andres is going to understand why
> it's happening.

I think that the only sensible course of action at this point is for
me to revert the page-level freezing commit from today, and abandon
all outstanding work on VACUUM. I will still stand by the basic
page-level freezing work, at least to the extent that I am able to.
Honestly, just typing that makes me feel a big sense of relief.

I am a proud, stubborn man. While the experience of working on the
earlier related stuff for Postgres 15 was itself enough to make me
seriously reassess my choice to work on VACUUM in general, I still
wanted to finish off what I'd started. I don't see how that'll be
possible now -- I'm just not in a position to be in the center of
another controversy, and I just don't seem to be able to avoid them
here, as a practical matter. I will resolve to be a less stubborn
person. I don't have the constitution for it anymore.

-- 
Peter Geoghegan



Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Wed, Jan 25, 2023 at 8:12 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
> That was followed by several paragraphs that never got around to explaining why table size should drive freezing
strategy.

You were talking about the system level view of freeze debt, and how
the table view might not be a sufficient proxy for that. What does
that have to do with anything that we've discussed on this thread
recently?

> Review is a feedback mechanism alerting the patch author to possible problems. Listening to feedback is like vacuum,
ina way: If it hurts, you're not doing it enough.
 

An elegant analogy.

-- 
Peter Geoghegan



Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Wed, Jan 25, 2023 at 8:24 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > I think we're on a very dangerous path here. I want VACUUM to be
> > better as the next person, but I really don't believe that's the
> > direction we're headed. I think if we release like this, we're going
> > to experience more VACUUM pain, not less. And worse still, I don't
> > think anyone other than Peter and Andres is going to understand why
> > it's happening.
>
> I think that the only sensible course of action at this point is for
> me to revert the page-level freezing commit from today, and abandon
> all outstanding work on VACUUM. I will still stand by the basic
> page-level freezing work, at least to the extent that I am able to.

I have now reverted today's commit. I have also withdrawn all
remaining work from the patch series as a whole, which is reflected in
the CF app. Separately, I have withdrawn 2 other VACUUM related
patches of mine via the CF app: the antiwraparound autovacuum patch
series, plus a patch that did some further work on freezing
MultiXacts.

I have no intention of picking any of these patches back up again. I
also intend to completely avoid new work on both VACUUM and
autovacuum, not including ambulkdelete() code run by index access
methods. I will continue to do maintenance and bugfix work when it
happens to involve VACUUM, though.

For the record, in case it matters: I certainly have no objection to
anybody else picking up any of this unfinished work for themselves, in
part or in full.

--
Peter Geoghegan



Re: New strategies for freezing, advancing relfrozenxid early

From
Robert Haas
Date:
On Wed, Jan 25, 2023 at 11:25 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Wed, Jan 25, 2023 at 7:41 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > Both Andres and I have repeatedly expressed concern about how much is
> > being changed in the behavior of vacuum, and how quickly, and IMHO on
> > the basis of very limited evidence that the changes are improvements.
> > The fact that Andres was very quickly able to find cases where the
> > patch produces large regression is just more evidence of that. It's
> > also hard to even understand what has been changed, because the
> > descriptions are so theoretical.
>
> Did you actually read the motivating examples Wiki page?

I don't know. I've read a lot of stuff that you've written on this
topic, which has taken a significant amount of time, and I still don't
understand a lot of what you're changing, and I don't agree with all
of the things that I do understand. I can't state with confidence that
the motivating examples wiki page was or was not among the things that
I read. But, you know, when people start running PostgreSQL 16, and
have some problem, they're not going to read the motivating examples
wiki page. They're going to read the documentation. If they can't find
the answer there, they (or some hacker that they contact) will
probably read the code comments and the relevant commit messages.
Those either clearly explain what was changed in a way that somebody
can understand, or they don't. If they don't, *the commits are not
good enough*, regardless of what other information may exist in any
other place.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: New strategies for freezing, advancing relfrozenxid early

From
Robert Haas
Date:
On Wed, Jan 25, 2023 at 10:56 PM Andres Freund <andres@anarazel.de> wrote:
> but that's only true because page level freezing neutered
> vacuum_freeze_min_age. Compared to <16, it's a *huge* change.

Do you think that page-level freezing
(1de58df4fec7325d91f5a8345757314be7ac05da) was improvidently
committed?

I have always been a bit skeptical of vacuum_freeze_min_age as a
mechanism. It's certainly true that it is a waste of energy to freeze
tuples that will soon be removed anyway, but on the other hand,
repeatedly dirtying the same page for various different freezing and
visibility related reasons *really sucks*, and even repeatedly reading
the page because we kept deciding not to do anything yet isn't great.
It seems possible that the page-level freezing mechanism could help
with that quite a bit, and I think that the heuristic that patch
proposes is basically reasonable: if there's at least one tuple on the
page that is old enough to justify freezing, it doesn't seem like a
bad bet to freeze all the others that can be frozen at the same time,
at least if it means that we can mark the page all-visible or
all-frozen. If it doesn't, then I'm not so sure; maybe we're best off
deferring as much work as possible to a time when we *can* mark the
page all-visible or all-frozen.

In short, I think that neutering vacuum_freeze_min_age at least to
some degree might be a good thing, but that's not to say that I'm
altogether confident in that patch, either.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Wed, Jan 25, 2023 at 7:56 PM Andres Freund <andres@anarazel.de> wrote:
> > https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Patch_2
> >
> > The difference between this and VACUUM FREEZE is described here:
> >
> > "Note how we freeze most pages, but still leave a significant number
> > unfrozen each time, despite using an eager approach to freezing
> > (2981204 scanned - 2355230 frozen = 625974 pages scanned but left
> > unfrozen). Again, this is because we don't freeze pages unless they're
> > already eligible to be set all-visible.
>
> The only reason there is a substantial difference is because of pgbench's
> uniform access pattern. Most real-world applications don't have that.

It's not pgbench! It's TPC-C. It's actually an adversarial case for
the patch series.

-- 
Peter Geoghegan



Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Thu, Jan 26, 2023 at 5:41 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Jan 25, 2023 at 11:25 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > On Wed, Jan 25, 2023 at 7:41 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > > Both Andres and I have repeatedly expressed concern about how much is
> > > being changed in the behavior of vacuum, and how quickly, and IMHO on
> > > the basis of very limited evidence that the changes are improvements.
> > > The fact that Andres was very quickly able to find cases where the
> > > patch produces large regression is just more evidence of that. It's
> > > also hard to even understand what has been changed, because the
> > > descriptions are so theoretical.
> >
> > Did you actually read the motivating examples Wiki page?
>
> I don't know. I've read a lot of stuff that you've written on this
> topic, which has taken a significant amount of time, and I still don't
> understand a lot of what you're changing, and I don't agree with all
> of the things that I do understand.

You complained about the descriptions being theoretical. But there's
nothing theoretical about the fact that we more or less do *all*
freezing in an eventual aggressive VACUUM in many important cases,
including very simple cases like pgbench_history -- the simplest
possible append-only table case. We'll merrily rewrite the entire
table, all at once, for no good reason at all. Consistently, reliably.
It's so incredibly obvious that this makes zero sense! And yet I don't
think you've ever engaged with such basic points as that one.

-- 
Peter Geoghegan



Re: New strategies for freezing, advancing relfrozenxid early

From
Andres Freund
Date:
Hi,

On 2023-01-26 09:20:57 -0500, Robert Haas wrote:
> On Wed, Jan 25, 2023 at 10:56 PM Andres Freund <andres@anarazel.de> wrote:
> > but that's only true because page level freezing neutered
> > vacuum_freeze_min_age. Compared to <16, it's a *huge* change.
> 
> Do you think that page-level freezing
> (1de58df4fec7325d91f5a8345757314be7ac05da) was improvidently
> committed?

I think it's probably ok, but perhaps deserves a bit more thought about when
to "opportunistically" freeze. Perhaps to make it *more* aggressive than it's
now.

With "opportunistic freezing" I mean freezing the page, even though we don't
*have* to freeze any of the tuples.

The overall condition gating freezing is:
    if (pagefrz.freeze_required || tuples_frozen == 0 ||
        (prunestate->all_visible && prunestate->all_frozen &&
         fpi_before != pgWalUsage.wal_fpi))

fpi_before is set before the heap_page_prune() call.

To me the
  fpi_before != pgWalUsage.wal_fpi"
part doesn't make a whole lot of sense. For one, it won't at all work if
full_page_writes=off. But more importantly, it also means we'll not freeze
when VACUUMing a recently modified page, even if pruning already emitted a WAL
record and we'd not emit an FPI if we freezed the page now.


To me a condition that checked if the buffer is already dirty and if another
XLogInsert() would be likely to generate an FPI would make more sense. The
rare race case of a checkpoint starting concurrently doesn't matter IMO.


A minor complaint I have about the code is that the "tuples_frozen == 0" path
imo is confusing. We go into the "freeze" path, which then inside has another
if for the tuples_frozen == 0 part. I get that this deduplicates the
NewRelFrozenXid handling, but it still looks odd.


> I have always been a bit skeptical of vacuum_freeze_min_age as a
> mechanism. It's certainly true that it is a waste of energy to freeze
> tuples that will soon be removed anyway, but on the other hand,
> repeatedly dirtying the same page for various different freezing and
> visibility related reasons *really sucks*, and even repeatedly reading
> the page because we kept deciding not to do anything yet isn't great.
> It seems possible that the page-level freezing mechanism could help
> with that quite a bit, and I think that the heuristic that patch
> proposes is basically reasonable: if there's at least one tuple on the
> page that is old enough to justify freezing, it doesn't seem like a
> bad bet to freeze all the others that can be frozen at the same time,
> at least if it means that we can mark the page all-visible or
> all-frozen. If it doesn't, then I'm not so sure; maybe we're best off
> deferring as much work as possible to a time when we *can* mark the
> page all-visible or all-frozen.

Agreed. Freezing everything if we need to freeze some things seems quite safe
to me.


> In short, I think that neutering vacuum_freeze_min_age at least to
> some degree might be a good thing, but that's not to say that I'm
> altogether confident in that patch, either.

I am not too woried about the neutering in the page level freezing patch.

The combination of the page level work with the eager strategy is where the
sensibly-more-aggressive freeze_min_age got turbocharged to an imo dangerous
degree.

Greetings,

Andres Freund



Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Thu, Jan 26, 2023 at 8:35 AM Andres Freund <andres@anarazel.de> wrote:
> I think it's probably ok, but perhaps deserves a bit more thought about when
> to "opportunistically" freeze. Perhaps to make it *more* aggressive than it's
> now.
>
> With "opportunistic freezing" I mean freezing the page, even though we don't
> *have* to freeze any of the tuples.
>
> The overall condition gating freezing is:
>         if (pagefrz.freeze_required || tuples_frozen == 0 ||
>                 (prunestate->all_visible && prunestate->all_frozen &&
>                  fpi_before != pgWalUsage.wal_fpi))
>
> fpi_before is set before the heap_page_prune() call.

Have you considered page-level checksums, and how the impact on hint
bits needs to be accounted for here?

All RDS customers use page-level checksums. And I've noticed that it's
very common for the number of FPIs to only be very slightly less than
the number of pages dirtied. Much of which is just hint bits. The
"fpi_before != pgWalUsage.wal_fpi" test catches that.

> To me a condition that checked if the buffer is already dirty and if another
> XLogInsert() would be likely to generate an FPI would make more sense. The
> rare race case of a checkpoint starting concurrently doesn't matter IMO.

That's going to be very significantly more aggressive. For example
it'll impact small tables very differently.

--
Peter Geoghegan



Re: New strategies for freezing, advancing relfrozenxid early

From
Andres Freund
Date:
Hi,

On 2023-01-26 08:54:55 -0800, Peter Geoghegan wrote:
> On Thu, Jan 26, 2023 at 8:35 AM Andres Freund <andres@anarazel.de> wrote:
> > I think it's probably ok, but perhaps deserves a bit more thought about when
> > to "opportunistically" freeze. Perhaps to make it *more* aggressive than it's
> > now.
> >
> > With "opportunistic freezing" I mean freezing the page, even though we don't
> > *have* to freeze any of the tuples.
> >
> > The overall condition gating freezing is:
> >         if (pagefrz.freeze_required || tuples_frozen == 0 ||
> >                 (prunestate->all_visible && prunestate->all_frozen &&
> >                  fpi_before != pgWalUsage.wal_fpi))
> >
> > fpi_before is set before the heap_page_prune() call.
> 
> Have you considered page-level checksums, and how the impact on hint
> bits needs to be accounted for here?
> 
> All RDS customers use page-level checksums. And I've noticed that it's
> very common for the number of FPIs to only be very slightly less than
> the number of pages dirtied. Much of which is just hint bits. The
> "fpi_before != pgWalUsage.wal_fpi" test catches that.

I assume the case you're thinking of is that pruning did *not* do any changes,
but in the process of figuring out that nothing needed to be pruned, we did a
MarkBufferDirtyHint(), and as part of that emitted an FPI?


> > To me a condition that checked if the buffer is already dirty and if another
> > XLogInsert() would be likely to generate an FPI would make more sense. The
> > rare race case of a checkpoint starting concurrently doesn't matter IMO.
> 
> That's going to be very significantly more aggressive. For example
> it'll impact small tables very differently.

Maybe it would be too aggressive, not sure. The cost of a freeze WAL record is
relatively small, with one important exception below, if we are 99.99% sure
that it's not going to require an FPI and isn't going to dirty the page.

The exception is that a newer LSN on the page can cause the ringbuffer
replacement to trigger more more aggressive WAL flushing. No meaningful
difference if we modified the page during pruning, or if the page was already
in s_b (since it likely won't be written out via the ringbuffer in that case),
but if checksums are off and we just hint-dirtied the page, it could be a
significant issue.

Thus a modification of the above logic could be to opportunistically freeze if
a ) it won't cause an FPI and either
b1) the page was already dirty before pruning, as we'll not do a ringbuffer
    replacement in that case
or
b2) We wrote a WAL record during pruning, as the difference in flush position
    is marginal

An even more aggressive version would be to replace b1) with logic that'd
allow newly dirtying the page if it wasn't read through the ringbuffer. But
newly dirtying the page feels like it'd be more dangerous.


A less aggressive version would be to check if any WAL records were emitted
during heap_page_prune() (instead of FPIs) and whether we'd emit an FPI if we
modified the page again. Similar to what we do now, except not requiring an
FPI to have been emitted.

But to me it seems a bit odd that VACUUM now is more aggressive if checksums /
wal_log_hint bits is on, than without them. Which I think is how using either
of pgWalUsage.wal_fpi, pgWalUsage.wal_records ends up working?

Greetings,

Andres Freund



Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Thu, Jan 26, 2023 at 9:53 AM Andres Freund <andres@anarazel.de> wrote:
> I assume the case you're thinking of is that pruning did *not* do any changes,
> but in the process of figuring out that nothing needed to be pruned, we did a
> MarkBufferDirtyHint(), and as part of that emitted an FPI?

Yes.

> > That's going to be very significantly more aggressive. For example
> > it'll impact small tables very differently.
>
> Maybe it would be too aggressive, not sure. The cost of a freeze WAL record is
> relatively small, with one important exception below, if we are 99.99% sure
> that it's not going to require an FPI and isn't going to dirty the page.
>
> The exception is that a newer LSN on the page can cause the ringbuffer
> replacement to trigger more more aggressive WAL flushing. No meaningful
> difference if we modified the page during pruning, or if the page was already
> in s_b (since it likely won't be written out via the ringbuffer in that case),
> but if checksums are off and we just hint-dirtied the page, it could be a
> significant issue.

Most of the overhead of FREEZE WAL records (with freeze plan
deduplication and page-level freezing in) is generic WAL record header
overhead. Your recent adversarial test case is going to choke on that,
too. At least if you set checkpoint_timeout to 1 minute again.

> Thus a modification of the above logic could be to opportunistically freeze if
> a ) it won't cause an FPI and either
> b1) the page was already dirty before pruning, as we'll not do a ringbuffer
>     replacement in that case
> or
> b2) We wrote a WAL record during pruning, as the difference in flush position
>     is marginal
>
> An even more aggressive version would be to replace b1) with logic that'd
> allow newly dirtying the page if it wasn't read through the ringbuffer. But
> newly dirtying the page feels like it'd be more dangerous.

In many cases we'll have to dirty the page anyway, just to set
PD_ALL_VISIBLE. The whole way the logic works is conditioned (whether
triggered by an FPI or triggered by my now-reverted GUC) on being able
to set the whole page all-frozen in the VM.

> A less aggressive version would be to check if any WAL records were emitted
> during heap_page_prune() (instead of FPIs) and whether we'd emit an FPI if we
> modified the page again. Similar to what we do now, except not requiring an
> FPI to have been emitted.

Also way more aggressive. Not nearly enough on its own.

> But to me it seems a bit odd that VACUUM now is more aggressive if checksums /
> wal_log_hint bits is on, than without them. Which I think is how using either
> of pgWalUsage.wal_fpi, pgWalUsage.wal_records ends up working?

Which part is the odd part? Is it odd that page-level freezing works
that way, or is it odd that page-level checksums work that way?

In any case this seems like an odd thing for you to say, having
eviscerated a patch that really just made the same behavior trigger
independently of FPIs in some tables, controlled via a GUC.

-- 
Peter Geoghegan



Re: New strategies for freezing, advancing relfrozenxid early

From
Matthias van de Meent
Date:
On Thu, 26 Jan 2023 at 19:45, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Thu, Jan 26, 2023 at 9:53 AM Andres Freund <andres@anarazel.de> wrote:
> > I assume the case you're thinking of is that pruning did *not* do any changes,
> > but in the process of figuring out that nothing needed to be pruned, we did a
> > MarkBufferDirtyHint(), and as part of that emitted an FPI?
>
> Yes.
>
> > > That's going to be very significantly more aggressive. For example
> > > it'll impact small tables very differently.
> >
> > Maybe it would be too aggressive, not sure. The cost of a freeze WAL record is
> > relatively small, with one important exception below, if we are 99.99% sure
> > that it's not going to require an FPI and isn't going to dirty the page.
> >
> > The exception is that a newer LSN on the page can cause the ringbuffer
> > replacement to trigger more more aggressive WAL flushing. No meaningful
> > difference if we modified the page during pruning, or if the page was already
> > in s_b (since it likely won't be written out via the ringbuffer in that case),
> > but if checksums are off and we just hint-dirtied the page, it could be a
> > significant issue.
>
> Most of the overhead of FREEZE WAL records (with freeze plan
> deduplication and page-level freezing in) is generic WAL record header
> overhead. Your recent adversarial test case is going to choke on that,
> too. At least if you set checkpoint_timeout to 1 minute again.

Could someone explain to me why we don't currently (optionally)
include the functionality of page freezing in the PRUNE records? I
think they're quite closely related (in that they both execute in
VACUUM and are required for long-term system stability), and are even
more related now that we have opportunistic page-level freezing. I
think adding a "freeze this page as well"-flag in PRUNE records would
go a long way to reducing the WAL overhead of aggressive and more
opportunistic freezing.

-Matthias



Re: New strategies for freezing, advancing relfrozenxid early

From
Robert Haas
Date:
On Thu, Jan 26, 2023 at 11:35 AM Peter Geoghegan <pg@bowt.ie> wrote:
> You complained about the descriptions being theoretical. But there's
> nothing theoretical about the fact that we more or less do *all*
> freezing in an eventual aggressive VACUUM in many important cases,
> including very simple cases like pgbench_history -- the simplest
> possible append-only table case. We'll merrily rewrite the entire
> table, all at once, for no good reason at all. Consistently, reliably.
> It's so incredibly obvious that this makes zero sense! And yet I don't
> think you've ever engaged with such basic points as that one.

I'm aware that that's a problem, and I agree that it sucks. I think
that what this patch does is make vacuum more aggressively, and I
expect that would help this problem. I haven't said much about that
because I don't think it's controversial. However, the patch also has
a cost, and that's what I think is controversial.

I think it's pretty much impossible to freeze more aggressively
without losing in some scenario or other. If waiting longer to freeze
would have resulted in the data getting updated again or deleted
before we froze it, then waiting longer reduces the total amount of
freezing work that ever has to be done. Freezing more aggressively
inevitably gives up some amount of that potential benefit in order to
try to secure some other benefit. It's a trade-off.

I think that the goal of a patch that makes vacuum more (or less)
aggressive should be to make the cases where we lose as obscure as
possible, and the cases where we win as broad as possible. I think
that, in order to be a good patch, it needs to be relatively difficult
to find cases where we incur a big loss. If it's easy to find a big
loss, then I think it's better to stick with the current behavior,
even if it's also easy to find a big gain. There's nothing wonderful
about the current behavior, but (to paraphrase what I think Andres has
already said several times) it's better to keep shipping code with the
same bad behavior than to put out a new major release with behaviors
that are just as bad, but different.

I feel like your emails sometimes seem to suppose that I think that
you're a bad person, or a bad developer, or that you have no good
ideas, or that you have no good ideas about this topic, or that this
topic is not important, or that we don't need to do better than we are
currently doing. I think none of those things. However, I'm also not
prepared to go all the way to the other end of the spectrum and say
that all of your ideas and everything in this patch are great. I don't
think either of those things, either.

I certainly think that freezing more aggressively in some scenarios
could be a great idea, but it seems like the patch's theory is to be
very nearly maximally aggressive in every vacuum run if the table size
is greater than some threshold, and I don't think that's right at all.
I'm not exactly sure what information we should use to decide how
aggressive to be, but I am pretty sure that the size of the table is
not it.  It's true that, for a small table, the cost of having to
eventually vacuum the whole table at once isn't going to be very high,
whereas for a large table, it will be. That line of reasoning makes a
size threshold sound reasonable. However, the amount of extra work
that we can potentially do by vacuuming more aggressively *also*
increases with the table size, which to me means using that a
criterion actually isn't sensible at all.

One idea that I've had about how to solve this problem is to try to
make vacuum try to aggressively freeze some portion of the table on
each pass, and to behave less aggressively on the rest of the table so
that, hopefully, no single vacuum does too much work. Unfortunately, I
don't really know how to do that effectively. If we knew that the
table was going to see 10 vacuums before we hit
autovacuum_freeze_max_age, we could try to have each one do 10% of the
amount of freezing that was going to need to be done rather than
letting any single vacuum do all of it, but we don't have that sort of
information. Also, even if we did have that sort of information, the
idea only works if the pages that we freeze sooner are ones that we're
not about to update or delete again, and we don't have any idea what
is likely there. In theory we could have some system that tracks how
recently each page range in a table has been modified, and direct our
freezing activity toward the ones less-recently modified on the theory
that they're not so likely to be modified again in the near future,
but in reality we have no such system. So I don't really feel like I
know what the right answer is here, yet.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Thu, Jan 26, 2023 at 11:28 AM Robert Haas <robertmhaas@gmail.com> wrote:
> I think it's pretty much impossible to freeze more aggressively
> without losing in some scenario or other. If waiting longer to freeze
> would have resulted in the data getting updated again or deleted
> before we froze it, then waiting longer reduces the total amount of
> freezing work that ever has to be done. Freezing more aggressively
> inevitably gives up some amount of that potential benefit in order to
> try to secure some other benefit. It's a trade-off.

There is no question about that.

> I think that the goal of a patch that makes vacuum more (or less)
> aggressive should be to make the cases where we lose as obscure as
> possible, and the cases where we win as broad as possible. I think
> that, in order to be a good patch, it needs to be relatively difficult
> to find cases where we incur a big loss. If it's easy to find a big
> loss, then I think it's better to stick with the current behavior,
> even if it's also easy to find a big gain.

Again, this seems totally uncontroversial. It's just incredibly vague,
and not at all actionable.

Relatively difficult for Andres, or for somebody else? What are the
real parameters here? Obviously there are no clear answers available.

> However, I'm also not
> prepared to go all the way to the other end of the spectrum and say
> that all of your ideas and everything in this patch are great. I don't
> think either of those things, either.

It doesn't matter. I'm done with it. This is not a negotiation about
what gets in and what doesn't get in.

All that I aim to do now is to draw some kind of line under the basic
page-level freezing work, since of course I'm still responsible for
that. And perhaps to defend my personal reputation.

> I certainly think that freezing more aggressively in some scenarios
> could be a great idea, but it seems like the patch's theory is to be
> very nearly maximally aggressive in every vacuum run if the table size
> is greater than some threshold, and I don't think that's right at all.

We'll systematically avoid accumulating debt past a certain point --
that's its purpose. That is, we'll avoid accumulating all-visible
pages that eventually need to be frozen.

> I'm not exactly sure what information we should use to decide how
> aggressive to be, but I am pretty sure that the size of the table is
> not it.  It's true that, for a small table, the cost of having to
> eventually vacuum the whole table at once isn't going to be very high,
> whereas for a large table, it will be. That line of reasoning makes a
> size threshold sound reasonable. However, the amount of extra work
> that we can potentially do by vacuuming more aggressively *also*
> increases with the table size, which to me means using that a
> criterion actually isn't sensible at all.

The overwhelming cost is usually FPIs in any case. If you're not
mostly focussing on that, you're focussing on the wrong thing. At
least with larger tables. You just have to focus on the picture over
time, across multiple VACUUM operations.

> One idea that I've had about how to solve this problem is to try to
> make vacuum try to aggressively freeze some portion of the table on
> each pass, and to behave less aggressively on the rest of the table so
> that, hopefully, no single vacuum does too much work. Unfortunately, I
> don't really know how to do that effectively.

That has been proposed a couple of times in the context of this
thread. It won't work, because the way autovacuum works in general
(and likely always will work) doesn't allow it. With an append-only
table, each VACUUM will naturally have to scan significantly more
pages than the last one, forever (barring antiwraparound vacuums). Why
wouldn't it continue that way? I mean it might not (the table might
stop growing altogether), but then it doesn't matter much what we do.

If you're not behaving very proactively at the level of each VACUUM
operation, then the picture over time is that you're *already* falling
behind. At least with an append-only table. You have to think of the
sequence of operations, not just one.

> In theory we could have some system that tracks how
> recently each page range in a table has been modified, and direct our
> freezing activity toward the ones less-recently modified on the theory
> that they're not so likely to be modified again in the near future,
> but in reality we have no such system. So I don't really feel like I
> know what the right answer is here, yet.

So we need to come up with a way of getting reliable information from
the future, about an application that we have no particular
understanding of. As opposed to just eating the cost to some degree,
and making it configurable.

-- 
Peter Geoghegan



Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Thu, Jan 26, 2023 at 11:26 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> Could someone explain to me why we don't currently (optionally)
> include the functionality of page freezing in the PRUNE records? I
> think they're quite closely related (in that they both execute in
> VACUUM and are required for long-term system stability), and are even
> more related now that we have opportunistic page-level freezing. I
> think adding a "freeze this page as well"-flag in PRUNE records would
> go a long way to reducing the WAL overhead of aggressive and more
> opportunistic freezing.

Yeah, we've talked about doing that in the past year. It's quite
possible. It would make quite a lot of sense, because the actual
overhead of the WAL record for freezing tends to come from the generic
WAL record header stuff itself. If there was only one record for both,
then you'd only need to include the relfilenode and block number (and
so on) once.

It would be tricky to handle Multis, so what you'd probably do is just
freezing xmin, and possibly aborted and locker XIDs in xmax. So you
wouldn't completely get rid of the main freeze record, but would be
able to avoid it in many important cases.

-- 
Peter Geoghegan



Re: New strategies for freezing, advancing relfrozenxid early

From
Andres Freund
Date:
Hi,

On 2023-01-26 10:44:45 -0800, Peter Geoghegan wrote:
> On Thu, Jan 26, 2023 at 9:53 AM Andres Freund <andres@anarazel.de> wrote:
> > > That's going to be very significantly more aggressive. For example
> > > it'll impact small tables very differently.
> >
> > Maybe it would be too aggressive, not sure. The cost of a freeze WAL record is
> > relatively small, with one important exception below, if we are 99.99% sure
> > that it's not going to require an FPI and isn't going to dirty the page.
> >
> > The exception is that a newer LSN on the page can cause the ringbuffer
> > replacement to trigger more more aggressive WAL flushing. No meaningful
> > difference if we modified the page during pruning, or if the page was already
> > in s_b (since it likely won't be written out via the ringbuffer in that case),
> > but if checksums are off and we just hint-dirtied the page, it could be a
> > significant issue.
> 
> Most of the overhead of FREEZE WAL records (with freeze plan
> deduplication and page-level freezing in) is generic WAL record header
> overhead. Your recent adversarial test case is going to choke on that,
> too. At least if you set checkpoint_timeout to 1 minute again.

I don't quite follow. What do you mean with "record header overhead"? Unless
that includes FPIs, I don't think that's that commonly true?

The problematic case I am talking about is when we do *not* emit a WAL record
during pruning (because there's nothing to prune), but want to freeze the
table. If you don't log an FPI, the remaining big overhead is that increasing
the LSN on the page will often cause an XLogFlush() when writing out the
buffer.

I don't see what your reference to checkpoint timeout is about here?

Also, as I mentioned before, the problem isn't specific to checkpoint_timeout
= 1min. It just makes it cheaper to reproduce.


> > Thus a modification of the above logic could be to opportunistically freeze if
> > a ) it won't cause an FPI and either
> > b1) the page was already dirty before pruning, as we'll not do a ringbuffer
> >     replacement in that case
> > or
> > b2) We wrote a WAL record during pruning, as the difference in flush position
> >     is marginal
> >
> > An even more aggressive version would be to replace b1) with logic that'd
> > allow newly dirtying the page if it wasn't read through the ringbuffer. But
> > newly dirtying the page feels like it'd be more dangerous.
> 
> In many cases we'll have to dirty the page anyway, just to set
> PD_ALL_VISIBLE. The whole way the logic works is conditioned (whether
> triggered by an FPI or triggered by my now-reverted GUC) on being able
> to set the whole page all-frozen in the VM.

IIRC setting PD_ALL_VISIBLE doesn't trigger an FPI unless we need to log hint
bits. But freezing does trigger one even without wal_log_hint_bits.

You're right, it makes sense to consider whether we'll emit a
XLOG_HEAP2_VISIBLE anyway.


> > A less aggressive version would be to check if any WAL records were emitted
> > during heap_page_prune() (instead of FPIs) and whether we'd emit an FPI if we
> > modified the page again. Similar to what we do now, except not requiring an
> > FPI to have been emitted.
> 
> Also way more aggressive. Not nearly enough on its own.

In which cases will it be problematically more aggressive?

If we emitted a WAL record during pruning we've already set the LSN of the
page to a very recent LSN. We know the page is dirty. Thus we'll already
trigger an XLogFlush() during ringbuffer replacement. We won't emit an FPI.



> > But to me it seems a bit odd that VACUUM now is more aggressive if checksums /
> > wal_log_hint bits is on, than without them. Which I think is how using either
> > of pgWalUsage.wal_fpi, pgWalUsage.wal_records ends up working?
> 
> Which part is the odd part? Is it odd that page-level freezing works
> that way, or is it odd that page-level checksums work that way?

That page-level freezing works that way.


> In any case this seems like an odd thing for you to say, having
> eviscerated a patch that really just made the same behavior trigger
> independently of FPIs in some tables, controlled via a GUC.

jdksjfkjdlkajsd;lfkjasd;lkfj;alskdfj

That behaviour I critizied was causing a torrent of FPIs and additional
dirtying of pages. My proposed replacement for the current FPI check doesn't,
because a) it only triggers when we wrote a WAL record b) It doesn't trigger
if we would write an FPI.

Greetings,

Andres Freund



Re: New strategies for freezing, advancing relfrozenxid early

From
Robert Haas
Date:
On Thu, Jan 26, 2023 at 2:57 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Relatively difficult for Andres, or for somebody else? What are the
> real parameters here? Obviously there are no clear answers available.

Andres is certainly smarter than the average guy, but practically any
scenario that someone can create in a few lines of SQL is something to
which code will be exposed to on some real-world system. If Andres
came along and said, hey, well I found a way to make this patch suck,
and proceeded to describe a scenario that involved a complex set of
tables and multiple workloads running simultaneously and using a
debugger to trigger some race condition and whatever, I'd be like "OK,
but is that really going to happen?". The actual scenario he came up
with is three lines of SQL, and it's nothing remotely obscure. That
kind of thing is going to happen *all the time*.

> The overwhelming cost is usually FPIs in any case. If you're not
> mostly focussing on that, you're focussing on the wrong thing. At
> least with larger tables. You just have to focus on the picture over
> time, across multiple VACUUM operations.

I think that's all mostly true, but the cases where being more
aggressive can cause *extra* FPIs are worthy of just as much attention
as the cases where we can reduce them.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: New strategies for freezing, advancing relfrozenxid early

From
Andres Freund
Date:
Hi,

On 2023-01-26 20:26:00 +0100, Matthias van de Meent wrote:
> Could someone explain to me why we don't currently (optionally)
> include the functionality of page freezing in the PRUNE records?

I think we definitely should (and have argued for it a couple times). It's not
just about reducing WAL overhead, it's also about reducing redundant
visibility checks - which are where a very significant portion of the CPU time
for VACUUMing goes to.

Besides performance considerations, it's also just plain weird that
lazy_scan_prune() can end up with a different visibility than
heap_page_prune() (mostly due to concurrent aborts).


The number of WAL records we often end up emitting for a processing a single
page in vacuum is just plain absurd:
- PRUNE
- FREEZE_PAGE
- VISIBLE

There's afaict no justification whatsoever for these to be separate records.


> I think they're quite closely related (in that they both execute in VACUUM
> and are required for long-term system stability), and are even more related
> now that we have opportunistic page-level freezing. I think adding a "freeze
> this page as well"-flag in PRUNE records would go a long way to reducing the
> WAL overhead of aggressive and more opportunistic freezing.

Yep.

I think we should also seriously consider setting all-visible during on-access
pruning, and freezing rows during on-access pruning.

Greetings,

Andres Freund



Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Thu, Jan 26, 2023 at 12:54 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > The overwhelming cost is usually FPIs in any case. If you're not
> > mostly focussing on that, you're focussing on the wrong thing. At
> > least with larger tables. You just have to focus on the picture over
> > time, across multiple VACUUM operations.
>
> I think that's all mostly true, but the cases where being more
> aggressive can cause *extra* FPIs are worthy of just as much attention
> as the cases where we can reduce them.

It's a question of our exposure to real problems, in no small part.
What can we afford to be wrong about? What problem can be fixed by the
user more or less as it emerges, and what problem doesn't have that
quality?

There is very good reason to believe that the large majority of all
data that people store in a system like Postgres is extremely cold
data:


https://www.microsoft.com/en-us/research/video/cost-performance-in-modern-data-stores-how-data-cashing-systems-succeed/
https://brandur.org/fragments/events

Having a separate aggressive step that rewrites an entire large table,
apparently at random, is just a huge burden to users. You've said that
you agree that it sucks, but somehow I still can't shake the feeling
that you don't fully understand just how much it sucks.

-- 
Peter Geoghegan



Re: New strategies for freezing, advancing relfrozenxid early

From
Robert Haas
Date:
On Thu, Jan 26, 2023 at 4:06 PM Peter Geoghegan <pg@bowt.ie> wrote:
> There is very good reason to believe that the large majority of all
> data that people store in a system like Postgres is extremely cold
> data:

The systems where I end up troubleshooting problems seem to be, most
typically, busy OLTP systems. I'm not in a position to say whether
that's more or less common than systems with extremely cold data, but
I am in a position to say that my employer will have a lot fewer happy
customers if we regress that use case. Naturally I'm keen to avoid
that.

> Having a separate aggressive step that rewrites an entire large table,
> apparently at random, is just a huge burden to users. You've said that
> you agree that it sucks, but somehow I still can't shake the feeling
> that you don't fully understand just how much it sucks.

Ha!

Well, that's possible. But maybe you don't understand how much your
patch makes other things suck.

I don't think we can really get anywhere here by postulating that the
problem is the other person's lack of understanding, even if such a
postulate should happen to be correct.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Thu, Jan 26, 2023 at 1:22 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Jan 26, 2023 at 4:06 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > There is very good reason to believe that the large majority of all
> > data that people store in a system like Postgres is extremely cold
> > data:
>
> The systems where I end up troubleshooting problems seem to be, most
> typically, busy OLTP systems. I'm not in a position to say whether
> that's more or less common than systems with extremely cold data, but
> I am in a position to say that my employer will have a lot fewer happy
> customers if we regress that use case. Naturally I'm keen to avoid
> that.

This is the kind of remark that makes me think that you don't get it.

The most influential OLTP benchmark of all time is TPC-C, which has
exactly this problem. In spades -- it's enormously disruptive. Which
is one reason why I used it as a showcase for a lot of this work. Plus
practical experience (like the Heroku database in the blog post I
linked to) fully agrees with that benchmark, as far as this stuff goes
-- that was also a busy OLTP database.

Online transaction involves transactions. Right? There is presumably
some kind of ledger, some kind of orders table. Naturally these have
entries that age out fairly predictably. After a while, almost all the
data is cold data. It is usually about that simple.

One of the key strengths of systems like Postgres is the ability to
inexpensively store a relatively large amount of data that has just
about zero chance of being read, let alone modified. While at the same
time having decent OLTP performance for the hot data. Not nearly as
good as an in-memory system, mind you -- and yet in-memory systems
remain largely a niche thing.

-- 
Peter Geoghegan



Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Thu, Jan 26, 2023 at 12:45 PM Andres Freund <andres@anarazel.de> wrote:
> > Most of the overhead of FREEZE WAL records (with freeze plan
> > deduplication and page-level freezing in) is generic WAL record header
> > overhead. Your recent adversarial test case is going to choke on that,
> > too. At least if you set checkpoint_timeout to 1 minute again.
>
> I don't quite follow. What do you mean with "record header overhead"? Unless
> that includes FPIs, I don't think that's that commonly true?

Even if there are no directly observable FPIs, there is still extra
WAL, which can cause FPIs indirectly, just by making checkpoints more
frequent. I feel ridiculous even having to explain this to you.

> The problematic case I am talking about is when we do *not* emit a WAL record
> during pruning (because there's nothing to prune), but want to freeze the
> table. If you don't log an FPI, the remaining big overhead is that increasing
> the LSN on the page will often cause an XLogFlush() when writing out the
> buffer.
>
> I don't see what your reference to checkpoint timeout is about here?
>
> Also, as I mentioned before, the problem isn't specific to checkpoint_timeout
> = 1min. It just makes it cheaper to reproduce.

That's flagrantly intellectually dishonest. Sure, it made it easier to
reproduce. But that's not all it did!

You had *lots* of specific numbers and technical details in your first
email, such as "Time for vacuuming goes up to ~5x. WAL volume to
~9x.". But you did not feel that it was worth bothering with details
like having set checkpoint_timeout to 1 minute, which is a setting
that nobody uses, and obviously had a multiplicative effect. That
detail was unimportant. I had to drag it out of you!

You basically found a way to add WAL overhead to a system/workload
that is already in a write amplification vicious cycle, with latent
tipping point type behavior.

There is a practical point here, that is equally obvious, and yet
somehow still needs to be said: benchmarks like that one are basically
completely free of useful information. If we can't agree on how to
assess such things in general, then what can we agree on when it comes
to what should be done about it, what trade-off to make, when it comes
to any similar question?

> > In many cases we'll have to dirty the page anyway, just to set
> > PD_ALL_VISIBLE. The whole way the logic works is conditioned (whether
> > triggered by an FPI or triggered by my now-reverted GUC) on being able
> > to set the whole page all-frozen in the VM.
>
> IIRC setting PD_ALL_VISIBLE doesn't trigger an FPI unless we need to log hint
> bits. But freezing does trigger one even without wal_log_hint_bits.

That is correct.

> You're right, it makes sense to consider whether we'll emit a
> XLOG_HEAP2_VISIBLE anyway.

As written the page-level freezing FPI mechanism probably doesn't
really stand to benefit much from doing that. Either checksums are
disabled and it's just a hint, or they're enabled and there is a very
high chance that we'll get an FPI inside lazy_scan_prune rather than
right after it is called, when PD_ALL_VISIBLE is set.

That's not perfect, of course, but it doesn't have to be. Perhaps it
should still be improved, just on general principle.

> > > A less aggressive version would be to check if any WAL records were emitted
> > > during heap_page_prune() (instead of FPIs) and whether we'd emit an FPI if we
> > > modified the page again. Similar to what we do now, except not requiring an
> > > FPI to have been emitted.
> >
> > Also way more aggressive. Not nearly enough on its own.
>
> In which cases will it be problematically more aggressive?
>
> If we emitted a WAL record during pruning we've already set the LSN of the
> page to a very recent LSN. We know the page is dirty. Thus we'll already
> trigger an XLogFlush() during ringbuffer replacement. We won't emit an FPI.

You seem to be talking about this as if the only thing that could
matter is the immediate FPI -- the first order effects -- and not any
second order effects. You certainly didn't get to 9x extra WAL
overhead by controlling for that before. Should I take it that you've
decided to assess these things more sensibly now? Out of curiosity:
why the change of heart?

> > > But to me it seems a bit odd that VACUUM now is more aggressive if checksums /
> > > wal_log_hint bits is on, than without them. Which I think is how using either
> > > of pgWalUsage.wal_fpi, pgWalUsage.wal_records ends up working?
> >
> > Which part is the odd part? Is it odd that page-level freezing works
> > that way, or is it odd that page-level checksums work that way?
>
> That page-level freezing works that way.

I think that it will probably cause a little confusion, and should be
specifically documented. But other than that, it seems reasonable
enough to me. I mean, should I not do something that's going to be of
significant help to users with checksums, just because it'll be
somewhat confusing to a small minority of them?

> > In any case this seems like an odd thing for you to say, having
> > eviscerated a patch that really just made the same behavior trigger
> > independently of FPIs in some tables, controlled via a GUC.
>
> jdksjfkjdlkajsd;lfkjasd;lkfj;alskdfj
>
> That behaviour I critizied was causing a torrent of FPIs and additional
> dirtying of pages. My proposed replacement for the current FPI check doesn't,
> because a) it only triggers when we wrote a WAL record b) It doesn't trigger
> if we would write an FPI.

It increases the WAL written in many important cases that
vacuum_freeze_strategy_threshold avoided. Sure, it did have some
problems, but the general idea of adding some high level
context/strategies seems essential to me.

You also seem to be suggesting that your proposed change to how basic
page-level freezing works will make freezing of pages on databases
with page-level checksums similar to an equivalent case without
checksums enabled. Even assuming that that's an important goal, you
won't be much closer to achieving it under your scheme, since hint
bits being set during VACUUM and requiring an FPI still make a huge
difference. Tables like pgbench_history have pages that generally
aren't pruned, that don't need to log an FPI just to set
PD_ALL_VISIBLE once checksums are disabled.

That's the difference that users are going to notice between checksums
enabled vs disabled, if they notice any -- it's the most important one
by far.

--
Peter Geoghegan



Re: New strategies for freezing, advancing relfrozenxid early

From
Andres Freund
Date:
Hi,

On 2023-01-26 14:27:53 -0500, Robert Haas wrote:
> One idea that I've had about how to solve this problem is to try to
> make vacuum try to aggressively freeze some portion of the table on
> each pass, and to behave less aggressively on the rest of the table so
> that, hopefully, no single vacuum does too much work.

I agree that this rough direction is worthwhile to purse.


> Unfortunately, I don't really know how to do that effectively. If we knew
> that the table was going to see 10 vacuums before we hit
> autovacuum_freeze_max_age, we could try to have each one do 10% of the
> amount of freezing that was going to need to be done rather than letting any
> single vacuum do all of it, but we don't have that sort of information.

I think, quite fundamentally, it's not possible to bound the amount of work an
anti-wraparound vacuum has to do if we don't have an age based autovacuum
trigger kicking in before autovacuum_freeze_max_age. After all, there might be
no autovacuum before that's autovacuum_freeze_max_age is reached.

But there's just no reason to not have a trigger below
autovacuum_freeze_max_age. That's why I think Peter's patch to split age and
anti-"auto-cancel" autovacuums is an strictly necessary change if we want to
make autovacuum fundamentally suck less. There's a few boring details to
figure out how to set/compute those limits, but I don't think there's anything
fundamentally hard.


I think we also need the number of all-frozen pages in pg_class if we want to
make better scheduling decision. As we already compute the number of
all-visible pages at the end of vacuuming, we can compute the number of
all-frozen pages as well. The space for another integer in pg_class doesn't
bother me one bit.


Let's say we had a autovacuum_vacuum_age trigger of 100m, and
autovacuum_freeze_max_age=500m. We know that we're roughly going to be
vacuuming 5 times before reaching autovacuum_freeze_max_age (very slow
autovacuums are an issue, but if one autovacuum takes 100m+ xids long, there's
not much we can do).

With that we could determine the eager percentage along the lines of:
  frozen_target = Min(age(relfrozenxid), autovacuum_freeze_max_age)/autovacuum_freeze_max_age
  eager_percentage = Min(0, frozen_target * relpages - pg_class.relallfrozen * relpages)

One thing I don't know fully how to handle is how to ensure that we try to
freeze a different part of the table each vacuum. I guess we could store a
page number in pgstats?


This would help address the "cliff" issue of reaching
autovacuum_freeze_max_age. What it would *not*, on its own, would is the
number of times we rewrite pages.

I can guess at a few ways to heuristically identify when tables are "append
mostly" from vacuum's view (a table can be update heavy, but very localized to
recent rows, and still be append mostly from vacuum's view).  There's obvious
cases, e.g. when there are way more inserts than dead rows.  But other cases
are harder.



> Also, even if we did have that sort of information, the idea only works if
> the pages that we freeze sooner are ones that we're not about to update or
> delete again, and we don't have any idea what is likely there.

Perhaps we could use something like
  (age(relfrozenxid) - age(newest_xid_on_page)) / age(relfrozenxid)
as a heuristic?

I have a gut feeling that we should somehow collect/use statistics about the
number of frozen pages, marked as such by the last (or recent?) vacuum, that
had to be "defrosted" by backends. But I don't quite know how to yet.  I think
we could collect statistics about that by storing the LSN of the last vacuum
in the shared stats, and incrementing that counter when defrosting.

A lot of things like that would work a whole lot better if we had statistics
that take older data into account, but weigh it less than more recent
data. But that's hard/expensive to collect.


Greetings,

Andres Freund



Re: New strategies for freezing, advancing relfrozenxid early

From
Andres Freund
Date:
Hi,

On 2023-01-26 15:36:52 -0800, Peter Geoghegan wrote:
> On Thu, Jan 26, 2023 at 12:45 PM Andres Freund <andres@anarazel.de> wrote:
> > > Most of the overhead of FREEZE WAL records (with freeze plan
> > > deduplication and page-level freezing in) is generic WAL record header
> > > overhead. Your recent adversarial test case is going to choke on that,
> > > too. At least if you set checkpoint_timeout to 1 minute again.
> >
> > I don't quite follow. What do you mean with "record header overhead"? Unless
> > that includes FPIs, I don't think that's that commonly true?
>
> Even if there are no directly observable FPIs, there is still extra
> WAL, which can cause FPIs indirectly, just by making checkpoints more
> frequent. I feel ridiculous even having to explain this to you.

What does that have to do with "generic WAL record overhead"?


I also don't really see how that is responsive to anything else in my
email. That's just as true for the current gating condition (the issuance of
an FPI during heap_page_prune() / HTSV()).

What I was wondering about is whether we should replace the
  fpi_before != pgWalUsage.wal_fpi
with
  records_before != pgWalUsage.wal_records && !WouldIssueFpi(page)


> > The problematic case I am talking about is when we do *not* emit a WAL record
> > during pruning (because there's nothing to prune), but want to freeze the
> > table. If you don't log an FPI, the remaining big overhead is that increasing
> > the LSN on the page will often cause an XLogFlush() when writing out the
> > buffer.
> >
> > I don't see what your reference to checkpoint timeout is about here?
> >
> > Also, as I mentioned before, the problem isn't specific to checkpoint_timeout
> > = 1min. It just makes it cheaper to reproduce.
>
> That's flagrantly intellectually dishonest. Sure, it made it easier to
> reproduce. But that's not all it did!
>
> You had *lots* of specific numbers and technical details in your first
> email, such as "Time for vacuuming goes up to ~5x. WAL volume to
> ~9x.". But you did not feel that it was worth bothering with details
> like having set checkpoint_timeout to 1 minute, which is a setting
> that nobody uses, and obviously had a multiplicative effect. That
> detail was unimportant. I had to drag it out of you!

The multiples were for checkpoint_timeout=5min, with
 '250s' instead of WHERE ts < now() - '120s'

I started out with checkpoint_timeout=1min, as I wanted to quickly test my
theory. Then I increased checkpoint_timeout back to 5min, adjusted the query
to some randomly guessed value. Happened to get nearly the same results.

I then experimented more with '1min', because it's less annoying to have to
wait for 120s until deletions start, than to wait for 250s. Because it's
quicker to run I thought I'd share the less resource intensive version. A
mistake as I now realize.


This wasn't intended as a carefully designed benchmark, or anything. It was a
quick proof for a problem that I found obvious. And it's not something worth
testing carefully - e.g. the constants in the test are actually quite hardware
specific, because the insert/seconds rate is very machine specific, and it's
completely unnecessarily hardware intensive due to the use of single-row
inserts, instead of batched operations.  It's just a POC.



> You basically found a way to add WAL overhead to a system/workload
> that is already in a write amplification vicious cycle, with latent
> tipping point type behavior.
>
> There is a practical point here, that is equally obvious, and yet
> somehow still needs to be said: benchmarks like that one are basically
> completely free of useful information. If we can't agree on how to
> assess such things in general, then what can we agree on when it comes
> to what should be done about it, what trade-off to make, when it comes
> to any similar question?

It's not at all free of useful information. It reproduces a problem I
predicted repeatedly, that others in the discussion also wondered about, that
you refused to acknowledge or address.

It's not a good benchmark - I completely agree with that much. It was not
designed to carefully benchmark different settings or such. It was designed to
show a problem. And it does that.



> > You're right, it makes sense to consider whether we'll emit a
> > XLOG_HEAP2_VISIBLE anyway.
>
> As written the page-level freezing FPI mechanism probably doesn't
> really stand to benefit much from doing that. Either checksums are
> disabled and it's just a hint, or they're enabled and there is a very
> high chance that we'll get an FPI inside lazy_scan_prune rather than
> right after it is called, when PD_ALL_VISIBLE is set.

I think it might be useful with logged hint bits, consider cases where all the
tuples on the page were already fully hinted. That's not uncommon, I think?


> > > > A less aggressive version would be to check if any WAL records were emitted
> > > > during heap_page_prune() (instead of FPIs) and whether we'd emit an FPI if we
> > > > modified the page again. Similar to what we do now, except not requiring an
> > > > FPI to have been emitted.
> > >
> > > Also way more aggressive. Not nearly enough on its own.
> >
> > In which cases will it be problematically more aggressive?
> >
> > If we emitted a WAL record during pruning we've already set the LSN of the
> > page to a very recent LSN. We know the page is dirty. Thus we'll already
> > trigger an XLogFlush() during ringbuffer replacement. We won't emit an FPI.
>
> You seem to be talking about this as if the only thing that could
> matter is the immediate FPI -- the first order effects -- and not any
> second order effects.

     * Freeze the page when heap_prepare_freeze_tuple indicates that at least
     * one XID/MXID from before FreezeLimit/MultiXactCutoff is present.  Also
     * freeze when pruning generated an FPI, if doing so means that we set the
     * page all-frozen afterwards (might not happen until final heap pass).
     */
    if (pagefrz.freeze_required || tuples_frozen == 0 ||
        (prunestate->all_visible && prunestate->all_frozen &&
         fpi_before != pgWalUsage.wal_fpi))

That's just as true for this.

What I'd like to know is why the second order effects of the above are lesser
than for
    if (pagefrz.freeze_required || tuples_frozen == 0 ||
        (prunestate->all_visible && prunestate->all_frozen &&
         records_before != pgWalUsage.wal_records && !WouldIssueFpi(page)))




> You certainly didn't get to 9x extra WAL
> overhead by controlling for that before. Should I take it that you've
> decided to assess these things more sensibly now? Out of curiosity:
> why the change of heart?

Dude.

What would the point have been to invest a lot of time in a repro for a
predicted problem? It's a problem repro, not a carefully designed benchmark.



> > > In any case this seems like an odd thing for you to say, having
> > > eviscerated a patch that really just made the same behavior trigger
> > > independently of FPIs in some tables, controlled via a GUC.
> >
> > jdksjfkjdlkajsd;lfkjasd;lkfj;alskdfj
> >
> > That behaviour I critizied was causing a torrent of FPIs and additional
> > dirtying of pages. My proposed replacement for the current FPI check doesn't,
> > because a) it only triggers when we wrote a WAL record b) It doesn't trigger
> > if we would write an FPI.
>
> It increases the WAL written in many important cases that
> vacuum_freeze_strategy_threshold avoided. Sure, it did have some
> problems, but the general idea of adding some high level
> context/strategies seems essential to me.

I was discussing changing the conditions for the "oppportunistic pruning"
logic, not about a replacement for the eager freezing strategy.


Greetings,

Andres Freund



Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Thu, Jan 26, 2023 at 6:37 PM Andres Freund <andres@anarazel.de> wrote:
> I also don't really see how that is responsive to anything else in my
> email. That's just as true for the current gating condition (the issuance of
> an FPI during heap_page_prune() / HTSV()).
>
> What I was wondering about is whether we should replace the
>   fpi_before != pgWalUsage.wal_fpi
> with
>   records_before != pgWalUsage.wal_records && !WouldIssueFpi(page)

I understand that. What I'm saying is that that's going to create a
huge problem of its own, unless you separately account for that
problem.

The simplest and obvious example is something like a pgbench_tellers
table. VACUUM will generally run fast enough relative to the workload
that it will set some of those pages all-visible. Now it's going to
freeze them, too. Arguably it shouldn't even be setting the pages
all-visible, but now you make that existing problem much worse.

The important point is that there doesn't seem to be any good way
around thinking about the table as a whole if you're going to freeze
speculatively. This is not the same dynamic as we see with the FPI
thing IMV -- that's not nearly so speculative as what you're talking
about, since it is speculative in roughly the same sense that eager
freezing was speculative (hence the suggestion that something like
vacuum_freeze_strategy_threshold could have a roll to play).

The FPI thing is mostly about the cost now versus the cost later on.
You're gambling that you won't get another FPI later on if you freeze
now. But the cost of a second FPI later on is so much higher than the
added cost of freezing now that that's a very favorable bet, that we
can afford to "lose" many times while still coming out ahead overall.
And even when we lose, you generally still won't have been completely
wrong -- even then there generally will indeed be a second FPI later
on for the same page, to go with everything else. This makes the
wasted freezing even less significant, on a comparative basis!

It's also likely true that an FPI in lazy_scan_prune is a much
stronger signal, but I think that the important dynamic is that we're
reasoning about "costs now vs costs later on". The asymmetry is really
important.

-- 
Peter Geoghegan



Re: New strategies for freezing, advancing relfrozenxid early

From
Andres Freund
Date:
Hi,

On 2023-01-26 19:01:03 -0800, Peter Geoghegan wrote:
> On Thu, Jan 26, 2023 at 6:37 PM Andres Freund <andres@anarazel.de> wrote:
> > I also don't really see how that is responsive to anything else in my
> > email. That's just as true for the current gating condition (the issuance of
> > an FPI during heap_page_prune() / HTSV()).
> >
> > What I was wondering about is whether we should replace the
> >   fpi_before != pgWalUsage.wal_fpi
> > with
> >   records_before != pgWalUsage.wal_records && !WouldIssueFpi(page)
>
> I understand that. What I'm saying is that that's going to create a
> huge problem of its own, unless you separately account for that
> problem.

> The simplest and obvious example is something like a pgbench_tellers
> table. VACUUM will generally run fast enough relative to the workload
> that it will set some of those pages all-visible. Now it's going to
> freeze them, too. Arguably it shouldn't even be setting the pages
> all-visible, but now you make that existing problem much worse.

So the benefit of the FPI condition is that it indicates that the page hasn't
been updated all that recently, because, after all, a checkpoint has happened
since?  If that's the intention, it needs a huge honking comment - at least I
can't read that out of:

  Also freeze when pruning generated an FPI, if doing so means that we set the
  page all-frozen afterwards (might not happen until final heap pass).


It doesn't seem like a great proxy to me. ISTM that this means that how
aggressive vacuum is about opportunistically freezing pages depends on config
variables like checkpoint_timeout & max_wal_size (less common opportunistic
freezing), full_page_writes & use of unlogged tables (no opportunistic
freezing), and the largely random scheduling of autovac workers.


I can see it making a difference for pgbench_tellers, but it's a pretty small
difference in overall WAL volume. I can think of more adverse workloads though
- but even there the difference seems not huge, and not predictably
reached. Due to the freeze plan stuff you added, the amount of WAL for
freezing a page is pretty darn small compared to the amount of WAL if compared
to the amount of WAL needed to fill a page with non-frozen tuples.

That's not to say we shouldn't reduce the risk - I agree that both the "any
fpi" and the "any record" condition can have adverse effects!


However, an already dirty page getting frozen is also the one case where
freezing won't have meaningful write amplication effect. So I think it's worth
trying spending effort figuring out how we can make freezing in that situation
have unlikely and small downsides.


The cases with downsides are tables that are very heavily updated througout,
where the page is going to be defrosted again almost immediately. As you say,
the all-visible marking has a similar problem.


Essentially the "any fpi" logic is a very coarse grained way of using the page
LSN as a measurement. As I said, I don't think "has a checkpoint occurred
since the last write" is a good metric to avoid unnecessary freezing - it's
too coarse. But I think using the LSN is the right thought. What about
something like

  lsn_threshold =  insert_lsn - (insert_lsn - lsn_of_last_vacuum) * 0.1
  if (/* other conds */ && PageGetLSN(page) <= lsn_threshold)
     FreezeMe();

I probably got some details wrong, what I am going for with lsn_threshold is
that we'd freeze an already dirty page if it's not been updated within 10% of
the LSN distance to the last VACUUM.



> The important point is that there doesn't seem to be any good way
> around thinking about the table as a whole if you're going to freeze
> speculatively. This is not the same dynamic as we see with the FPI
> thing IMV -- that's not nearly so speculative as what you're talking
> about, since it is speculative in roughly the same sense that eager
> freezing was speculative (hence the suggestion that something like
> vacuum_freeze_strategy_threshold could have a roll to play).

I don't think the speculation is that fundamentally different - a heavily
updated table with a bit of a historic, non-changing portion, makes
vacuum_freeze_strategy_threshold freeze way more aggressively than either "any
record" or "any fpi".


> The FPI thing is mostly about the cost now versus the cost later on.
> You're gambling that you won't get another FPI later on if you freeze
> now. But the cost of a second FPI later on is so much higher than the
> added cost of freezing now that that's a very favorable bet, that we
> can afford to "lose" many times while still coming out ahead overall.

Agreed. And not just avoiding FPIs, avoiding another dirtying of the page! The
latter part is especially huge IMO. Depending on s_b size it can also avoid
another *read* of the page...


> And even when we lose, you generally still won't have been completely
> wrong -- even then there generally will indeed be a second FPI later
> on for the same page, to go with everything else. This makes the
> wasted freezing even less significant, on a comparative basis!

This is precisely why I think that we can afford to be quite aggressive about
freezing already dirty pages...


Greetings,

Andres Freund



Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Thu, Jan 26, 2023 at 9:58 PM Andres Freund <andres@anarazel.de> wrote:
> It doesn't seem like a great proxy to me. ISTM that this means that how
> aggressive vacuum is about opportunistically freezing pages depends on config
> variables like checkpoint_timeout & max_wal_size (less common opportunistic
> freezing), full_page_writes & use of unlogged tables (no opportunistic
> freezing), and the largely random scheduling of autovac workers.

The FPI thing was originally supposed to complement the freezing
strategies stuff, and possibly other rules that live in
lazy_scan_prune. Obviously you can freeze a page by following any rule
that you care to invent -- you can decide by calling random(). Two
rules can coexist during the same VACUUM (actually, they do already).

> Essentially the "any fpi" logic is a very coarse grained way of using the page
> LSN as a measurement. As I said, I don't think "has a checkpoint occurred
> since the last write" is a good metric to avoid unnecessary freezing - it's
> too coarse. But I think using the LSN is the right thought. What about
> something like
>
>   lsn_threshold =  insert_lsn - (insert_lsn - lsn_of_last_vacuum) * 0.1
>   if (/* other conds */ && PageGetLSN(page) <= lsn_threshold)
>      FreezeMe();
>
> I probably got some details wrong, what I am going for with lsn_threshold is
> that we'd freeze an already dirty page if it's not been updated within 10% of
> the LSN distance to the last VACUUM.

It seems to me that you're reinventing something akin to eager
freezing strategy here. At least that's how I define it, since now
you're bringing the high level context into it; what happens with the
table, with VACUUM operations, and so on. Obviously this requires
tracking the metadata that you suppose will be available in some way
or other, in particular things like lsn_of_last_vacuum.

What about unlogged/temporary tables? The obvious thing to do there is
what I did in the patch that was reverted (freeze whenever the page
will thereby become all-frozen), and forget about LSNs. But you have
already objected to that part, specifically.

BTW, you still haven't changed the fact that you get rather different
behavior with checksums/wal_log_hints. I think that that's good, but
you didn't seem to.

> I don't think the speculation is that fundamentally different - a heavily
> updated table with a bit of a historic, non-changing portion, makes
> vacuum_freeze_strategy_threshold freeze way more aggressively than either "any
> record" or "any fpi".

That's true. The point I was making is that both this proposal and
eager freezing are based on some kind of high level picture of the
needs of the table, based on high level metadata. To me that's the
defining characteristic.

> > And even when we lose, you generally still won't have been completely
> > wrong -- even then there generally will indeed be a second FPI later
> > on for the same page, to go with everything else. This makes the
> > wasted freezing even less significant, on a comparative basis!
>
> This is precisely why I think that we can afford to be quite aggressive about
> freezing already dirty pages...

I'm beginning to warm to this idea, now that I understand it a little better.

-- 
Peter Geoghegan



Re: New strategies for freezing, advancing relfrozenxid early

From
Andres Freund
Date:
Hi,

On 2023-01-26 23:11:41 -0800, Peter Geoghegan wrote:
> > Essentially the "any fpi" logic is a very coarse grained way of using the page
> > LSN as a measurement. As I said, I don't think "has a checkpoint occurred
> > since the last write" is a good metric to avoid unnecessary freezing - it's
> > too coarse. But I think using the LSN is the right thought. What about
> > something like
> >
> >   lsn_threshold =  insert_lsn - (insert_lsn - lsn_of_last_vacuum) * 0.1
> >   if (/* other conds */ && PageGetLSN(page) <= lsn_threshold)
> >      FreezeMe();
> >
> > I probably got some details wrong, what I am going for with lsn_threshold is
> > that we'd freeze an already dirty page if it's not been updated within 10% of
> > the LSN distance to the last VACUUM.
> 
> It seems to me that you're reinventing something akin to eager
> freezing strategy here. At least that's how I define it, since now
> you're bringing the high level context into it; what happens with the
> table, with VACUUM operations, and so on. Obviously this requires
> tracking the metadata that you suppose will be available in some way
> or other, in particular things like lsn_of_last_vacuum.

I agree with bringing high-level context into the decision about whether to
freeze agressively - my problem with the eager freezing strategy patch isn't
that it did that too much, it's that it didn't do it enough.


But I also don't think what I describe above is really comparable to "table
level" eager freezing though - the potential worst case overhead is a small
fraction of the WAL volume, and there's zero increase in data write volume. I
suspect the absolute worst case of "always freeze dirty pages" is when a
single tuple on the page gets updated immediately after every time we freeze
the page - a single tuple is where the freeze record is the least space
efficient. The smallest update is about the same size as the smallest freeze
record.  For that to amount to a large WAL increase you'd a crazy rate of such
updates interspersed with vacuums. In slightly more realistic cases (i.e. not
column less tuples that constantly get updated and freezing happening all the
time) you end up with a reasonably small WAL rate overhead.

That worst case of "freeze dirty" is bad enough to spend some brain and
compute cycles to prevent. But if we don't always get it right in some
workload, it's not *awful*.


The worst case of the "eager freeze strategy" is a lot larger - it's probably
something like updating one narrow tuple every page, once per checkpoint, so
that each freeze generates an FPI. I think that results in a max overhead of
2x for data writes, and about 150x for WAL volume (ratio of one update record
with an FPI). Obviously that's a pointless workload, but I do think that
analyzing the "outer boundaries" of the regression something can cause, can be
helpful.


I think one way forward with the eager strategy approach would be to have a
very narrow gating condition for now, and then incrementally expand it in
later releases.

One use-case where the eager strategy is particularly useful is
[nearly-]append-only tables - and it's also the one workload that's reasonably
easy to detect using stats. Maybe something like
(dead_tuples_since_last_vacuum / inserts_since_last_vacuum) < 0.05
or so.

That'll definitely leave out loads of workloads where eager freezing would be
useful - but are there semi-reasonable workloads where it'll hurt badly? I
don't *think* so.


> What about unlogged/temporary tables? The obvious thing to do there is
> what I did in the patch that was reverted (freeze whenever the page
> will thereby become all-frozen), and forget about LSNs. But you have
> already objected to that part, specifically.

My main concern about that is the data write amplification it could cause when
page is clean when we start freezing.  But I can't see a large potential
downside to always freezing unlogged/temp tables when the page is already
dirty.


> BTW, you still haven't changed the fact that you get rather different
> behavior with checksums/wal_log_hints. I think that that's good, but
> you didn't seem to.

I think that, if we had something like the recency test I was talking about,
we could afford to alway freeze when the page is already dirty and not very
recently modified. I.e. not even insist on a WAL record having been generated
during pruning/HTSV.  But I need to think through the dangers of that more.

Greetings,

Andres Freund



Re: New strategies for freezing, advancing relfrozenxid early

From
Andres Freund
Date:
Hi,

On 2023-01-27 00:51:59 -0800, Andres Freund wrote:
> One use-case where the eager strategy is particularly useful is
> [nearly-]append-only tables - and it's also the one workload that's reasonably
> easy to detect using stats. Maybe something like
> (dead_tuples_since_last_vacuum / inserts_since_last_vacuum) < 0.05
> or so.
> 
> That'll definitely leave out loads of workloads where eager freezing would be
> useful - but are there semi-reasonable workloads where it'll hurt badly? I
> don't *think* so.

That 0.05 could be a GUC + relopt combo, which'd allow users to opt in tables
with known usage pattern into always using eager freezing.

Greetings,

Andres Freund



Re: New strategies for freezing, advancing relfrozenxid early

From
Robert Haas
Date:
On Thu, Jan 26, 2023 at 4:51 PM Peter Geoghegan <pg@bowt.ie> wrote:
> This is the kind of remark that makes me think that you don't get it.
>
> The most influential OLTP benchmark of all time is TPC-C, which has
> exactly this problem. In spades -- it's enormously disruptive. Which
> is one reason why I used it as a showcase for a lot of this work. Plus
> practical experience (like the Heroku database in the blog post I
> linked to) fully agrees with that benchmark, as far as this stuff goes
> -- that was also a busy OLTP database.
>
> Online transaction involves transactions. Right? There is presumably
> some kind of ledger, some kind of orders table. Naturally these have
> entries that age out fairly predictably. After a while, almost all the
> data is cold data. It is usually about that simple.
>
> One of the key strengths of systems like Postgres is the ability to
> inexpensively store a relatively large amount of data that has just
> about zero chance of being read, let alone modified. While at the same
> time having decent OLTP performance for the hot data. Not nearly as
> good as an in-memory system, mind you -- and yet in-memory systems
> remain largely a niche thing.

I think it's interesting that TPC-C suffers from the kind of problem
that your patch was intended to address. I hadn't considered that. But
I do not think it detracts from the basic point I was making, which is
that you need to think about the downsides of your patch, not just the
upsides.

If you want to argue that there is *no* OLTP workload that will be
harmed by freezing as aggressively as possible, then that would be a
good argument in favor of your patch, because it would be arguing that
the downside simply doesn't exist, at least for OLTP workloads. The
fact that you can think of *one particular* OLTP workload that can
benefit from the patch is just doubling down on the "my patch has an
upside" argument, which literally no one is disputing.

I don't think you can make such an argument stick, though. OLTP
workloads come in all shapes and sizes. It's pretty common to have
tables where the application inserts a bunch of data, updates it over
and over again like, truncates the table, and starts over. In such a
case, aggressive freezing has to be a loss, because no freezing is
ever needed. It's also surprisingly common to have tables where a
bunch of data is inserted and then, after a bit of processing, a bunch
of rows are updated exactly once, after which the data is not modified
any further. In those kinds of cases, aggressive freezing is a great
idea if it happens after that round of updates but a poor idea if it
happens before that round of updates.

It's also pretty common to have cases where portions of the table
become very hot, get a lot of updates for a while, and then that part
of the table becomes cool and some other part of the table becomes
very hot for a while. I think it's possible that aggressive freezing
might do OK in such environments, actually. It will be a negative if
we aggressively freeze the part of the table that's currently hot, but
I think typically tables that have this access pattern are quite big,
so VACUUM isn't going to sweep through the table all that often. It
will probably freeze a lot more data-that-was-hot-a-bit-ago than it
will freeze data-that-is-hot-this-very-minute. Then again, maybe that
would happen without the patch, too. Maybe this kind of case is a wash
for your patch? I don't know.

Whatever you think of these examples, I don't see how it can be right
to suppose that *in general* freezing very aggressively has no
downsides. If that were true, then we probably wouldn't have
vacuum_freeze_min_age at all. We would always just freeze everything
ASAP. I mean, you could theorize that whoever invented that GUC is an
idiot and that they had absolutely no good reason for introducing it,
but that seems pretty ridiculous. Someone put guards against
overly-aggressive freezing into the system *for a reason* and if you
just go rip them all out, you're going to reintroduce the problems
against which they were intended to guard.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: New strategies for freezing, advancing relfrozenxid early

From
Robert Haas
Date:
On Thu, Jan 26, 2023 at 6:37 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > I don't see what your reference to checkpoint timeout is about here?
> >
> > Also, as I mentioned before, the problem isn't specific to checkpoint_timeout
> > = 1min. It just makes it cheaper to reproduce.
>
> That's flagrantly intellectually dishonest.

This kind of ad hominum attack has no place on this mailing list, or
anywhere in the PostgreSQL community.

If you think there's a problem with Andres's test case, or his
analysis of it, you can talk about those problems without accusing him
of intellectual dishonesty.

I don't see anything to indicate that he was being intentionally
dishonest, either. At most he was mistaken. More than likely, not even
that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Fri, Jan 27, 2023 at 6:48 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > One of the key strengths of systems like Postgres is the ability to
> > inexpensively store a relatively large amount of data that has just
> > about zero chance of being read, let alone modified. While at the same
> > time having decent OLTP performance for the hot data. Not nearly as
> > good as an in-memory system, mind you -- and yet in-memory systems
> > remain largely a niche thing.
>
> I think it's interesting that TPC-C suffers from the kind of problem
> that your patch was intended to address. I hadn't considered that. But
> I do not think it detracts from the basic point I was making, which is
> that you need to think about the downsides of your patch, not just the
> upsides.
>
> If you want to argue that there is *no* OLTP workload that will be
> harmed by freezing as aggressively as possible, then that would be a
> good argument in favor of your patch, because it would be arguing that
> the downside simply doesn't exist, at least for OLTP workloads. The
> fact that you can think of *one particular* OLTP workload that can
> benefit from the patch is just doubling down on the "my patch has an
> upside" argument, which literally no one is disputing.

You've treated me to another multi paragraph talking down, as if I was
still clinging to my original position, which is of course not the
case. I've literally said I'm done with VACUUM for good, and that I
just want to put a line under this. Yet you still persist in doing
this sort of thing. I'm not fighting you, I'm not fighting Andres.

I was making a point about the need to do something in this area in
general. That's all.

-- 
Peter Geoghegan



Re: New strategies for freezing, advancing relfrozenxid early

From
Robert Haas
Date:
On Fri, Jan 27, 2023 at 12:58 AM Andres Freund <andres@anarazel.de> wrote:
> Essentially the "any fpi" logic is a very coarse grained way of using the page
> LSN as a measurement. As I said, I don't think "has a checkpoint occurred
> since the last write" is a good metric to avoid unnecessary freezing - it's
> too coarse. But I think using the LSN is the right thought. What about
> something like
>
>   lsn_threshold =  insert_lsn - (insert_lsn - lsn_of_last_vacuum) * 0.1
>   if (/* other conds */ && PageGetLSN(page) <= lsn_threshold)
>      FreezeMe();
>
> I probably got some details wrong, what I am going for with lsn_threshold is
> that we'd freeze an already dirty page if it's not been updated within 10% of
> the LSN distance to the last VACUUM.

I think this might not be quite the right idea for a couple of reasons.

First, suppose that the table is being processed just by autovacuum
(no manual VACUUM operations) and that the rate of WAL generation is
pretty even, so that LSN age is a good proxy for time. If autovacuum
processes the table once per hour, this will freeze if it hasn't been
updated in the last six minutes. That sounds good. But if autovacuum
processes the table once per day, then this will freeze if it hasn't
been updated in 2.4 hours. That might be OK, but it sounds a little on
the long side. If autovacuum processes the table once per week, then
this will freeze if it hasn't been updated in 16.8 hours. That sounds
too conservative. Conversely, if autovacuum processes the table every
3 minutes, then this will freeze the data if it hasn't been updated in
the last 18 seconds, which sounds awfully aggressive. Maybe I'm wrong
here, but I feel like the absolute amount of wall-clock time we're
talking about here probably matters to some degree. I'm not sure
whether a strict time-based threshold like, say, 10 minutes would be a
good idea, leaving aside the difficulties of implementation. It might
be right to think that if the table is being vacuumed a lot, freezing
more aggressively is smart, and if it's being vacuumed infrequently,
freezing less aggressively is smart, because if the table has enough
activity that it's being vacuumed frequently, that might also be a
sign that we need to freeze more aggressively in order to avoid having
things go sideways. However, I'm not completely sure about that, and I
think it's possible that we need some guardrails to avoid going too
far in either direction.

Second, and more seriously, I think this would, in some circumstances,
lead to tremendously unstable behavior. Suppose somebody does a bunch
of work on a table and then they're like "oh, we should clean up,
VACUUM" and it completes quickly because it's been a while since the
last vacuum and so it doesn't freeze much. Then, for whatever reason,
they decide to run it one more time, and it goes bananas and starts
freezing all kinds of stuff because the LSN distance since the last
vacuum is basically zero. Or equally, you run a manual VACUUM, and you
get completely different behavior depending on how long it's been
since the last autovacuum ran.

In some ways, I think this proposal has many of the same problems as
vacuum_freeze_min_age. In both cases, the instinct is that we should
use something on the page to let us know how long it's been since the
page was modified, and proceed on the theory that if the page has not
been modified recently, it probably isn't about to be modified again.
That's a reasonable instinct, but the rate of XID advancement and the
rate of LSN advancement are both highly variable, even on a system
that's always under some load.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: New strategies for freezing, advancing relfrozenxid early

From
Andres Freund
Date:
Hi,

On 2023-01-27 12:53:58 -0500, Robert Haas wrote:
> On Fri, Jan 27, 2023 at 12:58 AM Andres Freund <andres@anarazel.de> wrote:
> > Essentially the "any fpi" logic is a very coarse grained way of using the page
> > LSN as a measurement. As I said, I don't think "has a checkpoint occurred
> > since the last write" is a good metric to avoid unnecessary freezing - it's
> > too coarse. But I think using the LSN is the right thought. What about
> > something like
> >
> >   lsn_threshold =  insert_lsn - (insert_lsn - lsn_of_last_vacuum) * 0.1
> >   if (/* other conds */ && PageGetLSN(page) <= lsn_threshold)
> >      FreezeMe();
> >
> > I probably got some details wrong, what I am going for with lsn_threshold is
> > that we'd freeze an already dirty page if it's not been updated within 10% of
> > the LSN distance to the last VACUUM.
>
> I think this might not be quite the right idea for a couple of reasons.

It's definitely not perfect.  If we had an approximate LSN->time map as
general infrastructure, we could do a lot better. I think it'd be reasonably
easy to maintain that in the autovacuum launcher, for example.


One thing worth calling out here, because it's not obvious from the code
quoted above in isolation, is that what I was trying to refine here was the
decision when to perform opportunistic freezing *of already dirty pages that
do not require an FPI*.

So all that we need to prevent here is freezing very hotly updated data, where
the WAL overhead of the freeze records would be noticable, because we
constantly VACUUM, due to the high turnover.


> First, suppose that the table is being processed just by autovacuum
> (no manual VACUUM operations) and that the rate of WAL generation is
> pretty even, so that LSN age is a good proxy for time. If autovacuum
> processes the table once per hour, this will freeze if it hasn't been
> updated in the last six minutes. That sounds good. But if autovacuum
> processes the table once per day, then this will freeze if it hasn't
> been updated in 2.4 hours. That might be OK, but it sounds a little on
> the long side.

You're right. I was thinking of the "lsn_since_last_vacuum" because I was
posulating it being useful elsewhere in the thread (but for eager strategy
logic) - but here that's really not very relevant.

Given that we're dealing with already dirty pages not requiring an FPI, I
think a much better "reference LSN" would be the LSN of the last checkpoint
(LSN of the last checkpoint record, not the current REDO pointer).


> Second, and more seriously, I think this would, in some circumstances,
> lead to tremendously unstable behavior. Suppose somebody does a bunch
> of work on a table and then they're like "oh, we should clean up,
> VACUUM" and it completes quickly because it's been a while since the
> last vacuum and so it doesn't freeze much. Then, for whatever reason,
> they decide to run it one more time, and it goes bananas and starts
> freezing all kinds of stuff because the LSN distance since the last
> vacuum is basically zero. Or equally, you run a manual VACUUM, and you
> get completely different behavior depending on how long it's been
> since the last autovacuum ran.

I don't think this quite applies to the scenario at hand, because it's
restricted to already dirty pages. And the max increased overhead is also
small due to that - so occasionally getting it wrong is that impactful.

Greetings,

Andres Freund



Re: New strategies for freezing, advancing relfrozenxid early

From
Peter Geoghegan
Date:
On Fri, Jan 27, 2023 at 12:52 AM Andres Freund <andres@anarazel.de> wrote:
> I agree with bringing high-level context into the decision about whether to
> freeze agressively - my problem with the eager freezing strategy patch isn't
> that it did that too much, it's that it didn't do it enough.
>
>
> But I also don't think what I describe above is really comparable to "table
> level" eager freezing though - the potential worst case overhead is a small
> fraction of the WAL volume, and there's zero increase in data write volume.

All I meant was that I initially thought that you were trying to
replace the FPI thing with something at the same level of ambition,
that could work in a low context way. But I now see that you're
actually talking about something quite a bit more ambitious for
Postgres 16, which is structurally similar to a freezing strategy,
from a code point of view -- it relies on high-level context for the
VACUUM/table as a whole. I wasn't equating it with the eager freezing
strategy in any other way.

It might also be true that this other thing happens to render the FPI
mechanism redundant. I'm actually not completely sure that it will
just yet. Let me verify my understanding of your proposal:

You mean that we'd take the page LSN before doing anything with the
page, right at the top of lazy_scan_prune, at the same point that
"fpi_before" is initialized currently. Then, if we subsequently
dirtied the page (as determined by its LSN, so as to focus on "dirtied
via WAL logged operation") during pruning, *and* if the "lsn_before"
of the page was from before our cutoff (derived via "  lsn_threshold =
 insert_lsn - (insert_lsn - lsn_of_last_vacuum) * 0.1" or similar),
*and* if the page is eligible to become all-frozen, then we'd freeze
the page.

That's it, right? It's about pages that *we* (VACUUM) dirtied, and
wrote records and/or FPIs for already?

> I suspect the absolute worst case of "always freeze dirty pages" is when a
> single tuple on the page gets updated immediately after every time we freeze
> the page - a single tuple is where the freeze record is the least space
> efficient. The smallest update is about the same size as the smallest freeze
> record.  For that to amount to a large WAL increase you'd a crazy rate of such
> updates interspersed with vacuums. In slightly more realistic cases (i.e. not
> column less tuples that constantly get updated and freezing happening all the
> time) you end up with a reasonably small WAL rate overhead.

Other thing is that we'd be doing this in situations where we already
know that a VISIBLE record is required, which is comparable in size to
a FREEZE_PAGE record with one tuple/plan (around 64 bytes). The
smallest WAL records are mostly just generic WAL record header
overhead.

> Obviously that's a pointless workload, but I do think that
> analyzing the "outer boundaries" of the regression something can cause, can be
> helpful.

I agree about the "outer boundaries" being a useful guide.

> I think one way forward with the eager strategy approach would be to have a
> very narrow gating condition for now, and then incrementally expand it in
> later releases.
>
> One use-case where the eager strategy is particularly useful is
> [nearly-]append-only tables - and it's also the one workload that's reasonably
> easy to detect using stats. Maybe something like
> (dead_tuples_since_last_vacuum / inserts_since_last_vacuum) < 0.05
> or so.
>
> That'll definitely leave out loads of workloads where eager freezing would be
> useful - but are there semi-reasonable workloads where it'll hurt badly? I
> don't *think* so.

I have no further plans to work on eager freezing strategy, or
anything of the sort, in light of recent developments. My goal at this
point is very unambitious: to get the basic page-level freezing work
into a form that makes sense as a standalone thing for Postgres 16. To
put things on a good footing, so that I can permanently bow out of all
work on VACUUM having left everything in good order. That's all.

Now, that might still mean that I'd facilitate future work of this
sort, by getting the right basic structure in place. But my
involvement in any work on freezing or anything of the sort ends here,
both as a patch author and a committer of anybody else's work. I'm
proud of the work I've done on VACUUM, but I'm keen to move on from
it.

> > What about unlogged/temporary tables? The obvious thing to do there is
> > what I did in the patch that was reverted (freeze whenever the page
> > will thereby become all-frozen), and forget about LSNs. But you have
> > already objected to that part, specifically.
>
> My main concern about that is the data write amplification it could cause when
> page is clean when we start freezing.  But I can't see a large potential
> downside to always freezing unlogged/temp tables when the page is already
> dirty.

But we have to dirty the page anyway, just to set PD_ALL_VISIBLE. That
was always a gating condition. Actually, that may have depended on not
having SKIP_PAGES_THRESHOLD, which the vm snapshot infrastructure
would have removed. That's not happening now, so I may need to
reassess. But even with SKIP_PAGES_THRESHOLD, it should be fine.

> > BTW, you still haven't changed the fact that you get rather different
> > behavior with checksums/wal_log_hints. I think that that's good, but
> > you didn't seem to.
>
> I think that, if we had something like the recency test I was talking about,
> we could afford to alway freeze when the page is already dirty and not very
> recently modified. I.e. not even insist on a WAL record having been generated
> during pruning/HTSV.  But I need to think through the dangers of that more.

Now I'm confused. I thought that the recency test you talked about was
purely to be used to do something a bit like the FPI thing, but using
some high level context. Now I don't know what to think.

--
Peter Geoghegan