Thread: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

Removing more vacuumlazy.c special cases, relfrozenxid optimizations

From

Peter Geoghegan

Date:

22 November 2021, 02:13:51

Attached WIP patch series significantly simplifies the definition of
scanned_pages inside vacuumlazy.c. Apart from making several very
tricky things a lot simpler, and moving more complex code outside of
the big "blkno" loop inside lazy_scan_heap (building on the Postgres
14 work), this refactoring directly facilitates 2 new optimizations
(also in the patch):

1. We now collect LP_DEAD items into the dead_tuples array for all
scanned pages -- even when we cannot get a cleanup lock.

2. We now don't give up on advancing relfrozenxid during a
non-aggressive VACUUM when we happen to be unable to get a cleanup
lock on a heap page.

Both optimizations are much more natural with the refactoring in
place. Especially #2, which can be thought of as making aggressive and
non-aggressive VACUUM behave similarly. Sure, we shouldn't wait for a
cleanup lock in a non-aggressive VACUUM (by definition) -- and we
still don't in the patch (obviously). But why wouldn't we at least
*check* if the page has tuples that need to be frozen in order for us
to advance relfrozenxid? Why give up on advancing relfrozenxid in a
non-aggressive VACUUM when there's no good reason to?

See the draft commit messages from the patch series for many more
details on the simplifications I am proposing.

I'm not sure how much value the second optimization has on its own.
But I am sure that the general idea of teaching non-aggressive VACUUM
to be conscious of the value of advancing relfrozenxid is a good one
-- and so #2 is a good start on that work, at least. I've discussed
this idea with Andres (CC'd) a few times before now. Maybe we'll need
another patch that makes VACUUM avoid setting heap pages to
all-visible without also setting them to all-frozen (and freezing as
necessary) in order to really get a benefit. Since, of course, a
non-aggressive VACUUM still won't be able to advance relfrozenxid when
it skipped over all-visible pages that are not also known to be
all-frozen.

Masahiko (CC'd) has expressed interest in working on opportunistic
freezing. This refactoring patch seems related to that general area,
too. At a high level, to me, this seems like the tuple freezing
equivalent of the Postgres 14 work on bypassing index vacuuming when
there are very few LP_DEAD items (interpret that as 0 LP_DEAD items,
which is close to the truth anyway).  There are probably quite a few
interesting opportunities to make VACUUM better by not having such a
sharp distinction between aggressive and non-aggressive VACUUM. Why
should they be so different? A good medium term goal might be to
completely eliminate aggressive VACUUMs.

I have heard many stories about anti-wraparound/aggressive VACUUMs
where the cure (which suddenly made autovacuum workers
non-cancellable) was worse than the disease (not actually much danger
of wraparound failure). For example:

https://www.joyent.com/blog/manta-postmortem-7-27-2015

Yes, this problem report is from 2015, which is before we even had the
freeze map stuff. I still think that the point about aggressive
VACUUMs blocking DDL (leading to chaos) remains valid.

There is another interesting area of future optimization within
VACUUM, that also seems relevant to this patch: the general idea of
*avoiding* pruning during VACUUM, when it just doesn't make sense to
do so -- better to avoid dirtying the page for now. Needlessly pruning
inside lazy_scan_prune is hardly rare -- standard pgbench (maybe only
with heap fill factor reduced to 95) will have autovacuums that
*constantly* do it (granted, it may not matter so much there because
VACUUM is unlikely to re-dirty the page anyway). This patch seems
relevant to that area because it recognizes that pruning during VACUUM
is not necessarily special -- a new function called lazy_scan_noprune
may be used instead of lazy_scan_prune (though only when a cleanup
lock cannot be acquired). These pages are nevertheless considered
fully processed by VACUUM (this is perhaps 99% true, so it seems
reasonable to round up to 100% true).

I find it easy to imagine generalizing the same basic idea --
recognizing more ways in which pruning by VACUUM isn't necessarily
better than opportunistic pruning, at the level of each heap page. Of
course we *need* to prune sometimes (e.g., might be necessary to do so
to set the page all-visible in the visibility map), but why bother
when we don't, and when there is no reason to think that it'll help
anyway? Something to think about, at least.

-- 
Peter Geoghegan

On Tue, Nov 23, 2021 at 5:01 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > Behaviour that lead to a "sudden" falling over, rather than getting gradually
> > worse are bad - they somehow tend to happen on Friday evenings :).
>
> These are among our most important challenges IMV.

I haven't had time to work through any of your feedback just yet --
though it's certainly a priority for. I won't get to it until I return
home from PGConf NYC next week.

Even still, here is a rebased v2, just to fix the bitrot. This is just
a courtesy to anybody interested in the patch.

-- 
Peter Geoghegan

Attachment

v2-0001-Simplify-lazy_scan_heap-s-handling-of-scanned-pag.patch

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

From

Peter Geoghegan

Date:

10 December 2021, 21:48:00

On Tue, Nov 30, 2021 at 11:52 AM Peter Geoghegan <pg@bowt.ie> wrote:
> I haven't had time to work through any of your feedback just yet --
> though it's certainly a priority for. I won't get to it until I return
> home from PGConf NYC next week.

Attached is v3, which works through most of your (Andres') feedback.

Changes in v3:

* While the first patch still gets rid of the "pinskipped_pages"
instrumentation, the second patch adds back a replacement that's
better targeted: it tracks and reports "missed_dead_tuples". This
means that log output will show the number of fully DEAD tuples with
storage that could not be pruned away due to the fact that that would
have required waiting for a cleanup lock. But we *don't* generally
report the number of pages that we couldn't get a cleanup lock on,
because that in itself doesn't mean that we skipped any useful work
(which is very much the point of all of the refactoring in the first
patch).

* We now have FSM processing in the lazy_scan_noprune case, which more
or less matches the standard lazy_scan_prune case.

* Many small tweaks, based on suggestions from Andres, and other
things that I noticed.

* Further simplification of the "consider skipping pages using
visibility map" logic -- now we always don't skip the last block in
the relation, without calling should_attempt_truncation() to make sure
we have a reason.

Note that this means that we'll always read the final page during
VACUUM, even when doing so is provably unhelpful. I'd prefer to keep
the code that deals with skipping pages using the visibility map as
simple as possible. There isn't much downside to always doing that
once my refactoring is in place: there is no risk that we'll wait for
a cleanup lock (on the final page in the rel) for no good reason.
We're only wasting one page access, at most.

(I'm not 100% sure that this is the right trade-off, actually, but
it's at least worth considering.)

Not included in v3:

* Still haven't added the isolation test for rel truncation, though
it's on my TODO list.

* I'm still working on the optimization that we discussed on this
thread: the optimization that allows the final relfrozenxid (that we
set in pg_class) to be determined dynamically, based on the actual
XIDs we observed in the table (we don't just naively use FreezeLimit).

I'm not ready to post that today, but it shouldn't take too much
longer to be good enough to review.

Thanks
-- 
Peter Geoghegan

Attachment

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

From

Peter Geoghegan

Date:

15 December 2021, 20:26:47

On Fri, Dec 10, 2021 at 1:48 PM Peter Geoghegan <pg@bowt.ie> wrote:
> * I'm still working on the optimization that we discussed on this
> thread: the optimization that allows the final relfrozenxid (that we
> set in pg_class) to be determined dynamically, based on the actual
> XIDs we observed in the table (we don't just naively use FreezeLimit).

Attached is v4 of the patch series, which now includes this
optimization, broken out into its own patch. In addition, it includes
a prototype of opportunistic freezing.

My emphasis here has been on making non-aggressive VACUUMs *always*
advance relfrozenxid, outside of certain obvious edge cases. And so
with all the patches applied, up to and including the opportunistic
freezing patch, every autovacuum of every table manages to advance
relfrozenxid during benchmarking -- usually to a fairly recent value.
I've focussed on making aggressive VACUUMs (especially anti-wraparound
autovacuums) a rare occurrence, for truly exceptional cases (e.g.,
user keeps canceling autovacuums, maybe due to automated script that
performs DDL). That has taken priority over other goals, for now.

There is a kind of virtuous circle here, where successive
non-aggressive autovacuums never fall behind on freezing, and so never
fail to advance relfrozenxid (there are never any
all_visible-but-not-all_frozen pages, and we can cope with not
acquiring a cleanup lock quite well). When VACUUM chooses to freeze a
tuple opportunistically, the frozen XIDs naturally cannot hold back
the final safe relfrozenxid for the relation. Opportunistic freezing
avoids setting all_visible (without setting all_frozen) in the
visibility map. It's impossible for VACUUM to just set a page to
all_visible now, which seems like an essential part of making a decent
amount of relfrozenxid advancement take place in almost every VACUUM
operation.

Here is an example of what I'm calling a virtuous circle -- all
pgbench_history autovacuums look like this with the patch applied:

LOG:  automatic vacuum of table "regression.public.pgbench_history":
index scans: 0
    pages: 0 removed, 35503 remain, 31930 skipped using visibility map
(89.94% of total)
    tuples: 0 removed, 5568687 remain (547976 newly frozen), 0 are
dead but not yet removable
    removal cutoff: oldest xmin was 5570281, which is now 1177 xact IDs behind
    relfrozenxid: advanced by 546618 xact IDs, new value: 5565226
    index scan not needed: 0 pages from table (0.00% of total) had 0
dead item identifiers removed
    I/O timings: read: 0.003 ms, write: 0.000 ms
    avg read rate: 0.068 MB/s, avg write rate: 0.068 MB/s
    buffer usage: 7169 hits, 1 misses, 1 dirtied
    WAL usage: 7043 records, 1 full page images, 6974928 bytes
    system usage: CPU: user: 0.10 s, system: 0.00 s, elapsed: 0.11 s

Note that relfrozenxid is almost the same as oldest xmin here. Note also
that the log output shows the number of tuples newly frozen. I see the
same general trends with *every* pgbench_history autovacuum. Actually,
with every autovacuum. The history table tends to have ultra-recent
relfrozenxid values, which isn't always what we see, but that
difference may not matter. As far as I can tell, we can expect
practically every table to have a relfrozenxid that would (at least
traditionally) be considered very safe/recent. Barring weird
application issues that make it totally impossible to advance
relfrozenxid (e.g., idle cursors that hold onto a buffer pin forever),
it seems as if relfrozenxid will now steadily march forward. Sure,
relfrozenxid advancement might be held by the occasional inability to
acquire a cleanup lock, but the effect isn't noticeable over time;
what are the chances that a cleanup lock won't be available on the
same page (with the same old XID) more than once or twice? The odds of
that happening become astronomically tiny, long before there is any
real danger (barring pathological cases).

In the past, we've always talked about opportunistic freezing as a way
of avoiding re-dirtying heap pages during successive VACUUM operations
-- especially as a way of lowering the total volume of WAL. While I
agree that that's important, I have deliberately ignored it for now,
preferring to focus on the relfrozenxid stuff, and smoothing out the
cost of freezing (avoiding big shocks from aggressive/anti-wraparound
autovacuums). I care more about stable performance than absolute
throughput, but even still I believe that the approach I've taken to
opportunistic freezing is probably too aggressive. But it's dead
simple, which will make it easier to understand and discuss the issue
of central importance. It may be possible to optimize the WAL-logging
used during freezing, getting the cost down to the point where
freezing early just isn't a concern. The current prototype adds extra
WAL overhead, to be sure, but even that's not wildly unreasonable (you
make some of it back on FPIs, depending on the workload -- especially
with tables like pgbench_history, where delaying freezing is a total loss).

--
Peter Geoghegan

On Mon, Dec 20, 2021 at 9:35 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > Given the opportunistic freezing, that's true but I'm concerned
> > whether opportunistic freezing always works well on all tables since
> > freezing tuples is not 0 cost.
>
> That is the big question for this patch.

Attached is a mechanical rebase of the patch series. This new version
just fixes bitrot, caused by Masahiko's recent lazyvacuum.c
refactoring work. In other words, this revision has no significant
changes compared to the v4 that I posted back in late December -- just
want to keep CFTester green.

I still have plenty of work to do here. Especially with the final
patch (the v5-0005-* "freeze early" patch), which is generally more
speculative than the other patches. I'm playing catch-up now, since I
just returned from vacation.

--
Peter Geoghegan

On Thu, Jan 13, 2022 at 1:27 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Admittedly you may see a blip from this -- you might notice that the
> final relfrozenxid value for that one single VACUUM isn't quite as new
> as you'd like. But then the next VACUUM should catch up with the
> stable long term average again. It's hard to describe exactly why this
> effect is robust, but as I said, empirically, in practice, it appears
> to be robust. That might not be good enough as an explanation that
> justifies committing the patch series, but that's what I see. And I
> think I will be able to nail it down.

Attached is v6, which like v5 is a rebased version that I'm posting to
keep CFTester happy. I pushed a commit that consolidates VACUUM
VERBOSE and autovacuum logging earlier (commit 49c9d9fc), which bitrot
v5. So new real changes, nothing to note.

Although it technically has nothing to do with this patch series, I
will point out that it's now a lot easier to debug using VACUUM
VERBOSE, which will directly display information about how we've
advanced relfrozenxid, tuples frozen, etc:

pg@regression:5432 =# delete from mytenk2 where hundred < 15;
DELETE 1500
pg@regression:5432 =# vacuum VERBOSE mytenk2;
INFO:  vacuuming "regression.public.mytenk2"
INFO:  finished vacuuming "regression.public.mytenk2": index scans: 1
pages: 0 removed, 345 remain, 0 skipped using visibility map (0.00% of total)
tuples: 1500 removed, 8500 remain (8500 newly frozen), 0 are dead but
not yet removable
removable cutoff: 17411, which is 0 xids behind next
new relfrozenxid: 17411, which is 3 xids ahead of previous value
index scan needed: 341 pages from table (98.84% of total) had 1500
dead item identifiers removed
index "mytenk2_unique1_idx": pages: 39 in total, 0 newly deleted, 0
currently deleted, 0 reusable
index "mytenk2_unique2_idx": pages: 30 in total, 0 newly deleted, 0
currently deleted, 0 reusable
index "mytenk2_hundred_idx": pages: 11 in total, 1 newly deleted, 1
currently deleted, 0 reusable
I/O timings: read: 0.011 ms, write: 0.000 ms
avg read rate: 1.428 MB/s, avg write rate: 2.141 MB/s
buffer usage: 1133 hits, 2 misses, 3 dirtied
WAL usage: 1446 records, 1 full page images, 199702 bytes
system usage: CPU: user: 0.01 s, system: 0.00 s, elapsed: 0.01 s
VACUUM

-- 
Peter Geoghegan

On Thu, Jan 20, 2022 at 2:00 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I do see some value in that, too. Though it's not going to be a way of
> turning off the early freezing stuff, which seems unnecessary (though
> I do still have work to do on getting the overhead for that down).

Attached is v7, a revision that overhauls the algorithm that decides
what to freeze. I'm now calling it block-driven freezing in the commit
message. Also included is a new patch, that makes VACUUM record zero
free space in the FSM for an all-visible page, unless the total amount
of free space happens to be greater than one half of BLCKSZ.

The fact that I am now including this new FSM patch (v7-0006-*patch)
may seem like a case of expanding the scope of something that could
well do without it. But hear me out! It's true that the new FSM patch
isn't essential. I'm including it now because it seems relevant to the
approach taken with block-driven freezing -- it may even make my
general approach easier to understand. The new approach to freezing is
to freeze every tuple on a block that is about to be set all-visible
(and thus set it all-frozen too), or to not freeze anything on the
page at all (at least until one XID gets really old, which should be
rare). This approach has all the benefits that I described upthread,
and a new benefit: it effectively encourages the application to allow
pages to "become settled".

The main difference in how we freeze here (relative to v6 of the
patch) is that I'm *not* freezing a page just because it was
dirtied/pruned. I now think about freezing as an essentially
page-level thing, barring edge cases where we have to freeze
individual tuples, just because the XIDs really are getting old (it's
an edge case when we can't freeze all the tuples together due to a mix
of new and old, which is something we specifically set out to avoid
now).

Freezing whole pages
====================

When VACUUM sees that all remaining/unpruned tuples on a page are
all-visible, it isn't just important because of cost control
considerations. It's deeper than that. It's also treated as a
tentative signal from the application itself, about the data itself.
Which is: this page looks "settled" -- it may never be updated again,
but if there is an update it likely won't change too much about the
whole page. Also, if the page is ever updated in the future, it's
likely that that will happen at a much later time than you should
expect for those *other* nearby pages, that *don't* appear to be
settled. And so VACUUM infers that the page is *qualitatively*
different to these other nearby pages. VACUUM therefore makes it hard
(though not impossible) for future inserts or updates to disturb these
settled pages, via this FSM behavior -- it is short sighted to just
see the space remaining on the page as free space, equivalent to any
other. This holistic approach seems to work well for
TPC-C/BenchmarkSQL, and perhaps even in general. More on TPC-C below.

This is not unlike the approach taken by other DB systems, where free
space management is baked into concurrency control, and the concept of
physical data independence as we know it from Postgres never really
existed. My approach also seems related to the concept of a "tenured
generation", which is key to generational garbage collection. The
whole basis of generational garbage collection is the generational
hypothesis: "most objects die young". This is an empirical observation
about applications written in GC'd programming languages actually
behave, not a rigorous principle, and yet in practice it appears to
always hold. Intuitively, it seems to me like the hypothesis must work
in practice because if it didn't then a counterexample nemesis
application's behavior would be totally chaotic, in every way.
Theoretically possible, but of no real concern, since the program
makes zero practical sense *as an actual program*. A Java program must
make sense to *somebody* (at least the person that wrote it), which,
it turns out, helpfully constrains the space of possibilities that any
industrial strength GC implementation needs to handle well.

The same principles seem to apply here, with VACUUM. Grouping logical
rows into pages that become their "permanent home until further
notice" may be somewhat arbitrary, at first, but that doesn't mean it
won't end up sticking. Just like with generational garbage collection,
where the application isn't expected to instruct the GC about its
plans for memory that it allocates, that can nevertheless be usefully
organized into distinct generations through an adaptive process.

Second order effects
====================

Relating the FSM to page freezing/all-visible setting makes much more
sense if you consider the second order effects.

There is bound to be competition for free space among backends that
access the free space map. By *not* freezing a page during VACUUM
because it looks unsettled, we make its free space available in the
traditional way instead. It follows that unsettled pages (in tables
with lots of updates) are now the only place that backends that need
more free space from the FSM can look -- unsettled pages therefore
become a hot commodity, freespace-wise. A page that initially appeared
"unsettled", that went on to become settled in this newly competitive
environment might have that happen by pure chance -- but probably not.
It *could* happen by chance, of course -- in which case the page will
get dirtied again, and the cycle continues, for now. There will be
further opportunities to figure it out, and freezing the tuples on the
page "prematurely" still has plenty of benefits.

Locality matters a lot, obviously. The goal with the FSM stuff is
merely to make it *possible* for pages to settle naturally, to the
extent that we can. We really just want to avoid hindering a naturally
occurring process -- we want to avoid destroying naturally occuring
locality. We must be willing to accept some cost for that. Even if it
takes a few attempts for certain pages, constraining the application's
choice of where to get free space from (can't be a page marked
all-visible) allows pages to *systematically* become settled over
time.

The application is in charge, really -- not VACUUM. This is already
the case, whether we like it or not. VACUUM needs to learn to live in
that reality, rather than fighting it. When VACUUM considers a page
settled, and the physical page still has a relatively large amount of
free space (say 45% of BLCKSZ, a borderline case in the new FSM
patch), "losing" so much free space certainly is unappealing. We set
the free space to 0 in the free space map all the same, because we're
cutting our losses at that point. While the exact threshold I've
proposed is tentative, the underlying theory seems pretty sound to me.
The BLCKSZ/2 cutoff (and the way that it extends the general rules for
whole-page freezing) is intended to catch pages that are qualitatively
different, as well as quantitatively different. It is a balancing act,
between not wasting space, and the risk of systemic problems involving
excessive amounts of non-HOT updates that must move a successor
version to another page.

It's possible that a higher cutoff (for example a cutoff of 80% of
BLCKSZ, not 50%) will actually lead to *worse* space utilization, in
addition to the downsides from fragmentation -- it's far from a simple
trade-off. (Not that you should believe that 50% is special, it's just
a starting point for me.)

TPC-C
=====

I'm going to talk about a benchmark that ran throughout the week,
starting on Monday. Each run lasted 24 hours, and there were 2 runs in
total, for both the patch and for master/baseline. So this benchmark
lasted 4 days, not including the initial bulk loading, with databases
that were over 450GB in size by the time I was done (that's 450GB+ for
both the patch and master) . Benchmarking for days at a time is pretty
inconvenient, but it seems necessary to see certain effects in play.
We need to wait until the baseline/master case starts to have
anti-wraparound VACUUMs with default, realistic settings, which just
takes days and days.

I make available all of my data for the Benchmark in question, which
is way more information that anybody is likely to want -- I dump
anything that even might be useful from the system views in an
automated way. There are html reports for all 4 24 hour long runs.
Google drive link:

https://drive.google.com/drive/folders/1A1g0YGLzluaIpv-d_4o4thgmWbVx3LuR?usp=sharing

While the patch did well overall, and I will get to the particulars
towards the end of the email, I want to start with what I consider to
be the important part: the user/admin experience with VACUUM, and
VACUUM's performance stability. This is about making VACUUM less
scary.

As I've said several times now, with an append-only table like
pgbench_history we see a consistent pattern where relfrozenxid is set
to a value very close to the same VACUUM's OldestXmin value (even
precisely equal to OldestXmin) during each VACUUM operation, again and
again, forever -- that case is easy to understand and appreciate, and
has already been discussed. Now (with v7's new approach to freezing),
a related pattern can be seen in the case of the two big, troublesome
TPC-C tables, the orders and order lines tables.

To recap, these tables are somewhat like the history table, in that
new orders insert into both tables, again and again, forever. But they
also have one huge difference to simple append-only tables too, which
is the source of most of our problems with TPC-C. The difference is:
there are also delayed, correlated updates of each row from each
table. Exactly one such update per row for both tables, which takes
place hours after each order's insert, when the earlier order is
processed by TPC-C's delivery transaction. In the long run we need the
data to age out and not get re-dirtied, as the table grows and grows
indefinitely, much like with a simple append-only table. At the same
time, we don't want to have poor free space management for these
deferred updates. It's adversarial, sort of, but in a way that is
grounded in reality.

With the order and order lines tables, relfrozenxid tends to be
advanced up to the OldestXmin used by the *previous* VACUUM operation
-- an unmistakable pattern. I'll show you all of the autovacuum log
output for the orders table during the second 24 hour long benchmark
run:

2022-01-27 01:46:27 PST  LOG:  automatic vacuum of table
"regression.public.bmsql_oorder": index scans: 1
pages: 0 removed, 1205349 remain, 887225 skipped using visibility map
(73.61% of total)
tuples: 253872 removed, 134182902 remain (26482225 newly frozen),
27193 are dead but not yet removable
removable cutoff: 243783407, older by 728844 xids when operation ended
new relfrozenxid: 215400514, which is 26840669 xids ahead of previous value
...
2022-01-27 05:54:39 PST LOG:  automatic vacuum of table
"regression.public.bmsql_oorder": index scans: 1
pages: 0 removed, 1345302 remain, 993924 skipped using visibility map
(73.88% of total)
tuples: 261656 removed, 150022816 remain (29757570 newly frozen),
29216 are dead but not yet removable
removable cutoff: 276319403, older by 826850 xids when operation ended
new relfrozenxid: 243838706, which is 28438192 xids ahead of previous value
...
2022-01-27 10:37:24 PST LOG:  automatic vacuum of table
"regression.public.bmsql_oorder": index scans: 1
pages: 0 removed, 1504707 remain, 1110002 skipped using visibility map
(73.77% of total)
tuples: 316086 removed, 167990124 remain (33754949 newly frozen),
33326 are dead but not yet removable
removable cutoff: 313328445, older by 987732 xids when operation ended
new relfrozenxid: 276309397, which is 32470691 xids ahead of previous value
...
2022-01-27 15:49:51 PST LOG:  automatic vacuum of table
"regression.public.bmsql_oorder": index scans: 1
pages: 0 removed, 1680649 remain, 1250525 skipped using visibility map
(74.41% of total)
tuples: 343946 removed, 187739072 remain (37346315 newly frozen),
38037 are dead but not yet removable
removable cutoff: 354149019, older by 1222160 xids when operation ended
new relfrozenxid: 313332249, which is 37022852 xids ahead of previous value
...
2022-01-27 21:55:34 PST LOG:  automatic vacuum of table
"regression.public.bmsql_oorder": index scans: 1
pages: 0 removed, 1886336 remain, 1403800 skipped using visibility map
(74.42% of total)
tuples: 389748 removed, 210899148 remain (43453900 newly frozen),
45802 are dead but not yet removable
removable cutoff: 401955979, older by 1458514 xids when operation ended
new relfrozenxid: 354134615, which is 40802366 xids ahead of previous value

This mostly speaks for itself, I think. (Anybody that's interested can
drill down to the logs for order lines, which looks similar.)

The effect we see with the order/order lines table isn't perfectly
reliable. Actually, it depends on how you define it. It's possible
that we won't be able to acquire a cleanup lock on the wrong page at
the wrong time, and as a result fail to advance relfrozenxid by the
usual amount, once. But that effect appears to be both rare and of no
real consequence. One could reasonably argue that we never fell
behind, because we still did 99.9%+ of the required freezing -- we
just didn't immediately get to advance relfrozenxid, because of a
temporary hiccup on one page. We will still advance relfrozenxid by a
small amount. Sometimes it'll be by only hundreds of XIDs when
millions or tens of millions of XIDs were expected. Once we advance it
by some amount, we can reasonably suppose that the issue was just a
hiccup.

On the master branch, the first 24 hour period has no anti-wraparound
VACUUMs, and so looking at that first 24 hour period gives you some
idea of how worse off we are in the short term -- the freezing stuff
won't really start to pay for itself until the second 24 hour run with
these mostly-default freeze related settings. The second 24 hour run
on master almost exclusively has anti-wraparound VACUUMs for all the
largest tables, though -- all at the same time. And not just the first
time, either! This causes big spikes that the patch totally avoids,
simply by avoiding anti-wraparound VACUUMs. With the patch, there are
no anti-wraparound VACUUMs, barring tables that will never be vacuumed
for any other reason, where it's still inevitable, limited to the
stock table and customers table.

It was a mistake for me to emphasize "no anti-wraparound VACUUMs
outside pathological cases" before now. I stand by those statements as
accurate, but anti-wraparound VACUUMs should not have been given so
much emphasis. Let's assume that somehow we really were to get an
anti-wraparound VACUUM against one of the tables where that's just not
expected, like this orders table -- let's suppose that I got that part
wrong, in some way. It would hardly matter at all! We'd still have
avoided the freezing cliff during this anti-wraparound VACUUM, which
is the real benefit. Chances are good that we needed to VACUUM anyway,
just to clean any very old garbage tuples up -- relfrozenxid is now
predictive of the age of the oldest garbage tuples, which might have
been a good enough reason to VACUUM anyway. The stampede of
anti-wraparound VACUUMs against multiple tables seems like it would
still be fixed, since relfrozenxid now actually tells us something
about the table (as opposed to telling us only about what the user set
vacuum_freeze_min_age to). The only concerns that this leaves for me
are all usability related, and not of primary importance (e.g. do we
really need to make anti-wraparound VACUUMs non-cancelable now?).

TPC-C raw numbers
=================

The single most important number for the patch might be the decrease
in both buffer misses and buffer hits, which I believe is caused by
the patch being able to use index-only scans much more effectively
(with modifications to BenchmarkSQL to improve the indexing strategy
[1]). This is quite clear from pg_stat_database state at the end.

Patch:

xact_commit              | 440,515,133
xact_rollback            | 1,871,142
blks_read                | 3,754,614,188
blks_hit                 | 174,551,067,731
tup_returned             | 341,222,714,073
tup_fetched              | 124,797,772,450
tup_inserted             | 2,900,197,655
tup_updated              | 4,549,948,092
tup_deleted              | 165,222,130

Here is the same pg_stat_database info for master:

xact_commit              | 440,402,505
xact_rollback            | 1,871,536
blks_read                | 4,002,682,052
blks_hit                 | 283,015,966,386
tup_returned             | 346,448,070,798
tup_fetched              | 237,052,965,901
tup_inserted             | 2,899,735,420
tup_updated              | 4,547,220,642
tup_deleted              | 165,103,426

The blks_read is x0.938 of master/baseline for the patch -- not bad.
More importantly, blks_hit is x0.616 for the patch -- quite a
significant reduction in a key cost. Note that we start to get this
particular benefit for individual read queries pretty early on --
avoiding unsetting visibility map bits like this matters right from
the start. In TPC-C terms, the ORDER_STATUS transaction will have much
lower latency, particularly tail latency, since it uses index-only
scans to good effect. There are 5 distinct transaction types from the
benchmark, and an improvement to one particular transaction type isn't
unusual -- so you often have to drill down, and look at the full html
report. The latency situation is improved across the board with the
patch, by quite a bit, especially after the second run. This server
can sustain much more throughput than the TPC-C spec formally permits,
even though I've increased the TPM rate from the benchmark by 10x the
spec legal limit, so query latency is the main TPC-C metric of
interest here.

WAL
===

Then there's the WAL overhead. Like practically any workload, the WAL
consumption for this workload is dominated by FPIs, despite the fact
that I've tuned checkpoints reasonably well. The patch *does* write
more WAL in the first set of runs -- it writes a total of ~3.991 TiB,
versus ~3.834 TiB for master. In other words, during the first 24 hour
run (before the trouble with the anti-wraparound freeze cliff even
begins for the master branch), the patch writes x1.040 as much WAL in
total. The good news is that the patch comes out ahead by the end,
after the second set of 24 hour runs. By the time the second run
finishes, it's 8.332 TiB of WAL total for the patch, versus 8.409 TiB
for master, putting the patch at x0.990 in the end -- a small
improvement. I believe that most of the WAL doesn't get generated by
VACUUM here anyway -- opportunistic pruning works well for this
workload.

I expect to be able to commit the first 2 patches in a couple of
weeks, since that won't need to block on making the case for the final
3 or 4 patches from the patch series. The early stuff is mostly just
refactoring work that removes needless differences between aggressive
and non-aggressive VACUUM operations. It makes a lot of sense on its
own.

[1] https://github.com/pgsql-io/benchmarksql/pull/16
--
Peter Geoghegan

On Sat, Jan 29, 2022 at 8:42 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Attached is v7, a revision that overhauls the algorithm that decides
> what to freeze. I'm now calling it block-driven freezing in the commit
> message. Also included is a new patch, that makes VACUUM record zero
> free space in the FSM for an all-visible page, unless the total amount
> of free space happens to be greater than one half of BLCKSZ.

I pushed the earlier refactoring and instrumentation patches today.

Attached is v8. No real changes -- just a rebased version.

It will be easier to benchmark and test the page-driven freezing stuff
now, since the master/baseline case will now output instrumentation
showing how relfrozenxid has been advanced (if at all) -- whether (and
to what extent) each VACUUM operation advances relfrozenxid can now be
directly compared, just by monitoring the log_autovacuum_min_duration
output for a given table over time.

--
Peter Geoghegan

On Fri, Feb 18, 2022 at 5:00 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Another testing strategy occurs to me: we could stress-test the
> implementation by simulating an environment where the no-cleanup-lock
> path is hit an unusually large number of times, possibly a fixed
> percentage of the time (like 1%, 5%), say by making vacuumlazy.c's
> ConditionalLockBufferForCleanup() call return false randomly. Now that
> we have lazy_scan_noprune for the no-cleanup-lock path (which is as
> similar to the regular lazy_scan_prune path as possible), I wouldn't
> expect this ConditionalLockBufferForCleanup() testing gizmo to be too
> disruptive.

I tried this out, using the attached patch. It was quite interesting,
even when run against HEAD. I think that I might have found a bug on
HEAD, though I'm not really sure.

If you modify the patch to simulate conditions under which
ConditionalLockBufferForCleanup() fails about 2% of the time, you get
much better coverage of lazy_scan_noprune/heap_tuple_needs_freeze,
without it being so aggressive as to make "make check-world" fail --
which is exactly what I expected. If you are much more aggressive
about it, and make it 50% instead (which you can get just by using the
patch as written), then some tests will fail, mostly for reasons that
aren't surprising or interesting (e.g. plan changes). This is also
what I'd have guessed would happen.

However, it gets more interesting. One thing that I did not expect to
happen at all also happened (with the current 50% rate of simulated
ConditionalLockBufferForCleanup() failure from the patch): if I run
"make check" from the pg_surgery directory, then the Postgres backend
gets stuck in an infinite loop inside lazy_scan_prune, which has been
a symptom of several tricky bugs in the past year (not every time, but
usually). Specifically, the VACUUM statement launched by the SQL
command "vacuum freeze htab2;" from the file
contrib/pg_surgery/sql/heap_surgery.sql, at line 54 leads to this
misbehavior.

This is a temp table, which is a choice made by the tests specifically
because they need to "use a temp table so that vacuum behavior doesn't
depend on global xmin". This is convenient way of avoiding spurious
regression tests failures (e.g. from autoanalyze), and relies on the
GlobalVisTempRels behavior established by Andres' 2020 bugfix commit
94bc27b5.

It's quite possible that this is nothing more than a bug in my
adversarial gizmo patch -- since I don't think that
ConditionalLockBufferForCleanup() can ever fail with a temp buffer
(though even that's not completely clear right now). Even if the
behavior that I saw does not indicate a bug on HEAD, it still seems
informative. At the very least, it wouldn't hurt to Assert() that the
target table isn't a temp table inside lazy_scan_noprune, documenting
our assumptions around temp tables and
ConditionalLockBufferForCleanup().

I haven't actually tried to debug the issue just yet, so take all this
with a grain of salt.

-- 
Peter Geoghegan

Attachment

0001-Add-adversarial-ConditionalLockBufferForCleanup-gizm.txt

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

From

Andres Freund

Date:

20 February 2022, 00:12:31

Hi,

(On phone, so crappy formatting and no source access)

On February 19, 2022 3:08:41 PM PST, Peter Geoghegan <pg@bowt.ie> wrote:
>On Fri, Feb 18, 2022 at 5:00 PM Peter Geoghegan <pg@bowt.ie> wrote:
>> Another testing strategy occurs to me: we could stress-test the
>> implementation by simulating an environment where the no-cleanup-lock
>> path is hit an unusually large number of times, possibly a fixed
>> percentage of the time (like 1%, 5%), say by making vacuumlazy.c's
>> ConditionalLockBufferForCleanup() call return false randomly. Now that
>> we have lazy_scan_noprune for the no-cleanup-lock path (which is as
>> similar to the regular lazy_scan_prune path as possible), I wouldn't
>> expect this ConditionalLockBufferForCleanup() testing gizmo to be too
>> disruptive.
>
>I tried this out, using the attached patch. It was quite interesting,
>even when run against HEAD. I think that I might have found a bug on
>HEAD, though I'm not really sure.
>
>If you modify the patch to simulate conditions under which
>ConditionalLockBufferForCleanup() fails about 2% of the time, you get
>much better coverage of lazy_scan_noprune/heap_tuple_needs_freeze,
>without it being so aggressive as to make "make check-world" fail --
>which is exactly what I expected. If you are much more aggressive
>about it, and make it 50% instead (which you can get just by using the
>patch as written), then some tests will fail, mostly for reasons that
>aren't surprising or interesting (e.g. plan changes). This is also
>what I'd have guessed would happen.
>
>However, it gets more interesting. One thing that I did not expect to
>happen at all also happened (with the current 50% rate of simulated
>ConditionalLockBufferForCleanup() failure from the patch): if I run
>"make check" from the pg_surgery directory, then the Postgres backend
>gets stuck in an infinite loop inside lazy_scan_prune, which has been
>a symptom of several tricky bugs in the past year (not every time, but
>usually). Specifically, the VACUUM statement launched by the SQL
>command "vacuum freeze htab2;" from the file
>contrib/pg_surgery/sql/heap_surgery.sql, at line 54 leads to this
>misbehavior.


>This is a temp table, which is a choice made by the tests specifically
>because they need to "use a temp table so that vacuum behavior doesn't
>depend on global xmin". This is convenient way of avoiding spurious
>regression tests failures (e.g. from autoanalyze), and relies on the
>GlobalVisTempRels behavior established by Andres' 2020 bugfix commit
>94bc27b5.

We don't have a blocking path for cleanup locks of temporary buffers IIRC (normally not reachable). So I wouldn't be
surprisedif a cleanup lock failing would cause some odd behavior. 

>It's quite possible that this is nothing more than a bug in my
>adversarial gizmo patch -- since I don't think that
>ConditionalLockBufferForCleanup() can ever fail with a temp buffer
>(though even that's not completely clear right now). Even if the
>behavior that I saw does not indicate a bug on HEAD, it still seems
>informative. At the very least, it wouldn't hurt to Assert() that the
>target table isn't a temp table inside lazy_scan_noprune, documenting
>our assumptions around temp tables and
>ConditionalLockBufferForCleanup().

Definitely worth looking into more.


This reminds me of a recent thing I noticed in the aio patch. Spgist can end up busy looping when buffers are locked,
insteadof blocking. Not actually related, of course. 

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

From

Peter Geoghegan

Date:

20 February 2022, 00:22:23

On Sat, Feb 19, 2022 at 3:08 PM Peter Geoghegan <pg@bowt.ie> wrote:
> It's quite possible that this is nothing more than a bug in my
> adversarial gizmo patch -- since I don't think that
> ConditionalLockBufferForCleanup() can ever fail with a temp buffer
> (though even that's not completely clear right now). Even if the
> behavior that I saw does not indicate a bug on HEAD, it still seems
> informative.

This very much looks like a bug in pg_surgery itself now -- attached
is a draft fix.

The temp table thing was a red herring. I found I could get exactly
the same kind of failure when htab2 was a permanent table (which was
how it originally appeared, before commit 0811f766fd made it into a
temp table due to test flappiness issues). The relevant "vacuum freeze
htab2" happens at a point after the test has already deliberately
corrupted one of its tuples using heap_force_kill().  It's not that we
aren't careful enough about the corruption at some point in
vacuumlazy.c, which was my second theory. But I quickly discarded that
idea, and came up with a third theory: the relevant heap_surgery.c
path does the relevant ItemIdSetDead() to kill items, without also
defragmenting the page to remove the tuples with storage, which is
wrong.

This meant that we depended on pruning happening (in this case during
VACUUM) and defragmenting the page in passing. But there is no reason
to not defragment the page within pg_surgery (at least no obvious
reason), since we have a cleanup lock anyway.

Theoretically you could blame this on lazy_scan_noprune instead, since
it thinks it can collect LP_DEAD items while assuming that they have
no storage, but that doesn't make much sense to me. There has never
been any way of setting a heap item to LP_DEAD without also
defragmenting the page.  Since that's exactly what it means to prune a
heap page. (Actually, the same used to be true about heap vacuuming,
which worked more like heap pruning before Postgres 14, but that
doesn't seem important.)

-- 
Peter Geoghegan

Attachment

0002-Fix-for-pg_surgery-s-heap_force_kill-function.txt

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

From

Peter Geoghegan

Date:

20 February 2022, 01:22:33

On Sat, Feb 19, 2022 at 4:22 PM Peter Geoghegan <pg@bowt.ie> wrote:
> This very much looks like a bug in pg_surgery itself now -- attached
> is a draft fix.

Wait, that's not it either. I jumped the gun -- this isn't sufficient
(though the patch I posted might not be a bad idea anyway).

Looks like pg_surgery isn't processing HOT chains as whole units,
which it really should (at least in the context of killing items via
the heap_force_kill() function). Killing a root item in a HOT chain is
just hazardous -- disconnected/orphaned heap-only tuples are liable to
cause chaos, and should be avoided everywhere (including during
pruning, and within pg_surgery).

It's likely that the hardening I already planned on adding to pruning
[1] (as follow-up work to recent bugfix commit 18b87b201f) will
prevent lazy_scan_prune from getting stuck like this, whatever the
cause happens to be. The actual page image I see lazy_scan_prune choke
on (i.e. exhibit the same infinite loop unpleasantness we've seen
before on) is not in a consistent state at all (its tuples consist of
tuples from a single HOT chain, and the HOT chain is totally
inconsistent on account of having an LP_DEAD line pointer root item).
pg_surgery could in principle do the right thing here by always
treating HOT chains as whole units.

Leaving behind disconnected/orphaned heap-only tuples is pretty much
pointless anyway, since they'll never be accessible by index scans.
Even after a REINDEX, since there is no root item from the heap page
to go in the index. (A dump and restore might work better, though.)

[1] https://postgr.es/m/CAH2-WzmNk6V6tqzuuabxoxM8HJRaWU6h12toaS-bqYcLiht16A@mail.gmail.com
--
Peter Geoghegan

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

From

Andres Freund

Date:

20 February 2022, 01:54:16

Hi,

On 2022-02-19 17:22:33 -0800, Peter Geoghegan wrote:
> Looks like pg_surgery isn't processing HOT chains as whole units,
> which it really should (at least in the context of killing items via
> the heap_force_kill() function). Killing a root item in a HOT chain is
> just hazardous -- disconnected/orphaned heap-only tuples are liable to
> cause chaos, and should be avoided everywhere (including during
> pruning, and within pg_surgery).

How does that cause the endless loop?

It doesn't do so on HEAD + 0001-Add-adversarial-ConditionalLockBuff[...] for
me. So something needs have changed with your patch?

> It's likely that the hardening I already planned on adding to pruning
> [1] (as follow-up work to recent bugfix commit 18b87b201f) will
> prevent lazy_scan_prune from getting stuck like this, whatever the
> cause happens to be.

Yea, we should pick that up again. Not just for robustness or
performance. Also because it's just a lot easier to understand.

> Leaving behind disconnected/orphaned heap-only tuples is pretty much
> pointless anyway, since they'll never be accessible by index scans.
> Even after a REINDEX, since there is no root item from the heap page
> to go in the index. (A dump and restore might work better, though.)

Given that heap_surgery's raison d'etre is correcting corruption etc, I think
it makes sense for it to do as minimal work as possible. Iterating through a
HOT chain would be a problem if you e.g. tried to repair a page with HOT
corruption.

Greetings,

Andres Freund

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

From

Peter Geoghegan

Date:

20 February 2022, 02:16:54

On Sat, Feb 19, 2022 at 5:54 PM Andres Freund <andres@anarazel.de> wrote:
> How does that cause the endless loop?

Attached is the page image itself, dumped via gdb (and gzip'd). This
was on recent HEAD (commit 8f388f6f, actually), plus
0001-Add-adversarial-ConditionalLockBuff[...]. No other changes. No
defragmenting in pg_surgery, nothing like that.

> It doesn't do so on HEAD + 0001-Add-adversarial-ConditionalLockBuff[...] for
> me. So something needs have changed with your patch?

It doesn't always happen -- only about half the time on my machine.
Maybe it's timing sensitive?

We hit the "goto retry" on offnum 2, which is the first tuple with
storage (you can see "the ghost" of the tuple from the LP_DEAD item at
offnum 1, since the page isn't defragmented in pg_surgery). I think
that this happens because the heap-only tuple at offnum 2 is fully
DEAD to lazy_scan_prune, but hasn't been recognized as such by
heap_page_prune. There is no way that they'll ever "agree" on the
tuple being DEAD right now, because pruning still doesn't assume that
an orphaned heap-only tuple is fully DEAD.

We can either do that, or we can throw an error concerning corruption
when heap_page_prune notices orphaned tuples. Neither seems
particularly appealing. But it definitely makes no sense to allow
lazy_scan_prune to spin in a futile attempt to reach agreement with
heap_page_prune about a DEAD tuple really being DEAD.

> Given that heap_surgery's raison d'etre is correcting corruption etc, I think
> it makes sense for it to do as minimal work as possible. Iterating through a
> HOT chain would be a problem if you e.g. tried to repair a page with HOT
> corruption.

I guess that's also true. There is at least a legitimate argument to
be made for not leaving behind any orphaned heap-only tuples. The
interface is a TID, and so the user may already believe that they're
killing the heap-only, not just the root item (since ctid suggests
that the TID of a heap-only tuple is the TID of the root item, which
is kind of misleading).

Anyway, we can decide on what to do in heap_surgery later, once the
main issue is under control. My point was mostly just that orphaned
heap-only tuples are definitely not okay, in general. They are the
least worst option when corruption has already happened, maybe -- but
maybe not.

-- 
Peter Geoghegan

Attachment

corrupt-hot-chain.page.gz

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

From

Andres Freund

Date:

20 February 2022, 03:01:28

Hi,

On 2022-02-19 18:16:54 -0800, Peter Geoghegan wrote:
> On Sat, Feb 19, 2022 at 5:54 PM Andres Freund <andres@anarazel.de> wrote:
> > How does that cause the endless loop?
> 
> Attached is the page image itself, dumped via gdb (and gzip'd). This
> was on recent HEAD (commit 8f388f6f, actually), plus
> 0001-Add-adversarial-ConditionalLockBuff[...]. No other changes. No
> defragmenting in pg_surgery, nothing like that.

> > It doesn't do so on HEAD + 0001-Add-adversarial-ConditionalLockBuff[...] for
> > me. So something needs have changed with your patch?
> 
> It doesn't always happen -- only about half the time on my machine.
> Maybe it's timing sensitive?

Ah, I'd only run the tests three times or so, without it happening. Trying a
few more times repro'd it.


It's kind of surprising that this needs this
0001-Add-adversarial-ConditionalLockBuff to break. I suspect it's a question
of hint bits changing due to lazy_scan_noprune(), which then makes
HeapTupleHeaderIsHotUpdated() have a different return value, preventing the
"If the tuple is DEAD and doesn't chain to anything else"
path from being taken.


> We hit the "goto retry" on offnum 2, which is the first tuple with
> storage (you can see "the ghost" of the tuple from the LP_DEAD item at
> offnum 1, since the page isn't defragmented in pg_surgery). I think
> that this happens because the heap-only tuple at offnum 2 is fully
> DEAD to lazy_scan_prune, but hasn't been recognized as such by
> heap_page_prune. There is no way that they'll ever "agree" on the
> tuple being DEAD right now, because pruning still doesn't assume that
> an orphaned heap-only tuple is fully DEAD.

> We can either do that, or we can throw an error concerning corruption
> when heap_page_prune notices orphaned tuples. Neither seems
> particularly appealing. But it definitely makes no sense to allow
> lazy_scan_prune to spin in a futile attempt to reach agreement with
> heap_page_prune about a DEAD tuple really being DEAD.

Yea, this sucks. I think we should go for the rewrite of the
heap_prune_chain() logic. The current approach is just never going to be
robust.

Greetings,

Andres Freund

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

From

Peter Geoghegan

Date:

20 February 2022, 03:07:39

On Sat, Feb 19, 2022 at 7:01 PM Andres Freund <andres@anarazel.de> wrote:
> > We can either do that, or we can throw an error concerning corruption
> > when heap_page_prune notices orphaned tuples. Neither seems
> > particularly appealing. But it definitely makes no sense to allow
> > lazy_scan_prune to spin in a futile attempt to reach agreement with
> > heap_page_prune about a DEAD tuple really being DEAD.
>
> Yea, this sucks. I think we should go for the rewrite of the
> heap_prune_chain() logic. The current approach is just never going to be
> robust.

No, it just isn't robust enough. But it's not that hard to fix. My
patch really wasn't invasive.

I confirmed that HeapTupleSatisfiesVacuum() and
heap_prune_satisfies_vacuum() agree that the heap-only tuple at offnum
2 is HEAPTUPLE_DEAD -- they are in agreement, as expected (so no
reason to think that there is a new bug involved). The problem here is
indeed just that heap_prune_chain() can't "get to" the tuple, given
its current design.

For anybody else that doesn't follow what we're talking about:

The "doesn't chain to anything else" code at the start of
heap_prune_chain() won't get to the heap-only tuple at offnum 2, since
the tuple is itself HeapTupleHeaderIsHotUpdated() -- the expectation
is that it'll be processed later on, once we locate the HOT chain's
root item. Since, of course, the "root item" was already LP_DEAD
before we even reached heap_page_prune() (on account of the pg_surgery
corruption), there is no possible way that that can happen later on.
And so we cannot find the same heap-only tuple and mark it LP_UNUSED
(which is how we always deal with HEAPTUPLE_DEAD heap-only tuples)
during pruning.

-- 
Peter Geoghegan

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

From

Peter Geoghegan

Date:

20 February 2022, 03:18:25

On Sat, Feb 19, 2022 at 7:01 PM Andres Freund <andres@anarazel.de> wrote:
> It's kind of surprising that this needs this
> 0001-Add-adversarial-ConditionalLockBuff to break. I suspect it's a question
> of hint bits changing due to lazy_scan_noprune(), which then makes
> HeapTupleHeaderIsHotUpdated() have a different return value, preventing the
> "If the tuple is DEAD and doesn't chain to anything else"
> path from being taken.

That makes sense as an explanation. Goes to show just how fragile the
"DEAD and doesn't chain to anything else" logic at the top of
heap_prune_chain really is.

-- 
Peter Geoghegan

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

From

Andres Freund

Date:

20 February 2022, 03:28:33

Hi,

On 2022-02-19 19:07:39 -0800, Peter Geoghegan wrote:
> On Sat, Feb 19, 2022 at 7:01 PM Andres Freund <andres@anarazel.de> wrote:
> > > We can either do that, or we can throw an error concerning corruption
> > > when heap_page_prune notices orphaned tuples. Neither seems
> > > particularly appealing. But it definitely makes no sense to allow
> > > lazy_scan_prune to spin in a futile attempt to reach agreement with
> > > heap_page_prune about a DEAD tuple really being DEAD.
> >
> > Yea, this sucks. I think we should go for the rewrite of the
> > heap_prune_chain() logic. The current approach is just never going to be
> > robust.
> 
> No, it just isn't robust enough. But it's not that hard to fix. My
> patch really wasn't invasive.

I think we're in agreement there. We might think at some point about
backpatching too, but I'd rather have it stew in HEAD for a bit first.

> I confirmed that HeapTupleSatisfiesVacuum() and
> heap_prune_satisfies_vacuum() agree that the heap-only tuple at offnum
> 2 is HEAPTUPLE_DEAD -- they are in agreement, as expected (so no
> reason to think that there is a new bug involved). The problem here is
> indeed just that heap_prune_chain() can't "get to" the tuple, given
> its current design.

Right.

The reason that the "adversarial" patch makes a different is solely that it
changes the heap_surgery test to actually kill an item, which it doesn't
intend:

create temp table htab2(a int);
insert into htab2 values (100);
update htab2 set a = 200;
vacuum htab2;

-- redirected TIDs should be skipped
select heap_force_kill('htab2'::regclass, ARRAY['(0, 1)']::tid[]);

If the vacuum can get the cleanup lock due to the adversarial patch, the
heap_force_kill() doesn't do anything, because the first item is a
redirect. However if it *can't* get a cleanup lock, heap_force_kill() instead
targets the root item. Triggering the endless loop.

Hm. I think this might be a mild regression in 14. In < 14 we'd just skip the
tuple in lazy_scan_heap(), but now we have an uninterruptible endless
loop.

We'd do completely bogus stuff later in < 14 though, I think we'd just leave
it in place despite being older than relfrozenxid, which obviously has its own
set of issues.

Greetings,

Andres Freund

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

From

Peter Geoghegan

Date:

20 February 2022, 03:31:21

On Sat, Feb 19, 2022 at 6:16 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > Given that heap_surgery's raison d'etre is correcting corruption etc, I think
> > it makes sense for it to do as minimal work as possible. Iterating through a
> > HOT chain would be a problem if you e.g. tried to repair a page with HOT
> > corruption.
>
> I guess that's also true. There is at least a legitimate argument to
> be made for not leaving behind any orphaned heap-only tuples. The
> interface is a TID, and so the user may already believe that they're
> killing the heap-only, not just the root item (since ctid suggests
> that the TID of a heap-only tuple is the TID of the root item, which
> is kind of misleading).

Actually, I would say that heap_surgery's raison d'etre is making
weird errors related to corruption of this or that TID go away, so
that the user can cut their losses. That's how it's advertised.

Let's assume that we don't want to make VACUUM/pruning just treat
orphaned heap-only tuples as DEAD, regardless of their true HTSV-wise
status -- let's say that we want to err in the direction of doing
nothing at all with the page. Now we have to have a weird error in
VACUUM instead (not great, but better than just spinning between
lazy_scan_prune and heap_prune_page). And we've just created natural
demand for heap_surgery to deal with the problem by deleting whole HOT
chains (not just root items).

If we allow VACUUM to treat orphaned heap-only tuples as DEAD right
away, then we might as well do the same thing in heap_surgery, since
there is little chance that the user will get to the heap-only tuples
before VACUUM does (not something to rely on, at any rate).

Either way, I think we probably end up needing to teach heap_surgery
to kill entire HOT chains as a group, given a TID.

-- 
Peter Geoghegan

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

From

Peter Geoghegan

Date:

20 February 2022, 03:40:20

On Sat, Feb 19, 2022 at 7:28 PM Andres Freund <andres@anarazel.de> wrote:
> If the vacuum can get the cleanup lock due to the adversarial patch, the
> heap_force_kill() doesn't do anything, because the first item is a
> redirect. However if it *can't* get a cleanup lock, heap_force_kill() instead
> targets the root item. Triggering the endless loop.

But it shouldn't matter if the root item is an LP_REDIRECT or a normal
(not heap-only) tuple with storage. Either way it's the root of a HOT
chain.

The fact that pg_surgery treats LP_REDIRECT items differently from the
other kind of root items is just arbitrary. It seems to have more to
do with freezing tuples than killing tuples.

--
Peter Geoghegan

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

From

Andres Freund

Date:

20 February 2022, 03:47:32

Hi,

On 2022-02-19 19:31:21 -0800, Peter Geoghegan wrote:
> On Sat, Feb 19, 2022 at 6:16 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > > Given that heap_surgery's raison d'etre is correcting corruption etc, I think
> > > it makes sense for it to do as minimal work as possible. Iterating through a
> > > HOT chain would be a problem if you e.g. tried to repair a page with HOT
> > > corruption.
> >
> > I guess that's also true. There is at least a legitimate argument to
> > be made for not leaving behind any orphaned heap-only tuples. The
> > interface is a TID, and so the user may already believe that they're
> > killing the heap-only, not just the root item (since ctid suggests
> > that the TID of a heap-only tuple is the TID of the root item, which
> > is kind of misleading).
> 
> Actually, I would say that heap_surgery's raison d'etre is making
> weird errors related to corruption of this or that TID go away, so
> that the user can cut their losses. That's how it's advertised.

I'm not that sure those are that different... Imagine some corruption leading
to two hot chains ending in the same tid, which our fancy new secure pruning
algorithm might detect.

Either way, I'm a bit surprised about the logic to not allow killing redirect
items? What if you have a redirect pointing to an unused item?

> Let's assume that we don't want to make VACUUM/pruning just treat
> orphaned heap-only tuples as DEAD, regardless of their true HTSV-wise
> status

I don't think that'd ever be a good idea. Those tuples are visible to a
seqscan after all.

> -- let's say that we want to err in the direction of doing
> nothing at all with the page. Now we have to have a weird error in
> VACUUM instead (not great, but better than just spinning between
> lazy_scan_prune and heap_prune_page).

Non DEAD orphaned versions shouldn't cause a problem in lazy_scan_prune(). The
problem here is a DEAD orphaned HOT tuples, and those we should be able to
delete with the new page pruning logic, right?

I think it might be worth getting rid of the need for the retry approach by
reusing the same HTSV status array between heap_prune_page and
lazy_scan_prune. Then the only legitimate reason for seeing a DEAD item in
lazy_scan_prune() would be some form of corruption.  And it'd be a pretty
decent performance boost, HTSV ain't cheap.

Greetings,

Andres Freund

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

From

Peter Geoghegan

Date:

20 February 2022, 03:56:53

On Sat, Feb 19, 2022 at 7:47 PM Andres Freund <andres@anarazel.de> wrote:
> I'm not that sure those are that different... Imagine some corruption leading
> to two hot chains ending in the same tid, which our fancy new secure pruning
> algorithm might detect.

I suppose that's possible, but it doesn't seem all that likely to ever
happen, what with the xmin -> xmax cross-tuple matching stuff.

> Either way, I'm a bit surprised about the logic to not allow killing redirect
> items? What if you have a redirect pointing to an unused item?

Again, I simply think it boils down to having to treat HOT chains as a
whole unit when killing TIDs.

> > Let's assume that we don't want to make VACUUM/pruning just treat
> > orphaned heap-only tuples as DEAD, regardless of their true HTSV-wise
> > status
>
> I don't think that'd ever be a good idea. Those tuples are visible to a
> seqscan after all.

I agree (I don't hate it completely, but it seems mostly bad). This is
what leads me to the conclusion that pg_surgery has to be able to do
this instead. Surely it's not okay to have something that makes VACUUM
always end in error, that cannot even be fixed by pg_surgery.

> > -- let's say that we want to err in the direction of doing
> > nothing at all with the page. Now we have to have a weird error in
> > VACUUM instead (not great, but better than just spinning between
> > lazy_scan_prune and heap_prune_page).
>
> Non DEAD orphaned versions shouldn't cause a problem in lazy_scan_prune(). The
> problem here is a DEAD orphaned HOT tuples, and those we should be able to
> delete with the new page pruning logic, right?

Right. But what good does that really do? The problematic page had a
third tuple (at offnum 3) that was LIVE. If we could have done
something about the problematic tuple at offnum 2 (which is where we
got stuck), then we'd still be left with a very unpleasant choice
about what happens to the third tuple.

> I think it might be worth getting rid of the need for the retry approach by
> reusing the same HTSV status array between heap_prune_page and
> lazy_scan_prune. Then the only legitimate reason for seeing a DEAD item in
> lazy_scan_prune() would be some form of corruption.  And it'd be a pretty
> decent performance boost, HTSV ain't cheap.

I guess it doesn't actually matter if we leave an aborted DEAD tuple
behind, that we could have pruned away, but didn't. The important
thing is to be consistent at the level of the page.

-- 
Peter Geoghegan

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

From

Andres Freund

Date:

20 February 2022, 04:21:47

Hi,

On February 19, 2022 7:56:53 PM PST, Peter Geoghegan <pg@bowt.ie> wrote:
>On Sat, Feb 19, 2022 at 7:47 PM Andres Freund <andres@anarazel.de> wrote:
>> Non DEAD orphaned versions shouldn't cause a problem in lazy_scan_prune(). The
>> problem here is a DEAD orphaned HOT tuples, and those we should be able to
>> delete with the new page pruning logic, right?
>
>Right. But what good does that really do? The problematic page had a
>third tuple (at offnum 3) that was LIVE. If we could have done
>something about the problematic tuple at offnum 2 (which is where we
>got stuck), then we'd still be left with a very unpleasant choice
>about what happens to the third tuple.

Why does anything need to happen to it from vacuum's POV?  It'll not be a problem for freezing etc. Until it's deleted
vacuumdoesn't need to care. 

Probably worth a WARNING, and amcheck definitely needs to detect it, but otherwise I think it's fine to just continue.


>> I think it might be worth getting rid of the need for the retry approach by
>> reusing the same HTSV status array between heap_prune_page and
>> lazy_scan_prune. Then the only legitimate reason for seeing a DEAD item in
>> lazy_scan_prune() would be some form of corruption.  And it'd be a pretty
>> decent performance boost, HTSV ain't cheap.
>
>I guess it doesn't actually matter if we leave an aborted DEAD tuple
>behind, that we could have pruned away, but didn't. The important
>thing is to be consistent at the level of the page.

That's not ok, because it opens up dangers of being interpreted differently after wraparound etc.

But I don't see any cases where it would happen with the new pruning logic in your patch and sharing the HTSV status
array?

Andres


--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

From

Peter Geoghegan

Date:

20 February 2022, 04:28:21

On Sat, Feb 19, 2022 at 8:21 PM Andres Freund <andres@anarazel.de> wrote:
> Why does anything need to happen to it from vacuum's POV?  It'll not be a problem for freezing etc. Until it's
deletedvacuum doesn't need to care.

>
> Probably worth a WARNING, and amcheck definitely needs to detect it, but otherwise I think it's fine to just
continue.

Maybe that's true, but it's just really weird to imagine not having an
LP_REDIRECT that points to the LIVE item here, without throwing an
error. Seems kind of iffy, to say the least.

> >I guess it doesn't actually matter if we leave an aborted DEAD tuple
> >behind, that we could have pruned away, but didn't. The important
> >thing is to be consistent at the level of the page.
>
> That's not ok, because it opens up dangers of being interpreted differently after wraparound etc.
>
> But I don't see any cases where it would happen with the new pruning logic in your patch and sharing the HTSV status
array?

Right. Fundamentally, there isn't any reason why it should matter that
VACUUM reached the heap page just before (rather than concurrent with
or just after) some xact that inserted or updated on the page aborts.
Just as long as we have a consistent idea about what's going on at the
level of the whole page (or maybe the level of each HOT chain, but the
whole page level seems simpler to me).

-- 
Peter Geoghegan

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

From

Robert Haas

Date:

20 February 2022, 15:03:17

On Sat, Feb 19, 2022 at 8:54 PM Andres Freund <andres@anarazel.de> wrote:
> > Leaving behind disconnected/orphaned heap-only tuples is pretty much
> > pointless anyway, since they'll never be accessible by index scans.
> > Even after a REINDEX, since there is no root item from the heap page
> > to go in the index. (A dump and restore might work better, though.)
>
> Given that heap_surgery's raison d'etre is correcting corruption etc, I think
> it makes sense for it to do as minimal work as possible. Iterating through a
> HOT chain would be a problem if you e.g. tried to repair a page with HOT
> corruption.

Yeah, I agree. I don't have time to respond to all of these emails
thoroughly right now, but I think it's really important that
pg_surgery do the exact surgery the user requested, and not any other
work. I don't think that page defragmentation should EVER be REQUIRED
as a condition of other work. If other code is relying on that, I'd
say it's busted. I'm a little more uncertain about the case where we
kill the root tuple of a HOT chain, because I can see that this might
leave the page a state where sequential scans see the tuple at the end
of the chain and index scans don't. I'm not sure whether that should
be the responsibility of pg_surgery itself to avoid, or whether that's
your problem as a user of it -- although I lean mildly toward the
latter view, at the moment. But in any case surely the pruning code
can't just decide to go into an infinite loop if that happens. Code
that manipulates the states of data pages needs to be as robust
against arbitrary on-disk states as we can reasonably make it, because
pages get garbled on disk all the time.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

From

Robert Haas

Date:

20 February 2022, 15:30:21

On Fri, Feb 18, 2022 at 7:12 PM Peter Geoghegan <pg@bowt.ie> wrote:
> We have to worry about XIDs from MultiXacts (and xmax values more
> generally). And we have to worry about the case where we start out
> with only xmin frozen (by an earlier VACUUM), and then have to freeze
> xmax too. I believe that we have to generally consider xmin and xmax
> independently. For example, we cannot ignore xmax, just because we
> looked at xmin, since in general xmin alone might have already been
> frozen.

Right, so we at least need to add a similar comment to what I proposed
for MXIDs, and maybe other changes are needed, too.

> The difference between the cleanup lock path (in
> lazy_scan_prune/heap_prepare_freeze_tuple) and the share lock path (in
> lazy_scan_noprune/heap_tuple_needs_freeze) is what is at issue in both
> of these confusing comment blocks, really. Note that cutoff_xid is the
> name that both heap_prepare_freeze_tuple and heap_tuple_needs_freeze
> have for FreezeLimit (maybe we should rename every occurence of
> cutoff_xid in heapam.c to FreezeLimit).
>
> At a high level, we aren't changing the fundamental definition of an
> aggressive VACUUM in any of the patches -- we still need to advance
> relfrozenxid up to FreezeLimit in an aggressive VACUUM, just like on
> HEAD, today (we may be able to advance it *past* FreezeLimit, but
> that's just a bonus). But in a non-aggressive VACUUM, where there is
> still no strict requirement to advance relfrozenxid (by any amount),
> the code added by 0001 can set relfrozenxid to any known safe value,
> which could either be from before FreezeLimit, or after FreezeLimit --
> almost anything is possible (provided we respect the relfrozenxid
> invariant, and provided we see that we didn't skip any
> all-visible-not-all-frozen pages).
>
> Since we still need to "respect FreezeLimit" in an aggressive VACUUM,
> the aggressive case might need to wait for a full cleanup lock the
> hard way, having tried and failed to do it the easy way within
> lazy_scan_noprune (lazy_scan_noprune will still return false when any
> call to heap_tuple_needs_freeze for any tuple returns false) -- same
> as on HEAD, today.
>
> And so the difference at issue here is: FreezeLimit/cutoff_xid only
> needs to affect the new NewRelfrozenxid value we use for relfrozenxid in
> heap_prepare_freeze_tuple, which is involved in real freezing -- not
> in heap_tuple_needs_freeze, whose main purpose is still to help us
> avoid freezing where a cleanup lock isn't immediately available. While
> the purpose of FreezeLimit/cutoff_xid within heap_tuple_needs_freeze
> is to determine its bool return value, which will only be of interest
> to the aggressive case (which might have to get a cleanup lock and do
> it the hard way), not the non-aggressive case (where ratcheting back
> NewRelfrozenxid is generally possible, and generally leaves us with
> almost as good of a value).
>
> In other words: the calls to heap_tuple_needs_freeze made from
> lazy_scan_noprune are simply concerned with the page as it actually
> is, whereas the similar/corresponding calls to
> heap_prepare_freeze_tuple from lazy_scan_prune are concerned with
> *what the page will actually become*, after freezing finishes, and
> after lazy_scan_prune is done with the page entirely (ultimately
> the final NewRelfrozenxid value set in pg_class.relfrozenxid only has
> to be <= the oldest extant XID *at the time the VACUUM operation is
> just about to end*, not some earlier time, so "being versus becoming"
> is an interesting distinction for us).
>
> Maybe the way that FreezeLimit/cutoff_xid is overloaded can be fixed
> here, to make all of this less confusing. I only now fully realized
> how confusing all of this stuff is -- very.

Right. I think I understand all of this, or at least most of it -- but
not from the comment. The question is how the comment can be more
clear. My general suggestion is that function header comments should
have more to do with the behavior of the function than how it fits
into the bigger picture. If it's clear to the reader what conditions
must hold before calling the function and which must hold on return,
it helps a lot. IMHO, it's the job of the comments in the calling
function to clarify why we then choose to call that function at the
place and in the way that we do.

> As a general rule, we try to freeze all of the remaining live tuples
> on a page (following pruning) together, as a group, or none at all.
> Most of the time this is triggered by our noticing that the page is
> about to be set all-visible (but not all-frozen), and doing work
> sufficient to mark it fully all-frozen instead. Occasionally there is
> FreezeLimit to consider, which is now more of a backstop thing, used
> to make sure that we never get too far behind in terms of unfrozen
> XIDs. This is useful in part because it avoids any future
> non-aggressive VACUUM that is fundamentally unable to advance
> relfrozenxid (you can't skip all-visible pages if there are only
> all-frozen pages in the VM in practice).
>
> We're generally doing a lot more freezing with 0002, but we still
> manage to avoid freezing too much in tables like pgbench_tellers or
> pgbench_branches -- tables where it makes the least sense. Such tables
> will be updated so frequently that VACUUM is relatively unlikely to
> ever mark any page all-visible, avoiding the main criteria for
> freezing implicitly. It's also unlikely that they'll ever have an XID that is so
> old to trigger the fallback FreezeLimit-style criteria for freezing.
>
> In practice, freezing tuples like this is generally not that expensive in
> most tables where VACUUM freezes the majority of pages immediately
> (tables that aren't like pgbench_tellers or pgbench_branches), because
> they're generally big tables, where the overhead of FPIs tends
> to dominate anyway (gambling that we can avoid more FPIs later on is not a
> bad gamble, as gambles go). This seems to make the overhead
> acceptable, on balance. Granted, you might be able to poke holes in
> that argument, and reasonable people might disagree on what acceptable
> should mean. There are many value judgements here, which makes it
> complicated. (On the other hand we might be able to do better if there
> was a particularly bad case for the 0002 work, if one came to light.)

I think that the idea has potential, but I don't think that I
understand yet what the *exact* algorithm is. Maybe I need to read the
code, when I have some time for that. I can't form an intelligent
opinion at this stage about whether this is likely to be a net
positive.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

From

Peter Geoghegan

Date:

20 February 2022, 20:27:25

,

On Sun, Feb 20, 2022 at 7:30 AM Robert Haas <robertmhaas@gmail.com> wrote:
> Right, so we at least need to add a similar comment to what I proposed
> for MXIDs, and maybe other changes are needed, too.

Agreed.

> > Maybe the way that FreezeLimit/cutoff_xid is overloaded can be fixed
> > here, to make all of this less confusing. I only now fully realized
> > how confusing all of this stuff is -- very.
>
> Right. I think I understand all of this, or at least most of it -- but
> not from the comment. The question is how the comment can be more
> clear. My general suggestion is that function header comments should
> have more to do with the behavior of the function than how it fits
> into the bigger picture. If it's clear to the reader what conditions
> must hold before calling the function and which must hold on return,
> it helps a lot. IMHO, it's the job of the comments in the calling
> function to clarify why we then choose to call that function at the
> place and in the way that we do.

You've given me a lot of high quality feedback on all of this, which
I'll work through soon. It's hard to get the balance right here, but
it's made much easier by this kind of feedback.

> I think that the idea has potential, but I don't think that I
> understand yet what the *exact* algorithm is.

The algorithm seems to exploit a natural tendency that Andres once
described in a blog post about his snapshot scalability work [1]. To a
surprising extent, we can usefully bucket all tuples/pages into two
simple categories:

1. Very, very old ("infinitely old" for all practical purposes).

2. Very very new.

There doesn't seem to be much need for a third "in-between" category
in practice. This seems to be at least approximately true all of the
time.

Perhaps Andres wouldn't agree with this very general statement -- he
actually said something more specific. I for one believe that the
point he made generalizes surprisingly well, though. I have my own
theories about why this appears to be true. (Executive summary: power
laws are weird, and it seems as if the sparsity-of-effects principle
makes it easy to bucket things at the highest level, in a way that
generalizes well across disparate workloads.)

> Maybe I need to read the
> code, when I have some time for that. I can't form an intelligent
> opinion at this stage about whether this is likely to be a net
> positive.

The code in the v8-0002 patch is a bit sloppy right now. I didn't
quite get around to cleaning it up -- I was focussed on performance
validation of the algorithm itself. So bear that in mind if you do
look at v8-0002 (might want to wait for v9-0002 before looking).

I believe that the only essential thing about the algorithm itself is
that it freezes all the tuples on a page when it anticipates setting
the page all-visible, or (barring edge cases) freezes none at all.
(Note that setting the page all-visible/all-frozen may be happen just
after lazy_scan_prune returns, or in the second pass over the heap,
after LP_DEAD items are set to LP_UNUSED -- lazy_scan_prune doesn't
care which way it will happen.)

There are one or two other design choices that we need to make, like
what exact tuples we freeze in the edge case where FreezeLimit/XID age
forces us to freeze in lazy_scan_prune. These other design choices
don't seem relevant to the issue of central importance, which is
whether or not we come out ahead overall with this new algorithm.
FreezeLimit will seldom affect our choice to freeze or not freeze now,
and so AFAICT the exact way that FreezeLimit affects which precise
freezing-eligible tuples we freeze doesn't complicate performance
validation.

Remember when I got excited about how my big TPC-C benchmark run
showed a predictable, tick/tock style pattern across VACUUM operations
against the order and order lines table [2]? It seemed very
significant to me that the OldestXmin of VACUUM operation n
consistently went on to become the new relfrozenxid for the same table
in VACUUM operation n + 1. It wasn't exactly the same XID, but very
close to it (within the range of noise). This pattern was clearly
present, even though VACUUM operation n + 1 might happen as long as 4
or 5 hours after VACUUM operation n (this was a big table).

This pattern was encouraging to me because it showed (at least for the
workload and tables in question) that the amount of unnecessary extra
freezing can't have been too bad -- the fact that we can always
advance relfrozenxid in the same way is evidence of that. Note that
the vacuum_freeze_min_age setting can't have affected our choice of
what to freeze (given what we see in the logs), and yet there is a
clear pattern where the pages (it's really pages, not tuples) that the
new algorithm doesn't freeze in VACUUM operation n will reliably get
frozen in VACUUM operation n + 1 instead.

And so this pattern seems to lend support to the general idea of
letting the workload itself be the primary driver of what pages we
freeze (not FreezeLimit, and not anything based on XIDs). That's
really the underlying principle behind the new algorithm -- freezing
is driven by workload characteristics (or page/block characteristics,
if you prefer). ISTM that vacuum_freeze_min_age is almost impossible
to tune -- XID age is just too squishy a concept for that to ever
work.

[1]
https://techcommunity.microsoft.com/t5/azure-database-for-postgresql/improving-postgres-connection-scalability-snapshots/ba-p/1806462#interlude-removing-the-need-for-recentglobalxminhorizon
[2] https://postgr.es/m/CAH2-Wz=iLnf+0CsaB37efXCGMRJO1DyJw5HMzm7tp1AxG1NR2g@mail.gmail.com
-- scroll down to "TPC-C", which has the relevant autovacuum log
output for the orders table, covering a 24 hour period

--
Peter Geoghegan

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

From

Peter Geoghegan

Date:

25 February 2022, 04:53:08

On Sun, Feb 20, 2022 at 12:27 PM Peter Geoghegan <pg@bowt.ie> wrote:
> You've given me a lot of high quality feedback on all of this, which
> I'll work through soon. It's hard to get the balance right here, but
> it's made much easier by this kind of feedback.

Attached is v9. Lots of changes. Highlights:

* Much improved 0001 ("loosen coupling" dynamic relfrozenxid tracking
patch). Some of the improvements are due to recent feedback from
Robert.

* Much improved 0002 ("Make page-level characteristics drive freezing"
patch). Whole new approach to the implementation, though the same
algorithm as before.

* No more FSM patch -- that was totally separate work, that I
shouldn't have attached to this project.

* There are 2 new patches (these are now 0003 and 0004), both of which
are concerned with allowing non-aggressive VACUUM to consistently
advance relfrozenxid. I think that 0003 makes sense on general
principle, but I'm much less sure about 0004. These aren't too
important.

While working on the new approach to freezing taken by v9-0002, I had
some insight about the issues that Robert raised around 0001, too. I
wasn't expecting that to happen.

0002 makes page-level freezing a first class thing.
heap_prepare_freeze_tuple now has some (limited) knowledge of how this
works. heap_prepare_freeze_tuple's cutoff_xid argument is now always
the VACUUM caller's OldestXmin (not its FreezeLimit, as before). We
still have to pass FreezeLimit to heap_prepare_freeze_tuple, which
helps us to respect FreezeLimit as a backstop, and so now it's passed
via the new backstop_cutoff_xid argument instead. Whenever we opt to
"freeze a page", the new page-level algorithm *always* uses the most
recent possible XID and MXID values (OldestXmin and oldestMxact) to
decide what XIDs/XMIDs need to be replaced. That might sound like it'd
be too much, but it only applies to those pages that we actually
decide to freeze (since page-level characteristics drive everything
now). FreezeLimit is only one way of triggering that now (and one of
the least interesting and rarest).

0002 also adds an alternative set of relfrozenxid/relminmxid tracker
variables, to make the "don't freeze the page" path within
lazy_scan_prune simpler (if you don't want to freeze the page, then
use the set of tracker variables that go with that choice, which
heap_prepare_freeze_tuple knows about and helps with). With page-level
freezing, lazy_scan_prune wants to make a decision about the page as a
whole, at the last minute, after all heap_prepare_freeze_tuple calls
have already been made. So I think that heap_prepare_freeze_tuple
needs to know about that aspect of lazy_scan_prune's behavior.

When we *don't* want to freeze the page, we more or less need
everything related to freezing inside lazy_scan_prune to behave like
lazy_scan_noprune, which never freezes the page (that's mostly the
point of lazy_scan_noprune). And that's almost what we actually do --
heap_prepare_freeze_tuple now outsources maintenance of this
alternative set of "don't freeze the page" relfrozenxid/relminmxid
tracker variables to its sibling function, heap_tuple_needs_freeze.
That is the same function that lazy_scan_noprune itself actually
calls.

Now back to Robert's feedback on 0001, which had very complicated
comments in the last version. This approach seems to make the "being
versus becoming" or "going to freeze versus not going to freeze"
distinctions much clearer. This is less true if you assume that 0002
won't be committed but 0001 will be. Even if that happens with
Postgres 15, I have to imagine that adding something like 0002 must be
the real goal, long term. Without 0002, the value from 0001 is far
more limited. You need both together to get the virtuous cycle I've
described.

The approach with always using OldestXmin as cutoff_xid and
oldestMxact as our cutoff_multi makes a lot of sense to me, in part
because I think that it might well cut down on the tendency of VACUUM
to allocate new MultiXacts in order to be able to freeze old ones.
AFAICT the only reason that heap_prepare_freeze_tuple does that is
because it has no flexibility on FreezeLimit and MultiXactCutoff.
These are derived from vacuum_freeze_min_age and
vacuum_multixact_freeze_min_age, respectively, and so they're two
independent though fairly meaningless cutoffs. On the other hand,
OldestXmin and OldestMxact are not independent in the same way. We get
both of them at the same time and the same place, in
vacuum_set_xid_limits. OldestMxact really is very close to OldestXmin
-- only the units differ.

It seems that heap_prepare_freeze_tuple allocates new MXIDs (when
freezing old ones) in large part so it can NOT freeze XIDs that it
would have been useful (and much cheaper) to remove anyway. On HEAD,
FreezeMultiXactId() doesn't get passed down the VACUUM operation's
OldestXmin at all (it actually just gets FreezeLimit passed as its
cutoff_xid argument). It cannot possibly recognize any of this for
itself.

Does that theory about MultiXacts sound plausible? I'm not claiming
that the patch makes it impossible that FreezeMultiXactId() will have
to allocate a new MultiXact to freeze during VACUUM -- the
freeze-the-dead isolation tests already show that that's not true. I
just think that page-level freezing based on page characteristics with
oldestXmin and oldestMxact (not FreezeLimit and MultiXactCutoff)
cutoffs might make it a lot less likely in practice. oldestXmin and
oldestMxact map to the same wall clock time, more or less -- that
seems like it might be an important distinction, independent of
everything else.

Thanks
--
Peter Geoghegan

On Fri, Feb 25, 2022 at 5:52 PM Peter Geoghegan <pg@bowt.ie> wrote:
> There is an important practical way in which it makes sense to treat
> 0001 as separate to 0002. It is true that 0001 is independently quite
> useful. In practical terms, I'd be quite happy to just get 0001 into
> Postgres 15, without 0002. I think that that's what you meant here, in
> concrete terms, and we can agree on that now.

Attached is v10. While this does still include the freezing patch,
it's not in scope for Postgres 15. As I've said, I still think that it
makes sense to maintain the patch series with the freezing stuff,
since it's structurally related. So, to be clear, the first two
patches from the patch series are in scope for Postgres 15. But not
the third.

Highlights:

* Changes to terminology and commit messages along the lines suggested
by Andres.

* Bug fixes to heap_tuple_needs_freeze()'s MultiXact handling. My
testing strategy here still needs work.

* Expanded refactoring by v10-0002 patch.

The v10-0002 patch (which appeared for the first time in v9) was
originally all about fixing a case where non-aggressive VACUUMs were
at a gratuitous disadvantage (relative to aggressive VACUUMs) around
advancing relfrozenxid -- very much like the lazy_scan_noprune work
from commit 44fa8488. And that is still its main purpose. But the
refactoring now seems related to Andres' idea of making non-aggressive
VACUUMs decides to scan a few extra all-visible pages in order to be
able to advance relfrozenxid.

The code that sets up skipping the visibility map is made a lot
clearer by v10-0002. That patch moves a significant amount of code
from lazy_scan_heap() into a new helper routine (so it continues the
trend started by the Postgres 14 work that added lazy_scan_prune()).
Now skipping a range of visibility map pages is fundamentally based on
setting up the range up front, and then using the same saved details
about the range thereafter -- we don't have anymore ad-hoc
VM_ALL_VISIBLE()/VM_ALL_FROZEN() calls for pages from a range that we
already decided to skip (so no calls to those routines from
lazy_scan_heap(), at least not until after we finish processing in
lazy_scan_prune()).

This is more or less what we were doing all along for one special
case: aggressive VACUUMs. We had to make sure to either increment
frozenskipped_pages or increment scanned_pages for every page from
rel_pages -- this issue is described by lazy_scan_heap() comments on
HEAD that begin with "Tricky, tricky." (these date back to the freeze
map work from 2016). Anyway, there is no reason to not go further with
that: we should make whole ranges the basic unit that we deal with
when skipping. It's a lot simpler to think in terms of entire ranges
(not individual pages) that are determined to be all-visible or
all-frozen up-front, without needing to recheck anything (regardless
of whether it's an aggressive VACUUM).

We don't need to track frozenskipped_pages this way. And it's much
more obvious that it's safe for more complicated cases, in particular
for aggressive VACUUMs.

This kind of approach seems necessary to make non-aggressive VACUUMs
do a little more work opportunistically, when they realize that they
can advance relfrozenxid relatively easily that way (which I believe
Andres favors as part of overhauling freezing). That becomes a lot
more natural when you have a clear and unambiguous separation between
deciding what range of blocks to skip, and then actually skipping. I
can imagine the new helper function added by v10-0002 (which I've
called lazy_scan_skip_range()) eventually being taught to do these
kinds of tricks.

In general I think that all of the details of what to skip need to be
decided up front. The loop in lazy_scan_heap() should execute skipping
based on the instructions it receives from the new helper function, in
the simplest way possible. The helper function can become more
intelligent about the costs and benefits of skipping in the future,
without that impacting lazy_scan_heap().

--
Peter Geoghegan

Attachment

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

From

Peter Geoghegan

Date:

23 March 2022, 19:59:01

On Sun, Mar 13, 2022 at 9:05 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Attached is v10. While this does still include the freezing patch,
> it's not in scope for Postgres 15. As I've said, I still think that it
> makes sense to maintain the patch series with the freezing stuff,
> since it's structurally related.

Attached is v11. Changes:

* No longer includes the patch that adds page-level freezing. It was
making it harder to assess code coverage for the patches that I'm
targeting Postgres 15 with. And so including it with each new revision
no longer seems useful. I'll pick it up for Postgres 16.

* Extensive isolation tests added to v11-0001-*, exercising a lot of
hard-to-hit code paths that are reached when VACUUM is unable to
immediately acquire a cleanup lock on some heap page. In particular,
we now have test coverage for the code in heapam.c that handles
tracking the oldest extant XID and MXID in the presence of MultiXacts
(on a no-cleanup-lock heap page).

* v11-0002-* (which is the patch that avoids missing out on advancing
relfrozenxid in non-aggressive VACUUMs due to a race condition on
HEAD) now moves even more of the logic for deciding how VACUUM will
skip using the visibility map into its own helper routine. Now
lazy_scan_heap just does what the state returned by the helper routine
tells it about the current skippable range -- it doesn't make any
decisions itself anymore. This is far simpler than what we do
currently, on HEAD.

There are no behavioral changes here, but this approach could be
pushed further to improve performance. We could easily determine
*every* page that we're going to scan (not skip) up-front in even the
largest tables, very early, before we've even scanned one page. This
could enable things like I/O prefetching, or capping the size of the
dead_items array based on our final scanned_pages (not on rel_pages).

* A new patch (v11-0003-*) alters the behavior of VACUUM's
DISABLE_PAGE_SKIPPING option. DISABLE_PAGE_SKIPPING no longer forces
aggressive VACUUM -- now it only forces the use of the visibility map,
since that behavior is totally independent of aggressiveness.

I don't feel too strongly about the DISABLE_PAGE_SKIPPING change. It
just seems logical to decouple no-vm-skipping from aggressiveness --
it might actually be helpful in testing the work from the patch series
in the future. Any page counted in scanned_pages has essentially been
processed by VACUUM with this work in place -- that was the idea
behind the lazy_scan_noprune stuff from commit 44fa8488. Bear in mind
that the relfrozenxid tracking stuff from v11-0001-* makes it almost
certain that a DISABLE_PAGE_SKIPPING-without-aggressiveness VACUUM
will still manage to advance relfrozenxid -- usually by the same
amount as an equivalent aggressive VACUUM would anyway. (Failing to
acquire a cleanup lock on some heap page might result in the final
older relfrozenxid being appreciably older, but probably not, and we'd
still almost certainly manage to advance relfrozenxid by *some* small
amount.)

Of course, anybody that wants both an aggressive VACUUM and a VACUUM
that never skips even all-frozen pages in the visibility map will
still be able to get that behavior quite easily. For example,
VACUUM(DISABLE_PAGE_SKIPPING, FREEZE) will do that. Several of our
existing tests must already use both of these options together,
because the tests require an effective vacuum_freeze_min_age of 0 (and
vacuum_multixact_freeze_min_age of 0) -- DISABLE_PAGE_SKIPPING alone
won't do that on HEAD, which seems to confuse the issue (see commit
b700f96c for an example of that).

In other words, since DISABLE_PAGE_SKIPPING doesn't *consistently*
force lazy_scan_noprune to refuse to process a page on HEAD (it all
depends on FreezeLimit/vacuum_freeze_min_age), it is logical for
DISABLE_PAGE_SKIPPING to totally get out of the business of caring
about that -- better to limit it to caring only about the visibility
map (by no longer making it force aggressiveness).

-- 
Peter Geoghegan

On Thu, Mar 24, 2022 at 2:40 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > > This is absolutely mandatory in the aggressive case, because otherwise
> > > relfrozenxid advancement might be seen as unsafe. My observation is:
> > > Why should we accept the same race in the non-aggressive case? Why not
> > > do essentially the same thing in every VACUUM?
> >
> > Sure, that seems like a good idea. I think I basically agree with the
> > goals of the patch.
>
> Great.

Attached is v12. My current goal is to commit all 3 patches before
feature freeze. Note that this does not include the more complicated
patch including with previous revisions of the patch series (the
page-level freezing work that appeared in versions before v11).

Changes that appear in this new revision, v12:

* Reworking of the commit messages based on feedback from Robert.

* General cleanup of the changes to heapam.c from 0001 (the changes to
heap_prepare_freeze_tuple and related functions).  New and existing
code now fits together a bit better. I also added a couple of new
documenting assertions, to make the flow a bit easier to understand.

* Added new assertions that document
OldestXmin/FreezeLimit/relfrozenxid invariants, right at the point we
update pg_class within vacuumlazy.c.

These assertions would have a decent chance of failing if there were
any bugs in the code.

* Removed patch that made DISABLE_PAGE_SKIPPING not force aggressive
VACUUM, limiting the underlying mechanism to forcing scanning of all
pages in lazy_scan_heap (v11 was the first and last revision that
included this patch).

* Adds a new small patch 0003. This just moves the last piece of
resource allocation that still took place at the top of
lazy_scan_heap() back into its caller, heap_vacuum_rel().

The work in 0003 probably should have happened as part of the patch
that became commit 73f6ec3d -- same idea. It's totally mechanical
stuff. With 0002 and 0003, there is hardly any lazy_scan_heap code
before the main loop that iterates through blocks in rel_pages (and
the code that's still there is obviously related to the loop in a
direct and obvious way). This seems like a big overall improvement in
maintainability.

Didn't see a way to split up 0002, per Robert's suggestion 3 days ago.
As I said at the time, it's possible to split it up, but not in a way
that highlights the underlying issue (since the issue 0002 fixes was
always limited to non-aggressive VACUUMs). The commit message may have
to suffice.

--
Peter Geoghegan

On Tue, Mar 29, 2022 at 11:58 AM Peter Geoghegan <pg@bowt.ie> wrote:
> > I think I understand what the first paragraph of the header comment
> > for heap_tuple_needs_freeze() is trying to say, but the second one is
> > quite confusing. I think this is again because it veers into talking
> > about what the caller should do rather than explaining what the
> > function itself does.
>
> I wouldn't have done it that way if the function wasn't called
> heap_tuple_needs_freeze().
>
> I would be okay with removing this paragraph if the function was
> renamed to reflect the fact it now tells the caller something about
> the tuple having an old XID/MXID relative to the caller's own XID/MXID
> cutoffs. Maybe the function name should be heap_tuple_would_freeze(),
> making it clear that the function merely tells caller what
> heap_prepare_freeze_tuple() *would* do, without presuming to tell the
> vacuumlazy.c caller what it *should* do about any of the information
> it is provided.

Attached is v13, which does it that way. This does seem like a real
increase in clarity, albeit one that comes at the cost of renaming
heap_tuple_needs_freeze().

v13 also addresses all of the other items from Robert's most recent
round of feedback.

I would like to commit something close to v13 on Friday or Saturday.

Thanks
-- 
Peter Geoghegan

On Wed, Mar 30, 2022 at 7:37 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Yeah, a WARNING would be good here. I can write a new version of my
> patch series with a separation patch for that this evening. Actually,
> better make it a PANIC for now...

Attached is v14, which includes a new patch that PANICs like that in
vac_update_relstats() --- 0003.

This approach also covers manual VACUUMs, which isn't the case with
the failing assertion, which is in instrumentation code (actually
VACUUM VERBOSE might hit it).

I definitely think that something like this should be committed.
Silently ignoring system catalog corruption isn't okay.

-- 
Peter Geoghegan

On Thu, Mar 31, 2022 at 11:19 AM Peter Geoghegan <pg@bowt.ie> wrote:
> The assert is "Assert(diff > 0)", and not "Assert(diff >= 0)".

Attached is v15. I plan to commit the first two patches (the most
substantial two patches by far) in the next couple of days, barring
objections.

v15 removes this "Assert(diff > 0)" assertion from 0001. It's not
adding any value, now that the underlying issue that it accidentally
brought to light is well understood (there are still more robust
assertions to the relfrozenxid/relminmxid invariants). "Assert(diff >
0)" is liable to fail until the underlying bug on HEAD is fixed, which
can be treated as separate work.

I also refined the WARNING patch in v15. It now actually issues
WARNINGs (rather than PANICs, which were just a temporary debugging
measure in v14). Also fixed a compiler warning in this patch, based on
a complaint from CFBot's CompilerWarnings task. I can delay commiting
this WARNING patch until right before feature freeze. Seems best to
give others more opportunity for comments.

-- 
Peter Geoghegan

On Thu, Apr 14, 2022 at 4:19 PM Jim Nasby <nasbyj@amazon.com> wrote:
> > - percentage of non-yet-removable vs removable tuples
>
> This'd give you an idea how bad your long-running-transaction problem is.

VACUUM fundamentally works by removing those tuples that are
considered dead according to an XID-based cutoff established when the
operation begins. And so many very long running VACUUM operations will
see dead-but-not-removable tuples even when there are absolutely no
long running transactions (nor any other VACUUM operations). The only
long running thing involved might be our own long running VACUUM
operation.

I would like to reduce the number of non-removal dead tuples
encountered by VACUUM by "locking in" heap pages that we'd like to
scan up front. This would work by having VACUUM create its own local
in-memory copy of the visibility map before it even starts scanning
heap pages. That way VACUUM won't end up visiting heap pages just
because they were concurrently modified half way through our VACUUM
(by some other transactions). We don't really need to scan these pages
at all -- they have dead tuples, but not tuples that are "dead to
VACUUM".

The key idea here is to remove a big unnatural downside to slowing
VACUUM down. The cutoff would almost work like an MVCC snapshot, that
described precisely the work that VACUUM needs to do (which pages to
scan) up-front. Once that's locked in, the amount of work we're
required to do cannot go up as we're doing it (or it'll be less of an
issue, at least).

It would also help if VACUUM didn't scan pages that it already knows
don't have any dead tuples. The current SKIP_PAGES_THRESHOLD rule
could easily be improved. That's almost the same problem.

-- 
Peter Geoghegan