Re: New IndexAM API controlling index vacuum strategies - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: New IndexAM API controlling index vacuum strategies
Date
Msg-id CAH2-WznEkZT6mFSphn-8KfLhQFK+xEpV9a0mhLkvfvGbf2+t4g@mail.gmail.com
Whole thread Raw
In response to Re: New IndexAM API controlling index vacuum strategies  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: New IndexAM API controlling index vacuum strategies
List pgsql-hackers
On Thu, Mar 18, 2021 at 2:05 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Mar 17, 2021 at 11:23 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > Most anti-wraparound VACUUMs are really not emergencies, though.
>
> That's true, but it's equally true that most of the time it's not
> necessary to wear a seatbelt to avoid personal injury. The difficulty
> is that it's hard to predict on which occasions it is necessary, and
> therefore it is advisable to do it all the time.

Just to be clear: This was pretty much the point I was making here --
although I guess you're making the broader point about autovacuum and
freezing in general.

The fact that we can *continually* reevaluate if an ongoing VACUUM is
at risk of taking too long is entirely the point here. We can in
principle end index vacuuming dynamically, whenever we feel like it
and for whatever reasons occur to us (hopefully these are good reasons
-- the point is that we get to pick and choose). We can afford to be
pretty aggressive about not giving up, while still having the benefit
of doing that when it *proves* necessary. Because: what are the
chances of the emergency mechanism ending index vacuuming being the
wrong thing to do if we only do that when the system clearly and
measurably has no more than about 10% of the possible XID space to go
before the system becomes unavailable for writes?

What could possibly matter more than that?

By making the decision dynamic, the chances of our
threshold/heuristics causing the wrong behavior become negligible --
even though we're making the decision based on a tiny amount of
(current, authoritative) information. The only novel risk I can think
about is that somebody comes to rely on the mechanism saving the day,
over and over again, rather than fixing a fixable problem.

> autovacuum decides
> whether an emergency exists, in the first instance, by comparing
> age(relfrozenxid) to autovacuum_freeze_max_age, but that's problematic
> for at least two reasons. First, what matters is not when the vacuum
> starts, but when the vacuum finishes.

To be fair the vacuum_set_xid_limits() mechanism that you refer to
makes perfect sense. It's just totally insufficient for the reasons
you say.

> A user who has no tables larger
> than 100MB can set autovacuum_freeze_max_age a lot closer to the high
> limit without risk of hitting it than a user who has a 10TB table. The
> time to run vacuum is dependent on both the size of the table and the
> applicable cost delay settings, none of which autovacuum knows
> anything about. It also knows nothing about the XID consumption rate.
> It's relying on the user to set autovacuum_freeze_max_age low enough
> that all the anti-wraparound vacuums will finish before the system
> crashes into a wall.

Literally nobody on earth knows what their XID burn rate is when it
really matters. It might be totally out of control that one day of
your life where it truly matters (e.g., due to a recent buggy code
deployment, which I've seen up close). That's how emergencies work.

A dynamic approach is not merely preferable. It seems essential. No
top-down plan is going to be smart enough to predict that it'll take a
really long time to get that one super-exclusive lock on relatively
few pages.

> Second, what happens to one table affects what
> happens to other tables. Even if you have perfect knowledge of your
> XID consumption rate and the speed at which vacuum will complete, you
> can't just configure autovacuum_freeze_max_age to allow exactly enough
> time for the vacuum to complete once it hits the threshold, unless you
> have one autovacuum worker per table so that the work for that table
> never has to wait for work on any other tables. And even then, as you
> mention, you have to worry about the possibility that a vacuum was
> already in progress on that table itself. Here again, we rely on the
> user to know empirically how high they can set
> autovacuum_freeze_max_age without cutting it too close.

But the VM is a lot more useful when you effectively eliminate index
vacuuming from the picture. And VACUUM has a pretty good understanding
of how that works. Index vacuuming remains the achilles' heel, and I
think that avoiding it in some cases has tremendous value. It has
outsized importance now because we've significantly ameliorated the
problems in the heap, by having the visibility map. What other factor
can make VACUUM take 10x longer than usual on occasion?

Autovacuum scheduling is essentially a top-down model of the needs of
the system -- and one with a lot of flaws. IMV we can make the model's
simplistic view of reality better by making the reality better (i.e.
simpler, more tolerant of stressors) instead of making the model
better.

> Now, that's not actually a good thing, because most users aren't smart
> enough to do that, and will either leave a gigantic safety margin that
> they don't need, or will leave an inadequate safety margin and take
> the system down. However, it means we need to be very, very careful
> about hard-coded thresholds like 90% of the available XID space. I do
> think that there is a case for triggering emergency extra safety
> measures when things are looking scary. One that I think would help a
> tremendous amount is to start ignoring the vacuum cost delay when
> wraparound danger (and maybe even bloat danger) starts to loom.

We've done a lot to ameliorate that problem in recent releases, simply
by updating the defaults.

> Perhaps skipping index vacuuming is another such measure, though I
> suspect it would help fewer people, because in most of the cases I
> see, the system is throttled to use a tiny percentage of its actual
> hardware capability. If you're running at 1/5 of the speed of which
> the hardware is capable, you can only do better by skipping index
> cleanup if that skips more than 80% of page accesses, which could be
> true but probably isn't.

The proper thing for VACUUM to be throttled on these days is dirtying
pages. Skipping index vacuuming and skipping the second pass over the
heap will both make an enormous difference in many cases, precisely
because they'll avoid dirtying nearly so many pages. Especially in the
really bad cases, which are precisely where we see problems. Think
about how many pages you'll dirty with a UUID-based index with regular
churn from updates. Plus indexes don't have a visibility map. Whereas
an append-mostly pattern is common with the largest tables.

Perhaps it doesn't matter, but FWIW I think that you're drastically
underestimating the extent to which index vacuuming is now the
problem, in a certain important sense. I think that skipping index
vacuuming and heap vacuuming (i.e. just doing the bare minimum,
pruning) will in fact reduce the number of page accesses by 80% in
many many cases. But I suspect it makes an even bigger difference in
the cases where users are at most risk of wraparound related outages
to begin with. ISTM that you're focussing too much on the everyday
cases, the majority, which are not the cases where everything truly
falls apart. The extremes really matter.

Index vacuuming gets really slow when we're low on
maintenance_work_mem -- horribly slow. Whereas that doesn't matter at
all if you skip indexes. What do you think are the chances that that
was a major factor in those sites that actually had an outage in the
end? My intuition is that eliminating worst-case variability is the
really important thing here. Heap vacuuming just doesn't have that
multiplicative quality. Its costs tend to be proportionate to the
workload, and stable over time.

> But ... should the thresholds for triggering these kinds of mechanisms
> really be hard-coded with no possibility of being configured in the
> field? What if we find out after the release is shipped that the
> mechanism works better if you make it kick in sooner, or later, or if
> it depends on other things about the system, which I think it almost
> certainly does? Thresholds that can't be changed without a recompile
> are bad news. That's why we have GUCs.

I'm fine with a GUC, though only for the emergency mechanism. The
default really matters, though -- it shouldn't be necessary to tune
(since we're trying to address a problem that many people don't know
they have until it's too late). I still like 1.8 billion XIDs as the
value -- I propose that that be made the default.

> On another note, I cannot say enough bad things about the function
> name two_pass_strategy(). I sincerely hope that you're not planning to
> create a function which is a major point of control for VACUUM whose
> name gives no hint that it has anything to do with vacuum.

You always hate my names for things. But that's fine by me -- I'm
usually not very attached to them. I'm happy to change it to whatever
you prefer.

FWIW, that name was intended to highlight that VACUUMs with indexes
will now always use the two-pass strategy. This is not to be confused
with the one-pass strategy, which is now strictly used on tables with
no indexes -- this even includes the INDEX_CLEANUP=off case with the
patch.

--
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Thomas Munro
Date:
Subject: Re: fdatasync performance problem with large number of DB files
Next
From: Fujii Masao
Date:
Subject: Re: [PATCH] pgbench: improve \sleep meta command