Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
Date
Msg-id CAH2-WznRj+tUBeqzEWaxpGQHY2QxVcmje9Qm3oAVm0vzvQXfdw@mail.gmail.com
Whole thread Raw
In response to Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation  (Andres Freund <andres@anarazel.de>)
Responses Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
List pgsql-hackers
On Tue, Jan 17, 2023 at 10:02 AM Andres Freund <andres@anarazel.de> wrote:
> Both you and Robert said this, and I have seen it be true, but typically not
> for large high-throughput OLTP databases, where I found increasing
> relfrozenxid to be important. Sure, there's probably some up/down through the
> day / week, but it's likely to be pretty predictable.
>
> I think the problem is that an old relfrozenxid doesn't tell you how much
> outstanding work there is. Perhaps that's what both of you meant...

That's what I meant, yes.

> I think that's not the fault of relfrozenxid as a trigger, but that we simply
> don't keep enough other stats. We should imo at least keep track of:

If you assume that there is chronic undercounting of dead tuples
(which I think is very common), then of course anything that triggers
vacuuming is going to help with that problem -- it might be totally
inadequate, but still make the critical difference by not allowing the
system to become completely destabilized. I absolutely accept that
users that are relying on that exist, and that those users ought to
not have things get even worse -- I'm pragmatic. But overall, what we
should be doing is fixing the real problem, which is that the dead
tuples accounting is deeply flawed. Actually, it's not just that the
statistics are flat out wrong; the whole model is flat-out wrong.

The assumptions that work well for optimizer statistics quite simply
do not apply here. Random sampling for this is just wrong, because
we're not dealing with something that follows a distribution that can
be characterized with a sufficiently large sample. With optimizer
statistics, the entire contents of the table is itself a sample taken
from the wider world -- so even very stale statistics can work quite
well (assuming that the schema is well normalized). Whereas the
autovacuum dead tuples stuff is characterized by constant change. I
mean of course it is -- that's the whole point! The central limit
theorem obviously just doesn't work for something like this  -- we
cannot generalize from a sample, at all.

I strongly suspect that it still wouldn't be a good model even if the
information was magically always correct. It might actually be worse
in some ways! Most of my arguments against the model are not arguments
against the accuracy of the statistics as such. They're actually
arguments against the fundamental relevance of the information itself,
to the actual problem at hand. We are not interested in information
for its own sake; we're interested in making better decisions about
autovacuum scheduling. Those may only have a very loose relationship.

How many dead heap-only tuples are equivalent to one LP_DEAD item?
What about page-level concentrations, and the implication for
line-pointer bloat? I don't have a good answer to any of these
questions myself. And I have my doubts that there are *any* good
answers. Even these questions are the wrong questions (they're just
less wrong). Fundamentally, we're deciding when the next autovacuum
should run against each table. Presumably it's going to have to happen
some time, and when it does happen it happens to the table as a whole.
And with a larger table it probably doesn't matter if it happens +/- a
few hours from some theoretical optimal time. Doesn't it boil down to
that?

If we taught the system to do the autovacuum work early because it's a
relatively good time for it from a system level point of view (e.g.
it's going to be less disruptive right now), that would be useful and
easy to justify on its own terms. But it would also tend to make the
system much less vulnerable to undercounting dead tuples, since in
practice there'd be a decent chance of getting to them early enough
that it at least wasn't extremely painful any one time. It's much
easier to understand that the system is quiescent than it is to
understand bloat.

BTW, I think that the insert-driven autovacuum stuff added to 13 has
made the situation with bloat significantly better. Of course it
wasn't really designed to do that at all, but it still has, kind of by
accident, in roughly the same way that antiwraparound autovacuums help
with bloat by accident. So maybe we should embrace "happy accidents"
like that a bit more. It doesn't necessarily matter if we do the right
thing for a reason that turns out to have not been the best reason.
I'm certainly not opposed to it, despite my complaints about relying
on age(relfrozenxid).

> In pgstats:
> (Various stats)

Overall, what I like about your ideas here is the emphasis on bounding
the worst case, and the emphasis on the picture at the page level over
the tuple level.

I'd like to use the visibility map more for stuff here, too. It is
totally accurate about all-visible/all-frozen pages, so many of my
complaints about statistics don't really apply. Or need not apply, at
least. If 95% of a table's pages are all-frozen in the VM, then of
course it's pretty unlikely to be the right time to VACUUM the table
if it's to clean up bloat -- this is just about the most reliable
information we have access to.

I think that the only way that more stats can help is by allowing us
to avoid doing completely the wrong thing more often. Just avoiding
disaster is a great goal for us here.

> > This sounds like a great argument in favor of suspend-and-resume as a
> > way of handling autocancellation -- no useful work needs to be thrown
> > away for AV to yield for a minute or two.

> Hm, that seems a lot of work. Without having held a lock you don't even know
> whether your old dead items still apply. Of course it'd improve the situation
> substantially, if we could get it.

I don't think it's all that much work, once the visibility map
snapshot infrastructure is there.

Why wouldn't your old dead items still apply? The TIDs must always
reference LP_DEAD stubs. Those can only be set LP_UNUSED by VACUUM,
and presumably VACUUM can only run in a way that either resumes the
suspended VACUUM session, or discards it altogether. So they're not
going to be invalidated during the period that a VACUUM is suspended,
in any way. Even if CREATE INDEX runs against the table during a
suspending VACUUM, we know that the existing LP_DEAD dead_items won't
have been indexed, so they'll be safe to mark LP_UNUSED in any case.

What am I leaving out? I can't think of anything. The only minor
caveat is that we'd probably have to discard the progress from any
individual ambulkdelete() call that happened to be running at the time
that VACUUM was interrupted.

> > Yeah, that's pretty bad. Maybe DROP TABLE and TRUNCATE should be
> > special cases? Maybe they should always be able to auto cancel an
> > autovacuum?
>
> Yea, I think so. It's not obvious how to best pass down that knowledge into
> ProcSleep(). It'd have to be in the LOCALLOCK, I think. Looks like the best
> way would be to change LockAcquireExtended() to get a flags argument instead
> of reportMemoryError, and then we could add LOCK_ACQUIRE_INTENT_DROP &
> LOCK_ACQUIRE_INTENT_TRUNCATE or such. Then the same for
> RangeVarGetRelidExtended(). It already "customizes" how to lock based on RVR*
> flags.

It would be tricky, but still relatively straightforward compared to
other things. It is often a TRUNCATE or a DROP TABLE, and we have
nothing to lose and everything to gain by changing the rules for
those.

> ISTM that some of what you write below would be addressed, at least partially,
> by the stats I proposed above. Particularly keeping some "page granularity"
> instead of "tuple granularity" stats seems helpful.

That definitely could be true, but I think that my main concern is
that we completely rely on randomly sampled statistics (except with
antiwraparound autovacuums, which happen on a schedule that has
problems of its own).

> > It's quite possible to get approximately the desired outcome with an
> > algorithm that is completely wrong -- the way that we sometimes need
> > autovacuum_freeze_max_age to deal with bloat is a great example of
> > that.
>
> Yea. I think this is part of I like my idea about tracking more observations
> made by the last vacuum - they're quite easy to get right, and they
> self-correct, rather than potentially ending up causing ever-wronger stats.

I definitely think that there is a place for that. It has the huge
advantage of lessening our reliance on random sampling.

> Right. I think it's fundamental that we get a lot better estimates about the
> amount of work needed. Without that we have no chance of finishing autovacuums
> before problems become too big.

I like the emphasis on bounding the work required, so that it can be
spread out, rather than trying to predict dead tuples. Again, we
should focus on avoiding disaster.

> And serialized vacuum state won't help, because that still requires
> vacuum to scan all the !all-visible pages to discover them. Most of which
> won't contain dead tuples in a lot of workloads.

The main advantage of that model is that it decides what to do and
when to do it based on the actual state of the table (or the state in
the recent past). If we see a concentration of LP_DEAD items, then we
can hurry up index vacuuming. If not, maybe we'll take our time.
Again, less reliance on random sampling is a very good thing. More
dynamic decisions that are made at runtime, and delayed for as long as
possible, just seems much more promising than having better stats that
are randomly sampled.

> I don't feel like I have a good handle on what could work for 16 and what
> couldn't. Personally I think something like autovacuum_no_auto_cancel_age
> would be an improvement, but I also don't quite feel satisfied with it.

I don't either; but it should be strictly less unsatisfactory.

> Tracking the number of autovac failures seems uncontroverial and quite
> beneficial, even if the rest doesn't make it in. It'd at least let users
> monitor for tables where autovac is likely to swoop in in anti-wraparound
> mode.

I'll see what I can come up with.

> Perhaps its worth to separately track the number of times a backend would have
> liked to cancel autovac, but couldn't due to anti-wrap? If changes to the
> no-auto-cancel behaviour don't make it in, it'd at least allow us to collect
> more data about the prevalence of the problem and in what situations it
> occurs?  Even just adding some logging for that case seems like it'd be an
> improvement.

Hmm, maybe.

> I think with a bit of polish "Add autovacuum trigger instrumentation." ought
> to be quickly mergeable.

Yeah, I'll try to get that part out of the way quickly.

--
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: almost-super-user problems that we haven't fixed yet
Next
From: Bruce Momjian
Date:
Subject: Re: cutting down the TODO list thread