Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
Date
Msg-id CA+TgmoZ84CNpWojX=JYedkaHCfjz-tRuYr0XcXLbkMPd4f0NSw@mail.gmail.com
Whole thread Raw
In response to Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation  (Peter Geoghegan <pg@bowt.ie>)
List pgsql-hackers
On Tue, Jan 24, 2023 at 3:33 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Sure, it's possible that such a cancellable aggressive autovacuum was
> indeed cancelled, and that that factor made the crucial difference.
> But I find it far easier to believe that there simply was no such
> aggressive autovacuum in the first place (not this time), since it
> could have only happened when autovacuum thinks that there are
> sufficiently many dead tuples to justify launching an autovacuum in
> the first place. Which, as we now all accept, is based on highly
> dubious sampling by ANALYZE. So I think it's much more likely to be
> that factor (dead tuple accounting is bad), as well as the absurd
> false dichotomy between aggressive and non-aggressive -- plus the
> issue at hand, the auto-cancellation behavior.

In my opinion, this is too speculative to justify making changes to
the behavior. I'm not here to defend the way we do dead tuple
accounting. I think it's a hot mess. But whether or not it played any
role in this catastrophe is hard to say. The problems with dead tuple
accounting are, as I understand it, all about statistical
independence. That is, we make assumptions that what's true of the
sample is likely to be true of the whole table when in reality it may
not be true at all. Perhaps it's even unlikely to be true. But the
kinds of problems you get from assuming statistical independence tend
to hit users very unevenly. We make similar assumptions about
selectivity estimation: unless there are extended statistics, we take
P(a=1 and b=1) = P(a=1)*P(b = 1), which can be vastly and dramatically
wrong. People can and do get extremely bad query plans as a result of
that assumption. However, other people run PostgreSQL for years and
years and never really have big problems with it. I'd say that it's
completely fair to describe this as a big problem, but we can't
therefore conclude that some particular user has this problem, not
even if we know that they have a slow query and we know that it's due
to a bad plan. And similarly here, I don't see a particular reason to
think that your theory about what happened is more likely than mine. I
freely admit that yours could be right, just as you admitted that mine
could be right. But I think we really just don't know.

It feels unlikely to me that there was ONE cancellable aggressive
autovacuum and that it got cancelled. I think that it's probably
either ZERO or LOTS, depending on whether the dead tuple threshold was
ever reached. If it wasn't, then it must have been zero. But if it
was, and the first autovacuum worker to visit that table got
cancelled, then the next one would try again. And
autovacuum_naptime=1m, so if we're not out of autovacuum workers,
we're going to retry that table every minute. If we do run out of
autovacuum workers, which is pretty likely, we'll still launch new
workers in that database as often as we can given when other workers
exit. If the system is very tight on autovacuum capacity, probably
because the cost limit is too low, then you could have a situation
where only one try gets made before we hit autovacuum_freeze_max_age.
Otherwise, though, a single failed try would probably lead to trying a
whole lot more times after that, and you only hit
autovacuum_freeze_max_age if all those attempts fail.

At the risk of repeating myself, here's what bugs me. If we suppose
that your intuition is right and no aggressive autovacuum happened
before autovacuum_freeze_max_age was reached, then what you are
proposing will make things better. But if we suppose that my intuition
is right and many aggressive autovacuums happened before
autovacuum_freeze_max_age was reached, then what you are proposing
will make things worse, because if we've been auto-cancelling
repeatedly we're probably going to keep doing so until we shut that
behavior off, and we want a vacuum to succeed sooner rather than
later. So it doesn't feel obvious to me that we should change
anything. Even if we knew what had happened for certain in this
particular case, I don't know how we can possibly know what is typical
in similar cases.

My personal support experience has been that cases where autovacuum
runs a lot but doesn't solve the problem for some reason are a lot
more common than cases where it doesn't run when it should have done.
That probably accounts for my intuition about what is likely to have
happened here. But as my colleagues are often at pains to point out to
me, my experiences aren't representative of what happens to PostgreSQL
users generally for all kinds of reasons, and therefore sometimes my
intuition is wrong. But since I have nobody else's experiences to use
in lieu of my own, I don't know what else I can use to judge anything.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Pavel Stehule
Date:
Subject: Re: Re: Support plpgsql multi-range in conditional control
Next
From: gkokolatos@pm.me
Date:
Subject: Re: Add LZ4 compression in pg_dump