Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation |
Date | |
Msg-id | CA+TgmoZ84CNpWojX=JYedkaHCfjz-tRuYr0XcXLbkMPd4f0NSw@mail.gmail.com Whole thread Raw |
In response to | Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation (Peter Geoghegan <pg@bowt.ie>) |
List | pgsql-hackers |
On Tue, Jan 24, 2023 at 3:33 PM Peter Geoghegan <pg@bowt.ie> wrote: > Sure, it's possible that such a cancellable aggressive autovacuum was > indeed cancelled, and that that factor made the crucial difference. > But I find it far easier to believe that there simply was no such > aggressive autovacuum in the first place (not this time), since it > could have only happened when autovacuum thinks that there are > sufficiently many dead tuples to justify launching an autovacuum in > the first place. Which, as we now all accept, is based on highly > dubious sampling by ANALYZE. So I think it's much more likely to be > that factor (dead tuple accounting is bad), as well as the absurd > false dichotomy between aggressive and non-aggressive -- plus the > issue at hand, the auto-cancellation behavior. In my opinion, this is too speculative to justify making changes to the behavior. I'm not here to defend the way we do dead tuple accounting. I think it's a hot mess. But whether or not it played any role in this catastrophe is hard to say. The problems with dead tuple accounting are, as I understand it, all about statistical independence. That is, we make assumptions that what's true of the sample is likely to be true of the whole table when in reality it may not be true at all. Perhaps it's even unlikely to be true. But the kinds of problems you get from assuming statistical independence tend to hit users very unevenly. We make similar assumptions about selectivity estimation: unless there are extended statistics, we take P(a=1 and b=1) = P(a=1)*P(b = 1), which can be vastly and dramatically wrong. People can and do get extremely bad query plans as a result of that assumption. However, other people run PostgreSQL for years and years and never really have big problems with it. I'd say that it's completely fair to describe this as a big problem, but we can't therefore conclude that some particular user has this problem, not even if we know that they have a slow query and we know that it's due to a bad plan. And similarly here, I don't see a particular reason to think that your theory about what happened is more likely than mine. I freely admit that yours could be right, just as you admitted that mine could be right. But I think we really just don't know. It feels unlikely to me that there was ONE cancellable aggressive autovacuum and that it got cancelled. I think that it's probably either ZERO or LOTS, depending on whether the dead tuple threshold was ever reached. If it wasn't, then it must have been zero. But if it was, and the first autovacuum worker to visit that table got cancelled, then the next one would try again. And autovacuum_naptime=1m, so if we're not out of autovacuum workers, we're going to retry that table every minute. If we do run out of autovacuum workers, which is pretty likely, we'll still launch new workers in that database as often as we can given when other workers exit. If the system is very tight on autovacuum capacity, probably because the cost limit is too low, then you could have a situation where only one try gets made before we hit autovacuum_freeze_max_age. Otherwise, though, a single failed try would probably lead to trying a whole lot more times after that, and you only hit autovacuum_freeze_max_age if all those attempts fail. At the risk of repeating myself, here's what bugs me. If we suppose that your intuition is right and no aggressive autovacuum happened before autovacuum_freeze_max_age was reached, then what you are proposing will make things better. But if we suppose that my intuition is right and many aggressive autovacuums happened before autovacuum_freeze_max_age was reached, then what you are proposing will make things worse, because if we've been auto-cancelling repeatedly we're probably going to keep doing so until we shut that behavior off, and we want a vacuum to succeed sooner rather than later. So it doesn't feel obvious to me that we should change anything. Even if we knew what had happened for certain in this particular case, I don't know how we can possibly know what is typical in similar cases. My personal support experience has been that cases where autovacuum runs a lot but doesn't solve the problem for some reason are a lot more common than cases where it doesn't run when it should have done. That probably accounts for my intuition about what is likely to have happened here. But as my colleagues are often at pains to point out to me, my experiences aren't representative of what happens to PostgreSQL users generally for all kinds of reasons, and therefore sometimes my intuition is wrong. But since I have nobody else's experiences to use in lieu of my own, I don't know what else I can use to judge anything. -- Robert Haas EDB: http://www.enterprisedb.com
pgsql-hackers by date: