Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation |
Date | |
Msg-id | CAH2-WznRj+tUBeqzEWaxpGQHY2QxVcmje9Qm3oAVm0vzvQXfdw@mail.gmail.com Whole thread Raw |
In response to | Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation (Andres Freund <andres@anarazel.de>) |
Responses |
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation |
List | pgsql-hackers |
On Tue, Jan 17, 2023 at 10:02 AM Andres Freund <andres@anarazel.de> wrote: > Both you and Robert said this, and I have seen it be true, but typically not > for large high-throughput OLTP databases, where I found increasing > relfrozenxid to be important. Sure, there's probably some up/down through the > day / week, but it's likely to be pretty predictable. > > I think the problem is that an old relfrozenxid doesn't tell you how much > outstanding work there is. Perhaps that's what both of you meant... That's what I meant, yes. > I think that's not the fault of relfrozenxid as a trigger, but that we simply > don't keep enough other stats. We should imo at least keep track of: If you assume that there is chronic undercounting of dead tuples (which I think is very common), then of course anything that triggers vacuuming is going to help with that problem -- it might be totally inadequate, but still make the critical difference by not allowing the system to become completely destabilized. I absolutely accept that users that are relying on that exist, and that those users ought to not have things get even worse -- I'm pragmatic. But overall, what we should be doing is fixing the real problem, which is that the dead tuples accounting is deeply flawed. Actually, it's not just that the statistics are flat out wrong; the whole model is flat-out wrong. The assumptions that work well for optimizer statistics quite simply do not apply here. Random sampling for this is just wrong, because we're not dealing with something that follows a distribution that can be characterized with a sufficiently large sample. With optimizer statistics, the entire contents of the table is itself a sample taken from the wider world -- so even very stale statistics can work quite well (assuming that the schema is well normalized). Whereas the autovacuum dead tuples stuff is characterized by constant change. I mean of course it is -- that's the whole point! The central limit theorem obviously just doesn't work for something like this -- we cannot generalize from a sample, at all. I strongly suspect that it still wouldn't be a good model even if the information was magically always correct. It might actually be worse in some ways! Most of my arguments against the model are not arguments against the accuracy of the statistics as such. They're actually arguments against the fundamental relevance of the information itself, to the actual problem at hand. We are not interested in information for its own sake; we're interested in making better decisions about autovacuum scheduling. Those may only have a very loose relationship. How many dead heap-only tuples are equivalent to one LP_DEAD item? What about page-level concentrations, and the implication for line-pointer bloat? I don't have a good answer to any of these questions myself. And I have my doubts that there are *any* good answers. Even these questions are the wrong questions (they're just less wrong). Fundamentally, we're deciding when the next autovacuum should run against each table. Presumably it's going to have to happen some time, and when it does happen it happens to the table as a whole. And with a larger table it probably doesn't matter if it happens +/- a few hours from some theoretical optimal time. Doesn't it boil down to that? If we taught the system to do the autovacuum work early because it's a relatively good time for it from a system level point of view (e.g. it's going to be less disruptive right now), that would be useful and easy to justify on its own terms. But it would also tend to make the system much less vulnerable to undercounting dead tuples, since in practice there'd be a decent chance of getting to them early enough that it at least wasn't extremely painful any one time. It's much easier to understand that the system is quiescent than it is to understand bloat. BTW, I think that the insert-driven autovacuum stuff added to 13 has made the situation with bloat significantly better. Of course it wasn't really designed to do that at all, but it still has, kind of by accident, in roughly the same way that antiwraparound autovacuums help with bloat by accident. So maybe we should embrace "happy accidents" like that a bit more. It doesn't necessarily matter if we do the right thing for a reason that turns out to have not been the best reason. I'm certainly not opposed to it, despite my complaints about relying on age(relfrozenxid). > In pgstats: > (Various stats) Overall, what I like about your ideas here is the emphasis on bounding the worst case, and the emphasis on the picture at the page level over the tuple level. I'd like to use the visibility map more for stuff here, too. It is totally accurate about all-visible/all-frozen pages, so many of my complaints about statistics don't really apply. Or need not apply, at least. If 95% of a table's pages are all-frozen in the VM, then of course it's pretty unlikely to be the right time to VACUUM the table if it's to clean up bloat -- this is just about the most reliable information we have access to. I think that the only way that more stats can help is by allowing us to avoid doing completely the wrong thing more often. Just avoiding disaster is a great goal for us here. > > This sounds like a great argument in favor of suspend-and-resume as a > > way of handling autocancellation -- no useful work needs to be thrown > > away for AV to yield for a minute or two. > Hm, that seems a lot of work. Without having held a lock you don't even know > whether your old dead items still apply. Of course it'd improve the situation > substantially, if we could get it. I don't think it's all that much work, once the visibility map snapshot infrastructure is there. Why wouldn't your old dead items still apply? The TIDs must always reference LP_DEAD stubs. Those can only be set LP_UNUSED by VACUUM, and presumably VACUUM can only run in a way that either resumes the suspended VACUUM session, or discards it altogether. So they're not going to be invalidated during the period that a VACUUM is suspended, in any way. Even if CREATE INDEX runs against the table during a suspending VACUUM, we know that the existing LP_DEAD dead_items won't have been indexed, so they'll be safe to mark LP_UNUSED in any case. What am I leaving out? I can't think of anything. The only minor caveat is that we'd probably have to discard the progress from any individual ambulkdelete() call that happened to be running at the time that VACUUM was interrupted. > > Yeah, that's pretty bad. Maybe DROP TABLE and TRUNCATE should be > > special cases? Maybe they should always be able to auto cancel an > > autovacuum? > > Yea, I think so. It's not obvious how to best pass down that knowledge into > ProcSleep(). It'd have to be in the LOCALLOCK, I think. Looks like the best > way would be to change LockAcquireExtended() to get a flags argument instead > of reportMemoryError, and then we could add LOCK_ACQUIRE_INTENT_DROP & > LOCK_ACQUIRE_INTENT_TRUNCATE or such. Then the same for > RangeVarGetRelidExtended(). It already "customizes" how to lock based on RVR* > flags. It would be tricky, but still relatively straightforward compared to other things. It is often a TRUNCATE or a DROP TABLE, and we have nothing to lose and everything to gain by changing the rules for those. > ISTM that some of what you write below would be addressed, at least partially, > by the stats I proposed above. Particularly keeping some "page granularity" > instead of "tuple granularity" stats seems helpful. That definitely could be true, but I think that my main concern is that we completely rely on randomly sampled statistics (except with antiwraparound autovacuums, which happen on a schedule that has problems of its own). > > It's quite possible to get approximately the desired outcome with an > > algorithm that is completely wrong -- the way that we sometimes need > > autovacuum_freeze_max_age to deal with bloat is a great example of > > that. > > Yea. I think this is part of I like my idea about tracking more observations > made by the last vacuum - they're quite easy to get right, and they > self-correct, rather than potentially ending up causing ever-wronger stats. I definitely think that there is a place for that. It has the huge advantage of lessening our reliance on random sampling. > Right. I think it's fundamental that we get a lot better estimates about the > amount of work needed. Without that we have no chance of finishing autovacuums > before problems become too big. I like the emphasis on bounding the work required, so that it can be spread out, rather than trying to predict dead tuples. Again, we should focus on avoiding disaster. > And serialized vacuum state won't help, because that still requires > vacuum to scan all the !all-visible pages to discover them. Most of which > won't contain dead tuples in a lot of workloads. The main advantage of that model is that it decides what to do and when to do it based on the actual state of the table (or the state in the recent past). If we see a concentration of LP_DEAD items, then we can hurry up index vacuuming. If not, maybe we'll take our time. Again, less reliance on random sampling is a very good thing. More dynamic decisions that are made at runtime, and delayed for as long as possible, just seems much more promising than having better stats that are randomly sampled. > I don't feel like I have a good handle on what could work for 16 and what > couldn't. Personally I think something like autovacuum_no_auto_cancel_age > would be an improvement, but I also don't quite feel satisfied with it. I don't either; but it should be strictly less unsatisfactory. > Tracking the number of autovac failures seems uncontroverial and quite > beneficial, even if the rest doesn't make it in. It'd at least let users > monitor for tables where autovac is likely to swoop in in anti-wraparound > mode. I'll see what I can come up with. > Perhaps its worth to separately track the number of times a backend would have > liked to cancel autovac, but couldn't due to anti-wrap? If changes to the > no-auto-cancel behaviour don't make it in, it'd at least allow us to collect > more data about the prevalence of the problem and in what situations it > occurs? Even just adding some logging for that case seems like it'd be an > improvement. Hmm, maybe. > I think with a bit of polish "Add autovacuum trigger instrumentation." ought > to be quickly mergeable. Yeah, I'll try to get that part out of the way quickly. -- Peter Geoghegan
pgsql-hackers by date: