Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation |
Date | |
Msg-id | 20230117180251.tmucqn5xxz7jj6jt@awork3.anarazel.de Whole thread Raw |
In response to | Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation (Peter Geoghegan <pg@bowt.ie>) |
Responses |
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation |
List | pgsql-hackers |
Hi, On 2023-01-16 13:58:21 -0800, Peter Geoghegan wrote: > On Fri, Jan 13, 2023 at 9:55 PM Andres Freund <andres@anarazel.de> wrote: > When I express skepticism of very high autovacuum_freeze_max_age > settings, it's mostly just that I don't think that age(relfrozenxid) > is at all informative in the way that something that triggers > autovacuum ought to be. Worst of all, the older relfrozenxid gets, the > less informative it becomes. I'm sure that it can be safe to use very > high autovacuum_freeze_max_age values, but it seems like a case of > using a very complicated and unreliable thing to decide when to > VACUUM. Both you and Robert said this, and I have seen it be true, but typically not for large high-throughput OLTP databases, where I found increasing relfrozenxid to be important. Sure, there's probably some up/down through the day / week, but it's likely to be pretty predictable. I think the problem is that an old relfrozenxid doesn't tell you how much outstanding work there is. Perhaps that's what both of you meant... I think that's not the fault of relfrozenxid as a trigger, but that we simply don't keep enough other stats. We should imo at least keep track of: In pg_class: - The number of all frozen pages, like we do for all-visible That'd give us a decent upper bound for the amount of work we need to do to increase relfrozenxid. It's important that this is crash safe (thus no pg_stats), and it only needs to be written when we'd likely make other changes to the pg_class row anyway. In pgstats: - The number of dead items, incremented both by the heap scan and opportunistic pruning This would let us estimate how urgently we need to clean up indexes. - The xid/mxid horizons during the last vacuum - The number of pages with tuples that couldn't removed due to the horizon during the last vacuum Together with the horizon, this would let us avoid repeated vacuums that won't help. Tracking the number of pages instead of tuples allows a lot better cost/benefit estimation of another vacuum. - The number of pages with tuples that couldn't be frozen Similar to the dead tuple one, except that it'd help avoid repeated vacuums to increase relfrozenxid, when it won't be able to help. > > I've seen a ~2TB table grow to ~20TB due to dead tuples, at which point the > > server crash-restarted due to WAL ENOSPC. I think in that case there wasn't > > even DDL, they just needed to make the table readonly, for a brief moment, a > > few times a day. The problem started once the indexes grew to be large enough > > that the time for (re-)finding dead tuples + and an index scan phase got large > > enough that they were unlucky to be killed a few times in a row. After that > > autovac never got through the index scan phase. Not doing any "new" work, just > > collecting the same dead tids over and over, scanning the indexes for those > > tids, never quite finishing. > > That makes sense to me, though I wonder if there would have been > another kind of outage (not the one you actually saw) had the > autocancellation behavior somehow been disabled after the first few > cancellations. I suspect so, but it's hard to know. > This sounds like a great argument in favor of suspend-and-resume as a > way of handling autocancellation -- no useful work needs to be thrown > away for AV to yield for a minute or two. One ambition of mine for the > visibility map snapshot infrastructure was to be able support > suspend-and-resume. It wouldn't be that hard for autovacuum to > serialize everything, and use that to pick up right where an > autovacuum worker left of at when it was autocancelled. Same > OldestXmin starting point as before, same dead_items array, same > number of scanned_pages (and pages left to scan). Hm, that seems a lot of work. Without having held a lock you don't even know whether your old dead items still apply. Of course it'd improve the situation substantially, if we could get it. > > If you know postgres internals, it's easy. You know that autovac stops > > auto-cancelling after freeze_max_age and you know that the lock queue is > > ordered. So you issue DROP TABLE / TRUNCATE / whatever and immediately > > afterwards cancel the autovac worker. But if you aren't, it'll take a while to > > figure out that the DROP TABLE isn't progressing due to a lock conflict, at > > which point you'll cancel the statement (possibly having wrecked havoc with > > all other accesses). Then you figure out that you need to cancel > > autovacuum. After you try the DROP TABLE again - but the next worker has > > gotten to work on the table. > > Yeah, that's pretty bad. Maybe DROP TABLE and TRUNCATE should be > special cases? Maybe they should always be able to auto cancel an > autovacuum? Yea, I think so. It's not obvious how to best pass down that knowledge into ProcSleep(). It'd have to be in the LOCALLOCK, I think. Looks like the best way would be to change LockAcquireExtended() to get a flags argument instead of reportMemoryError, and then we could add LOCK_ACQUIRE_INTENT_DROP & LOCK_ACQUIRE_INTENT_TRUNCATE or such. Then the same for RangeVarGetRelidExtended(). It already "customizes" how to lock based on RVR* flags. > > > > I have two ideas around 3): > > > > > > > > First, we could introduce thresholds for the tuple thresholds, after which > > > > autovacuum isn't concealable anymore. > > > > > > Do you think that's a good idea? I just don't trust those statistics, > > > at all. As in, I think they're often complete garbage. > > > > I have seen them be reasonably accurate in plenty busy systems. The most > > common problem I've seen is that the amount of assumed dead / inserted tuples > > is way too high, because the server crash-restarted at some point. We now > > have most of the infrastructure for using a slightly older version of stats > > after a crash, which'd make them less inaccurate. > > Did you mean way too low? Err, yes. > > I haven't recently seen crazy over-estimates of the number of dead tuples, at > > least if you count dead tids as tuples. It might be worth to split the dead > > tuple count into a dead tuples and dead items. > > > > Are you mostly seen over or under estimates? > > It's not so much what I've seen. It's that the actual approach has > lots of problems. > > Referring to my notes, here are what seem to me to be serious problems: ISTM that some of what you write below would be addressed, at least partially, by the stats I proposed above. Particularly keeping some "page granularity" instead of "tuple granularity" stats seems helpful. It'd be great if we could update the page-granularity stats in heap_{insert,multi_insert,update,delete,heapgetpage}. But without page level flags like "has any dead tuples" that's probably too expensive. > * We are very naive about what a dead tuple even means, and we totally > fail to account for the fact that only the subset of heap pages that > are PageIsAllVisible() are interesting to VACUUM -- focusing on the > whole table just seems wrong. > > Per https://postgr.es/m/CAH2-Wz=MGFwJEpEjVzXwEjY5yx=UuNPzA6Bt4DSMasrGLUq9YA@mail.gmail.com > * Stub LP_DEAD items in heap pages are extremely different to dead > heap-only tuples in heap pages, which we ignore. > > Per https://postgr.es/m/CAH2-WznrZC-oHkB+QZQS65o+8_Jtj6RXadjh+8EBqjrD1f8FQQ@mail.gmail.com > * Problem where insert-driven autovacuums > (autovacuum_vacuum_insert_threshold/autovacuum_vacuum_insert_scale_factor > triggers AVs) become further spaced apart as a consequence of one > VACUUM operation taking far longer than usual (generally because it's > an aggressive VACUUM that follows several non-aggressive VACUUMs). > Per https://postgr.es/m/CAH2-Wzn=bZ4wynYB0hBAeF4kGXGoqC=PZVKHeerBU-je9AQF=g@mail.gmail.com Yea, that's not great. But seems fairly addressable. > It's quite possible to get approximately the desired outcome with an > algorithm that is completely wrong -- the way that we sometimes need > autovacuum_freeze_max_age to deal with bloat is a great example of > that. Yea. I think this is part of I like my idea about tracking more observations made by the last vacuum - they're quite easy to get right, and they self-correct, rather than potentially ending up causing ever-wronger stats. > Even then, there may still be serious problems that are well > worth solving. Right. I think it's fundamental that we get a lot better estimates about the amount of work needed. Without that we have no chance of finishing autovacuums before problems become too big. > > Somehow this made me think of a somewhat crazy, and largely unrelated, idea: > > Why don't we use the the currently unused VM bit combination to indicate pages > > with dead tids? We could have an [auto]vacuum mode where it scans just pages > > with the dead tids bit set. Particularly when on-access heap pruning is doing > > most of the work, that could be quite useful to more cheaply get rid of the > > dead tids. Obviously we'd also set them when vacuum decides / was told not to > > do index cleanup. Yes, it'd obviously be less effective than lots of the > > things we discussed in this area (needing to re-collect the dead tids on the > > indicated), but it'd have the advantage of not needing a lot of new > > infrastructure. > > I wonder if it would be possible to split up the work of VACUUM into > multiple phases that can be processed independently. The dead_items > array could be serialized and stored in a temp file. That's not the > same as some of the more complicated stuff we talked about in the last > couple of years, such as a dedicated fork for Dead TIDs. It's more > like an extremely flexible version of the same basic design for > VACUUM, with the ability to slow down and speed up based on a system > level view of things (e.g., checkpointing information). And with index > vacuuming happening on a highly deferred timeline in some cases. > Possibly we could make each slice of work processed by any available > autovacuum worker. Autovacuum workers could become "headless". I don't know. I think the more basic idea I describe has significant advantages - most importantly being able to target autovacuum on the work that on-access pruning couldn't deal with. As you know it's very common that most row versions are HOT and that the remaining dead tuples also get removed by on-access pruning. Which often leads to running out of usable items, but autovacuum won't be triggered because there's not all that much garbage overall. Without the ability for autovacuum to target such pages aggressively, I don't think we're going to improve such common workloads a whole lot. And serialized vacuum state won't help, because that still requires vacuum to scan all the !all-visible pages to discover them. Most of which won't contain dead tuples in a lot of workloads. > As for this patch of mine: do you think that it would be acceptable to > pursue a version based on your autovacuum_no_auto_cancel_age design > for 16? Perhaps this can include something like > pgstat_report_autovac_failure(). It's not even the work of > implementing pgstat_report_autovac_failure() that creates risk that > it'll miss the 16 feature freeze deadline. I'm more concerned that > introducing a more complicated design will lead to the patch being > bikeshedded to death. I don't feel like I have a good handle on what could work for 16 and what couldn't. Personally I think something like autovacuum_no_auto_cancel_age would be an improvement, but I also don't quite feel satisfied with it. Tracking the number of autovac failures seems uncontroverial and quite beneficial, even if the rest doesn't make it in. It'd at least let users monitor for tables where autovac is likely to swoop in in anti-wraparound mode. Perhaps its worth to separately track the number of times a backend would have liked to cancel autovac, but couldn't due to anti-wrap? If changes to the no-auto-cancel behaviour don't make it in, it'd at least allow us to collect more data about the prevalence of the problem and in what situations it occurs? Even just adding some logging for that case seems like it'd be an improvement. I think with a bit of polish "Add autovacuum trigger instrumentation." ought to be quickly mergeable. Greetings, Andres Freund
pgsql-hackers by date: