Thread: Decoupling antiwraparound autovacuum from special rules around auto cancellation
Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Peter Geoghegan
Date:
I think that we should decouple the PROC_VACUUM_FOR_WRAPAROUND autocancellation behavior in ProcSleep() from antiwraparound autovacuum itself. In other words I think that it should be possible to cancel an autovacuum that happens to be an antiwraparound autovacuum, just as if were any other autovacuum -- because it usually is no different in any real practical sense. Or at least it shouldn't be seen as fundamentally different to other autovacuums at first, before relfrozenxid attains an appreciably greater age (definition of "appreciably greater" is TBD). Why should the PROC_VACUUM_FOR_WRAPAROUND behavior happen on *exactly* the same timeline as the one used to launch an antiwraparound autovacuum, though? There is no inherent reason why we have to do both things at exactly the same XID-age-wise time. But there is reason to think that doing so could make matters worse rather than better [1]. More generally I think that it'll be useful to perform "aggressive behaviors" on their own timeline, with no two distinct aggressive behaviors applied at exactly the same time. In general we ought to give a less aggressive approach some room to succeed before escalating to a more aggressive approach -- we should see if a less aggressive approach will work on its own. The failsafe is the most aggressive intervention of all. The PROC_VACUUM_FOR_WRAPAROUND behavior is almost as aggressive, and should happen sooner. Antiwraparound autovacuum itself (which is really a separate thing to PROC_VACUUM_FOR_WRAPAROUND) is less aggressive still. Then you have things like the cutoffs in vacuumlazy.c that control things like freezing. In short, having an "escalatory" approach that applies each behavior at different times. The exact timelines we'd want are of course debatable, but the value of having multiple distinct timelines (one per aggressive behavior) is far less debatable. We should give problems a chance to "resolve themselves", at least up to a point. The latest version of my in progress VACUUM patch series [2] completely removes the concept of aggressive VACUUM as a discrete mode of operation inside vacuumlazy.c. Every existing "aggressive-ish behavior" will be retained in some form or other, but they'll be applied on separate timelines, in proportion to the problem at hand. For example, we'll have a separate XID cutoff for waiting for a cleanup lock the hard way -- we will no longer use FreezeLimit for that, since that doesn't give freezing a chance to happen in the next VACUUM. The same VACUUM operation that is the first one that is capable of freezing should ideally not *also* be the first one that has to wait for a cleanup lock. We should be willing to put off waiting for a cleanup lock for much longer than we're willing to put off freezing. Reusing the same cutoff just makes life harder. Clearly the idea of decoupling the PROC_VACUUM_FOR_WRAPAROUND behavior from antiwraparound autovacuum is conceptually related to my patch series, but it can be treated as separate work. That's why I'm starting another thread now. There is another idea in that patch series that also seems worth mentioning as relevant (but not essential) to this discussion on this thread: it would be better if antiwraparound autovacuum was simply another way to launch an autovacuum, which isn't fundamentally different to any other. I believe that users will find this conceptual model a lot easier, especially in a world where antiwraparound autovacuums naturally became rare (which is the world that the big patch series seeks to bring about). It'll make antiwraparound autovacuum "the threshold of last resort", only needed when conventional tuple-based thresholds don't trigger at all for an extended period of time (e.g., for static tables). Perhaps it won't be trivial to fix autovacuum.c in the way I have in mind (which is to split PROC_VACUUM_FOR_WRAPAROUND into two flags that serve two separate purposes). I haven't considered if we're accidentally relying on the coupling to avoid confusion within autovacuum.c. That doesn't seem important right now, though. [1] https://www.tritondatacenter.com/blog/manta-postmortem-7-27-2015 [2] https://postgr.es/m/CAH2-WzkU42GzrsHhL2BiC1QMhaVGmVdb5HR0_qczz0Gu2aSn=A@mail.gmail.com -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Jeff Davis
Date:
On Wed, 2022-10-19 at 14:58 -0700, Peter Geoghegan wrote: > Why should the PROC_VACUUM_FOR_WRAPAROUND behavior happen on > *exactly* > the same timeline as the one used to launch an antiwraparound > autovacuum, though? The terminology is getting slightly confusing here: by "antiwraparound", you mean that it's not skipping unfrozen pages, and therefore is able to advance relfrozenxid. Whereas the PROC_VACUUM_FOR_WRAPAROUND is the same thing, except done with greater urgency because wraparound is imminent. Right? > There is no inherent reason why we have to do both > things at exactly the same XID-age-wise time. But there is reason to > think that doing so could make matters worse rather than better [1]. Can you explain? Regards, Jeff Davis
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Peter Geoghegan
Date:
On Thu, Oct 20, 2022 at 11:09 AM Jeff Davis <pgsql@j-davis.com> wrote: > The terminology is getting slightly confusing here: by > "antiwraparound", you mean that it's not skipping unfrozen pages, and > therefore is able to advance relfrozenxid. Whereas the > PROC_VACUUM_FOR_WRAPAROUND is the same thing, except done with greater > urgency because wraparound is imminent. Right? Not really. I started this thread to discuss a behavior in autovacuum.c and proc.c (the autocancellation behavior), which is, strictly speaking, not related to the current vacuumlazy.c behavior we call aggressive mode VACUUM. Various hackers have in the past described antiwraparound autovacuum as "implying aggressive", which makes sense; what's the point in doing an antiwraparound autovacuum that can almost never advance relfrozenxid? It is nevertheless true that antiwraparound autovacuum is an independent behavior to aggressive VACUUM. The former is an autovacuum thing, and the latter is a VACUUM thing. That's just how it works, mechanically. If this division seems artificial or pedantic to you, then consider the fact that you can quite easily get a non-aggressive antiwraparound autovacuum by using the storage option called autovacuum_freeze_max_age (instead of the GUC): https://postgr.es/m/CAH2-Wz=DJAokY_GhKJchgpa8k9t_H_OVOvfPEn97jGNr9W=deg@mail.gmail.com This is even a case where we'll output a distinct description in the server log when autovacuum logging is enabled and gets triggered. So while there may be no point in an antiwraparound autovacuum that is non-aggressive, that doesn't stop them from happening. Regardless of whether or not that's an intended behavior, that's just how the mechanism has been constructed. > > There is no inherent reason why we have to do both > > things at exactly the same XID-age-wise time. But there is reason to > > think that doing so could make matters worse rather than better [1]. > > Can you explain? Why should the special autocancellation behavior for antiwraparound autovacuums kick in at exactly the same point that we first launch an antiwraparound autovacuum? Maybe that aggressive intervention will be needed, in the end, but why start there? With my patch series, antiwraparound autovacuums still occur, but they're confined to things like static tables -- things that are pretty much edge cases. They still need to behave sensibly (i.e. reliably advance relfrozenxid based on some principled approach), but now they're more like "an autovacuum that happens because no other condition triggered an autovacuum". To some degree this is already the case, but I'd like to be more deliberate about it. Leaving my patch series aside, I still don't think that it makes sense to make it impossible to auto-cancel antiwraparound autovacuums, across the board, regardless of the underlying table age. We still need something like that, but why not give a still-cancellable autovacuum worker a chance to resolve the problem? Why take a risk of causing much bigger problems (e.g., blocking automated DDL that blocks simple SELECT queries) before the point that that starts to look like the lesser risk (compared to hitting xidStopLimit)? -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Peter Geoghegan
Date:
On Thu, Oct 20, 2022 at 11:52 AM Peter Geoghegan <pg@bowt.ie> wrote: > Leaving my patch series aside, I still don't think that it makes sense > to make it impossible to auto-cancel antiwraparound autovacuums, > across the board, regardless of the underlying table age. One small thought on the presentation/docs side of this: maybe it would be better to invent a new kind of autovacuum that has the same purpose as antiwraparound autovacuum, but goes by a different name, and doesn't have the special behavior around cancellations. We wouldn't have to change anything about the behavior of antiwraparound autovacuum once we reached the point of needing one. Maybe we wouldn't even need to invent a new user-visible name for this other kind of autovacuum. While even this so-called "new kind of autovacuum" will be rare once my main patch series gets in, it'll still be a totally normal occurrence. Whereas antiwraparound autovacuums are sometimes described as an emergency mechanism. That way we wouldn't be fighting against the widely held perception that antiwraparound autovacuums are scary. In fact that reputation would be fully deserved, for the first time. There are lots of problems with the idea that antiwraparound autovacuum is kind of an emergency thing right now, but this would make things fit the perception, "fixing" the perception. Antiwraparound autovacuums would become far far rarer under this scheme, but when they did happen they'd be clear cause for concern. A useful signal for users, who should ideally aim to never see *any* antiwraparound autovacuums. -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Jeff Davis
Date:
On Fri, 2022-10-21 at 17:39 -0700, Peter Geoghegan wrote: > One small thought on the presentation/docs side of this: maybe it > would be better to invent a new kind of autovacuum It's possible this would be easier for users to understand: one process that does cleanup work over time in a way that minimizes interference; and another process that activates in more urgent situations (perhaps due to misconfiguration of the first process). But we should be careful that we don't end up with more confusion. For something like that to work, we'd probably want the second process to not be configurable at all, and we'd want it to be issuing WARNINGs pointing to what might be misconfigured, and otherwise just be invisible. > That way we wouldn't be fighting against the widely held perception > that antiwraparound autovacuums are scary. There's certainly a terminology problem there. Just to brainstorm on some new names, we might want to call it something like "xid reclamation" or "xid horizon advancement". When it starts to run out, we can use words like "wraparound" or "exhaustion". -- Jeff Davis PostgreSQL Contributor Team - AWS
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Peter Geoghegan
Date:
On Sun, Oct 23, 2022 at 9:32 PM Jeff Davis <pgsql@j-davis.com> wrote: > It's possible this would be easier for users to understand: one process > that does cleanup work over time in a way that minimizes interference; > and another process that activates in more urgent situations (perhaps > due to misconfiguration of the first process). I think that the new "early" version of antiwraparound autovacuum (that can still be autocancelled) would simply be called autovacuum. It wouldn't appear as "autovacuum to prevent wraparound" in places like pg_stat_activity. For the most part users wouldn't have to care about the difference between these autovacuums and traditional non-antiwraparound autovacuums. They really would be exactly the same thing, so it would make sense if users typically noticed no difference whatsoever (at least in contexts like pg_stat_activity). > But we should be careful that we don't end up with more confusion. For > something like that to work, we'd probably want the second process to > not be configurable at all, and we'd want it to be issuing WARNINGs > pointing to what might be misconfigured, and otherwise just be > invisible. There should be some simple scheme for determining when an antiwraparound autovacuum (non-cancellable autovacuum to advance relfrozenxid/relminmxid) should run (applied by the autovacuum.c scheduling logic). Something like "table has attained an age that's now 2x autovacuum_freeze_max_age, or 1/2 of vacuum_failsafe_age, whichever is less". The really important thing is giving a regular/early autocancellable autovacuum triggered by age(relfrozenxid) *some* opportunity to run. I strongly suspect that the exact details won't matter too much, provided we manage to launch at least one such autovacuum before escalating to traditional antiwraparound autovacuum (which cannot be autocancelled). Even if regular/early autovacuum had just one opportunity to run to completion, we'd already be much better off. The hazards from blocking automated DDL in a way that leads to a very disruptive traffic jam (like in the Joyent Manta postmortem) would go way down. > > That way we wouldn't be fighting against the widely held perception > > that antiwraparound autovacuums are scary. > > There's certainly a terminology problem there. Just to brainstorm on > some new names, we might want to call it something like "xid > reclamation" or "xid horizon advancement". I think that we should simply call it autovacuum. Under this scheme, antiwraparound autovacuum would be a qualitatively different kind of operation to users (though not to vacuumlazy.c), because it would not be autocancellable in the standard way. And because users should take it as a signal that things aren't really working well (otherwise we wouldn't have reached the point of requiring a scary antiwraparound autovacuum in the first place). Right now antiwraparound autovacuums are both an emergency thing (or at least described as such in one or two areas of the source code), and a completely routine occurrence. This is deeply confusing. Separately, I plan on breaking out insert-triggered autovacuums from traditional dead tuple triggered autovacuums [1], which creates a need to invent some kind of name to differentiate the new table age triggering criteria from both insert-driven and dead tuple driven autovacuums. These are all fundamentally the same operations with the same urgency to users, though. We'd only need to describe the *criteria* that *triggered* the autovacuum in our autovacuum log report (actually we'd still report autovacuums aš antiwraparound autovacuum in cases where that still happened, which won't be presented as just another triggering criteria in the report). [1] https://www.postgresql.org/message-id/flat/CAH2-WznEqmkmry8feuDK8XdpH37-4anyGF7a04bWXOc1GKd0Yg@mail.gmail.com -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Jeff Davis
Date:
On Mon, 2022-10-24 at 07:25 -0700, Peter Geoghegan wrote: > The really important thing is giving a regular/early autocancellable > autovacuum triggered by age(relfrozenxid) *some* opportunity to run. +1. That principle seems both reasonable from a system standpoint and understandable to a user. > Even if regular/early autovacuum had just one > opportunity to run to completion, we'd already be much better off. By "opportunity", you mean that, regardless of configuration, the cancellable autovacuum would at least start; though it still might be cancelled by DDL. Right? > These are all fundamentally the same operations with the > same urgency to users, though. We'd only need to describe the > *criteria* that *triggered* the autovacuum in our autovacuum log > report Hmm... I'm worried that could be a bit confusing depending on how it's done. Let's be clear that it was merely the triggering criteria and doesn't necessarily represent the work that is being done. There are enough cases that it would be good to start a document and outline the end behavior that your patch series is designed to accomplish. In other words, a before/after of the interesting cases. -- Jeff Davis PostgreSQL Contributor Team - AWS
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Peter Geoghegan
Date:
On Mon, Oct 24, 2022 at 8:43 AM Jeff Davis <pgsql@j-davis.com> wrote: > > Even if regular/early autovacuum had just one > > opportunity to run to completion, we'd already be much better off. > > By "opportunity", you mean that, regardless of configuration, the > cancellable autovacuum would at least start; though it still might be > cancelled by DDL. Right? Yes, exactly. It might be difficult as a practical matter to make sure that we *reliably* give autovacuum.c the opportunity to launch a "standard" autovacuum tasked with advancing relfrozenxid (just after autovacuum_freeze_max_age is first crossed) before the point that a scary antiwraparound autovacuum is launched. So we might end up giving it more XID slack than it's likely to ever need (say by only launching a traditional antiwraparound autovacuum against tables that attain an age that is twice the value of autovacuum_freeze_max_age). These are all just details, though -- the important principle is that we try our utmost to give the less disruptive strategy a chance to succeed before concluding that it has failed, and then "escalating" to a traditional antiwraparound autovacuum. > > These are all fundamentally the same operations with the > > same urgency to users, though. We'd only need to describe the > > *criteria* that *triggered* the autovacuum in our autovacuum log > > report > > Hmm... I'm worried that could be a bit confusing depending on how it's > done. Let's be clear that it was merely the triggering criteria and > doesn't necessarily represent the work that is being done. Maybe it could be broken out into a separate "autovacuum triggered by: " line, seen only in the autovacuum log instrumentation (and not in the similar report output by a manual VACUUM VERBOSE). When we still end up "escalating" to an antiwraparound autovacuum, the "triggered by:" line would match whatever we'd show in the benign the non-cancellable-but-must-advance-relfrozenxid autovacuum case. The difference would be that we'd now be reporting on a different operation entirely (not just a regular autovacuum, a scary antiwraparound autovacuum). (Again, even these distinctions wouldn't be meaningful to vacuumlazy.c itself -- it would just need to handle the details around logging in a way that gave users the right idea. There wouldn't be any special discrete aggressive mode of operation anymore, assuming my big patch set gets into Postgres 16 too.) > There are enough cases that it would be good to start a document and > outline the end behavior that your patch series is designed to > accomplish. In other words, a before/after of the interesting cases. That's on my TODO list. Mostly it's an independent thing to this (antiwraparound) autovacuum stuff, despite the fact that both projects share the same underlying philosophy. -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Peter Geoghegan
Date:
On Mon, Oct 24, 2022 at 9:00 AM Peter Geoghegan <pg@bowt.ie> wrote: > Maybe it could be broken out into a separate "autovacuum triggered by: > " line, seen only in the autovacuum log instrumentation (and not in > the similar report output by a manual VACUUM VERBOSE). When we still > end up "escalating" to an antiwraparound autovacuum, the "triggered > by:" line would match whatever we'd show in the benign the > non-cancellable-but-must-advance-relfrozenxid autovacuum case. The > difference would be that we'd now be reporting on a different > operation entirely (not just a regular autovacuum, a scary > antiwraparound autovacuum). Attached WIP patch invents the idea of a regular autovacuum that is tasked with advancing relfrozenxid -- which is really just another trigger criteria, reported on in the server log in its autovacuum reports. Of course we retain the idea of antiwraparound autovacuums. The only difference is that they are triggered when table age has advanced by twice the usual amount, which is presumably only possible because a regular autovacuum couldn't start or couldn't complete in time (most likely due to continually being auto-cancelled). As I said before, I think that the most important thing is to give regular autovacuuming a chance to succeed. The exact approach taken has a relatively large amount of slack, but that probably isn't needed. So the new antiwraparound cutoffs were chosen because they're easy to understand and remember, which is fairly arbitrary. Adding this to the upcoming CF. -- Peter Geoghegan
Attachment
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Laurenz Albe
Date:
On Fri, 2022-11-25 at 14:47 -0800, Peter Geoghegan wrote: > Attached WIP patch invents the idea of a regular autovacuum that is > tasked with advancing relfrozenxid -- which is really just another > trigger criteria, reported on in the server log in its autovacuum > reports. Of course we retain the idea of antiwraparound autovacuums. > The only difference is that they are triggered when table age has > advanced by twice the usual amount, which is presumably only possible > because a regular autovacuum couldn't start or couldn't complete in > time (most likely due to continually being auto-cancelled). > > As I said before, I think that the most important thing is to give > regular autovacuuming a chance to succeed. The exact approach taken > has a relatively large amount of slack, but that probably isn't > needed. So the new antiwraparound cutoffs were chosen because they're > easy to understand and remember, which is fairly arbitrary. The target is a table that receives no DML at all, right? I think that is a good idea. Wouldn't it make sense to trigger that at *half* "autovacuum_freeze_max_age"? Yours, Laurenz Albe
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Peter Geoghegan
Date:
On Sat, Nov 26, 2022 at 9:58 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote: > The target is a table that receives no DML at all, right? Sort of, but not really. The target is a table that doesn't get vacuumed for any other reason -- I don't make any claims beyond that one. It seems a little too optimistic to suppose that such a table really didn't need any vacuuming to deal with bloat just because autovacuum.c didn't seem to think that it did. > I think that is a good idea. > Wouldn't it make sense to trigger that at *half* "autovacuum_freeze_max_age"? That would be equivalent to what I've done here, provided you also double the autovacuum_freeze_max_age setting. I did it this way because I believe that it has fewer problems. The approach I took makes the general perception that antiwraparound autovacuum are a scary thing (really just needed for emergencies) become true, for the first time. We should expect to see very few antiwraparound autovacuums with the patch, but when we do see even one it'll be after a less aggressive approach was given the opportunity to succeed, but (for whatever reason) failed. Just seeing any antiwraparound autovacuums will become a useful signal of something being amiss in a way that it just isn't at the moment. The docs can be updated to talk about antiwraparound autovacuum as a thing that you should encounter approximately never. This is possible even though the patch isn't invasive at all. -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Laurenz Albe
Date:
On Sat, 2022-11-26 at 11:00 -0800, Peter Geoghegan wrote: > > I think that is a good idea. > > Wouldn't it make sense to trigger that at *half* "autovacuum_freeze_max_age"? > > That would be equivalent to what I've done here, provided you also > double the autovacuum_freeze_max_age setting. That's exactly what I was trying to debate. Wouldn't it make sense to trigger VACUUM earlier so that it has a chance of being less heavy? On the other hand, if there are not sufficiently many modifications on the table to trigger autovacuum, perhaps it doesn't matter in many cases. > I did it this way > because I believe that it has fewer problems. The approach I took > makes the general perception that antiwraparound autovacuum are a > scary thing (really just needed for emergencies) become true, for the > first time. > > We should expect to see very few antiwraparound autovacuums with the > patch, but when we do see even one it'll be after a less aggressive > approach was given the opportunity to succeed, but (for whatever > reason) failed. Is that really so much less aggressive? Will that autovacuum run want to process all pages that are not all-frozen? If not, it probably won't do much good. If yes, it will be just as heavy as an anti-wraparound autovacuum (except that it won't block other sessions). > Just seeing any antiwraparound autovacuums will become > a useful signal of something being amiss in a way that it just isn't > at the moment. The docs can be updated to talk about antiwraparound > autovacuum as a thing that you should encounter approximately never. > This is possible even though the patch isn't invasive at all. True. On the other hand, it might happen that after this, people start worrying about normal autovacuum runs because they occasionally experience a table age autovacuum that is much heavier than the other ones. And they can no longer tell the reason, because it doesn't show up anywhere. Yours, Laurenz Albe
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Peter Geoghegan
Date:
On Sun, Nov 27, 2022 at 8:54 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote: > That's exactly what I was trying to debate. Wouldn't it make sense to > trigger VACUUM earlier so that it has a chance of being less heavy? > On the other hand, if there are not sufficiently many modifications > on the table to trigger autovacuum, perhaps it doesn't matter in many > cases. Maybe. There is a deeper problem here, though: table age is a really terrible proxy for whether or not it's appropriate for VACUUM to freeze preexisting all-visible pages. It's not obvious that half autovacuum_freeze_max_age is much better than autovacuum_freeze_max_age if your concern is avoiding getting too far into debt on freezing. Afterall, this is debt that must be paid back by freezing some number of physical heap pages, which in general has approximately zero relationship with table age (we need physical units for this, not logical units). This is a long standing problem that I hope and expect will be fixed in 16, by my ongoing work to completely remove the concept of aggressive mode VACUUM: https://commitfest.postgresql.org/40/3843/ This makes VACUUM care about both table age and the number of unfrozen heap pages (mostly the latter). It weighs everything at the start of each VACUUM, and decides on how it must advance relfrozenxid based on the conditions in the table and the picture over time. Note that performance stability is the main goal; we will not just keep accumulating unfrozen pages for no good reason. All of the behaviors previously associated with aggressive mode are retained, but are individually applied on a timeline that is attuned to the needs of the table (we can still wait for a cleanup lock, but that happens much later than the point that the same page first becomes eligible for freezing, not at exactly the same time). In short, "aggressiveness" becomes a continuous thing, rather than a discrete mode of operation, improving performance stability. We go back to having only one kind of lazy vacuum, which is how things worked prior to the introduction of the visibility map. (We did have antiwraparound autovacuums in 8.3, but we did not have aggressive/scan_all VACUUMs at the time.) > Is that really so much less aggressive? Will that autovacuum run want > to process all pages that are not all-frozen? If not, it probably won't > do much good. If yes, it will be just as heavy as an anti-wraparound > autovacuum (except that it won't block other sessions). Even if we assume that my much bigger patch set won't make it into 16, it'll probably still be a good idea to do this in 16. I admit that I haven't really given that question enough thought to be sure of that, though. Naturally my goal is to get everything in. Hopefully I'll never have to make that call. It is definitely true that this patch is "the autovacuum side" of the work from the other much larger patchset (which handles "the VACUUM side" of things). This antiwraparound patch should probably be considered in that context, even though it's theoretically independent work. It just worked out that way. > True. On the other hand, it might happen that after this, people start > worrying about normal autovacuum runs because they occasionally experience > a table age autovacuum that is much heavier than the other ones. And > they can no longer tell the reason, because it doesn't show up anywhere. But you can tell the reason, just by looking at the autovacuum log reports. The only thing you can't do is see "(to prevent wraparound)" in pg_stat_activity. That (and the autocancellation behavioral change) are the only differences. The big picture is that users really will have no good reason to care very much about autovacuums that were triggered to advance relfrozenxid (at least in the common case where we haven't needed to make them antiwraparound autovacuums). They could almost (though not quite) now be explained as "an autovacuum that takes place because it's been a while since we did an autovacuum to deal with bloat and/or tuple inserts". That will at least be reasonable if you assume all of the patches get in. -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Peter Geoghegan
Date:
On Fri, Nov 25, 2022 at 2:47 PM Peter Geoghegan <pg@bowt.ie> wrote: > Attached WIP patch invents the idea of a regular autovacuum that is > tasked with advancing relfrozenxid -- which is really just another > trigger criteria, reported on in the server log in its autovacuum > reports. Attached is v2, which is just to fix bitrot. Well, mostly. I did make one functional change in v2: the autovacuum server log reports now separately report on table XID age and table MultiXactId age, each as its own distinct triggering condition. I've heard informal reports that the difference between antiwraparound autovacuums triggered by table XID age versus table MXID age can matter a great deal. It isn't difficult to break out that detail anyway, so even if the distinction isn't interesting all that often we might as well surface it to users. I still haven't made a start on the docs for this. I'm still not sure how much work I should do on the docs in the scope of this project versus my project that deals with related issues in VACUUM itself. The existing material from the "Routine Vacuuming" docs has lots of problems, and figuring out how to approach fixing those problems seems kind of daunting right now. -- Peter Geoghegan
Attachment
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Peter Geoghegan
Date:
On Thu, Dec 29, 2022 at 7:01 PM Peter Geoghegan <pg@bowt.ie> wrote: > Attached is v2, which is just to fix bitrot. Attached is v3. We no longer apply vacuum_failsafe_age when determining the cutoff for antiwraparound autovacuuming -- the new approach is a bit simpler. This is a fairly small change overall. Now any "table age driven" autovacuum will also be antiwraparound when its relfrozenxid/relminmxid attains an age that's either double the relevant setting (either autovacuum_freeze_max_age or effective_multixact_freeze_max_age), or 1 billion XIDs/MXIDs -- whichever is less. That makes it completely impossible to disable antiwraparound protections (the special antiwrap autocancellation behavior) for table-age-driven autovacuums once table age exceeds 1 billion XIDs/MXIDs. It's still possible to increase autovacuum_freeze_max_age to well over a billion, of course. It just won't be possible to do that while also avoiding the no-auto-cancellation behavior for those autovacuums that are triggered due to table age crossing the autovacuum_freeze_max_age/effective_multixact_freeze_max_age threshold. -- Peter Geoghegan
Attachment
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Andres Freund
Date:
Hi, On 2023-01-08 17:49:20 -0800, Peter Geoghegan wrote: > Teach autovacuum.c to launch "table age" autovacuums at the same point > that it previously triggered antiwraparound autovacuums. Antiwraparound > autovacuums are retained, but are only used as a true option of last > resort, when regular autovacuum has presumably tried and failed to > advance relfrozenxid (likely because the auto-cancel behavior kept > cancelling regular autovacuums triggered based on table age). I've also seen the inverse, with recent versions of postgres: Autovacuum can only ever make progress if it's an anti-wraparound vacuum, because it'll always get cancelled otherwise. I'm worried that substantially increasing the time until an anti-wraparound autovacuum happens will lead to more users running into out-of-xid shutdowns. I don't think it's safe to just increase the time at which anti-wrap vacuums happen to a hardcoded 1 billion. I'm also doubtful that it's ok to just make all autovacuums on relations with an age > 1 billion anti-wraparound ones. For people that use a large autovacuum_freeze_max_age that will be a rude awakening. I am all in favor for adding logic to trigger autovacuum based on the table age, without needing to reach autovacuum_freeze_max_age. It never made sense to me that we get to the "emergency mode" in entirely normal operation. But I'm not in favor of just entirely reinterpreting existing GUCs and adding important thresholds as hardcoded numbers. Greetings, Andres Freund
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Peter Geoghegan
Date:
On Mon, Jan 9, 2023 at 5:22 PM Andres Freund <andres@anarazel.de> wrote: > I've also seen the inverse, with recent versions of postgres: Autovacuum can > only ever make progress if it's an anti-wraparound vacuum, because it'll > always get cancelled otherwise. I'm worried that substantially increasing the > time until an anti-wraparound autovacuum happens will lead to more users > running into out-of-xid shutdowns. > > I don't think it's safe to just increase the time at which anti-wrap vacuums > happen to a hardcoded 1 billion. That's not what the patch does. It doubles the time that the anti-wrap no-autocancellation behaviors kick in, up to a maximum of 1 billion XIDs/MXIDs. So it goes from autovacuum_freeze_max_age to autovacuum_freeze_max_age x 2, without changing the basic fact that we initially launch autovacuums that advance relfrozenxid/relminmxid when the autovacuum_freeze_max_age threshold is first crossed. These heuristics are totally negotiable -- and likely should be thought out in more detail. It's likely that most of the benefit of the patch comes from simply trying to advance relfrozenxid without the special auto-cancellation behavior one single time. The main problem right now is that the threshold that launches most antiwraparound autovacuums is exactly the same as the threshold that activates the auto-cancellation protections. Even doing the latter very slightly later than the former could easily make things much better, while adding essentially no risk of the kind you're concerned about. > I'm also doubtful that it's ok to just make all autovacuums on relations with > an age > 1 billion anti-wraparound ones. For people that use a large > autovacuum_freeze_max_age that will be a rude awakening. Actually, users that have autovacuum_freeze_max_age set to over 1 billion will get exactly the same behavior as before (except that the instrumentation of autovacuum will be better). It'll be identical. If you set autovacuum_freeze_max_age to 2 billion, and a "standard" autovacuum is launched on a table whose relfrozenxid age is 1.5 billion, it'll just be a regular dead tuples/inserted tuples autovacuum, with the same old familiar locking characteristics as today. -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Robert Haas
Date:
On Mon, Jan 9, 2023 at 8:40 PM Peter Geoghegan <pg@bowt.ie> wrote: > That's not what the patch does. It doubles the time that the anti-wrap > no-autocancellation behaviors kick in, up to a maximum of 1 billion > XIDs/MXIDs. So it goes from autovacuum_freeze_max_age to > autovacuum_freeze_max_age x 2, without changing the basic fact that we > initially launch autovacuums that advance relfrozenxid/relminmxid when > the autovacuum_freeze_max_age threshold is first crossed. I'm skeptical about this kind of approach. I do agree that it's good to slowly increase the aggressiveness of VACUUM as we get further behind, rather than having big behavior changes all at once, but I think that should happen by smoothly varying various parameters rather than by making discrete behavior changes at a whole bunch of different times. For instance, when VACUUM goes into emergency mode, it stops respecting the vacuum delay. I think that's great, but it happens all at once, and maybe it would be better if it didn't. We could consider gradually ramping the vacuum delay from 100% down to 0% instead of having it happen all at once. Maybe that's not the right idea, I don't know, and a naive implementation might be worse than nothing, but I think it has some chance of being worth consideration. But what the kind of change you're proposing here does is create another threshold where the behavior changes suddenly, and I think that's challenging from the point of view of understanding the behavior of the system. The behavior already changes when you hit vacuum_freeze_min_age and then again when you hit vacuum_freeze_table_age and then there's also autoovacuum_freeze_max_age and xidWarnLimit and xidStopLimit and a few others, and these setting all interact in pretty complex ways. The more conditional logic we add to that, the harder it becomes to understand what's actually happening. You see a system where age(relfrozenxid) = 673m and you need a calculator and a spreadsheet to figure out what the vacuum behavior is at that point. Honestly, I think we already have a problem with the behaviors here being too complex for normal human beings to understand them, and I think that the kinds of changes you are proposing here could make that quite a bit worse. Now, you might reply to the above by saying, well, some behaviors can't vary continuously. vacuum_cost_limit can perhaps be phased out gradually, but autocancellation seems like something that you must either do, or not do. I would agree with that. But what I'm saying is that we ought to favor having those kinds of behaviors all engage at the same point rather than at different times. I'm not saying that there can't ever be good reasons to separate out different behaviors and have the engage at different times, but I think we will end up better off if we minimize that sort of thing as much as we reasonably can. In your opening email you write "Why should the PROC_VACUUM_FOR_WRAPAROUND behavior happen on *exactly* the same timeline as the one used to launch an antiwraparound autovacuum, though?" and my answer is "because that's easier to understand and I don't see that it has much of a downside." I did take a look at the post-mortem to which you linked, but I am not quite sure how that bears on the behavior change under discussion. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Peter Geoghegan
Date:
On Thu, Jan 12, 2023 at 9:12 AM Robert Haas <robertmhaas@gmail.com> wrote: > I do agree that it's good to slowly increase the aggressiveness of > VACUUM as we get further behind, rather than having big behavior > changes all at once, but I think that should happen by smoothly > varying various parameters rather than by making discrete behavior > changes at a whole bunch of different times. In general I tend to agree, but, as you go on to acknowledge yourself, this particular behavior is inherently discrete. Either the PROC_VACUUM_FOR_WRAPAROUND behavior is in effect, or it isn't. In many important cases the only kind of autovacuum that ever runs against a certain big table is antiwraparound autovacuum. And therefore every autovacuum that runs against the table must necessarily not be auto cancellable. These are the cases where we see disastrous interactions with automated DDL, such as a TRUNCATE run by a cron job (to stop those annoying antiwraparound autovacuums) -- a heavyweight lock traffic jam that causes the application to lock up. All that I really want to do here is give an autovacuum that *can* be auto cancelled *some* non-zero chance to succeed with these kinds of tables. TRUNCATE completes immediately, so the AEL is no big deal. Except when it's blocked behind an antiwraparound autovacuum. That kind of interaction is occasionally just disastrous. Even just the tiniest bit of wiggle room could avoid it in most cases, possibly even almost all cases. > Maybe that's not the right idea, I don't know, and a naive > implementation might be worse than nothing, but I think it has some > chance of being worth consideration. It's a question of priorities. The failsafe isn't supposed to be used (when it is it is a kind of a failure), and so presumably only kicks in on very rare occasions, where nobody was paying attention anyway. So far I've heard no complaints about this, but I've heard lots of complaints about the antiwrap autocancellation behavior. > The behavior already changes when you hit > vacuum_freeze_min_age and then again when you hit > vacuum_freeze_table_age and then there's also > autoovacuum_freeze_max_age and xidWarnLimit and xidStopLimit and a few > others, and these setting all interact in pretty complex ways. The > more conditional logic we add to that, the harder it becomes to > understand what's actually happening. In general I strongly agree. In fact that's a big part of what motivates my ongoing work on VACUUM. The user experience is central. As Andres pointed out, presenting antiwraparound autovacuums as kind of an emergency thing but also somehow a routine thing is just horribly confusing. I want to make them into an emergency thing in every sense -- something that you as a user can reasonably expect to never see (like the failsafe). But if you do see one, then that's a useful signal of an underlying problem with contention, say from automated DDL that pathologically cancels autovacuums again and again. > Now, you might reply to the above by saying, well, some behaviors > can't vary continuously. vacuum_cost_limit can perhaps be phased out > gradually, but autocancellation seems like something that you must > either do, or not do. I would agree with that. But what I'm saying is > that we ought to favor having those kinds of behaviors all engage at > the same point rather than at different times. Right now aggressive VACUUMs do just about all freezing at the same time, to the extent that many users seem to think that it's a totally different thing with totally different responsibilities to regular VACUUM. Doing everything at the same time like that causes huge practical problems, and is very confusing. I think that users will really appreciate having only one kind of VACUUM/autovacuum (since the other patch gets rid of discrete aggressive mode VACUUMs). I want "table age autovacuuming" (as I propose to call it) come to be seen as not any different to any other autovacuum, such as an "insert tuples" autovacuum or a "dead tuples" autovacuum. The difference is only in how autovacuum.c triggers the VACUUM, not in any runtime behavior. That's an important goal here. > I did take a look at the post-mortem to which you linked, but I am not > quite sure how that bears on the behavior change under discussion. The post-mortem involved a single "DROP TRIGGER" that caused chaos when it interacted with the auto cancellation behavior. It would usually completely instantly, so the AEL wasn't actually disruptive, but one day antiwraparound autovacuum made the cron job effectively block all reads and writes for hours. The similar outages I was called in to help with personally had either an automated TRUNCATE or an automated CREATE INDEX. Had autovacuum only been willing to yield once or twice, then it probably would have been fine -- the situation probably would have worked itself out naturally. That's the best outcome you can hope for. -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Robert Haas
Date:
On Thu, Jan 12, 2023 at 2:22 PM Peter Geoghegan <pg@bowt.ie> wrote: > All that I really want to do here is give an autovacuum that *can* be > auto cancelled *some* non-zero chance to succeed with these kinds of > tables. TRUNCATE completes immediately, so the AEL is no big deal. > Except when it's blocked behind an antiwraparound autovacuum. That > kind of interaction is occasionally just disastrous. Even just the > tiniest bit of wiggle room could avoid it in most cases, possibly even > almost all cases. I doubt it. Wiggle room that's based on the XID threshold being different for one behavior vs. another can easily fail to produce any benefit, because there's no guarantee that the autovacuum launcher will ever try to launch a worker against that table while the XID is in the range where you'd get one behavior and not the other. I've long thought that the fact that vacuum_freeze_table_age is documented as capped at 0.95 * autovacuum_freeze_max_age is silly for just this reason. The interval that you're proposing is much wider so the chances of getting a benefit are greater, but supposing that it's going to solve it in most cases seems like an exercise in unwarranted optimism. In fact, I would guess that in fact it will very rarely solve the problem. Normally, the XID age of a table never reaches autovacuum_freeze_max_age in the first place. If it does, there's some reason. Maybe there's a really old open transaction or an abandon replication slot or an unresolved 2PC transaction. Maybe the autovacuum system is overloaded and no table is getting visited regularly because the system just can't keep up. Or maybe there are regular AELs being taken on the table at issue. If there's only an AEL taken against a table once in blue moon, some autovacuum attempt ought to succeed before we reach autovacuum_freeze_max_age. Flipping that around, if we reach autovacuum_freeze_max_age without advancing relfrozenxid, and an AEL shows up behind us in the lock queue, it's really likely that the reason *why* we've reached autovacuum_freeze_max_age is that this same thing has happened to every previous autovacuum attempt and they all cancelled themselves. If we cancel ourselves too, we're just postponing resolution of the problem to some future point when we decide to stop cancelling ourselves. That's not a win. > I think that users will really appreciate having only one kind of > VACUUM/autovacuum (since the other patch gets rid of discrete > aggressive mode VACUUMs). I want "table age autovacuuming" (as I > propose to call it) come to be seen as not any different to any other > autovacuum, such as an "insert tuples" autovacuum or a "dead tuples" > autovacuum. The difference is only in how autovacuum.c triggers the > VACUUM, not in any runtime behavior. That's an important goal here. I don't agree with that goal. I think that having different kinds of autovacuums with different, identifiable names and corresponding, easily-identifiable behaviors is really important for troubleshooting. Trying to remove those distinctions and make everything look the same will not keep autovacuum from getting itself into trouble. It will just make it harder to understand what's happening when it does. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Peter Geoghegan
Date:
On Thu, Jan 12, 2023 at 1:08 PM Robert Haas <robertmhaas@gmail.com> wrote: > I doubt it. Wiggle room that's based on the XID threshold being > different for one behavior vs. another can easily fail to produce any > benefit, because there's no guarantee that the autovacuum launcher > will ever try to launch a worker against that table while the XID is > in the range where you'd get one behavior and not the other. Of course it's true that in general it might not succeed in forestalling the auto cancellation behavior. You can say something similar about approximately anything like this. For example, there is no absolute guarantee that any autovacuum will ever complete. But we still try! > I've long thought that the fact that vacuum_freeze_table_age is documented > as capped at 0.95 * autovacuum_freeze_max_age is silly for just this > reason. The interval that you're proposing is much wider so the > chances of getting a benefit are greater, but supposing that it's > going to solve it in most cases seems like an exercise in unwarranted > optimism. I don't claim to be dealing in certainties, especially about the final outcome. Whether or not you accept my precise claim is perhaps not important, in the end. What is important is that we give things a chance to succeed, based on the information that we have available, with a constant eye towards avoiding disaster scenarios. Some of the problems with VACUUM seem to be cases where VACUUM takes on a potentially ruinous obligation, that it cannot possibly meet in some rare cases that do come up sometimes -- like the cleanup lock behavior. Is a check for $1000 written by me really worth less than a check written by me for a billion dollars? They're both nominally equivalent guarantees about an outcome, after all, though one has a far greater monetary value. Which would you value more, subjectively? Nothing is guaranteed -- even (and perhaps especially) strong guarantees. > In fact, I would guess that in fact it will very rarely solve the > problem. Normally, the XID age of a table never reaches > autovacuum_freeze_max_age in the first place. If it does, there's some > reason. Probably, but none of this matters at all if the table age never reaches autovacuum_freeze_max_age in the first place. We're only talking about tables where that isn't the case, by definition. Everything else is out of scope here. > Maybe there's a really old open transaction or an abandon > replication slot or an unresolved 2PC transaction. Maybe the > autovacuum system is overloaded and no table is getting visited > regularly because the system just can't keep up. Or maybe there are > regular AELs being taken on the table at issue. Maybe an asteroid hits the datacenter, making all of these considerations irrelevant. But perhaps it won't! > If there's only an AEL > taken against a table once in blue moon, some autovacuum attempt ought > to succeed before we reach autovacuum_freeze_max_age. Flipping that > around, if we reach autovacuum_freeze_max_age without advancing > relfrozenxid, and an AEL shows up behind us in the lock queue, it's > really likely that the reason *why* we've reached > autovacuum_freeze_max_age is that this same thing has happened to > every previous autovacuum attempt and they all cancelled themselves. Why do you assume that a previous autovacuum ever got launched in the first place? There is always going to be a certain kind of table that can only get an autovacuum when its table age crosses autovacuum_freeze_max_age. And it's not just static tables -- there is very good reason to have doubts about the statistics that drive autovacuum. Plus vacuum_freeze_table_age works very unreliably (which is why my big VACUUM patch more or less relegates it to a compatibility option, while retaining a more sophisticated notion of table age creating pressure to advance relfrozenxid). Under the scheme from this autovacuum patch, it really does become reasonable to make a working assumption that there was a previous autovacuum, that failed (likely due to the autocancellation behavior, as you said). We must have tried and failed in an earlier autovacuum, once we reach the point of needing an antiwraparound autovacuum (meaning a table age autovacuum which cannot be autocancelled) -- which is not the case today at all. If nothing else, table age autovacuums will have been scheduled much earlier on -- they will have at least started up, barring pathological cases. That's a huge difference in the strength of the signal, compared to today. The super aggressive autocancellation behavior is actually proportionate to the problem at hand. Kind of like how if you go to the doctor and tell them you have a headache, they don't schedule you for emergency brain surgery. What they do is tell you to take an aspirin, and make sure that you stay well hydrated -- if the problem doesn't go away after a few days, then call back, reassess. Perhaps it really will be a brain tumor, but there is nothing to gain and everything to lose by taking such drastic action at the first sign of trouble. > If we cancel ourselves too, we're just postponing resolution of the > problem to some future point when we decide to stop cancelling > ourselves. That's not a win. It's also only a very minor loss, relative to what would have happened without any of this. This is something that we can be relatively sure of (unlike anything about final outcomes). It's clear that we have a lot to gain. What do we have to lose, really? > > I think that users will really appreciate having only one kind of > > VACUUM/autovacuum (since the other patch gets rid of discrete > > aggressive mode VACUUMs). I want "table age autovacuuming" (as I > > propose to call it) come to be seen as not any different to any other > > autovacuum, such as an "insert tuples" autovacuum or a "dead tuples" > > autovacuum. The difference is only in how autovacuum.c triggers the > > VACUUM, not in any runtime behavior. That's an important goal here. > > I don't agree with that goal. I think that having different kinds of > autovacuums with different, identifiable names and corresponding, > easily-identifiable behaviors is really important for troubleshooting. You need to distinguish between different types of autovacuums and different types of VACUUMs here. Sure, it's valuable to have information about why autovacuum launched a VACUUM, and the patch greatly improves that. But runtime behavior is another story. It's not really generic behavior -- more like generic policies that produce different behavior under different runtime conditions. VACUUM has always had generic policies about how to do things, at least up until the introduction of the visibility map, which added scan_all/aggressive VACUUMs, and the vacuum_freeze_table_age GUC. The policy should be the same in every VACUUM, which the behavior itself emerges from. > Trying to remove those distinctions and make everything look the same > will not keep autovacuum from getting itself into trouble. It will > just make it harder to understand what's happening when it does. The point isn't to have every VACUUM behave in the same way. The point is to make decisions dynamically, based on the observed conditions in the table. And to delay committing to things until there really is no alternative, to maximize our opportunities to avoid disaster. In short: loose, springy behavior. Imposing absolute obligations on VACUUM has the potential to create lots of problems. It is sometimes necessary, but can easily be overused, making a bad situation much worse. -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Andres Freund
Date:
Hi, On 2023-01-12 16:08:06 -0500, Robert Haas wrote: > Normally, the XID age of a table never reaches autovacuum_freeze_max_age in > the first place. That's not at all my experience. I often see it being the primary reason for autovacuum vacuuming large tables on busy OLTP systems. Even without any longrunning transactions or such, with available autovac workers and without earlier autovacuums getting interrupted by locks. Once a table is large, reasonable scale factors require a lot of changes to accumulate to trigger an autovacuum, and during a vacuum a lot of transactions complete, leading to large tables having a significant age by the time autovac finishes. The most common "bad" reason for reaching autovacuum_freeze_max_age that I see is cost limits not allowing vacuum to complete on time. Perhaps we should track how often autovacuum was triggered by what reason in a relation's pgstats? That'd make it a lot easier to collect data, both for tuning the thresholds on a system, and for hacking on postgres. Tracking the number of times autovacuum was interruped due to a lock request might be a good idea as well? I think it'd be a good idea to split off the part of the patch that introduces AutoVacType / adds logging for what triggered. That's independently useful, likely uncontroversial and makes the remaining patch smaller. I'd also add the trigger to the pg_stat_activity entry for the autovac worker. Additionally I think we should add information about using failsafe mode to the p_s_a entry. I've wished for a non-wraparound, xid age based, "autovacuum trigger" many times, FWIW. And I've seen plenty of places write their own userspace version of it, because without it they run into trouble. However, I don't like that the patch infers the various thresholds using magic constants / multipliers. autovacuum_freeze_max_age is really a fairly random collection of things: 1) triggers autovacuum on tables based on age, in addition to the dead tuple / inserted tuples triggers 2) prevents auto-cancellation of autovacuum 3) starts autovacuum, even if autovacuum is disabled IME hitting 1) isn't a reason for concern, it's perfectly normal. Needing 2) to make progress is a bit more concerning. 3) should rarely be needed, but is a good safety mechanism. I doubt that controlling all of them via one GUC is sensible. If I understand the patch correctly, we now have the following age based thresholds for av: - force-enable autovacuum: oldest_datfrozenxid + autovacuum_freeze_max_age < nextXid - autovacuum based on age: freeze_max_age = Min(autovacuum_freeze_max_age, table_freeze_max_age) tableagevac = relfrozenxid < recentXid - freeze_max_age - prevent auto-cancellation: freeze_max_age = Min(autovacuum_freeze_max_age, table_freeze_max_age) prevent_auto_cancel_age = Min(freeze_max_age * 2, 1 billion) prevent_auto_cancel = reflrozenxid < recentXid - prevent_auto_cancel_age Is that right? One thing I just noticed: Isn't it completely bonkers that we compute recentXid/recentMulti once at the start of a worker in relation_needs_vacanalyze()? That's fine for the calls in do_autovacuum()'s initial loops over all tables. But seems completely wrong for the later calls via table_recheck_autovac() -> recheck_relation_needs_vacanalyze() -> relation_needs_vacanalyze()? These variables really shouldn't be globals. It makes sense to cache them locally in do_autovacuum(), but reusing them recheck_relation_needs_vacanalyze() and sharing it between launcher and worker is bad. Greetings, Andres Freund
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Peter Geoghegan
Date:
On Fri, Jan 13, 2023 at 2:00 PM Andres Freund <andres@anarazel.de> wrote: > On 2023-01-12 16:08:06 -0500, Robert Haas wrote: > > Normally, the XID age of a table never reaches autovacuum_freeze_max_age in > > the first place. > > That's not at all my experience. I often see it being the primary reason for > autovacuum vacuuming large tables on busy OLTP systems. Even without any > longrunning transactions or such, with available autovac workers and without > earlier autovacuums getting interrupted by locks. Once a table is large, > reasonable scale factors require a lot of changes to accumulate to trigger an > autovacuum, and during a vacuum a lot of transactions complete, leading to > large tables having a significant age by the time autovac finishes. I've definitely seen this. I've also noticed that TPC-C's stock and customer tables (both of which are subject to many HOT updates) only ever receive antiwraparound autovacuums, even with my very aggressive autovacuum settings. Overall, I think that it's quite common among the largest tables, even when things are running normally. > The most common "bad" reason for reaching autovacuum_freeze_max_age that I see > is cost limits not allowing vacuum to complete on time. There are many problems with the statistics driving this whole process, that I won't rehash right now. I actually think that the whole idea of relying on statistical sampling for dead tuples is fundamentally just bogus (that's not how statistics work in general), though I have quite a few less fundamental and more concrete complaints about the statistics just being wrong on their own terms. > Perhaps we should track how often autovacuum was triggered by what reason in > a relation's pgstats? That'd make it a lot easier to collect data, both for > tuning the thresholds on a system, and for hacking on postgres. That would definitely be useful. > Tracking the number of times autovacuum was interruped due to a lock request > might be a good idea as well? Also useful information worth having. > I think it'd be a good idea to split off the part of the patch that introduces > AutoVacType / adds logging for what triggered. That's independently useful, > likely uncontroversial and makes the remaining patch smaller. I like that idea. Attached revision v4 breaks things up mechanically, along those lines (no real changes to the code inself, though). The controversial parts of the patch are indeed a fairly small proportion of the total changes. > I'd also add the trigger to the pg_stat_activity entry for the autovac > worker. Additionally I think we should add information about using failsafe > mode to the p_s_a entry. I agree that that's all useful, but it seems like it can be treated as later work. > I've wished for a non-wraparound, xid age based, "autovacuum trigger" many > times, FWIW. And I've seen plenty of places write their own userspace version > of it, because without it they run into trouble. However, I don't like that > the patch infers the various thresholds using magic constants / multipliers. As I said, these details are totally negotiable, and likely could be a lot better without much effort. What are your concerns about the thresholds? For example, is it that you can't configure the behavior directly at all? Something else? > autovacuum_freeze_max_age is really a fairly random collection of things: > 1) triggers autovacuum on tables based on age, in addition to the dead tuple / > inserted tuples triggers > 2) prevents auto-cancellation of autovacuum > 3) starts autovacuum, even if autovacuum is disabled > > IME hitting 1) isn't a reason for concern, it's perfectly normal. Needing 2) > to make progress is a bit more concerning. 3) should rarely be needed, but is > a good safety mechanism. > > I doubt that controlling all of them via one GUC is sensible. I agree, of course, but just to be clear: I don't think it matters that we couple together 1 and 3. In fact it's good that we do that, because the point to the user is that they cannot disable table-age (i.e. what we current call antiwraparound) autovacuums -- that just makes sense. The only problem that I see is that item 2 is tied to the other items from your list. > If I understand the patch correctly, we now have the following age based > thresholds for av: > > - force-enable autovacuum: > oldest_datfrozenxid + autovacuum_freeze_max_age < nextXid > - autovacuum based on age: > freeze_max_age = Min(autovacuum_freeze_max_age, table_freeze_max_age) > tableagevac = relfrozenxid < recentXid - freeze_max_age > - prevent auto-cancellation: > freeze_max_age = Min(autovacuum_freeze_max_age, table_freeze_max_age) > prevent_auto_cancel_age = Min(freeze_max_age * 2, 1 billion) > prevent_auto_cancel = reflrozenxid < recentXid - prevent_auto_cancel_age > > Is that right? That summary looks accurate, but I'm a bit confused about why you're asking the question this way. I thought that it was obvious that the patch doesn't change most of these things. The only mechanism that the patch changes is related to "prevent auto-cancellation" behaviors -- which is now what the term "antiwraparound" refers to. It does change the name of "autovacuum based on age", though -- the name is now "table age autovacuum" (the old name was antiwraparound autovacuum, of course). As I pointed out to you already, it's mechanically impossible for any autovacuum to be antiwraparound unless it's an XID table age/MXID table age autovacuum. The naming convention I propose here makes it a little confusing for us to discuss, but it seems like the best thing for users. Users' basic intuitions about antiwraparound autovacuums (that they're scary things needed because wraparound is starting to become a real concern) don't need to change. If anything they become more accurate, because antiwraparound autovacuums become non-routine -- which is really how it should have been when autovacuum was first added IMV. Users have rather good reasons to find antiwraparound autovacuums scary, even though that's kind of wrong (it's really our fault for making it so confusing for them, not their fault for being wrong). > One thing I just noticed: Isn't it completely bonkers that we compute > recentXid/recentMulti once at the start of a worker in > relation_needs_vacanalyze()? That's fine for the calls in do_autovacuum()'s > initial loops over all tables. But seems completely wrong for the later calls > via table_recheck_autovac() -> recheck_relation_needs_vacanalyze() -> > relation_needs_vacanalyze()? > > These variables really shouldn't be globals. It makes sense to cache them > locally in do_autovacuum(), but reusing them > recheck_relation_needs_vacanalyze() and sharing it between launcher and worker > is bad. I am not sure. I do hope that there isn't some subtle way in which the design relies on that. It seems obviously weird, and so I have to wonder if there is a reason behind it that isn't immediately apparent. -- Peter Geoghegan
Attachment
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Andres Freund
Date:
Hi, On 2023-01-13 16:13:45 -0800, Peter Geoghegan wrote: > On Fri, Jan 13, 2023 at 2:00 PM Andres Freund <andres@anarazel.de> wrote: > > I think it'd be a good idea to split off the part of the patch that introduces > > AutoVacType / adds logging for what triggered. That's independently useful, > > likely uncontroversial and makes the remaining patch smaller. > > I like that idea. Cool. > Attached revision v4 breaks things up mechanically, along those lines > (no real changes to the code inself, though). The controversial parts > of the patch are indeed a fairly small proportion of the total > changes. I don't think the split is right. There's too much in 0001 - it's basically introducing the terminology of 0002 already. Could you make it a much more minimal change? > > I'd also add the trigger to the pg_stat_activity entry for the autovac > > worker. Additionally I think we should add information about using failsafe > > mode to the p_s_a entry. > > I agree that that's all useful, but it seems like it can be treated as > later work. IDK, it splits up anti-wraparound vacuums into different sub-kinds but doesn't allow to distinguish most of them from a plain autovacuum. Seems pretty easy to display the trigger from 0001 in autovac_report_activity()? You'd have to move the AutoVacType -> translated string mapping into a separate function. That seems like a good idea anyway, the current coding makes translators translate several largely identical strings that just differ in one part. > > I've wished for a non-wraparound, xid age based, "autovacuum trigger" many > > times, FWIW. And I've seen plenty of places write their own userspace version > > of it, because without it they run into trouble. However, I don't like that > > the patch infers the various thresholds using magic constants / multipliers. > > As I said, these details are totally negotiable, and likely could be a > lot better without much effort. > > What are your concerns about the thresholds? For example, is it that > you can't configure the behavior directly at all? Something else? The above, but mainly that > > freeze_max_age = Min(autovacuum_freeze_max_age, table_freeze_max_age) > > prevent_auto_cancel_age = Min(freeze_max_age * 2, 1 billion) > > prevent_auto_cancel = reflrozenxid < recentXid - prevent_auto_cancel_age seems quite confusing / non-principled. What's the logic behind the auto cancel threshold being 2 x freeze_max_age, except that when freeze_max_age is 1 billion, the cutoff is set to 1 billion? That just makes no sense to me. Maye I'm partially confused by the Min(freeze_max_age * 2, 1 billion). As you pointed out in [1], that doesn't actually lower the threshold for "table age" vacuums, because we don't even get to that portion of the code if we didn't already cross freeze_max_age. if (freeze_max_age < ANTIWRAPAROUND_MAX_AGE) freeze_max_age *= 2; freeze_max_age = Min(freeze_max_age, ANTIWRAPAROUND_MAX_AGE); You're lowering a large freeze_max_age to ANTIWRAPAROUND_MAX_AGE - but it only happens after all other checks of freeze_max_age, so it won't influence those. That's confusing code. I think this'd be a lot more readable if you introduced a separate variable for the "no-cancel" threshold, rather than overloading freeze_max_age with different meanings. And you should remove the confusing "lowering" of the cutoff. Maybe something like no_cancel_age = freeze_max_age; if (no_cancel_age < ANTIWRAPAROUND_MAX_AGE) { /* multiply by two, but make sure to not exceed ANTIWRAPAROUND_MAX_AGE */ no_cancel_age = Min((uint32)ANTIWRAPAROUND_MAX_AGE, (uint32)no_cancel_age * 2); } The uint32 bit isn't needed with ANTIWRAPAROUND_MAX_AGE at 1 billion, but at 1.2 it would be needed, so it seems better to have it. That still doesn't explain why we the cancel_age = freeze_max_age * 2 behaviour should be clamped at 1 billion, though. > > If I understand the patch correctly, we now have the following age based > > thresholds for av: > > > > - force-enable autovacuum: > > oldest_datfrozenxid + autovacuum_freeze_max_age < nextXid > > - autovacuum based on age: > > freeze_max_age = Min(autovacuum_freeze_max_age, table_freeze_max_age) > > tableagevac = relfrozenxid < recentXid - freeze_max_age > > - prevent auto-cancellation: > > freeze_max_age = Min(autovacuum_freeze_max_age, table_freeze_max_age) > > prevent_auto_cancel_age = Min(freeze_max_age * 2, 1 billion) > > prevent_auto_cancel = reflrozenxid < recentXid - prevent_auto_cancel_age > > > > Is that right? > > That summary looks accurate, but I'm a bit confused about why you're > asking the question this way. I thought that it was obvious that the > patch doesn't change most of these things. For me it was helpful to clearly list the triggers when thinking about the issue. I found the diff hard to read and, as noted above, the logic for the auto cancel threshold quite confusing, so ... > The only mechanism that the patch changes is related to "prevent > auto-cancellation" behaviors -- which is now what the term > "antiwraparound" refers to. Not sure that redefining what a long-standing name refers to is helpful. It might be best to retire it and come up with new names. > It does change the name of "autovacuum based on age", though -- the name is > now "table age autovacuum" (the old name was antiwraparound autovacuum, of > course). As I pointed out to you already, it's mechanically impossible for > any autovacuum to be antiwraparound unless it's an XID table age/MXID table > age autovacuum. Thinking about it a bit more, my problem with the current anti-wraparound logic boil down to a few different aspects: 1) It regularly scares the crap out of users, even though it's normal. This is further confounded by failsafe autovacuums, where a scared reaction is appropriate, not being visible in pg_stat_activity. I suspect that learning that "vacuum to prevent wraparound" isn't a problem, contributes to people later ignoring "must be vacuumed within ..." WARNINGS, which I've seen plenty times. 2) It makes it hard to DROP, TRUNCATE, VACUUM FULL or even manually VACUUM tables being anti-wraparound vacuumed, even though those manual actions will often resolve the issue much more quickly. 3) Autovacuums triggered by tuple thresholds persistently getting cancelled also regularly causes outages, and they make it more likely that an eventual age-based vacuum will take forever. Aspect 1) is addressed to a good degree by the proposed split of anti-wrap into an age and anti-cancel triggers. And could be improved by reporting failsafe autovacuums in pg_stat_activity. Perhaps 2) could be improved a bit by emitting a WARNING message when we didn't cancel AV because it was anti-wraparound? But eventually I think we somehow need to signal the "intent" of the lock drop down into ProcSleep() or wherever it'd be. I have two ideas around 3): First, we could introduce thresholds for the tuple thresholds, after which autovacuum isn't concealable anymore. Second, we could track the number of cancellations since the last [auto]vacuum in pgstat, and only trigger the anti-cancel behaviour when autovacuum has been cancelled a number of times. Greetings, Andres Freund
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Peter Geoghegan
Date:
On Fri, Jan 13, 2023 at 6:09 PM Andres Freund <andres@anarazel.de> wrote: > I don't think the split is right. There's too much in 0001 - it's basically > introducing the terminology of 0002 already. Could you make it a much more > minimal change? Okay. I thought that you might say that. I really just wanted to show you how small the code footprint was for the more controversial part. > IDK, it splits up anti-wraparound vacuums into different sub-kinds but doesn't > allow to distinguish most of them from a plain autovacuum. > > Seems pretty easy to display the trigger from 0001 in > autovac_report_activity()? You'd have to move the AutoVacType -> translated > string mapping into a separate function. That seems like a good idea anyway, > the current coding makes translators translate several largely identical > strings that just differ in one part. I'll look into it. It's on my TODO list now. > seems quite confusing / non-principled. What's the logic behind the auto > cancel threshold being 2 x freeze_max_age, except that when freeze_max_age is > 1 billion, the cutoff is set to 1 billion? That just makes no sense to me. Why is the default for autovacuum_freeze_max_age 200 million? I think that it's because 200 million is 10% of 2 billion. And so it is for this cap. 1 billion is 50% of 2 billion. It's just numerology. It doesn't make sense to me either. Of course it's not *completely* arbitrary. Obviously the final auto cancel threshold needs to be at least somewhat greater than freeze_max_age for any of this to make any sense at all. And, we should ideally have a comfortable amount of slack to work with, so that things like moderate (not pathological) autovacuum worker starvation isn't likely to defeat our attempts at avoiding the no-auto-cancel behavior for no good reason. Finally, once we get to a certain table age it starts to seem like a bad idea to not be conservative about auto cancellations. I believe that we agree on most things when it comes to VACUUM, but I'm pretty sure that we still disagree about the value in setting autovacuum_freeze_max_age very high. I continue to believe that setting it over a billion or so is just a bad idea. I'm mentioning this only because it might give you some idea of where I'm coming from -- in general I believe that age-based settings often have only very weak relationships with what actually matters. It might make sense to always give a small fixed amount of headroom when autovacuum_freeze_max_age is set to very high values. Maybe just 5 million XIDs/MXIDs. That would probably be just as effective as (say) 500 million in almost all cases. But then you have to accept another magic number. > I think this'd be a lot more readable if you introduced a separate variable > for the "no-cancel" threshold, rather than overloading freeze_max_age with > different meanings. And you should remove the confusing "lowering" of the > cutoff. Maybe something like > > no_cancel_age = freeze_max_age; > if (no_cancel_age < ANTIWRAPAROUND_MAX_AGE) > { > /* multiply by two, but make sure to not exceed ANTIWRAPAROUND_MAX_AGE */ > no_cancel_age = Min((uint32)ANTIWRAPAROUND_MAX_AGE, (uint32)no_cancel_age * 2); > } I'm happy to do it that way, but let's decide what the algorithm itself should be first. Or let's explore it a bit more, at least. > > The only mechanism that the patch changes is related to "prevent > > auto-cancellation" behaviors -- which is now what the term > > "antiwraparound" refers to. > > Not sure that redefining what a long-standing name refers to is helpful. It > might be best to retire it and come up with new names. I generally try to avoid bike-shedding, and naming things is the ultimate source of bike shedding. I dread having to update the docs for this stuff, too. The docs in this area (particularing "Routine Vacuuming") are such a mess already. But perhaps you're right. > 1) It regularly scares the crap out of users, even though it's normal. This > is further confounded by failsafe autovacuums, where a scared reaction is > appropriate, not being visible in pg_stat_activity. The docs actually imply that when the system reaches the point of entering xidStopLimit mode, you might get data corruption. Of course that's not true (the entire point of xidStopLimit is to avoid that), but apparently we like to keep users on their toes. > I suspect that learning that "vacuum to prevent wraparound" isn't a > problem, contributes to people later ignoring "must be vacuumed within ..." > WARNINGS, which I've seen plenty times. That point never occured to me, but it makes perfect intuitive sense that users would behave that way. This phenomenon is sometimes called alarm fatigue, It can be quite dangerous to warn people about non-issues "out of an abundance of caution". > 3) Autovacuums triggered by tuple thresholds persistently getting cancelled > also regularly causes outages, and they make it more likely that an > eventual age-based vacuum will take forever. Really? Outages? I imagine that you'd have to be constantly hammering the table with DDL before it could happen. That's possible, but it seems relatively obvious that doing that is asking for trouble. Whereas the Manta postmortem (and every similar case that I've personally seen) involved very nasty interactions that happened due to the way components interacted with a workload that wasn't like that. Running DDL from a cron job or from the application may not be a great idea, but it's also quite common. > Aspect 1) is addressed to a good degree by the proposed split of anti-wrap > into an age and anti-cancel triggers. And could be improved by reporting > failsafe autovacuums in pg_stat_activity. What you call aspect 2 (the issue with disastrous HW lock traffic jams involving TRUNCATE being run from a cron job, etc) is a big goal of mine for this patch series. You seem unsure of how effective my approach (or an equally simple approach based on table age heuristics) will be. Is that the case? > Perhaps 2) could be improved a bit by emitting a WARNING message when we > didn't cancel AV because it was anti-wraparound? But eventually I think we > somehow need to signal the "intent" of the lock drop down into ProcSleep() or > wherever it'd be. That's doable, but definitely seems like separate work. > I have two ideas around 3): > > First, we could introduce thresholds for the tuple thresholds, after which > autovacuum isn't concealable anymore. Do you think that's a good idea? I just don't trust those statistics, at all. As in, I think they're often complete garbage. > Second, we could track the number of cancellations since the last [auto]vacuum > in pgstat, and only trigger the anti-cancel behaviour when autovacuum has been > cancelled a number of times. In theory that would be better than an approach along the lines I've proposed, because it is directly based on a less aggressive approach being tried a few times, and failing a few times. That part I like. However, I also don't like several things about this approach. First of all it relies on relatively complicated infrastructure, for something that can be critical. Second, it will be hard to test. Third, perhaps it would make sense to give the earlier/less aggressive approach (a table age av that is still autocancellable) quite a large number of attempts before giving up. If the table age isn't really growing too fast, why not continue to be patient, possibly for quite a long time? Perhaps a hybrid strategy could be useful? Something like what I came up with already, *plus* a mechanism that gives up after (say) 1000 cancellations, and escalates to no-auto-cancel, regardless of table age. It seems sensible to assume that a less aggressive approach is just hopeless relatively quickly (in wall clock time and logical XID time) once we see sufficiently many cancellations against the same table. -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Andres Freund
Date:
Hi, On 2023-01-13 19:39:41 -0800, Peter Geoghegan wrote: > On Fri, Jan 13, 2023 at 6:09 PM Andres Freund <andres@anarazel.de> wrote: > > I don't think the split is right. There's too much in 0001 - it's basically > > introducing the terminology of 0002 already. Could you make it a much more > > minimal change? > > Okay. > > I thought that you might say that. I really just wanted to show you > how small the code footprint was for the more controversial part. I don't mind you splitting this into three parts ;) > I believe that we agree on most things when it comes to VACUUM, but > I'm pretty sure that we still disagree about the value in setting > autovacuum_freeze_max_age very high. I continue to believe that > setting it over a billion or so is just a bad idea. I'm mentioning > this only because it might give you some idea of where I'm coming from > -- in general I believe that age-based settings often have only very > weak relationships with what actually matters. I think part of our difference around a high autovacuum_freeze_max_age are due to things you're trying to address here - if no-auto-cancel is a separate threshold from autovacuum_freeze_max_age, it's less problematic to set autovacuum_freeze_max_age to something lower. But yes, there's a remaining difference of opinion / experience. > It might make sense to always give a small fixed amount of headroom > when autovacuum_freeze_max_age is set to very high values. Maybe just > 5 million XIDs/MXIDs. That would probably be just as effective as > (say) 500 million in almost all cases. But then you have to accept > another magic number. I suspect that most systems with a high autovacuum_freeze_max_age use it because 200m leads to too frequent autovacuums - a few million won't do much for those. How about a float autovacuum_no_auto_cancel_age where positive values are treated as absolute values, and negative values are a multiple of autovacuum_freeze_max_age? And where the "computed" age is capped at vacuum_failsafe_age? A "failsafe" autovacuum clearly shouldn't be cancelled. And maybe a default setting of -1.8 or so? If a user chooses to set autovacuum_no_auto_cancel_age and vacuum_failsafe_age to 2.1 billion, oh well, that's really not our problem. > > 1) It regularly scares the crap out of users, even though it's normal. This > > is further confounded by failsafe autovacuums, where a scared reaction is > > appropriate, not being visible in pg_stat_activity. > > The docs actually imply that when the system reaches the point of > entering xidStopLimit mode, you might get data corruption. Of course > that's not true (the entire point of xidStopLimit is to avoid that), > but apparently we like to keep users on their toes. Well, historically there wasn't all that much protection. And I suspect there still might be some dragons. We really need to get rid of the remaining places that cache 32bit xids across transactions. > > 3) Autovacuums triggered by tuple thresholds persistently getting cancelled > > also regularly causes outages, and they make it more likely that an > > eventual age-based vacuum will take forever. > > Really? Outages? I imagine that you'd have to be constantly hammering > the table with DDL before it could happen. That's possible, but it > seems relatively obvious that doing that is asking for trouble. Yes, due to queries slowing down due to the bloat, or due to running out of space. I've seen a ~2TB table grow to ~20TB due to dead tuples, at which point the server crash-restarted due to WAL ENOSPC. I think in that case there wasn't even DDL, they just needed to make the table readonly, for a brief moment, a few times a day. The problem started once the indexes grew to be large enough that the time for (re-)finding dead tuples + and an index scan phase got large enough that they were unlucky to be killed a few times in a row. After that autovac never got through the index scan phase. Not doing any "new" work, just collecting the same dead tids over and over, scanning the indexes for those tids, never quite finishing. You really don't need that frequent lock conflicts. It just needs to be more frequent than what the index scans portion of vacuuming takes, which can be a long time with large indexes. > > Aspect 1) is addressed to a good degree by the proposed split of anti-wrap > > into an age and anti-cancel triggers. And could be improved by reporting > > failsafe autovacuums in pg_stat_activity. > > What you call aspect 2 (the issue with disastrous HW lock traffic jams > involving TRUNCATE being run from a cron job, etc) is a big goal of > mine for this patch series. You seem unsure of how effective my > approach (or an equally simple approach based on table age heuristics) > will be. Is that the case? I am was primarily concerned about situations where an admin interactively was trying to get rid of the table holding back the xid horizon, after some problem caused a lot of work to pile up. If you know postgres internals, it's easy. You know that autovac stops auto-cancelling after freeze_max_age and you know that the lock queue is ordered. So you issue DROP TABLE / TRUNCATE / whatever and immediately afterwards cancel the autovac worker. But if you aren't, it'll take a while to figure out that the DROP TABLE isn't progressing due to a lock conflict, at which point you'll cancel the statement (possibly having wrecked havoc with all other accesses). Then you figure out that you need to cancel autovacuum. After you try the DROP TABLE again - but the next worker has gotten to work on the table. Brr. The bad cousin of this is when you can't even drop the table due to "not accepting commands". I don't know if you've seen people try to start postgres in single user mode, under pressure, for the first time. It ain't pretty. > > I have two ideas around 3): > > > > First, we could introduce thresholds for the tuple thresholds, after which > > autovacuum isn't concealable anymore. > > Do you think that's a good idea? I just don't trust those statistics, > at all. As in, I think they're often complete garbage. I have seen them be reasonably accurate in plenty busy systems. The most common problem I've seen is that the amount of assumed dead / inserted tuples is way too high, because the server crash-restarted at some point. We now have most of the infrastructure for using a slightly older version of stats after a crash, which'd make them less inaccurate. I haven't recently seen crazy over-estimates of the number of dead tuples, at least if you count dead tids as tuples. It might be worth to split the dead tuple count into a dead tuples and dead items. Are you mostly seen over or under estimates? > > Second, we could track the number of cancellations since the last [auto]vacuum > > in pgstat, and only trigger the anti-cancel behaviour when autovacuum has been > > cancelled a number of times. > > In theory that would be better than an approach along the lines I've > proposed, because it is directly based on a less aggressive approach > being tried a few times, and failing a few times. That part I like. > > However, I also don't like several things about this approach. First > of all it relies on relatively complicated infrastructure, for > something that can be critical. I don't think it relies on much more machinery than we already rely on? The dead/inserted tuple estimates and come from pgstat already. The only really new piece would be that the worker would need to do a pgstat_report_autovac_failure() after getting cancelled, which doesn't seem too hard? > Second, it will be hard to test. It doesn't seem too bad. A short naptime, cancelling autovac in a loop, and the threshold should quickly be reached? We also could add a helper function to artificially increase the failure count. > Third, perhaps it would make sense to give the earlier/less aggressive > approach (a table age av that is still autocancellable) quite a large > number of attempts before giving up. If the table age isn't really > growing too fast, why not continue to be patient, possibly for quite a > long time? Yea, that's true. But it's also easy to get to the point that you collect so much "debt" that it'll be hard to get out of. Particularly once the dead items space doesn't fit all the dead tids, the subsequent parts of the table won't get processed by vacuum if it frequently is cancelled before the index scans end, leading to a huge amount fo work later. > Perhaps a hybrid strategy could be useful? Something like what I came > up with already, *plus* a mechanism that gives up after (say) 1000 > cancellations, and escalates to no-auto-cancel, regardless of table > age. It seems sensible to assume that a less aggressive approach is > just hopeless relatively quickly (in wall clock time and logical XID > time) once we see sufficiently many cancellations against the same > table. Maybe. Making it explainable is presumably the hard part. We've historically failed to make this area understandable, so maybe we don't need to try :) Somehow this made me think of a somewhat crazy, and largely unrelated, idea: Why don't we use the the currently unused VM bit combination to indicate pages with dead tids? We could have an [auto]vacuum mode where it scans just pages with the dead tids bit set. Particularly when on-access heap pruning is doing most of the work, that could be quite useful to more cheaply get rid of the dead tids. Obviously we'd also set them when vacuum decides / was told not to do index cleanup. Yes, it'd obviously be less effective than lots of the things we discussed in this area (needing to re-collect the dead tids on the indicated), but it'd have the advantage of not needing a lot of new infrastructure. Greetings, Andres Freund
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Robert Haas
Date:
On Fri, Jan 13, 2023 at 9:09 PM Andres Freund <andres@anarazel.de> wrote: > > > If I understand the patch correctly, we now have the following age based > > > thresholds for av: > > > > > > - force-enable autovacuum: > > > oldest_datfrozenxid + autovacuum_freeze_max_age < nextXid > > > - autovacuum based on age: > > > freeze_max_age = Min(autovacuum_freeze_max_age, table_freeze_max_age) > > > tableagevac = relfrozenxid < recentXid - freeze_max_age > > > - prevent auto-cancellation: > > > freeze_max_age = Min(autovacuum_freeze_max_age, table_freeze_max_age) > > > prevent_auto_cancel_age = Min(freeze_max_age * 2, 1 billion) > > > prevent_auto_cancel = reflrozenxid < recentXid - prevent_auto_cancel_age > > > > > > Is that right? > > > > That summary looks accurate, but I'm a bit confused about why you're > > asking the question this way. I thought that it was obvious that the > > patch doesn't change most of these things. > > For me it was helpful to clearly list the triggers when thinking about the > issue. I found the diff hard to read and, as noted above, the logic for the > auto cancel threshold quite confusing, so ... I really dislike formulas like Min(freeze_max_age * 2, 1 billion). That looks completely magical from a user perspective. Some users aren't going to understand autovacuum behavior at all. Some will, and will be able to compare age(relfrozenxid) against autovacuum_freeze_max_age. Very few people are going to think to compare age(relfrozenxid) against some formula based on autovacuum_freeze_max_age. I guess if we document it, maybe they will. But even then, what's the logic behind that formula? I am not entirely convinced that we need to separate the force-a-vacuum threshold from the don't-cancel threshold, but if we do separate them, what's the purpose of having the clearance between them increase as you increase autovacuum_freeze_max_age from 0 to 500 million, and thereafter decrease until it reaches 0 at 1 billion? I can't explain the logic behind that except by saying "well, somebody came up with an arbitrary formula". I do like the idea of driving the auto-cancel behavior off of the results of previous attempts to vacuum the table. That could be done independently of the XID age of the table. If we've failed to vacuum the table, say, 10 times, because we kept auto-cancelling, it's probably appropriate to force the issue. It doesn't really matter whether the autovacuum triggered because of bloat or because of XID age. Letting either of those things get out of control is bad. What I think happens fairly commonly right now is that the vacuums just keep getting cancelled until the table's XID age gets too old, and then we finally force the issue. But at that point a lot of harm has already been done. In a frequently updated table, waiting 300 million XIDs to stop cancelling the vacuum is basically condemning the user to have to run VACUUM FULL. The table can easily be ten or a hundred times bigger than it should be by that point. And that's a big reason why I am skeptical about the patch as proposed. It raises the threshold for auto-cancellation in cases where it's sometimes already far too high. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Peter Geoghegan
Date:
On Fri, Jan 13, 2023 at 9:55 PM Andres Freund <andres@anarazel.de> wrote: > I think part of our difference around a high autovacuum_freeze_max_age are due > to things you're trying to address here - if no-auto-cancel is a separate > threshold from autovacuum_freeze_max_age, it's less problematic to set > autovacuum_freeze_max_age to something lower. That seems like a much smaller difference of opinion than I'd imagined it was before now. > But yes, there's a remaining difference of opinion / experience. Perhaps it's also a communication issue. I don't disagree with pragmatic decisions that made sense given very particular limitations in Postgres, which this is starting to sound like now. When I express skepticism of very high autovacuum_freeze_max_age settings, it's mostly just that I don't think that age(relfrozenxid) is at all informative in the way that something that triggers autovacuum ought to be. Worst of all, the older relfrozenxid gets, the less informative it becomes. I'm sure that it can be safe to use very high autovacuum_freeze_max_age values, but it seems like a case of using a very complicated and unreliable thing to decide when to VACUUM. > > It might make sense to always give a small fixed amount of headroom > > when autovacuum_freeze_max_age is set to very high values. Maybe just > > 5 million XIDs/MXIDs. That would probably be just as effective as > > (say) 500 million in almost all cases. But then you have to accept > > another magic number. > > I suspect that most systems with a high autovacuum_freeze_max_age use it > because 200m leads to too frequent autovacuums - a few million won't do much > for those. My point was mostly that it really doesn't cost us anything, and it could easily help, so it might be worthwhile. I wonder if these users specifically get too many aggressive autovacuums with lower autovacuum_freeze_max_age settings? Maybe they even fail to get enough non-aggressive autovacuums? If that's what it is, then that makes perfect sense to me. However, I tend to think of this as an argument against aggressive VACUUMs, not an argument in favor of high autovacuum_freeze_max_age settings (as I said, communication is hard). > How about a float autovacuum_no_auto_cancel_age where positive values are > treated as absolute values, and negative values are a multiple of > autovacuum_freeze_max_age? And where the "computed" age is capped at > vacuum_failsafe_age? A "failsafe" autovacuum clearly shouldn't be cancelled. > > And maybe a default setting of -1.8 or so? I think that would work fine, as a GUC (with no reloption). > Well, historically there wasn't all that much protection. And I suspect there > still might be some dragons. We really need to get rid of the remaining places > that cache 32bit xids across transactions. Maybe, but our position is that it's supported, it's safe -- there will be no data loss (or else it's a bug, and one that we'd take very seriously at that). Obviously it's a bad idea to allow this to happen, but surely having the system enter xidStopLimit is sufficient disincentive for users. > I've seen a ~2TB table grow to ~20TB due to dead tuples, at which point the > server crash-restarted due to WAL ENOSPC. I think in that case there wasn't > even DDL, they just needed to make the table readonly, for a brief moment, a > few times a day. The problem started once the indexes grew to be large enough > that the time for (re-)finding dead tuples + and an index scan phase got large > enough that they were unlucky to be killed a few times in a row. After that > autovac never got through the index scan phase. Not doing any "new" work, just > collecting the same dead tids over and over, scanning the indexes for those > tids, never quite finishing. That makes sense to me, though I wonder if there would have been another kind of outage (not the one you actually saw) had the autocancellation behavior somehow been disabled after the first few cancellations. This sounds like a great argument in favor of suspend-and-resume as a way of handling autocancellation -- no useful work needs to be thrown away for AV to yield for a minute or two. One ambition of mine for the visibility map snapshot infrastructure was to be able support suspend-and-resume. It wouldn't be that hard for autovacuum to serialize everything, and use that to pick up right where an autovacuum worker left of at when it was autocancelled. Same OldestXmin starting point as before, same dead_items array, same number of scanned_pages (and pages left to scan). > > What you call aspect 2 (the issue with disastrous HW lock traffic jams > > involving TRUNCATE being run from a cron job, etc) is a big goal of > > mine for this patch series. You seem unsure of how effective my > > approach (or an equally simple approach based on table age heuristics) > > will be. Is that the case? > > I am was primarily concerned about situations where an admin interactively was > trying to get rid of the table holding back the xid horizon, after some > problem caused a lot of work to pile up. Fair enough, but the outages that I'm mostly thinking about here weren't really like that. There wasn't anything holding back the horizon, at any point. It was just that the autocancellation behavior was disruptive in some critical way. > If you know postgres internals, it's easy. You know that autovac stops > auto-cancelling after freeze_max_age and you know that the lock queue is > ordered. So you issue DROP TABLE / TRUNCATE / whatever and immediately > afterwards cancel the autovac worker. But if you aren't, it'll take a while to > figure out that the DROP TABLE isn't progressing due to a lock conflict, at > which point you'll cancel the statement (possibly having wrecked havoc with > all other accesses). Then you figure out that you need to cancel > autovacuum. After you try the DROP TABLE again - but the next worker has > gotten to work on the table. Yeah, that's pretty bad. Maybe DROP TABLE and TRUNCATE should be special cases? Maybe they should always be able to auto cancel an autovacuum? > > > I have two ideas around 3): > > > > > > First, we could introduce thresholds for the tuple thresholds, after which > > > autovacuum isn't concealable anymore. > > > > Do you think that's a good idea? I just don't trust those statistics, > > at all. As in, I think they're often complete garbage. > > I have seen them be reasonably accurate in plenty busy systems. The most > common problem I've seen is that the amount of assumed dead / inserted tuples > is way too high, because the server crash-restarted at some point. We now > have most of the infrastructure for using a slightly older version of stats > after a crash, which'd make them less inaccurate. Did you mean way too low? > I haven't recently seen crazy over-estimates of the number of dead tuples, at > least if you count dead tids as tuples. It might be worth to split the dead > tuple count into a dead tuples and dead items. > > Are you mostly seen over or under estimates? It's not so much what I've seen. It's that the actual approach has lots of problems. Referring to my notes, here are what seem to me to be serious problems: * We are very naive about what a dead tuple even means, and we totally fail to account for the fact that only the subset of heap pages that are PageIsAllVisible() are interesting to VACUUM -- focusing on the whole table just seems wrong. Per https://postgr.es/m/CAH2-Wz=MGFwJEpEjVzXwEjY5yx=UuNPzA6Bt4DSMasrGLUq9YA@mail.gmail.com * Stub LP_DEAD items in heap pages are extremely different to dead heap-only tuples in heap pages, which we ignore. Per https://postgr.es/m/CAH2-WznrZC-oHkB+QZQS65o+8_Jtj6RXadjh+8EBqjrD1f8FQQ@mail.gmail.com * Problem where insert-driven autovacuums (autovacuum_vacuum_insert_threshold/autovacuum_vacuum_insert_scale_factor triggers AVs) become further spaced apart as a consequence of one VACUUM operation taking far longer than usual (generally because it's an aggressive VACUUM that follows several non-aggressive VACUUMs). Per https://postgr.es/m/CAH2-Wzn=bZ4wynYB0hBAeF4kGXGoqC=PZVKHeerBU-je9AQF=g@mail.gmail.com It's quite possible to get approximately the desired outcome with an algorithm that is completely wrong -- the way that we sometimes need autovacuum_freeze_max_age to deal with bloat is a great example of that. Even then, there may still be serious problems that are well worth solving. > > However, I also don't like several things about this approach. First > > of all it relies on relatively complicated infrastructure, for > > something that can be critical. > > I don't think it relies on much more machinery than we already rely on? The > dead/inserted tuple estimates and come from pgstat already. The only really > new piece would be that the worker would need to do a > pgstat_report_autovac_failure() after getting cancelled, which doesn't seem > too hard? Yeah, but it is a new dependency on stored state. Certainly doable, but hard enough that it might be better to add that part later on. > Somehow this made me think of a somewhat crazy, and largely unrelated, idea: > Why don't we use the the currently unused VM bit combination to indicate pages > with dead tids? We could have an [auto]vacuum mode where it scans just pages > with the dead tids bit set. Particularly when on-access heap pruning is doing > most of the work, that could be quite useful to more cheaply get rid of the > dead tids. Obviously we'd also set them when vacuum decides / was told not to > do index cleanup. Yes, it'd obviously be less effective than lots of the > things we discussed in this area (needing to re-collect the dead tids on the > indicated), but it'd have the advantage of not needing a lot of new > infrastructure. I wonder if it would be possible to split up the work of VACUUM into multiple phases that can be processed independently. The dead_items array could be serialized and stored in a temp file. That's not the same as some of the more complicated stuff we talked about in the last couple of years, such as a dedicated fork for Dead TIDs. It's more like an extremely flexible version of the same basic design for VACUUM, with the ability to slow down and speed up based on a system level view of things (e.g., checkpointing information). And with index vacuuming happening on a highly deferred timeline in some cases. Possibly we could make each slice of work processed by any available autovacuum worker. Autovacuum workers could become "headless". You would need some kind of state machine to make sure that critical dependencies were respected (e.g., always do the heap vacuuming step after all indexes are vacuumed), but that possibly isn't that hard, and still gives you a lot. As for this patch of mine: do you think that it would be acceptable to pursue a version based on your autovacuum_no_auto_cancel_age design for 16? Perhaps this can include something like pgstat_report_autovac_failure(). It's not even the work of implementing pgstat_report_autovac_failure() that creates risk that it'll miss the 16 feature freeze deadline. I'm more concerned that introducing a more complicated design will lead to the patch being bikeshedded to death. -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Peter Geoghegan
Date:
On Mon, Jan 16, 2023 at 8:25 AM Robert Haas <robertmhaas@gmail.com> wrote: > I really dislike formulas like Min(freeze_max_age * 2, 1 billion). > That looks completely magical from a user perspective. Some users > aren't going to understand autovacuum behavior at all. Some will, and > will be able to compare age(relfrozenxid) against > autovacuum_freeze_max_age. Very few people are going to think to > compare age(relfrozenxid) against some formula based on > autovacuum_freeze_max_age. I guess if we document it, maybe they will. What do you think of Andres' autovacuum_no_auto_cancel_age proposal? As I've said several times already, I am by no means attached to the current formula. > I do like the idea of driving the auto-cancel behavior off of the > results of previous attempts to vacuum the table. That could be done > independently of the XID age of the table. Even when the XID age of the table has already significantly surpassed autovacuum_freeze_max_age, say due to autovacuum worker starvation? > If we've failed to vacuum > the table, say, 10 times, because we kept auto-cancelling, it's > probably appropriate to force the issue. I suggested 1000 times upthread. 10 times seems very low, at least if "number of times cancelled" is the sole criterion, without any attention paid to relfrozenxid age or some other tiebreaker. > It doesn't really matter > whether the autovacuum triggered because of bloat or because of XID > age. Letting either of those things get out of control is bad. While inventing a new no-auto-cancel behavior that prevents bloat from getting completely out of control may well have merit, I don't see why it needs to be attached to this other effort. I think that the vast majority of individual tables have autovacuums cancelled approximately never, and so my immediate concern is ameliorating cases where not being able to auto-cancel once in a blue moon causes an outage. Sure, the opposite problem also exists, and I think that it would be really bad if it was made significantly worse as an unintended consequence of a patch that addressed just the first problem. But that doesn't mean we have to solve both problems together at the same time. > But at that point a lot of harm has already > been done. In a frequently updated table, waiting 300 million XIDs to > stop cancelling the vacuum is basically condemning the user to have to > run VACUUM FULL. The table can easily be ten or a hundred times bigger > than it should be by that point. The rate at which relfrozenxid ages is just about useless as a proxy for how much wall clock time has passed with a given workload -- workloads are usually very bursty. It's much worse still as a proxy for what has changed in the table; completely static tables have their relfrozenxid age at exactly the same rate as the most frequently updated table in the same database (the table that "consumes the most XIDs"). So while antiwraparound autovacuum no-auto-cancel behavior may indeed save the user from problems with serious bloat, it will happen pretty much by mistake. Not that it doesn't happen all the same -- of course it does. That factor (the mistake factor) doesn't mean I take the point any less seriously. What I don't take seriously is the idea that the precise XID age was ever crucially important. More generally, I just don't accept that this leaves with no room for something along the lines of my proposed, such as Andres' autovacuum_freeze_max_age concept. As I've said already, there will usually be a very asymmetric quality to the problem in cases like the Joyent outage. Even a modest amount of additional XID-space-headroom will very likely be all that will be needed at the critical juncture. It may not be perfect, but it still has every potential to make things safer for some users, without making things any less safe for other users. -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Robert Haas
Date:
On Mon, Jan 16, 2023 at 11:11 PM Peter Geoghegan <pg@bowt.ie> wrote: > On Mon, Jan 16, 2023 at 8:25 AM Robert Haas <robertmhaas@gmail.com> wrote: > > I really dislike formulas like Min(freeze_max_age * 2, 1 billion). > > That looks completely magical from a user perspective. Some users > > aren't going to understand autovacuum behavior at all. Some will, and > > will be able to compare age(relfrozenxid) against > > autovacuum_freeze_max_age. Very few people are going to think to > > compare age(relfrozenxid) against some formula based on > > autovacuum_freeze_max_age. I guess if we document it, maybe they will. > > What do you think of Andres' autovacuum_no_auto_cancel_age proposal? I like it better than your proposal. I don't think it's a fundamental improvement and I would rather see a fundamental improvement, but I can see it being better than nothing. > > I do like the idea of driving the auto-cancel behavior off of the > > results of previous attempts to vacuum the table. That could be done > > independently of the XID age of the table. > > Even when the XID age of the table has already significantly surpassed > autovacuum_freeze_max_age, say due to autovacuum worker starvation? > > > If we've failed to vacuum > > the table, say, 10 times, because we kept auto-cancelling, it's > > probably appropriate to force the issue. > > I suggested 1000 times upthread. 10 times seems very low, at least if > "number of times cancelled" is the sole criterion, without any > attention paid to relfrozenxid age or some other tiebreaker. Hmm, I think that a threshold of 1000 is far too high to do much good. By the time we've tried to vacuum a table 1000 times and failed every time, I anticipate that the situation will be pretty dire, regardless of why we thought the table needed to be vacuumed in the first place. In the best case, with autovacum_naptime=1minute, failing 1000 times means that we've delayed vacuuming the table for at least 16 hours. That's assuming that there's a worker available to retry every minute and that we fail quickly. If it's a 2 hour vacuum operation and we typically fail about halfway through, it could take us over a month to hit 1000 failures. There are many tables out there that get enough inserts, updates, and deletes that a 16-hour delay will result in irreversible bloat, never mind a 41-day delay. After even a few days, wraparound could become critical even if bloat isn't. I'm not sure why a threshold of 10 would be too low. It seems to me that if we fail ten times in a row to vacuum a table and fail for the same reason every time, we're probably going to keep failing for that reason. If that is true, we will be better off if we force the issue sooner rather than later. There's no value in letting the table bloat out the wazoo and the cluster approach a wraparound shutdown before we insist. Consider a more mundane example. If I try to start my car or my dishwasher or my computer or my toaster oven ten times and it fails ten times in a row, and the failure mode appears to be the same each time, I am not going to sit there and try 990 more times hoping things get better, because that seems very unlikely to help. Honestly, depending on the situation, I might not even get to ten times before I switch to doing some form of troubleshooting and/or calling someone who could repair the device. In fact I think there's a decent argument that a threshold of ten is possibly too high here, too. If you wait until the tenth try before you try not auto-cancelling, then a table with a workload that makes auto-cancelling 100% probable will get vacuumed 10% as often as it would otherwise. I think there are cases where that would be OK, but probably on the whole it's not going to go very well. The only problem I see with lowering the threshold below ~10 is that the signal starts to get weak. If something fails for the same reason ten times in a row you can be pretty sure it's a chronic problem. If you made the thereshold say three you'd probably start making bad decisions sometimes -- you'd think that you had a chronic problem when really you just got a bit unlucky. To get back to the earlier question above, I think that if the retries-before-not-auto-cancelling threshold were low enough to be effective, you wouldn't necessarily need to consider XID age as a second reason for not auto-cancelling. You would want to force the behavior anyway when you hit emergency mode, because that should force all the mitigations we have, but I don't know that you need to do anything before that. > > It doesn't really matter > > whether the autovacuum triggered because of bloat or because of XID > > age. Letting either of those things get out of control is bad. > > While inventing a new no-auto-cancel behavior that prevents bloat from > getting completely out of control may well have merit, I don't see why > it needs to be attached to this other effort. It doesn't, but I think it would be a lot more beneficial than just adding a new GUC. A lot of the fundamental stupidity of autovacuum comes from its inability to consider the context. I've had various ideas over the years about how to fix that, but this is far simpler than some things I've thought about and I think it would help a lot of people. > That factor (the mistake factor) doesn't mean I take the point any > less seriously. What I don't take seriously is the idea that the > precise XID age was ever crucially important. I agree. That's why I think driving this off of number of previous failures would be better than driving it off of an XID age. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Andres Freund
Date:
Hi, On 2023-01-16 13:58:21 -0800, Peter Geoghegan wrote: > On Fri, Jan 13, 2023 at 9:55 PM Andres Freund <andres@anarazel.de> wrote: > When I express skepticism of very high autovacuum_freeze_max_age > settings, it's mostly just that I don't think that age(relfrozenxid) > is at all informative in the way that something that triggers > autovacuum ought to be. Worst of all, the older relfrozenxid gets, the > less informative it becomes. I'm sure that it can be safe to use very > high autovacuum_freeze_max_age values, but it seems like a case of > using a very complicated and unreliable thing to decide when to > VACUUM. Both you and Robert said this, and I have seen it be true, but typically not for large high-throughput OLTP databases, where I found increasing relfrozenxid to be important. Sure, there's probably some up/down through the day / week, but it's likely to be pretty predictable. I think the problem is that an old relfrozenxid doesn't tell you how much outstanding work there is. Perhaps that's what both of you meant... I think that's not the fault of relfrozenxid as a trigger, but that we simply don't keep enough other stats. We should imo at least keep track of: In pg_class: - The number of all frozen pages, like we do for all-visible That'd give us a decent upper bound for the amount of work we need to do to increase relfrozenxid. It's important that this is crash safe (thus no pg_stats), and it only needs to be written when we'd likely make other changes to the pg_class row anyway. In pgstats: - The number of dead items, incremented both by the heap scan and opportunistic pruning This would let us estimate how urgently we need to clean up indexes. - The xid/mxid horizons during the last vacuum - The number of pages with tuples that couldn't removed due to the horizon during the last vacuum Together with the horizon, this would let us avoid repeated vacuums that won't help. Tracking the number of pages instead of tuples allows a lot better cost/benefit estimation of another vacuum. - The number of pages with tuples that couldn't be frozen Similar to the dead tuple one, except that it'd help avoid repeated vacuums to increase relfrozenxid, when it won't be able to help. > > I've seen a ~2TB table grow to ~20TB due to dead tuples, at which point the > > server crash-restarted due to WAL ENOSPC. I think in that case there wasn't > > even DDL, they just needed to make the table readonly, for a brief moment, a > > few times a day. The problem started once the indexes grew to be large enough > > that the time for (re-)finding dead tuples + and an index scan phase got large > > enough that they were unlucky to be killed a few times in a row. After that > > autovac never got through the index scan phase. Not doing any "new" work, just > > collecting the same dead tids over and over, scanning the indexes for those > > tids, never quite finishing. > > That makes sense to me, though I wonder if there would have been > another kind of outage (not the one you actually saw) had the > autocancellation behavior somehow been disabled after the first few > cancellations. I suspect so, but it's hard to know. > This sounds like a great argument in favor of suspend-and-resume as a > way of handling autocancellation -- no useful work needs to be thrown > away for AV to yield for a minute or two. One ambition of mine for the > visibility map snapshot infrastructure was to be able support > suspend-and-resume. It wouldn't be that hard for autovacuum to > serialize everything, and use that to pick up right where an > autovacuum worker left of at when it was autocancelled. Same > OldestXmin starting point as before, same dead_items array, same > number of scanned_pages (and pages left to scan). Hm, that seems a lot of work. Without having held a lock you don't even know whether your old dead items still apply. Of course it'd improve the situation substantially, if we could get it. > > If you know postgres internals, it's easy. You know that autovac stops > > auto-cancelling after freeze_max_age and you know that the lock queue is > > ordered. So you issue DROP TABLE / TRUNCATE / whatever and immediately > > afterwards cancel the autovac worker. But if you aren't, it'll take a while to > > figure out that the DROP TABLE isn't progressing due to a lock conflict, at > > which point you'll cancel the statement (possibly having wrecked havoc with > > all other accesses). Then you figure out that you need to cancel > > autovacuum. After you try the DROP TABLE again - but the next worker has > > gotten to work on the table. > > Yeah, that's pretty bad. Maybe DROP TABLE and TRUNCATE should be > special cases? Maybe they should always be able to auto cancel an > autovacuum? Yea, I think so. It's not obvious how to best pass down that knowledge into ProcSleep(). It'd have to be in the LOCALLOCK, I think. Looks like the best way would be to change LockAcquireExtended() to get a flags argument instead of reportMemoryError, and then we could add LOCK_ACQUIRE_INTENT_DROP & LOCK_ACQUIRE_INTENT_TRUNCATE or such. Then the same for RangeVarGetRelidExtended(). It already "customizes" how to lock based on RVR* flags. > > > > I have two ideas around 3): > > > > > > > > First, we could introduce thresholds for the tuple thresholds, after which > > > > autovacuum isn't concealable anymore. > > > > > > Do you think that's a good idea? I just don't trust those statistics, > > > at all. As in, I think they're often complete garbage. > > > > I have seen them be reasonably accurate in plenty busy systems. The most > > common problem I've seen is that the amount of assumed dead / inserted tuples > > is way too high, because the server crash-restarted at some point. We now > > have most of the infrastructure for using a slightly older version of stats > > after a crash, which'd make them less inaccurate. > > Did you mean way too low? Err, yes. > > I haven't recently seen crazy over-estimates of the number of dead tuples, at > > least if you count dead tids as tuples. It might be worth to split the dead > > tuple count into a dead tuples and dead items. > > > > Are you mostly seen over or under estimates? > > It's not so much what I've seen. It's that the actual approach has > lots of problems. > > Referring to my notes, here are what seem to me to be serious problems: ISTM that some of what you write below would be addressed, at least partially, by the stats I proposed above. Particularly keeping some "page granularity" instead of "tuple granularity" stats seems helpful. It'd be great if we could update the page-granularity stats in heap_{insert,multi_insert,update,delete,heapgetpage}. But without page level flags like "has any dead tuples" that's probably too expensive. > * We are very naive about what a dead tuple even means, and we totally > fail to account for the fact that only the subset of heap pages that > are PageIsAllVisible() are interesting to VACUUM -- focusing on the > whole table just seems wrong. > > Per https://postgr.es/m/CAH2-Wz=MGFwJEpEjVzXwEjY5yx=UuNPzA6Bt4DSMasrGLUq9YA@mail.gmail.com > * Stub LP_DEAD items in heap pages are extremely different to dead > heap-only tuples in heap pages, which we ignore. > > Per https://postgr.es/m/CAH2-WznrZC-oHkB+QZQS65o+8_Jtj6RXadjh+8EBqjrD1f8FQQ@mail.gmail.com > * Problem where insert-driven autovacuums > (autovacuum_vacuum_insert_threshold/autovacuum_vacuum_insert_scale_factor > triggers AVs) become further spaced apart as a consequence of one > VACUUM operation taking far longer than usual (generally because it's > an aggressive VACUUM that follows several non-aggressive VACUUMs). > Per https://postgr.es/m/CAH2-Wzn=bZ4wynYB0hBAeF4kGXGoqC=PZVKHeerBU-je9AQF=g@mail.gmail.com Yea, that's not great. But seems fairly addressable. > It's quite possible to get approximately the desired outcome with an > algorithm that is completely wrong -- the way that we sometimes need > autovacuum_freeze_max_age to deal with bloat is a great example of > that. Yea. I think this is part of I like my idea about tracking more observations made by the last vacuum - they're quite easy to get right, and they self-correct, rather than potentially ending up causing ever-wronger stats. > Even then, there may still be serious problems that are well > worth solving. Right. I think it's fundamental that we get a lot better estimates about the amount of work needed. Without that we have no chance of finishing autovacuums before problems become too big. > > Somehow this made me think of a somewhat crazy, and largely unrelated, idea: > > Why don't we use the the currently unused VM bit combination to indicate pages > > with dead tids? We could have an [auto]vacuum mode where it scans just pages > > with the dead tids bit set. Particularly when on-access heap pruning is doing > > most of the work, that could be quite useful to more cheaply get rid of the > > dead tids. Obviously we'd also set them when vacuum decides / was told not to > > do index cleanup. Yes, it'd obviously be less effective than lots of the > > things we discussed in this area (needing to re-collect the dead tids on the > > indicated), but it'd have the advantage of not needing a lot of new > > infrastructure. > > I wonder if it would be possible to split up the work of VACUUM into > multiple phases that can be processed independently. The dead_items > array could be serialized and stored in a temp file. That's not the > same as some of the more complicated stuff we talked about in the last > couple of years, such as a dedicated fork for Dead TIDs. It's more > like an extremely flexible version of the same basic design for > VACUUM, with the ability to slow down and speed up based on a system > level view of things (e.g., checkpointing information). And with index > vacuuming happening on a highly deferred timeline in some cases. > Possibly we could make each slice of work processed by any available > autovacuum worker. Autovacuum workers could become "headless". I don't know. I think the more basic idea I describe has significant advantages - most importantly being able to target autovacuum on the work that on-access pruning couldn't deal with. As you know it's very common that most row versions are HOT and that the remaining dead tuples also get removed by on-access pruning. Which often leads to running out of usable items, but autovacuum won't be triggered because there's not all that much garbage overall. Without the ability for autovacuum to target such pages aggressively, I don't think we're going to improve such common workloads a whole lot. And serialized vacuum state won't help, because that still requires vacuum to scan all the !all-visible pages to discover them. Most of which won't contain dead tuples in a lot of workloads. > As for this patch of mine: do you think that it would be acceptable to > pursue a version based on your autovacuum_no_auto_cancel_age design > for 16? Perhaps this can include something like > pgstat_report_autovac_failure(). It's not even the work of > implementing pgstat_report_autovac_failure() that creates risk that > it'll miss the 16 feature freeze deadline. I'm more concerned that > introducing a more complicated design will lead to the patch being > bikeshedded to death. I don't feel like I have a good handle on what could work for 16 and what couldn't. Personally I think something like autovacuum_no_auto_cancel_age would be an improvement, but I also don't quite feel satisfied with it. Tracking the number of autovac failures seems uncontroverial and quite beneficial, even if the rest doesn't make it in. It'd at least let users monitor for tables where autovac is likely to swoop in in anti-wraparound mode. Perhaps its worth to separately track the number of times a backend would have liked to cancel autovac, but couldn't due to anti-wrap? If changes to the no-auto-cancel behaviour don't make it in, it'd at least allow us to collect more data about the prevalence of the problem and in what situations it occurs? Even just adding some logging for that case seems like it'd be an improvement. I think with a bit of polish "Add autovacuum trigger instrumentation." ought to be quickly mergeable. Greetings, Andres Freund
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Andres Freund
Date:
Hi, On 2023-01-17 10:26:52 -0500, Robert Haas wrote: > On Mon, Jan 16, 2023 at 11:11 PM Peter Geoghegan <pg@bowt.ie> wrote: > > On Mon, Jan 16, 2023 at 8:25 AM Robert Haas <robertmhaas@gmail.com> wrote: > > > I really dislike formulas like Min(freeze_max_age * 2, 1 billion). > > > That looks completely magical from a user perspective. Some users > > > aren't going to understand autovacuum behavior at all. Some will, and > > > will be able to compare age(relfrozenxid) against > > > autovacuum_freeze_max_age. Very few people are going to think to > > > compare age(relfrozenxid) against some formula based on > > > autovacuum_freeze_max_age. I guess if we document it, maybe they will. > > > > What do you think of Andres' autovacuum_no_auto_cancel_age proposal? > > I like it better than your proposal. I don't think it's a fundamental > improvement and I would rather see a fundamental improvement, but I > can see it being better than nothing. That's similar to my feelings about it. I do think it'll be operationally nice to have at least some window where an autovacuum is triggered due to age and where it won't prevent cancels. In many situations it'll likely suffice if that window is autovacuum_naptime * xids_per_sec large, but of course that's easily enough exceeded. > > > I do like the idea of driving the auto-cancel behavior off of the > > > results of previous attempts to vacuum the table. That could be done > > > independently of the XID age of the table. > > > > Even when the XID age of the table has already significantly surpassed > > autovacuum_freeze_max_age, say due to autovacuum worker starvation? > > > > > If we've failed to vacuum > > > the table, say, 10 times, because we kept auto-cancelling, it's > > > probably appropriate to force the issue. > > > > I suggested 1000 times upthread. 10 times seems very low, at least if > > "number of times cancelled" is the sole criterion, without any > > attention paid to relfrozenxid age or some other tiebreaker. > > Hmm, I think that a threshold of 1000 is far too high to do much good. Agreed. > By the time we've tried to vacuum a table 1000 times and failed every > time, I anticipate that the situation will be pretty dire, regardless > of why we thought the table needed to be vacuumed in the first place. Agreed. > In the best case, with autovacum_naptime=1minute, failing 1000 times > means that we've delayed vacuuming the table for at least 16 hours. Perhaps it'd make sense for an auto-cancelled worker to signal the launcher to do a cycle of vacuuming? Or even to just try to vacuum the table again immediately? After all, we know that the table is going to be on the schedule of the next worker immediately. Of course we shouldn't retry indefinitely, but ... > In fact I think there's a decent argument that a threshold of ten is > possibly too high here, too. If you wait until the tenth try before > you try not auto-cancelling, then a table with a workload that makes > auto-cancelling 100% probable will get vacuumed 10% as often as it > would otherwise. I think there are cases where that would be OK, but > probably on the whole it's not going to go very well. That's already kind of the case - we'll only block auto-cancelling when exceeding autovacuum_freeze_max_age, all the other autovacuums will be cancelable. > The only problem I see with lowering the threshold below ~10 is that the > signal starts to get weak. If something fails for the same reason ten times > in a row you can be pretty sure it's a chronic problem. If you made the > thereshold say three you'd probably start making bad decisions sometimes -- > you'd think that you had a chronic problem when really you just got a bit > unlucky. Yea. Schema migrations in prod databases typically have to run in single-statement or very small transactions, for obvious reasons. Needing to lock the same table exclusively a few times during a schema migration is pretty normal, particularly when foreign keys are involved. Getting blocked by autovacuum in the middle of a schema migration is NASTY. This is why I'm a bit worried that 10 might be too low... It's not absurd for a schema migration to create 10 new tables referencing an existing table in need of vacuuming. Perhaps we should track when the first failure was, and take that into account? Clearly having all 10 autovacuums on the same table cancelled is different when those 10 cancellations happened in the last 10 * autovacuum_naptime minutes, than when the last successful autovacuum was hours ago. > To get back to the earlier question above, I think that if the > retries-before-not-auto-cancelling threshold were low enough to be > effective, you wouldn't necessarily need to consider XID age as a > second reason for not auto-cancelling. You would want to force the > behavior anyway when you hit emergency mode, because that should force > all the mitigations we have, but I don't know that you need to do > anything before that. Hm, without further restrictions, that has me worried. It's not crazy to have a LOCK TABLE on a small-ish table be part of your workload - I've certainly seen it plenty of times. Suddenly blocking on that for a few minutes, just because a bit of bloat has collected, seems likely to cause havoc. Greetings, Andres Freund
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Robert Haas
Date:
On Tue, Jan 17, 2023 at 1:02 PM Andres Freund <andres@anarazel.de> wrote: > On 2023-01-16 13:58:21 -0800, Peter Geoghegan wrote: > > On Fri, Jan 13, 2023 at 9:55 PM Andres Freund <andres@anarazel.de> wrote: > > When I express skepticism of very high autovacuum_freeze_max_age > > settings, it's mostly just that I don't think that age(relfrozenxid) > > is at all informative in the way that something that triggers > > autovacuum ought to be. Worst of all, the older relfrozenxid gets, the > > less informative it becomes. I'm sure that it can be safe to use very > > high autovacuum_freeze_max_age values, but it seems like a case of > > using a very complicated and unreliable thing to decide when to > > VACUUM. > > Both you and Robert said this, and I have seen it be true, but typically not > for large high-throughput OLTP databases, where I found increasing > relfrozenxid to be important. Sure, there's probably some up/down through the > day / week, but it's likely to be pretty predictable. > > I think the problem is that an old relfrozenxid doesn't tell you how much > outstanding work there is. Perhaps that's what both of you meant... I think so, at least in my case. The reason why the default autovacuum_freeze_max_age is 300m is not because going over 300m is a problem, but because we don't know how long it's going to take to run all of the vacuums, and we're still going to be consuming XIDs in the meantime, and we have no idea how fast we're consuming XIDs either. For all we know it might take days, and we might be burning through XIDs really quickly, so we might need a ton of headroom. That's not to say that raising the setting might not be a sensible thing to do in a context where you know what the workload is. If you know that the vacuums will completely quickly OR that the XID burn rate is low, you can raise it drastically and be fine. We just can't assume that will be true everywhere. > I think that's not the fault of relfrozenxid as a trigger, but that we simply > don't keep enough other stats. We should imo at least keep track of: > > In pg_class: > > - The number of all frozen pages, like we do for all-visible > > That'd give us a decent upper bound for the amount of work we need to do to > increase relfrozenxid. It's important that this is crash safe (thus no > pg_stats), and it only needs to be written when we'd likely make other > changes to the pg_class row anyway. I'm not sure how useful this is because a lot of the work is from scanning the indexes. > In pgstats: > > - The number of dead items, incremented both by the heap scan and > opportunistic pruning > > This would let us estimate how urgently we need to clean up indexes. I don't think this is true because btree indexes are self-cleaning in some scenarios and not in others. > - The xid/mxid horizons during the last vacuum > > - The number of pages with tuples that couldn't removed due to the horizon > during the last vacuum > > - The number of pages with tuples that couldn't be frozen Not bad to know, but if the horizon we could use advances by 1, we can't tell whether that allows pruning nothing additional or another billion tuples. I'm not trying to take the position that XID age is a totally useless metric. I don't think it is. If XID age is high, you know you have a problem, and the higher it is, the more urgent that problem is. Furthemore, if XID is low, you know that you don't have that particular problem. You might have some other one, but that's OK: this one metric doesn't have to answer every question to be useful. However, where XID age really falls down as a metric is that it doesn't tell you what it's going to take to solve the problem. The answer, at ten thousand feet, is always vacuum. But how long will that vacuum run? We don't know. Do we need the XID horizon to advance first, and if so, how far? We don't know. Do we need auto-cancellation to be disabled? We don't know. That's where we get into a lot of trouble here. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Peter Geoghegan
Date:
On Tue, Jan 17, 2023 at 10:02 AM Andres Freund <andres@anarazel.de> wrote: > Both you and Robert said this, and I have seen it be true, but typically not > for large high-throughput OLTP databases, where I found increasing > relfrozenxid to be important. Sure, there's probably some up/down through the > day / week, but it's likely to be pretty predictable. > > I think the problem is that an old relfrozenxid doesn't tell you how much > outstanding work there is. Perhaps that's what both of you meant... That's what I meant, yes. > I think that's not the fault of relfrozenxid as a trigger, but that we simply > don't keep enough other stats. We should imo at least keep track of: If you assume that there is chronic undercounting of dead tuples (which I think is very common), then of course anything that triggers vacuuming is going to help with that problem -- it might be totally inadequate, but still make the critical difference by not allowing the system to become completely destabilized. I absolutely accept that users that are relying on that exist, and that those users ought to not have things get even worse -- I'm pragmatic. But overall, what we should be doing is fixing the real problem, which is that the dead tuples accounting is deeply flawed. Actually, it's not just that the statistics are flat out wrong; the whole model is flat-out wrong. The assumptions that work well for optimizer statistics quite simply do not apply here. Random sampling for this is just wrong, because we're not dealing with something that follows a distribution that can be characterized with a sufficiently large sample. With optimizer statistics, the entire contents of the table is itself a sample taken from the wider world -- so even very stale statistics can work quite well (assuming that the schema is well normalized). Whereas the autovacuum dead tuples stuff is characterized by constant change. I mean of course it is -- that's the whole point! The central limit theorem obviously just doesn't work for something like this -- we cannot generalize from a sample, at all. I strongly suspect that it still wouldn't be a good model even if the information was magically always correct. It might actually be worse in some ways! Most of my arguments against the model are not arguments against the accuracy of the statistics as such. They're actually arguments against the fundamental relevance of the information itself, to the actual problem at hand. We are not interested in information for its own sake; we're interested in making better decisions about autovacuum scheduling. Those may only have a very loose relationship. How many dead heap-only tuples are equivalent to one LP_DEAD item? What about page-level concentrations, and the implication for line-pointer bloat? I don't have a good answer to any of these questions myself. And I have my doubts that there are *any* good answers. Even these questions are the wrong questions (they're just less wrong). Fundamentally, we're deciding when the next autovacuum should run against each table. Presumably it's going to have to happen some time, and when it does happen it happens to the table as a whole. And with a larger table it probably doesn't matter if it happens +/- a few hours from some theoretical optimal time. Doesn't it boil down to that? If we taught the system to do the autovacuum work early because it's a relatively good time for it from a system level point of view (e.g. it's going to be less disruptive right now), that would be useful and easy to justify on its own terms. But it would also tend to make the system much less vulnerable to undercounting dead tuples, since in practice there'd be a decent chance of getting to them early enough that it at least wasn't extremely painful any one time. It's much easier to understand that the system is quiescent than it is to understand bloat. BTW, I think that the insert-driven autovacuum stuff added to 13 has made the situation with bloat significantly better. Of course it wasn't really designed to do that at all, but it still has, kind of by accident, in roughly the same way that antiwraparound autovacuums help with bloat by accident. So maybe we should embrace "happy accidents" like that a bit more. It doesn't necessarily matter if we do the right thing for a reason that turns out to have not been the best reason. I'm certainly not opposed to it, despite my complaints about relying on age(relfrozenxid). > In pgstats: > (Various stats) Overall, what I like about your ideas here is the emphasis on bounding the worst case, and the emphasis on the picture at the page level over the tuple level. I'd like to use the visibility map more for stuff here, too. It is totally accurate about all-visible/all-frozen pages, so many of my complaints about statistics don't really apply. Or need not apply, at least. If 95% of a table's pages are all-frozen in the VM, then of course it's pretty unlikely to be the right time to VACUUM the table if it's to clean up bloat -- this is just about the most reliable information we have access to. I think that the only way that more stats can help is by allowing us to avoid doing completely the wrong thing more often. Just avoiding disaster is a great goal for us here. > > This sounds like a great argument in favor of suspend-and-resume as a > > way of handling autocancellation -- no useful work needs to be thrown > > away for AV to yield for a minute or two. > Hm, that seems a lot of work. Without having held a lock you don't even know > whether your old dead items still apply. Of course it'd improve the situation > substantially, if we could get it. I don't think it's all that much work, once the visibility map snapshot infrastructure is there. Why wouldn't your old dead items still apply? The TIDs must always reference LP_DEAD stubs. Those can only be set LP_UNUSED by VACUUM, and presumably VACUUM can only run in a way that either resumes the suspended VACUUM session, or discards it altogether. So they're not going to be invalidated during the period that a VACUUM is suspended, in any way. Even if CREATE INDEX runs against the table during a suspending VACUUM, we know that the existing LP_DEAD dead_items won't have been indexed, so they'll be safe to mark LP_UNUSED in any case. What am I leaving out? I can't think of anything. The only minor caveat is that we'd probably have to discard the progress from any individual ambulkdelete() call that happened to be running at the time that VACUUM was interrupted. > > Yeah, that's pretty bad. Maybe DROP TABLE and TRUNCATE should be > > special cases? Maybe they should always be able to auto cancel an > > autovacuum? > > Yea, I think so. It's not obvious how to best pass down that knowledge into > ProcSleep(). It'd have to be in the LOCALLOCK, I think. Looks like the best > way would be to change LockAcquireExtended() to get a flags argument instead > of reportMemoryError, and then we could add LOCK_ACQUIRE_INTENT_DROP & > LOCK_ACQUIRE_INTENT_TRUNCATE or such. Then the same for > RangeVarGetRelidExtended(). It already "customizes" how to lock based on RVR* > flags. It would be tricky, but still relatively straightforward compared to other things. It is often a TRUNCATE or a DROP TABLE, and we have nothing to lose and everything to gain by changing the rules for those. > ISTM that some of what you write below would be addressed, at least partially, > by the stats I proposed above. Particularly keeping some "page granularity" > instead of "tuple granularity" stats seems helpful. That definitely could be true, but I think that my main concern is that we completely rely on randomly sampled statistics (except with antiwraparound autovacuums, which happen on a schedule that has problems of its own). > > It's quite possible to get approximately the desired outcome with an > > algorithm that is completely wrong -- the way that we sometimes need > > autovacuum_freeze_max_age to deal with bloat is a great example of > > that. > > Yea. I think this is part of I like my idea about tracking more observations > made by the last vacuum - they're quite easy to get right, and they > self-correct, rather than potentially ending up causing ever-wronger stats. I definitely think that there is a place for that. It has the huge advantage of lessening our reliance on random sampling. > Right. I think it's fundamental that we get a lot better estimates about the > amount of work needed. Without that we have no chance of finishing autovacuums > before problems become too big. I like the emphasis on bounding the work required, so that it can be spread out, rather than trying to predict dead tuples. Again, we should focus on avoiding disaster. > And serialized vacuum state won't help, because that still requires > vacuum to scan all the !all-visible pages to discover them. Most of which > won't contain dead tuples in a lot of workloads. The main advantage of that model is that it decides what to do and when to do it based on the actual state of the table (or the state in the recent past). If we see a concentration of LP_DEAD items, then we can hurry up index vacuuming. If not, maybe we'll take our time. Again, less reliance on random sampling is a very good thing. More dynamic decisions that are made at runtime, and delayed for as long as possible, just seems much more promising than having better stats that are randomly sampled. > I don't feel like I have a good handle on what could work for 16 and what > couldn't. Personally I think something like autovacuum_no_auto_cancel_age > would be an improvement, but I also don't quite feel satisfied with it. I don't either; but it should be strictly less unsatisfactory. > Tracking the number of autovac failures seems uncontroverial and quite > beneficial, even if the rest doesn't make it in. It'd at least let users > monitor for tables where autovac is likely to swoop in in anti-wraparound > mode. I'll see what I can come up with. > Perhaps its worth to separately track the number of times a backend would have > liked to cancel autovac, but couldn't due to anti-wrap? If changes to the > no-auto-cancel behaviour don't make it in, it'd at least allow us to collect > more data about the prevalence of the problem and in what situations it > occurs? Even just adding some logging for that case seems like it'd be an > improvement. Hmm, maybe. > I think with a bit of polish "Add autovacuum trigger instrumentation." ought > to be quickly mergeable. Yeah, I'll try to get that part out of the way quickly. -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Andres Freund
Date:
Hi, On 2023-01-17 14:57:27 -0500, Robert Haas wrote: > > In pg_class: > > > > - The number of all frozen pages, like we do for all-visible > > > > That'd give us a decent upper bound for the amount of work we need to do to > > increase relfrozenxid. It's important that this is crash safe (thus no > > pg_stats), and it only needs to be written when we'd likely make other > > changes to the pg_class row anyway. > > I'm not sure how useful this is because a lot of the work is from > scanning the indexes. We can increase relfrozenxid before the index scans, unless we ran out of dead tuple space. We already have code for that in failsafe mode, in some way. But we really also should increase > > In pgstats: > > > > - The number of dead items, incremented both by the heap scan and > > opportunistic pruning > > > > This would let us estimate how urgently we need to clean up indexes. > > I don't think this is true because btree indexes are self-cleaning in > some scenarios and not in others. I mainly meant it from the angle of whether need to clean up dead items in the heap to avoid the table from bloating because we stop using those pages - which requires index scans. But even for the index scan portion, it'd give us a better bound than we have today. We probably should track the number of killed tuples in indexes. > > - The xid/mxid horizons during the last vacuum > > > > - The number of pages with tuples that couldn't removed due to the horizon > > during the last vacuum > > > > - The number of pages with tuples that couldn't be frozen > > Not bad to know, but if the horizon we could use advances by 1, we > can't tell whether that allows pruning nothing additional or another > billion tuples. Sure. But it'd be a lot better than scanning it again and again when nothing has changed because thing still holds back the horizon. We could improve upon it later by tracking the average or even bins of ages. > However, where XID age really falls down as a metric is that it > doesn't tell you what it's going to take to solve the problem. The > answer, at ten thousand feet, is always vacuum. But how long will that > vacuum run? We don't know. Do we need the XID horizon to advance > first, and if so, how far? We don't know. Do we need auto-cancellation > to be disabled? We don't know. That's where we get into a lot of > trouble here. Agreed. I think the metrics I proposed would help some, by at least providing sensible upper boundaries (for work) and minimal requirements (horizon during last vacuum). Greetings, Andres Freund
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Robert Haas
Date:
On Tue, Jan 17, 2023 at 3:08 PM Peter Geoghegan <pg@bowt.ie> wrote: > If you assume that there is chronic undercounting of dead tuples > (which I think is very common), ... Why do you think that? > How many dead heap-only tuples are equivalent to one LP_DEAD item? > What about page-level concentrations, and the implication for > line-pointer bloat? I don't have a good answer to any of these > questions myself. Seems a bit pessimistic. If we had unlimited resources and all operations were infinitely fast, the optimal strategy would be to vacuum after every insert, update, or delete. But in reality, that would be prohibitively expensive, so we're making a trade-off. Batching together cleanup for many table modifications reduces the amortized cost of cleaning up after one such operation very considerably. That's critical. But if we batch too much together, then the actual cleanup doesn't happen soon enough to keep us out of trouble. If we had an oracle that could provide us with perfect information, we'd ask it, among other things, how much work will be required to vacuum right now, and how much benefit would we get out of doing so. The dead tuple count is related to the first question. It's not a direct, linear relationship, but it's not completely unrelated, either. Maybe we could refine the estimates by gathering more or different statistics than we do now, but ultimately it's always going to be a trade-off between getting the work done sooner (and thus maybe preventing table growth or a wraparound shutdown) and being able to do more work at once (and thus being more efficient). The current system set of counters predates HOT and the visibility map, so it's not surprising if needs updating, but if you're argue that the whole concept is just garbage, I think that's an overreaction. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Peter Geoghegan
Date:
On Tue, Jan 17, 2023 at 2:11 PM Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Jan 17, 2023 at 3:08 PM Peter Geoghegan <pg@bowt.ie> wrote: > > If you assume that there is chronic undercounting of dead tuples > > (which I think is very common), ... > > Why do you think that? For the reasons I gave about statistics, random sampling, the central limit theorem. All that stuff. This matches the experience of Andres. And is obviously the only explanation behind the reliance on antiwraparound autovacuums for cleaning up bloat in larger OLTP databases. It just fits: the dead tuples approach can sometimes be so completely wrong that even an alternative triggering condition based on something that is virtually unrelated to the thing we actually care about can do much better in practice. Consistently, reliably, for a given table/workload. > > How many dead heap-only tuples are equivalent to one LP_DEAD item? > > What about page-level concentrations, and the implication for > > line-pointer bloat? I don't have a good answer to any of these > > questions myself. > > Seems a bit pessimistic. If we had unlimited resources and all > operations were infinitely fast, the optimal strategy would be to > vacuum after every insert, update, or delete. But in reality, that > would be prohibitively expensive, so we're making a trade-off. To a large degree, that's my point. I don't know how to apply this information, so having detailed information doesn't seem like the main problem. > If we had an oracle that could provide us with perfect information, > we'd ask it, among other things, how much work will be required to > vacuum right now, and how much benefit would we get out of doing so. And then what would we do? What about costs? Even if we were omniscient, we still wouldn't be omnipotent. We're still subject to the laws of physics. VACUUM would still be something that more or less works at the level of the whole table, or not at all. So being omniscient seems kinda overrated to me. Adding more information does not in general lead to better outcomes. > The dead tuple count is related to the first question. It's not a > direct, linear relationship, but it's not completely unrelated, > either. Maybe we could refine the estimates by gathering more or > different statistics than we do now, but ultimately it's always going > to be a trade-off between getting the work done sooner (and thus maybe > preventing table growth or a wraparound shutdown) and being able to do > more work at once (and thus being more efficient). The current system > set of counters predates HOT and the visibility map, so it's not > surprising if needs updating, but if you're argue that the whole > concept is just garbage, I think that's an overreaction. What I'm arguing is that principally relying on any one thing is garbage. If you have only one thing that creates pressure to VACUUM then there can be a big impact whenever it turns out to be completely wrong. Whereas if VACUUM can run because of (say) 3 moderate signals taken together, then it's much less likely that we'll be completely wrong. In general my emphasis is on avoiding disaster in all its forms. Vacuuming somewhat early more often is perhaps suboptimal, but far from a disaster. It's the kind of thing that we can manage. By all means, let's make the dead tuples/dead items stuff less naive (e.g. make it distinguish between LP_DEAD items and dead heap-only tuples). But even then, we shouldn't continue to completely rely on it in the way that we do right now. In other words, I'm fine with adding more information that is more accurate as long as we don't continue to make the mistake of not treating it kinda suspect, and certainly not something to completely rely on if at all possible. In particular, we need to think about both costs and benefits at all times. -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Robert Haas
Date:
On Tue, Jan 17, 2023 at 5:56 PM Peter Geoghegan <pg@bowt.ie> wrote: > > Why do you think that? > > For the reasons I gave about statistics, random sampling, the central > limit theorem. All that stuff. This matches the experience of Andres. > And is obviously the only explanation behind the reliance on > antiwraparound autovacuums for cleaning up bloat in larger OLTP > databases. It just fits: the dead tuples approach can sometimes be so > completely wrong that even an alternative triggering condition based > on something that is virtually unrelated to the thing we actually care > about can do much better in practice. Consistently, reliably, for a > given table/workload. Hmm, I don't know. I have no intuition one way or the other for whether we're undercounting dead tuples, and I don't understand what would cause us to do that. I thought that we tracked that accurately, as part of the statistics system, not by sampling (pg_stat_all_tables.n_dead_tup). But, I think there are a number of other explanations for why we tend to rely on antiwraparound vacuums more than we should. Auto-cancellation. Skipping tables that are locked, or pages that are pinned. A cost limit that is too low relative to the size of the database, so that eventually all tables are in wraparound danger all the time. The fact that we can vacuum tables uselessly, without accomplishing anything, because the XID horizon is too new, but we don't know that so we just try to vacuum anyway. And then we repeat that useless work in an infinite loop. The fact that the system's idea of when a vacuum needs to happen grows with autovacuum_vacuum_scale_factor, but that actually gets too big too fast, so that eventually it never triggers vacuuming at all, or at least not before XID age does. I think we ought to fire autovacuum_vacuum_scale_factor out of an airlock. It's not the right model, and I think many people have been aware that it's not the right model for a decade, and we haven't been creative enough to come up with anything better. We *know* that you have to lower this value for large tables or they just don't get vacuumed often enough. That means we have some idea how often they ought to be vacuumed. I'm sure I'm not the person who has the best intuition on that point, but I bet people who have been responsible for large production systems have some decent ideas in that area. We should find out what that intuition is and come up with a new formula that matches the intuition of people with experience doing this sort of thing. e.g. 1. When computing autovacuum_vacuum_threshold + table_size * autovacuum_vacuum_scale_factor, if the result exceeds the value of a new parameter autovacuum_vacuum_maximum_threshold, then clamp the result to that value. 2. When computing autovacuum_vacuum_threshold + table_size * autovacuum_vacuum_scale_factor, if the result exceeds 80% of the number of dead TIDs we could store, clamp it to that number. 3. Change the formula altogether to use a square root or a cube root or a logarithm somewhere. I think we also ought to invent some sort of better cost limit system that doesn't shoot you in the foot automatically as the database grows. Nobody actually wants to limit the rate at which the database vacuums stuff to a constant. What they really want to do is limit it to a rate that is somewhat faster than the minimum rate needed to avoid disaster. We should try to develop metrics for whether vacuum is keeping up. I think one good one would be duty cycle -- if we have N vacuum workers, then over a period of K seconds we could have done as much as N*K process-seconds of vacuum work, and as little as 0. So figure out how many seconds of vacuum work we actually did, and divide that by N*K to get a percentage. If it's over, say, 90%, then we are not keeping up. We should dynamically raise the cost limit until we do. And drop it back down later when things look better. I don't actually see any reason why dead tuples, even counted in a relatively stupid way, isn't fundamentally good enough to get all tables vacuumed before we hit the XID age cutoff. It doesn't actually do that right now, but I feel like that must be because we're doing other stupid things, not because there's anything that terrible about the metric as such. Maybe that's wrong, but I find it hard to imagine. If I imagine a world where vacuum always gets started when the number of dead tuples hits some reasonable bound (rather than the unreasonable bound that the scale factor stuff computes) and it always cleans up those dead tuples (instead of doing a lot of work to clean up nothing at all, or doing a lot of work to clean up only a small fraction of those dead tuples, or cancelling itself, or skipping the table that has the problem because it's locked, or running with an unreasonably low cost limit, or otherwise being unable to GET THE JOB DONE) then how do we ever reach autovacuum_freeze_max_age? I think it would still be possible, but only if the XID consumption rate of the server is so high that we chunk through 300 million XIDs in the time it takes to perform an un-throttled vacuum of the table. I think that's a real threat and will probably be a bigger one in ten years, but it's only one of many things that are going wrong right now. > Even if we were omniscient, we still wouldn't be omnipotent. A sound theological point! > We're > still subject to the laws of physics. VACUUM would still be something > that more or less works at the level of the whole table, or not at > all. So being omniscient seems kinda overrated to me. Adding more > information does not in general lead to better outcomes. Yeah, I think that's true. In particular, it's not much use being omniscient but stupid. It would be better to have limited information and be smart about what you did with it. > What I'm arguing is that principally relying on any one thing is > garbage. If you have only one thing that creates pressure to VACUUM > then there can be a big impact whenever it turns out to be completely > wrong. Whereas if VACUUM can run because of (say) 3 moderate signals > taken together, then it's much less likely that we'll be completely > wrong. In general my emphasis is on avoiding disaster in all its > forms. Vacuuming somewhat early more often is perhaps suboptimal, but > far from a disaster. It's the kind of thing that we can manage. True, although it can be overdone. An extra vacuum on a big table with some large indexes that end up getting scanned can be very expensive even if the table itself is almost entirely all-visible. We can't afford to make too many mistakes in the direction of vacuuming early in such cases. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Peter Geoghegan
Date:
. On Wed, Jan 18, 2023 at 7:54 AM Robert Haas <robertmhaas@gmail.com> wrote: > > It just fits: the dead tuples approach can sometimes be so > > completely wrong that even an alternative triggering condition based > > on something that is virtually unrelated to the thing we actually care > > about can do much better in practice. Consistently, reliably, for a > > given table/workload. > > Hmm, I don't know. I have no intuition one way or the other for > whether we're undercounting dead tuples, and I don't understand what > would cause us to do that. I thought that we tracked that accurately, > as part of the statistics system, not by sampling > (pg_stat_all_tables.n_dead_tup). It's both, kind of. pgstat_report_analyze() will totally override the tabentry->dead_tuples information that drives autovacuum.c, based on an estimate derived from a random sample -- which seems to me to be an approach that just doesn't have any sound theoretical basis. So while there is a sense in which we track dead tuples incrementally and accurately using the statistics system, we occasionally call pgstat_report_analyze (and pgstat_report_vacuum) like this, so AFAICT we might as well not even bother tracking things reliably the rest of the time. Random sampling works because the things that you don't sample are very well represented by the things that you do sample. That's why even very stale optimizer statistics can work quite well (and why the EAV anti-pattern makes query optimization impossible) -- the distribution is often fixed, more or less. The statistics generalize very well because the data meets certain underlying assumptions that all data stored in a relational database is theoretically supposed to meet. Whereas with dead tuples, the whole point is to observe and count dead tuples so that autovacuum can then go remove the dead tuples -- which then utterly changes the situation! That's a huge difference. ISTM that you need a *totally* different approach for something that's fundamentally dynamic, which is what this really is. Think about how the random sampling will work in a very large table with concentrated updates. The modified pages need to outweigh the large majority of pages in the table that can be skipped by VACUUM anyway. I wonder how workable it would be to just teach pgstat_report_analyze and pgstat_report_vacuum to keep out of this, or to not update the stats unless it's to increase the number of dead_tuples... > I think we ought to fire autovacuum_vacuum_scale_factor out of an > airlock. Couldn't agree more. I think that this and the underlying statistics are the really big problem as far as under-vacuuming is concerned. > I think we also ought to invent some sort of better cost limit system > that doesn't shoot you in the foot automatically as the database > grows. Nobody actually wants to limit the rate at which the database > vacuums stuff to a constant. What they really want to do is limit it > to a rate that is somewhat faster than the minimum rate needed to > avoid disaster. We should try to develop metrics for whether vacuum is > keeping up. Definitely agree that doing some kind of dynamic updating is promising. What we thought at the start versus what actually happened. Something cyclic, just like autovacuum itself. > I don't actually see any reason why dead tuples, even counted in a > relatively stupid way, isn't fundamentally good enough to get all > tables vacuumed before we hit the XID age cutoff. It doesn't actually > do that right now, but I feel like that must be because we're doing > other stupid things, not because there's anything that terrible about > the metric as such. Maybe that's wrong, but I find it hard to imagine. On reflection, maybe you're right here. Maybe it's true that the bigger problem is just that the implementation is bad, even on its own terms -- since it's pretty bad! Hard to say at this point. Depends on how you define it, too. Statistically sampling is just not fit for purpose here. But is that a problem with autovacuum_vacuum_scale_factor? I may have said words that could reasonably be interpreted that way, but I'm not prepared to blame it on the underlying autovacuum_vacuum_scale_factor model now. It's fuzzy. > > We're > > still subject to the laws of physics. VACUUM would still be something > > that more or less works at the level of the whole table, or not at > > all. So being omniscient seems kinda overrated to me. Adding more > > information does not in general lead to better outcomes. > > Yeah, I think that's true. In particular, it's not much use being > omniscient but stupid. It would be better to have limited information > and be smart about what you did with it. I would put it like this: autovacuum shouldn't ever be a sucker. It should pay attention to disconfirmatory signals. The information that drives its decision making process should be treated as provisional. Even if the information was correct at one point, the contents of the table are constantly changing in a way that could matter enormously. So we should be paying attention to where the table is going -- and even where it might be going -- not just where it is, or was. > True, although it can be overdone. An extra vacuum on a big table with > some large indexes that end up getting scanned can be very expensive > even if the table itself is almost entirely all-visible. We can't > afford to make too many mistakes in the direction of vacuuming early > in such cases. No, but we can afford to make some -- and can detect when it happened after the fact. I would rather err on the side of over-vacuuming, especially if the system is smart enough to self-correct when that turns out to be the wrong approach. One of the advantages of running VACUUM sooner is that it provides us with relatively reliable information about the needs of the table. We can also cheat, sort of. If we find another justification for autovacuuming (e.g., it's a quiet time for the system as a whole), and it works out to help with this other problem, it may be just as good for users. -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Robert Haas
Date:
On Wed, Jan 18, 2023 at 1:31 PM Peter Geoghegan <pg@bowt.ie> wrote: > pgstat_report_analyze() will totally override the > tabentry->dead_tuples information that drives autovacuum.c, based on > an estimate derived from a random sample -- which seems to me to be an > approach that just doesn't have any sound theoretical basis. Yikes. I think we don't have a choice but to have a method to correct the information somehow, because AFAIK the statistics system is not crash-safe. But that approach does seem to carry significant risk of overwriting correct information with wrong information. > On reflection, maybe you're right here. Maybe it's true that the > bigger problem is just that the implementation is bad, even on its own > terms -- since it's pretty bad! Hard to say at this point. > > Depends on how you define it, too. Statistically sampling is just not > fit for purpose here. But is that a problem with > autovacuum_vacuum_scale_factor? I may have said words that could > reasonably be interpreted that way, but I'm not prepared to blame it > on the underlying autovacuum_vacuum_scale_factor model now. It's > fuzzy. Yep. I think what we should try to evaluate is which number is furthest from the truth. My guess is that the threshold is so high relative to what a reasonable value would be that you can't get any benefit out of making the dead tuple count more accurate. Like, if the threshold is 100x too high, or something, then who cares how accurate the dead tuples number is? It's going to be insufficient to trigger vacuuming whether it's right or wrong. We should try substituting a less-bogus threshold calculation and see what happens then. An alternative theory is that the threshold is fine and we're only failing to reach it because the dead tuple calculation is so inaccurate. Maybe that's even true in some scenarios, but I bet that it's never the issue when people have really big tables. The fact that I'm OK with 10MB of bloat in my 100MB table doesn't mean I'm OK with 1TB of bloat in my 10TB table. Among other problems, I can't even vacuum away that much bloat in one index pass, because autovacuum can't use enough work memory for that. Also, the absolute space wastage matters. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Peter Geoghegan
Date:
On Wed, Jan 18, 2023 at 11:02 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Jan 18, 2023 at 1:31 PM Peter Geoghegan <pg@bowt.ie> wrote: > > pgstat_report_analyze() will totally override the > > tabentry->dead_tuples information that drives autovacuum.c, based on > > an estimate derived from a random sample -- which seems to me to be an > > approach that just doesn't have any sound theoretical basis. > > Yikes. I think we don't have a choice but to have a method to correct > the information somehow, because AFAIK the statistics system is not > crash-safe. But that approach does seem to carry significant risk of > overwriting correct information with wrong information. This situation is really quite awful, so maybe we should do something about it soon, in the scope of the Postgres 16 work on autovacuum that is already underway. In fact I think that the problem here is so bad that even something slightly less naive would be far more effective. You're right to point out that pgstat_report_analyze needs to update the stats in case there is a hard crash, of course. But there is plenty of context with which to make better decisions close at hand. For example, I bet that pgstat_report_analyze already does a pretty good job of estimating live_tuples -- my spiel about statistics mostly doesn't apply to live_tuples. Suppose that we notice that its new estimate for live_tuples approximately matches what the stats subsystem already thought about live_tuples, while dead_tuples is far far lower. We shouldn't be so credulous as to believe the new dead_tuples estimate at that point. Perhaps we can change nothing about dead_tuples at all when this happens. Or perhaps we can set dead_tuples to a value that is scaled from the old estimate. The new dead_tuples value could be derived by taking the ratio of the old live_tuples to the old dead_tuples, and then using that to scale from the new live_tuples. This is just a first pass, to see what you and others think. Even very simple heuristics seem like they could make things much better. Another angle of attack is the PD_ALL_VISIBLE page-level bit, which acquire_sample_rows() could pay attention to -- especially in larger tables, where the difference between all pages and just the all-visible subset of pages is most likely to matter. The more sampled pages that had PD_ALL_VISIBLE set, the less credible the new dead_tuples estimate will be (relative to existing information), and so pgstat_report_analyze() should prefer the new estimate over the old one in proportion to that. We probably shouldn't change anything about pgstat_report_vacuum as part of this effort to make pgstat_report_analyze less terrible in the near term. It certainly has its problems (what is true for pages that VACUUM scanned at the end of VACUUM is far from representative for new pages!), it's probably much less of a contributor to issues like those that Andres reports seeing. BTW, one of the nice things about the insert-driven autovacuum stats is that pgstat_report_analyze doesn't have an opinion about how many tuples were inserted since the last VACUUM ran. It does have other problems, but they seem less serious to me. > Yep. I think what we should try to evaluate is which number is > furthest from the truth. My guess is that the threshold is so high > relative to what a reasonable value would be that you can't get any > benefit out of making the dead tuple count more accurate. Like, if the > threshold is 100x too high, or something, then who cares how accurate > the dead tuples number is? Right. Or if we don't make any reasonable distinction between LP_DEAD items and dead heap-only tuples, then the total number of both things together may matter very little. Better to be approximately correct than exactly wrong. Deliberately introducing a bias to lower the variance is a perfectly valid approach. > Maybe that's even true in some scenarios, but I bet that > it's never the issue when people have really big tables. The fact that > I'm OK with 10MB of bloat in my 100MB table doesn't mean I'm OK with > 1TB of bloat in my 10TB table. Among other problems, I can't even > vacuum away that much bloat in one index pass, because autovacuum > can't use enough work memory for that. Also, the absolute space > wastage matters. I certainly agree with all that. FWIW, part of my mental model with VACUUM is that the rules kind of change in the case of a big table. We're far more vulnerable to issues such as (say) waiting for cleanup locks because the overall cadence used by autovacuum is so infrequently relative to everything else. There are more opportunities for things to go wrong, worse consequences when they do go wrong, and greater potential for the problems to compound. -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Andres Freund
Date:
Hi, On 2023-01-17 12:08:01 -0800, Peter Geoghegan wrote: > > I think that's not the fault of relfrozenxid as a trigger, but that we simply > > don't keep enough other stats. We should imo at least keep track of: > > If you assume that there is chronic undercounting of dead tuples > (which I think is very common), then of course anything that triggers > vacuuming is going to help with that problem -- it might be totally > inadequate, but still make the critical difference by not allowing the > system to become completely destabilized. I absolutely accept that > users that are relying on that exist, and that those users ought to > not have things get even worse -- I'm pragmatic. But overall, what we > should be doing is fixing the real problem, which is that the dead > tuples accounting is deeply flawed. Actually, it's not just that the > statistics are flat out wrong; the whole model is flat-out wrong. I think that depends on what "whole model" encompasses... > The assumptions that work well for optimizer statistics quite simply > do not apply here. Random sampling for this is just wrong, because > we're not dealing with something that follows a distribution that can > be characterized with a sufficiently large sample. With optimizer > statistics, the entire contents of the table is itself a sample taken > from the wider world -- so even very stale statistics can work quite > well (assuming that the schema is well normalized). Whereas the > autovacuum dead tuples stuff is characterized by constant change. I > mean of course it is -- that's the whole point! The central limit > theorem obviously just doesn't work for something like this -- we > cannot generalize from a sample, at all. If we were to stop dropping stats after crashes, I think we likely could afford to stop messing with dead tuple stats during analyze. Right now it's valuable to some degree because it's a way to reaosonably quickly recover from lost stats. The main way to collect inserted / dead tuple info for autovacuum's benefit is via the stats collected when making changes. We probably ought to simply not update dead tuples after analyze if the stats entry has information about a prior [auto]vacuum. Or at least split the fields. > How many dead heap-only tuples are equivalent to one LP_DEAD item? > What about page-level concentrations, and the implication for > line-pointer bloat? I don't have a good answer to any of these > questions myself. And I have my doubts that there are *any* good > answers. Hence my suggestion to track several of these via page level stats. In the big picture it doesn't really matter that much whether there's 10 or 100 (dead tuples|items) on a page that needs to be removed. It matters that the page needs to be processed. > Even these questions are the wrong questions (they're just less wrong). I don't agree. Nothing is going to be perfect, but you're not going to be able to do sensible vacuum scheduling without some stats, and it's fine if those are an approximation, as long as the approximation makes some sense. > I'd like to use the visibility map more for stuff here, too. It is > totally accurate about all-visible/all-frozen pages, so many of my > complaints about statistics don't really apply. Or need not apply, at > least. If 95% of a table's pages are all-frozen in the VM, then of > course it's pretty unlikely to be the right time to VACUUM the table > if it's to clean up bloat -- this is just about the most reliable > information we have access to. I think querying that from stats is too expensive for most things. I suggested tracking all-frozen in pg_class. Perhaps we should also track when pages are *removed* from the VM in pgstats, I don't think we do today. That should give a decent picture? > > > This sounds like a great argument in favor of suspend-and-resume as a > > > way of handling autocancellation -- no useful work needs to be thrown > > > away for AV to yield for a minute or two. > > > Hm, that seems a lot of work. Without having held a lock you don't even know > > whether your old dead items still apply. Of course it'd improve the situation > > substantially, if we could get it. > > I don't think it's all that much work, once the visibility map > snapshot infrastructure is there. > > Why wouldn't your old dead items still apply? Well, for one the table could have been rewritten. Of course we can add the code to deal with that, but it is definitely something to be aware of. There might also be some oddities around indexes getting added / removed. > > > Yeah, that's pretty bad. Maybe DROP TABLE and TRUNCATE should be > > > special cases? Maybe they should always be able to auto cancel an > > > autovacuum? > > > > Yea, I think so. It's not obvious how to best pass down that knowledge into > > ProcSleep(). It'd have to be in the LOCALLOCK, I think. Looks like the best > > way would be to change LockAcquireExtended() to get a flags argument instead > > of reportMemoryError, and then we could add LOCK_ACQUIRE_INTENT_DROP & > > LOCK_ACQUIRE_INTENT_TRUNCATE or such. Then the same for > > RangeVarGetRelidExtended(). It already "customizes" how to lock based on RVR* > > flags. > > It would be tricky, but still relatively straightforward compared to > other things. It is often a TRUNCATE or a DROP TABLE, and we have > nothing to lose and everything to gain by changing the rules for > those. Probably should also change the rules for VACUUM and VACUUM FULL / CLUSTER, if we do it. Manual VACUUM will often be faster due to the cost limits, and VACUUM FULL can be *considerably* faster than VACUUM once you hit bad bloat. Greetings, Andres Freund
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Robert Haas
Date:
On Wed, Jan 18, 2023 at 3:15 PM Peter Geoghegan <pg@bowt.ie> wrote: > Suppose that we notice that its new > estimate for live_tuples approximately matches what the stats > subsystem already thought about live_tuples, while dead_tuples is far > far lower. We shouldn't be so credulous as to believe the new > dead_tuples estimate at that point. > > Another angle of attack is the PD_ALL_VISIBLE page-level bit, which > acquire_sample_rows() could pay attention to -- especially in larger > tables, where the difference between all pages and just the > all-visible subset of pages is most likely to matter. The more sampled > pages that had PD_ALL_VISIBLE set, the less credible the new > dead_tuples estimate will be (relative to existing information), and > so pgstat_report_analyze() should prefer the new estimate over the old > one in proportion to that. I don't know enough about the specifics of how this works to have an intelligent opinion about how likely these particular ideas are to work out. However, I think it's risky to look at estimates and try to infer whether they are reliable. It's too easy to be wrong. What we really want to do is anchor our estimates to some data source that we know we can trust absolutely. If you trust possibly-bad data less, it screws up your estimates more slowly, but it still screws them up. If Andres is correct that what really matter is the number of pages we're going to have to dirty, we could abandon counting dead tuples altogether and just count not-all-visible pages in the VM map. That would be cheap enough to recompute periodically. However, it would also be a big behavior change from the way we do things now, so I'm not sure it's a good idea. Still, a quantity that we can be certain we're measuring accurately is better than one we can't measure accurately even if it's a somewhat worse proxy for the thing we really care about. There's a ton of value in not being completely and totally wrong. > FWIW, part of my mental model with VACUUM is that the rules kind of > change in the case of a big table. We're far more vulnerable to issues > such as (say) waiting for cleanup locks because the overall cadence > used by autovacuum is so infrequently relative to everything else. > There are more opportunities for things to go wrong, worse > consequences when they do go wrong, and greater potential for the > problems to compound. Yes. A lot of parts of PostgreSQL, including this one, were developed a long time ago when PostgreSQL databases were a lot smaller than they often are today. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Peter Geoghegan
Date:
On Wed, Jan 18, 2023 at 12:44 PM Robert Haas <robertmhaas@gmail.com> wrote: > I don't know enough about the specifics of how this works to have an > intelligent opinion about how likely these particular ideas are to > work out. However, I think it's risky to look at estimates and try to > infer whether they are reliable. It's too easy to be wrong. What we > really want to do is anchor our estimates to some data source that we > know we can trust absolutely. If you trust possibly-bad data less, it > screws up your estimates more slowly, but it still screws them up. Some of what I'm proposing arguably amounts to deliberately adding a bias. But that's not an unreasonable thing in itself. I think of it as related to the bias-variance tradeoff, which is a concept that comes up a lot in machine learning and statistical inference. We can afford to be quite imprecise at times, especially if we choose a bias that we know has much less potential to do us harm -- some mistakes hurt much more than others. We cannot afford to ever be dramatically wrong, though -- especially in the direction of vacuuming less often. Besides, there is something that we *can* place a relatively high degree of trust in that will still be in the loop here: VACUUM itself. If VACUUM runs then it'll call pgstat_report_vacuum(), which will set the record straight in the event of over estimating dead tuples. To some degree the problem of over estimating dead tuples is self-limiting. > If Andres is correct that what really matter is the number of pages > we're going to have to dirty, we could abandon counting dead tuples > altogether and just count not-all-visible pages in the VM map. That's what matters most from a cost point of view IMV. So it's a big part of the overall picture, but not everything. It tells us relatively little about the benefits, except perhaps when most pages are all-visible. -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Andres Freund
Date:
Hi, On 2023-01-18 12:15:17 -0800, Peter Geoghegan wrote: > On Wed, Jan 18, 2023 at 11:02 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, Jan 18, 2023 at 1:31 PM Peter Geoghegan <pg@bowt.ie> wrote: > > > pgstat_report_analyze() will totally override the > > > tabentry->dead_tuples information that drives autovacuum.c, based on > > > an estimate derived from a random sample -- which seems to me to be an > > > approach that just doesn't have any sound theoretical basis. > > > > Yikes. I think we don't have a choice but to have a method to correct > > the information somehow, because AFAIK the statistics system is not > > crash-safe. But that approach does seem to carry significant risk of > > overwriting correct information with wrong information. I suggested nearby to only have ANALYZE dead_tuples it if there's been no [auto]vacuum since the stats entry was created. That allows recovering from stats resets, be it via crashes or explicitly. What do you think? To add insult to injury, we overwrite accurate information gathered by VACUUM with bad information gathered by ANALYZE if you do VACUUM ANALYZE. One complicating factor is that VACUUM sometimes computes an incrementally more bogus n_live_tup when it skips pages due to the VM, whereas ANALYZE computes something sane. I unintentionally encountered one when I was trying something while writing this email, reproducer attached. VACUUM (DISABLE_PAGE_SKIPPING) foo; SELECT n_live_tup, n_dead_tup FROM pg_stat_user_tables WHERE relid = 'foo'::regclass; ┌────────────┬────────────┐ │ n_live_tup │ n_dead_tup │ ├────────────┼────────────┤ │ 9000001 │ 500000 │ └────────────┴────────────┘ after one VACUUM: ┌────────────┬────────────┐ │ n_live_tup │ n_dead_tup │ ├────────────┼────────────┤ │ 8549905 │ 500000 │ └────────────┴────────────┘ after 9 more VACUUMs: ┌────────────┬────────────┐ │ n_live_tup │ n_dead_tup │ ├────────────┼────────────┤ │ 5388421 │ 500000 │ └────────────┴────────────┘ (1 row) I briefly tried it out, and it does *not* reproduce in 11, but does in 12. Haven't dug into what the cause is, but we probably use the wrong denominator somewhere... Greetings, Andres Freund
Attachment
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Peter Geoghegan
Date:
On Wed, Jan 18, 2023 at 1:02 PM Peter Geoghegan <pg@bowt.ie> wrote: > Some of what I'm proposing arguably amounts to deliberately adding a > bias. But that's not an unreasonable thing in itself. I think of it as > related to the bias-variance tradeoff, which is a concept that comes > up a lot in machine learning and statistical inference. To be clear, I was thinking of unreservedly trusting what pgstat_report_analyze() says about dead_tuples in the event of its estimate increasing our dead_tuples estimate, while being skeptical (to a varying degree) when it's the other way around. But now I need to go think about what Andres just brought up... -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Andres Freund
Date:
Hi, On 2023-01-18 13:08:44 -0800, Andres Freund wrote: > One complicating factor is that VACUUM sometimes computes an incrementally > more bogus n_live_tup when it skips pages due to the VM, whereas ANALYZE > computes something sane. I unintentionally encountered one when I was trying > something while writing this email, reproducer attached. > > > VACUUM (DISABLE_PAGE_SKIPPING) foo; > SELECT n_live_tup, n_dead_tup FROM pg_stat_user_tables WHERE relid = 'foo'::regclass; > ┌────────────┬────────────┐ > │ n_live_tup │ n_dead_tup │ > ├────────────┼────────────┤ > │ 9000001 │ 500000 │ > └────────────┴────────────┘ > > after one VACUUM: > ┌────────────┬────────────┐ > │ n_live_tup │ n_dead_tup │ > ├────────────┼────────────┤ > │ 8549905 │ 500000 │ > └────────────┴────────────┘ > > after 9 more VACUUMs: > ┌────────────┬────────────┐ > │ n_live_tup │ n_dead_tup │ > ├────────────┼────────────┤ > │ 5388421 │ 500000 │ > └────────────┴────────────┘ > (1 row) > > > I briefly tried it out, and it does *not* reproduce in 11, but does in > 12. Haven't dug into what the cause is, but we probably use the wrong > denominator somewhere... Oh, it does actually reproduce in 11 too - my script just didn't see it because it was "too fast". For some reason < 12 it takes longer for the new pgstat snapshot to be available. If I add a few sleeps, it shows in 11. The real point of change appears to be 10->11. There's a relevant looking difference in the vac_estimate_reltuples call: 10: /* now we can compute the new value for pg_class.reltuples */ vacrelstats->new_rel_tuples = vac_estimate_reltuples(onerel, false, nblocks, vacrelstats->tupcount_pages, num_tuples); 11: /* now we can compute the new value for pg_class.reltuples */ vacrelstats->new_live_tuples = vac_estimate_reltuples(onerel, nblocks, vacrelstats->tupcount_pages, live_tuples); which points to: commit 7c91a0364fcf5d739a09cc87e7adb1d4a33ed112 Author: Tom Lane <tgl@sss.pgh.pa.us> Date: 2018-03-22 15:47:29 -0400 Sync up our various ways of estimating pg_class.reltuples. Greetings, Andres Freund
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Peter Geoghegan
Date:
On Wed, Jan 18, 2023 at 1:08 PM Andres Freund <andres@anarazel.de> wrote: > I suggested nearby to only have ANALYZE dead_tuples it if there's been no > [auto]vacuum since the stats entry was created. That allows recovering from > stats resets, be it via crashes or explicitly. What do you think? I like that idea. It's far simpler than the kind of stuff I was thinking about, and probably just as effective. Even if it introduces some unforeseen problem (which seems unlikely), we can still rely on pgstat_report_vacuum() to set things straight before too long. Are you planning on writing a patch for this? I'd be very interested in seeing this through. Could definitely review it. -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Andres Freund
Date:
Hi, On 2023-01-18 13:42:40 -0800, Andres Freund wrote: > The real point of change appears to be 10->11. > > There's a relevant looking difference in the vac_estimate_reltuples call: > 10: > /* now we can compute the new value for pg_class.reltuples */ > vacrelstats->new_rel_tuples = vac_estimate_reltuples(onerel, false, > nblocks, > vacrelstats->tupcount_pages, > num_tuples); > > 11: > /* now we can compute the new value for pg_class.reltuples */ > vacrelstats->new_live_tuples = vac_estimate_reltuples(onerel, > nblocks, > vacrelstats->tupcount_pages, > live_tuples); > which points to: > > commit 7c91a0364fcf5d739a09cc87e7adb1d4a33ed112 > Author: Tom Lane <tgl@sss.pgh.pa.us> > Date: 2018-03-22 15:47:29 -0400 > > Sync up our various ways of estimating pg_class.reltuples. The problem with the change is here: /* * Okay, we've covered the corner cases. The normal calculation is to * convert the old measurement to a density (tuples per page), then * estimate the number of tuples in the unscanned pages using that figure, * and finally add on the number of tuples in the scanned pages. */ old_density = old_rel_tuples / old_rel_pages; unscanned_pages = (double) total_pages - (double) scanned_pages; total_tuples = old_density * unscanned_pages + scanned_tuples; return floor(total_tuples + 0.5); Because we'll re-scan the pages for not-yet-removable rows in subsequent vacuums, the next vacuum will process the same pages again. By using scanned_tuples = live_tuples, we basically remove not-yet-removable tuples from reltuples, each time. The commit *did* try to account for that to some degree: + /* also compute total number of surviving heap entries */ + vacrelstats->new_rel_tuples = + vacrelstats->new_live_tuples + vacrelstats->new_dead_tuples; but new_rel_tuples isn't used for pg_class.reltuples or pgstat. This is pretty nasty. We use reltuples for a lot of things. And while analyze might fix it sometimes, that won't reliably be the case, particularly when there are repeated autovacuums due to a longrunning transaction - there's no cause for auto-analyze to trigger again soon, while autovacuum will go at it again and again. Greetings, Andres Freund
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Peter Geoghegan
Date:
On Wed, Jan 18, 2023 at 2:22 PM Andres Freund <andres@anarazel.de> wrote: > The problem with the change is here: > > /* > * Okay, we've covered the corner cases. The normal calculation is to > * convert the old measurement to a density (tuples per page), then > * estimate the number of tuples in the unscanned pages using that figure, > * and finally add on the number of tuples in the scanned pages. > */ > old_density = old_rel_tuples / old_rel_pages; > unscanned_pages = (double) total_pages - (double) scanned_pages; > total_tuples = old_density * unscanned_pages + scanned_tuples; > return floor(total_tuples + 0.5); My assumption has always been that vac_estimate_reltuples() is prone to issues like this because it just doesn't have access to very much information each time it runs. It can only see the delta between what VACUUM just saw, and what the last VACUUM (or possibly the last ANALYZE) saw according to pg_class. You're always going to find weaknesses in such a model if you go looking for them. You're always going to find a way to salami slice your way from good information to total nonsense, if you pick the right/wrong test case, which runs VACUUM in a way that allows whatever bias there may be to accumulate. It's sort of like the way floating point values can become very inaccurate through a process that allows many small inaccuracies to accumulate over time. Maybe you're right to be concerned to the degree that you're concerned -- I'm not sure. I'm just adding what I see as important context. -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Peter Geoghegan
Date:
On Wed, Jan 18, 2023 at 2:37 PM Peter Geoghegan <pg@bowt.ie> wrote: > Maybe you're right to be concerned to the degree that you're concerned > -- I'm not sure. I'm just adding what I see as important context. The problems in this area tend to be that vac_estimate_reltuples() behaves as if it sees a random sample, when in fact it's far from random -- it's the same scanned_pages as last time, and the ten other times before that. That's a feature of your test case, and other similar vac_estimate_reltuples test cases that I came up with in the past. Both scenarios involved skipping using the visibility map in multiple successive VACUUM operations. Perhaps we should make vac_estimate_reltuples focus on the pages that VACUUM newly set all-visible each time (not including all-visible pages that got scanned despite being all-visible) -- only that subset of scanned_pages seems to be truly relevant. That way you wouldn't be able to do stuff like this. We'd have to start explicitly tracking the number of pages that were newly set in the VM in vacuumlazy.c to be able to do that, but that seems like a good idea anyway. This probably has consequences elsewhere, but maybe that's okay. We know when the existing pg_class has no information, since that is explicitly encoded by a reltuples of -1. Obviously we'd need to be careful about stuff like that. Overall, the danger from assuming that "unsettled" pages (pages that couldn't be newly set all-visible by VACUUM) have a generic tuple density seems less than the danger of assuming that they're representative. We know that we're bound to scan these same pages in the next VACUUM anyway, so they'll have another chance to impact our view of the table's tuple density at that time. -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Peter Geoghegan
Date:
On Wed, Jan 18, 2023 at 3:28 PM Peter Geoghegan <pg@bowt.ie> wrote: > The problems in this area tend to be that vac_estimate_reltuples() > behaves as if it sees a random sample, when in fact it's far from > random -- it's the same scanned_pages as last time, and the ten other > times before that. That's a feature of your test case, and other > similar vac_estimate_reltuples test cases that I came up with in the > past. Both scenarios involved skipping using the visibility map in > multiple successive VACUUM operations. FWIW, the problem in your test case just goes away if you just change this line: DELETE FROM foo WHERE i < (10000000 * 0.1) To this: DELETE FROM foo WHERE i < (10000000 * 0.065) Which is not a huge difference, overall. This effect is a consequence of the heuristics I added in commit 74388a1a, so it's only present on Postgres 15+. Whether or not this is sufficient protection is of course open to interpretation. One problem with those heuristics (as far as your test case is concerned) is that they either work, or they don't work. For example they're conditioned on "old_rel_pages == total_page". -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Andres Freund
Date:
Hi, On 2023-01-18 14:37:20 -0800, Peter Geoghegan wrote: > On Wed, Jan 18, 2023 at 2:22 PM Andres Freund <andres@anarazel.de> wrote: > > The problem with the change is here: > > > > /* > > * Okay, we've covered the corner cases. The normal calculation is to > > * convert the old measurement to a density (tuples per page), then > > * estimate the number of tuples in the unscanned pages using that figure, > > * and finally add on the number of tuples in the scanned pages. > > */ > > old_density = old_rel_tuples / old_rel_pages; > > unscanned_pages = (double) total_pages - (double) scanned_pages; > > total_tuples = old_density * unscanned_pages + scanned_tuples; > > return floor(total_tuples + 0.5); > > My assumption has always been that vac_estimate_reltuples() is prone > to issues like this because it just doesn't have access to very much > information each time it runs. It can only see the delta between what > VACUUM just saw, and what the last VACUUM (or possibly the last > ANALYZE) saw according to pg_class. You're always going to find > weaknesses in such a model if you go looking for them. You're always > going to find a way to salami slice your way from good information to > total nonsense, if you pick the right/wrong test case, which runs > VACUUM in a way that allows whatever bias there may be to accumulate. > It's sort of like the way floating point values can become very > inaccurate through a process that allows many small inaccuracies to > accumulate over time. Sure. To start with, there's always going to be some inaccuracies when you assume an even distribution across a table. But I think this goes beyond that. This problem occurs with a completely even distribution, exactly the same inputs to the estimation function every time. My example under-sold the severity, because I had only 5% non-deletable tuples. Here's it with 50% non-removable tuples (I've seen way worse than 50% in many real-world cases), and a bunch of complexity removed (attched). vacuum-no reltuples/n_live_tup n_dead_tup 1 4999976 5000000 2 2500077 5000000 3 1250184 5000000 4 625266 5000000 5 312821 5000000 10 10165 5000000 Each vacuum halves reltuples. That's going to screw badly with all kinds of things. Planner costs completely out of whack etc. I wonder if this is part of the reason for the distortion you addressed with 74388a1a / 3097bde7dd1d. I am somewhat doubtful they're right as is. For a large relation 2% of blocks is a significant number of rows, and simply never adjusting reltuples seems quite problematic. At the very least we ought to account for dead tids we removed or such, instead of just freezing reltuples. Greetings, Andres Freund
Attachment
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Peter Geoghegan
Date:
On Wed, Jan 18, 2023 at 4:02 PM Andres Freund <andres@anarazel.de> wrote: > vacuum-no reltuples/n_live_tup n_dead_tup > 1 4999976 5000000 > 2 2500077 5000000 > 3 1250184 5000000 > 4 625266 5000000 > 5 312821 5000000 > 10 10165 5000000 > > Each vacuum halves reltuples. That's going to screw badly with all kinds of > things. Planner costs completely out of whack etc. I get that that could be a big problem, even relative to the more immediate problem of VACUUM just spinning like it does in your test case. What do you think we should do about it? What do you think about my idea of focussing on the subset of pages newly set all-visible in the VM? > I wonder if this is part of the reason for the distortion you addressed with > 74388a1a / 3097bde7dd1d. I am somewhat doubtful they're right as is. For a > large relation 2% of blocks is a significant number of rows, and simply never > adjusting reltuples seems quite problematic. At the very least we ought to > account for dead tids we removed or such, instead of just freezing reltuples. As I mentioned, it only kicks in when relpages is *precisely* the same as last time (not one block more or one block less), *and* we only scanned less than 2% of rel_pages. It's quite possible that that's insufficient, but I can't imagine it causing any new problems. I think that we need to be realistic about what's possible while storing a small, fixed amount of information. There is always going to be some distortion of this kind. We can do something about the obviously pathological cases, where errors can grow without bound. But you have to draw the line somewhere, unless you're willing to replace the whole approach with something that stores historic metadata. What kind of tradeoff do you want to make here? I think that you have to make one. -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Andres Freund
Date:
Hi, On 2023-01-18 13:45:19 -0800, Peter Geoghegan wrote: > On Wed, Jan 18, 2023 at 1:08 PM Andres Freund <andres@anarazel.de> wrote: > > I suggested nearby to only have ANALYZE dead_tuples it if there's been no > > [auto]vacuum since the stats entry was created. That allows recovering from > > stats resets, be it via crashes or explicitly. What do you think? > > I like that idea. It's far simpler than the kind of stuff I was > thinking about, and probably just as effective. Even if it introduces > some unforeseen problem (which seems unlikely), we can still rely on > pgstat_report_vacuum() to set things straight before too long. > > Are you planning on writing a patch for this? I'd be very interested > in seeing this through. Could definitely review it. I can, it should be just about trivial code-wise. A bit queasy about trying to forsee the potential consequences. A somewhat related issue is that pgstat_report_vacuum() sets dead_tuples to what VACUUM itself observed, ignoring any concurrently reported dead tuples. As far as I can tell, when vacuum takes a long time, that can lead to severely under-accounting dead tuples. We probably loose track of a bit more than 50% of the dead tuples reported since vacuum started. During the heap scan phase we don't notice all the tuples reported before the current scan point, and then we don't notice them at all during the index/heap vacuuming. The only saving grace is that it'll be corrected at the next VACUUM. But the next vacuum might very well be delayed noticably due to this. This is similar to the issue around ins_since_vacuum that Peter pointed out. I wonder if we ought to remember the dead_tuples value at the start of the heap scan and use that to adjust the final dead_tuples value. I'd lean towards over-counting rather than under-counting and thus probably would go for something like tabentry->dead_tuples = livetuples + Min(0, tabentry->dead_tuples - deadtuples_at_start); i.e. assuming we might have missed all concurrently reported dead tuples. Of course we could instead move to something like ins_since_vacuum and reset it at the *start* of the vacuum. But that'd make the error case harder, without giving us more accuracy, I think? I do think this is an argument for splitting up dead_tuples into separate "components" that we track differently. I.e. tracking the number of dead items, not-yet-removable rows, and the number of dead tuples reported from DML statements via pgstats. Greetings, Andres Freund
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Peter Geoghegan
Date:
On Wed, Jan 18, 2023 at 4:37 PM Andres Freund <andres@anarazel.de> wrote: > I can, it should be just about trivial code-wise. A bit queasy about trying to > forsee the potential consequences. That's always going to be true, though. > A somewhat related issue is that pgstat_report_vacuum() sets dead_tuples to > what VACUUM itself observed, ignoring any concurrently reported dead > tuples. As far as I can tell, when vacuum takes a long time, that can lead to > severely under-accounting dead tuples. Did I not mention that one? There are so many that it can be hard to keep track! That's why I catalog them. As you point out, it's the dead tuples equivalent of my ins_since_vacuum complaint. The problem is exactly analogous to my recent complaint about insert-driven autovacuums. > I wonder if we ought to remember the dead_tuples value at the start of the > heap scan and use that to adjust the final dead_tuples value. I'd lean towards > over-counting rather than under-counting and thus probably would go for > something like > > tabentry->dead_tuples = livetuples + Min(0, tabentry->dead_tuples - deadtuples_at_start); > > i.e. assuming we might have missed all concurrently reported dead tuples. This is exactly what I was thinking of doing for both issues (the ins_since_vacuum one and this similar dead tuples one). It's completely logical. This creates an awkward but logical question, though: what if dead_tuples doesn't go down at all? What if VACUUM actually has to increase it, because VACUUM runs so slowly relative to the workload? Of course the whole point is to make it more likely that VACUUM will keep up with the workload. I'm just not quite sure that the consequences of doing it that way are strictly a good thing. Bearing in mind that we don't differentiate between recently dead and dead here. Fun fact: autovacuum can spin with pgbench because of recently dead tuples, even absent an old snapshot/long running xact, if you set things aggressively enough: https://postgr.es/m/CAH2-Wz=sJm3tm+FpXbyBhEhX5tbz1trQrhG6eOhYk4-+5uL=ww@mail.gmail.com I think that we probably need to do something like always make sure that dead_items goes down by a small amount at the end of each VACUUM, even when that's a lie. Maybe we also have a log message about autovacuum not keeping up, so as to not feel too guilty about it. You know, to give the user a chance to reconfigure autovacuum so that it stops happening. > Of course we could instead move to something like ins_since_vacuum and reset > it at the *start* of the vacuum. But that'd make the error case harder, > without giving us more accuracy, I think? It would. It seems illogical to me. > I do think this is an argument for splitting up dead_tuples into separate > "components" that we track differently. I.e. tracking the number of dead > items, not-yet-removable rows, and the number of dead tuples reported from DML > statements via pgstats. Is it? Why? I'm all in favor of doing that, of course. I just don't particularly think that it's related to this other problem. One problem is that we count dead tuples incorrectly because we don't account for the fact that things change while VACUUM runs. The other problem is that the thing that is counted isn't broken down into distinct subcategories of things -- things are bunched together that shouldn't be. Oh wait, you were thinking of what I said before -- my "awkward but logical question". Is that it? -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Andres Freund
Date:
Hi, On 2023-01-18 16:19:02 -0800, Peter Geoghegan wrote: > On Wed, Jan 18, 2023 at 4:02 PM Andres Freund <andres@anarazel.de> wrote: > > vacuum-no reltuples/n_live_tup n_dead_tup > > 1 4999976 5000000 > > 2 2500077 5000000 > > 3 1250184 5000000 > > 4 625266 5000000 > > 5 312821 5000000 > > 10 10165 5000000 > > > > Each vacuum halves reltuples. That's going to screw badly with all kinds of > > things. Planner costs completely out of whack etc. > > I get that that could be a big problem, even relative to the more > immediate problem of VACUUM just spinning like it does in your test > case. What do you think we should do about it? The change made in 7c91a0364fc imo isn't right. We need to fix it. I think it's correct that now pgstat_report_vacuum() doesn't include recently dead tuples in livetuples - they're still tracked via deadtuples. But it's wrong for vacuum's call to vac_update_relstats() to not include recently dead tuples, at least when we only scanned part of the relation. I think the right thing would be to not undo the semantic change of 7c91a0364fc as a whole, but instead take recently-dead tuples into account only in the "Okay, we've covered the corner cases." part, to avoid the spiraling seen above. Not super clean, but it seems somewhat fundamental that we'll re-scan pages full of recently-dead tuples in the near future. If we, in a way, subtract the recently dead tuples from reltuples in this cycle, we shouldn't do so again in the next - but not taking recently dead into account, does so. It's a bit complicated because of the APIs involved. vac_estimate_reltuples() computes vacrel->new_live_tuples and contains the logic for how to compute the new reltuples. But we use the ->new_live_tuples both vac_update_relstats(), where we, imo, should take recently-dead into account for partial scans and pgstat_report_vacuum where we shouldn't. I guess we would need to add an output paramter both for "reltuples" and "new_live_tuples". > What do you think about my idea of focussing on the subset of pages newly > set all-visible in the VM? I don't understand it yet. On 2023-01-18 15:28:19 -0800, Peter Geoghegan wrote: > Perhaps we should make vac_estimate_reltuples focus on the pages that > VACUUM newly set all-visible each time (not including all-visible > pages that got scanned despite being all-visible) -- only that subset > of scanned_pages seems to be truly relevant. That way you wouldn't be > able to do stuff like this. We'd have to start explicitly tracking the > number of pages that were newly set in the VM in vacuumlazy.c to be > able to do that, but that seems like a good idea anyway. Can you explain a bit more what you mean with "focus on the pages" means? > > I wonder if this is part of the reason for the distortion you addressed with > > 74388a1a / 3097bde7dd1d. I am somewhat doubtful they're right as is. For a > > large relation 2% of blocks is a significant number of rows, and simply never > > adjusting reltuples seems quite problematic. At the very least we ought to > > account for dead tids we removed or such, instead of just freezing reltuples. > > As I mentioned, it only kicks in when relpages is *precisely* the same > as last time (not one block more or one block less), *and* we only > scanned less than 2% of rel_pages. It's quite possible that that's > insufficient, but I can't imagine it causing any new problems. In OLTP workloads relpages will often not change, even if there's lots of write activity, because there's plenty free space in the relation, and there's something not-removable on the last page. relpages also won't change if data is deleted anywhere but the end. I don't think it's hard to see this causing problems. Set autovacuum_vacuum_scale_factor to something smaller than 2% or somewhat frequently vacuum manually. Incrementally delete old data. Unless analyze saves you - which might not be run or might have a different scale factor or not be run manually - reltuples will stay exactly the same, despite data changing substantially. Greetings, Andres Freund
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Peter Geoghegan
Date:
On Fri, Jan 13, 2023 at 9:55 PM Andres Freund <andres@anarazel.de> wrote: > How about a float autovacuum_no_auto_cancel_age where positive values are > treated as absolute values, and negative values are a multiple of > autovacuum_freeze_max_age? And where the "computed" age is capped at > vacuum_failsafe_age? A "failsafe" autovacuum clearly shouldn't be cancelled. > > And maybe a default setting of -1.8 or so? Attached is a new revision, v5. I'm not happy with this, but thought it would be useful to show you where I am with it. It's a bit awkward that we have a GUC (autovacuum_no_auto_cancel_age) that can sometimes work as a cutoff that works similarly to both freeze_max_age and multixact_freeze_max_age, but usually works as a multiplier. It's both an XID age value, an MXID age value, and a multiplier on XID/MXID age values. What if it was just a simple multiplier on freeze_max_age/multixact_freeze_max_age, without changing any other detail? -- Peter Geoghegan
Attachment
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Andres Freund
Date:
Hi, On 2023-01-18 17:00:48 -0800, Peter Geoghegan wrote: > On Wed, Jan 18, 2023 at 4:37 PM Andres Freund <andres@anarazel.de> wrote: > > I can, it should be just about trivial code-wise. A bit queasy about trying to > > forsee the potential consequences. > > That's always going to be true, though. > > > A somewhat related issue is that pgstat_report_vacuum() sets dead_tuples to > > what VACUUM itself observed, ignoring any concurrently reported dead > > tuples. As far as I can tell, when vacuum takes a long time, that can lead to > > severely under-accounting dead tuples. > > Did I not mention that one? There are so many that it can be hard to > keep track! That's why I catalog them. I don't recall you doing, but there's lot of emails and holes in my head. > This creates an awkward but logical question, though: what if > dead_tuples doesn't go down at all? What if VACUUM actually has to > increase it, because VACUUM runs so slowly relative to the workload? Sure, that can happen - but it's not made better by having wrong stats :) > > I do think this is an argument for splitting up dead_tuples into separate > > "components" that we track differently. I.e. tracking the number of dead > > items, not-yet-removable rows, and the number of dead tuples reported from DML > > statements via pgstats. > > Is it? Why? We have reasonably sophisticated accounting in pgstats what newly live/dead rows a transaction "creates". So an obvious (and wrong) idea is just decrement reltuples by the number of tuples removed by autovacuum. But we can't do that, because inserted/deleted tuples reported by backends can be removed by on-access pruning and vacuumlazy doesn't know about all changes made by its call to heap_page_prune(). But I think that if we add a pgstat_count_heap_prune(nredirected, ndead, nunused) around heap_page_prune() and a pgstat_count_heap_vacuum(nunused) in lazy_vacuum_heap_page(), we'd likely end up with a better approximation than what vac_estimate_reltuples() does, in the "partially scanned" case. > I'm all in favor of doing that, of course. I just don't particularly > think that it's related to this other problem. One problem is that we > count dead tuples incorrectly because we don't account for the fact > that things change while VACUUM runs. The other problem is that the > thing that is counted isn't broken down into distinct subcategories of > things -- things are bunched together that shouldn't be. If we only adjust the counters incrementally, as we go, we'd not update them at the end of vacuum. I think it'd be a lot easier to only update the counters incrementally if we split ->dead_tuples into sub-counters. So I don't think it's entirely unrelated. You probably could get close without splitting the counters, by just pushing down the counting, and only counting redirected and unused during heap pruning. But I think it's likely to be more accurate with the split counter. > Oh wait, you were thinking of what I said before -- my "awkward but > logical question". Is that it? I'm not quite following? The "awkward but logical" bit is in the email I'm just replying to, right? Greetings, Andres Freund
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Peter Geoghegan
Date:
On Wed, Jan 18, 2023 at 5:49 PM Andres Freund <andres@anarazel.de> wrote: > On 2023-01-18 15:28:19 -0800, Peter Geoghegan wrote: > > Perhaps we should make vac_estimate_reltuples focus on the pages that > > VACUUM newly set all-visible each time (not including all-visible > > pages that got scanned despite being all-visible) -- only that subset > > of scanned_pages seems to be truly relevant. That way you wouldn't be > > able to do stuff like this. We'd have to start explicitly tracking the > > number of pages that were newly set in the VM in vacuumlazy.c to be > > able to do that, but that seems like a good idea anyway. > > Can you explain a bit more what you mean with "focus on the pages" means? We don't say anything about pages we didn't scan right now -- only scanned_pages have new information, so we just extrapolate. Why not go even further than that, by only saying something about the pages that were both scanned and newly set all-visible? Under this scheme, the pages that were scanned but couldn't be newly set all-visible are treated just like the pages that weren't scanned at all -- they get a generic estimate from the existing reltuples. > I don't think it's hard to see this causing problems. Set > autovacuum_vacuum_scale_factor to something smaller than 2% or somewhat > frequently vacuum manually. Incrementally delete old data. Unless analyze > saves you - which might not be run or might have a different scale factor or > not be run manually - reltuples will stay exactly the same, despite data > changing substantially. You seem to be saying that it's a problem if we don't update reltuples -- an estimate -- when less than 2% of the table is scanned by VACUUM. But why? Why can't we just do nothing sometimes? I mean in general, leaving aside the heuristics I came up with for a moment? It will cause problems if we remove the heuristics. Much less theoretical problems. What about those? How often does VACUUM scan so few pages, anyway? We've been talking about how ineffective autovacuum_vacuum_scale_factor is, at great length, but now you're saying that it *can* meaningfully trigger not just one VACUUM, but many VACUUMs, where no more than 2% of rel_pages are not all-visible (pages, not tuples)? Not just once, mind you, but many times? And in the presence of some kind of highly variable tuple size, where it actually could matter to the planner at some point? I would be willing to just avoid even these theoretical problems if there was some way to do so, that didn't also create new problems. I have my doubts that that is possible, within the constraints of updating pg_class. Or the present constraints, at least. I am not a miracle worker -- I can only do so much with the information that's available to vac_update_relstats (and/or the information that can easily be made available). -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Peter Geoghegan
Date:
On Wed, Jan 18, 2023 at 6:10 PM Andres Freund <andres@anarazel.de> wrote: > > This creates an awkward but logical question, though: what if > > dead_tuples doesn't go down at all? What if VACUUM actually has to > > increase it, because VACUUM runs so slowly relative to the workload? > > Sure, that can happen - but it's not made better by having wrong stats :) Maybe it's that simple. It's reasonable to wonder how far we want to go with letting dead tuples grow and grow, even when VACUUM is running constantly. It's not a question that has an immediate and obvious answer IMV. Maybe the real question is: is this an opportunity to signal to the user (say via a LOG message) that VACUUM cannot keep up? That might be very useful, in a general sort of way (not just to avoid new problems). > We have reasonably sophisticated accounting in pgstats what newly live/dead > rows a transaction "creates". So an obvious (and wrong) idea is just decrement > reltuples by the number of tuples removed by autovacuum. Did you mean dead_tuples? > But we can't do that, > because inserted/deleted tuples reported by backends can be removed by > on-access pruning and vacuumlazy doesn't know about all changes made by its > call to heap_page_prune(). I'm not following here. Perhaps this is a good sign that I should stop working for the day. :-) > But I think that if we add a > pgstat_count_heap_prune(nredirected, ndead, nunused) > around heap_page_prune() and a > pgstat_count_heap_vacuum(nunused) > in lazy_vacuum_heap_page(), we'd likely end up with a better approximation > than what vac_estimate_reltuples() does, in the "partially scanned" case. What does vac_estimate_reltuples() have to do with dead tuples? -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Andres Freund
Date:
Hi, On 2023-01-18 18:21:33 -0800, Peter Geoghegan wrote: > On Wed, Jan 18, 2023 at 5:49 PM Andres Freund <andres@anarazel.de> wrote: > > On 2023-01-18 15:28:19 -0800, Peter Geoghegan wrote: > > > Perhaps we should make vac_estimate_reltuples focus on the pages that > > > VACUUM newly set all-visible each time (not including all-visible > > > pages that got scanned despite being all-visible) -- only that subset > > > of scanned_pages seems to be truly relevant. That way you wouldn't be > > > able to do stuff like this. We'd have to start explicitly tracking the > > > number of pages that were newly set in the VM in vacuumlazy.c to be > > > able to do that, but that seems like a good idea anyway. > > > > Can you explain a bit more what you mean with "focus on the pages" means? > > We don't say anything about pages we didn't scan right now -- only > scanned_pages have new information, so we just extrapolate. Why not go > even further than that, by only saying something about the pages that > were both scanned and newly set all-visible? > > Under this scheme, the pages that were scanned but couldn't be newly > set all-visible are treated just like the pages that weren't scanned > at all -- they get a generic estimate from the existing reltuples. I don't think that'd work well either. If we actually removed a large chunk of the tuples in the table it should be reflected in reltuples, otherwise costing and autovac scheduling suffers. And you might not be able to set all those page to all-visible, because of more recent rows. > > I don't think it's hard to see this causing problems. Set > > autovacuum_vacuum_scale_factor to something smaller than 2% or somewhat > > frequently vacuum manually. Incrementally delete old data. Unless analyze > > saves you - which might not be run or might have a different scale factor or > > not be run manually - reltuples will stay exactly the same, despite data > > changing substantially. > > You seem to be saying that it's a problem if we don't update reltuples > -- an estimate -- when less than 2% of the table is scanned by VACUUM. > But why? Why can't we just do nothing sometimes? I mean in general, > leaving aside the heuristics I came up with for a moment? The problem isn't that we might apply the heuristic once, that'd be fine. But that there's nothing preventing it from applying until there basically are no tuples left, as long as the vacuum is frequent enough. As a demo: The attached sql script ends up with a table containing 10k rows, but relpages being set 1 million. VACUUM foo; EXPLAIN (ANALYZE) SELECT * FROM foo; ┌───────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │ QUERY PLAN │ ├───────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ Seq Scan on foo (cost=0.00..14425.00 rows=1000000 width=4) (actual time=3.251..4.693 rows=10000 loops=1) │ │ Planning Time: 0.056 ms │ │ Execution Time: 5.491 ms │ └───────────────────────────────────────────────────────────────────────────────────────────────────────────┘ (3 rows) > It will cause problems if we remove the heuristics. Much less > theoretical problems. What about those? I don't think what I describe is a theoretical problem. > How often does VACUUM scan so few pages, anyway? We've been talking > about how ineffective autovacuum_vacuum_scale_factor is, at great > length, but now you're saying that it *can* meaningfully trigger not > just one VACUUM, but many VACUUMs, where no more than 2% of rel_pages > are not all-visible (pages, not tuples)? Not just once, mind you, but > many times? I've seen autovacuum_vacuum_scale_factor set to 0.01 repeatedly. But that's not even needed - you just need a longrunning transaction preventing at least one dead row from getting removed and hit autovacuum_freeze_max_age. There'll be continuous VACUUMs of the table, all only processing a small fraction of the table. And I have many times seen bulk loading / deletion scripts that do VACUUM on a regular schedule, which also could easily trigger this. > And in the presence of some kind of highly variable tuple > size, where it actually could matter to the planner at some point? I don't see how a variable tuple size needs to be involved? As the EXPLAIN ANALYZE above shows, we'll end up with wrong row count estimates etc. > I would be willing to just avoid even these theoretical problems if > there was some way to do so, that didn't also create new problems. I > have my doubts that that is possible, within the constraints of > updating pg_class. Or the present constraints, at least. I am not a > miracle worker -- I can only do so much with the information that's > available to vac_update_relstats (and/or the information that can > easily be made available). I'm worried they might cause new problems. Greetings, Andres Freund
Attachment
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Peter Geoghegan
Date:
On Wed, Jan 18, 2023 at 7:04 PM Andres Freund <andres@anarazel.de> wrote: > > You seem to be saying that it's a problem if we don't update reltuples > > -- an estimate -- when less than 2% of the table is scanned by VACUUM. > > But why? Why can't we just do nothing sometimes? I mean in general, > > leaving aside the heuristics I came up with for a moment? > > The problem isn't that we might apply the heuristic once, that'd be fine. But > that there's nothing preventing it from applying until there basically are no > tuples left, as long as the vacuum is frequent enough. > > As a demo: The attached sql script ends up with a table containing 10k rows, > but relpages being set 1 million. I saw that too. But then I checked again a few seconds later, and autoanalyze had run, so reltuples was 10k. Just like it would have if there was no VACUUM statements in your script. -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Robert Haas
Date:
On Wed, Jan 18, 2023 at 1:31 PM Peter Geoghegan <pg@bowt.ie> wrote: > pgstat_report_analyze() will totally override the > tabentry->dead_tuples information that drives autovacuum.c, based on > an estimate derived from a random sample -- which seems to me to be an > approach that just doesn't have any sound theoretical basis. In other words, ANALYZE sometimes (but not always) produces wrong answers. On Wed, Jan 18, 2023 at 4:08 PM Andres Freund <andres@anarazel.de> wrote: > One complicating factor is that VACUUM sometimes computes an incrementally > more bogus n_live_tup when it skips pages due to the VM, whereas ANALYZE > computes something sane. I unintentionally encountered one when I was trying > something while writing this email, reproducer attached. In other words, VACUUM sometimes (but not always) produces wrong answers. TL;DR: We're screwed. I refuse to believe that any amount of math you can do on numbers that can be arbitrarily inaccurate will result in an accurate answer popping out the other end. Trying to update the reltuples estimate incrementally based on an estimate derived from a non-random, likely-to-be-skewed subset of the table is always going to produce distortion that gets worse and worse the more times you do it. If could say, well, the existing estimate of let's say 100 tuples per page is based on the density being 200 tuples per page in the pages I just scanned and 50 tuples per page in the rest of the table, then you could calculate a new estimate that keeps the value of 50 tuples per page for the remainder of the table intact and just replaces the estimate for the part you just scanned. But we have no way of doing that, so we just make some linear combination of the old estimate with the new one. That overweights the repeatedly-sampled portions of the table more and more, making the estimate wronger and wronger. Now, that is already quite bad. But if we accept the premise that neither VACUUM nor ANALYZE is guaranteed to ever produce a new actually-reliable estimate, then not only will we go progressively more wrong as time goes by, but we have no way of ever fixing anything. If you get a series of unreliable data points followed by a reliable data point, you can at least get back on track when the reliable data shows up. But it sounds like you guys are saying that there's no guarantee that will ever happen, which is a bit like discovering that not only do you have a hole in your gas tank but there is no guarantee that you will arrive at a gas station ever again regardless of distance travelled. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Andres Freund
Date:
Hi, On 2023-01-19 15:12:12 -0500, Robert Haas wrote: > On Wed, Jan 18, 2023 at 1:31 PM Peter Geoghegan <pg@bowt.ie> wrote: > > pgstat_report_analyze() will totally override the > > tabentry->dead_tuples information that drives autovacuum.c, based on > > an estimate derived from a random sample -- which seems to me to be an > > approach that just doesn't have any sound theoretical basis. > > In other words, ANALYZE sometimes (but not always) produces wrong answers. For dead tuples, but not live tuples. > On Wed, Jan 18, 2023 at 4:08 PM Andres Freund <andres@anarazel.de> wrote: > > One complicating factor is that VACUUM sometimes computes an incrementally > > more bogus n_live_tup when it skips pages due to the VM, whereas ANALYZE > > computes something sane. I unintentionally encountered one when I was trying > > something while writing this email, reproducer attached. > > In other words, VACUUM sometimes (but not always) produces wrong answers. For live tuples, but not badly so for dead tuples. > TL;DR: We're screwed. We are, but perhaps not too badly so, because we can choose to believe analyze more for live tuples, and vacuum for dead tuples. Analyze doesn't compute reltuples incrementally and vacuum doesn't compute deadtuples incrementally. > I refuse to believe that any amount of math you can do on numbers that > can be arbitrarily inaccurate will result in an accurate answer > popping out the other end. Trying to update the reltuples estimate > incrementally based on an estimate derived from a non-random, > likely-to-be-skewed subset of the table is always going to produce > distortion that gets worse and worse the more times you do it. If > could say, well, the existing estimate of let's say 100 tuples per > page is based on the density being 200 tuples per page in the pages I > just scanned and 50 tuples per page in the rest of the table, then you > could calculate a new estimate that keeps the value of 50 tuples per > page for the remainder of the table intact and just replaces the > estimate for the part you just scanned. But we have no way of doing > that, so we just make some linear combination of the old estimate with > the new one. That overweights the repeatedly-sampled portions of the > table more and more, making the estimate wronger and wronger. Perhaps we should, at least occasionally, make vacuum do a cheaper version of analyze's sampling to compute an updated reltuples. This could even happen during the heap scan phase. I don't like relying on analyze to fix vacuum's bogus reltuples, because there's nothing forcing an analyze run soon after vacuum [incrementally] screwed it up. Vacuum can be forced to run a lot of times due to xid horizons preventing cleanup, after which there isn't anything forcing analyze to run again. But in contrast to dead_tuples, where I think we can just stop analyze from updating it unless we crashed recently, I do think we need to update reltuples in vacuum. So computing an accurate value seems like the least unreasonable thing I can see. Greetings, Andres Freund
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Andres Freund
Date:
Hi On 2023-01-18 19:26:22 -0800, Peter Geoghegan wrote: > On Wed, Jan 18, 2023 at 7:04 PM Andres Freund <andres@anarazel.de> wrote: > > > You seem to be saying that it's a problem if we don't update reltuples > > > -- an estimate -- when less than 2% of the table is scanned by VACUUM. > > > But why? Why can't we just do nothing sometimes? I mean in general, > > > leaving aside the heuristics I came up with for a moment? > > > > The problem isn't that we might apply the heuristic once, that'd be fine. But > > that there's nothing preventing it from applying until there basically are no > > tuples left, as long as the vacuum is frequent enough. > > > > As a demo: The attached sql script ends up with a table containing 10k rows, > > but relpages being set 1 million. > > I saw that too. But then I checked again a few seconds later, and > autoanalyze had run, so reltuples was 10k. Just like it would have if > there was no VACUUM statements in your script. There's absolutely no guarantee that autoanalyze is triggered there. Particularly with repeated vacuums triggered due to an relfrozenxid age that can't be advanced that very well might not happen within days on a large relation. Greetings, Andres Freund
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Peter Geoghegan
Date:
On Thu, Jan 19, 2023 at 12:56 PM Andres Freund <andres@anarazel.de> wrote: > > In other words, ANALYZE sometimes (but not always) produces wrong answers. > > For dead tuples, but not live tuples. > > In other words, VACUUM sometimes (but not always) produces wrong answers. > > For live tuples, but not badly so for dead tuples. Agreed. More generally, there is a need to think about the whole table in some cases (like for regular optimizer statistics including reltuples/live tuples), and the subset of pages that will be scanned by VACUUM in other cases (for dead tuples accounting). Random sampling at the table level is appropriate and works well enough for the former, though not for the latter. > We are, but perhaps not too badly so, because we can choose to believe analyze > more for live tuples, and vacuum for dead tuples. Analyze doesn't compute > reltuples incrementally and vacuum doesn't compute deadtuples incrementally. Good points. > But in contrast to dead_tuples, where I think we can just stop analyze from > updating it unless we crashed recently, I do think we need to update reltuples > in vacuum. So computing an accurate value seems like the least unreasonable > thing I can see. I agree, but there is no reasonable basis for treating scanned_pages as a random sample, especially if it's only a small fraction of all of rel_pages -- treating it as a random sample is completely unjustifiable. And so it seems to me that the only thing that can be done is to either make VACUUM behave somewhat like ANALYZE in at least some cases, or to have it invoke ANALYZE directly (or indirectly) in those same cases. -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Peter Geoghegan
Date:
On Thu, Jan 19, 2023 at 12:58 PM Andres Freund <andres@anarazel.de> wrote: > There's absolutely no guarantee that autoanalyze is triggered > there. Particularly with repeated vacuums triggered due to an relfrozenxid age > that can't be advanced that very well might not happen within days on a large > relation. Arguments like that work far better as arguments in favor of the vac_estimate_reltuples heuristics. That doesn't mean that the heuristics are good in any absolute sense, of course. They were just a band aid intended to ameliorate some of the negative impact that came from treating scanned_pages as a random sample. I think that we both agree that the real problem is that scanned_pages just isn't a random sample, at least not as far as reltuples/live tuples is concerned (for dead tuples it kinda isn't a sample, but is rather something close to an exact count). I now understand that you're in favor of addressing the root problem directly. I am also in favor of that approach. I'd be more than happy to get rid of the band aid as part of that whole effort. -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Andres Freund
Date:
Hi, On 2023-01-19 13:36:41 -0800, Peter Geoghegan wrote: > On Thu, Jan 19, 2023 at 12:58 PM Andres Freund <andres@anarazel.de> wrote: > > There's absolutely no guarantee that autoanalyze is triggered > > there. Particularly with repeated vacuums triggered due to an relfrozenxid age > > that can't be advanced that very well might not happen within days on a large > > relation. > > Arguments like that work far better as arguments in favor of the > vac_estimate_reltuples heuristics. I don't agree. But mainly my issue is that the devil you know (how this has worked for a while) is preferrable to introducing an unknown quantity (your patch that hasn't yet seen real world exposure). Greetings, Andres Freund
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Andres Freund
Date:
Hi, On 2023-01-19 13:22:28 -0800, Peter Geoghegan wrote: > On Thu, Jan 19, 2023 at 12:56 PM Andres Freund <andres@anarazel.de> wrote: > > But in contrast to dead_tuples, where I think we can just stop analyze from > > updating it unless we crashed recently, I do think we need to update reltuples > > in vacuum. So computing an accurate value seems like the least unreasonable > > thing I can see. > > I agree, but there is no reasonable basis for treating scanned_pages > as a random sample, especially if it's only a small fraction of all of > rel_pages -- treating it as a random sample is completely > unjustifiable. Agreed. > And so it seems to me that the only thing that can be done is to either make > VACUUM behave somewhat like ANALYZE in at least some cases, or to have it > invoke ANALYZE directly (or indirectly) in those same cases. Yea. Hence my musing about potentially addressing this by choosing to visit additional blocks during the heap scan using vacuum's block sampling logic. IME most of the time in analyze isn't spent doing IO for the sample blocks themselves, but CPU time and IO for toasted columns. A trimmed down version that just computes relallvisible should be a good bit faster. Greetings, Andres Freund
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Peter Geoghegan
Date:
On Thu, Jan 19, 2023 at 2:54 PM Andres Freund <andres@anarazel.de> wrote: > Yea. Hence my musing about potentially addressing this by choosing to visit > additional blocks during the heap scan using vacuum's block sampling logic. I'd rather just invent a way for vacuumlazy.c to tell the top-level vacuum.c caller "I didn't update reltuples, but you ought to go ANALYZE the table now that I'm done, even if you weren't already planning to do so". This wouldn't have to happen every time, but it would happen fairly often. > IME most of the time in analyze isn't spent doing IO for the sample blocks > themselves, but CPU time and IO for toasted columns. A trimmed down version > that just computes relallvisible should be a good bit faster. I worry about that from a code maintainability point of view. I'm concerned that it won't be very cut down at all, in the end. Presumably you'll want to add the same I/O prefetching logic to this cut-down version, just for example. Since without that there will be no competition between it and ANALYZE proper. Besides which, isn't it kinda wasteful to not just do a full ANALYZE? Sure, you can avoid detoasting overhead that way. But even still. -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Andres Freund
Date:
Hi, On 2023-01-19 15:10:38 -0800, Peter Geoghegan wrote: > On Thu, Jan 19, 2023 at 2:54 PM Andres Freund <andres@anarazel.de> wrote: > > Yea. Hence my musing about potentially addressing this by choosing to visit > > additional blocks during the heap scan using vacuum's block sampling logic. > > I'd rather just invent a way for vacuumlazy.c to tell the top-level > vacuum.c caller "I didn't update reltuples, but you ought to go > ANALYZE the table now that I'm done, even if you weren't already > planning to do so". I'm worried about increasing the number of analyzes that much - on a subset of workloads it's really quite slow. Another version of this could be to integrate analyze.c's scan more closely with vacuum all the time. It's a bit bonkers that we often sequentially read blocks, evict them from shared buffers if we read them, just to then afterwards do random IO for blocks we've already read. That's imo what we eventually should do, but clearly it's not a small project. > This wouldn't have to happen every time, but it would happen fairly often. Do you have a mechanism for that in mind? Just something vacuum_count % 10 == 0 like? Or remember scanned_pages in pgstats and re-computing > > IME most of the time in analyze isn't spent doing IO for the sample blocks > > themselves, but CPU time and IO for toasted columns. A trimmed down version > > that just computes relallvisible should be a good bit faster. > > I worry about that from a code maintainability point of view. I'm > concerned that it won't be very cut down at all, in the end. I think it'd be fine to just use analyze.c and pass in an option to not compute column and inheritance stats. > Presumably you'll want to add the same I/O prefetching logic to this > cut-down version, just for example. Since without that there will be > no competition between it and ANALYZE proper. Besides which, isn't it > kinda wasteful to not just do a full ANALYZE? Sure, you can avoid > detoasting overhead that way. But even still. It's not just that analyze is expensive, I think it'll also be confusing if the column stats change after a manual VACUUM without ANALYZE. It shouldn't be too hard to figure out whether we're going to do an analyze anyway and not do the rowcount-estimate version when doing VACUUM ANALYZE or if autovacuum scheduled an analyze as well. Greetings, Andres Freund
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Peter Geoghegan
Date:
On Thu, Jan 19, 2023 at 3:38 PM Andres Freund <andres@anarazel.de> wrote: > Another version of this could be to integrate analyze.c's scan more closely > with vacuum all the time. It's a bit bonkers that we often sequentially read > blocks, evict them from shared buffers if we read them, just to then > afterwards do random IO for blocks we've already read. That's imo what we > eventually should do, but clearly it's not a small project. Very often, what we're really missing in VACUUM is high level context. That's true of what you say here, about analyze.c, as well as complaints like your vac_estimate_reltuples complaint. The problem scenarios involving vac_estimate_reltuples all involve repeating the same silly action again and again, never realizing that that's what's going on. I've found it very useful to think of one VACUUM as picking up where the last one left off for my work on freezing. This seems related to pre-autovacuum historical details. VACUUM shouldn't really be a command in the same way that CREATE INDEX is a command. I do think that we need to retain a VACUUM command in some form, but it should be something pretty close to a command that just enqueues off-schedule autovacuums. That can do things like coalesce duplicate requests into one. Anyway, I am generally in favor of a design that makes VACUUM and ANALYZE things that are more or less owned by autovacuum. It should be less and less of a problem to blur the distinction between VACUUM and ANALYZE under this model, in each successive release. These distinctions are quite unhelpful, overall, because they make it hard for autovacuum scheduling to work at the whole-system level. > > This wouldn't have to happen every time, but it would happen fairly often. > > Do you have a mechanism for that in mind? Just something vacuum_count % 10 == > 0 like? Or remember scanned_pages in pgstats and re-computing I was thinking of something very simple like that, yes. > I think it'd be fine to just use analyze.c and pass in an option to not > compute column and inheritance stats. That could be fine. Just as long as it's not duplicative in an obvious way. > > Presumably you'll want to add the same I/O prefetching logic to this > > cut-down version, just for example. Since without that there will be > > no competition between it and ANALYZE proper. Besides which, isn't it > > kinda wasteful to not just do a full ANALYZE? Sure, you can avoid > > detoasting overhead that way. But even still. > > It's not just that analyze is expensive, I think it'll also be confusing if > the column stats change after a manual VACUUM without ANALYZE. Possibly, but it doesn't have to happen there. It's not like the rules aren't a bit different compared to autovacuum already. For example, the way TOAST tables are handled by the VACUUM command versus autovacuum. Even if it's valuable to maintain this kind of VACUUM/autovacuum parity (which I tend to doubt), doesn't the same argument work almost as well with whatever stripped down version you come up with? It's also confusing that a manual VACUUM command will be doing an ANALYZE-like thing. Especially in cases where it's really expensive relative to the work of VACUUM, because VACUUM scanned so few pages. You just have to make some kind of trade-off. -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Robert Haas
Date:
On Thu, Jan 19, 2023 at 5:51 PM Andres Freund <andres@anarazel.de> wrote: > I don't agree. But mainly my issue is that the devil you know (how this has > worked for a while) is preferrable to introducing an unknown quantity (your > patch that hasn't yet seen real world exposure). Yeah, this is a major reason why I'm very leery about changes in this area. A lot of autovacuum behavior is emergent, in the sense that it wasn't directly intended by whoever wrote the code. It's just a consequence of other decisions that probably seemed very reasonable at the time they were made but turned out to have surprising and unpleasant consequences. In this particular case, I think that there is a large risk that postponing auto-cancellation will make things significantly worse, possibly drastically worse, for a certain class of users - specifically, those whose vacuums often get auto-cancelled. I think that it's actually pretty common for people to have workloads where something pretty close to all of the autovacuums get auto-cancelled on certain tables, and those people are always hard up against autovacuum_freeze_max_age because they *have* to hit that in order to get any vacuuming done on the affected tables. If the default threshold for auto-cancellation goes up, those people will be vacuuming even less often than they are now. That's why I really liked your idea of decoupling auto-cancellation from XID age. Such an approach can still avoid disabling auto-cancellation just because autovacuum_freeze_max_age has been hit, but it can also disable it much earlier when it detects that doing so is necessary to make progress. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Peter Geoghegan
Date:
On Fri, Jan 20, 2023 at 5:47 AM Robert Haas <robertmhaas@gmail.com> wrote: > Yeah, this is a major reason why I'm very leery about changes in this > area. A lot of autovacuum behavior is emergent, in the sense that it > wasn't directly intended by whoever wrote the code. It's just a > consequence of other decisions that probably seemed very reasonable at > the time they were made but turned out to have surprising and > unpleasant consequences. I certainly agree with your general description of the ways things are. To a large degree we're living in a world where DBAs have already compensated for some of the autovacuum shortcomings discussed on this thread. For example, by setting autovacuum_vacuum_scale_factor (and even autovacuum_vacuum_insert_scale_factor) to very low values, to compensate for the issues with random sampling of dead tuples by analyze, and to compensate for the way that VACUUM doesn't reason correctly about how the number of dead tuples changes as VACUUM runs. They might not have thought of it that way -- it could have happened as a byproduct of tuning a production system through trial and error -- but it still counts as compensating for a defect in autovacuum scheduling IMV. It's actually quite likely that even a strict improvement to (say) autovacuum scheduling will cause some number of regressions, since now what were effectively mitigations become unnecessary. This is somewhat similar to the dynamic with optimizer improvements, where (say) a selectivity estimate function that's better by every available metric can still easily cause regressions that really cannot be blamed on the improvement itself. I personally believe that it's a price worth paying when it comes to the issues with autovacuum statistics, particularly the dead tuple count issues. Since much of the behavior that we sometimes see is just absurdly bad. We have both water tight theoretical arguments and practical experiences pointing in that direction. > In this particular case, I think that there is a large risk that > postponing auto-cancellation will make things significantly worse, > possibly drastically worse, for a certain class of users - > specifically, those whose vacuums often get auto-cancelled. I agree that that's a real concern for the autocancellation side of things. That seems quite different to the dead tuples accounting issues, in that nobody would claim that the existing behavior is flagrantly wrong (just that it sometimes causes problems instead of preventing them). > That's why I really liked your idea of decoupling auto-cancellation > from XID age. Such an approach can still avoid disabling > auto-cancellation just because autovacuum_freeze_max_age has been hit, > but it can also disable it much earlier when it detects that doing so > is necessary to make progress. To be clear, I didn't think that that's what Andres was proposing, and my recent v5 doesn't actually do that. Even in v5, it's still fundamentally impossible for autovacuums that are triggered by the tuples inserted or dead tuples thresholds to not be autocancellable. ISTM that it doesn't make sense to condition the autocancellation behavior on table XID age in the context of dead tuple VACUUMs. It could either be way too early or way too late at that point. I was rather hoping to not have to build the infrastructure required for fully decoupling the autocancellation behavior from the triggering condition (dead tuples vs table XID age) in the scope of this thread/patch, though I can see the appeal of that. The only reason why I'm using table age at all is because that's how it works already, rightly or wrongly. If nothing else, t's pretty clear that there is no theoretical or practical reason why it has to be exactly the same table age as the one for launching autovacuums to advance relfrozenxid/relminmxid. In v5 of the patch, the default is to use 1.8x of the threshold that initially makes autovacuum.c want to launch autovacuums to deal with table age. That's based on a suggestion from Andres, but I'd be almost as happy with a default as low as 1.1x or even 1.05x. That's going to make very little difference to those users that really rely on the no-auto-cancellation behavior, while at the same time making things a lot safer for scenarios like the Joyent/Manta "DROP TRIGGER" outage (not perfectly safe, by any means, but meaningfully safer). -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Robert Haas
Date:
On Fri, Jan 20, 2023 at 3:43 PM Peter Geoghegan <pg@bowt.ie> wrote: > The only reason why I'm using table age at all is because that's how > it works already, rightly or wrongly. If nothing else, t's pretty > clear that there is no theoretical or practical reason why it has to > be exactly the same table age as the one for launching autovacuums to > advance relfrozenxid/relminmxid. In v5 of the patch, the default is to > use 1.8x of the threshold that initially makes autovacuum.c want to > launch autovacuums to deal with table age. That's based on a > suggestion from Andres, but I'd be almost as happy with a default as > low as 1.1x or even 1.05x. That's going to make very little difference > to those users that really rely on the no-auto-cancellation behavior, > while at the same time making things a lot safer for scenarios like > the Joyent/Manta "DROP TRIGGER" outage (not perfectly safe, by any > means, but meaningfully safer). It doesn't seem that way to me. What am I missing? In that case, the problem was a DROP TRIGGER command waiting behind autovacuum's lock and thus causing all new table locks to wait behind DROP TRIGGER's lock request. But it does not sound like that was a one-off event. It sounds like they used DROP TRIGGER pretty regularly. So I think this sounds like exactly the kind of case I was talking about, where autovacuums keep getting cancelled until we decide to stop cancelling them. If so, then they were going to have a problem whenever that happened. Delaying the point at which we stop cancelling them would not help at all, as your patch would do. What about stopping cancelling them sooner, as with the proposal to switch to that behavior after a certain number of auto-cancels? That doesn't prevent the problem either. If it's aggressive enough, it has some chance of making the problem visible in a well-run test environment, which could conceivably prevent you from hitting it in production, but certainly there are no guarantees. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Peter Geoghegan
Date:
On Fri, Jan 20, 2023 at 12:57 PM Robert Haas <robertmhaas@gmail.com> wrote: > It doesn't seem that way to me. What am I missing? In that case, the > problem was a DROP TRIGGER command waiting behind autovacuum's lock > and thus causing all new table locks to wait behind DROP TRIGGER's > lock request. But it does not sound like that was a one-off event. It's true that I cannot categorically state that it would have made the crucial difference in this particular case. It comes down to two factors: 1. How many attempts would any given amount of additional XID space head room have bought them in practice? We can be all but certain that the smallest possible number is 1, which is something. 2. Would that have been enough for relfrozenxid to be advanced in practice? I think that it's likely that the answer to 2 is yes, since there was no mention of bloat as a relevant factor at any point in the postmortem. It was all about locking characteristics of antiwraparound autovacuuming in particular, and its interaction with their application. I think that they were perfectly okay with the autovacuum cancellation behavior most of the time. In fact, I don't think that there was any bloat in the table at all -- it was a really huge table (likely an events table), and those tend to be append-only. Even if I'm wrong about this specific case (we'll never know for sure), the patch as written would be virtually guaranteed to make the crucial differences in cases that I have seen up close. For example, a case with TRUNCATE. > It sounds like they used DROP TRIGGER pretty regularly. So I think this > sounds like exactly the kind of case I was talking about, where > autovacuums keep getting cancelled until we decide to stop cancelling > them. I don't know how you can reach that conclusion. The chances of a non-aggressive VACUUM advancing relfrozenxid right now are virtually zero, at least for a big table like this one. It seems quite likely that plenty of non-aggressive autovacuums completed, or would have had the insert-driven autovacuum feature been available. The whole article was about how this DROP TRIGGER pattern worked just fine most of the time, because most of the time autovacuum was just autocancelled. They say this at one point: "The normal autovacuum mechanism is skipped when locks are held in order to minimize service disruption. However, because transaction wraparound is such a severe problem, if the system gets too close to wraparound, an autovacuum is launched that does not back off under lock contention." At another point: "When the outage was resolved, we still had a number of questions: is a wraparound autovacuum always so disruptive? Given that it was blocking all table operations, why does it throttle itself?" ISTM that it was a combination of aggressive vacuuming taking far longer than usual (especially likely because this was pre freeze map), and the no-auto-cancel behavior. Aggressive/antiwraparound VACUUMs are naturally much more likely to coincide with periodic DDL, just because they take so much longer. That is a dangerous combination. -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Peter Geoghegan
Date:
On Tue, Jan 17, 2023 at 10:02 AM Andres Freund <andres@anarazel.de> wrote: > I think with a bit of polish "Add autovacuum trigger instrumentation." ought > to be quickly mergeable. Any thoughts on v5-0001-*? It would be nice to get the uncontentious part of all this (which is the instrumentation patch) out of the way soon. -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Andres Freund
Date:
Hi, On 2023-01-21 18:37:59 -0800, Peter Geoghegan wrote: > On Tue, Jan 17, 2023 at 10:02 AM Andres Freund <andres@anarazel.de> wrote: > > I think with a bit of polish "Add autovacuum trigger instrumentation." ought > > to be quickly mergeable. > > Any thoughts on v5-0001-*? > > It would be nice to get the uncontentious part of all this (which is > the instrumentation patch) out of the way soon. Is https://www.postgresql.org/message-id/CAH2-WzmytCuSpaMEhv8H-jt8x_9whTi0T5bjNbH2gvaR0an2Pw%40mail.gmail.com the last / relevant version of the patch to look at? Greetings, Andres Freund
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Peter Geoghegan
Date:
On Sat, Jan 21, 2023 at 6:54 PM Andres Freund <andres@anarazel.de> wrote: > Is https://www.postgresql.org/message-id/CAH2-WzmytCuSpaMEhv8H-jt8x_9whTi0T5bjNbH2gvaR0an2Pw%40mail.gmail.com > the last / relevant version of the patch to look at? Yes. I'm mostly just asking about v5-0001-* right now. -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Andres Freund
Date:
Hi, On 2023-01-18 17:54:27 -0800, Peter Geoghegan wrote: > From 0afaf310255a068d3c1ca9d2ce6f00118cbff890 Mon Sep 17 00:00:00 2001 > From: Peter Geoghegan <pg@bowt.ie> > Date: Fri, 25 Nov 2022 11:23:20 -0800 > Subject: [PATCH v5 1/2] Add autovacuum trigger instrumentation. > > Add new instrumentation that lists a triggering condition in the server > log whenever an autovacuum is logged. This reports "table age" as the > triggering criteria when antiwraparound autovacuum runs (the XID age > trigger case and the MXID age trigger case are represented separately). > Other cases are reported as autovacuum trigger when the tuple insert > thresholds or the dead tuple thresholds were crossed. > > Author: Peter Geoghegan <pg@bowt.ie> > Reviewed-By: Andres Freund <andres@anarazel.de> > Reviewed-By: Jeff Davis <pgsql@j-davis.com> > Discussion: https://postgr.es/m/CAH2-Wz=S-R_2rO49Hm94Nuvhu9_twRGbTm6uwDRmRu-Sqn_t3w@mail.gmail.com > --- > src/include/commands/vacuum.h | 19 +++- > src/backend/access/heap/vacuumlazy.c | 5 ++ > src/backend/commands/vacuum.c | 31 ++++++- > src/backend/postmaster/autovacuum.c | 124 ++++++++++++++++++--------- > 4 files changed, 137 insertions(+), 42 deletions(-) > > diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h > index 689dbb770..13f70a1f6 100644 > --- a/src/include/commands/vacuum.h > +++ b/src/include/commands/vacuum.h > @@ -191,6 +191,21 @@ typedef struct VacAttrStats > #define VACOPT_SKIP_DATABASE_STATS 0x100 /* skip vac_update_datfrozenxid() */ > #define VACOPT_ONLY_DATABASE_STATS 0x200 /* only vac_update_datfrozenxid() */ > > +/* > + * Values used by autovacuum.c to tell vacuumlazy.c about the specific > + * threshold type that triggered an autovacuum worker. > + * > + * AUTOVACUUM_NONE is used when VACUUM isn't running in an autovacuum worker. > + */ > +typedef enum AutoVacType > +{ > + AUTOVACUUM_NONE = 0, > + AUTOVACUUM_TABLE_XID_AGE, > + AUTOVACUUM_TABLE_MXID_AGE, > + AUTOVACUUM_DEAD_TUPLES, > + AUTOVACUUM_INSERTED_TUPLES, > +} AutoVacType; Why is there TABLE_ in AUTOVACUUM_TABLE_XID_AGE but not AUTOVACUUM_DEAD_TUPLES? Both are on tables. What do you think about naming this VacuumTriggerType and adding an VAC_TRIG_MANUAL or such? > /* > * Values used by index_cleanup and truncate params. > * > @@ -222,7 +237,8 @@ typedef struct VacuumParams > * use default */ > int multixact_freeze_table_age; /* multixact age at which to scan > * whole table */ > - bool is_wraparound; /* force a for-wraparound vacuum */ > + bool is_wraparound; /* antiwraparound autovacuum? */ > + AutoVacType trigger; /* autovacuum trigger condition, if any */ The comment change for is_wraparound seems a bit pointless, but whatever. > @@ -2978,7 +2995,10 @@ relation_needs_vacanalyze(Oid relid, > bool *doanalyze, > bool *wraparound) > { The changes here are still bigger than I'd like, but ... > - bool force_vacuum; > + TransactionId relfrozenxid = classForm->relfrozenxid; > + MultiXactId relminmxid = classForm->relminmxid; > + AutoVacType trigger = AUTOVACUUM_NONE; > + bool tableagevac; Here + below we end up with three booleans that just represent the choices in our fancy new enum. That seems awkward to me. > @@ -3169,14 +3212,15 @@ autovacuum_do_vac_analyze(autovac_table *tab, BufferAccessStrategy bstrategy) > static void > autovac_report_activity(autovac_table *tab) > { > -#define MAX_AUTOVAC_ACTIV_LEN (NAMEDATALEN * 2 + 56) > +#define MAX_AUTOVAC_ACTIV_LEN (NAMEDATALEN * 2 + 100) > char activity[MAX_AUTOVAC_ACTIV_LEN]; > int len; > > /* Report the command and possible options */ > if (tab->at_params.options & VACOPT_VACUUM) > snprintf(activity, MAX_AUTOVAC_ACTIV_LEN, > - "autovacuum: VACUUM%s", > + "autovacuum for %s: VACUUM%s", > + vac_autovacuum_trigger_msg(tab->at_params.trigger), > tab->at_params.options & VACOPT_ANALYZE ? " ANALYZE" : ""); > else > snprintf(activity, MAX_AUTOVAC_ACTIV_LEN, Somehow the added "for ..." sounds a bit awkward. "autovacuum for table XID age". Maybe "autovacuum due to ..."? Greetings, Andres Freund
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Peter Geoghegan
Date:
On Mon, Jan 23, 2023 at 6:56 PM Andres Freund <andres@anarazel.de> wrote: > Why is there TABLE_ in AUTOVACUUM_TABLE_XID_AGE but not > AUTOVACUUM_DEAD_TUPLES? Both are on tables. Why does vacuum_freeze_table_age contain the word "table", while autovacuum_vacuum_scale_factor does not? To me, "table XID age" is a less awkward term for "relfrozenxid advancing", useful in contexts where it's probably more important to be understood by non-experts than it is to be unambiguous. Besides, relfrozenxid works at the level of the pg_class metadata. Nothing whatsoever needs to have changed about the table itself, nor will anything necessarily be changed by VACUUM (except the relfrozenxid field from pg_class). > What do you think about naming this VacuumTriggerType and adding an > VAC_TRIG_MANUAL or such? But we're not doing anything with it in the context of manual VACUUMs. I'd prefer to keep this about what autovacuum.c thought needed to happen, at least for as long as manual VACUUMs are something that autovacuum.c knows nothing about. > > - bool force_vacuum; > > + TransactionId relfrozenxid = classForm->relfrozenxid; > > + MultiXactId relminmxid = classForm->relminmxid; > > + AutoVacType trigger = AUTOVACUUM_NONE; > > + bool tableagevac; > > Here + below we end up with three booleans that just represent the choices in > our fancy new enum. That seems awkward to me. I don't follow. It's true that "wraparound" is still a synonym of "tableagevac" in 0001, but that changes in 0002. And even if you assume that 0002 won't get in, I think that it still makes sense to structure it in a way that shows that in principle the "wraparound" behaviors don't necessarily have to be used whenever "tableagevac" is in use. > > @@ -3169,14 +3212,15 @@ autovacuum_do_vac_analyze(autovac_table *tab, BufferAccessStrategy bstrategy) > > static void > > autovac_report_activity(autovac_table *tab) > Somehow the added "for ..." sounds a bit awkward. "autovacuum for table XID > age". Maybe "autovacuum due to ..."? That works just as well IMV. I'll change it to that. Anything else for 0001? Would be nice to get it committed tomorrow. -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Robert Haas
Date:
On Fri, Jan 20, 2023 at 4:24 PM Peter Geoghegan <pg@bowt.ie> wrote: > > It sounds like they used DROP TRIGGER pretty regularly. So I think this > > sounds like exactly the kind of case I was talking about, where > > autovacuums keep getting cancelled until we decide to stop cancelling > > them. > > I don't know how you can reach that conclusion. I can accept that there might be some way I'm wrong about this in theory, but your email then seems to go on to say that I'm right just a few sentences later: > The whole article was about how this DROP TRIGGER pattern worked just > fine most of the time, because most of the time autovacuum was just > autocancelled. They say this at one point: > > "The normal autovacuum mechanism is skipped when locks are held in > order to minimize service disruption. However, because transaction > wraparound is such a severe problem, if the system gets too close to > wraparound, an autovacuum is launched that does not back off under > lock contention." If this isn't arguing in favor of exactly what I'm saying, I don't know what that would look like. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Peter Geoghegan
Date:
On Tue, Jan 24, 2023 at 11:21 AM Robert Haas <robertmhaas@gmail.com> wrote: > > The whole article was about how this DROP TRIGGER pattern worked just > > fine most of the time, because most of the time autovacuum was just > > autocancelled. They say this at one point: > > > > "The normal autovacuum mechanism is skipped when locks are held in > > order to minimize service disruption. However, because transaction > > wraparound is such a severe problem, if the system gets too close to > > wraparound, an autovacuum is launched that does not back off under > > lock contention." > > If this isn't arguing in favor of exactly what I'm saying, I don't > know what that would look like. I'm happy to clear that up. What you said was: "So I think this sounds like exactly the kind of case I was talking about, where autovacuums keep getting cancelled until we decide to stop cancelling them. If so, then they were going to have a problem whenever that happened." Just because *some* autovacuums get cancelled doesn't mean they *all* get cancelled. And, even if the rate is quite high, that may not be much of a problem in itself (especially now that we have the freeze map). 200 million XIDs usually amounts to a lot of wall clock time. Even if it is rather difficult to finish up, we only have to get lucky once. The fact that autovacuum eventually got to the point of requiring an antiwraparound autovacuum on the problematic table does indeed strongly suggest that any other, earlier autovacuums were relatively unlikely to have advanced relfrozenxid in the end -- or at least couldn't on this one occasion. But that in itself is just not relevant to our current discussion, since even the tiniest perturbation would have been enough to prevent a non-aggressive VACUUM from being able to advance relfrozenxid. Before 15, non-aggressive VACUUMs would throw away the opportunity to do so just because they couldn't immediately get a cleanup lock on one single heap page. It's quite possible that most or all prior aggressive VACUUMs were not antiwraparound autovacuums, because the dead tuples accounting was enough to launch an autovacuum at some point after age(relfrozenxid) exceeded vacuum_freeze_table_age that was still before it could reach autovacuum_freeze_max_age. That would give you a cancellable aggressive VACUUM -- a VACUUM that actually has a non-zero chance of advancing relfrozenxid. Sure, it's possible that such a cancellable aggressive autovacuum was indeed cancelled, and that that factor made the crucial difference. But I find it far easier to believe that there simply was no such aggressive autovacuum in the first place (not this time), since it could have only happened when autovacuum thinks that there are sufficiently many dead tuples to justify launching an autovacuum in the first place. Which, as we now all accept, is based on highly dubious sampling by ANALYZE. So I think it's much more likely to be that factor (dead tuple accounting is bad), as well as the absurd false dichotomy between aggressive and non-aggressive -- plus the issue at hand, the auto-cancellation behavior. I don't claim to know what is inevitable, or what is guaranteed to work or not work. I only claim that we can meaningfully reduce the absolute risk by using a fairly simple approach, principally by not needlessly coupling the auto-cancellation behavior to *all* autovacuums that are specifically triggered by age(relfrozenxid). As Andres said at one point, doing those two things at exactly the same time is just arbitrary. -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Andres Freund
Date:
Hi, On 2023-01-23 19:22:18 -0800, Peter Geoghegan wrote: > On Mon, Jan 23, 2023 at 6:56 PM Andres Freund <andres@anarazel.de> wrote: > > Why is there TABLE_ in AUTOVACUUM_TABLE_XID_AGE but not > > AUTOVACUUM_DEAD_TUPLES? Both are on tables. > > Why does vacuum_freeze_table_age contain the word "table", while > autovacuum_vacuum_scale_factor does not? I don't know. But that's not really a reason to introduce more oddities. > To me, "table XID age" is a less awkward term for "relfrozenxid > advancing", useful in contexts where it's probably more important to > be understood by non-experts than it is to be unambiguous. Besides, > relfrozenxid works at the level of the pg_class metadata. Nothing > whatsoever needs to have changed about the table itself, nor will > anything necessarily be changed by VACUUM (except the relfrozenxid > field from pg_class). I'd just go for "xid age", I don't see a point in adding 'table', particularly when you don't for dead tuples. > > What do you think about naming this VacuumTriggerType and adding an > > VAC_TRIG_MANUAL or such? > > But we're not doing anything with it in the context of manual VACUUMs. It's a member of a struct passed to the routines handling both manual and interactive vacuum. And we could e.g. eventually start replace IsAutoVacuumWorkerProcess() checks with it - which aren't e.g. going to work well if we add parallel index vacuuming support to autovacuum. > I'd prefer to keep this about what autovacuum.c thought needed to > happen, at least for as long as manual VACUUMs are something that > autovacuum.c knows nothing about. It's an enum defined in a general header, not something in autovacuum.c - so I don't really buy this. > > > - bool force_vacuum; > > > + TransactionId relfrozenxid = classForm->relfrozenxid; > > > + MultiXactId relminmxid = classForm->relminmxid; > > > + AutoVacType trigger = AUTOVACUUM_NONE; > > > + bool tableagevac; > > > > Here + below we end up with three booleans that just represent the choices in > > our fancy new enum. That seems awkward to me. > > I don't follow. It's true that "wraparound" is still a synonym of > "tableagevac" in 0001, but that changes in 0002. And even if you > assume that 0002 won't get in, I think that it still makes sense to > structure it in a way that shows that in principle the "wraparound" > behaviors don't necessarily have to be used whenever "tableagevac" is > in use. You have booleans tableagevac, deadtupvac, inserttupvac. Particularly the latter ones really are just a rephrasing of the trigger: + tableagevac = true; + *wraparound = false; + /* See header comments about trigger precedence */ + if (TransactionIdIsNormal(relfrozenxid) && + TransactionIdPrecedes(relfrozenxid, xidForceLimit)) + trigger = AUTOVACUUM_TABLE_XID_AGE; + else if (MultiXactIdIsValid(relminmxid) && + MultiXactIdPrecedes(relminmxid, multiForceLimit)) + trigger = AUTOVACUUM_TABLE_MXID_AGE; + else + tableagevac = false; + + /* User disabled non-table-age autovacuums in pg_class.reloptions? */ + if (!av_enabled && !tableagevac) ... + deadtupvac = (vactuples > vacthresh); + inserttupvac = (vac_ins_base_thresh >= 0 && instuples > vacinsthresh); + /* See header comments about trigger precedence */ + if (!tableagevac) + { + if (deadtupvac) + trigger = AUTOVACUUM_DEAD_TUPLES; + else if (inserttupvac) + trigger = AUTOVACUUM_INSERTED_TUPLES; + } + /* Determine if this table needs vacuum or analyze. */ - *dovacuum = force_vacuum || (vactuples > vacthresh) || - (vac_ins_base_thresh >= 0 && instuples > vacinsthresh); + *dovacuum = (tableagevac || deadtupvac || inserttupvac); I find this to be awkward code. The booleans are kinda pointless, and the tableagevac case is hard to follow because trigger is set elsewhere. I can give reformulating it a go. Need to make some food first. I suspect that the code would look better if we didn't continue to have "bool *dovacuum" and the trigger. They're redundant. > Anything else for 0001? Would be nice to get it committed tomorrow. Sorry, today was busy with meetings and bashing my head against AIX. Greetings, Andres Freund
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Andres Freund
Date:
Hi, On 2023-01-24 20:59:04 -0800, Andres Freund wrote: > I find this to be awkward code. The booleans are kinda pointless, and the > tableagevac case is hard to follow because trigger is set elsewhere. > > I can give reformulating it a go. Need to make some food first. Here's a draft of what I am thinking of. Not perfect yet, but I think it looks better. The pg_stat_activity output looks like this right now: autovacuum due to table XID age: VACUUM public.pgbench_accounts (to prevent wraparound) Why don't we drop the "(to prevent wraparound)" now? And I still think removing the "table " bit would be an improvement. Greetings, Andres Freund
Attachment
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Robert Haas
Date:
On Tue, Jan 24, 2023 at 3:33 PM Peter Geoghegan <pg@bowt.ie> wrote: > Sure, it's possible that such a cancellable aggressive autovacuum was > indeed cancelled, and that that factor made the crucial difference. > But I find it far easier to believe that there simply was no such > aggressive autovacuum in the first place (not this time), since it > could have only happened when autovacuum thinks that there are > sufficiently many dead tuples to justify launching an autovacuum in > the first place. Which, as we now all accept, is based on highly > dubious sampling by ANALYZE. So I think it's much more likely to be > that factor (dead tuple accounting is bad), as well as the absurd > false dichotomy between aggressive and non-aggressive -- plus the > issue at hand, the auto-cancellation behavior. In my opinion, this is too speculative to justify making changes to the behavior. I'm not here to defend the way we do dead tuple accounting. I think it's a hot mess. But whether or not it played any role in this catastrophe is hard to say. The problems with dead tuple accounting are, as I understand it, all about statistical independence. That is, we make assumptions that what's true of the sample is likely to be true of the whole table when in reality it may not be true at all. Perhaps it's even unlikely to be true. But the kinds of problems you get from assuming statistical independence tend to hit users very unevenly. We make similar assumptions about selectivity estimation: unless there are extended statistics, we take P(a=1 and b=1) = P(a=1)*P(b = 1), which can be vastly and dramatically wrong. People can and do get extremely bad query plans as a result of that assumption. However, other people run PostgreSQL for years and years and never really have big problems with it. I'd say that it's completely fair to describe this as a big problem, but we can't therefore conclude that some particular user has this problem, not even if we know that they have a slow query and we know that it's due to a bad plan. And similarly here, I don't see a particular reason to think that your theory about what happened is more likely than mine. I freely admit that yours could be right, just as you admitted that mine could be right. But I think we really just don't know. It feels unlikely to me that there was ONE cancellable aggressive autovacuum and that it got cancelled. I think that it's probably either ZERO or LOTS, depending on whether the dead tuple threshold was ever reached. If it wasn't, then it must have been zero. But if it was, and the first autovacuum worker to visit that table got cancelled, then the next one would try again. And autovacuum_naptime=1m, so if we're not out of autovacuum workers, we're going to retry that table every minute. If we do run out of autovacuum workers, which is pretty likely, we'll still launch new workers in that database as often as we can given when other workers exit. If the system is very tight on autovacuum capacity, probably because the cost limit is too low, then you could have a situation where only one try gets made before we hit autovacuum_freeze_max_age. Otherwise, though, a single failed try would probably lead to trying a whole lot more times after that, and you only hit autovacuum_freeze_max_age if all those attempts fail. At the risk of repeating myself, here's what bugs me. If we suppose that your intuition is right and no aggressive autovacuum happened before autovacuum_freeze_max_age was reached, then what you are proposing will make things better. But if we suppose that my intuition is right and many aggressive autovacuums happened before autovacuum_freeze_max_age was reached, then what you are proposing will make things worse, because if we've been auto-cancelling repeatedly we're probably going to keep doing so until we shut that behavior off, and we want a vacuum to succeed sooner rather than later. So it doesn't feel obvious to me that we should change anything. Even if we knew what had happened for certain in this particular case, I don't know how we can possibly know what is typical in similar cases. My personal support experience has been that cases where autovacuum runs a lot but doesn't solve the problem for some reason are a lot more common than cases where it doesn't run when it should have done. That probably accounts for my intuition about what is likely to have happened here. But as my colleagues are often at pains to point out to me, my experiences aren't representative of what happens to PostgreSQL users generally for all kinds of reasons, and therefore sometimes my intuition is wrong. But since I have nobody else's experiences to use in lieu of my own, I don't know what else I can use to judge anything. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
From
Peter Geoghegan
Date:
On Wed, Jan 25, 2023 at 1:19 AM Andres Freund <andres@anarazel.de> wrote: > Here's a draft of what I am thinking of. Not perfect yet, but I think it looks > better. I'm afraid that I will be unable to do any more work on this project. I have withdrawn it from the CF app. If you would like to complete some or all of the patches yourself, in part or in full, I have no objections. -- Peter Geoghegan