Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation |
Date | |
Msg-id | CAH2-Wzm8UxiYB4U+YboAZwXm40unp7cXVNzpBr233YH2o5PVKw@mail.gmail.com Whole thread Raw |
In response to | Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
|
List | pgsql-hackers |
On Thu, Jan 12, 2023 at 9:12 AM Robert Haas <robertmhaas@gmail.com> wrote: > I do agree that it's good to slowly increase the aggressiveness of > VACUUM as we get further behind, rather than having big behavior > changes all at once, but I think that should happen by smoothly > varying various parameters rather than by making discrete behavior > changes at a whole bunch of different times. In general I tend to agree, but, as you go on to acknowledge yourself, this particular behavior is inherently discrete. Either the PROC_VACUUM_FOR_WRAPAROUND behavior is in effect, or it isn't. In many important cases the only kind of autovacuum that ever runs against a certain big table is antiwraparound autovacuum. And therefore every autovacuum that runs against the table must necessarily not be auto cancellable. These are the cases where we see disastrous interactions with automated DDL, such as a TRUNCATE run by a cron job (to stop those annoying antiwraparound autovacuums) -- a heavyweight lock traffic jam that causes the application to lock up. All that I really want to do here is give an autovacuum that *can* be auto cancelled *some* non-zero chance to succeed with these kinds of tables. TRUNCATE completes immediately, so the AEL is no big deal. Except when it's blocked behind an antiwraparound autovacuum. That kind of interaction is occasionally just disastrous. Even just the tiniest bit of wiggle room could avoid it in most cases, possibly even almost all cases. > Maybe that's not the right idea, I don't know, and a naive > implementation might be worse than nothing, but I think it has some > chance of being worth consideration. It's a question of priorities. The failsafe isn't supposed to be used (when it is it is a kind of a failure), and so presumably only kicks in on very rare occasions, where nobody was paying attention anyway. So far I've heard no complaints about this, but I've heard lots of complaints about the antiwrap autocancellation behavior. > The behavior already changes when you hit > vacuum_freeze_min_age and then again when you hit > vacuum_freeze_table_age and then there's also > autoovacuum_freeze_max_age and xidWarnLimit and xidStopLimit and a few > others, and these setting all interact in pretty complex ways. The > more conditional logic we add to that, the harder it becomes to > understand what's actually happening. In general I strongly agree. In fact that's a big part of what motivates my ongoing work on VACUUM. The user experience is central. As Andres pointed out, presenting antiwraparound autovacuums as kind of an emergency thing but also somehow a routine thing is just horribly confusing. I want to make them into an emergency thing in every sense -- something that you as a user can reasonably expect to never see (like the failsafe). But if you do see one, then that's a useful signal of an underlying problem with contention, say from automated DDL that pathologically cancels autovacuums again and again. > Now, you might reply to the above by saying, well, some behaviors > can't vary continuously. vacuum_cost_limit can perhaps be phased out > gradually, but autocancellation seems like something that you must > either do, or not do. I would agree with that. But what I'm saying is > that we ought to favor having those kinds of behaviors all engage at > the same point rather than at different times. Right now aggressive VACUUMs do just about all freezing at the same time, to the extent that many users seem to think that it's a totally different thing with totally different responsibilities to regular VACUUM. Doing everything at the same time like that causes huge practical problems, and is very confusing. I think that users will really appreciate having only one kind of VACUUM/autovacuum (since the other patch gets rid of discrete aggressive mode VACUUMs). I want "table age autovacuuming" (as I propose to call it) come to be seen as not any different to any other autovacuum, such as an "insert tuples" autovacuum or a "dead tuples" autovacuum. The difference is only in how autovacuum.c triggers the VACUUM, not in any runtime behavior. That's an important goal here. > I did take a look at the post-mortem to which you linked, but I am not > quite sure how that bears on the behavior change under discussion. The post-mortem involved a single "DROP TRIGGER" that caused chaos when it interacted with the auto cancellation behavior. It would usually completely instantly, so the AEL wasn't actually disruptive, but one day antiwraparound autovacuum made the cron job effectively block all reads and writes for hours. The similar outages I was called in to help with personally had either an automated TRUNCATE or an automated CREATE INDEX. Had autovacuum only been willing to yield once or twice, then it probably would have been fine -- the situation probably would have worked itself out naturally. That's the best outcome you can hope for. -- Peter Geoghegan
pgsql-hackers by date: