Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation |
Date | |
Msg-id | 20230114020950.3adnp5rd3cbdg4kb@awork3.anarazel.de Whole thread Raw |
In response to | Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation (Peter Geoghegan <pg@bowt.ie>) |
Responses |
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation |
List | pgsql-hackers |
Hi, On 2023-01-13 16:13:45 -0800, Peter Geoghegan wrote: > On Fri, Jan 13, 2023 at 2:00 PM Andres Freund <andres@anarazel.de> wrote: > > I think it'd be a good idea to split off the part of the patch that introduces > > AutoVacType / adds logging for what triggered. That's independently useful, > > likely uncontroversial and makes the remaining patch smaller. > > I like that idea. Cool. > Attached revision v4 breaks things up mechanically, along those lines > (no real changes to the code inself, though). The controversial parts > of the patch are indeed a fairly small proportion of the total > changes. I don't think the split is right. There's too much in 0001 - it's basically introducing the terminology of 0002 already. Could you make it a much more minimal change? > > I'd also add the trigger to the pg_stat_activity entry for the autovac > > worker. Additionally I think we should add information about using failsafe > > mode to the p_s_a entry. > > I agree that that's all useful, but it seems like it can be treated as > later work. IDK, it splits up anti-wraparound vacuums into different sub-kinds but doesn't allow to distinguish most of them from a plain autovacuum. Seems pretty easy to display the trigger from 0001 in autovac_report_activity()? You'd have to move the AutoVacType -> translated string mapping into a separate function. That seems like a good idea anyway, the current coding makes translators translate several largely identical strings that just differ in one part. > > I've wished for a non-wraparound, xid age based, "autovacuum trigger" many > > times, FWIW. And I've seen plenty of places write their own userspace version > > of it, because without it they run into trouble. However, I don't like that > > the patch infers the various thresholds using magic constants / multipliers. > > As I said, these details are totally negotiable, and likely could be a > lot better without much effort. > > What are your concerns about the thresholds? For example, is it that > you can't configure the behavior directly at all? Something else? The above, but mainly that > > freeze_max_age = Min(autovacuum_freeze_max_age, table_freeze_max_age) > > prevent_auto_cancel_age = Min(freeze_max_age * 2, 1 billion) > > prevent_auto_cancel = reflrozenxid < recentXid - prevent_auto_cancel_age seems quite confusing / non-principled. What's the logic behind the auto cancel threshold being 2 x freeze_max_age, except that when freeze_max_age is 1 billion, the cutoff is set to 1 billion? That just makes no sense to me. Maye I'm partially confused by the Min(freeze_max_age * 2, 1 billion). As you pointed out in [1], that doesn't actually lower the threshold for "table age" vacuums, because we don't even get to that portion of the code if we didn't already cross freeze_max_age. if (freeze_max_age < ANTIWRAPAROUND_MAX_AGE) freeze_max_age *= 2; freeze_max_age = Min(freeze_max_age, ANTIWRAPAROUND_MAX_AGE); You're lowering a large freeze_max_age to ANTIWRAPAROUND_MAX_AGE - but it only happens after all other checks of freeze_max_age, so it won't influence those. That's confusing code. I think this'd be a lot more readable if you introduced a separate variable for the "no-cancel" threshold, rather than overloading freeze_max_age with different meanings. And you should remove the confusing "lowering" of the cutoff. Maybe something like no_cancel_age = freeze_max_age; if (no_cancel_age < ANTIWRAPAROUND_MAX_AGE) { /* multiply by two, but make sure to not exceed ANTIWRAPAROUND_MAX_AGE */ no_cancel_age = Min((uint32)ANTIWRAPAROUND_MAX_AGE, (uint32)no_cancel_age * 2); } The uint32 bit isn't needed with ANTIWRAPAROUND_MAX_AGE at 1 billion, but at 1.2 it would be needed, so it seems better to have it. That still doesn't explain why we the cancel_age = freeze_max_age * 2 behaviour should be clamped at 1 billion, though. > > If I understand the patch correctly, we now have the following age based > > thresholds for av: > > > > - force-enable autovacuum: > > oldest_datfrozenxid + autovacuum_freeze_max_age < nextXid > > - autovacuum based on age: > > freeze_max_age = Min(autovacuum_freeze_max_age, table_freeze_max_age) > > tableagevac = relfrozenxid < recentXid - freeze_max_age > > - prevent auto-cancellation: > > freeze_max_age = Min(autovacuum_freeze_max_age, table_freeze_max_age) > > prevent_auto_cancel_age = Min(freeze_max_age * 2, 1 billion) > > prevent_auto_cancel = reflrozenxid < recentXid - prevent_auto_cancel_age > > > > Is that right? > > That summary looks accurate, but I'm a bit confused about why you're > asking the question this way. I thought that it was obvious that the > patch doesn't change most of these things. For me it was helpful to clearly list the triggers when thinking about the issue. I found the diff hard to read and, as noted above, the logic for the auto cancel threshold quite confusing, so ... > The only mechanism that the patch changes is related to "prevent > auto-cancellation" behaviors -- which is now what the term > "antiwraparound" refers to. Not sure that redefining what a long-standing name refers to is helpful. It might be best to retire it and come up with new names. > It does change the name of "autovacuum based on age", though -- the name is > now "table age autovacuum" (the old name was antiwraparound autovacuum, of > course). As I pointed out to you already, it's mechanically impossible for > any autovacuum to be antiwraparound unless it's an XID table age/MXID table > age autovacuum. Thinking about it a bit more, my problem with the current anti-wraparound logic boil down to a few different aspects: 1) It regularly scares the crap out of users, even though it's normal. This is further confounded by failsafe autovacuums, where a scared reaction is appropriate, not being visible in pg_stat_activity. I suspect that learning that "vacuum to prevent wraparound" isn't a problem, contributes to people later ignoring "must be vacuumed within ..." WARNINGS, which I've seen plenty times. 2) It makes it hard to DROP, TRUNCATE, VACUUM FULL or even manually VACUUM tables being anti-wraparound vacuumed, even though those manual actions will often resolve the issue much more quickly. 3) Autovacuums triggered by tuple thresholds persistently getting cancelled also regularly causes outages, and they make it more likely that an eventual age-based vacuum will take forever. Aspect 1) is addressed to a good degree by the proposed split of anti-wrap into an age and anti-cancel triggers. And could be improved by reporting failsafe autovacuums in pg_stat_activity. Perhaps 2) could be improved a bit by emitting a WARNING message when we didn't cancel AV because it was anti-wraparound? But eventually I think we somehow need to signal the "intent" of the lock drop down into ProcSleep() or wherever it'd be. I have two ideas around 3): First, we could introduce thresholds for the tuple thresholds, after which autovacuum isn't concealable anymore. Second, we could track the number of cancellations since the last [auto]vacuum in pgstat, and only trigger the anti-cancel behaviour when autovacuum has been cancelled a number of times. Greetings, Andres Freund
pgsql-hackers by date: