Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
Date
Msg-id CAH2-Wzm8UxiYB4U+YboAZwXm40unp7cXVNzpBr233YH2o5PVKw@mail.gmail.com
Whole thread Raw
In response to Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
List pgsql-hackers
On Thu, Jan 12, 2023 at 9:12 AM Robert Haas <robertmhaas@gmail.com> wrote:
> I do agree that it's good to slowly increase the aggressiveness of
> VACUUM as we get further behind, rather than having big behavior
> changes all at once, but I think that should happen by smoothly
> varying various parameters rather than by making discrete behavior
> changes at a whole bunch of different times.

In general I tend to agree, but, as you go on to acknowledge yourself,
this particular behavior is inherently discrete. Either the
PROC_VACUUM_FOR_WRAPAROUND behavior is in effect, or it isn't.

In many important cases the only kind of autovacuum that ever runs
against a certain big table is antiwraparound autovacuum. And
therefore every autovacuum that runs against the table must
necessarily not be auto cancellable. These are the cases where we see
disastrous interactions with automated DDL, such as a TRUNCATE run by
a cron job (to stop those annoying antiwraparound autovacuums) -- a
heavyweight lock traffic jam that causes the application to lock up.

All that I really want to do here is give an autovacuum that *can* be
auto cancelled *some* non-zero chance to succeed with these kinds of
tables. TRUNCATE completes immediately, so the AEL is no big deal.
Except when it's blocked behind an antiwraparound autovacuum. That
kind of interaction is occasionally just disastrous. Even just the
tiniest bit of wiggle room could avoid it in most cases, possibly even
almost all cases.

> Maybe that's not the right idea, I don't know, and a naive
> implementation might be worse than nothing, but I think it has some
> chance of being worth consideration.

It's a question of priorities. The failsafe isn't supposed to be used
(when it is it is a kind of a failure), and so presumably only kicks
in on very rare occasions, where nobody was paying attention anyway.
So far I've heard no complaints about this, but I've heard lots of
complaints about the antiwrap autocancellation behavior.

> The behavior already changes when you hit
> vacuum_freeze_min_age and then again when you hit
> vacuum_freeze_table_age and then there's also
> autoovacuum_freeze_max_age and xidWarnLimit and xidStopLimit and a few
> others, and these setting all interact in pretty complex ways. The
> more conditional logic we add to that, the harder it becomes to
> understand what's actually happening.

In general I strongly agree. In fact that's a big part of what
motivates my ongoing work on VACUUM. The user experience is central.

As Andres pointed out, presenting antiwraparound autovacuums as kind
of an emergency thing but also somehow a routine thing is just
horribly confusing. I want to make them into an emergency thing in
every sense -- something that you as a user can reasonably expect to
never see (like the failsafe). But if you do see one, then that's a
useful signal of an underlying problem with contention, say from
automated DDL that pathologically cancels autovacuums again and again.

> Now, you might reply to the above by saying, well, some behaviors
> can't vary continuously. vacuum_cost_limit can perhaps be phased out
> gradually, but autocancellation seems like something that you must
> either do, or not do. I would agree with that. But what I'm saying is
> that we ought to favor having those kinds of behaviors all engage at
> the same point rather than at different times.

Right now aggressive VACUUMs do just about all freezing at the same
time, to the extent that many users seem to think that it's a totally
different thing with totally different responsibilities to regular
VACUUM. Doing everything at the same time like that causes huge
practical problems, and is very confusing.

I think that users will really appreciate having only one kind of
VACUUM/autovacuum (since the other patch gets rid of discrete
aggressive mode VACUUMs). I want "table age autovacuuming" (as I
propose to call it) come to be seen as not any different to any other
autovacuum, such as an "insert tuples" autovacuum or a "dead tuples"
autovacuum. The difference is only in how autovacuum.c triggers the
VACUUM, not in any runtime behavior. That's an important goal here.

> I did take a look at the post-mortem to which you linked, but I am not
> quite sure how that bears on the behavior change under discussion.

The post-mortem involved a single "DROP TRIGGER" that caused chaos
when it interacted with the auto cancellation behavior. It would
usually completely instantly, so the AEL wasn't actually disruptive,
but one day antiwraparound autovacuum made the cron job effectively
block all reads and writes for hours.

The similar outages I was called in to help with personally had either
an automated TRUNCATE or an automated CREATE INDEX. Had autovacuum
only been willing to yield once or twice, then it probably would have
been fine -- the situation probably would have worked itself out
naturally. That's the best outcome you can hope for.

--
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Devrim Gündüz
Date:
Subject: Re: drop postmaster symlink
Next
From: Nathan Bossart
Date:
Subject: Re: Transaction timeout