Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
Date
Msg-id CAH2-Wzk3GShs96LdBU=raZiGtH1safUWvsQ2GskpDf8tLS4VAQ@mail.gmail.com
Whole thread Raw
In response to Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
List pgsql-hackers
On Fri, Jan 20, 2023 at 12:57 PM Robert Haas <robertmhaas@gmail.com> wrote:
> It doesn't seem that way to me. What am I missing? In that case, the
> problem was a DROP TRIGGER command waiting behind autovacuum's lock
> and thus causing all new table locks to wait behind DROP TRIGGER's
> lock request. But it does not sound like that was a one-off event.

It's true that I cannot categorically state that it would have made
the crucial difference in this particular case. It comes down to two
factors:

1. How many attempts would any given amount of additional XID space
head room have bought them in practice? We can be all but certain that
the smallest possible number is 1, which is something.

2. Would that have been enough for relfrozenxid to be advanced in practice?

I think that it's likely that the answer to 2 is yes, since there was
no mention of bloat as a relevant factor at any point in the
postmortem. It was all about locking characteristics of antiwraparound
autovacuuming in particular, and its interaction with their
application. I think that they were perfectly okay with the autovacuum
cancellation behavior most of the time. In fact, I don't think that
there was any bloat in the table at all -- it was a really huge table
(likely an events table), and those tend to be append-only.

Even if I'm wrong about this specific case (we'll never know for
sure), the patch as written would be virtually guaranteed to make the
crucial differences in cases that I have seen up close. For example, a
case with TRUNCATE.

> It sounds like they used DROP TRIGGER pretty regularly. So I think this
> sounds like exactly the kind of case I was talking about, where
> autovacuums keep getting cancelled until we decide to stop cancelling
> them.

I don't know how you can reach that conclusion. The chances of a
non-aggressive VACUUM advancing relfrozenxid right now are virtually
zero, at least for a big table like this one. It seems quite likely
that plenty of non-aggressive autovacuums completed, or would have had
the insert-driven autovacuum feature been available.

The whole article was about how this DROP TRIGGER pattern worked just
fine most of the time, because most of the time autovacuum was just
autocancelled. They say this at one point:

"The normal autovacuum mechanism is skipped when locks are held in
order to minimize service disruption. However, because transaction
wraparound is such a severe problem, if the system gets too close to
wraparound, an autovacuum is launched that does not back off under
lock contention."

At another point:

"When the outage was resolved, we still had a number of questions: is
a wraparound autovacuum always so disruptive? Given that it was
blocking all table operations, why does it throttle itself?"

ISTM that it was a combination of aggressive vacuuming taking far
longer than usual (especially likely because this was pre freeze map),
and the no-auto-cancel behavior. Aggressive/antiwraparound VACUUMs are
naturally much more likely to coincide with periodic DDL, just because
they take so much longer. That is a dangerous combination.

-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Nathan Bossart
Date:
Subject: Re: almost-super-user problems that we haven't fixed yet
Next
From: Robert Haas
Date:
Subject: Re: almost-super-user problems that we haven't fixed yet