Re: pgsql: Improve autovacuum logging for aggressive andanti-wraparound ru - Mailing list pgsql-hackers

From Michael Paquier
Subject Re: pgsql: Improve autovacuum logging for aggressive andanti-wraparound ru
Date
Msg-id 20200324044016.GD1836@paquier.xyz
Whole thread Raw
In response to Re: pgsql: Improve autovacuum logging for aggressive andanti-wraparound ru  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
On Mon, Mar 23, 2020 at 06:41:50PM -0700, Andres Freund wrote:
> Which valid scenario can lead to this? Neither the comment, nor commit
> message explain it.

The commit message mentions that concurrent autovacuum jobs can lead
to the creation of non-aggressive and anti-wraparound jobs, which have
no sense because an aggressive and anti-wraparound job was already
done in parallel with a different worker, and that this was possible
because of inconsistent relcache lookups across concurrent jobs.  This
was mentioned upthread.

> Unless you're thinking of scenarios where autovacuum
> and manual vacuum are mixed, I don't really see valid reasons? Normally
> autovacuum's locking + the table_recheck_autovac() check should prevent
> problematic scenarios.
>
> I do see a few scenarios that can trigger this - but they all more or
> less are bugs.

Hmm.  OK.

> It doesn't strike me as a good idea to work around such bugs by silently
> neutering heap_vacuum_rel(). The likelihood of that temporarily covering
> up more severe problems seems significant - they're likely to then later
> bite you with a cluster shutdown.

Saying that, I have been thinking about this one for a couple of days
now and it seems to me that this is a factor contributing to what we
are seeing in [1], and I agree that this is just an incorrect approach
that makes easier to trigger the real underlying issues, while
table_recheck_autovac() ought to be the only code path doing the skip
job.  Note that I have failed to reproduce the behavior of the other
thread though, always finishing with a non-aggressive anti-wraparound
skipped because of an aggressive and anti-wraparound job happened just
before in parallel, and autovacuum was always able to continue
triggering new jobs, keeping the relfrozenxid age at bay.

So I would like to first revert that part, to have a cleaner state to
work on the underlying issues.  A pure revert means also adding back
the log message for non-aggressive and anti-wraparound jobs that
should never exist, which should be replaced by an assertion once all
the holes are fixed.  What do you think?

[1]: https://www.postgresql.org/message-id/CAE39h23RTX1jkYjWc5tccv34HwwraizaCUxOmdQdPM+Zt5-2Qg@mail.gmail.com
--
Michael

Attachment

pgsql-hackers by date:

Previous
From: Paul Jungwirth
Date:
Subject: Re: range_agg
Next
From: Amit Kapila
Date:
Subject: Re: error context for vacuum to include block number