I frequently hear about scenarios where users with thousands upon thousands
of tables realize that autovacuum is struggling to keep up. When they
inevitably go to bump up autovacuum_max_workers, they discover that it
requires a server restart (i.e., downtime) to take effect, causing further
frustration. For this reason, I think $SUBJECT is a desirable improvement.
I spent some time looking for past discussions about this, and I was
surprised to not find any, so I thought I'd give it a try.
The attached proof-of-concept patch demonstrates what I have in mind.
Instead of trying to dynamically change the global process table, etc., I'm
proposing that we introduce a new GUC that sets the effective maximum
number of autovacuum workers that can be started at any time. This means
there would be two GUCs for the number of autovacuum workers: one for the
number of slots reserved for autovacuum workers, and another that restricts
the number of those slots that can be used. The former would continue to
require a restart to change its value, and users would typically want to
set it relatively high. The latter could be changed at any time and would
allow for raising or lowering the maximum number of active autovacuum
workers, up to the limit set by the other parameter.
The proof-of-concept patch keeps autovacuum_max_workers as the maximum
number of slots to reserve for workers, but I think we should instead
rename this parameter to something else and then reintroduce
autovacuum_max_workers as the new parameter that can be adjusted without
restarting. That way, autovacuum_max_workers continues to work much the
same way as in previous versions.
There are a couple of weird cases with this approach. One is when the
restart-only limit is set lower than the PGC_SIGHUP limit. In that case, I
think we should just use the restart-only limit. The other is when there
are already N active autovacuum workers and the PGC_SIGHUP parameter is
changed to something less than N. For that case, I think we should just
block starting additional workers until the number of workers drops below
the new parameter's value. I don't think we should kill existing workers,
or anything else like that.
TBH I've been sitting on this idea for a while now, only because I think it
has a slim chance of acceptance, but IMHO this is a simple change that
could help many users.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com