Re: another autovacuum scheduling thread - Mailing list pgsql-hackers

From Robert Haas
Subject Re: another autovacuum scheduling thread
Date
Msg-id CA+TgmoY27S+nbgdCrVrc8S4p38NwTAC8_Uyq5ZaX6zxYToebXA@mail.gmail.com
Whole thread Raw
In response to Re: another autovacuum scheduling thread  (Nathan Bossart <nathandbossart@gmail.com>)
Responses Re: another autovacuum scheduling thread
List pgsql-hackers
On Wed, Nov 12, 2025 at 3:10 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
> I do think re-prioritization is worth considering, but IMHO we should leave
> it out of phase 1.  I think it's pretty easy to reason about one round of
> prioritization being okay.  The order is completely arbitrary today, so how
> could ordering by vacuum-related criteria make things any worse?  In my
> view, changing the list contents in fancier ways (e.g., adding
> just-processed tables back to the list) is a step further that requires
> more discussion and testing.

I agree with your view around reprioritization. To answer your
rhetorical question, the way that reordering the list could hurt is if
the current ordering (pg_class scan order) happened to be a
near-optimal choice. For example, suppose the last table in pg_class
order in a state where vacuuming appears to be necessary but will be
painful and/or useless (VACUUM will error, xmin will prevent all or
most tuple removal, located on an incredibly slow disk with nothing
cached, whatever). Re-sorting the list figures to move that table
earlier, which will not work out for the best. I suspect that
reprioritization actually increases the danger of this kind of failure
mode. The more aggressive you are about making sure that the
highest-priority tables actually get handled first, the more important
it is to be correct about the real order of priority.

I do think in the long term a really good system is probably going to
accumulate a bunch of extra logic to deal with cases like this. For
example, if the first table in the queue causes VACUUM to spend an
hour chugging a way and then fail with an I/O error, we would ideally
want to make sure to wait a while before retrying that table, so that
others don't get starved. But like you say, there's no need to solve
every problem at once.

What seems important to me for this patch is that we don't choose an
actively bad sort order. For instance, if we don't get the balance
between prioritizing anti-wraparound activity and controlling runaway
bloat correct, and especially if there's no way to recover by tweaking
settings, to me that's a scary scenario. I do think it's fairly
realistic for a bad choice of sort order to end up being a regression
over the current lack of a sort order. You might just be getting lucky
right now -- say, because the catalog tables all occur first in the
catalog and vacuuming those tends to be important, and among user
tables, the ones you created first are actually the ones that are most
important. That's not a particularly crazy scenario, IMHO.

Point being: I think we need to avoid the mindset that we can't be
stupider than we are now. I don't think there's any way we would
commit something that is GENERALLY stupider than we are now, but it's
not about averages. It's about whether there are specific cases that
are common enough to worry about which end up getting regressed. I'm
honestly not sure how much of a risk that is, and, again, I'm not
trying to kill the patch. It might well be that the patch is already
good enough that such scenarios will be extremely rare. However, it's
easy to get overconfident when replacing a completely unintelligent
system with a smarter one. The risk of something backfiring can
sometimes be higher than one anticipates.

One idea that might be worth considering is adding a reloption of some
kind that lets the user exert positive control over the sort order. I
know that's scope creep, so maybe it's a bad idea for that reason. But
I think it would be a better idea than Sami's proposal to score system
catalogs more highly, not so much because his idea is necessary
wrong-headed as because it doesn't help with what I see as the
principal danger here, namely, that whatever we do will sometimes turn
out to be wrong. Trying to be right 100% of the time is not going to
work out as well as having a backup plan for the cases where we are
wrong.

--
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Fabrice Chapuis
Date:
Subject: Re: Issue with logical replication slot during switchover
Next
From: Bertrand Drouvot
Date:
Subject: Re: Remove useless casts to (void *)