cost delay brainstorming - Mailing list pgsql-hackers

From Robert Haas
Subject cost delay brainstorming
Date
Msg-id CA+Tgmoah8fUda2DFbTkqxqv6o4jjomQ+tr=snRfBVxVi4oYG3w@mail.gmail.com
Whole thread Raw
Responses Re: cost delay brainstorming
Re: cost delay brainstorming
Re: cost delay brainstorming
Re: cost delay brainstorming
List pgsql-hackers
Hi,

As I mentioned in my talk at 2024.pgconf.dev, I think that the biggest
problem with autovacuum as it exists today is that the cost delay is
sometimes too low to keep up with the amount of vacuuming that needs
to be done. I sketched a solution during the talk, but it was very
complicated, so I started to try to think of simpler ideas that might
still solve the problem, or at least be better than what we have
today.

I think we might able to get fairly far by observing that if the
number of running autovacuum workers is equal to the maximum allowable
number of running autovacuum workers, that may be a sign of trouble,
and the longer that situation persists, the more likely it is that
we're in trouble. So, a very simple algorithm would be: If the maximum
number of workers have been running continuously for more than, say,
10 minutes, assume we're falling behind and exempt all workers from
the cost limit for as long as the situation persists. One could
criticize this approach on the grounds that it causes a very sudden
behavior change instead of, say, allowing the rate of vacuuming to
gradually increase. I'm curious to know whether other people think
that would be a problem.

I think it might be OK, for a couple of reasons:

1. I'm unconvinced that the vacuum_cost_delay system actually prevents
very many problems. I've fixed a lot of problems by telling users to
raise the cost limit, but virtually never by lowering it. When we
lowered the delay by an order of magnitude a few releases ago -
equivalent to increasing the cost limit by an order of magnitude - I
didn't personally hear any complaints about that causing problems. So
disabling the delay completely some of the time might just be fine.

1a. Incidentally, when I have seen problems because of vacuum running
"too fast", it's not been because it was using up too much I/O
bandwidth, but because it's pushed too much data out of cache too
quickly. A long overnight vacuum can evict a lot of pages from the
system page cache by morning - the ring buffer only protects our
shared_buffers, not the OS cache. I don't think this can be fixed by
rate-limiting vacuum, though: to keep the cache eviction at a level
low enough that you could be certain of not causing trouble, you'd
have to limit it to an extremely low rate which would just cause
vacuuming not to keep up. The cure would be worse than the disease at
that point.

2. If we decided to gradually increase the rate of vacuuming instead
of just removing the throttling all at once, what formula would we use
and why would that be the right idea? We'd need a lot of state to
really do a calculation of how fast we would need to go in order to
keep up, and that starts to rapidly turn into a very complicated
project along the lines of what I mooted in Vancouver. Absent that,
the only other idea I have is to gradually ramp up the cost limit
higher and higher, which we could do, but we would have no idea how
fast to ramp it up, so anything we do here feels like it's just
picking random numbers and calling them an algorithm.

If you like this idea, I'd like to know that, and hear any further
thoughts you have about how to improve or refine it. If you don't, I'd
like to know that, too, and any alternatives you can propose,
especially alternatives that don't require crazy amounts of new
infrastructure to implement.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Tomasz Rybak
Date:
Subject: Re: Virtual generated columns
Next
From: "David E. Wheeler"
Date:
Subject: Re: FYI: LLVM Runtime Crash