cost delay brainstorming - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | cost delay brainstorming |
Date | |
Msg-id | CA+Tgmoah8fUda2DFbTkqxqv6o4jjomQ+tr=snRfBVxVi4oYG3w@mail.gmail.com Whole thread Raw |
Responses |
Re: cost delay brainstorming
Re: cost delay brainstorming Re: cost delay brainstorming Re: cost delay brainstorming Re: cost delay brainstorming |
List | pgsql-hackers |
Hi, As I mentioned in my talk at 2024.pgconf.dev, I think that the biggest problem with autovacuum as it exists today is that the cost delay is sometimes too low to keep up with the amount of vacuuming that needs to be done. I sketched a solution during the talk, but it was very complicated, so I started to try to think of simpler ideas that might still solve the problem, or at least be better than what we have today. I think we might able to get fairly far by observing that if the number of running autovacuum workers is equal to the maximum allowable number of running autovacuum workers, that may be a sign of trouble, and the longer that situation persists, the more likely it is that we're in trouble. So, a very simple algorithm would be: If the maximum number of workers have been running continuously for more than, say, 10 minutes, assume we're falling behind and exempt all workers from the cost limit for as long as the situation persists. One could criticize this approach on the grounds that it causes a very sudden behavior change instead of, say, allowing the rate of vacuuming to gradually increase. I'm curious to know whether other people think that would be a problem. I think it might be OK, for a couple of reasons: 1. I'm unconvinced that the vacuum_cost_delay system actually prevents very many problems. I've fixed a lot of problems by telling users to raise the cost limit, but virtually never by lowering it. When we lowered the delay by an order of magnitude a few releases ago - equivalent to increasing the cost limit by an order of magnitude - I didn't personally hear any complaints about that causing problems. So disabling the delay completely some of the time might just be fine. 1a. Incidentally, when I have seen problems because of vacuum running "too fast", it's not been because it was using up too much I/O bandwidth, but because it's pushed too much data out of cache too quickly. A long overnight vacuum can evict a lot of pages from the system page cache by morning - the ring buffer only protects our shared_buffers, not the OS cache. I don't think this can be fixed by rate-limiting vacuum, though: to keep the cache eviction at a level low enough that you could be certain of not causing trouble, you'd have to limit it to an extremely low rate which would just cause vacuuming not to keep up. The cure would be worse than the disease at that point. 2. If we decided to gradually increase the rate of vacuuming instead of just removing the throttling all at once, what formula would we use and why would that be the right idea? We'd need a lot of state to really do a calculation of how fast we would need to go in order to keep up, and that starts to rapidly turn into a very complicated project along the lines of what I mooted in Vancouver. Absent that, the only other idea I have is to gradually ramp up the cost limit higher and higher, which we could do, but we would have no idea how fast to ramp it up, so anything we do here feels like it's just picking random numbers and calling them an algorithm. If you like this idea, I'd like to know that, and hear any further thoughts you have about how to improve or refine it. If you don't, I'd like to know that, too, and any alternatives you can propose, especially alternatives that don't require crazy amounts of new infrastructure to implement. -- Robert Haas EDB: http://www.enterprisedb.com
pgsql-hackers by date: