Thread: cost delay brainstorming
Hi, As I mentioned in my talk at 2024.pgconf.dev, I think that the biggest problem with autovacuum as it exists today is that the cost delay is sometimes too low to keep up with the amount of vacuuming that needs to be done. I sketched a solution during the talk, but it was very complicated, so I started to try to think of simpler ideas that might still solve the problem, or at least be better than what we have today. I think we might able to get fairly far by observing that if the number of running autovacuum workers is equal to the maximum allowable number of running autovacuum workers, that may be a sign of trouble, and the longer that situation persists, the more likely it is that we're in trouble. So, a very simple algorithm would be: If the maximum number of workers have been running continuously for more than, say, 10 minutes, assume we're falling behind and exempt all workers from the cost limit for as long as the situation persists. One could criticize this approach on the grounds that it causes a very sudden behavior change instead of, say, allowing the rate of vacuuming to gradually increase. I'm curious to know whether other people think that would be a problem. I think it might be OK, for a couple of reasons: 1. I'm unconvinced that the vacuum_cost_delay system actually prevents very many problems. I've fixed a lot of problems by telling users to raise the cost limit, but virtually never by lowering it. When we lowered the delay by an order of magnitude a few releases ago - equivalent to increasing the cost limit by an order of magnitude - I didn't personally hear any complaints about that causing problems. So disabling the delay completely some of the time might just be fine. 1a. Incidentally, when I have seen problems because of vacuum running "too fast", it's not been because it was using up too much I/O bandwidth, but because it's pushed too much data out of cache too quickly. A long overnight vacuum can evict a lot of pages from the system page cache by morning - the ring buffer only protects our shared_buffers, not the OS cache. I don't think this can be fixed by rate-limiting vacuum, though: to keep the cache eviction at a level low enough that you could be certain of not causing trouble, you'd have to limit it to an extremely low rate which would just cause vacuuming not to keep up. The cure would be worse than the disease at that point. 2. If we decided to gradually increase the rate of vacuuming instead of just removing the throttling all at once, what formula would we use and why would that be the right idea? We'd need a lot of state to really do a calculation of how fast we would need to go in order to keep up, and that starts to rapidly turn into a very complicated project along the lines of what I mooted in Vancouver. Absent that, the only other idea I have is to gradually ramp up the cost limit higher and higher, which we could do, but we would have no idea how fast to ramp it up, so anything we do here feels like it's just picking random numbers and calling them an algorithm. If you like this idea, I'd like to know that, and hear any further thoughts you have about how to improve or refine it. If you don't, I'd like to know that, too, and any alternatives you can propose, especially alternatives that don't require crazy amounts of new infrastructure to implement. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2024-06-17 15:39:27 -0400, Robert Haas wrote: > As I mentioned in my talk at 2024.pgconf.dev, I think that the biggest > problem with autovacuum as it exists today is that the cost delay is > sometimes too low to keep up with the amount of vacuuming that needs > to be done. I agree it's a big problem, not sure it's *the* problem. But I'm happy to see it improved anyway, so it doesn't really matter. One issue around all of this is that we pretty much don't have the tools to analyze autovacuum behaviour across a larger number of systems in a realistic way :/. I find my own view of what precisely the problem is being heavily swayed by the last few problematic cases I've looked t. > I think we might able to get fairly far by observing that if the > number of running autovacuum workers is equal to the maximum allowable > number of running autovacuum workers, that may be a sign of trouble, > and the longer that situation persists, the more likely it is that > we're in trouble. So, a very simple algorithm would be: If the maximum > number of workers have been running continuously for more than, say, > 10 minutes, assume we're falling behind and exempt all workers from > the cost limit for as long as the situation persists. One could > criticize this approach on the grounds that it causes a very sudden > behavior change instead of, say, allowing the rate of vacuuming to > gradually increase. I'm curious to know whether other people think > that would be a problem. Another issue is that it's easy to fall behind due to cost limits on systems where autovacuum_max_workers is smaller than the number of busy tables. IME one common situation is to have a single table that's being vacuumed too slowly due to cost limits, with everything else keeping up easily. > I think it might be OK, for a couple of reasons: > > 1. I'm unconvinced that the vacuum_cost_delay system actually prevents > very many problems. I've fixed a lot of problems by telling users to > raise the cost limit, but virtually never by lowering it. When we > lowered the delay by an order of magnitude a few releases ago - > equivalent to increasing the cost limit by an order of magnitude - I > didn't personally hear any complaints about that causing problems. So > disabling the delay completely some of the time might just be fine. I have seen disabling cost limits cause replication setups to fall over because the amount of WAL increases beyond what can be replicated/archived/replayed. It's very easy to reach the issue when syncrep is enabled. > 1a. Incidentally, when I have seen problems because of vacuum running > "too fast", it's not been because it was using up too much I/O > bandwidth, but because it's pushed too much data out of cache too > quickly. A long overnight vacuum can evict a lot of pages from the > system page cache by morning - the ring buffer only protects our > shared_buffers, not the OS cache. I don't think this can be fixed by > rate-limiting vacuum, though: to keep the cache eviction at a level > low enough that you could be certain of not causing trouble, you'd > have to limit it to an extremely low rate which would just cause > vacuuming not to keep up. The cure would be worse than the disease at > that point. I've seen versions of this too. Ironically it's often made way worse by ringbuffers, because even if there is space is shared buffers, we'll not move buffers there, instead putting a lot of pressure on the OS page cache. > If you like this idea, I'd like to know that, and hear any further > thoughts you have about how to improve or refine it. If you don't, I'd > like to know that, too, and any alternatives you can propose, > especially alternatives that don't require crazy amounts of new > infrastructure to implement. I unfortunately don't know what to do about all of this with just the existing set of metrics :/. One reason that cost limit can be important is that we often schedule autovacuums on tables in useless ways, over and over. Without cost limits providing *some* protection against using up all IO bandwidth / WAL volume, we could end up doing even worse. Common causes of such useless vacuums I've seen: - Longrunning transaction prevents increasing relfrozenxid, we run autovacuum over and over on the same relation, using up the whole cost budget. This is particularly bad because often we'll not autovacuum anything else, building up a larger and larger backlog of actual work. - Tables, where on-access pruning works very well, end up being vacuumed far too frequently, because our autovacuum scheduling doesn't know about tuples having been cleaned up by on-access pruning. - Larger tables with occasional lock conflicts cause autovacuum to be cancelled and restarting from scratch over and over. If that happens before the second table scan, this can easily eat up the whole cost budget without making forward progress. Greetings, Andres Freund
On Mon, Jun 17, 2024 at 3:39 PM Robert Haas <robertmhaas@gmail.com> wrote:
So, a very simple algorithm would be: If the maximum number of workers have been running continuously for more than, say,
10 minutes, assume we're falling behind
Hmm, I don't know about the validity of this. I've seen plenty of cases where we hit the max workers but all is just fine. On the other hand, I don't have an alternative trigger point yet. But I do overall like the idea of dynamically changing the delay. And agree it is pretty conservative.
2. If we decided to gradually increase the rate of vacuuming instead of just removing the throttling all at once, what formula would we use
and why would that be the right idea?
Well, since the idea of disabling the delay is on the table, we could raise the cost every minute by X% until we effectively reach an infinite cost / zero delay situation. I presume this would only affect currently running vacs, and future ones would get the default cost until things get triggered again?
Cheers,
Greg
On Tue, 18 Jun 2024 at 07:39, Robert Haas <robertmhaas@gmail.com> wrote: > I think we might able to get fairly far by observing that if the > number of running autovacuum workers is equal to the maximum allowable > number of running autovacuum workers, that may be a sign of trouble, > and the longer that situation persists, the more likely it is that > we're in trouble. So, a very simple algorithm would be: If the maximum > number of workers have been running continuously for more than, say, > 10 minutes, assume we're falling behind and exempt all workers from > the cost limit for as long as the situation persists. One could > criticize this approach on the grounds that it causes a very sudden > behavior change instead of, say, allowing the rate of vacuuming to > gradually increase. I'm curious to know whether other people think > that would be a problem. I think a nicer solution would implement some sort of unified "urgency level" and slowly ramp up the vacuum_cost_limit according to that urgency rather than effectively switching to an infinite vacuum_cost_limit when all workers have been going for N mins. If there is nothing else that requires a vacuum while all 3 workers have been busy for an hour or two, it seems strange to hurry them up so they can more quickly start their next task -- being idle. An additional feature that having this unified "urgency level" will provide is the ability to prioritise auto-vacuum so that it works on the most urgent tables first. I outlined some ideas in [1] of how this might be done. David [1] https://postgr.es/m/CAApHDvo8DWyt4CWhF=NPeRstz_78SteEuuNDfYO7cjp=7YTK4g@mail.gmail.com
On Mon, Jun 17, 2024 at 03:39:27PM -0400, Robert Haas wrote: > I think we might able to get fairly far by observing that if the > number of running autovacuum workers is equal to the maximum allowable > number of running autovacuum workers, that may be a sign of trouble, > and the longer that situation persists, the more likely it is that > we're in trouble. So, a very simple algorithm would be: If the maximum > number of workers have been running continuously for more than, say, > 10 minutes, assume we're falling behind and exempt all workers from > the cost limit for as long as the situation persists. One could > criticize this approach on the grounds that it causes a very sudden > behavior change instead of, say, allowing the rate of vacuuming to > gradually increase. I'm curious to know whether other people think > that would be a problem. > > I think it might be OK, for a couple of reasons: > > 1. I'm unconvinced that the vacuum_cost_delay system actually prevents > very many problems. I've fixed a lot of problems by telling users to > raise the cost limit, but virtually never by lowering it. When we > lowered the delay by an order of magnitude a few releases ago - > equivalent to increasing the cost limit by an order of magnitude - I > didn't personally hear any complaints about that causing problems. So > disabling the delay completely some of the time might just be fine. Have we ruled out further adjustments to the cost parameters as a first step? If you are still recommending that folks raise it and never recommending that folks lower it, ISTM that our defaults might still not be in the right ballpark. The autovacuum_vacuum_cost_delay adjustment you reference (commit cbccac3) is already 5 years old, so maybe it's worth another look. Perhaps only tangentially related, but I feel like the default of 3 for autovacuum_max_workers is a bit low, especially for systems with many tables. Changing the default for that likely requires changing the default for the delay/limit, too. -- nathan
Hi, On 2024-06-18 13:50:46 -0500, Nathan Bossart wrote: > Have we ruled out further adjustments to the cost parameters as a first > step? I'm not against that, but I it doesn't address the issue that with the current logic one set of values just isn't going to fit a 60MB that's allowed to burst to 100 iops and a 60TB database that has multiple 1M iops NVMe drives. That said, the fact that vacuum_cost_page_hit is 1 and vacuum_cost_page_miss is 2 just doesn't make much sense aesthetically. There's a far bigger multiplier in actual costs than that... > If you are still recommending that folks raise it and never recommending > that folks lower it, ISTM that our defaults might still not be in the right > ballpark. The autovacuum_vacuum_cost_delay adjustment you reference (commit > cbccac3) is already 5 years old, so maybe it's worth another look. Adjusting cost delay much lower doesn't make much sense imo. It's already only 2ms on a 1ms granularity variable. We could increase the resolution, but sleeping for much shorter often isn't that cheap (you need to set up hardware timers all the time and due to the short time they can't be combined with other timers) and/or barely gives time to switch to other tasks. So we'd have to increase the cost limit. Greetings, Andres Freund
On Tue, Jun 18, 2024 at 01:32:38PM -0700, Andres Freund wrote: > On 2024-06-18 13:50:46 -0500, Nathan Bossart wrote: >> Have we ruled out further adjustments to the cost parameters as a first >> step? > > I'm not against that, but I it doesn't address the issue that with the current > logic one set of values just isn't going to fit a 60MB that's allowed to burst > to 100 iops and a 60TB database that has multiple 1M iops NVMe drives. True. >> If you are still recommending that folks raise it and never recommending >> that folks lower it, ISTM that our defaults might still not be in the right >> ballpark. The autovacuum_vacuum_cost_delay adjustment you reference (commit >> cbccac3) is already 5 years old, so maybe it's worth another look. > > Adjusting cost delay much lower doesn't make much sense imo. It's already only > 2ms on a 1ms granularity variable. We could increase the resolution, but > sleeping for much shorter often isn't that cheap (you need to set up hardware > timers all the time and due to the short time they can't be combined with > other timers) and/or barely gives time to switch to other tasks. > > > So we'd have to increase the cost limit. Agreed. -- nathan
Hi, > Hi, > > On 2024-06-17 15:39:27 -0400, Robert Haas wrote: >> As I mentioned in my talk at 2024.pgconf.dev, I think that the biggest >> problem with autovacuum as it exists today is that the cost delay is >> sometimes too low to keep up with the amount of vacuuming that needs >> to be done. > > I agree it's a big problem, not sure it's *the* problem. But I'm happy to see > it improved anyway, so it doesn't really matter. In my past knowldege, another big problem is the way we triggers an autovacuum on a relation. With the current stategy, if we have lots of writes between 9:00 AM ~ 5:00 PM, it is more likely to triggers an autovauum at that time which is the peak time of application as well. If we can trigger vacuum at off-peak time, like 00:00 am ~ 05:00 am, even we use lots of resource, it is unlikly cause any issue. > One issue around all of this is that we pretty much don't have the tools to > analyze autovacuum behaviour across a larger number of systems in a realistic > way :/. I find my own view of what precisely the problem is being heavily > swayed by the last few problematic cases I've looked t. > > >> I think we might able to get fairly far by observing that if the >> number of running autovacuum workers is equal to the maximum allowable >> number of running autovacuum workers, that may be a sign of trouble, >> and the longer that situation persists, the more likely it is that >> we're in trouble. So, a very simple algorithm would be: If the maximum >> number of workers have been running continuously for more than, say, >> 10 minutes, assume we're falling behind and exempt all workers from >> the cost limit for as long as the situation persists. One could >> criticize this approach on the grounds that it causes a very sudden >> behavior change instead of, say, allowing the rate of vacuuming to >> gradually increase. I'm curious to know whether other people think >> that would be a problem. > > Another issue is that it's easy to fall behind due to cost limits on systems > where autovacuum_max_workers is smaller than the number of busy tables. > > IME one common situation is to have a single table that's being vacuumed too > slowly due to cost limits, with everything else keeping up easily. > > >> I think it might be OK, for a couple of reasons: >> >> 1. I'm unconvinced that the vacuum_cost_delay system actually prevents >> very many problems. I've fixed a lot of problems by telling users to >> raise the cost limit, but virtually never by lowering it. When we >> lowered the delay by an order of magnitude a few releases ago - >> equivalent to increasing the cost limit by an order of magnitude - I >> didn't personally hear any complaints about that causing problems. So >> disabling the delay completely some of the time might just be fine. > > I have seen disabling cost limits cause replication setups to fall over > because the amount of WAL increases beyond what can be > replicated/archived/replayed. It's very easy to reach the issue when syncrep > is enabled. Usually applications have off-peak time, if we can use such character, we might have some good result. But I know it is hard to do in PostgreSQL core, I ever tried it in an external system (external minotor + crontab-like). I can see the CPU / Memory ussage of autovacuum reduced a lot at the daytime (application peak time). >> 1a. Incidentally, when I have seen problems because of vacuum running >> "too fast", it's not been because it was using up too much I/O >> bandwidth, but because it's pushed too much data out of cache too >> quickly. A long overnight vacuum can evict a lot of pages from the >> system page cache by morning - the ring buffer only protects our >> shared_buffers, not the OS cache. I don't think this can be fixed by >> rate-limiting vacuum, though: to keep the cache eviction at a level >> low enough that you could be certain of not causing trouble, you'd >> have to limit it to an extremely low rate which would just cause >> vacuuming not to keep up. The cure would be worse than the disease at >> that point. > > I've seen versions of this too. Ironically it's often made way worse by > ringbuffers, because even if there is space is shared buffers, we'll not move > buffers there, instead putting a lot of pressure on the OS page cache. I can understand the pressure on the OS page cache, but I thought the OS page cache can be reused easily for any other purposes. Not sure what outstanding issue it can cause. > - Longrunning transaction prevents increasing relfrozenxid, we run autovacuum > over and over on the same relation, using up the whole cost budget. This is > particularly bad because often we'll not autovacuum anything else, building > up a larger and larger backlog of actual work. Could we maintain a pg_class.last_autovacuum_min_xid during vacuum? So if we compare the OldestXminXid with pg_class.last_autovacuum_min_xid before doing the real work. I think we can use a in-place update on it to avoid too many versions of pg_class tuples when updating pg_class.last_autovacuum_min_xid. > > - Tables, where on-access pruning works very well, end up being vacuumed far > too frequently, because our autovacuum scheduling doesn't know about tuples > having been cleaned up by on-access pruning. Good to know this case. if we update the pg_stats_xx metrics when on-access pruning, would it is helpful on this? > - Larger tables with occasional lock conflicts cause autovacuum to be > cancelled and restarting from scratch over and over. If that happens before > the second table scan, this can easily eat up the whole cost budget without > making forward progress. Off-peak time + manual vacuum should be helpful I think. -- Best Regards Andy Fan
Andy Fan <zhihuifan1213@163.com> writes: > >> - Longrunning transaction prevents increasing relfrozenxid, we run autovacuum >> over and over on the same relation, using up the whole cost budget. This is >> particularly bad because often we'll not autovacuum anything else, building >> up a larger and larger backlog of actual work. > > Could we maintain a pg_class.last_autovacuum_min_xid during vacuum? So > if we compare the OldestXminXid with pg_class.last_autovacuum_min_xid > before doing the real work. Maintaining the oldestXminXid on this relation might be expensive. >> >> - Tables, where on-access pruning works very well, end up being vacuumed far >> too frequently, because our autovacuum scheduling doesn't know about tuples >> having been cleaned up by on-access pruning. > > Good to know this case. if we update the pg_stats_xx metrics when on-access > pruning, would it is helpful on this? I got the answer myself, it doesn't work because on-access pruing working on per-index level and one relation may has many indexes. and pg_stats_xx works at relation level. So the above proposal doesn't work. -- Best Regards Andy Fan
I had suggested something more that just cost limit, throttling which would be re-startable vacuum - https://www.postgresql.org/message-id/CAPdcCKpvZiRCoDxQoo9mXxXAK8w=bX5NQdTTgzvHV2sUXp0ihA@mail.gmail.com.
It may not be difficult to predict patterns of idle periods with cloud infrastructure and monitoring now-a-days. If we keep manual vacuum going in those idle periods, then there would be much less chance of auto-vacuum getting disruptive. This can be built with extensions or inside the engine.
However, this change is a bit bigger than just a config parameter. It didn't get much traction.
- Jay Sudrik
On Tue, Jun 18, 2024 at 1:09 AM Robert Haas <robertmhaas@gmail.com> wrote:
Hi,
As I mentioned in my talk at 2024.pgconf.dev, I think that the biggest
problem with autovacuum as it exists today is that the cost delay is
sometimes too low to keep up with the amount of vacuuming that needs
to be done. I sketched a solution during the talk, but it was very
complicated, so I started to try to think of simpler ideas that might
still solve the problem, or at least be better than what we have
today.
I think we might able to get fairly far by observing that if the
number of running autovacuum workers is equal to the maximum allowable
number of running autovacuum workers, that may be a sign of trouble,
and the longer that situation persists, the more likely it is that
we're in trouble. So, a very simple algorithm would be: If the maximum
number of workers have been running continuously for more than, say,
10 minutes, assume we're falling behind and exempt all workers from
the cost limit for as long as the situation persists. One could
criticize this approach on the grounds that it causes a very sudden
behavior change instead of, say, allowing the rate of vacuuming to
gradually increase. I'm curious to know whether other people think
that would be a problem.
I think it might be OK, for a couple of reasons:
1. I'm unconvinced that the vacuum_cost_delay system actually prevents
very many problems. I've fixed a lot of problems by telling users to
raise the cost limit, but virtually never by lowering it. When we
lowered the delay by an order of magnitude a few releases ago -
equivalent to increasing the cost limit by an order of magnitude - I
didn't personally hear any complaints about that causing problems. So
disabling the delay completely some of the time might just be fine.
1a. Incidentally, when I have seen problems because of vacuum running
"too fast", it's not been because it was using up too much I/O
bandwidth, but because it's pushed too much data out of cache too
quickly. A long overnight vacuum can evict a lot of pages from the
system page cache by morning - the ring buffer only protects our
shared_buffers, not the OS cache. I don't think this can be fixed by
rate-limiting vacuum, though: to keep the cache eviction at a level
low enough that you could be certain of not causing trouble, you'd
have to limit it to an extremely low rate which would just cause
vacuuming not to keep up. The cure would be worse than the disease at
that point.
2. If we decided to gradually increase the rate of vacuuming instead
of just removing the throttling all at once, what formula would we use
and why would that be the right idea? We'd need a lot of state to
really do a calculation of how fast we would need to go in order to
keep up, and that starts to rapidly turn into a very complicated
project along the lines of what I mooted in Vancouver. Absent that,
the only other idea I have is to gradually ramp up the cost limit
higher and higher, which we could do, but we would have no idea how
fast to ramp it up, so anything we do here feels like it's just
picking random numbers and calling them an algorithm.
If you like this idea, I'd like to know that, and hear any further
thoughts you have about how to improve or refine it. If you don't, I'd
like to know that, too, and any alternatives you can propose,
especially alternatives that don't require crazy amounts of new
infrastructure to implement.
--
Robert Haas
EDB: http://www.enterprisedb.com