Thread: cost delay brainstorming

cost delay brainstorming

From
Robert Haas
Date:
Hi,

As I mentioned in my talk at 2024.pgconf.dev, I think that the biggest
problem with autovacuum as it exists today is that the cost delay is
sometimes too low to keep up with the amount of vacuuming that needs
to be done. I sketched a solution during the talk, but it was very
complicated, so I started to try to think of simpler ideas that might
still solve the problem, or at least be better than what we have
today.

I think we might able to get fairly far by observing that if the
number of running autovacuum workers is equal to the maximum allowable
number of running autovacuum workers, that may be a sign of trouble,
and the longer that situation persists, the more likely it is that
we're in trouble. So, a very simple algorithm would be: If the maximum
number of workers have been running continuously for more than, say,
10 minutes, assume we're falling behind and exempt all workers from
the cost limit for as long as the situation persists. One could
criticize this approach on the grounds that it causes a very sudden
behavior change instead of, say, allowing the rate of vacuuming to
gradually increase. I'm curious to know whether other people think
that would be a problem.

I think it might be OK, for a couple of reasons:

1. I'm unconvinced that the vacuum_cost_delay system actually prevents
very many problems. I've fixed a lot of problems by telling users to
raise the cost limit, but virtually never by lowering it. When we
lowered the delay by an order of magnitude a few releases ago -
equivalent to increasing the cost limit by an order of magnitude - I
didn't personally hear any complaints about that causing problems. So
disabling the delay completely some of the time might just be fine.

1a. Incidentally, when I have seen problems because of vacuum running
"too fast", it's not been because it was using up too much I/O
bandwidth, but because it's pushed too much data out of cache too
quickly. A long overnight vacuum can evict a lot of pages from the
system page cache by morning - the ring buffer only protects our
shared_buffers, not the OS cache. I don't think this can be fixed by
rate-limiting vacuum, though: to keep the cache eviction at a level
low enough that you could be certain of not causing trouble, you'd
have to limit it to an extremely low rate which would just cause
vacuuming not to keep up. The cure would be worse than the disease at
that point.

2. If we decided to gradually increase the rate of vacuuming instead
of just removing the throttling all at once, what formula would we use
and why would that be the right idea? We'd need a lot of state to
really do a calculation of how fast we would need to go in order to
keep up, and that starts to rapidly turn into a very complicated
project along the lines of what I mooted in Vancouver. Absent that,
the only other idea I have is to gradually ramp up the cost limit
higher and higher, which we could do, but we would have no idea how
fast to ramp it up, so anything we do here feels like it's just
picking random numbers and calling them an algorithm.

If you like this idea, I'd like to know that, and hear any further
thoughts you have about how to improve or refine it. If you don't, I'd
like to know that, too, and any alternatives you can propose,
especially alternatives that don't require crazy amounts of new
infrastructure to implement.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: cost delay brainstorming

From
Andres Freund
Date:
Hi,

On 2024-06-17 15:39:27 -0400, Robert Haas wrote:
> As I mentioned in my talk at 2024.pgconf.dev, I think that the biggest
> problem with autovacuum as it exists today is that the cost delay is
> sometimes too low to keep up with the amount of vacuuming that needs
> to be done.

I agree it's a big problem, not sure it's *the* problem. But I'm happy to see
it improved anyway, so it doesn't really matter.

One issue around all of this is that we pretty much don't have the tools to
analyze autovacuum behaviour across a larger number of systems in a realistic
way :/.  I find my own view of what precisely the problem is being heavily
swayed by the last few problematic cases I've looked t.



> I think we might able to get fairly far by observing that if the
> number of running autovacuum workers is equal to the maximum allowable
> number of running autovacuum workers, that may be a sign of trouble,
> and the longer that situation persists, the more likely it is that
> we're in trouble. So, a very simple algorithm would be: If the maximum
> number of workers have been running continuously for more than, say,
> 10 minutes, assume we're falling behind and exempt all workers from
> the cost limit for as long as the situation persists. One could
> criticize this approach on the grounds that it causes a very sudden
> behavior change instead of, say, allowing the rate of vacuuming to
> gradually increase. I'm curious to know whether other people think
> that would be a problem.

Another issue is that it's easy to fall behind due to cost limits on systems
where autovacuum_max_workers is smaller than the number of busy tables.

IME one common situation is to have a single table that's being vacuumed too
slowly due to cost limits, with everything else keeping up easily.


> I think it might be OK, for a couple of reasons:
>
> 1. I'm unconvinced that the vacuum_cost_delay system actually prevents
> very many problems. I've fixed a lot of problems by telling users to
> raise the cost limit, but virtually never by lowering it. When we
> lowered the delay by an order of magnitude a few releases ago -
> equivalent to increasing the cost limit by an order of magnitude - I
> didn't personally hear any complaints about that causing problems. So
> disabling the delay completely some of the time might just be fine.

I have seen disabling cost limits cause replication setups to fall over
because the amount of WAL increases beyond what can be
replicated/archived/replayed.  It's very easy to reach the issue when syncrep
is enabled.



> 1a. Incidentally, when I have seen problems because of vacuum running
> "too fast", it's not been because it was using up too much I/O
> bandwidth, but because it's pushed too much data out of cache too
> quickly. A long overnight vacuum can evict a lot of pages from the
> system page cache by morning - the ring buffer only protects our
> shared_buffers, not the OS cache. I don't think this can be fixed by
> rate-limiting vacuum, though: to keep the cache eviction at a level
> low enough that you could be certain of not causing trouble, you'd
> have to limit it to an extremely low rate which would just cause
> vacuuming not to keep up. The cure would be worse than the disease at
> that point.

I've seen versions of this too. Ironically it's often made way worse by
ringbuffers, because even if there is space is shared buffers, we'll not move
buffers there, instead putting a lot of pressure on the OS page cache.


> If you like this idea, I'd like to know that, and hear any further
> thoughts you have about how to improve or refine it. If you don't, I'd
> like to know that, too, and any alternatives you can propose,
> especially alternatives that don't require crazy amounts of new
> infrastructure to implement.

I unfortunately don't know what to do about all of this with just the existing
set of metrics :/.


One reason that cost limit can be important is that we often schedule
autovacuums on tables in useless ways, over and over. Without cost limits
providing *some* protection against using up all IO bandwidth / WAL volume, we
could end up doing even worse.


Common causes of such useless vacuums I've seen:

- Longrunning transaction prevents increasing relfrozenxid, we run autovacuum
  over and over on the same relation, using up the whole cost budget. This is
  particularly bad because often we'll not autovacuum anything else, building
  up a larger and larger backlog of actual work.

- Tables, where on-access pruning works very well, end up being vacuumed far
  too frequently, because our autovacuum scheduling doesn't know about tuples
  having been cleaned up by on-access pruning.

- Larger tables with occasional lock conflicts cause autovacuum to be
  cancelled and restarting from scratch over and over. If that happens before
  the second table scan, this can easily eat up the whole cost budget without
  making forward progress.


Greetings,

Andres Freund



Re: cost delay brainstorming

From
Greg Sabino Mullane
Date:
On Mon, Jun 17, 2024 at 3:39 PM Robert Haas <robertmhaas@gmail.com> wrote:
So, a very simple algorithm would be: If the maximum number of workers have been running continuously for more than, say,
10 minutes, assume we're falling behind

Hmm, I don't know about the validity of this. I've seen plenty of cases where we hit the max workers but all is just fine. On the other hand, I don't have an alternative trigger point yet. But I do overall like the idea of dynamically changing the delay. And agree it is pretty conservative.
 
2. If we decided to gradually increase the rate of vacuuming instead of just removing the throttling all at once, what formula would we use
and why would that be the right idea?

Well, since the idea of disabling the delay is on the table, we could raise the cost every minute by X% until we effectively reach an infinite cost / zero delay situation. I presume this would only affect currently running vacs, and future ones would get the default cost until things get triggered again?

Cheers,
Greg

Re: cost delay brainstorming

From
David Rowley
Date:
On Tue, 18 Jun 2024 at 07:39, Robert Haas <robertmhaas@gmail.com> wrote:
> I think we might able to get fairly far by observing that if the
> number of running autovacuum workers is equal to the maximum allowable
> number of running autovacuum workers, that may be a sign of trouble,
> and the longer that situation persists, the more likely it is that
> we're in trouble. So, a very simple algorithm would be: If the maximum
> number of workers have been running continuously for more than, say,
> 10 minutes, assume we're falling behind and exempt all workers from
> the cost limit for as long as the situation persists. One could
> criticize this approach on the grounds that it causes a very sudden
> behavior change instead of, say, allowing the rate of vacuuming to
> gradually increase. I'm curious to know whether other people think
> that would be a problem.

I think a nicer solution would implement some sort of unified "urgency
level" and slowly ramp up the vacuum_cost_limit according to that
urgency rather than effectively switching to an infinite
vacuum_cost_limit when all workers have been going for N mins. If
there is nothing else that requires a vacuum while all 3 workers have
been busy for an hour or two, it seems strange to hurry them up so
they can more quickly start their next task -- being idle.

An additional feature that having this unified "urgency level" will
provide is the ability to prioritise auto-vacuum so that it works on
the most urgent tables first.

I outlined some ideas in [1] of how this might be done.

David

[1] https://postgr.es/m/CAApHDvo8DWyt4CWhF=NPeRstz_78SteEuuNDfYO7cjp=7YTK4g@mail.gmail.com



Re: cost delay brainstorming

From
Nathan Bossart
Date:
On Mon, Jun 17, 2024 at 03:39:27PM -0400, Robert Haas wrote:
> I think we might able to get fairly far by observing that if the
> number of running autovacuum workers is equal to the maximum allowable
> number of running autovacuum workers, that may be a sign of trouble,
> and the longer that situation persists, the more likely it is that
> we're in trouble. So, a very simple algorithm would be: If the maximum
> number of workers have been running continuously for more than, say,
> 10 minutes, assume we're falling behind and exempt all workers from
> the cost limit for as long as the situation persists. One could
> criticize this approach on the grounds that it causes a very sudden
> behavior change instead of, say, allowing the rate of vacuuming to
> gradually increase. I'm curious to know whether other people think
> that would be a problem.
> 
> I think it might be OK, for a couple of reasons:
> 
> 1. I'm unconvinced that the vacuum_cost_delay system actually prevents
> very many problems. I've fixed a lot of problems by telling users to
> raise the cost limit, but virtually never by lowering it. When we
> lowered the delay by an order of magnitude a few releases ago -
> equivalent to increasing the cost limit by an order of magnitude - I
> didn't personally hear any complaints about that causing problems. So
> disabling the delay completely some of the time might just be fine.

Have we ruled out further adjustments to the cost parameters as a first
step?  If you are still recommending that folks raise it and never
recommending that folks lower it, ISTM that our defaults might still not be
in the right ballpark.  The autovacuum_vacuum_cost_delay adjustment you
reference (commit cbccac3) is already 5 years old, so maybe it's worth
another look.

Perhaps only tangentially related, but I feel like the default of 3 for
autovacuum_max_workers is a bit low, especially for systems with many
tables.  Changing the default for that likely requires changing the default
for the delay/limit, too.

-- 
nathan



Re: cost delay brainstorming

From
Andres Freund
Date:
Hi,

On 2024-06-18 13:50:46 -0500, Nathan Bossart wrote:
> Have we ruled out further adjustments to the cost parameters as a first
> step?

I'm not against that, but I it doesn't address the issue that with the current
logic one set of values just isn't going to fit a 60MB that's allowed to burst
to 100 iops and a 60TB database that has multiple 1M iops NVMe drives.


That said, the fact that vacuum_cost_page_hit is 1 and vacuum_cost_page_miss
is 2 just doesn't make much sense aesthetically. There's a far bigger
multiplier in actual costs than that...



> If you are still recommending that folks raise it and never recommending
> that folks lower it, ISTM that our defaults might still not be in the right
> ballpark.  The autovacuum_vacuum_cost_delay adjustment you reference (commit
> cbccac3) is already 5 years old, so maybe it's worth another look.

Adjusting cost delay much lower doesn't make much sense imo. It's already only
2ms on a 1ms granularity variable.  We could increase the resolution, but
sleeping for much shorter often isn't that cheap (you need to set up hardware
timers all the time and due to the short time they can't be combined with
other timers) and/or barely gives time to switch to other tasks.


So we'd have to increase the cost limit.


Greetings,

Andres Freund



Re: cost delay brainstorming

From
Nathan Bossart
Date:
On Tue, Jun 18, 2024 at 01:32:38PM -0700, Andres Freund wrote:
> On 2024-06-18 13:50:46 -0500, Nathan Bossart wrote:
>> Have we ruled out further adjustments to the cost parameters as a first
>> step?
> 
> I'm not against that, but I it doesn't address the issue that with the current
> logic one set of values just isn't going to fit a 60MB that's allowed to burst
> to 100 iops and a 60TB database that has multiple 1M iops NVMe drives.

True.

>> If you are still recommending that folks raise it and never recommending
>> that folks lower it, ISTM that our defaults might still not be in the right
>> ballpark.  The autovacuum_vacuum_cost_delay adjustment you reference (commit
>> cbccac3) is already 5 years old, so maybe it's worth another look.
> 
> Adjusting cost delay much lower doesn't make much sense imo. It's already only
> 2ms on a 1ms granularity variable.  We could increase the resolution, but
> sleeping for much shorter often isn't that cheap (you need to set up hardware
> timers all the time and due to the short time they can't be combined with
> other timers) and/or barely gives time to switch to other tasks.
> 
> 
> So we'd have to increase the cost limit.

Agreed.

-- 
nathan



Re: cost delay brainstorming

From
Andy Fan
Date:
Hi, 
> Hi,
>
> On 2024-06-17 15:39:27 -0400, Robert Haas wrote:
>> As I mentioned in my talk at 2024.pgconf.dev, I think that the biggest
>> problem with autovacuum as it exists today is that the cost delay is
>> sometimes too low to keep up with the amount of vacuuming that needs
>> to be done.
>
> I agree it's a big problem, not sure it's *the* problem. But I'm happy to see
> it improved anyway, so it doesn't really matter.

In my past knowldege,  another big problem is the way we triggers an
autovacuum on a relation. With the current stategy, if we have lots of
writes between 9:00 AM ~ 5:00 PM,  it is more likely to triggers an
autovauum at that time which is the peak time of application as well.

If we can trigger vacuum at off-peak time, like 00:00 am ~ 05:00 am,
even we use lots of resource, it is unlikly cause any issue.

> One issue around all of this is that we pretty much don't have the tools to
> analyze autovacuum behaviour across a larger number of systems in a realistic
> way :/.  I find my own view of what precisely the problem is being heavily
> swayed by the last few problematic cases I've looked t.
>
>
>> I think we might able to get fairly far by observing that if the
>> number of running autovacuum workers is equal to the maximum allowable
>> number of running autovacuum workers, that may be a sign of trouble,
>> and the longer that situation persists, the more likely it is that
>> we're in trouble. So, a very simple algorithm would be: If the maximum
>> number of workers have been running continuously for more than, say,
>> 10 minutes, assume we're falling behind and exempt all workers from
>> the cost limit for as long as the situation persists. One could
>> criticize this approach on the grounds that it causes a very sudden
>> behavior change instead of, say, allowing the rate of vacuuming to
>> gradually increase. I'm curious to know whether other people think
>> that would be a problem.
>
> Another issue is that it's easy to fall behind due to cost limits on systems
> where autovacuum_max_workers is smaller than the number of busy tables.
>
> IME one common situation is to have a single table that's being vacuumed too
> slowly due to cost limits, with everything else keeping up easily.
>
>
>> I think it might be OK, for a couple of reasons:
>>
>> 1. I'm unconvinced that the vacuum_cost_delay system actually prevents
>> very many problems. I've fixed a lot of problems by telling users to
>> raise the cost limit, but virtually never by lowering it. When we
>> lowered the delay by an order of magnitude a few releases ago -
>> equivalent to increasing the cost limit by an order of magnitude - I
>> didn't personally hear any complaints about that causing problems. So
>> disabling the delay completely some of the time might just be fine.
>
> I have seen disabling cost limits cause replication setups to fall over
> because the amount of WAL increases beyond what can be
> replicated/archived/replayed.  It's very easy to reach the issue when syncrep
> is enabled.

Usually applications have off-peak time, if we can use such character, we
might have some good result. But I know it is hard to do in PostgreSQL
core, I ever tried it in an external system (external minotor +
crontab-like). I can see the CPU / Memory ussage of autovacuum reduced a
lot at the daytime (application peak time).


>> 1a. Incidentally, when I have seen problems because of vacuum running
>> "too fast", it's not been because it was using up too much I/O
>> bandwidth, but because it's pushed too much data out of cache too
>> quickly. A long overnight vacuum can evict a lot of pages from the
>> system page cache by morning - the ring buffer only protects our
>> shared_buffers, not the OS cache. I don't think this can be fixed by
>> rate-limiting vacuum, though: to keep the cache eviction at a level
>> low enough that you could be certain of not causing trouble, you'd
>> have to limit it to an extremely low rate which would just cause
>> vacuuming not to keep up. The cure would be worse than the disease at
>> that point.
>
> I've seen versions of this too. Ironically it's often made way worse by
> ringbuffers, because even if there is space is shared buffers, we'll not move
> buffers there, instead putting a lot of pressure on the OS page cache.

I can understand the pressure on the OS page cache, but I thought the
OS page cache can be reused easily for any other purposes. Not sure what
outstanding issue it can cause.  

> - Longrunning transaction prevents increasing relfrozenxid, we run autovacuum
>   over and over on the same relation, using up the whole cost budget. This is
>   particularly bad because often we'll not autovacuum anything else, building
>   up a larger and larger backlog of actual work.

Could we maintain a pg_class.last_autovacuum_min_xid during vacuum? So
if we compare the OldestXminXid with pg_class.last_autovacuum_min_xid
before doing the real work. I think we can use a in-place update on it
to avoid too many versions of pg_class tuples when updating
pg_class.last_autovacuum_min_xid.

>
> - Tables, where on-access pruning works very well, end up being vacuumed far
>   too frequently, because our autovacuum scheduling doesn't know about tuples
>   having been cleaned up by on-access pruning.

Good to know this case. if we update the pg_stats_xx metrics when on-access
pruning, would it is helpful on this? 

> - Larger tables with occasional lock conflicts cause autovacuum to be
>   cancelled and restarting from scratch over and over. If that happens before
>   the second table scan, this can easily eat up the whole cost budget without
>   making forward progress.

Off-peak time + manual vacuum should be helpful I think.

-- 
Best Regards
Andy Fan




Re: cost delay brainstorming

From
Andy Fan
Date:
Andy Fan <zhihuifan1213@163.com> writes:

>
>> - Longrunning transaction prevents increasing relfrozenxid, we run autovacuum
>>   over and over on the same relation, using up the whole cost budget. This is
>>   particularly bad because often we'll not autovacuum anything else, building
>>   up a larger and larger backlog of actual work.
>
> Could we maintain a pg_class.last_autovacuum_min_xid during vacuum? So
> if we compare the OldestXminXid with pg_class.last_autovacuum_min_xid
> before doing the real work.

Maintaining the oldestXminXid on this relation might be expensive. 

>>
>> - Tables, where on-access pruning works very well, end up being vacuumed far
>>   too frequently, because our autovacuum scheduling doesn't know about tuples
>>   having been cleaned up by on-access pruning.
>
> Good to know this case. if we update the pg_stats_xx metrics when on-access
> pruning, would it is helpful on this?

I got the answer myself, it doesn't work because on-access pruing
working on per-index level and one relation may has many indexes. and
pg_stats_xx works at relation level.  So the above proposal doesn't
work. 

-- 
Best Regards
Andy Fan




Re: cost delay brainstorming

From
Jay
Date:
I had suggested something more that just cost limit, throttling which would be re-startable vacuum - https://www.postgresql.org/message-id/CAPdcCKpvZiRCoDxQoo9mXxXAK8w=bX5NQdTTgzvHV2sUXp0ihA@mail.gmail.com

It may not be difficult to predict patterns of idle periods with cloud infrastructure and monitoring now-a-days. If we keep manual vacuum going in those idle periods, then there would be much less chance of auto-vacuum getting disruptive. This can be built with extensions or inside the engine. 

However, this change is a bit bigger than just a config parameter. It didn't get much traction. 

- Jay Sudrik

On Tue, Jun 18, 2024 at 1:09 AM Robert Haas <robertmhaas@gmail.com> wrote:
Hi,

As I mentioned in my talk at 2024.pgconf.dev, I think that the biggest
problem with autovacuum as it exists today is that the cost delay is
sometimes too low to keep up with the amount of vacuuming that needs
to be done. I sketched a solution during the talk, but it was very
complicated, so I started to try to think of simpler ideas that might
still solve the problem, or at least be better than what we have
today.

I think we might able to get fairly far by observing that if the
number of running autovacuum workers is equal to the maximum allowable
number of running autovacuum workers, that may be a sign of trouble,
and the longer that situation persists, the more likely it is that
we're in trouble. So, a very simple algorithm would be: If the maximum
number of workers have been running continuously for more than, say,
10 minutes, assume we're falling behind and exempt all workers from
the cost limit for as long as the situation persists. One could
criticize this approach on the grounds that it causes a very sudden
behavior change instead of, say, allowing the rate of vacuuming to
gradually increase. I'm curious to know whether other people think
that would be a problem.

I think it might be OK, for a couple of reasons:

1. I'm unconvinced that the vacuum_cost_delay system actually prevents
very many problems. I've fixed a lot of problems by telling users to
raise the cost limit, but virtually never by lowering it. When we
lowered the delay by an order of magnitude a few releases ago -
equivalent to increasing the cost limit by an order of magnitude - I
didn't personally hear any complaints about that causing problems. So
disabling the delay completely some of the time might just be fine.

1a. Incidentally, when I have seen problems because of vacuum running
"too fast", it's not been because it was using up too much I/O
bandwidth, but because it's pushed too much data out of cache too
quickly. A long overnight vacuum can evict a lot of pages from the
system page cache by morning - the ring buffer only protects our
shared_buffers, not the OS cache. I don't think this can be fixed by
rate-limiting vacuum, though: to keep the cache eviction at a level
low enough that you could be certain of not causing trouble, you'd
have to limit it to an extremely low rate which would just cause
vacuuming not to keep up. The cure would be worse than the disease at
that point.

2. If we decided to gradually increase the rate of vacuuming instead
of just removing the throttling all at once, what formula would we use
and why would that be the right idea? We'd need a lot of state to
really do a calculation of how fast we would need to go in order to
keep up, and that starts to rapidly turn into a very complicated
project along the lines of what I mooted in Vancouver. Absent that,
the only other idea I have is to gradually ramp up the cost limit
higher and higher, which we could do, but we would have no idea how
fast to ramp it up, so anything we do here feels like it's just
picking random numbers and calling them an algorithm.

If you like this idea, I'd like to know that, and hear any further
thoughts you have about how to improve or refine it. If you don't, I'd
like to know that, too, and any alternatives you can propose,
especially alternatives that don't require crazy amounts of new
infrastructure to implement.

--
Robert Haas
EDB: http://www.enterprisedb.com