Thread: autovacuum prioritization

autovacuum prioritization

From

Robert Haas

Date:

20 January 2022, 22:23:45

autovacuum, as it exists today, has some reasonable - though imperfect
- heuristics for deciding which tables need to be vacuumed or
analyzed. However, it has nothing to help it decide which tables need
to be vacuumed or analyzed first. As a result, when it's configured
aggressively enough, and has sufficient hardware resources, to do all
the things that need to get done, the system holds together reasonably
well. When it's not configured sufficiently aggressively, or when
hardware resources are lacking, the situation deteriorates. Some
deterioration is inevitable in that kind of scenario, because if you
can't do all of the things that need to be done, something will have
to get postponed, and that can cause problems. However, autovacuum
makes it worse than it needs to be by deciding which things to do
first more or less at random. When autovacuum chooses to vacuum a
table that is a little bit bloated in preference to one that is
severely bloated, or one that is bloated in place of one that is about
to cause a wraparound shutdown, users are understandably upset, and
some of them post on this mailing list about it. I am not going to try
to review in detail the history of previous threads looking for action
in this area, but there are quite a few.

In my view, previous efforts in this area have been too simplistic.
For example, it's been proposed that a table that is perceived to be
in any kind of wraparound danger ought to get top priority, but I find
that implausible. A mostly-quiescent table that is one XID past
autovacuum_freeze_max_age is not likely to be a more urgent problem
than a table that is absorbing 20,000 row updates per second. There is
a long way between the default value of autovacuum_freeze_max_age and
a wraparound shutdown, but if you don't vacuum a hotly-updated table
promptly, you will get irreversible bloat and force an application
shutdown to run VACUUM FULL. It's also been proposed that we provide
some way to let the user fix priorities, which was rightly criticized
as foisting on the user what the system ought to be figuring out for
itself. Perhaps a user-controllable prioritization system is a good
idea, but only as a way of overriding some built-in system in cases
where that goes wrong. A few people have proposed scoring systems,
which I think is closer to the right idea, because our basic goal is
to start vacuuming any given table soon enough that we finish
vacuuming it before some catastrophe strikes. The more imminent the
catastrophe, the more urgent it is to start vacuuming right away.
Also, and I think this is very important, the longer vacuum is going
to take, the more urgent it is to start vacuuming right away. If table
A will cause wraparound in 2 hours and take 2 hours to vacuum, and
table B will cause wraparound in 1 hour and take 10 minutes to vacuum,
table A is more urgent even though the catastrophe is further out.

So at a high level, I think that what we ought to do is, first, for
each table, estimate the time at which we think something bad will
occur. There are several bad events: too much bloat, XID wraparound,
MXID wraparound. We need to estimate the time at which we think each
of those things will occur, and then take the earliest of those
estimates. That's the time by which we need to have finished vacuuming
the table. Then, make an estimate of how long it will take to complete
a vacuum of the table, subtract that from the time at which we need to
be done, and that's the time by which we need to start. The earliest
need-to-start time is the highest priority table.

There are a number of problems here. One is that we actually need to
be able to estimate all the things that I just described, which will
probably require tracking statistics that we don't capture today, such
as the rate at which the system is consuming XIDs, and the rate at
which a table is bloating, rather than just the current state of
things. Unfortunately, proposing that the statistics system should
store more per-table information is a well-known way to get your patch
forcefully rejected. Maybe if we were storing and updating the
statistics data in a better way it wouldn't be an issue - so perhaps
shared memory stats collector stuff will resolve this issue somehow.
Or maybe it's not an issue anyway, since the big problem with the
statistics files is that they have to be constantly rewritten. If we
took snapshots of the values in even a relatively dumb way, they'd be
kinda bug, but they'd also be write-once. Maybe that would keep the
expense reasonable.

A second problem is that, if the earliest need-to-start time is in the
past, then we definitely are in trouble and had better get to work at
once, but if it's in the future, that doesn't necessarily mean we're
safe. If there are three tables with a need-to-finish time that is 12
hours in the future and each of them will take 11 hours to vacuum,
then every need-to-start time computed according to the algorithm
above is in the future, but in fact we're in a lot of trouble. If the
estimates are accurate, we need 3 autovacuum workers to be available
to start within 1 hour, or we're doomed. The very last thing we want
to do is wait another hour before doing anything. It's not impossible
to factor this into the calculation of need-to-start times, assuming
we know how many workers we have. For instance, if we've got tables
whose need-to-finish times are 30, 50, and 70 minutes in the future,
we can see that if each one takes 20 minutes or less to vacuum, then
the need-to-start times can just be computed by subtraction. But the
tables with 50 or 70 minute deadlines are going to take more than 20
minutes to vacuum, then we've got to back up the need-to-start times
so that we finish each table in time to start on the next one. I
haven't looked into what algorithms exist for this kind of scheduling
problem, but I feel like a literature search, or pulling out my
college textbooks, would probably turn up some useful ideas.

As soon as you see that you can't decide when you need to start on a
particular table without knowing what's going on with all the other
tables on the system, a third problem becomes apparent: you can't
figure anything out with confidence by looking at a single database,
but must rather gather information from all databases and decide to
which databases the workers should connect and what they ought to do
when they get there. And then, as tasks finish and system activity
progresses, you need to continuously update your notion of what needs
to be done next and move workers around to accommodate it. This gets
quite complicated, but that doesn't mean that it's unimportant.
There's a pretty well-known "thundering herd" type effect when every
table in the system crosses autovacuum_freeze_max_age around the same
time, and suddenly we go from not much vacuuming to a whole ton of
vacuuming all at once. A system like this could give us enough
information to spread that out in an intelligent way: we could see the
storm coming and start a single autovacuum worker working on the
problem well in advance, and then ramp up to multiple workers only if
it looks like that's not going to be enough to get the job done. I'm
not sure I want to deal with all that complexity on day one, but I
think it's important to do something about it eventually.

In the meantime, I think a sensible place to start would be to figure
out some system that makes sensible estimates of how soon we need to
address bloat, XID wraparound, and MXID wraparound for each table, and
some system that estimates how long each one will take to vacuum.
Then, even if the workers aren't talking to each other, each
individual worker can make an effort to deal with the most urgent
tasks first. I think that estimating how long it will take to vacuum a
table shouldn't be too bad: examining the visibility map and the index
sizes and thinking about the autovacuum cost limits should give us at
least some notion of what's likely to happen. Also, I think estimating
the time until XID age becomes critical isn't too bad either. First,
fix a threshold (perhaps autovacuum_max_freeze_age, maybe with a
higher-than-current value, or maybe some other threshold entirely)
that represents the target below which we always want to remain.
Second, know something about how fast the system is consuming XIDs.
Then just divide. I thought for a while that it would be too hard to
understand the XID consumption rate, because it might be very uneven,
but I think we can mitigate that problem somewhat by averaging over
relatively long time intervals. For instance, we might measure the
number of XIDs consumed in each 24 hour period, keep 7 days of data,
and then take the average of those values, or maybe better, the
highest or second-highest. That's a very small amount of data to store
and in combination with the relfrozenxid for each table, it's all we
need. We could also give the user a way to override our estimate of
the XID consumption rate, in case they have very brief, very severe
spikes. All of this can also be done for MXID age. It's estimating
that the time at which table bloat will exceed some threshold that
seems most difficult, because that seems to require measuring trends
on a per-table basis, as opposed to the XID consumption rate, which is
global.

I know that this email is kind of a giant wall of text, so my thanks
if you've read this far, and even more if you feel inspired to write
back with your own thoughts.

Thanks,

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: autovacuum prioritization

From

Peter Geoghegan

Date:

21 January 2022, 02:54:23

On Thu, Jan 20, 2022 at 11:24 AM Robert Haas <robertmhaas@gmail.com> wrote:
> In my view, previous efforts in this area have been too simplistic.
> For example, it's been proposed that a table that is perceived to be
> in any kind of wraparound danger ought to get top priority, but I find
> that implausible.

I agree that it doesn't follow that table A should be more of a
priority than table B, either because it has a greater age, or because
its age happens to exceed some actually-arbitrary threshold. But I
will point out that my ongoing work on freezing does make something
along these lines much more plausible. As I said over on that thread,
there is now a kind of "natural variation" among tables, in terms of
relfrozenxid, as a result of tracking the actual oldest XID, and using
that (plus the emphasis on advancing relfrozenxid wherever possible).
And so we'll have a much better idea of what's going on with each
table -- it's typically a precise XID value from the table, from the
recent past.

As of today, on HEAD, the picture is rather fuzzy. If a table has a
really old relminmxid, then which is more likely: 1. there are lots of
remaining MultiXactIds references by the old table, or 2. it has been
a while since that table was last aggressively vacuumed, and it
actually has exactly zero MultiXactId references? I would guess 2
myself, but right now I could never be too sure. But, in a world where
we consistently advance relfrozenxid and relminmxid, then *not*
advancing them (or advancing either by relatively little in one
particular table) becomes a strong signal, in a way that it just isn't
currently.

This is a negative signal, not a positive signal. And as you yourself
go on to say, that's what any new heuristics for this stuff ought to
be exclusively concerned with -- what not to allow to happen, ever.
There is a great deal of diversity among healthy databases; they're
hard to make generalizations about that work. But unhealthy (really
very unhealthy) states are *far* easier to recognize and understand,
without really needing to understand the workload itself at all.

Since we now have the failsafe, the scheduling algorithm can afford to
not give too much special attention to table age until we're maybe
over the 1 billion age mark -- or even 1.5 billion+. But once the
scheduling stuff starts to give table age special attention, it should
probably become the dominant consideration, by far, completely
drowning out any signals about bloat. It's kinda never really supposed
to get that high, so when we do end up there it is reasonable to fully
freak out. Unlike the bloat criteria, the wraparound safety criteria
doesn't seem to have much recognizable space between not worrying at
all, and freaking out.

> A second problem is that, if the earliest need-to-start time is in the
> past, then we definitely are in trouble and had better get to work at
> once, but if it's in the future, that doesn't necessarily mean we're
> safe.

There is a related problem that you didn't mention:
autovacuum_max_workers controls how many autovacuum workers can run at
once, but there is no particular concern for whether or not running
that many workers actually makes sense, in any given scenario. As a
general rule, the system should probably be *capable* of running a
large number of autovacuums at the same time, but never actually do
that (because it just doesn't ever prove necessary). Better to have
the option and never use it than need it and not have it.

> In the meantime, I think a sensible place to start would be to figure
> out some system that makes sensible estimates of how soon we need to
> address bloat, XID wraparound, and MXID wraparound for each table, and
> some system that estimates how long each one will take to vacuum.

I think that it's going to be hard to model how long index vacuuming
will take accurately. And harder still to model which indexes will
adversely impact the user in some way if we delay vacuuming some more.
Might be more useful to start off by addressing how to spread out the
burden of vacuuming over time. The needs of queries matters, but
controlling costs matters too.

One of the most effective techniques is to manually VACUUM when the
system is naturally idle, like at night time. If that could be
quasi-automated, or if the criteria used by autovacuum scheduling gave
just a little weight to how busy the system is right now, then we
would have more slack when the system becomes very busy.

--
Peter Geoghegan

Re: autovacuum prioritization

From

Robert Haas

Date:

21 January 2022, 03:43:11

On Thu, Jan 20, 2022 at 6:54 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I agree that it doesn't follow that table A should be more of a
> priority than table B, either because it has a greater age, or because
> its age happens to exceed some actually-arbitrary threshold. But I
> will point out that my ongoing work on freezing does make something
> along these lines much more plausible. As I said over on that thread,
> there is now a kind of "natural variation" among tables, in terms of
> relfrozenxid, as a result of tracking the actual oldest XID, and using
> that (plus the emphasis on advancing relfrozenxid wherever possible).
> And so we'll have a much better idea of what's going on with each
> table -- it's typically a precise XID value from the table, from the
> recent past.

I agree.

> Since we now have the failsafe, the scheduling algorithm can afford to
> not give too much special attention to table age until we're maybe
> over the 1 billion age mark -- or even 1.5 billion+. But once the
> scheduling stuff starts to give table age special attention, it should
> probably become the dominant consideration, by far, completely
> drowning out any signals about bloat. It's kinda never really supposed
> to get that high, so when we do end up there it is reasonable to fully
> freak out. Unlike the bloat criteria, the wraparound safety criteria
> doesn't seem to have much recognizable space between not worrying at
> all, and freaking out.

I do not agree with all of this. First, on general principle, I think
sharp edges are bad. If a table had priority 0 for autovacuum 10
minutes ago, it can't now have priority one million bazillion. If
you're saying that the priority of wraparound needs to, in the limit,
become higher than any bloat-based priority, that is reasonable. Bloat
never causes a hard stop in the way that wraparound does, even if the
practical effects are not much different. However, if you're saying
that the priority should shoot up to the maximum all at once, I don't
agree with that at all. Second, I think it is good and appropriate to
leave a lot of slop in the mechanism. As you point out later, we don't
really know whether any of our estimates for how long things will take
are accurate, and therefore we don't know whether the time we've
budgeted will be sufficient. We need to leave lots of slop so that
even if we turn out to be quite wrong, we don't hit a wall.

Also, it's worth keeping in mind that waiting longer to freak out is
not necessarily an advantage. It may well be that the only way the
problem will ever get resolved is by human intervention - going in and
fixing whatever dumb thing somebody did - e.g. resolving the pending
prepared transaction.  In that sense, we might be best off freaking
out after a relatively small number of transactions, because that
might get some human being's attention. In a very real sense, if old
prepared transactions shut down the system after 100 million
transactions, users would probably be better off on average, because
the problems would get fixed before so much damage is done. I'm not
seriously proposing that as a design, but I think it's a mistake to
think that pushing off the day of reckoning is necessarily better.

All that being said, I do agree that trying to keep the table age
below 300 million is too conservative. I think we need to be
conservative now because we don't take the time that the table will
take to vacuum into account, and I think if we start thinking about it
as a target to finish vacuuming rather than a target to start
vacuuming, it can go significantly higher. But I would be disinclined
to go to say, 1.5 billion. If the user hasn't taken any action when we
hit the 1 billion transaction mark, or really probably a lot sooner,
they're unlikely to wake up any time soon. I don't think there are
many systems out there where vacuum ages >1b are the result of the
system trying frantically to keep up and not having enough juice.
There are probably some, but most such cases are the result of
misconfiguration, user error, software failure, etc.

> There is a related problem that you didn't mention:
> autovacuum_max_workers controls how many autovacuum workers can run at
> once, but there is no particular concern for whether or not running
> that many workers actually makes sense, in any given scenario. As a
> general rule, the system should probably be *capable* of running a
> large number of autovacuums at the same time, but never actually do
> that (because it just doesn't ever prove necessary). Better to have
> the option and never use it than need it and not have it.

I agree. And related to that, the more workers we have, the slower
each one goes, which I think is often counterintuitive for people, and
also often counterproductive. I'm sure there are cases where table A
is really big and needs to be vacuumed but not terribly urgently, and
table B is really small but needs to be vacuumed right now, and I/O
bandwidth is really tight. In that case, slowing down the vacuum on
table A so that the vacuum on table B can do its thing is the right
call. But what I think is more common is that we get more workers
because the first one is not getting the job done. And if they all get
slower then we're still not getting the job done, but at greater
expense.

> > In the meantime, I think a sensible place to start would be to figure
> > out some system that makes sensible estimates of how soon we need to
> > address bloat, XID wraparound, and MXID wraparound for each table, and
> > some system that estimates how long each one will take to vacuum.
>
> I think that it's going to be hard to model how long index vacuuming
> will take accurately. And harder still to model which indexes will
> adversely impact the user in some way if we delay vacuuming some more.

Those are fair concerns. I assumed that if we knew the number of pages
in the index, which we do, it wouldn't be too hard to make an estimate
like this ... but you know more about this than I do, so tell me why
you think that won't work. It's perhaps worth noting that even a
somewhat poor estimate could be a big improvement over what we have
now.

> Might be more useful to start off by addressing how to spread out the
> burden of vacuuming over time. The needs of queries matters, but
> controlling costs matters too.
>
> One of the most effective techniques is to manually VACUUM when the
> system is naturally idle, like at night time. If that could be
> quasi-automated, or if the criteria used by autovacuum scheduling gave
> just a little weight to how busy the system is right now, then we
> would have more slack when the system becomes very busy.

I have thought about this approach but I'm not very hopeful about it
as a development direction. One problem is that we don't necessarily
know when the quiet times are, and another is that there might not
even be any quiet times. Still, neither of those problems by itself
would discourage me from attempting something in this area. The thing
that does discourage me is: if you have a quiet period, you can take
advantage of that to do vacuuming without any code changes at all.
You can just crontab a vacuum that runs with a reduced setting for
vacuum_freeze_table_age and vacuum_freeze_min_age during your nightly
quiet period and call it good.

The problem that I'm principally concerned about here is the case
where somebody had a system that was basically OK and then at some
point, bad things started to happen. At some point they realize
they're in trouble and try to get back on track. Very often,
autovacuum is actually the enemy in that situation: it insists on
consuming resources to vacuum the wrong stuff. Whatever we can do to
avoid such disastrous situations is all to the good, but since we
can't realistically expect to avoid them entirely, we need to improve
the behavior in the cases where they do happen.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: autovacuum prioritization

From

Peter Geoghegan

Date:

22 January 2022, 22:48:35

On Thu, Jan 20, 2022 at 4:43 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > Since we now have the failsafe, the scheduling algorithm can afford to
> > not give too much special attention to table age until we're maybe
> > over the 1 billion age mark -- or even 1.5 billion+. But once the
> > scheduling stuff starts to give table age special attention, it should
> > probably become the dominant consideration, by far, completely
> > drowning out any signals about bloat. It's kinda never really supposed
> > to get that high, so when we do end up there it is reasonable to fully
> > freak out. Unlike the bloat criteria, the wraparound safety criteria
> > doesn't seem to have much recognizable space between not worrying at
> > all, and freaking out.
>
> I do not agree with all of this. First, on general principle, I think
> sharp edges are bad. If a table had priority 0 for autovacuum 10
> minutes ago, it can't now have priority one million bazillion. If
> you're saying that the priority of wraparound needs to, in the limit,
> become higher than any bloat-based priority, that is reasonable.

I'm definitely saying considerations about wraparound need to swamp
everything else out at the limit. But I'm also making the point that
(at least with the ongoing relfrozenxid/freezing work) the system does
remarkably well at avoiding all aggressive anti-wraparound VACUUMs in
most individual tables, with most workloads. And so having an
aggressive anti-wraparound VACUUM at all now becomes a pretty strong
signal.

As we discussed on the other thread recently, you're still only going
to get anti-wraparound VACUUMs in a minority of tables with the
patches in place -- for tables that won't ever get an autovacuum for
any other reason. And so having an anti-wraparound probably just
signals that we have such a table, which is totally inconsequential.
But what about when there is an anti-wraparound VACUUM (or a need for
one) on a table whose age is already (say) 2x  the value of
autovacuum_freeze_max_age? That really is an incredibly strong signal
that something is very much amiss. Since the relfrozenxid/freezing
patch series actually makes each VACUUM able to advance relfrozenxid
in a way that's really robust when the system is not under great
pressure, the failure of that strategy becomes a really strong signal.

So it's not that table age signals something that we can generalize
about too much, without context. The context is important. The
relationship between table age and autovacuum_freeze_max_age with the
new strategy from my patch series becomes an important negative
signal, about something that we reasonably expected to be quite stable
not actually being stable.

(Sorry to keep going on about my work, but it really seems relevant.)

> Also, it's worth keeping in mind that waiting longer to freak out is
> not necessarily an advantage. It may well be that the only way the
> problem will ever get resolved is by human intervention - going in and
> fixing whatever dumb thing somebody did - e.g. resolving the pending
> prepared transaction.

In that case we ought to try to alert the user earlier.

> Those are fair concerns. I assumed that if we knew the number of pages
> in the index, which we do, it wouldn't be too hard to make an estimate
> like this ... but you know more about this than I do, so tell me why
> you think that won't work. It's perhaps worth noting that even a
> somewhat poor estimate could be a big improvement over what we have
> now.

I can construct a plausible, totally realistic counter-example that
breaks a heuristic like that, unless it focuses on extremes only, like
no index growth at all since the last VACUUM (which didn't leave
behind any deleted pages). I think that such a model can work well,
but only if it's designed to matter less and less as our uncertainty
grows. It seems as if the uncertainty grows very sharply, once you
begin to generalize past the extremes.

We have to be totally prepared for the model to be wrong, except
perhaps as a way of prioritizing things when there is real urgency,
and we don't have a choice about choosing. All models are wrong, some
are useful.

> The problem that I'm principally concerned about here is the case
> where somebody had a system that was basically OK and then at some
> point, bad things started to happen.

It seems necessary to distinguish between the case where things really
were okay for a time, and the case where they merely appeared to be
okay to somebody whose understanding of the system isn't impossibly
deep and sophisticated. You'd have to be an all-knowing oracle to be
able to tell the difference, because the system itself has no
sophisticated notion of how far it is into debt. There are things that
we can do to address this gap directly (that's what I have been doing
myself), but that can only go so far.

ISTM that the higher the amount of debt that the system is actually
in, the greater the uncertainty about the total amount of debt. In
other words, the advantage of paying down debt isn't limited to the
obvious stuff; there is also the advantage of gaining confidence about
how far into debt the system really is. The longer it's been since the
last real VACUUM, the more your model of debt/bloat is likely to have
diverged from reality.

And that's why I bring costs into it. Vacuuming at night because you
know that the cost will be relatively low, even if the benefits might
not be quite as high as you'd usually expect makes sense on its own
terms, and also has the advantage of making the overall picture
clearer to the system/your model.

> At some point they realize
> they're in trouble and try to get back on track. Very often,
> autovacuum is actually the enemy in that situation: it insists on
> consuming resources to vacuum the wrong stuff.

To some degree this is because the statistics that autovacuum has
access to are flat out wrong, even though we could do better. For
example, the issue that I highlighted a while back about ANALYZE's
dead tuples accounting. Or the issue that I pointed out on this thread
already, about relfrozenxid being a very bad indicator of what's
actually going on with XIDs in the table (at least without my
relfrozenxid patches in place).

Another idea centered on costs: with my freezing/relfrozenxid patch
series, strict append-only tables like pgbench_history will only ever
need to have VACUUM process each heap page once. That's good, but it
could be even better if we didn't have to rely on the autovacuum
scheduling and autovacuum_vacuum_insert_scale_factor to drive
everything. This is technically a special case, but it's a rather
important one -- it's both very common and not that hard to do a lot
better on. We ought to be aiming to only dirty each page exactly once,
by *dynamically* deciding to VACUUM much more often than the current
model supposes makes sense.

I think that this would require a two-way dialog between autovacuum.c
and vacuumlazy.c. At a high level, vacuumlazy.c would report back
"turns out that that table looks very much like an append-only table".
That feedback would cause the autovacuum.c scheduling to eagerly
launch another autovacuum worker, ignoring the usual criteria -- just
wait (say) another 60 seconds, and then launch a new autovacuum worker
on the same table if it became larger by some smallish fixed amount
(stop caring about percentage table growth). Constant mini-vacuums
against such a table make sense, since costs are almost exactly
proportional to the number of heap pages appended since the last
VACUUM.

-- 
Peter Geoghegan

Re: autovacuum prioritization

From

Dilip Kumar

Date:

25 January 2022, 07:14:35

On Fri, Jan 21, 2022 at 12:54 AM Robert Haas <robertmhaas@gmail.com> wrote:

> So at a high level, I think that what we ought to do is, first, for
> each table, estimate the time at which we think something bad will
> occur. There are several bad events: too much bloat, XID wraparound,
> MXID wraparound. We need to estimate the time at which we think each
> of those things will occur, and then take the earliest of those
> estimates. That's the time by which we need to have finished vacuuming
> the table. Then, make an estimate of how long it will take to complete
> a vacuum of the table, subtract that from the time at which we need to
> be done, and that's the time by which we need to start. The earliest
> need-to-start time is the highest priority table.

I think we need some more parameters to compare bloat vs wraparound.
I mean in one of your examples in the 2nd paragraph we can say that
the need-to-start of table A is earlier than table B so it's kind of
simple.  But when it comes to wraparound vs bloat we need to add some
weightage to compute how much bloat is considered as bad as
wraparound.  I think the amount of bloat can not be an absolute number
but it should be relative w.r.t the total database size or so.  I
don't think it can be computed w.r.t to the table size because if the
table is e.g. just 1 GB size and it is 5 times bloated then it is not
as bad as another 1 TB table which is just 2 times bloated.

>
> A second problem is that, if the earliest need-to-start time is in the
> past, then we definitely are in trouble and had better get to work at
> once, but if it's in the future, that doesn't necessarily mean we're
> safe. If there are three tables with a need-to-finish time that is 12
> hours in the future and each of them will take 11 hours to vacuum,
> then every need-to-start time computed according to the algorithm
> above is in the future, but in fact we're in a lot of trouble. If the
> estimates are accurate, we need 3 autovacuum workers to be available
> to start within 1 hour, or we're doomed. The very last thing we want
> to do is wait another hour before doing anything. It's not impossible
> to factor this into the calculation of need-to-start times, assuming
> we know how many workers we have. For instance, if we've got tables
> whose need-to-finish times are 30, 50, and 70 minutes in the future,
> we can see that if each one takes 20 minutes or less to vacuum, then
> the need-to-start times can just be computed by subtraction. But the
> tables with 50 or 70 minute deadlines are going to take more than 20
> minutes to vacuum, then we've got to back up the need-to-start times
> so that we finish each table in time to start on the next one. I
> haven't looked into what algorithms exist for this kind of scheduling
> problem, but I feel like a literature search, or pulling out my
> college textbooks, would probably turn up some useful ideas.

I think we should be thinking of dynamically adjusting priority as
well.  Because it is possible that when autovacuum started we
prioritize the table based on some statistics and estimation but
vacuuming process can take long time and during that some priority
might change so during the start of the autovacuum if we push all
table to some priority queue and simply vacuum in that order then we
might go wrong somewhere.  I think we need to make different priority
queues based on different factors, for example 1 queue for wraparound
risk and another for bloat risk.  Even though there would be multiple
queue we would have need_to_start time with each item so that we
exactly know from which queue we pick the next item but dynamically
whenever picking the item we can recheck the priority of the item at
the head of the queue and always assume that the queue is arranged  in
order of need_to_start time.  So now what if the item back in the
queue becomes more important than the item at the queue head based on
some statistics?  I don't think it is wise to compute the
need_to_start time for all the items before picking any new item.  But
I think we need to have multiple queues based on different factors
(not only just wraparound and bloat) to reduce the risk of items in
the back of the queue becoming higher priority than items in front of
the queue.  I mean this can not completely be avoided but this can be
reduced by creating multiple work queues based on more factors which
can dynamically change.

>
> I know that this email is kind of a giant wall of text, so my thanks
> if you've read this far, and even more if you feel inspired to write
> back with your own thoughts.

Yeah it is a long email but quite interesting.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: autovacuum prioritization

From

Robert Haas

Date:

25 January 2022, 22:30:01

On Mon, Jan 24, 2022 at 11:14 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I think we need some more parameters to compare bloat vs wraparound.
> I mean in one of your examples in the 2nd paragraph we can say that
> the need-to-start of table A is earlier than table B so it's kind of
> simple.  But when it comes to wraparound vs bloat we need to add some
> weightage to compute how much bloat is considered as bad as
> wraparound.  I think the amount of bloat can not be an absolute number
> but it should be relative w.r.t the total database size or so.  I
> don't think it can be computed w.r.t to the table size because if the
> table is e.g. just 1 GB size and it is 5 times bloated then it is not
> as bad as another 1 TB table which is just 2 times bloated.

Thanks for writing back.

I don't think that I believe the last part of this argument, because
it seems to suppose that the big problem with bloat is that it might
use up disk space, whereas in my experience the big problem with bloat
is that it slows down access to your data. Yet the dead space in some
other table will not have much impact on the speed of access to the
current table. In fact, if most accesses to the table are index scans,
even dead space in the current table may not have much effect, but
sequential scans are bound to notice. It's true that, on a
cluster-wide basis, every dead page is one more page that can
potentially take up space in cache, so in that sense the performance
consequences are global to the whole cluster. However, that effect is
more indirect and takes a long time to become a big problem. The
direct effect of having to read more pages to execute the same query
plan causes problems a lot sooner.

But your broader point that we need to consider how much bloat
represents a problem is a really good one. In the past, one rule that
I've thought about is: if we're vacuuming a table and we're not going
to finish before it needs to be vacuumed again, then we should vacuum
faster (i.e. in effect, increase the cost limit on the fly). That
might still not result in good behavior, but it would at least result
in behavior that is less bad. However, it doesn't really answer the
question of how we decide when to start the very first VACUUM. I don't
really know the answer to that question. The current heuristics result
in estimates of acceptable bloat that are too high in some cases and
too low in others. I've seen tables that got bloated vastly beyond
what autovacuum is configured to tolerate before they caused any real
difficulty, and I know there are other cases where users start to
suffer long before those thresholds are reached.

At the moment, the best idea I have is to use something like the
current algorithm, but treat it as a deadline (keep bloat below this
amount) rather than an initiation criteria (start when you reach this
amount).  But I think that idea is a bit weak; maybe there's something
better out there.

> I think we should be thinking of dynamically adjusting priority as
> well.  Because it is possible that when autovacuum started we
> prioritize the table based on some statistics and estimation but
> vacuuming process can take long time and during that some priority
> might change so during the start of the autovacuum if we push all
> table to some priority queue and simply vacuum in that order then we
> might go wrong somewhere.

Yep. I think we should reassess what to do next after each table.
Possibly making some exception for really small tables - e.g. if we
last recomputed priorities less than 1 minute ago, don't do it again.

> I think we need to make different priority
> queues based on different factors, for example 1 queue for wraparound
> risk and another for bloat risk.

I don't see why we want multiple queues. We have to answer the
question "what should we do next?" which requires us, in some way, to
funnel everything into a single prioritization.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: autovacuum prioritization

From

Peter Geoghegan

Date:

25 January 2022, 23:31:50

On Tue, Jan 25, 2022 at 11:30 AM Robert Haas <robertmhaas@gmail.com> wrote:
> But your broader point that we need to consider how much bloat
> represents a problem is a really good one. In the past, one rule that
> I've thought about is: if we're vacuuming a table and we're not going
> to finish before it needs to be vacuumed again, then we should vacuum
> faster (i.e. in effect, increase the cost limit on the fly).

That seems reasonable, but I doubt that that's a huge issue in
practice, now that the default cost limits are more sensible.

> That might still not result in good behavior, but it would at least result
> in behavior that is less bad. However, it doesn't really answer the
> question of how we decide when to start the very first VACUUM. I don't
> really know the answer to that question. The current heuristics result
> in estimates of acceptable bloat that are too high in some cases and
> too low in others. I've seen tables that got bloated vastly beyond
> what autovacuum is configured to tolerate before they caused any real
> difficulty, and I know there are other cases where users start to
> suffer long before those thresholds are reached.

ISTM that the easiest thing that could be done to improve this is to
give some consideration to page-level characteristics. For example, a
page that has 5 dead heap-only tuples is vastly different to a similar
page that has 5 LP_DEAD items instead -- and yet our current approach
makes no distinction. Chances are very high that if the only dead
tuples are heap-only tuples, then things are going just fine on that
page -- opportunistic pruning is actually keeping up. Page-level
stability over time seems to be the thing that matters most -- we must
make sure that the same "logical rows" that were inserted around the
same time remain on the same block for as long as possible, without
mixing in other unrelated tuples needlessly. In other words, preserve
natural locality.

This is related to the direction of things, and the certain knowledge
that VACUUM alone can deal with line pointer bloat. The current state
of individual pages hints at the direction of things even without
tracking how things change directly. But tracking the change over time
in ANALYZE seems better still: if successive ANALYZE operations notice
a consistent pattern where pages that had a non-zero number of LP_DEAD
items last time now have a significantly higher number, then it's a
good idea to err in the direction of more aggressive vacuuming.
*Growing* concentrations of LP_DEAD items signal chaos. I think that
placing a particular emphasis on pages with non-zero LP_DEAD items as
a qualitatively distinct category of page might well make sense --
relatively few blocks with a growing number of LP_DEAD items seems
like it should be enough to make autovacuum run aggressively.

As I pointed out not long ago, ANALYZE does a terrible job of
accurately counting dead tuples/LP_DEAD items when they aren't
uniformly distributed in the table -- which is often a hugely
important factor, with a table that is append-mostly with updates and
deletes. That's why I suggested bringing the visibility map into it.
In general I think that the statistics that drive autovacuum are
currently often quite wrong, even on their own simplistic,
quantitative terms.

> I don't see why we want multiple queues. We have to answer the
> question "what should we do next?" which requires us, in some way, to
> funnel everything into a single prioritization.

Even busy production DBs should usually only be vacuuming one large
table at a time. Also might make sense to strategically align the work
with the beginning of a new checkpoint.

-- 
Peter Geoghegan

Re: autovacuum prioritization

From

John Naylor

Date:

25 January 2022, 23:34:34

On Tue, Jan 25, 2022 at 2:30 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Jan 24, 2022 at 11:14 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

> > I think we need to make different priority
> > queues based on different factors, for example 1 queue for wraparound
> > risk and another for bloat risk.
>
> I don't see why we want multiple queues. We have to answer the
> question "what should we do next?" which requires us, in some way, to
> funnel everything into a single prioritization.

I was thinking along the same lines as Dilip: If the anti-wraparound
risk is really far in the future, there might not be much eligible
freezing work to do. Dead tuples can be removed as soon as visibility
rules allow it. With a separate bloat queue, there might always be
some work to do. Maybe "bloat queue" is too specific, because
insert-only tables can use more vacuuming for the VM even if they have
not reached the configured threshold.

So a worker would check the wraparound queue, and if nothing's there
grab something from the other queue. Maybe low-priority work would
have a low cost limit.

Probably the true best way to do schedule, at least at first, is
what's the least complex. I'm not yet sure what that is...
-- 
John Naylor
EDB: http://www.enterprisedb.com

Re: autovacuum prioritization

From

Robert Haas

Date:

26 January 2022, 21:54:50

On Tue, Jan 25, 2022 at 3:32 PM Peter Geoghegan <pg@bowt.ie> wrote:
> For example, a
> page that has 5 dead heap-only tuples is vastly different to a similar
> page that has 5 LP_DEAD items instead -- and yet our current approach
> makes no distinction. Chances are very high that if the only dead
> tuples are heap-only tuples, then things are going just fine on that
> page -- opportunistic pruning is actually keeping up.

Hmm, I think that's a really good insight. Perhaps we ought to forget
about counting updates and deletes and instead count dead line
pointers. Or maybe we can estimate the number of dead line pointers by
knowing how many updates and deletes there were, as long as we can
distinguish hot updates from non-HOT updates, which I think we can.

> if successive ANALYZE operations notice
> a consistent pattern where pages that had a non-zero number of LP_DEAD
> items last time now have a significantly higher number, then it's a
> good idea to err in the direction of more aggressive vacuuming.
> *Growing* concentrations of LP_DEAD items signal chaos. I think that
> placing a particular emphasis on pages with non-zero LP_DEAD items as
> a qualitatively distinct category of page might well make sense --
> relatively few blocks with a growing number of LP_DEAD items seems
> like it should be enough to make autovacuum run aggressively.

I think measuring the change over time here might be fraught with
peril. If vacuum makes a single pass over the indexes, it can retire
as many dead line pointers as we have, or as will fit in memory, and
the effort doesn't really depend too much on exactly how many dead
line pointers we're trying to find. (I hear that it does depend more
than you'd think ... but I still don't think that should be the
dominant consideration here.) So to be efficient, we want to do that
pass over the indexes when we have a suitably large batch of dead line
pointers. I don't think it really depends on how long it took the
batch to get to that size. I don't want to vacuum a terabyte of
indexes with a much-smaller-than-normal batch of dead TIDs just
because the number of dead TIDs seems to be increasing quickly at the
moment: it's hard to imagine that the results will be worth the
resources I'll have to expend to get there. On the other hand I also
don't think I want to postpone vacuuming the indexes because the
number is really big but not growing that fast.

I feel like my threshold for the number of dead TIDs that ought to
trigger a vacuum grows as the table gets bigger, capped by how much
memory I've got. But I don't feel like the rate at which it's changing
necessarily matters. Like if I create a million dead line pointers
really quickly, wait a month, and then create another million dead
line pointers, I feel like I want the system to respond just as
aggressively as if the month-long delay were omitted.

Maybe my feelings are wrong here. I'm just saying that, to me, it
doesn't feel like the rate of change is all that relevant.

> Even busy production DBs should usually only be vacuuming one large
> table at a time. Also might make sense to strategically align the work
> with the beginning of a new checkpoint.

I'm not sure that either of those statements are correct. But on the
other hand, I am also not sure that either of those statements are
incorrect.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: autovacuum prioritization

From

Robert Haas

Date:

26 January 2022, 23:17:38

On Tue, Jan 25, 2022 at 3:34 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
> I was thinking along the same lines as Dilip: If the anti-wraparound
> risk is really far in the future, there might not be much eligible
> freezing work to do. Dead tuples can be removed as soon as visibility
> rules allow it. With a separate bloat queue, there might always be
> some work to do.

Isn't the same thing true of bloat, though? If the XID threshold
hasn't advanced that much, then there may be nothing that's worth
doing about XID wraparound in the short term. If there aren't many
dead tuples in any of the tables, then there may be nothing that's
worth doing about bloat. Then we should just do nothing. On the other
hand, we may have a relatively urgent problem in one of those areas
but not the other. Then we should work on that one. Or we may have
problems in both areas, and then we need to somehow decide which one
is more urgent -- that's the situation in which I feel like we need to
unify the prioritization or ordering in some way.

It is an interesting point that we could have low priority work with a
low cost limit and high priority work with a higher cost limit, or as
I think Peter suggested, just use a single process for low-priority
stuff but allow multiple processes when there's high-priority stuff.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: autovacuum prioritization

From

Peter Geoghegan

Date:

27 January 2022, 01:19:57

> On Wed, Jan 26, 2022 at 10:55 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Jan 25, 2022 at 3:32 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > For example, a
> > page that has 5 dead heap-only tuples is vastly different to a similar
> > page that has 5 LP_DEAD items instead -- and yet our current approach
> > makes no distinction. Chances are very high that if the only dead
> > tuples are heap-only tuples, then things are going just fine on that
> > page -- opportunistic pruning is actually keeping up.
>
> Hmm, I think that's a really good insight. Perhaps we ought to forget
> about counting updates and deletes and instead count dead line
> pointers. Or maybe we can estimate the number of dead line pointers by
> knowing how many updates and deletes there were, as long as we can
> distinguish hot updates from non-HOT updates, which I think we can.

All that we have to go on is a bunch of observations in any case,
though -- the map is not the territory. And so it seems to me that the
sensible thing to do is just embrace that we won't ever really exactly
know what's going on in a given database, at any given time.
Fortunately, we don't really have to know. We should be able to get
away with only having roughly the right idea, by focussing on the few
things that we are sure of -- things like the difference between
LP_DEAD items and dead heap-only tuples, which are invariant to
workload characteristics.

I recently said (on the ANALYZE related thread) that we should be
thinking probabilistically here [1]. Our understanding of the amount
of bloat could very usefully be framed that way. Maybe the model we
use is a probability density function (maybe not formally, not sure).
A PDF has an exact expectation, which for us might be the most
probable number of dead tuples in total in a given table right now
(let's just assume it's still dead tuples, ignoring the problems with
that metric for now).

This is a useful basis for making better decisions by weighing
competing considerations -- which might themselves be another PDF.
Example: For a given table that is approaching the point where the
model says "time to VACUUM", we may very well spend hours, days, or
even weeks approaching the crossover point. The exact expectation
isn't truly special here -- there is actually zero practical reason to
have special reverence for that precise point (with a good model,
within certain reasonable bounds). If our model says that there is
only a noise-level difference between doing a VACUUM on a given table
today, tomorrow, or next week, why not take advantage? For example,
why not do the VACUUM when the system appears to not be busy at all
(typically in the dead of night), just because it'll definitely be
both cheaper in absolute terms (FPIs can be avoided by spreading
things out over multiple checkpoints), and less disruptive?

There are many opportunities like that, I believe. It's hard for me to
suppress the urge to blurt out 17 more ideas like that. What are the
chances that you won't have at least a few real winners among all of
the ideas that everybody will come up with, in the end?

> > if successive ANALYZE operations notice
> > a consistent pattern where pages that had a non-zero number of LP_DEAD
> > items last time now have a significantly higher number, then it's a
> > good idea to err in the direction of more aggressive vacuuming.
> > *Growing* concentrations of LP_DEAD items signal chaos. I think that
> > placing a particular emphasis on pages with non-zero LP_DEAD items as
> > a qualitatively distinct category of page might well make sense --
> > relatively few blocks with a growing number of LP_DEAD items seems
> > like it should be enough to make autovacuum run aggressively.
>
> I think measuring the change over time here might be fraught with
> peril.

I'd say that that depends on how you define the problem we're trying
to solve. If you define the problem as coming up with a significantly
improved statistical model that determines (say) how many dead tuples
there are in the table right now, given a set of observations made by
ANALYZE in the past, then yes, it's fraught with peril. But why would
you define it that way? It seems far easier to improve things by
putting model error and *actual* exposure to real known issues (e.g.
line pointer bloat) front and center.

It doesn't necessarily matter if we're *usually* wrong with a good
model. But with a bad model we may need to consistently get the
correct answer. And so the model that is the most accurate
quantitatively is probably *not* the best available model, all things
considered. Most of the time we shouldn't VACUUM right this second,
and so a model that consists of "return false" is very frequently
correct. But that doesn't mean it's a good model. You get the idea.

> If vacuum makes a single pass over the indexes, it can retire
> as many dead line pointers as we have, or as will fit in memory, and
> the effort doesn't really depend too much on exactly how many dead
> line pointers we're trying to find.

Line pointer bloat is something that displays hysteresis; once it
happens (past some significant threshold) then there is no reversing
the damage. This makes the behavior very non-linear. In other words,
it makes it incredibly hard to model mathematically [2] -- once you
cross a certain hard to define threshold, it's total chaos, even in a
closed well-specified system (i.e. a highly constrained workload),
because you have all these feedback loops.

On top of all that, even with a perfect model we're still forced to
make a go/no-go decision for the entire table, moment to moment. So
even a mythical perfect model runs into the problem that it is
simultaneously much too early and much too late at the level of the
table. Which is even more reason to just focus on not going totally
off the rails, in any particular direction. Note that this includes
going off the rails by vacuuming in a way that's unsustainably
aggressive -- sometimes you have to cut your losses at that level as
well.

There is usually some bigger picture to consider when things do go
wrong -- there is usually some much worse fate that must be avoided.
Like with VACUUM's failsafe. Sure, controlling index bloat is
extremely important. But it's also much less important than keeping
the system online and responsive. That's another level up. (The level
up *after that* is "at least we didn't lose data", or maybe something
about limiting the amount of downtime, not going out of business,
whatever.)

> I feel like my threshold for the number of dead TIDs that ought to
> trigger a vacuum grows as the table gets bigger, capped by how much
> memory I've got.

I thought of another PDF related idea when I read this, without even
trying: we could account for the discontinuity from multiple index
scans in a single VACUUM operation (instead of just one) by erring in
the direction of doing the VACUUM sooner rather than later, when the
model says that doing so will make very little difference in terms of
extra costs incurred (extra costs from vacuuming sooner rather than
later, conservatively assuming that our concern about TIDs not fitting
in memory is basically unfounded).

> But I don't feel like the rate at which it's changing
> necessarily matters. Like if I create a million dead line pointers
> really quickly, wait a month, and then create another million dead
> line pointers, I feel like I want the system to respond just as
> aggressively as if the month-long delay were omitted.
>
> Maybe my feelings are wrong here. I'm just saying that, to me, it
> doesn't feel like the rate of change is all that relevant.

It's not that they're wrong, exactly -- I wouldn't say that. It's more
like this: you as a Postgres user actually care about a great many
things, not just one thing. Some of these things might be somewhat in
tension, from time to time. And so it seems wise to find a way to live
with any tension that may crop up -- by acknowledging the tension, we
get the chance to honor the preferences of the user to the greatest
extent possible.

[1] https://postgr.es/m/CAH2-WzmvXXEKtEph7U360umZ5pN3d18RBfu=nyPg9neBLDUWdw@mail.gmail.com
[2] https://en.wikipedia.org/wiki/Hysteretic_model
--
Peter Geoghegan

Re: autovacuum prioritization

From

Greg Stark

Date:

27 January 2022, 02:46:00

On Thu, 20 Jan 2022 at 14:31, Robert Haas <robertmhaas@gmail.com> wrote:
>
> In my view, previous efforts in this area have been too simplistic.
>

One thing I've been wanting to do something about is I think
autovacuum needs to be a little cleverer about when *not* to vacuum a
table because it won't do any good.

I've seen a lot of cases where autovacuum kicks off a vacuum of a
table even though the globalxmin hasn't really advanced significantly
over the oldest frozen xid. When it's a large table this really hurts
because it could be hours or days before it finishes and at that point
there's quite a bit of bloat.

This isn't a common occurrence, it happens when the system is broken
in some way. Either there's an idle-in-transaction session or
something else keeping the global xmin held back.

What it does though is make things *much* worse and *much* harder for
a non-expert to hit on the right remediation. It's easy enough to tell
them to look for these idle-in-transaction sessions or set timeouts.
It's much harder to determine whether it's a good idea for them to go
and kill the vacuum that's been running for days. And it's not a great
thing for people to be getting in the habit of doing either.

I want to be able to stop telling people to kill vacuums kicked off by
autovacuum. I feel like it's a bad thing for someone to ever have to
do and I know some fraction of the time I'm telling them to do it
it'll have been a terrible thing to have done (but we'll never know
which times those were). Determining whether a running vacuum is
actually doing any good is pretty hard and on older versions probably
impossible.

I was thinking of just putting a check in before kicking off a vacuum
and if the globalxmin is a significant fraction of the distance to the
relfrozenxid then instead log a warning. Basically it means "we can't
keep the bloat below the threshold due to the idle transactions et al,
not because there's insufficient i/o bandwidth".

At the same time it would be nice if autovacuum could recognize when
the i/o bandwidth is insufficient. If it finishes a vacuum it could
recheck whether the table is eligible for vacuuming and log that it's
unable to keep up with the vacuuming requirements -- but right now
that would be a lie much of the time when it's not a lack of bandwidth
preventing it from keeping up.

-- 
greg

Re: autovacuum prioritization

From

Peter Geoghegan

Date:

27 January 2022, 02:54:27

On Wed, Jan 26, 2022 at 3:46 PM Greg Stark <stark@mit.edu> wrote:
> One thing I've been wanting to do something about is I think
> autovacuum needs to be a little cleverer about when *not* to vacuum a
> table because it won't do any good.

There was a thread about this exact thing not too long ago:

https://postgr.es/m/CAH2-Wzmx6+PrfpmmFw8JZbxD+kkwhQWPOhE5RUBy6S4_Jwty=Q@mail.gmail.com

If everything goes according to plan, then Postgres 15 will have my
work on freezing and dynamically advancing relfrozenxid. Meaning that
you'll be able to see (in autovacuum log output and in VACUUM VERBOSE
output) how much relfrozenxid has been advanced by, if at all. You'll
also directly see how far behind the VACUUM operation's OldestXmin
that is (and how far behind the OldestXmin is at the end of the VACUUM
operation).

It seems as if this offers you exactly what you need. You'll be able
to notice the inherent futility of an anti-wraparound VACUUM that runs
against a table whose relfrozenxid is already exactly equal to the
VACUUM's OldestXmin (say because of a leaked replication slot --
anything that makes vacuuming fundamentally unable to advance
relfrozenxid, really).

-- 
Peter Geoghegan

Re: autovacuum prioritization

From

Greg Stark

Date:

27 January 2022, 02:56:07

On Wed, 26 Jan 2022 at 18:46, Greg Stark <stark@mit.edu> wrote:
>
> On Thu, 20 Jan 2022 at 14:31, Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > In my view, previous efforts in this area have been too simplistic.
> >
>
> One thing I've been wanting to do something about is I think
> autovacuum needs to be a little cleverer about when *not* to vacuum a
> table because it won't do any good.
>
> I've seen a lot of cases where autovacuum kicks off a vacuum of a
> table even though the globalxmin hasn't really advanced significantly
> over the oldest frozen xid. When it's a large table this really hurts
> because it could be hours or days before it finishes and at that point
> there's quite a bit of bloat.

Another case I would like to see autovacuum get clever about is when
there is a wide disparity in the size of tables. If you have a few
large tables and a few small tables there could be enough bandwidth
for everyone but you can get in trouble if the workers are all tied up
vacuuming the large tables.

This is a case where autovacuum scheduling can create a problem where
there shouldn't be one. It often happens when you have a set of large
tables that were all loaded with data around the same time and you
have your busy tables that are well designed small tables receiving
lots of updates. They can happily be getting vacuumed every 15-30min
and finishing promptly maintaining a nice steady state until one day
all the large tables suddenly hit the freeze threshold and suddenly
all your workers are busy vacuuming huge tables that take hours or
days to vacuum and your small tables bloat by orders of magnitude.

I was thinking of dividing the eligible tables up into ntiles based on
size and then making sure one worker was responsible for each ntile.
I'm not sure that would actually be quite right though.

-- 
greg

Re: autovacuum prioritization

From

Robert Treat

Date:

01 February 2022, 20:35:41

On Wed, Jan 26, 2022 at 6:56 PM Greg Stark <stark@mit.edu> wrote:
>
> On Wed, 26 Jan 2022 at 18:46, Greg Stark <stark@mit.edu> wrote:
> >
> > On Thu, 20 Jan 2022 at 14:31, Robert Haas <robertmhaas@gmail.com> wrote:
> > >
> > > In my view, previous efforts in this area have been too simplistic.
> > >
> >
> > One thing I've been wanting to do something about is I think
> > autovacuum needs to be a little cleverer about when *not* to vacuum a
> > table because it won't do any good.
> >
> > I've seen a lot of cases where autovacuum kicks off a vacuum of a
> > table even though the globalxmin hasn't really advanced significantly
> > over the oldest frozen xid. When it's a large table this really hurts
> > because it could be hours or days before it finishes and at that point
> > there's quite a bit of bloat.
>
>
> Another case I would like to see autovacuum get clever about is when
> there is a wide disparity in the size of tables. If you have a few
> large tables and a few small tables there could be enough bandwidth
> for everyone but you can get in trouble if the workers are all tied up
> vacuuming the large tables.
>
> This is a case where autovacuum scheduling can create a problem where
> there shouldn't be one. It often happens when you have a set of large
> tables that were all loaded with data around the same time and you
> have your busy tables that are well designed small tables receiving
> lots of updates. They can happily be getting vacuumed every 15-30min
> and finishing promptly maintaining a nice steady state until one day
> all the large tables suddenly hit the freeze threshold and suddenly
> all your workers are busy vacuuming huge tables that take hours or
> days to vacuum and your small tables bloat by orders of magnitude.
>
> I was thinking of dividing the eligible tables up into ntiles based on
> size and then making sure one worker was responsible for each ntile.
> I'm not sure that would actually be quite right though.
>

I've been working off and on some external vacuum scheduling tools the
past yearish and one thing that seems to be an issue is a lack of
observability into the various cost delay/limit mechanisms, like how
much does a vacuum contribute towards the limit or how much was it
delayed during a given run. One theory was if we are seeing a lot of
slow down due to cost limiting, we should more heavily weight smaller
tables in our priority list for which tables to vacuum vs larger
tables which we expect to exacerbate the situation.

I've also thought it'd be nice for users to have an easy way to
guesstimate % of frozen tables (like live vs dead tuples in
pg_stat_all_tables), but this seems difficult to maintain accurately.
Had a similar thing with tracking clock time of vacuums; just keeping
the duration of the last vacuum ended up being insufficient for some
cases, so we ended up tracking it historically... we haven't quite yet
designed a pg_stat_vacuums a la pg_stat_statements, but it has crossed
our minds.

Robert Treat
https://xzilla.net

Re: autovacuum prioritization

From

Robert Haas

Date:

02 February 2022, 19:32:55

On Wed, Jan 26, 2022 at 6:46 PM Greg Stark <stark@mit.edu> wrote:
> One thing I've been wanting to do something about is I think
> autovacuum needs to be a little cleverer about when *not* to vacuum a
> table because it won't do any good.

I agree.

> I was thinking of just putting a check in before kicking off a vacuum
> and if the globalxmin is a significant fraction of the distance to the
> relfrozenxid then instead log a warning. Basically it means "we can't
> keep the bloat below the threshold due to the idle transactions et al,
> not because there's insufficient i/o bandwidth".

Unfortunately, XID distances don't tell us much, because the tuples
need not be uniformly distributed across the XID space. In fact, it
seems highly likely that they will be very non-uniformly distributed,
with a few transactions having created a lot of dead tuples and most
having created none. Therefore, it's pretty plausible that a vacuum
that permits relfrozenxid++ could solve every problem we have. If we
knew something about the distribution of dead XIDs in the table, then
we could make an intelligent judgement about whether vacuuming would
be useful. But otherwise I feel like we're just guessing, so instead
of really fixing the problem we'll just be making it happen in a set
of cases that's even harder to grasp.

-- 
Robert Haas
EDB: http://www.enterprisedb.com