Re: autovacuum prioritization - Mailing list pgsql-hackers

From Dilip Kumar
Subject Re: autovacuum prioritization
Date
Msg-id CAFiTN-tw9uo+pdKmkdotc_rKXYbL=gDz7Aw00_m2+iSE4Dt10A@mail.gmail.com
Whole thread Raw
In response to autovacuum prioritization  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: autovacuum prioritization
List pgsql-hackers
On Fri, Jan 21, 2022 at 12:54 AM Robert Haas <robertmhaas@gmail.com> wrote:

> So at a high level, I think that what we ought to do is, first, for
> each table, estimate the time at which we think something bad will
> occur. There are several bad events: too much bloat, XID wraparound,
> MXID wraparound. We need to estimate the time at which we think each
> of those things will occur, and then take the earliest of those
> estimates. That's the time by which we need to have finished vacuuming
> the table. Then, make an estimate of how long it will take to complete
> a vacuum of the table, subtract that from the time at which we need to
> be done, and that's the time by which we need to start. The earliest
> need-to-start time is the highest priority table.


I think we need some more parameters to compare bloat vs wraparound.
I mean in one of your examples in the 2nd paragraph we can say that
the need-to-start of table A is earlier than table B so it's kind of
simple.  But when it comes to wraparound vs bloat we need to add some
weightage to compute how much bloat is considered as bad as
wraparound.  I think the amount of bloat can not be an absolute number
but it should be relative w.r.t the total database size or so.  I
don't think it can be computed w.r.t to the table size because if the
table is e.g. just 1 GB size and it is 5 times bloated then it is not
as bad as another 1 TB table which is just 2 times bloated.

>
> A second problem is that, if the earliest need-to-start time is in the
> past, then we definitely are in trouble and had better get to work at
> once, but if it's in the future, that doesn't necessarily mean we're
> safe. If there are three tables with a need-to-finish time that is 12
> hours in the future and each of them will take 11 hours to vacuum,
> then every need-to-start time computed according to the algorithm
> above is in the future, but in fact we're in a lot of trouble. If the
> estimates are accurate, we need 3 autovacuum workers to be available
> to start within 1 hour, or we're doomed. The very last thing we want
> to do is wait another hour before doing anything. It's not impossible
> to factor this into the calculation of need-to-start times, assuming
> we know how many workers we have. For instance, if we've got tables
> whose need-to-finish times are 30, 50, and 70 minutes in the future,
> we can see that if each one takes 20 minutes or less to vacuum, then
> the need-to-start times can just be computed by subtraction. But the
> tables with 50 or 70 minute deadlines are going to take more than 20
> minutes to vacuum, then we've got to back up the need-to-start times
> so that we finish each table in time to start on the next one. I
> haven't looked into what algorithms exist for this kind of scheduling
> problem, but I feel like a literature search, or pulling out my
> college textbooks, would probably turn up some useful ideas.


I think we should be thinking of dynamically adjusting priority as
well.  Because it is possible that when autovacuum started we
prioritize the table based on some statistics and estimation but
vacuuming process can take long time and during that some priority
might change so during the start of the autovacuum if we push all
table to some priority queue and simply vacuum in that order then we
might go wrong somewhere.  I think we need to make different priority
queues based on different factors, for example 1 queue for wraparound
risk and another for bloat risk.  Even though there would be multiple
queue we would have need_to_start time with each item so that we
exactly know from which queue we pick the next item but dynamically
whenever picking the item we can recheck the priority of the item at
the head of the queue and always assume that the queue is arranged  in
order of need_to_start time.  So now what if the item back in the
queue becomes more important than the item at the queue head based on
some statistics?  I don't think it is wise to compute the
need_to_start time for all the items before picking any new item.  But
I think we need to have multiple queues based on different factors
(not only just wraparound and bloat) to reduce the risk of items in
the back of the queue becoming higher priority than items in front of
the queue.  I mean this can not completely be avoided but this can be
reduced by creating multiple work queues based on more factors which
can dynamically change.

>
> I know that this email is kind of a giant wall of text, so my thanks
> if you've read this far, and even more if you feel inspired to write
> back with your own thoughts.


Yeah it is a long email but quite interesting.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Catalog version access
Next
From: Vignesh K
Date:
Subject: Reg. evaluation of expression in HashCond