Re: decoupling table and index vacuum - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: decoupling table and index vacuum |
Date | |
Msg-id | CA+TgmoZU6xAZ8hypQk9YJZt4WQFuouyBfSOuL+J7R4WkWGsQog@mail.gmail.com Whole thread Raw |
In response to | Re: decoupling table and index vacuum (Dilip Kumar <dilipbalaut@gmail.com>) |
List | pgsql-hackers |
On Thu, Apr 22, 2021 at 7:51 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > How do we decide this target, I mean at a given point how do we decide > that what is the limit of dead TID's at which we want to trigger the > index vacuuming? It's tricky. Essentially, it's a cost-benefit analysis. On the "cost" side, the expense associated with an index vacuum is basically the number of pages that we're going to visit, and the number of those that we're going to dirty. We can know the former with certainty but can only estimate the latter. On the "benefit" side, setting dead TIDs unused helps us in two ways. First, it lets us mark heap pages all-visible, which makes index-only scans work better and reduces the cost of future vacuuming. These benefits are mitigated by future DML unsetting those bits again; there's no point in marking a page all-visible if it's about to be modified again. Second, it avoids line pointer bloat. Dead line pointers still take up space on the page, and potentially force the line pointer array to be extended, which can eventually cause tuples that would have fit into the page to spill into a different page, possibly a newly-allocated one that forces a table extension. It's hard to compare the costs to the benefits because we don't know the frequency of access. Suppose it costs $1000 to vacuum relevant indexes and set dead line pointers unused. And suppose that if you do it, you thereafter will save $1 every time someone does an index-only scan. If there will be >1000 index-only scans in a meaningful time frame, it's a good trade-off, but if not, it's a bad one, but it's difficult to predict the future, and we have limited information even about the past. My intuition is that two things that we want to consider are the total number of dead line pointers in the heap, and the number of pages across which they are spread. It is also my intuition that the latter is the more important number, possibly to the extent that we could ignore the former number completely. But exactly what the thresholds should be is very unclear to me. > Is it a good idea to always perform an I/O after collecting the dead > TID's or there should be an option where the user can configure so > that it aggressively vacuum all the indexes and this I/O overhead can > be avoided completely. It's my view that there should definitely be such an option. As I also mentioned on another thread recently, suppose we pick words for each phase of vacuum. For the sake of argument, suppose we refer to the first heap phase as PRUNE, the index phase as SANITIZE, and the second heap phase as RECYCLE. Then you can imagine syntax like this: VACUUM (PRUNE) my_table; VACUUM (SANITIZE) my_table; -- all indexes VACUUM my_index; -- must be sanitize only VACUUM (PRUNE, SANITIZE, RECYCLE) my_table; -- do everything Now in the last case is clearly possible for the system to do everything in memory since all phases are being performed, but depending on what we decide, maybe it will choose to use the dead-TID fork in some cases for some reason or other. If so, we might have explicit syntax to override that behavior, e.g. VACUUM (PRUNE, SANITIZE, RECYCLE, TID_STORE 0) my_table; which might be able to be abbreviated, depending on how we set the defaults, to just: VACUUM (TID_STORE 0) my_table; This is all just hypothetical syntax and probably needs a good deal of polish and bike-shedding. But it would be really nice to standardize on some set of terms like prune/sanitize/recycle or whatever we pick, because then we could document what they mean, use them in autovacuum log messages, use them internally for function names, use them for VACUUM option names when we get to that point, etc. and the whole thing would be a good deal more comprehensible than at present. -- Robert Haas EDB: http://www.enterprisedb.com
pgsql-hackers by date: