Re: decoupling table and index vacuum - Mailing list pgsql-hackers

From Robert Haas
Subject Re: decoupling table and index vacuum
Date
Msg-id CA+TgmoYswMDp610KJVpXEs=qA8aJ-WPuBgqyMhZoMzsdgjdcXg@mail.gmail.com
Whole thread Raw
In response to Re: decoupling table and index vacuum  (Dilip Kumar <dilipbalaut@gmail.com>)
Responses Re: decoupling table and index vacuum  (Peter Geoghegan <pg@bowt.ie>)
Re: decoupling table and index vacuum  (Dilip Kumar <dilipbalaut@gmail.com>)
List pgsql-hackers
On Wed, Feb 9, 2022 at 1:18 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I agree with the point that we should be focusing more on index size
> growth compared to dead tuples.  But I don't think that we can
> completely ignore the number of dead tuples.  Although we have the
> bottom-up index deletion but whether the index structure will be
> preserved or not will depend upon what keys we are inserting next.  So
> for example if there are 80% dead tuples but so far index size is fine
> then can we avoid vacuum? If we avoid vacuuming then it is very much
> possible that in some cases we will create a huge bloat e.g. if we are
> inserting some keys which can not take advantage of bottom up
> deletion.  So IMHO the decision should be a combination of index size
> bloat and % dead tuples.  Maybe we can add more weight to the size
> bloat and less weight to % dead tuple but we should not completely
> ignore it.

I think that dead index tuples really don't matter if they're going to
get removed anyway before a page split happens. In particular, if
we're going to do a bottom-up index deletion pass before splitting the
page, then who cares if there are going to be dead tuples around until
then? You might think that they'd have the unfortunate effect of
slowing down scans, and they could slow down ONE scan, but if they do,
then I think kill_prior_tuple will hint them dead and they won't
matter any more. Now, if we have a page that is going to split,
because it's going to receive inserts but neither kill_prior_tuple nor
bottom-up index deletion are going to keep us out of trouble, then the
dead tuples matter. And if we have a page where all the tuples are
dead and no further inserts are ever going to happen, those dead
tuples also matter, because getting rid of them would let us recycle
the page.

Just to be clear, when I say that the dead index tuples don't matter
here, I mean from the point of view of the index. From the point of
view of the table, the presence of dead index tuples (or even the
potential presence of dead tuples) pointing to dead line pointers is
an issue that can drive heap bloat. But from the point of view of the
index, because we don't ever merge sibling index pages, and because we
have kill_prior_tuple, there's not much value in freeing up space in
index pages unless it either prevents a split or lets us free the
whole page. So I agree with Peter that index growth is what really
matters.

However, I have a concern that Peter's idea to use the previous index
growth to drive future index vacuuming distinction is retrospective
rather than prospective. If the index is growing more than it should
based on the data volume, then evidently we didn't do enough vacuuming
at some point in the past. It's reasonable to step up our efforts in
the present to make sure that the problem doesn't continue, but in
some sense it's already too late. What we would really like is a
measure that answers the question: is the index going to bloat in the
relatively near future if we don't vacuum it now? I think that the
dead tuple count is trying, however imperfectly, to figure that out.
All other things being equal, the more dead tuples there are in the
index, the more bloat we're going to have later if we don't clean them
out now.

The problem is not with that core idea, which IMHO is actually good,
but that all other things are NOT equal. Peter has shown pretty
convincingly that in some workloads, essentially 100% of dead tuples
are going to get removed without causing a page split and the index
growth will be 0, whereas in other workloads 0% of dead tuples are
going to get removed without causing index growth. If you knew that
you had the second case, then counting dead index tuples to decide
when to vacuum would, in my opinion, be a very sensible thing to do.
It would still not be perfect, because dead tuples in pages that are
going to get split are a lot worse than dead tuples in pages that
aren't going to be split, but it doesn't seem meaningless. However, if
all of the index tuples are going to get removed in a timely fashion
anyway, then it's as useful as a stopped clock: it will be right
whenever it says the index doesn't need to be vacuumed, and wrong when
it says anything else.

In a certain sense, bottom-up index deletion may have exacerbated the
problems in this area. The more ways we add to remove dead tuples from
indexes without vacuum, the less useful dead tuples will become as a
predictor of index growth. Maybe #-of-dead-tuples and
future-index-growth weren't that tightly coupled even before bottom-up
index deletion, but it must be worse now.

I'm not hung up on using the # of dead tuples specifically as the
metric for index vacuuming, and it may be best to pick some other
measure. But I am a little suspicious that if the only measure is past
index growth, we will let some situations go too far before we wake up
and do something about them. My intuition is that it would be a good
idea to come up with something we could measure, even if it's
imperfect, that would give us some clue that trouble is brewing before
pages actually start splitting. Now maybe my intuition is wrong and
there is nothing better, but I think it's worth a thought.

--
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Daniel Gustafsson
Date:
Subject: Typo in archive modules docs
Next
From: Daniel Gustafsson
Date:
Subject: Re: Add tag/category to the commitfest app