Re: decoupling table and index vacuum - Mailing list pgsql-hackers

From Robert Haas
Subject Re: decoupling table and index vacuum
Date
Msg-id CA+TgmobGfPFtQD2yP-OVDNrnsM81Ma_YEUnY-Q2fctjTeceSqA@mail.gmail.com
Whole thread Raw
In response to Re: decoupling table and index vacuum  (Dilip Kumar <dilipbalaut@gmail.com>)
Responses Re: decoupling table and index vacuum  (Peter Geoghegan <pg@bowt.ie>)
Re: decoupling table and index vacuum  (Dilip Kumar <dilipbalaut@gmail.com>)
List pgsql-hackers
On Wed, Jan 26, 2022 at 8:58 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> TODO:
> - This is just a POC patch to discuss the design idea and needs a lot
> of improvement and testing.
> - We are using a slightly different format for storing the dead tids
> into the conveyor belt which is explained in the patch but the
> traditional multi-pass vacuum is still using the same format (array of
> ItemPointeData), so we need to unify that format.
> - Performance testing.
> - Cleaner interfaces so that we can easily be integrated with auto
> vacuum, currently, this is not provided for the manual vacuum.
> - Add test cases.

I think this is a pretty interesting piece of work. I appreciate the
effort you've obviously put into the comments, although I do think
some of them are going to need some additional clarification. But I
think the bigger questions here at the moment are things like (1) Is
this the right idea? and if not (2) How could we change it to make it
better? and (3) Is there any way that we can make it simpler? It was
the last of these questions that prompted me to post
http://postgr.es/m/CA+TgmoY18RzQqDm2jE2WDkiA8ngTEDHp7uLtHb3a-ABs+wbY_g@mail.gmail.com
because, if that thought were to work out, then we could have more
things in common between the conveyor-belt and non-conveyor-belt
cases, and we might be able to start with some preliminary work to
jigger more things in to the second phase, and then look to integrate
the conveyor belt stuff separately.

I think what we ought to do at this point is try to figure out some
tests that might show how well this approach actually works in
practice. Now one motivation for this work was the desire to someday
have global indexes, but those don't exist yet, so it makes sense to
consider other scenarios in which the patch might (or might not) be
beneficial. And it seems to me that we should be looking for a
scenario where we have multiple indexes with different vacuuming
needs. How could that happen? Well, the first thing that occurred to
me was a table with a partial index. If we have a column t whose
values are randomly distributed between 1 and 10, and a partial index
on some other column WHERE t = 1, then the partial index should only
accumulate dead tuples about 10% as fast as a non-partial index on the
same column. On the other hand, the partial index also has a much
smaller number of total rows, so after a fixed number of updates, the
partial index should have the same *percentage* of dead tuples as the
non-partial index even though the absolute number is smaller. So maybe
that's not a great idea.

My second thought was that perhaps we can create a test scenario
where, in one index, the deduplication and bottom-up index deletion
and kill_prior_tuple mechanisms are very effective, and in another
index, it's not effective at all. For example, maybe index A is an
index on the primary key, and index B is a non-unique index on some
column that we're updating with ever-increasing values (so that we
never put new tuples into a page that could be productively cleaned
up). I think what should happen in this case is that A should not grow
in size even if it's never vacuumed, while B will require vacuuming to
keep the size down. If this example isn't exactly right, maybe we can
construct one where that does happen. Then we could try to demonstrate
that with this patch we can do less vacuuming work and still keep up
than what would be possible without the patch. We'll either be able to
show that this is true, or we will see that it's false, or we won't be
able to really see much difference. Any of those would be interesting
findings.

One thing we could try doing in order to make that easier would be:
tweak things so that when autovacuum vacuums the table, it only
vacuums the indexes if they meet some threshold for bloat. I'm not
sure exactly what happens with the heap vacuuming then - do we do
phases 1 and 2 always, or a combined heap pass, or what? But if we
pick some criteria that vacuums indexes sometimes and not other times,
we can probably start doing some meaningful measurement of whether
this patch is making bloat better or worse, and whether it's using
fewer or more resources to do it.

Do you have a git branch for this work?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: Support for NSS as a libpq TLS backend
Next
From: Bruce Momjian
Date:
Subject: Re: Support for NSS as a libpq TLS backend