decoupling table and index vacuum - Mailing list pgsql-hackers

From Robert Haas
Subject decoupling table and index vacuum
Date
Msg-id CA+TgmoZgapzekbTqdBrcH8O8Yifi10_nB7uWLB8ajAhGL21M6A@mail.gmail.com
Whole thread Raw
Responses Re: decoupling table and index vacuum
Re: decoupling table and index vacuum
Re: decoupling table and index vacuum
Re: decoupling table and index vacuum
List pgsql-hackers
Hi,

We are used to thinking about table vacuum and index vacuum as parts
of a single, indivisible operation. You vacuum the table -- among
other things by performing HOT pruning and remembering dead TIDs --
and then you vacuum the indexes -- removing the remembered TIDs from
the index -- and then you vacuum the table some more, setting those
dead TIDs unused -- and then you're done. And along the way you do
some other things too like considering truncation that aren't relevant
to the point I want to make here. Now, the problem with this is that
every index has its own needs, which are separate from the needs of
the tables, as I think Peter Geoghegan and Masahiko Sawada were also
discussing recently. Opportunistic index cleanup strategies like
kill_prior_tuple and bottom-up deletion may work much better for some
indexes than others, meaning that you could have some indexes that
badly need to be vacuumed because they are full of garbage, and other
indexes on the same table where the opportunistic cleanup has worked
perfectly and there is no need for vacuuming at all. Separately, the
table may or may not need to get some dead pointers set back to unused
to avoid table bloat.

But, as things stand today, strategy options to deal with such
situations are limited. Leaving aside what the code actually does
right now, let's talk about what options we have in theory with the
technology as it exists now. They basically all boil down to stopping
early and then redoing the work later. We must always start with a
pass over the heap to collect dead TIDs, because otherwise there's
nothing else to do. Now we can choose to stop, but then the next
VACUUM will have to collect all those TIDs again. It may get to skip
more all-visible pages than the current vacuum did, but the pages that
still have dead TIDs will all have to be visited again. If we don't
stop, then we can choose to vacuum all of the indexes or just some of
them, and then afterwards stop. But if we do this, the next VACUUM
will have to scan all indexes again for the same TIDs. Here, we don't
even have the visibility map to allow skipping known-clean pages, so
it's *really* a lot of work we have to redo. Thus what we normally do
is press on to the third step, where we mark dead line pointers unused
after scanning every index in its entirety, and now they're gone and
we don't have to worry about them again. Barring emergency escape
valves, as things stand today, the frequency of table vacuuming is the
same as the frequency of index vacuuming, even though the *required*
frequency of vacuuming is not the same, and also varies from index to
index.

Now, the reason for this is that when we discover dead TIDs, we only
record them in memory, not on disk. So, as soon as VACUUM ends, we
lose all knowledge of whether those TIDs were and must rediscover
them. Suppose we didn't do this, and instead had a "dead TID" fork for
each table. Suppose further that this worked like a conveyor belt,
similar to WAL, where every dead TID we store into the fork is
assigned an identifying 64-bit number that is never reused. Then,
suppose that for each index, we store the number of the oldest entry
that might still need to be vacuumed from the index. Every time you
perform what we now call the first heap pass of a VACUUM, you add the
new TIDs you find to the dead TID fork. Every time you vacuum an
index, the TIDs that need to be removed are those between the
oldest-entry pointer for that index and the current end of the TID
fork. You remove all of those and then advance your oldest-entry
pointer accordingly. If that's too many TIDs to fit in
maintenance_work_mem, you can just read as many as will fit and
advance your oldest-entry pointer less far. Every time you perform
what we now call the second heap pass of a VACUUM, you find all the
TIDs that precede every index's oldest-entry pointer and set them
unused. You then throw away the associated storage at the OS level.
This requires a scheme where relations can be efficiently truncated
from the beginning rather than only at the end, which is why I said "a
conveyor belt" and "similar to WAL". Details deliberately vague since
I am just brainstorming here.

This scheme adds a lot of complexity, which is a concern, but it seems
to me that it might have several benefits. One is concurrency. You
could have one process gathering dead TIDs and adding them to the
dead-TID fork while another process is vacuuming previously-gathered
TIDs from some index. In fact, every index could be getting vacuumed
at the same time, and different indexes could be removing different
TID ranges. At the same time, you could have another process setting
dead TIDs that all indexes have previously removed to unused.
Furthermore, all of these operations can start in any order, and any
of them can be repeated any number of times during a single run of any
particular other one, or indeed, without that particular one ever
being run at all. Both heap phases can also now be done in smaller
chunks, if desired. You can gather TIDs from a portion of the table
and remember where you left off, and come back and pick up from that
point later, if you wish. You can likewise pick a subset of
dead-TIDs-retired-from-all-indexes to set unused, and do just that
many, and then at a later time come back and do some more. Also, you
can now make mostly-independent decisions about how to perform each of
these operations, too. It's not completely independent: if you need to
set some dead TIDs in the table to unused, you may have to force index
vacuuming that isn't needed for bloat control. However, you only need
to force it for indexes that haven't been vacuumed recently enough for
some other reason, rather than every index. If you have a target of
reclaiming 30,000 TIDs, you can just pick the indexes where there are
fewer than 30,000 dead TIDs behind their oldest-entry pointers and
force vacuuming only of those. By the time that's done, there will be
at least 30,000 dead line pointers you can mark unused, and maybe
more, minus whatever reclamation someone else did concurrently.

But is this worthwhile? I think it depends a lot on what you think the
comparative required frequencies are for the various operations. If
index A needs to be vacuumed every 40 minutes and index B needs to be
vacuumed every 55 minutes and the table that owns both of them needs
to be vacuumed every 70 minutes, I am not sure there is a whole lot
here. I think you will be pretty much equally well off if you just do
what we do today every 40 minutes and call it good. Also, you will not
benefit very much if the driving force is reclaiming dead line
pointers in the table itself. If that has to happen frequently, then
the indexes have to be scanned frequently, and this whole thing is a
lot of work for not much. But, maybe that's not the case. Suppose
index A needs to be vacuumed every hour to avoid bloat, index B needs
to be vacuumed every 4 hours to avoid bloat, and the table needs dead
line pointers reclaimed every 5.5 hours. Well, now you can gain a lot.
You can vacuum index A frequently while vacuuming index B only as
often as it needs, and you can reclaim dead line pointers on their own
schedule based on whatever index vacuuming was already done for bloat
avoidance. Without this scheme, there's just no way to give everybody
what they need without some of the participants being "dragged along
for the ride" and forced into work that they don't actually need done
simply because "that's how it works."

One thing I don't know is whether the kind of scenario that I describe
above is common, i.e. is the main reason we need to vacuum to control
index bloat, where this kind of approach seems likely to help, or is
it to reclaim dead line pointers in the heap, where it's not? I'd be
interested in hearing from people who have some experience in this
area, or at least better intuition than I do.

I'm interested in this idea partly because I think it would be much
more likely to help in a hypothetical world where we had global
indexes. Imagine a partitioned table where each partition has a local
index and a then there is also a global index which indexes tuples
from every partition. Waving away the difficulty of making such a
thing work, there's a vacuuming problem here, which has been discussed
before. In short, if you tried to handle this in the naive way, you'd
end up having to vacuum the global index every time you vacuumed any
partition. That sucks. Suppose that there are 1000 partitions, each
partition is 1GB, and each local index is 50MB. All things being
equal, the global index should end up being about as big as all of the
local indexes put together, which in this case would be 50GB. Clearly,
we do not want to vacuum each partition by scanning the 1GB partition
+ the 50MB local index + the 50GB global index. That's insane. With
the above system, since everything's decoupled, you can vacuum the
partition tables individually as often as required, and similarly for
their local indexes, but put off vacuuming the global index until
you've vacuumed a bunch, maybe all, of the partitions, so that the
work of cleaning up the global index cleans up dead TIDs from many or
all partitions instead of just one at a time.

Now, the fly in the ointment here is that this supposes that we don't
get forced into vacuuming the global index too quickly because of dead
line pointer accumulation. But, I think if that does happen, with
careful scheduling, we might not really be worse off than we would
have been without partitioning. If we scan the table for just one
partition and, say, exhaust maintenance_work_mem, we have some
expensive index vacuuming to do immediately, but that would've also
happened in pretty much the same way with an unpartitioned table. If
we don't fill maintenance_work_mem but we do notice that the table for
this partition is full of dead line pointers that we need to reclaim,
we can still choose to scan some other partitions and collect some
more dead TIDs before cleaning the global index. That could delay
actually getting those line pointers reclaimed, but an unpartitioned
table would have suffered from at least as much delay, because it
wouldn't even consider the possibility of stopping before scanning
every available table page, and we could choose to stop after dealing
with only some partitions but not all. It's probably tricky to get the
autovacuum algorithm right here, but there seems to be some room for
optimism.

Even if global indexes never happened, though, I think this could have
other benefits. For example, the wraparound failsafe mechanism
recently added by Masahiko Sawada and Peter Geoghegan bypasses index
vacuuming when wraparound danger is imminent. The only problem is that
making that decision means throwing away the accumulated list of dead
TIDs, which then need to be rediscovered whenever we get around to
vacuuming the indexes. But that's avoidable, if they're stored on disk
rather than in RAM.

One rather serious objection to this whole line of attack is that we'd
ideally like VACUUM to reclaim disk space without using any more, in
case the motivation for running VACUUM in the first place. A related
objection is that if it's sometimes agreable to do everything all at
once as we currently do, the I/O overhead could be avoided. I think
we'd probably have to retain a code path that buffers the dead TIDs in
memory to account, at least, for the low-on-disk-space case, and maybe
that can also be used to avoid I/O in some other cases, too. I haven't
thought through all the details here. It seems to me that the actual
I/O avoidance is probably not all that much - each dead TID is much
smaller than the deleted tuple that gave rise to it, and typically you
don't delete all the tuples at once - but it might be material in some
cases, and it's definitely material if you don't have enough disk
space left for it to complete without error.

Thoughts?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Fujii Masao
Date:
Subject: Re: track_planning causing performance regression
Next
From: Fujii Masao
Date:
Subject: Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?