Thread: decoupling table and index vacuum

decoupling table and index vacuum

From

Robert Haas

Date:

21 April 2021, 15:21:31

Hi,

We are used to thinking about table vacuum and index vacuum as parts
of a single, indivisible operation. You vacuum the table -- among
other things by performing HOT pruning and remembering dead TIDs --
and then you vacuum the indexes -- removing the remembered TIDs from
the index -- and then you vacuum the table some more, setting those
dead TIDs unused -- and then you're done. And along the way you do
some other things too like considering truncation that aren't relevant
to the point I want to make here. Now, the problem with this is that
every index has its own needs, which are separate from the needs of
the tables, as I think Peter Geoghegan and Masahiko Sawada were also
discussing recently. Opportunistic index cleanup strategies like
kill_prior_tuple and bottom-up deletion may work much better for some
indexes than others, meaning that you could have some indexes that
badly need to be vacuumed because they are full of garbage, and other
indexes on the same table where the opportunistic cleanup has worked
perfectly and there is no need for vacuuming at all. Separately, the
table may or may not need to get some dead pointers set back to unused
to avoid table bloat.

But, as things stand today, strategy options to deal with such
situations are limited. Leaving aside what the code actually does
right now, let's talk about what options we have in theory with the
technology as it exists now. They basically all boil down to stopping
early and then redoing the work later. We must always start with a
pass over the heap to collect dead TIDs, because otherwise there's
nothing else to do. Now we can choose to stop, but then the next
VACUUM will have to collect all those TIDs again. It may get to skip
more all-visible pages than the current vacuum did, but the pages that
still have dead TIDs will all have to be visited again. If we don't
stop, then we can choose to vacuum all of the indexes or just some of
them, and then afterwards stop. But if we do this, the next VACUUM
will have to scan all indexes again for the same TIDs. Here, we don't
even have the visibility map to allow skipping known-clean pages, so
it's *really* a lot of work we have to redo. Thus what we normally do
is press on to the third step, where we mark dead line pointers unused
after scanning every index in its entirety, and now they're gone and
we don't have to worry about them again. Barring emergency escape
valves, as things stand today, the frequency of table vacuuming is the
same as the frequency of index vacuuming, even though the *required*
frequency of vacuuming is not the same, and also varies from index to
index.

Now, the reason for this is that when we discover dead TIDs, we only
record them in memory, not on disk. So, as soon as VACUUM ends, we
lose all knowledge of whether those TIDs were and must rediscover
them. Suppose we didn't do this, and instead had a "dead TID" fork for
each table. Suppose further that this worked like a conveyor belt,
similar to WAL, where every dead TID we store into the fork is
assigned an identifying 64-bit number that is never reused. Then,
suppose that for each index, we store the number of the oldest entry
that might still need to be vacuumed from the index. Every time you
perform what we now call the first heap pass of a VACUUM, you add the
new TIDs you find to the dead TID fork. Every time you vacuum an
index, the TIDs that need to be removed are those between the
oldest-entry pointer for that index and the current end of the TID
fork. You remove all of those and then advance your oldest-entry
pointer accordingly. If that's too many TIDs to fit in
maintenance_work_mem, you can just read as many as will fit and
advance your oldest-entry pointer less far. Every time you perform
what we now call the second heap pass of a VACUUM, you find all the
TIDs that precede every index's oldest-entry pointer and set them
unused. You then throw away the associated storage at the OS level.
This requires a scheme where relations can be efficiently truncated
from the beginning rather than only at the end, which is why I said "a
conveyor belt" and "similar to WAL". Details deliberately vague since
I am just brainstorming here.

This scheme adds a lot of complexity, which is a concern, but it seems
to me that it might have several benefits. One is concurrency. You
could have one process gathering dead TIDs and adding them to the
dead-TID fork while another process is vacuuming previously-gathered
TIDs from some index. In fact, every index could be getting vacuumed
at the same time, and different indexes could be removing different
TID ranges. At the same time, you could have another process setting
dead TIDs that all indexes have previously removed to unused.
Furthermore, all of these operations can start in any order, and any
of them can be repeated any number of times during a single run of any
particular other one, or indeed, without that particular one ever
being run at all. Both heap phases can also now be done in smaller
chunks, if desired. You can gather TIDs from a portion of the table
and remember where you left off, and come back and pick up from that
point later, if you wish. You can likewise pick a subset of
dead-TIDs-retired-from-all-indexes to set unused, and do just that
many, and then at a later time come back and do some more. Also, you
can now make mostly-independent decisions about how to perform each of
these operations, too. It's not completely independent: if you need to
set some dead TIDs in the table to unused, you may have to force index
vacuuming that isn't needed for bloat control. However, you only need
to force it for indexes that haven't been vacuumed recently enough for
some other reason, rather than every index. If you have a target of
reclaiming 30,000 TIDs, you can just pick the indexes where there are
fewer than 30,000 dead TIDs behind their oldest-entry pointers and
force vacuuming only of those. By the time that's done, there will be
at least 30,000 dead line pointers you can mark unused, and maybe
more, minus whatever reclamation someone else did concurrently.

But is this worthwhile? I think it depends a lot on what you think the
comparative required frequencies are for the various operations. If
index A needs to be vacuumed every 40 minutes and index B needs to be
vacuumed every 55 minutes and the table that owns both of them needs
to be vacuumed every 70 minutes, I am not sure there is a whole lot
here. I think you will be pretty much equally well off if you just do
what we do today every 40 minutes and call it good. Also, you will not
benefit very much if the driving force is reclaiming dead line
pointers in the table itself. If that has to happen frequently, then
the indexes have to be scanned frequently, and this whole thing is a
lot of work for not much. But, maybe that's not the case. Suppose
index A needs to be vacuumed every hour to avoid bloat, index B needs
to be vacuumed every 4 hours to avoid bloat, and the table needs dead
line pointers reclaimed every 5.5 hours. Well, now you can gain a lot.
You can vacuum index A frequently while vacuuming index B only as
often as it needs, and you can reclaim dead line pointers on their own
schedule based on whatever index vacuuming was already done for bloat
avoidance. Without this scheme, there's just no way to give everybody
what they need without some of the participants being "dragged along
for the ride" and forced into work that they don't actually need done
simply because "that's how it works."

One thing I don't know is whether the kind of scenario that I describe
above is common, i.e. is the main reason we need to vacuum to control
index bloat, where this kind of approach seems likely to help, or is
it to reclaim dead line pointers in the heap, where it's not? I'd be
interested in hearing from people who have some experience in this
area, or at least better intuition than I do.

I'm interested in this idea partly because I think it would be much
more likely to help in a hypothetical world where we had global
indexes. Imagine a partitioned table where each partition has a local
index and a then there is also a global index which indexes tuples
from every partition. Waving away the difficulty of making such a
thing work, there's a vacuuming problem here, which has been discussed
before. In short, if you tried to handle this in the naive way, you'd
end up having to vacuum the global index every time you vacuumed any
partition. That sucks. Suppose that there are 1000 partitions, each
partition is 1GB, and each local index is 50MB. All things being
equal, the global index should end up being about as big as all of the
local indexes put together, which in this case would be 50GB. Clearly,
we do not want to vacuum each partition by scanning the 1GB partition
+ the 50MB local index + the 50GB global index. That's insane. With
the above system, since everything's decoupled, you can vacuum the
partition tables individually as often as required, and similarly for
their local indexes, but put off vacuuming the global index until
you've vacuumed a bunch, maybe all, of the partitions, so that the
work of cleaning up the global index cleans up dead TIDs from many or
all partitions instead of just one at a time.

Now, the fly in the ointment here is that this supposes that we don't
get forced into vacuuming the global index too quickly because of dead
line pointer accumulation. But, I think if that does happen, with
careful scheduling, we might not really be worse off than we would
have been without partitioning. If we scan the table for just one
partition and, say, exhaust maintenance_work_mem, we have some
expensive index vacuuming to do immediately, but that would've also
happened in pretty much the same way with an unpartitioned table. If
we don't fill maintenance_work_mem but we do notice that the table for
this partition is full of dead line pointers that we need to reclaim,
we can still choose to scan some other partitions and collect some
more dead TIDs before cleaning the global index. That could delay
actually getting those line pointers reclaimed, but an unpartitioned
table would have suffered from at least as much delay, because it
wouldn't even consider the possibility of stopping before scanning
every available table page, and we could choose to stop after dealing
with only some partitions but not all. It's probably tricky to get the
autovacuum algorithm right here, but there seems to be some room for
optimism.

Even if global indexes never happened, though, I think this could have
other benefits. For example, the wraparound failsafe mechanism
recently added by Masahiko Sawada and Peter Geoghegan bypasses index
vacuuming when wraparound danger is imminent. The only problem is that
making that decision means throwing away the accumulated list of dead
TIDs, which then need to be rediscovered whenever we get around to
vacuuming the indexes. But that's avoidable, if they're stored on disk
rather than in RAM.

One rather serious objection to this whole line of attack is that we'd
ideally like VACUUM to reclaim disk space without using any more, in
case the motivation for running VACUUM in the first place. A related
objection is that if it's sometimes agreable to do everything all at
once as we currently do, the I/O overhead could be avoided. I think
we'd probably have to retain a code path that buffers the dead TIDs in
memory to account, at least, for the low-on-disk-space case, and maybe
that can also be used to avoid I/O in some other cases, too. I haven't
thought through all the details here. It seems to me that the actual
I/O avoidance is probably not all that much - each dead TID is much
smaller than the deleted tuple that gave rise to it, and typically you
don't delete all the tuples at once - but it might be material in some
cases, and it's definitely material if you don't have enough disk
space left for it to complete without error.

Thoughts?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: decoupling table and index vacuum

From

Andres Freund

Date:

21 April 2021, 21:38:25

Hi,

On 2021-04-21 11:21:31 -0400, Robert Haas wrote:
> Opportunistic index cleanup strategies like
> kill_prior_tuple and bottom-up deletion may work much better for some
> indexes than others, meaning that you could have some indexes that
> badly need to be vacuumed because they are full of garbage, and other
> indexes on the same table where the opportunistic cleanup has worked
> perfectly and there is no need for vacuuming at all.

Partial indexes are another case that can lead to individual indexes
being without bloat, with others severely bloated.

> This requires a scheme where relations can be efficiently truncated
> from the beginning rather than only at the end, which is why I said "a
> conveyor belt" and "similar to WAL". Details deliberately vague since
> I am just brainstorming here.

I'm not sure that's the only way to deal with this. While some form of
generic "conveyor belt" infrastructure would be a useful building block,
and it'd be sensible to use it here if it existed, it seems feasible to
dead tids in a different way here. You could e.g. have per-heap-vacuum
files with a header containing LSNs that indicate the age of the
contents.

> This scheme adds a lot of complexity, which is a concern, but it seems
> to me that it might have several benefits. One is concurrency. You
> could have one process gathering dead TIDs and adding them to the
> dead-TID fork while another process is vacuuming previously-gathered
> TIDs from some index.

I think it might even open the door to using multiple processes
gathering dead TIDs for the same relation.

> In fact, every index could be getting vacuumed at the same time, and
> different indexes could be removing different TID ranges.

We kind of have this feature right now, due to parallel vacuum...

> It's not completely independent: if you need to set some dead TIDs in
> the table to unused, you may have to force index vacuuming that isn't
> needed for bloat control. However, you only need to force it for
> indexes that haven't been vacuumed recently enough for some other
> reason, rather than every index.

Hm - how would we know how recently that TID has been marked dead? We
don't even have xids for dead ItemIds... Maybe you're intending to
answer that in your next paragraph, but it's not obvious to me that'd be
sufficient...

> If you have a target of reclaiming 30,000 TIDs, you can just pick the
> indexes where there are fewer than 30,000 dead TIDs behind their
> oldest-entry pointers and force vacuuming only of those. By the time
> that's done, there will be at least 30,000 dead line pointers you can
> mark unused, and maybe more, minus whatever reclamation someone else
> did concurrently.

One thing that you didn't mention so far is that this'd allow us to add
dead TIDs to the "dead tid" file outside of vacuum too. In some
workloads most of the dead tuple removal happens as part of on-access
HOT pruning. While some indexes are likely to see that via the
killtuples logic, others may not. Being able to have more aggressive
index vacuum for the one or two bloated index, without needing to rescan
the heap, seems like it'd be a significant improvement.

> Suppose index A needs to be vacuumed every hour to avoid bloat, index
> B needs to be vacuumed every 4 hours to avoid bloat, and the table
> needs dead line pointers reclaimed every 5.5 hours. Well, now you can
> gain a lot.  You can vacuum index A frequently while vacuuming index B
> only as often as it needs, and you can reclaim dead line pointers on
> their own schedule based on whatever index vacuuming was already done
> for bloat avoidance. Without this scheme, there's just no way to give
> everybody what they need without some of the participants being
> "dragged along for the ride" and forced into work that they don't
> actually need done simply because "that's how it works."

Have you thought about how we would do the scheduling of vacuums for the
different indexes? We don't really have useful stats for the number of
dead index entries to be expected in an index. It'd not be hard to track
how many entries are removed in an index via killtuples, but
e.g. estimating how many dead entries there are in a partial index seems
quite hard (at least without introducing significant overhead).

> One thing I don't know is whether the kind of scenario that I describe
> above is common, i.e. is the main reason we need to vacuum to control
> index bloat, where this kind of approach seems likely to help, or is
> it to reclaim dead line pointers in the heap, where it's not? I'd be
> interested in hearing from people who have some experience in this
> area, or at least better intuition than I do.

I think doing something like this has a fair bit of potential. Being
able to perform freezing independently of index scans, without needing
to scan the table again to re-discover dead line item pointers seems
like it'd be a win. More aggressive/targeted index vacuum in cases where
most tuples are removed via HOT pruning seems like a win. Not having to
restart from scratch after a cancelled autvacuum would be a
win. Additional parallelization seems like a win...

> One rather serious objection to this whole line of attack is that we'd
> ideally like VACUUM to reclaim disk space without using any more, in
> case the motivation for running VACUUM in the first place.

I suspect we'd need a global limit of space used for this data. If above
that limit we'd switch to immediately performing the work required to
remove some of that space.

> A related objection is that if it's sometimes agreable to do
> everything all at once as we currently do, the I/O overhead could be
> avoided. I think we'd probably have to retain a code path that buffers
> the dead TIDs in memory to account, at least, for the
> low-on-disk-space case, and maybe that can also be used to avoid I/O
> in some other cases, too.

We'd likely want to do some batching of insertions into the "dead tid"
map - which'd probably end up looking similar to a purely in-memory path
anyway.

Greetings,

Andres Freund

Re: decoupling table and index vacuum

From

Peter Geoghegan

Date:

21 April 2021, 23:51:35

On Wed, Apr 21, 2021 at 8:21 AM Robert Haas <robertmhaas@gmail.com> wrote:
> We are used to thinking about table vacuum and index vacuum as parts
> of a single, indivisible operation. You vacuum the table -- among
> other things by performing HOT pruning and remembering dead TIDs --
> and then you vacuum the indexes -- removing the remembered TIDs from
> the index -- and then you vacuum the table some more, setting those
> dead TIDs unused -- and then you're done. And along the way you do
> some other things too like considering truncation that aren't relevant
> to the point I want to make here. Now, the problem with this is that
> every index has its own needs, which are separate from the needs of
> the tables, as I think Peter Geoghegan and Masahiko Sawada were also
> discussing recently.

I'm very happy to see that you've taken an interest in this work! I
believe it's an important area. It's too important to be left to only
two contributors. I welcome your participation as an equal partner in
the broader project to fix problems with VACUUM.

Masahiko and I have had plenty of ideas about where this could go next
-- way too many ideas, in fact. Maybe that kind of partnership sounds
unnecessary or at least seems premature, but it turns out that this
area is extremely broad and far reaching, if you really think it
through -- you end up having to negotiate rather a lot all at once.
Apart from anything else, I simply don't have the authority to commit
a bunch of stuff that implicitly makes Postgres do things a certain
way in a huge number of different subsystems. (Whether or not I'd be
right in each case is beside the point.)

My most ambitious goal is finding a way to remove the need to freeze
or to set hint bits. I think that we can do this by inventing a new
kind of VACUUM just for aborted transactions, which doesn't do index
vacuuming. You'd need something like an ARIES-style dirty page table
to make this cheap -- so it's a little like UNDO, but not very much.
The basic idea is that eagerly cleaning up aborted transactions in an
autovacuum worker allows you to broadly assume that most blocks
contain definitely-committed heap tuples, or else LP_DEAD stubs (which
of course don't contain any XIDs). You'd still have something like
conventional VACUUM, which wouldn't change that much. Freezing is
largely implicit, but maybe you freeze tuples the old way if and only
if a backend dirties a "known-all-commited" block -- that can still be
expensive.

The visibility map still has an all-visible bit, but now it also has
an all-committed bit (or maybe it's a separate data structure). The
combination of all-visible and all-committed to precisely the same as
frozen, so you don't need a separate VM bit for that anymore.

Notice that this design doesn't change much about our basic approach
to transaction management. It just further decouples things.
Conventional VACUUMs are now only about garbage collection, and so can
be further optimized with that goal in mind. It's much easier to do
clever scheduling if VACUUM really only has to do garbage collection.

> Opportunistic index cleanup strategies like
> kill_prior_tuple and bottom-up deletion may work much better for some
> indexes than others, meaning that you could have some indexes that
> badly need to be vacuumed because they are full of garbage, and other
> indexes on the same table where the opportunistic cleanup has worked
> perfectly and there is no need for vacuuming at all.

I know I say this all the time these days, but it seems worth
repeating now: it is a qualitative difference, not a quantitative
difference. Bottom-up index deletion will frequently stop most indexes
on a table from growing by even one single block, while indexes that
cannot use the optimization (indexes that are logically modified by
UPDATE statements) might be hugely bloated. If this is the case during
one VACUUM operation, it's probably going to work like that with all
future VACUUM operations. It's abundantly clear that the current
quantitative approach just cannot be pushed much further.

> But, as things stand today, strategy options to deal with such
> situations are limited. Leaving aside what the code actually does
> right now, let's talk about what options we have in theory with the
> technology as it exists now. They basically all boil down to stopping
> early and then redoing the work later. We must always start with a
> pass over the heap to collect dead TIDs, because otherwise there's
> nothing else to do. Now we can choose to stop, but then the next
> VACUUM will have to collect all those TIDs again. It may get to skip
> more all-visible pages than the current vacuum did, but the pages that
> still have dead TIDs will all have to be visited again. If we don't
> stop, then we can choose to vacuum all of the indexes or just some of
> them, and then afterwards stop. But if we do this, the next VACUUM
> will have to scan all indexes again for the same TIDs. Here, we don't
> even have the visibility map to allow skipping known-clean pages, so
> it's *really* a lot of work we have to redo. Thus what we normally do
> is press on to the third step, where we mark dead line pointers unused
> after scanning every index in its entirety, and now they're gone and
> we don't have to worry about them again. Barring emergency escape
> valves, as things stand today, the frequency of table vacuuming is the
> same as the frequency of index vacuuming, even though the *required*
> frequency of vacuuming is not the same, and also varies from index to
> index.

I'm 100% in agreement about all of this.

> Now, the reason for this is that when we discover dead TIDs, we only
> record them in memory, not on disk. So, as soon as VACUUM ends, we
> lose all knowledge of whether those TIDs were and must rediscover
> them. Suppose we didn't do this, and instead had a "dead TID" fork for
> each table.

I had a similar idea myself recently -- clearly remembering the TIDs
that you haven't vacuumed to save work later on makes a lot of sense.
I didn't get very far with it, even in my own head, but I like the
direction you're taking it. Having it work a little like a queue makes
a lot of sense.

> Suppose further that this worked like a conveyor belt,
> similar to WAL, where every dead TID we store into the fork is
> assigned an identifying 64-bit number that is never reused. Then,
> suppose that for each index, we store the number of the oldest entry
> that might still need to be vacuumed from the index. Every time you
> perform what we now call the first heap pass of a VACUUM, you add the
> new TIDs you find to the dead TID fork.

Maybe we can combine this known-dead-tid structure with the visibility
map. Index-only scans might be able to reason about blocks that are
mostly all-visible, but still have stub LP_DEAD line pointers that
this other structure knows about -- so you can get index-only scans
without requiring a full round of traditional vacuuming. Maybe there
is some opportunity like that, but not sure how to fit it in to
everything else right now.

> Every time you vacuum an
> index, the TIDs that need to be removed are those between the
> oldest-entry pointer for that index and the current end of the TID
> fork. You remove all of those and then advance your oldest-entry
> pointer accordingly. If that's too many TIDs to fit in
> maintenance_work_mem, you can just read as many as will fit and
> advance your oldest-entry pointer less far. Every time you perform
> what we now call the second heap pass of a VACUUM, you find all the
> TIDs that precede every index's oldest-entry pointer and set them
> unused. You then throw away the associated storage at the OS level.
> This requires a scheme where relations can be efficiently truncated
> from the beginning rather than only at the end, which is why I said "a
> conveyor belt" and "similar to WAL". Details deliberately vague since
> I am just brainstorming here.

This amounts to adding yet more decoupling -- which seems great to me.
Anything that gives us the option but not the obligation to perform
work either more lazily or more eagerly (whichever makes sense) seems
highly desirable to me. Especially if we can delay our decision until
the last possible point, when we can have relatively high confidence
that we know what we're doing. And especially if holding it open as an
option is pretty cheap (that's the point of remembering dead TIDs).

> Furthermore, all of these operations can start in any order, and any
> of them can be repeated any number of times during a single run of any
> particular other one, or indeed, without that particular one ever
> being run at all. Both heap phases can also now be done in smaller
> chunks, if desired.

> But is this worthwhile? I think it depends a lot on what you think the
> comparative required frequencies are for the various operations.

There is a risk that you'll never think that any optimization is worth
it because each optimization seems marginal in isolation. Sometimes a
diversity of strategies is the real strategy. Let's say you have a
bunch of options that you can pick and choose from, with fallbacks and
with ways of course correcting even halfway through the VACUUM. It's
possible that that will work wonderfully well for a given complex user
workload, but if you subtract away *any one* of the strategies
suddenly things get much worse in some obvious high-level way. It's
entirely possible for a single table to have different needs in
different parts of the table.

Certainly works that way with indexes -- that much I can say for sure.

> If index A needs to be vacuumed every 40 minutes and index B needs to be
> vacuumed every 55 minutes and the table that owns both of them needs
> to be vacuumed every 70 minutes, I am not sure there is a whole lot
> here. I think you will be pretty much equally well off if you just do
> what we do today every 40 minutes and call it good.

That's probably all true, but I think that an excellent heuristic is
to work hard to avoid really terrible outcomes, rather than trying
hard to get good outcomes. The extremes don't just matter -- they may
even be the only thing that matters.

If index A needs to be vacuumed about as frequently as index B anyway,
then the user happens to naturally be in a position where the current
simplistic scheduling works for them. Which is fine, as far as it
goes, but that's not really where we have problems. If you consider
the "qualitative, not quantitative" perspective, things change. It's
now pretty unlikely that all of the indexes on the same table will
have approximately the same needs -- except when there is very little
to do with indexes anyway, which is pretty much not interesting
anyway. Because workloads generally don't logically modify all indexes
columns within each UPDATE. They just don't tend to look like that in
practice.

> Also, you will not
> benefit very much if the driving force is reclaiming dead line
> pointers in the table itself. If that has to happen frequently, then
> the indexes have to be scanned frequently, and this whole thing is a
> lot of work for not much. But, maybe that's not the case. Suppose
> index A needs to be vacuumed every hour to avoid bloat, index B needs
> to be vacuumed every 4 hours to avoid bloat, and the table needs dead
> line pointers reclaimed every 5.5 hours. Well, now you can gain a lot.
> You can vacuum index A frequently while vacuuming index B only as
> often as it needs, and you can reclaim dead line pointers on their own
> schedule based on whatever index vacuuming was already done for bloat
> avoidance. Without this scheme, there's just no way to give everybody
> what they need without some of the participants being "dragged along
> for the ride" and forced into work that they don't actually need done
> simply because "that's how it works."

Right. And, the differences between index A and index B will tend to
be pretty consistent and often much larger than this.

Many indexes would never have to be vacuumed, even with non-HOT
UPDATES due to bottom-up index deletion -- because they literally
won't even have one single page split for hours, while maybe one index
gets 3x larger in the same timeframe. Eventually you'll need to vacuum
the indexes all the same (not just the bloated index), but that's only
required to enable safely performing heap vacuuming. It's not so bad
if the non-bloated indexes won't be dirtied and if it's not so
frequent (dirtying pages is the real cost to keep under control here).

> One thing I don't know is whether the kind of scenario that I describe
> above is common, i.e. is the main reason we need to vacuum to control
> index bloat, where this kind of approach seems likely to help, or is
> it to reclaim dead line pointers in the heap, where it's not? I'd be
> interested in hearing from people who have some experience in this
> area, or at least better intuition than I do.

The paradox here is:

1. Workload characteristics are important and must be exploited to get
optimal performance.

2. Workloads are too complicated and unpredictable to ever truly understand.

Roughly speaking, the solution that I think has the most promise is to
make it okay for your heuristics to be wrong. You do this by keeping
the costs simple, fixed and low, and by doing things that have
multiple benefits (not just one). This is why it's important to give
VACUUM a bunch of strategies that it can choose from and switch back
and forth from, with minimal commitment -- you let VACUUM figure out
what to do about the workload through trial and error. It has to try
and fail on occasion -- you must be willing to pay the cost of
negative feedback (though the cost must be carefully managed). This
approach is perhaps sufficient to cover all of the possible extremes
with all workloads. I think that the extremes are where our problems
all are, or close to it.

The cost shouldn't be terribly noticeable because you have the
flexibility to change your mind at the first sign of an issue. So you
never pay an extreme cost (you pay a pretty low fixed cost
incrementally, at worst), but you do sometimes (and maybe even often)
get an extreme benefit -- the benefit of avoiding current pathological
performance issues. We know that the cost of bloat is very non-linear
in a bunch of ways that can be pretty well understood, so that seems
like the right thing to focus on -- this is perhaps the only thing
that we can expect to understand with a relatively high degree of
confidence. We can live with a lot of uncertainty about what's going
on with the workload by managing it continually, ramping up and down,
etc.

> Clearly,
> we do not want to vacuum each partition by scanning the 1GB partition
> + the 50MB local index + the 50GB global index. That's insane. With
> the above system, since everything's decoupled, you can vacuum the
> partition tables individually as often as required, and similarly for
> their local indexes, but put off vacuuming the global index until
> you've vacuumed a bunch, maybe all, of the partitions, so that the
> work of cleaning up the global index cleans up dead TIDs from many or
> all partitions instead of just one at a time.

I can't think of any other way of sensibly implementing global indexes.

> Now, the fly in the ointment here is that this supposes that we don't
> get forced into vacuuming the global index too quickly because of dead
> line pointer accumulation. But, I think if that does happen, with
> careful scheduling, we might not really be worse off than we would
> have been without partitioning. If we scan the table for just one
> partition and, say, exhaust maintenance_work_mem, we have some
> expensive index vacuuming to do immediately, but that would've also
> happened in pretty much the same way with an unpartitioned table.

But you can at least drop the partitions with a global index. It
shouldn't be too hard to make that work without breaking things.

> It's probably tricky to get the
> autovacuum algorithm right here, but there seems to be some room for
> optimism.

I think that it's basically okay if global indexes suck when you do
lots of UPDATEs -- this is a limitation that users can probably live
with. What's not okay is if they suck when you do relatively few
UPDATEs. It's especially not okay to have to scan the global index to
delete one index tuple that points to one LP_DEAD item. Since you tend
to get a tiny number of LP_DEAD items even when the DBA bends over
backwards to make all UPDATEs use HOT. Getting that to happen 99%+ of
the time is so much easier than getting it to happen 100% of the time.
There can be enormous asymmetry with this stuff.

Long term, I see VACUUM evolving into something that can only be
configured in an advisory way. It's too hard to tune this stuff
because what we really want here is to structure many things as an
optimization problem, and to have a holistic view that considers how
the table changes over time -- over multiple VACUUM operations. We can
safely be very lazy if we have some basic sense of proportion about
what the risk is. For example, maybe we limit the number of newly
dirtied pages during VACUUM by being lazy about pruning pages that
don't happen to be dirty when encountered within VACUUM. We still have
some sense of how much work we've put off, so as to never get in over
our head with debt. We might also have a sense of how many dirty pages
in total there are in the system currently -- maybe if the DB is not
busy right now we can be extra aggressive. In short, we apply holistic
thinking.

> Even if global indexes never happened, though, I think this could have
> other benefits. For example, the wraparound failsafe mechanism
> recently added by Masahiko Sawada and Peter Geoghegan bypasses index
> vacuuming when wraparound danger is imminent. The only problem is that
> making that decision means throwing away the accumulated list of dead
> TIDs, which then need to be rediscovered whenever we get around to
> vacuuming the indexes. But that's avoidable, if they're stored on disk
> rather than in RAM.

Yeah, that's not ideal.

> One rather serious objection to this whole line of attack is that we'd
> ideally like VACUUM to reclaim disk space without using any more, in
> case the motivation for running VACUUM in the first place. A related
> objection is that if it's sometimes agreable to do everything all at
> once as we currently do, the I/O overhead could be avoided.

Of course it's possible that what we currently do will be optimal. But
it's pretty much a question of mostly-independent things all going the
same way. So I expect that it will be rare.

> I think
> we'd probably have to retain a code path that buffers the dead TIDs in
> memory to account, at least, for the low-on-disk-space case, and maybe
> that can also be used to avoid I/O in some other cases, too. I haven't
> thought through all the details here. It seems to me that the actual
> I/O avoidance is probably not all that much - each dead TID is much
> smaller than the deleted tuple that gave rise to it, and typically you
> don't delete all the tuples at once - but it might be material in some
> cases, and it's definitely material if you don't have enough disk
> space left for it to complete without error.

All true.

--
Peter Geoghegan

Re: decoupling table and index vacuum

From

Dilip Kumar

Date:

22 April 2021, 11:50:48

On Wed, Apr 21, 2021 at 8:51 PM Robert Haas <robertmhaas@gmail.com> wrote:

> Now, the reason for this is that when we discover dead TIDs, we only
> record them in memory, not on disk. So, as soon as VACUUM ends, we
> lose all knowledge of whether those TIDs were and must rediscover
> them. Suppose we didn't do this, and instead had a "dead TID" fork for
> each table.

Interesting idea.

However, you only need
> to force it for indexes that haven't been vacuumed recently enough for
> some other reason, rather than every index. If you have a target of
> reclaiming 30,000 TIDs, you can just pick the indexes where there are
> fewer than 30,000 dead TIDs behind their oldest-entry pointers and
> force vacuuming only of those.

How do we decide this target, I mean at a given point how do we decide
that what is the limit of dead TID's at which we want to trigger the
index vacuuming?

> One rather serious objection to this whole line of attack is that we'd
> ideally like VACUUM to reclaim disk space without using any more, in
> case the motivation for running VACUUM in the first place. A related
> objection is that if it's sometimes agreable to do everything all at
> once as we currently do, the I/O overhead could be avoided. I think
> we'd probably have to retain a code path that buffers the dead TIDs in
> memory to account, at least, for the low-on-disk-space case, and maybe
> that can also be used to avoid I/O in some other cases, too. I haven't
> thought through all the details here. It seems to me that the actual
> I/O avoidance is probably not all that much - each dead TID is much
> smaller than the deleted tuple that gave rise to it, and typically you
> don't delete all the tuples at once - but it might be material in some
> cases, and it's definitely material if you don't have enough disk
> space left for it to complete without error.

Is it a good idea to always perform an I/O after collecting the dead
TID's or there should be an option where the user can configure so
that it aggressively vacuum all the indexes and this I/O overhead can
be avoided completely.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: decoupling table and index vacuum

From

Masahiko Sawada

Date:

22 April 2021, 14:27:34

On Thu, Apr 22, 2021 at 8:51 AM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Wed, Apr 21, 2021 at 8:21 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > We are used to thinking about table vacuum and index vacuum as parts
> > of a single, indivisible operation. You vacuum the table -- among
> > other things by performing HOT pruning and remembering dead TIDs --
> > and then you vacuum the indexes -- removing the remembered TIDs from
> > the index -- and then you vacuum the table some more, setting those
> > dead TIDs unused -- and then you're done. And along the way you do
> > some other things too like considering truncation that aren't relevant
> > to the point I want to make here. Now, the problem with this is that
> > every index has its own needs, which are separate from the needs of
> > the tables, as I think Peter Geoghegan and Masahiko Sawada were also
> > discussing recently.
>
> I'm very happy to see that you've taken an interest in this work! I
> believe it's an important area. It's too important to be left to only
> two contributors. I welcome your participation as an equal partner in
> the broader project to fix problems with VACUUM.

+many

>
> > Now, the reason for this is that when we discover dead TIDs, we only
> > record them in memory, not on disk. So, as soon as VACUUM ends, we
> > lose all knowledge of whether those TIDs were and must rediscover
> > them. Suppose we didn't do this, and instead had a "dead TID" fork for
> > each table.
>
> I had a similar idea myself recently -- clearly remembering the TIDs
> that you haven't vacuumed to save work later on makes a lot of sense.
> I didn't get very far with it, even in my own head, but I like the
> direction you're taking it. Having it work a little like a queue makes
> a lot of sense.

Agreed. (I now remembered I gave a talk about a similar idea at PGCon
a couple years ago).

Another good point of this "dead TID fork" design is that IIUC we
don't necessarily need to make it crash-safe. We would not need WAL
logging for remembering dead TIDs. If the server crashes, we can
simply throw it away and assume we haven't done the first heap pass
yet.

>
> > Suppose further that this worked like a conveyor belt,
> > similar to WAL, where every dead TID we store into the fork is
> > assigned an identifying 64-bit number that is never reused. Then,
> > suppose that for each index, we store the number of the oldest entry
> > that might still need to be vacuumed from the index. Every time you
> > perform what we now call the first heap pass of a VACUUM, you add the
> > new TIDs you find to the dead TID fork.
>
> Maybe we can combine this known-dead-tid structure with the visibility
> map. Index-only scans might be able to reason about blocks that are
> mostly all-visible, but still have stub LP_DEAD line pointers that
> this other structure knows about -- so you can get index-only scans
> without requiring a full round of traditional vacuuming. Maybe there
> is some opportunity like that, but not sure how to fit it in to
> everything else right now.

Interesting idea.

>
> > Every time you vacuum an
> > index, the TIDs that need to be removed are those between the
> > oldest-entry pointer for that index and the current end of the TID
> > fork. You remove all of those and then advance your oldest-entry
> > pointer accordingly. If that's too many TIDs to fit in
> > maintenance_work_mem, you can just read as many as will fit and
> > advance your oldest-entry pointer less far. Every time you perform
> > what we now call the second heap pass of a VACUUM, you find all the
> > TIDs that precede every index's oldest-entry pointer and set them
> > unused. You then throw away the associated storage at the OS level.
> > This requires a scheme where relations can be efficiently truncated
> > from the beginning rather than only at the end, which is why I said "a
> > conveyor belt" and "similar to WAL". Details deliberately vague since
> > I am just brainstorming here.

The dead TID fork needs to also be efficiently searched. If the heap
scan runs twice, the collected dead TIDs on each heap pass could be
overlapped. But we would not be able to merge them if we did index
vacuuming on one of indexes at between those two heap scans. The
second time heap scan would need to record only TIDs that are not
collected by the first time heap scan.

>
> > Clearly,
> > we do not want to vacuum each partition by scanning the 1GB partition
> > + the 50MB local index + the 50GB global index. That's insane. With
> > the above system, since everything's decoupled, you can vacuum the
> > partition tables individually as often as required, and similarly for
> > their local indexes, but put off vacuuming the global index until
> > you've vacuumed a bunch, maybe all, of the partitions, so that the
> > work of cleaning up the global index cleans up dead TIDs from many or
> > all partitions instead of just one at a time.
>
> I can't think of any other way of sensibly implementing global indexes.

Given that we could load all dead TIDs from many or all partitions,
having dead TIDs on the memory with an efficient format would also
become important.

> > It's probably tricky to get the
> > autovacuum algorithm right here, but there seems to be some room for
> > optimism.
>
> I think that it's basically okay if global indexes suck when you do
> lots of UPDATEs -- this is a limitation that users can probably live
> with. What's not okay is if they suck when you do relatively few
> UPDATEs. It's especially not okay to have to scan the global index to
> delete one index tuple that points to one LP_DEAD item.

Right. Given decoupling index vacuuming, I think the index’s garbage
statistics are important which preferably need to be fetchable without
accessing indexes. It would be not hard to estimate how many index
tuples might be able to be deleted by looking at the dead TID fork but
it doesn’t necessarily match the actual number.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: decoupling table and index vacuum

From

Robert Haas

Date:

22 April 2021, 16:15:27

On Wed, Apr 21, 2021 at 5:38 PM Andres Freund <andres@anarazel.de> wrote:
> I'm not sure that's the only way to deal with this. While some form of
> generic "conveyor belt" infrastructure would be a useful building block,
> and it'd be sensible to use it here if it existed, it seems feasible to
> dead tids in a different way here. You could e.g. have per-heap-vacuum
> files with a header containing LSNs that indicate the age of the
> contents.

That's true, but have some reservations about being overly reliant on
the filesystem to provide structure here. There are good reasons to be
worried about bloating the number of files in the data directory. Hmm,
but maybe we could mitigate that. First, we could skip this for small
relations. If you can vacuum the table and all of its indexes using
the naive algorithm in <10 seconds, you probably shouldn't do anything
fancy. That would *greatly* reduce the number of additional files
generated. Second, we could forget about treating them as separate
relation forks and make them some other kind of thing entirely, in a
separate directory, especially if we adopted Sawada-san's proposal to
skip WAL logging. I don't know if that proposal is actually a good
idea, because it effectively adds a performance penalty when you crash
or fail over, and that sort of thing can be an unpleasant surprise.
But it's something to think about.

> > This scheme adds a lot of complexity, which is a concern, but it seems
> > It's not completely independent: if you need to set some dead TIDs in
> > the table to unused, you may have to force index vacuuming that isn't
> > needed for bloat control. However, you only need to force it for
> > indexes that haven't been vacuumed recently enough for some other
> > reason, rather than every index.
>
> Hm - how would we know how recently that TID has been marked dead? We
> don't even have xids for dead ItemIds... Maybe you're intending to
> answer that in your next paragraph, but it's not obvious to me that'd be
> sufficient...

You wouldn't know anything about when things were added in terms of
wall clock time, but the idea was that TIDs get added in order and
stay in that order. So you know which ones were added first. Imagine a
conceptually infinite array of TIDs:

(17,5) (332,6) (5, 1) (2153,92) ....

Each index keeps a pointer into this array. Initially it points to the
start of the array, here (17,5). If an index vacuum starts after
(17,5) and (332,6) have been added to the array but before (5,1) is
added, then upon completion it updates its pointer to point to (5,1).
If every index is pointing to (5,1) or some later element, then you
know that (17,5) and (332,6) can be set LP_UNUSED. If not, and you
want to get to a state where you CAN set (17,5) and (332,6) to
LP_UNUSED, you just need to force index vac on indexes that are
pointing to something prior to (5,1) -- and keep forcing it until
those pointers reach (5,1) or later.

> One thing that you didn't mention so far is that this'd allow us to add
> dead TIDs to the "dead tid" file outside of vacuum too. In some
> workloads most of the dead tuple removal happens as part of on-access
> HOT pruning. While some indexes are likely to see that via the
> killtuples logic, others may not. Being able to have more aggressive
> index vacuum for the one or two bloated index, without needing to rescan
> the heap, seems like it'd be a significant improvement.

Oh, that's a very interesting idea. It does impose some additional
requirements on any such system, though, because it means you have to
be able to efficiently add single TIDs. For example, you mention a
per-heap-VACUUM file above, but you can't get away with creating a new
file per HOT prune no matter how you arrange things at the FS level.
Actually, though, I think the big problem here is deduplication. A
full-blown VACUUM can perhaps read all the already-known-to-be-dead
TIDs into some kind of data structure and avoid re-adding them, but
that's impractical for a HOT prune.

> Have you thought about how we would do the scheduling of vacuums for the
> different indexes? We don't really have useful stats for the number of
> dead index entries to be expected in an index. It'd not be hard to track
> how many entries are removed in an index via killtuples, but
> e.g. estimating how many dead entries there are in a partial index seems
> quite hard (at least without introducing significant overhead).

No, I don't have any good ideas about that, really. Partial indexes
seem like a hard problem, and so do GIN indexes or other kinds of
things where you may have multiple index entries per heap tuple. We
might have to accept some known-to-be-wrong approximations in such
cases.

> > One rather serious objection to this whole line of attack is that we'd
> > ideally like VACUUM to reclaim disk space without using any more, in
> > case the motivation for running VACUUM in the first place.
>
> I suspect we'd need a global limit of space used for this data. If above
> that limit we'd switch to immediately performing the work required to
> remove some of that space.

I think that's entirely the wrong approach. On the one hand, it
doesn't prevent you from running out of disk space during emergency
maintenance, because the disk overall can be full even though you're
below your quota of space for this particular purpose. On the other
hand, it does subject you to random breakage when your database gets
big enough that the critical information can't be stored within the
configured quota. I think we'd end up with pathological cases very
much like what used to happen with the fixed-size free space map. What
happened there was that your database got big enough that you couldn't
track all the free space any more and it just started bloating out the
wazoo. What would happen here is that you'd silently lose the
well-optimized version of VACUUM when your database gets too big. That
does not seem like something anybody wants.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: decoupling table and index vacuum

From

Robert Haas

Date:

22 April 2021, 18:16:46

On Wed, Apr 21, 2021 at 7:51 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I'm very happy to see that you've taken an interest in this work! I
> believe it's an important area. It's too important to be left to only
> two contributors. I welcome your participation as an equal partner in
> the broader project to fix problems with VACUUM.

Err, thanks. I agree this needs broad discussion and participation.

> My most ambitious goal is finding a way to remove the need to freeze
> or to set hint bits. I think that we can do this by inventing a new
> kind of VACUUM just for aborted transactions, which doesn't do index
> vacuuming. You'd need something like an ARIES-style dirty page table
> to make this cheap -- so it's a little like UNDO, but not very much.

I don't see how that works. An aborted transaction can have made index
entries, and those index entries can have already been moved by page
splits, and there can be arbitrarily many of them, so that you can't
keep track of them all in RAM. Also, you can crash after making the
index entries and writing them to the disk and before the abort
happens. Anyway, this is probably a topic for a separate thread.

> I know I say this all the time these days, but it seems worth
> repeating now: it is a qualitative difference, not a quantitative
> difference.

For the record, I find your quantitative vs. qualitative distinction
to be mostly unhelpful in understanding what's actually going on here.
I've backed into it by reading the explanatory statements you've made
at various times (including here, in the part I didn't quote). But
that phrase in and of itself means very little to me. Other people's
mileage may vary, of course; I'm just telling you how I feel about it.

> Right. And, the differences between index A and index B will tend to
> be pretty consistent and often much larger than this.
>
> Many indexes would never have to be vacuumed, even with non-HOT
> UPDATES due to bottom-up index deletion -- because they literally
> won't even have one single page split for hours, while maybe one index
> gets 3x larger in the same timeframe. Eventually you'll need to vacuum
> the indexes all the same (not just the bloated index), but that's only
> required to enable safely performing heap vacuuming. It's not so bad
> if the non-bloated indexes won't be dirtied and if it's not so
> frequent (dirtying pages is the real cost to keep under control here).

Interesting.

> The cost shouldn't be terribly noticeable because you have the
> flexibility to change your mind at the first sign of an issue. So you
> never pay an extreme cost (you pay a pretty low fixed cost
> incrementally, at worst), but you do sometimes (and maybe even often)
> get an extreme benefit -- the benefit of avoiding current pathological
> performance issues. We know that the cost of bloat is very non-linear
> in a bunch of ways that can be pretty well understood, so that seems
> like the right thing to focus on -- this is perhaps the only thing
> that we can expect to understand with a relatively high degree of
> confidence. We can live with a lot of uncertainty about what's going
> on with the workload by managing it continually, ramping up and down,
> etc.

I generally agree. You want to design a system in a way that's going
to do a good job avoiding pathological cases. The current system is
kind of naive about that. It does things that work well in
middle-of-the-road cases, but often does stupid things in extreme
cases. There are numerous examples of that; one is the "useless
vacuuming" problem about which I've blogged in
http://rhaas.blogspot.com/2020/02/useless-vacuuming.html where the
system keeps on vacuuming because relfrozenxid is old but doesn't
actually succeed in advancing it, so that it's just spinning to no
purpose. Another thing is when it picks the "wrong" thing to do first,
focusing on a less urgent problem rather than a more urgent one. This
can go either way: we might spend a lot of energy cleaning up bloat
when a wraparound shutdown is imminent, but we also might spend a lot
of energy dealing with a wraparound issue that's not yet urgent while
some table bloats out of control. I think it's important not to let
the present discussion get overbroad; we won't be able to solve
everything at once, and trying to do too many things at the same time
will likely result in instability.

> > Clearly,
> > we do not want to vacuum each partition by scanning the 1GB partition
> > + the 50MB local index + the 50GB global index. That's insane. With
> > the above system, since everything's decoupled, you can vacuum the
> > partition tables individually as often as required, and similarly for
> > their local indexes, but put off vacuuming the global index until
> > you've vacuumed a bunch, maybe all, of the partitions, so that the
> > work of cleaning up the global index cleans up dead TIDs from many or
> > all partitions instead of just one at a time.
>
> I can't think of any other way of sensibly implementing global indexes.

Awesome.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: decoupling table and index vacuum

From

Peter Geoghegan

Date:

22 April 2021, 18:30:21

On Thu, Apr 22, 2021 at 9:15 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > Have you thought about how we would do the scheduling of vacuums for the
> > different indexes? We don't really have useful stats for the number of
> > dead index entries to be expected in an index. It'd not be hard to track
> > how many entries are removed in an index via killtuples, but
> > e.g. estimating how many dead entries there are in a partial index seems
> > quite hard (at least without introducing significant overhead).
>
> No, I don't have any good ideas about that, really. Partial indexes
> seem like a hard problem, and so do GIN indexes or other kinds of
> things where you may have multiple index entries per heap tuple. We
> might have to accept some known-to-be-wrong approximations in such
> cases.

I think that you're both missing very important subtleties here.
Apparently the "quantitative vs qualitative" distinction I like to
make hasn't cleared it up.

You both seem to be assuming that everything would be fine if you
could somehow inexpensively know the total number of undeleted dead
tuples in each index at all times. But I don't think that that's true
at all. I don't mean that it might not be true. What I mean is that
it's usually a meaningless number *on its own*, at least if you assume
that every index is either an nbtree index (or an index that uses some
other index AM that has the same index deletion capabilities).

My mental models for index bloat usually involve imagining an
idealized version of a real world bloated index -- I compare the
empirical reality against an imagined idealized version. I then try to
find optimizations that make the reality approximate the idealized
version. Say a version of the same index in a traditional 2PL database
without MVCC, or in real world Postgres with VACUUM that magically
runs infinitely fast.

Bottom-up index deletion usually leaves a huge number of
undeleted-though-dead index tuples untouched for hours, even when it
works perfectly. 10% - 30% of the index tuples might be
undeleted-though-dead at any given point in time (traditional B-Tree
space utilization math generally ensures that there is about that much
free space on each leaf page if we imagine no version churn/bloat --
we *naturally* have a lot of free space to work with). These are
"Schrodinger's dead index tuples". You could count them
mechanistically, but then you'd be counting index tuples that are
"already dead and deleted" in an important theoretical sense, despite
the fact that they are not yet literally deleted. That's why bottom-up
index deletion frequently avoids 100% of all unnecessary page splits.
The asymmetry that was there all along was just crazy. I merely had
the realization that it was there and could be exploited -- I didn't
create or invent the natural asymmetry.

A similar issue exists within vacuumlazy.c (though that might matter a
lot less). Notice that it counts a recently dead heap tuple in its
new_dead_tuples accounting, even though the workload can probably
delete that tuple in a just-in-time fashion opportunistically. Might
we be better off recognizing that such a heap tuple is already morally
dead and gone, even if that isn't literally true? (That's a harder
argument to make, and I'm not making it right now -- it's just an
example.)

-- 
Peter Geoghegan

Re: decoupling table and index vacuum

From

Robert Haas

Date:

22 April 2021, 18:37:46

On Thu, Apr 22, 2021 at 7:51 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> How do we decide this target, I mean at a given point how do we decide
> that what is the limit of dead TID's at which we want to trigger the
> index vacuuming?

It's tricky. Essentially, it's a cost-benefit analysis. On the "cost"
side, the expense associated with an index vacuum is basically the
number of pages that we're going to visit, and the number of those
that we're going to dirty. We can know the former with certainty but
can only estimate the latter. On the "benefit" side, setting dead TIDs
unused helps us in two ways. First, it lets us mark heap pages
all-visible, which makes index-only scans work better and reduces the
cost of future vacuuming. These benefits are mitigated by future DML
unsetting those bits again; there's no point in marking a page
all-visible if it's about to be modified again. Second, it avoids line
pointer bloat. Dead line pointers still take up space on the page, and
potentially force the line pointer array to be extended, which can
eventually cause tuples that would have fit into the page to spill
into a different page, possibly a newly-allocated one that forces a
table extension.

It's hard to compare the costs to the benefits because we don't know
the frequency of access. Suppose it costs $1000 to vacuum relevant
indexes and set dead line pointers unused. And suppose that if you do
it, you thereafter will save $1 every time someone does an index-only
scan. If there will be >1000 index-only scans in a meaningful time
frame, it's a good trade-off, but if not, it's a bad one, but it's
difficult to predict the future, and we have limited information even
about the past.

My intuition is that two things that we want to consider are the total
number of dead line pointers in the heap, and the number of pages
across which they are spread. It is also my intuition that the latter
is the more important number, possibly to the extent that we could
ignore the former number completely. But exactly what the thresholds
should be is very unclear to me.

> Is it a good idea to always perform an I/O after collecting the dead
> TID's or there should be an option where the user can configure so
> that it aggressively vacuum all the indexes and this I/O overhead can
> be avoided completely.

It's my view that there should definitely be such an option.

As I also mentioned on another thread recently, suppose we pick words
for each phase of vacuum. For the sake of argument, suppose we refer
to the first heap phase as PRUNE, the index phase as SANITIZE, and the
second heap phase as RECYCLE. Then you can imagine syntax like this:

VACUUM (PRUNE) my_table;
VACUUM (SANITIZE) my_table; -- all indexes
VACUUM my_index; -- must be sanitize only
VACUUM (PRUNE, SANITIZE, RECYCLE) my_table; -- do everything

Now in the last case is clearly possible for the system to do
everything in memory since all phases are being performed, but
depending on what we decide, maybe it will choose to use the dead-TID
fork in some cases for some reason or other. If so, we might have
explicit syntax to override that behavior, e.g.

VACUUM (PRUNE, SANITIZE, RECYCLE, TID_STORE 0) my_table;

which might be able to be abbreviated, depending on how we set the
defaults, to just:

VACUUM (TID_STORE 0) my_table;

This is all just hypothetical syntax and probably needs a good deal of
polish and bike-shedding. But it would be really nice to standardize
on some set of terms like prune/sanitize/recycle or whatever we pick,
because then we could document what they mean, use them in autovacuum
log messages, use them internally for function names, use them for
VACUUM option names when we get to that point, etc. and the whole
thing would be a good deal more comprehensible than at present.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: decoupling table and index vacuum

From

Andres Freund

Date:

22 April 2021, 18:44:00

Hi,

On 2021-04-22 11:30:21 -0700, Peter Geoghegan wrote:
> I think that you're both missing very important subtleties here.
> Apparently the "quantitative vs qualitative" distinction I like to
> make hasn't cleared it up.

I'm honestly getting a bit annoyed about this stuff. Yes it's a cool
improvement, but no, it doesn't mean that there aren't still relevant
issues in important cases. It doesn't help that you repeatedly imply
that people that don't see it your way need to have their view "cleared
up".

"Bottom up index deletion" is practically *irrelevant* for a significant
set of workloads.

> You both seem to be assuming that everything would be fine if you
> could somehow inexpensively know the total number of undeleted dead
> tuples in each index at all times.

I don't think we'd need an exact number. Just a reasonable approximation
so we know whether it's worth spending time vacuuming some index.

> But I don't think that that's true at all. I don't mean that it might
> not be true. What I mean is that it's usually a meaningless number *on
> its own*, at least if you assume that every index is either an nbtree
> index (or an index that uses some other index AM that has the same
> index deletion capabilities).

You also have to assume that you have roughly evenly distributed index
insertions and deletions. But workloads that insert into some parts of a
value range and delete from another range are common.

I even would say that *precisely* because "Bottom up index deletion" can
be very efficient in some workloads it is useful to have per-index stats
determining whether an index should be vacuumed or not.

> My mental models for index bloat usually involve imagining an
> idealized version of a real world bloated index -- I compare the
> empirical reality against an imagined idealized version. I then try to
> find optimizations that make the reality approximate the idealized
> version. Say a version of the same index in a traditional 2PL database
> without MVCC, or in real world Postgres with VACUUM that magically
> runs infinitely fast.
> 
> Bottom-up index deletion usually leaves a huge number of
> undeleted-though-dead index tuples untouched for hours, even when it
> works perfectly. 10% - 30% of the index tuples might be
> undeleted-though-dead at any given point in time (traditional B-Tree
> space utilization math generally ensures that there is about that much
> free space on each leaf page if we imagine no version churn/bloat --
> we *naturally* have a lot of free space to work with). These are
> "Schrodinger's dead index tuples". You could count them
> mechanistically, but then you'd be counting index tuples that are
> "already dead and deleted" in an important theoretical sense, despite
> the fact that they are not yet literally deleted. That's why bottom-up
> index deletion frequently avoids 100% of all unnecessary page splits.
> The asymmetry that was there all along was just crazy. I merely had
> the realization that it was there and could be exploited -- I didn't
> create or invent the natural asymmetry.

Except that heap bloat not index bloat might be the more pressing
concern. Or that there will be no meaningful amount of bottom-up
deletions. Or ...

Greetings,

Andres Freund

Re: decoupling table and index vacuum

From

Robert Haas

Date:

22 April 2021, 18:47:14

On Thu, Apr 22, 2021 at 10:28 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> The dead TID fork needs to also be efficiently searched. If the heap
> scan runs twice, the collected dead TIDs on each heap pass could be
> overlapped. But we would not be able to merge them if we did index
> vacuuming on one of indexes at between those two heap scans. The
> second time heap scan would need to record only TIDs that are not
> collected by the first time heap scan.

I agree that there's a problem here. It seems to me that it's probably
possible to have a dead TID fork that implements "throw away the
oldest stuff" efficiently, and it's probably also possible to have a
TID fork that can be searched efficiently. However, I am not sure that
it's possible to have a dead TID fork that does both of those things
efficiently. Maybe you have an idea. My intuition is that if we have
to pick one, it's MUCH more important to be able to throw away the
oldest stuff efficiently. I think we can work around the lack of
efficient lookup, but I don't see a way to work around the lack of an
efficient operation to discard the oldest stuff.

> Right. Given decoupling index vacuuming, I think the index’s garbage
> statistics are important which preferably need to be fetchable without
> accessing indexes. It would be not hard to estimate how many index
> tuples might be able to be deleted by looking at the dead TID fork but
> it doesn’t necessarily match the actual number.

Right, and to appeal (I think) to Peter's quantitative vs. qualitative
principle, it could be way off. Like, we could have a billion dead
TIDs and in one index the number of index entries that need to be
cleaned out could be 1 billion and in another index it could be zero
(0). We know how much data we will need to scan because we can fstat()
the index, but there seems to be no easy way to estimate how many of
those pages we'll need to dirty, because we don't know how successful
previous opportunistic cleanup has been. It is not impossible, as
Peter has pointed out a few times now, that it has worked perfectly
and there will be no modifications required, but it is also possible
that it has done nothing.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: decoupling table and index vacuum

From

Andres Freund

Date:

22 April 2021, 18:56:10

Hi,

On 2021-04-22 14:47:14 -0400, Robert Haas wrote:
> On Thu, Apr 22, 2021 at 10:28 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > Right. Given decoupling index vacuuming, I think the index’s garbage
> > statistics are important which preferably need to be fetchable without
> > accessing indexes. It would be not hard to estimate how many index
> > tuples might be able to be deleted by looking at the dead TID fork but
> > it doesn’t necessarily match the actual number.
> 
> Right, and to appeal (I think) to Peter's quantitative vs. qualitative
> principle, it could be way off. Like, we could have a billion dead
> TIDs and in one index the number of index entries that need to be
> cleaned out could be 1 billion and in another index it could be zero
> (0). We know how much data we will need to scan because we can fstat()
> the index, but there seems to be no easy way to estimate how many of
> those pages we'll need to dirty, because we don't know how successful
> previous opportunistic cleanup has been.

That aspect seems reasonably easy to fix: We can start to report the
number of opportunistically deleted index entries to pgstat. At least
nbtree already performs the actual deletion in bulk and we already have
(currently unused) space in the pgstat entries for it, so I don't think
it'd meanginfully increase overhead. And it'd improve insight in how
indexes operate significantly, even leaving autovacuum etc aside.

Greetings,

Andres Freund

Re: decoupling table and index vacuum

From

Peter Geoghegan

Date:

22 April 2021, 19:10:54

On Thu, Apr 22, 2021 at 11:44 AM Andres Freund <andres@anarazel.de> wrote:
> I'm honestly getting a bit annoyed about this stuff.

You're easily annoyed.

> Yes it's a cool
> improvement, but no, it doesn't mean that there aren't still relevant
> issues in important cases. It doesn't help that you repeatedly imply
> that people that don't see it your way need to have their view "cleared
> up".

I don't think that anything that I've said about it contradicts
anything that you or Robert said. What I said that you're missing a
couple of important subtleties (or that you seem to be). It's not
really about the optimization in particular -- it's about the
subtleties that it exploits. I think that they're generalizable. Even
if there was only a 1% chance of that being true, it would still be
worth exploring in depth.

I think that everybody's beliefs about VACUUM tend to be correct. It
almost doesn't matter if scenario A is the problem in 90% or cases
versus 10% of cases for scenario B (or vice-versa). What actually
matters is that we have good handling for both. (It's probably some
weird combination of scenario A and scenario B in any case.)

> "Bottom up index deletion" is practically *irrelevant* for a significant
> set of workloads.

You're missing the broader point. Which is that we don't know how much
it helps in each case, just as we don't know how much some other
complementary optimization helps. It's important to develop
complementary techniques precisely because (say) bottom-up index
deletion only solves one class of problem. And because it's so hard to
predict.

I actually went on at length about the cases that the optimization
*doesn't* help. Because that'll be a disproportionate source of
problems now. And you really need to avoid all of the big sources of
trouble to get a really good outcome. Avoiding each and every source
of trouble might be much much more useful than avoiding all but one.

> > You both seem to be assuming that everything would be fine if you
> > could somehow inexpensively know the total number of undeleted dead
> > tuples in each index at all times.
>
> I don't think we'd need an exact number. Just a reasonable approximation
> so we know whether it's worth spending time vacuuming some index.

I agree.

> You also have to assume that you have roughly evenly distributed index
> insertions and deletions. But workloads that insert into some parts of a
> value range and delete from another range are common.
>
> I even would say that *precisely* because "Bottom up index deletion" can
> be very efficient in some workloads it is useful to have per-index stats
> determining whether an index should be vacuumed or not.

Exactly!

> Except that heap bloat not index bloat might be the more pressing
> concern. Or that there will be no meaningful amount of bottom-up
> deletions. Or ...

Exactly!

-- 
Peter Geoghegan

Re: decoupling table and index vacuum

From

Robert Haas

Date:

22 April 2021, 19:27:22

On Thu, Apr 22, 2021 at 3:11 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I think that everybody's beliefs about VACUUM tend to be correct. It
> almost doesn't matter if scenario A is the problem in 90% or cases
> versus 10% of cases for scenario B (or vice-versa). What actually
> matters is that we have good handling for both. (It's probably some
> weird combination of scenario A and scenario B in any case.)

I agree strongly with this. In fact, I seem to remember saying similar
things to you in the past. If something wins $1 in 90% of cases and
loses $5 in 10% of cases, is it a good idea? Well, it depends on how
the losses are distributed. If every user can be expected to hit both
winning and losing cases with approximately those frequencies, then
yes, it's a good idea, because everyone will come out ahead on
average. But if 90% of users will see only wins and 10% of users will
see only losses, it sucks.

That being said, I don't know what this really has to do with the
proposal on the table, except in the most general sense. If you're
just saying that decoupling stuff is good because different indexes
have different needs, I am in agreement, as I said in my OP. It sort
of sounded like you were saying that it's not important to try to
estimate the number of undeleted dead tuples in each index, which
puzzled me, because while knowing doesn't mean everything is
wonderful, not knowing it sure seems worse. But I guess maybe that's
not what you were saying, so I don't know. I feel like we're in danger
of drifting into discussions about whether we're disagreeing with each
other rather than, as I would like, focusing on how to design a system
for $SUBJECT.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: decoupling table and index vacuum

From

Andres Freund

Date:

22 April 2021, 20:01:18

Hi,

On 2021-04-22 12:15:27 -0400, Robert Haas wrote:
> On Wed, Apr 21, 2021 at 5:38 PM Andres Freund <andres@anarazel.de> wrote:
> > I'm not sure that's the only way to deal with this. While some form of
> > generic "conveyor belt" infrastructure would be a useful building block,
> > and it'd be sensible to use it here if it existed, it seems feasible to
> > dead tids in a different way here. You could e.g. have per-heap-vacuum
> > files with a header containing LSNs that indicate the age of the
> > contents.
>
> That's true, but have some reservations about being overly reliant on
> the filesystem to provide structure here. There are good reasons to be
> worried about bloating the number of files in the data directory. Hmm,
> but maybe we could mitigate that. First, we could skip this for small
> relations. If you can vacuum the table and all of its indexes using
> the naive algorithm in <10 seconds, you probably shouldn't do anything
> fancy. That would *greatly* reduce the number of additional files
> generated. Second, we could forget about treating them as separate
> relation forks and make them some other kind of thing entirely, in a
> separate directory

I'm not *too* worried about this issue. IMO the big difference to the
cost of additional relation forks is that such files would only exist
when the table is modified to a somewhat meaningful degree. IME the
practical issues with the number of files due to forks are cases where
huge number of tables that are practically never modified exist.

That's not to say that I am sure that some form of "conveyor belt"
storage *wouldn't* be the right thing. How were you thinking of dealing
with the per-relation aspects of this? One conveyor belt per relation?

> especially if we adopted Sawada-san's proposal to skip WAL logging. I
> don't know if that proposal is actually a good idea, because it
> effectively adds a performance penalty when you crash or fail over,
> and that sort of thing can be an unpleasant surprise.  But it's
> something to think about.

I'm doubtful about skipping WAL logging entirely - I'd have to think
harder about it, but I think that'd mean we'd restart from scratch after
crashes / immediate restarts as well, because we couldn't rely on the
contents of the "dead tid" files to be accurate. In addition to the
replication issues you mention.

> > One thing that you didn't mention so far is that this'd allow us to add
> > dead TIDs to the "dead tid" file outside of vacuum too. In some
> > workloads most of the dead tuple removal happens as part of on-access
> > HOT pruning. While some indexes are likely to see that via the
> > killtuples logic, others may not. Being able to have more aggressive
> > index vacuum for the one or two bloated index, without needing to rescan
> > the heap, seems like it'd be a significant improvement.
>
> Oh, that's a very interesting idea. It does impose some additional
> requirements on any such system, though, because it means you have to
> be able to efficiently add single TIDs. For example, you mention a
> per-heap-VACUUM file above, but you can't get away with creating a new
> file per HOT prune no matter how you arrange things at the FS level.

I agree that it'd be an issue, even though I think it's not too common
that only one tuple gets pruned.  It might be possible to have a
per-relation file per backend or such... But yes, we'd definitely have
to think about it.

I've previously pondered adding some cross-page batching and deferring
of hot pruning in the read case, which I guess might be more
advantageous with this.

The main reason for thinking about batching & deferring of HOT pruning
is that I found during the AIO work that there's speed gains to be head
if we pad xlog pages instead of partially filling them - obviously
risking increasing WAL usage. One idea to reduce the cost of that was to
fill the padded space with actually useful things, like FPIs or hot
pruning records. A related speedup opportunity with AIO is to perform
useful work while waiting for WAL flushes during commit (i.e. after
initiating IO to flush the commit record, but before that IO has
completed).

> Actually, though, I think the big problem here is deduplication. A
> full-blown VACUUM can perhaps read all the already-known-to-be-dead
> TIDs into some kind of data structure and avoid re-adding them, but
> that's impractical for a HOT prune.

What is there to deduplicate during HOT pruning? It seems that hot
pruning would need to log all items that it marks dead, but nothing
else? And that VACUUM can't yet have put those items onto the dead tuple
map, because they weren't yet?

This actually brings up a question I vaguely had to the fore: How are
you assuming indexes would access the list of dead tids? As far as I can
see the on-disk data would not be fully sorted even without adding
things during HOT pruning - the dead tids from a single heap pass will
be, but there'll be tids from multiple passes, right?

Are you assuming that we'd read the data into memory and then merge-sort
between each of the pre-sorted "runs"? Or that we'd read and cache parts
of the on-disk data during index checks?

> > Have you thought about how we would do the scheduling of vacuums for the
> > different indexes? We don't really have useful stats for the number of
> > dead index entries to be expected in an index. It'd not be hard to track
> > how many entries are removed in an index via killtuples, but
> > e.g. estimating how many dead entries there are in a partial index seems
> > quite hard (at least without introducing significant overhead).
>
> No, I don't have any good ideas about that, really. Partial indexes
> seem like a hard problem, and so do GIN indexes or other kinds of
> things where you may have multiple index entries per heap tuple. We
> might have to accept some known-to-be-wrong approximations in such
> cases.

The gin case seems a bit easier than the partial index case. Keeping
stats about the number of new entries in a GIN index doesn't seem too
hard, nor does tracking the number of cleaned up index entries. But
knowing which indexes are affected when a heap tuple becomes dead seems
harder.  I guess we could just start doing a stats-only version of
ExecInsertIndexTuples() for deletes, but obviously the cost of that is
not enticing. Perhaps it'd not be too bad if we only did it when there's
an index with predicates?

> > > One rather serious objection to this whole line of attack is that we'd
> > > ideally like VACUUM to reclaim disk space without using any more, in
> > > case the motivation for running VACUUM in the first place.
> >
> > I suspect we'd need a global limit of space used for this data. If above
> > that limit we'd switch to immediately performing the work required to
> > remove some of that space.
>
> I think that's entirely the wrong approach. On the one hand, it
> doesn't prevent you from running out of disk space during emergency
> maintenance, because the disk overall can be full even though you're
> below your quota of space for this particular purpose. On the other
> hand, it does subject you to random breakage when your database gets
> big enough that the critical information can't be stored within the
> configured quota.

What random breakage are you thinking of? I'm not thinking of a hard
limit that may not be crossed at any cost, by even a single byte, but
that [auto]VACUUMs would start to be more aggressive about performing
index vacuums once the limit is reached.

> I think we'd end up with pathological cases very much like what used
> to happen with the fixed-size free space map. What happened there was
> that your database got big enough that you couldn't track all the free
> space any more and it just started bloating out the wazoo.  What would
> happen here is that you'd silently lose the well-optimized version of
> VACUUM when your database gets too big. That does not seem like
> something anybody wants.

I don't think the consequences would really be that comparable. Once the
FSM size was reached in the bad old days, we'd just loose track of of
free space. Whereas here we'd start to be more aggressive about cleaning
up once the "dead tids" data reaches a certain size. Of course that
would have efficiency impacts, but I think "global free space wasted" is
a valid input in deciding when to perform index vacuums.

I think max_wal_size has worked out pretty well, even if not perfect.

Greetings,

Andres Freund

Re: decoupling table and index vacuum

From

Peter Geoghegan

Date:

22 April 2021, 20:51:51

On Thu, Apr 22, 2021 at 12:27 PM Robert Haas <robertmhaas@gmail.com> wrote:
> I agree strongly with this. In fact, I seem to remember saying similar
> things to you in the past. If something wins $1 in 90% of cases and
> loses $5 in 10% of cases, is it a good idea? Well, it depends on how
> the losses are distributed. If every user can be expected to hit both
> winning and losing cases with approximately those frequencies, then
> yes, it's a good idea, because everyone will come out ahead on
> average. But if 90% of users will see only wins and 10% of users will
> see only losses, it sucks.

Right. It's essential that we not disadvantage any workload by more
than a small fixed amount (and only with a huge upside elsewhere).

The even more general version is this: the average probably doesn't
even exist in any meaningful sense.

Bottom-up index deletion tends to be effective either 100% of the time
or 0% of the time, which varies on an index by index basis. Does that
mean we should split the difference, and assume that it's effective
50% of the time? Clearly not. Clearly that particular framing is just
wrong. And clearly it basically doesn't matter if it's half of all
indexes, or a quarter, or none, whatever. Because it's all of those
proportions, and also because who cares.

> That being said, I don't know what this really has to do with the
> proposal on the table, except in the most general sense. If you're
> just saying that decoupling stuff is good because different indexes
> have different needs, I am in agreement, as I said in my OP.

Mostly what I'm saying is that I would like to put together a rough
list of things that we could do to improve VACUUM along the lines
we've discussed -- all of which stem from $SUBJECT. There are
literally dozens of goals (some of which are quite disparate) that we
could conceivably set out to pursue under the banner of $SUBJECT.
Ideally there would be soft agreement about which ideas were more
promising. Ideally we'd avoid painting ourselves into a corner with
respect to one of these goals, in pursuit of any other goal.

I suspect that we'll need somewhat more of a top-down approach to this
work, which is something that we as a community don't have much
experience with. It might be useful to set the parameters of the
discussion up-front, which seems weird to me too, but might actually
help. (A lot of the current problems with VACUUM seem like they might
be consequences of pgsql-hackers not usually working like this.)

> It sort
> of sounded like you were saying that it's not important to try to
> estimate the number of undeleted dead tuples in each index, which
> puzzled me, because while knowing doesn't mean everything is
> wonderful, not knowing it sure seems worse. But I guess maybe that's
> not what you were saying, so I don't know.

I agree that it matters that we are able to characterize how bloated a
partial index is, because an improved VACUUM implementation will need
to know that. My main point about that was that it's complicated in
surprising ways that actually matter. An approximate solution seems
quite possible to me, but I think that that will probably have to
involve the index AM directly.

Sometimes 10% - 30% of the extant physical index tuples will be dead
and it'll be totally fine in every practical sense -- the index won't
have grown by even one page since the last VACUUM! Other times it
might be as few as 2% - 5% that are now dead when VACUUM is
considered, which will in fact be a serious problem (e.g., it's
concentrated in one part of the keyspace, say). I would say that
having some rough idea of which case we have on our hands is extremely
important here. Even if the distinction only arises in rare cases
(though FWIW I don't think that these differences will be rare at
all).

(I also tried to clarify what I mean about qualitative bloat in
passing in my response about the case of a bloated partial index.)

> I feel like we're in danger
> of drifting into discussions about whether we're disagreeing with each
> other rather than, as I would like, focusing on how to design a system
> for $SUBJECT.

While I am certainly guilty of being kind of hand-wavy and talking
about lots of stuff all at once here, it's still kind of unclear what
practical benefits you hope to attain through $SUBJECT. Apart from the
thing about global indexes, which matters but is hardly the
overwhelming reason to do all this. I myself don't expect your goals
to be super crisp just yet. As I said, I'm happy to talk about it in
very general terms at first -- isn't that what you were doing
yourself?

Or did I misunderstand -- are global indexes mostly all that you're
thinking about here? (Even if they are all you care about, it still
seems like you're still somewhat obligated to generalize the dead TID
fork/map thing to help with a bunch of other things, just to justify
the complexity of adding a dead TID relfork.)

--
Peter Geoghegan

Re: decoupling table and index vacuum

From

Peter Geoghegan

Date:

22 April 2021, 22:52:01

On Thu, Apr 22, 2021 at 11:16 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > My most ambitious goal is finding a way to remove the need to freeze
> > or to set hint bits. I think that we can do this by inventing a new
> > kind of VACUUM just for aborted transactions, which doesn't do index
> > vacuuming. You'd need something like an ARIES-style dirty page table
> > to make this cheap -- so it's a little like UNDO, but not very much.
>
> I don't see how that works. An aborted transaction can have made index
> entries, and those index entries can have already been moved by page
> splits, and there can be arbitrarily many of them, so that you can't
> keep track of them all in RAM. Also, you can crash after making the
> index entries and writing them to the disk and before the abort
> happens. Anyway, this is probably a topic for a separate thread.

This is a topic for a separate thread, but I will briefly address your question.

Under the scheme I've sketched, we never do index vacuuming when
invoking an autovacuum worker (or something like it) to clean-up after
an aborted transaction. We track the pages that all transactions have
modified. If a transaction commits then we quickly discard the
relevant dirty page table metadata. If a transaction aborts
(presumably a much rarer event), then we launch an autovacuum worker
that visits precisely those heap blocks that were modified by the
aborted transaction, and just prune each page, one by one. We have a
cutoff that works a little like relfrozenxid, except that it tracks
the point in the XID space before which we know any XIDs (any XIDs
that we can read from extant tuple headers) must be committed.

The idea of a "Dirty page table" is standard ARIES. It'd be tricky to
get it working, but still quite possible.

The overall goal of this design is for the system to be able to reason
about committed-ness inexpensively (to obviate the need for hint bits
and per-tuple freezing). We want to be able to say for sure that
almost all heap blocks in the database only contain heap tuples whose
headers contain only committed XIDs, or LP_DEAD items that are simply
dead (the exact provenance of these LP_DEAD items is not a concern,
just like today). The XID cutoff for committed-ness could be kept
quite recent due to the fact that aborted transactions are naturally
rare. And because we can do relatively little work to "logically roll
back" aborted transactions.

Note that a heap tuple whose xmin and xmax are committed might also be
dead under this scheme, since of course it might have been updated or
deleted by an xact that committed. We've effectively decoupled things
by making aborted transactions special, and subject to very eager
cleanup.

I'm sure that there are significant challenges with making something
like this work. But to me this design seems roughly the right
combination of radical and conservative.

-- 
Peter Geoghegan

Getting rid of freezing and hint bits by eagerly vacuuming aborted xacts (was: decoupling table and index vacuum)

From

Peter Geoghegan

Date:

23 April 2021, 00:39:54

On Thu, Apr 22, 2021 at 3:52 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Thu, Apr 22, 2021 at 11:16 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > > My most ambitious goal is finding a way to remove the need to freeze
> > > or to set hint bits. I think that we can do this by inventing a new
> > > kind of VACUUM just for aborted transactions, which doesn't do index
> > > vacuuming. You'd need something like an ARIES-style dirty page table
> > > to make this cheap -- so it's a little like UNDO, but not very much.
> >
> > I don't see how that works. An aborted transaction can have made index
> > entries, and those index entries can have already been moved by page
> > splits, and there can be arbitrarily many of them, so that you can't
> > keep track of them all in RAM. Also, you can crash after making the
> > index entries and writing them to the disk and before the abort
> > happens. Anyway, this is probably a topic for a separate thread.
>
> This is a topic for a separate thread, but I will briefly address your question.
>
> Under the scheme I've sketched, we never do index vacuuming when
> invoking an autovacuum worker (or something like it) to clean-up after
> an aborted transaction. We track the pages that all transactions have
> modified. If a transaction commits then we quickly discard the
> relevant dirty page table metadata. If a transaction aborts
> (presumably a much rarer event), then we launch an autovacuum worker
> that visits precisely those heap blocks that were modified by the
> aborted transaction, and just prune each page, one by one. We have a
> cutoff that works a little like relfrozenxid, except that it tracks
> the point in the XID space before which we know any XIDs (any XIDs
> that we can read from extant tuple headers) must be committed.
>
> The idea of a "Dirty page table" is standard ARIES. It'd be tricky to
> get it working, but still quite possible.
>
> The overall goal of this design is for the system to be able to reason
> about committed-ness inexpensively (to obviate the need for hint bits
> and per-tuple freezing). We want to be able to say for sure that
> almost all heap blocks in the database only contain heap tuples whose
> headers contain only committed XIDs, or LP_DEAD items that are simply
> dead (the exact provenance of these LP_DEAD items is not a concern,
> just like today). The XID cutoff for committed-ness could be kept
> quite recent due to the fact that aborted transactions are naturally
> rare. And because we can do relatively little work to "logically roll
> back" aborted transactions.
>
> Note that a heap tuple whose xmin and xmax are committed might also be
> dead under this scheme, since of course it might have been updated or
> deleted by an xact that committed. We've effectively decoupled things
> by making aborted transactions special, and subject to very eager
> cleanup.
>
> I'm sure that there are significant challenges with making something
> like this work. But to me this design seems roughly the right
> combination of radical and conservative.

I'll start a new thread now, as a placeholder for further discussion.

This would be an incredibly ambitious project, and I'm sure that this
thread will be very hand-wavy at first. But you've got to start
somewhere.

-- 
Peter Geoghegan

Re: decoupling table and index vacuum

From

Masahiko Sawada

Date:

23 April 2021, 11:03:56

On Fri, Apr 23, 2021 at 3:47 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Apr 22, 2021 at 10:28 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > The dead TID fork needs to also be efficiently searched. If the heap
> > scan runs twice, the collected dead TIDs on each heap pass could be
> > overlapped. But we would not be able to merge them if we did index
> > vacuuming on one of indexes at between those two heap scans. The
> > second time heap scan would need to record only TIDs that are not
> > collected by the first time heap scan.
>
> I agree that there's a problem here. It seems to me that it's probably
> possible to have a dead TID fork that implements "throw away the
> oldest stuff" efficiently, and it's probably also possible to have a
> TID fork that can be searched efficiently. However, I am not sure that
> it's possible to have a dead TID fork that does both of those things
> efficiently. Maybe you have an idea. My intuition is that if we have
> to pick one, it's MUCH more important to be able to throw away the
> oldest stuff efficiently. I think we can work around the lack of
> efficient lookup, but I don't see a way to work around the lack of an
> efficient operation to discard the oldest stuff.

Agreed.

I think we can divide the TID fork into 16MB or 32MB chunks like WAL
segment files so that we can easily remove old chunks. Regarding the
efficient search part, I think we need to consider the case where the
TID fork gets bigger than maintenance_work_mem. In that case, during
the heap scan, we need to check if the discovered TID exists in a
chunk of the TID fork that could be on the disk. Even if all
known-dead-TIDs are loaded into an array on the memory, it could get
much slower than the current heap scan to bsearch over the array for
each dead TID discovered during heap scan. So it would be better to
have a way to skip searching by already recorded TIDs. For example,
during heap scan or HOT pruning, I think that when marking TIDs dead
and recording it to the dead TID fork we can mark them “dead and
recorded” instead of just “dead” so that future heap scans can skip
those TIDs without existence check.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: decoupling table and index vacuum

From

Robert Haas

Date:

23 April 2021, 15:21:52

On Fri, Apr 23, 2021 at 7:04 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> I think we can divide the TID fork into 16MB or 32MB chunks like WAL
> segment files so that we can easily remove old chunks. Regarding the
> efficient search part, I think we need to consider the case where the
> TID fork gets bigger than maintenance_work_mem. In that case, during
> the heap scan, we need to check if the discovered TID exists in a
> chunk of the TID fork that could be on the disk. Even if all
> known-dead-TIDs are loaded into an array on the memory, it could get
> much slower than the current heap scan to bsearch over the array for
> each dead TID discovered during heap scan. So it would be better to
> have a way to skip searching by already recorded TIDs. For example,
> during heap scan or HOT pruning, I think that when marking TIDs dead
> and recording it to the dead TID fork we can mark them “dead and
> recorded” instead of just “dead” so that future heap scans can skip
> those TIDs without existence check.

I'm not very excited about this. If we did this, and if we ever
generated dead-but-not-recorded TIDs, then we will potentially dirty
those blocks again later to mark them recorded.

Also, if bsearch() is a bottleneck, how about just using an O(1)
algorithm instead of an O(lg n) algorithm, rather than changing the
on-disk format?

Also, can you clarify exactly what you think the problem case is here?
It seems to me that:

1. If we're pruning the heap to collect dead TIDs, we should stop when
the number of TIDs we've accumulated reaches maintenance_work_mem. It
is possible that we could find when starting to prune that there are
*already* more dead TIDs than will fit, because maintenance_work_mem
might have been reduced since they were gathered. But that's OK: we
can figure out that there are more than will fit without loading them
all, and since we shouldn't do additional pruning in this case,
there's no issue.

2. If we're sanitizing indexes, we should normally discover that there
are few enough TIDs that we can still fit them all in memory. But if
that proves not to be the case, again because for example
maintenance_work_mem has been reduced, then we can handle that with
multiple index passes just as we do today.

3. If we're going back to the heap to permit TIDs to be recycled by
setting dead line pointers to unused, we can load in as many of those
as will fit in maintenance_work_mem, sort them by block number, and go
through block by block and DTRT. Then, we can release all that memory
and, if necessary, do the whole thing again. This isn't even
particularly inefficient.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: decoupling table and index vacuum

From

Robert Haas

Date:

23 April 2021, 15:44:09

On Thu, Apr 22, 2021 at 4:52 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Mostly what I'm saying is that I would like to put together a rough
> list of things that we could do to improve VACUUM along the lines
> we've discussed -- all of which stem from $SUBJECT. There are
> literally dozens of goals (some of which are quite disparate) that we
> could conceivably set out to pursue under the banner of $SUBJECT.

I hope not. I don't have a clue why there would be dozens of possible
goals here, or why it matters. I think if we're going to do something
like $SUBJECT, we should just concentrate on the best way to make that
particular change happen with minimal change to anything else.
Otherwise, we risk conflating this engineering effort with others that
really should be separate endeavors. For example, as far as possible,
I think it would be best to try to do this without changing the
statistics that are currently gathered, and just make the best
decisions we can with the information we already have. Ideally, I'd
like to avoid introducing a new kind of relation fork that uses a
different on-disk storage format (e.g. 16MB segments that are dropped
from the tail) rather than the one used by the other forks, but I'm
not sure we can get away with that, because conveyor-belt storage
looks pretty appealing here. Regardless, the more we have to change to
accomplish the immediate goal, the more likely we are to introduce
instability into places where it could have been avoided, or to get
tangled up in endless bikeshedding.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: decoupling table and index vacuum

From

Peter Geoghegan

Date:

23 April 2021, 18:55:53

On Fri, Apr 23, 2021 at 8:44 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Apr 22, 2021 at 4:52 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > Mostly what I'm saying is that I would like to put together a rough
> > list of things that we could do to improve VACUUM along the lines
> > we've discussed -- all of which stem from $SUBJECT. There are
> > literally dozens of goals (some of which are quite disparate) that we
> > could conceivably set out to pursue under the banner of $SUBJECT.
>
> I hope not. I don't have a clue why there would be dozens of possible
> goals here, or why it matters.

Not completely distinct goals, for the most part, but I can certainly
see dozens of benefits.

For example, if we know before index vacuuming starts that heap
vacuuming definitely won't go ahead (quite possible when we decide
that we're only vacuuming a subset of indexes), we can then tell the
index AM about that fact. It can then safely vacuum in an
"approximate" fashion, for example by skipping pages whose LSNs are
from before the last VACUUM, and by not bothering with a
super-exclusive lock in the case of nbtree.

The risk of a conflict between this goal and another goal that we may
want to pursue (which might be a bit contrived) is that we fail to do
the right thing when a large range deletion has taken place, which
must be accounted in the statistics, but creates a tension with the
global index stuff. It's probably only safe to do this when we know
that there have been hardly any DELETEs. There is also the question of
how the TID map thing interacts with the visibility map, and how that
affects how VACUUM behaves (both in general and in order to attain
some kind of specific new benefit from this synergy).

Who knows? We're never going to get on exactly the same page, but some
rough idea of which page each of us are on might save everybody time.

The stuff that I went into about making aborted transactions special
as a means of decoupling transaction status management from garbage
collection is arguably totally unrelated -- perhaps it's just too much
of a stretch to link that to what you want to do now. I suppose it's
hard to invest the time to engage with me on that stuff, and I
wouldn't be surprised if you never did so. If it doesn't happen it
would be understandable, though quite possibly a missed opportunity
for both of us. My basic intuition there is that it's another variety
of decoupling, so (for better or worse) it does *seem* related to me.
(I am an intuitive thinker, which has advantages and disadvantages.)

> I think if we're going to do something
> like $SUBJECT, we should just concentrate on the best way to make that
> particular change happen with minimal change to anything else.
> Otherwise, we risk conflating this engineering effort with others that
> really should be separate endeavors.

Of course it's true that that is a risk. That doesn't mean that the
opposite risk is not also a concern. I am concerned about both risks.
I'm not sure which risk I should be more concerned about.

I agree that we ought to focus on a select few goals as part of the
first round of work in this area (without necessarily doing all or
even most of them at the same time). It's not self-evident which goals
those should be at this point, though. You've said that you're
interested in global indexes. Okay, that's a start. I'll add the basic
idea of not doing index vacuuming for some indexes and not others to
the list -- this will necessitate that we teach index AMs to assess
how much bloat the index has accumulated since the last VACUUM, which
presumably must work in some generalized, composable way.

> For example, as far as possible,
> I think it would be best to try to do this without changing the
> statistics that are currently gathered, and just make the best
> decisions we can with the information we already have.

I have no idea if that's the right way to do it. In any case the
statistics that we gather influence the behavior of autovacuum.c, but
nothing stops us from doing our own dynamic gathering of statistics to
decide what we should do within vacuumlazy.c each time. We don't have
to change the basic triggering conditions to change the work each
VACUUM performs.

As I've said before, I think that we're likely to get more benefit (at
least at first) from making the actual reality of what VACUUM does
simpler and more predictable in practice than we are from changing how
reality is modeled inside autovacuum.c. I'll go further with that now:
if we do change that modelling at some point, I think that it should
work in an additive way, which can probably be compatible with how the
statistics and so on work already. For example, maybe vacuumlazy.c
asks autovacuum.c to do a VACUUM earlier next time. This can be
structured as an exception to the general rule of autovacuum
scheduling, probably -- something that occurs when it becomes evident
that the generic schedule isn't quite cutting it in some important,
specific way.

> Ideally, I'd
> like to avoid introducing a new kind of relation fork that uses a
> different on-disk storage format (e.g. 16MB segments that are dropped
> from the tail) rather than the one used by the other forks, but I'm
> not sure we can get away with that, because conveyor-belt storage
> looks pretty appealing here.

No opinion on that just yet.

> Regardless, the more we have to change to
> accomplish the immediate goal, the more likely we are to introduce
> instability into places where it could have been avoided, or to get
> tangled up in endless bikeshedding.

Certainly true. I'm not really trying to convince you of specific
actionable points just yet, though. Perhaps that was the problem (or
perhaps it simply led to miscommunication). It would be so much easier
to discuss some of this stuff at an event like pgCon. Oh well.

-- 
Peter Geoghegan

Re: decoupling table and index vacuum

From

Peter Geoghegan

Date:

23 April 2021, 20:04:41

On Thu, Apr 22, 2021 at 1:01 PM Andres Freund <andres@anarazel.de> wrote:
> The gin case seems a bit easier than the partial index case. Keeping
> stats about the number of new entries in a GIN index doesn't seem too
> hard, nor does tracking the number of cleaned up index entries. But
> knowing which indexes are affected when a heap tuple becomes dead seems
> harder.  I guess we could just start doing a stats-only version of
> ExecInsertIndexTuples() for deletes, but obviously the cost of that is
> not enticing. Perhaps it'd not be too bad if we only did it when there's
> an index with predicates?

Though I agree that we need some handling here, I doubt that an index
with a predicate is truly a special case.

Suppose you have a partial index that covers 10% of the table. How is
that meaningfully different from an index without a predicate that is
otherwise equivalent? If the churn occurs in the same keyspace in
either case, and if that's the part of the keyspace that queries care
about, then ISTM that the practical difference is fairly
insignificant. (If you have some churn all over the standard index by
queries are only interested in the same 10% of the full keyspace then
this will be less true, but still roughly true.)

There is an understandable tendency to focus on the total size of the
index in each case, and to be alarmed that the partial index has (say)
doubled in size, while at the same time not being overly concerned
about lower *proportionate* growth for the standard index case
(assuming otherwise identical workload/conditions). The page splits
that affect the same 10% of the key space in each case will be
approximately as harmful in each case, though. We can expect the same
growth in leaf pages in each case, which will look very similar.

It should be obvious that it's somewhat of a problem that 90% of the
standard index is apparently not useful (this is an unrelated
problem). But if the DBA fixes this unrelated problem (by making the
standard index a partial index), surely it would be absurd to then
conclude that that helpful intervention somehow had the effect of
making the index bloat situation much worse!

I think that a simple heuristic could work very well here, but it
needs to be at least a little sensitive to the extremes. And I mean
all of the extremes, not just the one from my example -- every
variation exists and will cause problems if given zero weight.

-- 
Peter Geoghegan

Re: decoupling table and index vacuum

From

Peter Geoghegan

Date:

24 April 2021, 18:21:49

On Fri, Apr 23, 2021 at 1:04 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I think that a simple heuristic could work very well here, but it
> needs to be at least a little sensitive to the extremes. And I mean
> all of the extremes, not just the one from my example -- every
> variation exists and will cause problems if given zero weight.

To expand on this a bit, my objection to counting the number of live
tuples in the index (as a means to determining how aggressively each
individual index needs to be vacuumed) is this: it's driven by
positive feedback, not negative feedback. We should focus on *extreme*
adverse events (e.g., version-driven page splits) instead. We don't
even need to understand ordinary adverse events (e.g., how many dead
tuples are in the index).

The cost of accumulating dead tuples in an index (could be almost any
index AM) grows very slowly at first, and then suddenly explodes
(actually it's more like a cascade of correlated explosions, but for
the purposes of this explanation that doesn't matter). In a way, this
makes life easy for us. The cost of accumulating dead tuples rises so
dramatically at a certain inflection point that we can reasonably
assume that that's all that matters -- just stop the explosions. An
extremely simple heuristic that prevents these extreme adverse events
can work very well because that's where almost all of the possible
downside is. We can be sure that these extreme adverse events are
universally very harmful (workload doesn't matter). Note that the same
is not true for an approach driven by positive feedback -- it'll be
fragile because it depends on workload characteristics in unfathomably
many ways. We should focus on what we can understand with a high
degree of confidence.

We just need to identify what the extreme adverse event is in each
index AM, count them, and focus on those (could be a VACUUM thing,
could be local to the index AM like bottom-up deletion is). We need to
notice when things are *starting* to go really badly and intervene
aggressively. So we need to be willing to try a generic index
vacuuming strategy first, and then notice that it has just failed, or
is just about to fail. Something like version-driven page splits
really shouldn't ever happen, so even a very crude approach will
probably work very well.

-- 
Peter Geoghegan

Re: decoupling table and index vacuum

From

Andres Freund

Date:

24 April 2021, 18:43:12

Hi,

On 2021-04-24 11:21:49 -0700, Peter Geoghegan wrote:
> To expand on this a bit, my objection to counting the number of live
> tuples in the index (as a means to determining how aggressively each
> individual index needs to be vacuumed) is this: it's driven by
> positive feedback, not negative feedback. We should focus on *extreme*
> adverse events (e.g., version-driven page splits) instead. We don't
> even need to understand ordinary adverse events (e.g., how many dead
> tuples are in the index).

I don't see how that's good enough as a general approach. It won't work
on indexes that insert on one end, delete from the other (think
inserted_at or serial primary keys in many workloads).

Greetings,

Andres Freund

Re: decoupling table and index vacuum

From

Peter Geoghegan

Date:

24 April 2021, 18:59:29

On Sat, Apr 24, 2021 at 11:43 AM Andres Freund <andres@anarazel.de> wrote:
> I don't see how that's good enough as a general approach. It won't work
> on indexes that insert on one end, delete from the other (think
> inserted_at or serial primary keys in many workloads).

That can be treated as another extreme that we need to treat as
negative feedback. There may also be other types of negative feedback
that occur only in some index AMs, that neither of us have thought of
just yet. But that's okay -- we can just add that to the list. Some
varieties of negative feedback might be much more common in practice
than others. This shouldn't matter.

The number of live tuples (or even dead tuples) in the whole entire
index is simply not a useful proxy for what actually matters -- this
is 100% clear. There are many cases where this will do completely the
wrong thing, even if we have perfectly accurate information. I'll
spare you a repeat of the example of bottom-up index deletion and
"Schrodinger's dead index tuples" (it's not the only example, just the
purest).

-- 
Peter Geoghegan

Re: decoupling table and index vacuum

From

Andres Freund

Date:

24 April 2021, 19:56:43

Hi,

On 2021-04-24 11:59:29 -0700, Peter Geoghegan wrote:
> The number of live tuples (or even dead tuples) in the whole entire
> index is simply not a useful proxy for what actually matters -- this
> is 100% clear.

Did anybody actually argue for using #live entries directly? I think
*dead* entries is more relevant, partiuclarly because various forms of
local cleanup can be taken into account. Live tuples might come in to
put the number of dead tuples into perspective, but otherwise not that
much?

> There are many cases where this will do completely the wrong thing,
> even if we have perfectly accurate information.

Imo the question isn't really whether criteria will ever do something
wrong, but how often and how consequential such mistakes will
be. E.g. unnecessarily vacuuming an index isn't fun, but it's better
than ending up not never cleaning up dead index pointers despite repeat
accesses (think bitmap scans).

Greetings,

Andres Freund

Re: decoupling table and index vacuum

From

Peter Geoghegan

Date:

24 April 2021, 20:17:22

On Sat, Apr 24, 2021 at 12:56 PM Andres Freund <andres@anarazel.de> wrote:
> Did anybody actually argue for using #live entries directly? I think
> *dead* entries is more relevant, partiuclarly because various forms of
> local cleanup can be taken into account. Live tuples might come in to
> put the number of dead tuples into perspective, but otherwise not that
> much?

I was unclear. I can't imagine how you'd do anything like this without
using both together. Or if you didn't use live tuples you'd use heap
blocks instead. Something like that.

> > There are many cases where this will do completely the wrong thing,
> > even if we have perfectly accurate information.
>
> Imo the question isn't really whether criteria will ever do something
> wrong, but how often and how consequential such mistakes will
> be. E.g. unnecessarily vacuuming an index isn't fun, but it's better
> than ending up not never cleaning up dead index pointers despite repeat
> accesses (think bitmap scans).

I strongly agree. The risk with what I propose is that we'd somehow
overlook a relevant extreme cost. But I think that that's an
acceptable risk. Plus I see no workable alternative -- your "indexes
that insert on one end, delete from the other" example works much
better as an argument against what you propose than an argument
against my own alternative proposal. Which reminds me: how would your
framework for index bloat/skipping indexes in VACUUM deal cope with
this same scenario?

Though I don't think that it's useful to use quantitative thinking as
a starting point here, that doesn't mean there is exactly zero role
for it. Not sure about how far I'd go here. But I would probably not
argue that we shouldn't vacuum an index that is known to (say) be more
than 60% dead tuples. I guess I'd always prefer to have a better
metric, but speaking hypothetically: Why take a chance? This is not
because it's definitely worth it -- it really isn't! It's just because
the benefit of being right is low compared to the cost of being wrong
-- as you point out, that is really important.

-- 
Peter Geoghegan

Re: decoupling table and index vacuum

From

Peter Geoghegan

Date:

24 April 2021, 20:39:03

On Sat, Apr 24, 2021 at 1:17 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Sat, Apr 24, 2021 at 12:56 PM Andres Freund <andres@anarazel.de> wrote:
> > Imo the question isn't really whether criteria will ever do something
> > wrong, but how often and how consequential such mistakes will
> > be. E.g. unnecessarily vacuuming an index isn't fun, but it's better
> > than ending up not never cleaning up dead index pointers despite repeat
> > accesses (think bitmap scans).
>
> I strongly agree. The risk with what I propose is that we'd somehow
> overlook a relevant extreme cost. But I think that that's an
> acceptable risk.

IMV the goal here is not really to skip index vacuuming when it's
unnecessary. The goal is to do *more* index vacuuming when and where
it *is* necessary (in one problematic index among several) -- maybe
even much much more. We currently treat index vacuuming as an
all-or-nothing thing at the level of the table, which makes this
impossible.

This is another reason why we can be pretty conservative about
skipping. We only need to skip index vacuuming those indexes that
we're pretty confident just don't need it -- that's sufficient to be
able to do vastly more index vacuuming where it is needed in almost
all cases. There is some gray area, but that seems much less
interesting to me.

-- 
Peter Geoghegan

Re: decoupling table and index vacuum

From

Masahiko Sawada

Date:

26 April 2021, 00:32:22

On Sat, Apr 24, 2021 at 12:22 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Apr 23, 2021 at 7:04 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > I think we can divide the TID fork into 16MB or 32MB chunks like WAL
> > segment files so that we can easily remove old chunks. Regarding the
> > efficient search part, I think we need to consider the case where the
> > TID fork gets bigger than maintenance_work_mem. In that case, during
> > the heap scan, we need to check if the discovered TID exists in a
> > chunk of the TID fork that could be on the disk. Even if all
> > known-dead-TIDs are loaded into an array on the memory, it could get
> > much slower than the current heap scan to bsearch over the array for
> > each dead TID discovered during heap scan. So it would be better to
> > have a way to skip searching by already recorded TIDs. For example,
> > during heap scan or HOT pruning, I think that when marking TIDs dead
> > and recording it to the dead TID fork we can mark them “dead and
> > recorded” instead of just “dead” so that future heap scans can skip
> > those TIDs without existence check.
>
> I'm not very excited about this. If we did this, and if we ever
> generated dead-but-not-recorded TIDs, then we will potentially dirty
> those blocks again later to mark them recorded.

Since the idea I imagined is that we always mark a TID recorded at the
same time when marking it dead we don't dirty the page again, but,
yes, if we do that the recorded flag is not necessary. We can simply
think that TID marked dead is recorded to the TID fork. Future vacuum
can skip TID that are already marked dead.

>
> Also, if bsearch() is a bottleneck, how about just using an O(1)
> algorithm instead of an O(lg n) algorithm, rather than changing the
> on-disk format?
>
> Also, can you clarify exactly what you think the problem case is here?
> It seems to me that:
>
> 1. If we're pruning the heap to collect dead TIDs, we should stop when
> the number of TIDs we've accumulated reaches maintenance_work_mem. It
> is possible that we could find when starting to prune that there are
> *already* more dead TIDs than will fit, because maintenance_work_mem
> might have been reduced since they were gathered. But that's OK: we
> can figure out that there are more than will fit without loading them
> all, and since we shouldn't do additional pruning in this case,
> there's no issue.

The case I'm thinking is that pruning the heap and sanitizing indexes
are running concurrently as you mentioned that concurrency is one of
the benefits of decoupling vacuum phases. In that case, one process is
doing index vacuuming using known-dead-TIDs in the TID fork while
another process is appending new dead TIDs. We can suspend heap
pruning until the size of the TID fork gets smaller as you mentioned
but it seems inefficient.

>
> 2. If we're sanitizing indexes, we should normally discover that there
> are few enough TIDs that we can still fit them all in memory. But if
> that proves not to be the case, again because for example
> maintenance_work_mem has been reduced, then we can handle that with
> multiple index passes just as we do today.

Yeah, there seems to be room for improvement but not worse than today.
I imagine users will want to set a high maintenance_work_mem for
sanitizing global index separately from the setting for heap pruning.

>
> 3. If we're going back to the heap to permit TIDs to be recycled by
> setting dead line pointers to unused, we can load in as many of those
> as will fit in maintenance_work_mem, sort them by block number, and go
> through block by block and DTRT. Then, we can release all that memory
> and, if necessary, do the whole thing again. This isn't even
> particularly inefficient.

Agreed.

Just an idea: during pruning the heap, we can buffer the collected
dead TIDs before writing the TID fork to the disk so that we can sort
the dead TIDs in a chunk (say a 16MB chunk consists of 8KB blocks)? We
write the chunk to the disk either when the chunk filled with dead
TIDs or when index sanitizing starts. The latter timing is required to
remember the chunk ID or uint64 ID of dead TID indicating how far
index sanitizing removed dead TIDs up to. One of the benefits would be
to reduce the disk I/O for the dead TID fork. Another would be we’re
likely to complete the recycle phase in one heap scan since we load
only one block per chunk during scanning the heap.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: decoupling table and index vacuum

From

Masahiko Sawada

Date:

06 May 2021, 02:56:24

On Fri, Apr 23, 2021 at 5:01 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2021-04-22 12:15:27 -0400, Robert Haas wrote:
> > On Wed, Apr 21, 2021 at 5:38 PM Andres Freund <andres@anarazel.de> wrote:
> > > I'm not sure that's the only way to deal with this. While some form of
> > > generic "conveyor belt" infrastructure would be a useful building block,
> > > and it'd be sensible to use it here if it existed, it seems feasible to
> > > dead tids in a different way here. You could e.g. have per-heap-vacuum
> > > files with a header containing LSNs that indicate the age of the
> > > contents.
> >
> > That's true, but have some reservations about being overly reliant on
> > the filesystem to provide structure here. There are good reasons to be
> > worried about bloating the number of files in the data directory. Hmm,
> > but maybe we could mitigate that. First, we could skip this for small
> > relations. If you can vacuum the table and all of its indexes using
> > the naive algorithm in <10 seconds, you probably shouldn't do anything
> > fancy. That would *greatly* reduce the number of additional files
> > generated. Second, we could forget about treating them as separate
> > relation forks and make them some other kind of thing entirely, in a
> > separate directory
>
> I'm not *too* worried about this issue. IMO the big difference to the
> cost of additional relation forks is that such files would only exist
> when the table is modified to a somewhat meaningful degree. IME the
> practical issues with the number of files due to forks are cases where
> huge number of tables that are practically never modified exist.
>
> That's not to say that I am sure that some form of "conveyor belt"
> storage *wouldn't* be the right thing. How were you thinking of dealing
> with the per-relation aspects of this? One conveyor belt per relation?
>
>
> > especially if we adopted Sawada-san's proposal to skip WAL logging. I
> > don't know if that proposal is actually a good idea, because it
> > effectively adds a performance penalty when you crash or fail over,
> > and that sort of thing can be an unpleasant surprise.  But it's
> > something to think about.
>
> I'm doubtful about skipping WAL logging entirely - I'd have to think
> harder about it, but I think that'd mean we'd restart from scratch after
> crashes / immediate restarts as well, because we couldn't rely on the
> contents of the "dead tid" files to be accurate. In addition to the
> replication issues you mention.

Yeah, not having WAL would have a big negative impact on other various
aspects. Can we piggyback the WAL for the TID fork and
XLOG_HEAP2_PRUNE? That is, we add the buffer for the TID fork to
XLOG_HEAP2_PRUNE and record one 64-bit number of the first dead TID in
the list so that we can add dead TIDs to the TID fork during replaying
XLOG_HEAP2_PRUNE.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: decoupling table and index vacuum

From

Dilip Kumar

Date:

06 May 2021, 06:38:04

On Thu, May 6, 2021 at 8:27 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> > I'm doubtful about skipping WAL logging entirely - I'd have to think
> > harder about it, but I think that'd mean we'd restart from scratch after
> > crashes / immediate restarts as well, because we couldn't rely on the
> > contents of the "dead tid" files to be accurate. In addition to the
> > replication issues you mention.
>
> Yeah, not having WAL would have a big negative impact on other various
> aspects. Can we piggyback the WAL for the TID fork and
> XLOG_HEAP2_PRUNE? That is, we add the buffer for the TID fork to
> XLOG_HEAP2_PRUNE and record one 64-bit number of the first dead TID in
> the list so that we can add dead TIDs to the TID fork during replaying
> XLOG_HEAP2_PRUNE.

That could be an option but we need to be careful about the buffer
lock order because now we will have to hold the lock on the TID fork
buffer as well as the heap buffer so that we don't create any
deadlock.  And there is also a possibility of holding the lock on
multiple TID fork buffers, which will depend upon how many tid we have
pruned.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: decoupling table and index vacuum

From

Masahiko Sawada

Date:

06 May 2021, 09:02:13

On Thu, May 6, 2021 at 3:38 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, May 6, 2021 at 8:27 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > > I'm doubtful about skipping WAL logging entirely - I'd have to think
> > > harder about it, but I think that'd mean we'd restart from scratch after
> > > crashes / immediate restarts as well, because we couldn't rely on the
> > > contents of the "dead tid" files to be accurate. In addition to the
> > > replication issues you mention.
> >
> > Yeah, not having WAL would have a big negative impact on other various
> > aspects. Can we piggyback the WAL for the TID fork and
> > XLOG_HEAP2_PRUNE? That is, we add the buffer for the TID fork to
> > XLOG_HEAP2_PRUNE and record one 64-bit number of the first dead TID in
> > the list so that we can add dead TIDs to the TID fork during replaying
> > XLOG_HEAP2_PRUNE.
>
> That could be an option but we need to be careful about the buffer
> lock order because now we will have to hold the lock on the TID fork
> buffer as well as the heap buffer so that we don't create any
> deadlock.  And there is also a possibility of holding the lock on
> multiple TID fork buffers, which will depend upon how many tid we have
> pruned.

Not sure we will need to hold buffer locks for both the TID fork and
the heap at the same time but I agree that we could need to lock on
multiple TID fork buffers. We could need to add dead TIDs to up to two
pages for the TID fork during replaying XLOG_HEAP2_PRUNE since we
write it per heap pages. Probably we can process one by one.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: decoupling table and index vacuum

From

Robert Haas

Date:

06 May 2021, 10:19:32

On Thu, May 6, 2021 at 5:02 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Not sure we will need to hold buffer locks for both the TID fork and
> the heap at the same time but I agree that we could need to lock on
> multiple TID fork buffers. We could need to add dead TIDs to up to two
> pages for the TID fork during replaying XLOG_HEAP2_PRUNE since we
> write it per heap pages. Probably we can process one by one.

It seems like we do need to hold them at the same time, because
typically for a WAL record you lock all the buffers, modify them all
while writing the WAL record, and then unlock them all.

Now maybe there's some argument that we can dodge that requirement
here, but I have reservations about departing from the usual locking
pattern. It's easier to reason about the behavior when everybody
follows the same set of rules.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: decoupling table and index vacuum

From

Masahiko Sawada

Date:

06 May 2021, 10:42:14

On Thu, May 6, 2021 at 7:19 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, May 6, 2021 at 5:02 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > Not sure we will need to hold buffer locks for both the TID fork and
> > the heap at the same time but I agree that we could need to lock on
> > multiple TID fork buffers. We could need to add dead TIDs to up to two
> > pages for the TID fork during replaying XLOG_HEAP2_PRUNE since we
> > write it per heap pages. Probably we can process one by one.
>
> It seems like we do need to hold them at the same time, because
> typically for a WAL record you lock all the buffers, modify them all
> while writing the WAL record, and then unlock them all.
>
> Now maybe there's some argument that we can dodge that requirement
> here, but I have reservations about departing from the usual locking
> pattern. It's easier to reason about the behavior when everybody
> follows the same set of rules.

Yes, agreed. I was thinking of replaying WAL, not writing WAL.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: decoupling table and index vacuum

From

Dilip Kumar

Date:

06 May 2021, 12:32:29

On Thu, 6 May 2021 at 4:12 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, May 6, 2021 at 7:19 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, May 6, 2021 at 5:02 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > Not sure we will need to hold buffer locks for both the TID fork and
> > the heap at the same time but I agree that we could need to lock on
> > multiple TID fork buffers. We could need to add dead TIDs to up to two
> > pages for the TID fork during replaying XLOG_HEAP2_PRUNE since we
> > write it per heap pages. Probably we can process one by one.
>
> It seems like we do need to hold them at the same time, because
> typically for a WAL record you lock all the buffers, modify them all
> while writing the WAL record, and then unlock them all.
>
> Now maybe there's some argument that we can dodge that requirement
> here, but I have reservations about departing from the usual locking
> pattern. It's easier to reason about the behavior when everybody
> follows the same set of rules.

Yes, agreed. I was thinking of replaying WAL, not writing WAL.

Right, I was pointing to while writing the WAL.

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: decoupling table and index vacuum

From

Antonin Houska

Date:

23 June 2021, 09:26:32

Andres Freund <andres@anarazel.de> wrote:

> On 2021-04-21 11:21:31 -0400, Robert Haas wrote:
> > This scheme adds a lot of complexity, which is a concern, but it seems
> > to me that it might have several benefits. One is concurrency. You
> > could have one process gathering dead TIDs and adding them to the
> > dead-TID fork while another process is vacuuming previously-gathered
> > TIDs from some index.
> 
> I think it might even open the door to using multiple processes
> gathering dead TIDs for the same relation.

I think the possible concurrency improvements are themselves a valid reason to
do the decoupling. Or rather it's hard to imagine how the current
implementation of VACUUM can get parallel workers involved in gathering the
dead heap TIDs efficiently. Currently, a single backend gathers the heap TIDs,
and it can then launch several parallel workers to remove the TIDs from
indexes. If parallel workers gathered the heap TIDs, then (w/o the decoupling)
the parallel index processing would be a problem because a parallel worker
cannot launch other parallel workers.

> > In fact, every index could be getting vacuumed at the same time, and
> > different indexes could be removing different TID ranges.
> 
> We kind of have this feature right now, due to parallel vacuum...

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: decoupling table and index vacuum

From

Peter Geoghegan

Date:

15 September 2021, 22:08:41

On Wed, Apr 21, 2021 at 8:21 AM Robert Haas <robertmhaas@gmail.com> wrote:
> Now, the reason for this is that when we discover dead TIDs, we only
> record them in memory, not on disk. So, as soon as VACUUM ends, we
> lose all knowledge of whether those TIDs were and must rediscover
> them. Suppose we didn't do this, and instead had a "dead TID" fork for
> each table. Suppose further that this worked like a conveyor belt,
> similar to WAL, where every dead TID we store into the fork is
> assigned an identifying 64-bit number that is never reused.

Have you started any work on this project? I think that it's a very good idea.

Enabling index-only scans is a good enough reason to pursue this
project, even on its own. The flexibility that this design offers
allows VACUUM to run far more aggressively, with little possible
downside. It makes it possible for VACUUM to run so frequently that it
rarely dirties pages most of the time -- at least in many important
cases. Imagine if VACUUM almost kept in lockstep with inserters into
an append-mostly table -- that would be great. The main blocker to
making VACUUM behave like that is of course indexes.

Setting visibility map bits during VACUUM can make future vacuuming
cheaper (for the obvious reason), which *also* makes it cheaper to set
*most* visibility map bits as the table is further extended, which in
turn makes future vacuuming cheaper...and so on. This virtuous circle
seems like it might be really important. Especially once you factor in
the cost of dirtying pages a second or a third time. I think that we
can really keep the number of times VACUUM dirties pages under
control, simply by decoupling. Decoupling is key to keeping the costs
to a minimum.

I attached a POC autovacuum logging instrumentation patch that shows
how VACUUM uses *and* sets VM bits. I wrote this for my TPC-C + FSM
work. Seeing both things together, and seeing how both things *change*
over time was a real eye opener for me: it turns out that the master
branch keeps setting and resetting VM bit pages in the two big
append-mostly tables that are causing so much trouble for Postgres
today. What we see right now is pretty disorderly -- the numbers don't
trend in the right direction when they should. But it could be a lot
more orderly, with a little work.

This instrumentation helped me to discover a better approach to
indexing within TPC-C, based on index-only scans [1]. It also made me
realize that it's possible for a table to have real problems with dead
tuple cleanup in indexes, while nevertheless being an effective target
for index-only scans. There is actually no good reason to think that
one condition should preclude the other -- they may very well go
together. You did say this yourself when talking about global indexes,
but there is no reason to think that it's limited to partitioning
cases. The current "ANALYZE dead_tuples statistics" paradigm cannot
recognize when both conditions go together, even though I now think
that it's fairly common. I also like your idea here because it enables
a more qualitative approach, based on recent information for recently
modified blocks -- not whole-table statistics. Averages are
notoriously misleading.

[1] https://github.com/pgsql-io/benchmarksql/pull/16
-- 
Peter Geoghegan

Attachment

0001-Instrument-pages-skipped-by-VACUUM.patch

Re: decoupling table and index vacuum

From

Masahiko Sawada

Date:

24 September 2021, 05:41:35

On Thu, Sep 16, 2021 at 7:09 AM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Wed, Apr 21, 2021 at 8:21 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > Now, the reason for this is that when we discover dead TIDs, we only
> > record them in memory, not on disk. So, as soon as VACUUM ends, we
> > lose all knowledge of whether those TIDs were and must rediscover
> > them. Suppose we didn't do this, and instead had a "dead TID" fork for
> > each table. Suppose further that this worked like a conveyor belt,
> > similar to WAL, where every dead TID we store into the fork is
> > assigned an identifying 64-bit number that is never reused.
>
> Enabling index-only scans is a good enough reason to pursue this
> project, even on its own.

+1

> I attached a POC autovacuum logging instrumentation patch that shows
> how VACUUM uses *and* sets VM bits.

Logging how vacuum uses and sets VM bits seems a good idea.

I've read the proposed PoC patch. Probably it's better to start a new
thread for this patch and write the comment for it there but let me
leave one comment on the patch:

With the patch, we increment allfrozen_pages counter, which is used to
determine whether or not we advance relfrozenxid and relminmxid, at
two places:

@@ -1141,7 +1201,9 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams
*params, bool aggressive)
                                  * in this case an approximate answer is OK.
                                  */
                                 if (aggressive ||
VM_ALL_FROZEN(vacrel->rel, blkno, &vmbuffer))
-                                        vacrel->frozenskipped_pages++;
+                                        vacrel->allfrozen_pages++;
+                                else
+                                        vacrel->allvisible_pages++;
                                 continue;

@@ -1338,6 +1400,8 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams
*params, bool aggressive)
                          */
                         if (!PageIsAllVisible(page))
                         {
+                                vacrel->allfrozen_pages++;
+

I think that we will end up doubly counting the page as scanned_pages
and allfrozen_pages due to the newly added latter change. This seems
wrong to me because we calculate as follows:

@@ -644,7 +656,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
          * NB: We need to check this before truncating the relation,
because that
          * will change ->rel_pages.
          */
-        if ((vacrel->scanned_pages + vacrel->frozenskipped_pages)
+        if ((vacrel->scanned_pages + vacrel->allfrozen_pages)
                 < vacrel->rel_pages)
         {
                 Assert(!aggressive);

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: decoupling table and index vacuum

From

Robert Haas

Date:

24 September 2021, 18:48:40

On Wed, Sep 15, 2021 at 6:08 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Have you started any work on this project? I think that it's a very good idea.

Actually, I have. I've been focusing on trying to create a general
infrastructure for conveyor belt storage. An incomplete and likely
quite buggy version of this can be found here:

https://git.postgresql.org/gitweb/?p=users/rhaas/postgres.git;a=shortlog;h=refs/heads/conveyor

Mark Dilger has been helping me debug it, but it's still very early
days. I was planning to wait until it was a little more baked before
posting it to the list, but since you asked...

Once that infrastructure is sufficiently mature, then the next step, I
think, would be to try to use it to store dead TIDs.

And then after that, one has to think about how autovacuum scheduling
ought to work in a world where table vacuuming and index vacuuming are
decoupled.

This is a very hard problem, and I don't expect to solve it quickly. I
do hope to keep plugging away at it, though.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: decoupling table and index vacuum

From

Peter Geoghegan

Date:

25 September 2021, 01:17:21

On Thu, Sep 23, 2021 at 10:42 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Thu, Sep 16, 2021 at 7:09 AM Peter Geoghegan <pg@bowt.ie> wrote:
> > Enabling index-only scans is a good enough reason to pursue this
> > project, even on its own.
>
> +1

I was hoping that you might be able to work on opportunistically
freezing whole pages for Postgres 15. I think that it would make sense
to opportunistically make a page that is about to become all_visible
during VACUUM become all_frozen instead. Our goal is to make most
pages skip all_visible, and go straight to all_frozen directly. Often
the page won't need to be dirtied again, ever.

Right now freezing is something that we mostly just think about as
occurring at the level of tuples, which doesn't seem ideal. This seems
related to Robert's project because both projects are connected to the
question of how autovacuum scheduling works in general. We will
probably need to rethink things like the vacuum_freeze_min_age GUC. (I
also think that we might need to reconsider how
aggressive/anti-wraparound VACUUMs work, but that's another story.)

Obviously this is a case of performing work eagerly; a form of
speculation that tries to lower costs in the aggregate, over time.
Heuristics that work well on average seem possible, but even excellent
heuristics could be wrong -- in the end we're trying to predict the
future, which is inherently impossible to do reliably for all
workloads. I think that that will be okay, provided that the cost of
being wrong is kept low and *fixed* (the exact definition of "fixed"
will need to be defined, but the basic idea is that any regression is
once per page, not once per page per VACUUM or something).

Once it's much cheaper enough to freeze a whole page early (i.e. all
tuple headers from all tuples), then the implementation can be wrong
95%+ of the time, and maybe we'll still win by a lot. That may sound
bad, until you realize that it's 95% *per VACUUM* -- the entire
situation is much better once you think about the picture for the
entire table over time and across many different VACUUM operations,
and once you think about FPIs in the WAL stream. We'll be paying the
cost of freezing in smaller and more predictable increments, too,
which can make the whole system more robust. Many pages that all go
from all_visible to all_frozen at the same time (just because they
crossed some usually-meaningless XID-based threshold) is actually
quite risky (this is why I mentioned aggressive VACUUMs in passing).

The hard part is getting the cost way down. lazy_scan_prune() uses
xl_heap_freeze_tuple records for each tuple it freezes. These
obviously have a lot of redundancy across tuples from the same page in
practice. And the WAL overhead is much larger just because these are
per-tuple records, not per-page records. Getting the cost down is hard
because of issues with MultiXacts, freezing xmin but not freezing xmax
at the same time, etc.

> Logging how vacuum uses and sets VM bits seems a good idea.

> I think that we will end up doubly counting the page as scanned_pages
> and allfrozen_pages due to the newly added latter change. This seems
> wrong to me because we calculate as follows:

I agree that that's buggy. Oops.

It was just a prototype that I wrote for my own work. I do think that
we should have a patch that has some of this, for users, but I am not
sure about the details just yet. This is probably too much information
for users, but I think it will take me more time to decide what really
does matter to users.

-- 
Peter Geoghegan

Re: decoupling table and index vacuum

From

Peter Geoghegan

Date:

25 September 2021, 02:44:15

On Fri, Sep 24, 2021 at 11:48 AM Robert Haas <robertmhaas@gmail.com> wrote:
> Actually, I have. I've been focusing on trying to create a general
> infrastructure for conveyor belt storage. An incomplete and likely
> quite buggy version of this can be found here:
>
> https://git.postgresql.org/gitweb/?p=users/rhaas/postgres.git;a=shortlog;h=refs/heads/conveyor

That's great news! I think that this is the right high level direction.

> Mark Dilger has been helping me debug it, but it's still very early
> days. I was planning to wait until it was a little more baked before
> posting it to the list, but since you asked...

Reminds me of my FSM patch, in a way. It's ambitious, and still very
rough, but maybe I should bite the bullet and post it as a POC soon.

> Once that infrastructure is sufficiently mature, then the next step, I
> think, would be to try to use it to store dead TIDs.

+1.

> And then after that, one has to think about how autovacuum scheduling
> ought to work in a world where table vacuuming and index vacuuming are
> decoupled.

I'm excited about the possibility of using this infrastructure as a
springboard for driving autovacuum's behavior using more or less
authoritative information, rather than dubious statistics that can
consistently lead us down the wrong path. ANALYZE style statistics are
something that can only work under specific conditions that take their
obvious limitations into account -- and even there (even within the
optimizer) it's amazing that they work as well as they do. I fear that
we assumed that the statistics driving autovacuum were good enough at
some point in the distant past, and never really validated that
assumption. Perhaps because anti-wraparound VACUUM was *accidentally*
protective.

The scheduling of autovacuum is itself a big problem for the two big
BenchmarkSQL tables I'm always going on about -- though it did get a
lot better with the introduction of the
autovacuum_vacuum_insert_scale_factor stuff in Postgres 13. I recently
noticed that the tables have *every* autovacuum driven by inserts
(i.e. by the new autovacuum_vacuum_scale_factor stuff), and never by
updates -- even though updates obviously produce significant bloat in
the two tables. BenchmarkSQL on Postgres was far worse than it is now
a few releases ago [1], and I think that this stats business was a big
factor (on top of everything else). I can clearly see that
autovacuum_vacuum_scale_factor is certainly accidentally protective
with BenchmarkSQL today, in a way that wasn't particularly anticipated
by anybody.

The fact that the intellectual justifications for a lot of these
things are so vague concerns me. For example, why do we apply
autovacuum_vacuum_scale_factor based on reltuples at the end of the
last VACUUM? That aspect of the design will make much less sense once
we have this decoupling in place. Even with the happy accident of
autovacuum_vacuum_insert_scale_factor helping BenchmarkSQL, the
conventional dead tuples based approach to VACUUM still doesn't drive
autovacuum sensibly -- we still systematically undercount LP_DEAD
stubs because (with this workload) they're systemically concentrated
in relatively few heap pages. So if this was a real app, the DBA would
somehow have to work out that they should aggressively tune
autovacuum_vacuum_scale_factor to clean up bloat from updates. I doubt
any DBA could ever figure that out, because it doesn't make any sense.

The problem goes both ways: in addition to undercounting dead tuples,
we effectively overcount, which can lead to autovacuum chasing its own
tail [2].

I think that we could do *way* better than we do today without
enormous effort, and I think that it really matters. Maybe we could
select from a few standard models for autovacuum scheduling using
Bayesian inference -- converge on the more predictive model for a
given table over time, using actual outcomes for each autovacuum. Be
sensitive to how LP_DEAD stub line pointers can become concentrated in
relatively few heap pages, and stuff like that. Maybe keep a little
history to work off of. The problem with the current model is not that
it might be wrong. The problem is that it might *never* be right (for
a given table). The scheduling never learns any lessons, because it's
fundamentally static -- it ought to be dynamic. How things change is
much more informative than where things are at an arbitrary point in
time.

[1] https://www.postgresql.org/message-id/flat/0265f9e2-3e32-e67d-f106-8abde596c0e4%40commandprompt.com
[2] https://postgr.es/m/CAH2-Wz=sJm3tm+FpXbyBhEhX5tbz1trQrhG6eOhYk4-+5uL=ww@mail.gmail.com
--
Peter Geoghegan

Re: decoupling table and index vacuum

From

Peter Geoghegan

Date:

25 September 2021, 03:08:19

On Fri, Sep 24, 2021 at 7:44 PM Peter Geoghegan <pg@bowt.ie> wrote:
> The scheduling of autovacuum is itself a big problem for the two big
> BenchmarkSQL tables I'm always going on about -- though it did get a
> lot better with the introduction of the
> autovacuum_vacuum_insert_scale_factor stuff in Postgres 13. I recently
> noticed that the tables have *every* autovacuum driven by inserts
> (i.e. by the new autovacuum_vacuum_scale_factor stuff), and never by
> updates -- even though updates obviously produce significant bloat in
> the two tables. BenchmarkSQL on Postgres was far worse than it is now
> a few releases ago [1], and I think that this stats business was a big
> factor (on top of everything else). I can clearly see that
> autovacuum_vacuum_scale_factor is certainly accidentally protective
> with BenchmarkSQL today, in a way that wasn't particularly anticipated
> by anybody.

> So if this was a real app, the DBA would
> somehow have to work out that they should aggressively tune
> autovacuum_vacuum_scale_factor to clean up bloat from updates. I doubt
> any DBA could ever figure that out, because it doesn't make any sense.

Correction: I meant that the autovacuum_vacuum_insert_scale_factor GUC is
accidentally protective with the BenchmarkSQL tables, and that no DBA
could be expected to figure this out. That is, it helps to lower
autovacuum_vacuum_insert_scale_factor from its default of 0.20, just
to get autovacuum to better handle bloat from *updates*. This has
nothing to do with inserts, or with freeze or set VM bits -- and so
overall it doesn't make any sense.


--
Peter Geoghegan

Re: decoupling table and index vacuum

From

Masahiko Sawada

Date:

27 September 2021, 03:17:37

On Sat, Sep 25, 2021 at 10:17 AM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Thu, Sep 23, 2021 at 10:42 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > On Thu, Sep 16, 2021 at 7:09 AM Peter Geoghegan <pg@bowt.ie> wrote:
> > > Enabling index-only scans is a good enough reason to pursue this
> > > project, even on its own.
> >
> > +1
>
> I was hoping that you might be able to work on opportunistically
> freezing whole pages for Postgres 15. I think that it would make sense
> to opportunistically make a page that is about to become all_visible
> during VACUUM become all_frozen instead. Our goal is to make most
> pages skip all_visible, and go straight to all_frozen directly. Often
> the page won't need to be dirtied again, ever.

+1. I'm happy to work on this.

There was a similar proposal before[1]; if we freeze even one tuple in
a page, we freeze all tuples in a page and set the page as all-frozen
if all tuples in the page can be frozen. This is also a good approach.

> The hard part is getting the cost way down. lazy_scan_prune() uses
> xl_heap_freeze_tuple records for each tuple it freezes. These
> obviously have a lot of redundancy across tuples from the same page in
> practice. And the WAL overhead is much larger just because these are
> per-tuple records, not per-page records.

xl_heap_freeze_page includes multiple xl_heap_freeze_tuple data but we
write XLOG_HEAP2_FREEZE_PAGE WAL record per pages? What the WAL
overhead did you refer to?

Regards,

[1]
https://www.postgresql.org/message-id/CANP8%2Bj%2BEfLZMux6KLvb%2BumdeVYc%2BJZs5ReNSFq9WDLn%2BAKnhkg%40mail.gmail.com

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: decoupling table and index vacuum

From

Dilip Kumar

Date:

26 January 2022, 13:58:08

On Mon, Sep 27, 2021 at 8:48 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
Hi,

Here is the first WIP patch for the decoupling table and index vacuum.
The first mail of the thread has already explained the complete
background of why we want to do this so instead of describing that I
will directly jump into explaining what these patches do.

Currently, the table vacuum and index vacuum are executed as a single
operation. Basically, we vacuum the table by performing hot pruning
and remembering dead items in the cache and then we perform the index
vacuum and perform the second pass of the heap vacuum under which we
mark items unused.

In this patch, we make these multiple vacuum passes as independent
operations. So the idea is that we provide multiple vacuum options
under that, the user can perform the independent operations i.e.
"VACUUM (heap_hot_prune) tbl_name" for performing just the hot prune
(first vacuum) pass, "VACUUM (heap_vacuum) tbl_name" for the second
heap pass to set dead item unused for which index vacuum is done. And
additionally, we are now allowing users to just perform the index
vacuum i.e. "VACUUM idx_name".

So under the heap_hot_prune pass, we will generate the dead tids and
instead of directly performing the index vacuum we will flush those
dead tids into the conveyor belt using Deadtidstore interfaces. Then
in the index pass, we will read the data from the conveyor belt and
perform the index vacuum and at last, in the heap_vacuum pass, we will
read the data from the conveyor belt and mark all dead items unused.
However, in the second pass, we can only mark those items unused which
are dead, and for which all the indexes for the table are already
vacuumed. So for identifying that in the pg_class entry we store the
conveyor belt pageno up to which we have already done the index vacuum
for the index related entry and we have already done the heap_vacuum
pass for the table related entry. Additionally while doing the
hot_prune pass we also check if the item is already dead and index
vacuum is also done for that then we directly set it unused, for this,
we use Deadtidstore interfaces.

Deadtidstore provides interfaces over the conveyor belt for storing
and retrieving dead tids into the conveyor belt. This module
maintains a DeadTidState which keeps track of the current insertion
progress i.e the first and the last conveyor belt page for the current
vacuum run. And on the completion of the vacuum run, this takes care
of setting the complete vacuum run bound by storing the last conveyor
belt pageno of the current vacuum run into the special space of the
first conveyor belt page for this run. This also provides the
infrastructure to avoid adding duplicate tids into the conveyor belt.
Basically, if we perform the first vacuum pass multiple times without
executing the second vacuum pass then it is possible that we encounter
the same dead tids in the conveyor belt so this module maintains a
cache over the conveyor belt such that it only loads the data into the
cache w.r.t the current block the vacuum is processing so we don't
need to maintain a huge cache.

Test example:

CREATE TABLE t (a int);
CREATE INDEX idx on t(a);
INSERT INTO t VALUES (generate_series(1,1000000));
DELETE FROM t where a > 300;
VACUUM (heap_hot_prune) t;
VACUUM idx;
"VACUUM (heap_vacuum) t;

From

Peter Geoghegan

Date:

10 February 2022, 19:35:32

On Thu, Feb 10, 2022 at 11:14 AM Robert Haas <robertmhaas@gmail.com> wrote:
> Hmm. I think you're vastly overestimating the extent to which it's
> possible to spread out and reschedule the work. I don't know which of
> us is wrong. From my point of view, if VACUUM is going to do a full
> phase 1 heap pass and a full phase 2 heap pass on either side of
> whatever index work it does, there is no way that things are going to
> get that much more dynamic than they are today.

Waiting to vacuum each index allows us to wait until the next VACUUM
operation on the table, giving us more TIDs to remove when we do go to
vacuum one of these large indexes. Making decisions dynamically seems
very promising because it gives us the most flexibility. In principle
the workload might not allow for any of that, but in practice I think
that it will.

> I don't understand what your point is in these two paragraphs. I'm
> just arguing that, if a raw dead tuple count is meaningless because a
> lot of them are going to disappear harmlessly with or without vacuum,
> it's reasonable to try to get around that problem by counting the
> subset of dead tuples where that isn't true. I agree that it's unclear
> how to do that, but that doesn't mean that it can't be done.

VACUUM is a participant in the system -- it sees how physical
relations are affected by the workload, but it also sees how physical
relations are affected by previous VACUUM operations. If it goes to
VACUUM an index on the basis of a relatively small difference (that
might just be noise), and does so systematically and consistently,
that might have unintended consequences. In particular, we might do
the wrong thing, again and again, because we're overinterpreting noise
again and again.

> I have the same concern about this as what I mentioned before: it's
> purely retrospective. Therefore in my mind it's a very reasonable
> choice for a backstop, but not a somewhat unsatisfying choice for a
> primary mechanism.

I'm not saying that it's impossible or even unreasonable to do
something based on the current or anticipated state of the index,
exactly. Just that you have to be realistic about how accurate that
model is going to be in practice. In practice it'll be quite noisy,
and that must be accounted for. For example, we could deliberately
coarsen the information, so that only relatively large differences in
apparent-bloatedness are visible to the model.

The other thing is that VACUUM itself cannot be expected to operate
with all that much precision, just because of how it works at a high
level. Any quantitative measure will only be meaningful as a way of
prioritizing work. Which is going to be far easier by making the
behavior dynamic, and continually reassessing. Once a relatively large
difference among two indexes first emerges, we can be relatively
confident about what to do. But smaller differences are likely just
noise.

-- 
Peter Geoghegan