Thread: should vacuum's first heap pass be read-only?

should vacuum's first heap pass be read-only?

From
Robert Haas
Date:
VACUUM's first pass over the heap is implemented by a function called
lazy_scan_heap(), while the second pass is implemented by a function
called lazy_vacuum_heap_rel(). This seems to imply that the first pass
is primarily an examination of what is present, while the second pass
does the real work. This used to be more true than it now is. In
PostgreSQL 7.2, the first release that implemented concurrent vacuum,
the first heap pass could set hint bits as a side effect of calling
HeapTupleSatisfiesVacuum(), and it could freeze old xmins. However,
neither of those things wrote WAL, and you had a reasonable chance of
escaping without dirtying the page at all. By the time PostgreSQL 8.2
was released, it had been understood that making critical changes to
pages without writing WAL was not a good plan, and so freezing now
wrote WAL, but no big deal: most vacuums wouldn't freeze anything
anyway. Things really changed a lot in PostgreSQL 8.3. With the
addition of HOT, lazy_scan_heap() was made to prune the page, meaning
that the first heap pass would likely dirty a large fraction of the
pages that it touched, truncating dead tuples to line pointers and
defragmenting the page. The second heap pass would then have to dirty
the page again to mark dead line pointers unused. In the absolute
worst case, that's a very large increase in WAL generation. VACUUM
could write full page images for all of those pages while HOT-pruning
them, and then a checkpoint could happen, and then VACUUM could write
full-page images of all of them again while marking the dead line
pointers unused. I don't know whether anyone spent time and energy
worrying about this problem, but considering how much HOT improves
performance overall, it would be entirely understandable if this
didn't seem like a terribly important thing to worry about.

But maybe we should reconsider. What benefit do we get out of dirtying
the page twice like this, writing WAL each time? What if we went back
to the idea of having the first heap pass be read-only? In fact, I'm
thinking we might want to go even further and try to prevent even hint
bit changes to the page during the first pass, especially because now
we have checksums and wal_log_hints. If our vacuum cost settings are
to believed (and I am not sure that they are) dirtying a page is 10
times as expensive as reading one from the disk. So on a large table,
we're paying 44 vacuum cost units per heap page vacuumed twice, when
we could be paying only 24 such cost units. What a bargain! The
downside is that we would be postponing, perhaps substantially, the
work that can be done immediately, namely freeing up space in the page
and updating the free space map. The former doesn't seem like a big
loss, because it can be done by anyone who visits the page anyway, and
skipped if nobody does. The latter might be a loss, because getting
the page into the freespace map sooner could prevent bloat by allowing
space to be recycled sooner. I'm not sure how serious a problem this
is. I'm curious what other people think. Would it be worth the delay
in getting pages into the FSM if it means we dirty the pages only
once? Could we have our cake and eat it too by updating the FSM with
the amount of free space that the page WOULD have if we pruned it, but
not actually do so?

I'm thinking about this because of the "decoupling table and index
vacuuming" thread, which I was discussing with Dilip this morning. In
a world where table vacuuming and index vacuuming are decoupled, it
feels like we want to have only one kind of heap vacuum. It pushes us
in the direction of unifying the first and second pass, and doing all
the cleanup work at once. However, I don't know that we want to use
the approach described there in all cases. For a small table that is,
let's just say, not part of any partitioning hierarchy, I'm not sure
that using the conveyor belt approach makes a lot of sense, because
the total amount of work we want to do is so small that we should just
get it over with and not clutter up the disk with more conveyor belt
forks -- especially for people who have large numbers of small tables,
the inode consumption could be a real issue. And we won't really save
anything either. The value of decoupling operations has to do with
improving concurrency and error recovery and allowing global indexes
and a bunch of stuff that, for a small table, simply doesn't matter.
So it would be nice to fall back to an approach more like what we do
now. But then you end up with two fairly distinct code paths, one
where you want the heap phases combined and another where you want
them separated. If the first pass were a strictly read-only pass, you
could do that if there's no conveyor belt, or else read from the
conveyor belt if there is one, and then the phase where you dirty the
heap looks about the same either way.

Aside from the question of whether this is a good idea at all, I'm
also wondering what kinds of experiments we could run to try to find
out. What would be a best case workload for the current strategy vs.
this? What would be a worst case for the current strategy vs. this?
I'm not sure. If you have ideas, I'd love to hear them.

Thanks,

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: should vacuum's first heap pass be read-only?

From
Peter Geoghegan
Date:
On Thu, Feb 3, 2022 at 12:20 PM Robert Haas <robertmhaas@gmail.com> wrote:
> But maybe we should reconsider. What benefit do we get out of dirtying
> the page twice like this, writing WAL each time? What if we went back
> to the idea of having the first heap pass be read-only?

What about recovery conflicts? Index vacuuming WAL records don't
require their own latestRemovedXid field, since they can rely on
earlier XLOG_HEAP2_PRUNE records instead. Since the TIDs that index
vacuuming removes always point to LP_DEAD items in the heap, it's safe
to lean on that.

> In fact, I'm
> thinking we might want to go even further and try to prevent even hint
> bit changes to the page during the first pass, especially because now
> we have checksums and wal_log_hints. If our vacuum cost settings are
> to believed (and I am not sure that they are) dirtying a page is 10
> times as expensive as reading one from the disk. So on a large table,
> we're paying 44 vacuum cost units per heap page vacuumed twice, when
> we could be paying only 24 such cost units. What a bargain!

In practice HOT generally works well enough that the number of heap
pages that prune significantly exceeds the subset that are also
vacuumed during the second pass over the heap -- at least when heap
fill factor has been tuned (which might be rare). The latter category
of pages is not reported on by the enhanced autovacuum logging added
to Postgres 14, so you might be able to get some sense of how this
works by looking at that.

> Could we have our cake and eat it too by updating the FSM with
> the amount of free space that the page WOULD have if we pruned it, but
> not actually do so?

Did you ever notice that VACUUM records free space after *it* prunes,
using its own horizon? With a long running VACUUM operation, where
unremoved "recently dead" tuples are common, it's possible that the
amount of free space that's effectively available (available to every
other backend) is significantly higher. And so we already record
"subjective amounts of free space" -- though not necessarily by
design.

> I'm thinking about this because of the "decoupling table and index
> vacuuming" thread, which I was discussing with Dilip this morning. In
> a world where table vacuuming and index vacuuming are decoupled, it
> feels like we want to have only one kind of heap vacuum. It pushes us
> in the direction of unifying the first and second pass, and doing all
> the cleanup work at once. However, I don't know that we want to use
> the approach described there in all cases. For a small table that is,
> let's just say, not part of any partitioning hierarchy, I'm not sure
> that using the conveyor belt approach makes a lot of sense, because
> the total amount of work we want to do is so small that we should just
> get it over with and not clutter up the disk with more conveyor belt
> forks -- especially for people who have large numbers of small tables,
> the inode consumption could be a real issue.

I'm not sure that what you're proposing here is the best way to go
about it, but let's assume for a moment that it is. Can't you just
simulate the conveyor belt approach, without needing a relation fork?
Just store the same information in memory, accessed using the same
interface, with a spillover path?

Ideally VACUUM will be able to use the conveyor belt for any table.
Whether or not it actually happens should be decided at the latest
possible point during VACUUM, based on considerations about the actual
number of dead items that we now need to remove from indexes, as well
as metadata from any preexisting conveyor belt.

-- 
Peter Geoghegan



Re: should vacuum's first heap pass be read-only?

From
Greg Stark
Date:
On Thu, 3 Feb 2022 at 12:21, Robert Haas <robertmhaas@gmail.com> wrote:
>
> VACUUM's first pass over the heap is implemented by a function called
> lazy_scan_heap(), while the second pass is implemented by a function
> called lazy_vacuum_heap_rel(). This seems to imply that the first pass
> is primarily an examination of what is present, while the second pass
> does the real work. This used to be more true than it now is.

I've been out of touch for a while but I'm trying to catch up with the
progress of the past few years.

Whatever happened to the idea to "rotate" the work of vacuum. So all
the work of the second pass would actually be deferred until the first
pass of the next vacuum cycle.

That would also have the effect of eliminating the duplicate work,
both the  writes with the wal generation as well as the actual scan.
The only heap scan would be "remove line pointers previously cleaned
from indexes and prune dead tuples recording them to clean from
indexes in future". The index scan would remove line pointers and
record them to be removed from the heap in a future heap scan.

The downside would mainly be in the latency before the actual tuples
get cleaned up from the table. That is not so much of an issue as far
as space these days with tuple pruning but is more and more of an
issue with xid wraparound. Also, having to record the line pointers
that have been cleaned from indexes somewhere on disk for the
subsequent vacuum would be extra state on disk and we've learned that
means extra complexity.

-- 
greg



Re: should vacuum's first heap pass be read-only?

From
Robert Haas
Date:
On Fri, Feb 4, 2022 at 2:16 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Thu, Feb 3, 2022 at 12:20 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > But maybe we should reconsider. What benefit do we get out of dirtying
> > the page twice like this, writing WAL each time? What if we went back
> > to the idea of having the first heap pass be read-only?
>
> What about recovery conflicts? Index vacuuming WAL records don't
> require their own latestRemovedXid field, since they can rely on
> earlier XLOG_HEAP2_PRUNE records instead. Since the TIDs that index
> vacuuming removes always point to LP_DEAD items in the heap, it's safe
> to lean on that.

Oh, that's an interesting consideration.

> > In fact, I'm
> > thinking we might want to go even further and try to prevent even hint
> > bit changes to the page during the first pass, especially because now
> > we have checksums and wal_log_hints. If our vacuum cost settings are
> > to believed (and I am not sure that they are) dirtying a page is 10
> > times as expensive as reading one from the disk. So on a large table,
> > we're paying 44 vacuum cost units per heap page vacuumed twice, when
> > we could be paying only 24 such cost units. What a bargain!
>
> In practice HOT generally works well enough that the number of heap
> pages that prune significantly exceeds the subset that are also
> vacuumed during the second pass over the heap -- at least when heap
> fill factor has been tuned (which might be rare). The latter category
> of pages is not reported on by the enhanced autovacuum logging added
> to Postgres 14, so you might be able to get some sense of how this
> works by looking at that.

Is there an extra "not" in this sentence? Because otherwise it seems
like you're saying that I should look at the information that isn't
reported, which seems hard.

In any case, I think this might be a death knell for the whole idea.
It might be good to cut down the number of page writes by avoiding
writing them twice -- but not at the expense of having the second pass
have to visit a large number of pages it could otherwise skip. I
suppose we could write only those pages in the first pass that we
aren't going to need to write again later, but at that point I can't
really see that we're winning anything.

> > Could we have our cake and eat it too by updating the FSM with
> > the amount of free space that the page WOULD have if we pruned it, but
> > not actually do so?
>
> Did you ever notice that VACUUM records free space after *it* prunes,
> using its own horizon? With a long running VACUUM operation, where
> unremoved "recently dead" tuples are common, it's possible that the
> amount of free space that's effectively available (available to every
> other backend) is significantly higher. And so we already record
> "subjective amounts of free space" -- though not necessarily by
> design.

Yes, I wondered about that. It seems like maybe a running VACUUM
should periodically refresh its notion of what cutoff to use.

> I'm not sure that what you're proposing here is the best way to go
> about it, but let's assume for a moment that it is. Can't you just
> simulate the conveyor belt approach, without needing a relation fork?
> Just store the same information in memory, accessed using the same
> interface, with a spillover path?

(I'm not sure it's best either.)

I think my concern here is about not having too many different code
paths from heap vacuuming. I agree that if we're going to vacuum
without an on-disk conveyor belt we can use an in-memory substitute.
However, to Greg's point, if we're using the conveyor belt, it seems
like we want to merge the second pass of one VACUUM into the first
pass of the next one. That is, if we start up a heap vacuum already
having a list of TIDs that can be marked unused, we want to do that
during the same pass of the heap that we prune and search for
newly-discovered dead TIDs. But we can't do that in the case where the
conveyor belt is only simulated, because our in-memory data structure
can't contain leftovers from a previous vacuum the way the on-disk
conveyor belt can. So it seems like the whole algorithm has to be
different. I'd like to find a way to avoid that.

If this isn't entirely making sense, it may well be because I'm a
little fuzzy on all of it myself. But I hope it's clear enough that
you can figure out what it is that I'm worrying about. If not, I'll
keep trying to explain until we both reach a sufficiently non-fuzzy
state.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: should vacuum's first heap pass be read-only?

From
Robert Haas
Date:
On Fri, Feb 4, 2022 at 3:05 PM Greg Stark <stark@mit.edu> wrote:
> Whatever happened to the idea to "rotate" the work of vacuum. So all
> the work of the second pass would actually be deferred until the first
> pass of the next vacuum cycle.
>
> That would also have the effect of eliminating the duplicate work,
> both the  writes with the wal generation as well as the actual scan.
> The only heap scan would be "remove line pointers previously cleaned
> from indexes and prune dead tuples recording them to clean from
> indexes in future". The index scan would remove line pointers and
> record them to be removed from the heap in a future heap scan.

I vaguely remember previous discussions of this, but only vaguely, so
if there are threads on list feel free to send pointers. It seems to
me that in order to do this, we'd need some kind of way of storing the
TIDs that were found to be dead in one VACUUM so that they can be
marked unused in the next VACUUM - and the conveyor belt patches on
which Dilip's work is based provide exactly that machinery, which his
patches then leverage to do exactly that thing. But it feels like a
big, sudden change from the way things work now, and I'm trying to
think of ways to make it more incremental, and thus hopefully less
risky.

> The downside would mainly be in the latency before the actual tuples
> get cleaned up from the table. That is not so much of an issue as far
> as space these days with tuple pruning but is more and more of an
> issue with xid wraparound. Also, having to record the line pointers
> that have been cleaned from indexes somewhere on disk for the
> subsequent vacuum would be extra state on disk and we've learned that
> means extra complexity.

I don't think there's any XID wraparound issue here. Phase 1 does a
HOT prune, after which only dead line pointers remain, not dead
tuples. And those contain no XIDs. Phase 2 is only setting those dead
line pointers back to unused.

As for the other part, that's pretty much exactly the complexity that
I'm worrying about.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: should vacuum's first heap pass be read-only?

From
Peter Geoghegan
Date:
On Fri, Feb 4, 2022 at 3:18 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > What about recovery conflicts? Index vacuuming WAL records don't
> > require their own latestRemovedXid field, since they can rely on
> > earlier XLOG_HEAP2_PRUNE records instead. Since the TIDs that index
> > vacuuming removes always point to LP_DEAD items in the heap, it's safe
> > to lean on that.
>
> Oh, that's an interesting consideration.

You'd pretty much have to do "fake pruning", performing the same
computation as pruning without actually pruning.

> > In practice HOT generally works well enough that the number of heap
> > pages that prune significantly exceeds the subset that are also
> > vacuumed during the second pass over the heap -- at least when heap
> > fill factor has been tuned (which might be rare). The latter category
> > of pages is not reported on by the enhanced autovacuum logging added
> > to Postgres 14, so you might be able to get some sense of how this
> > works by looking at that.
>
> Is there an extra "not" in this sentence? Because otherwise it seems
> like you're saying that I should look at the information that isn't
> reported, which seems hard.

Sorry, yes. I meant "now" (as in, as of Postgres 14).

> In any case, I think this might be a death knell for the whole idea.
> It might be good to cut down the number of page writes by avoiding
> writing them twice -- but not at the expense of having the second pass
> have to visit a large number of pages it could otherwise skip. I
> suppose we could write only those pages in the first pass that we
> aren't going to need to write again later, but at that point I can't
> really see that we're winning anything.

Right. I think that we *can* be more aggressive about deferring heap
page vacuuming until another VACUUM operation with the conveyor belt
stuff. You may well end up getting almost the same benefit that way.

> Yes, I wondered about that. It seems like maybe a running VACUUM
> should periodically refresh its notion of what cutoff to use.

Yeah, Andres said something about this a few months ago. Shouldn't be
very difficult.

> I think my concern here is about not having too many different code
> paths from heap vacuuming. I agree that if we're going to vacuum
> without an on-disk conveyor belt we can use an in-memory substitute.

Avoiding special cases in vacuumlazy.c seems really important to me.

> However, to Greg's point, if we're using the conveyor belt, it seems
> like we want to merge the second pass of one VACUUM into the first
> pass of the next one.

But it's only going to be safe to do that with those dead TIDs (or
distinct generations of dead TIDs) that are known to already be
removed from all indexes, including indexes that have the least need
for vacuuming (often no direct need at all). I had imagined that we'd
want to do heap vacuuming in the same way as today with the dead TID
conveyor belt stuff -- it just might take several VACUUM operations
until we are ready to do a round of heap vacuuming.

For those indexes that use bottom-up index deletion effectively, the
index structure itself never really needs to be vacuumed to avoid
index bloat. We must nevertheless vacuum these indexes at some point,
just to be able to vacuum heap pages with LP_DEAD items safely.

Overall, I think that there will typically be stark differences among
indexes on the table, in terms of how much vacuuming each index
requires. And so the thing that drives us to perform heap vacuuming
will probably be heap vacuuming itself, and not the fact that each and
every index has become "sufficiently bloated".

> If this isn't entirely making sense, it may well be because I'm a
> little fuzzy on all of it myself.

I'm in no position to judge.   :-)

-- 
Peter Geoghegan



Re: should vacuum's first heap pass be read-only?

From
Robert Haas
Date:
On Fri, Feb 4, 2022 at 4:12 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I had imagined that we'd
> want to do heap vacuuming in the same way as today with the dead TID
> conveyor belt stuff -- it just might take several VACUUM operations
> until we are ready to do a round of heap vacuuming.

I am trying to understand exactly what you are imagining here. Do you
mean we'd continue to lazy_scan_heap() at the start of every vacuum,
and lazy_vacuum_heap_rel() at the end? I had assumed that we didn't
want to do that, because we might already know from the conveyor belt
that there are some dead TIDs that could be marked unused, and it
seems strange to just ignore that knowledge at a time when we're
scanning the heap anyway. However, on reflection, that approach has
something to recommend it, because it would be somewhat simpler to
understand what's actually being changed. We could just:

1. Teach lazy_scan_heap() that it should add TIDs to the conveyor
belt, if we're using one, unless they're already there, but otherwise
work as today.

2. Teach lazy_vacuum_heap_rel() that it, if there is a conveyor belt,
it should try to clear from the indexes all of the dead TIDs that are
eligible.

3. If there is a conveyor belt, use some kind of magic to decide when
to skip vacuuming some or all indexes. When we skip one or more
indexes, the subsequent lazy_vacuum_heap_rel() can't possibly mark as
unused any of the dead TIDs we found this time, so we should just skip
it, unless somehow there are TIDs on the conveyor belt that were
already ready to be marked unused at the start of this VACUUM, in
which case we can still handle those.

Is this the kind of thing you had in mind?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: should vacuum's first heap pass be read-only?

From
Peter Geoghegan
Date:
Yes, that's what I meant. That's always how I thought that it would work, for over a year now. I might have jumped to the conclusion that that's what you had in mind all along. Oops.

Although this design is simpler, which is an advantage, that's not really the point. The  point is that it makes sense, and that extra concurrent with pruning heap vacuuming doesn't seem useful at all.
--
Peter Geoghegan

Re: should vacuum's first heap pass be read-only?

From
Dilip Kumar
Date:
On Mon, Feb 7, 2022 at 10:06 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Feb 4, 2022 at 4:12 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > I had imagined that we'd
> > want to do heap vacuuming in the same way as today with the dead TID
> > conveyor belt stuff -- it just might take several VACUUM operations
> > until we are ready to do a round of heap vacuuming.
>
> I am trying to understand exactly what you are imagining here. Do you
> mean we'd continue to lazy_scan_heap() at the start of every vacuum,
> and lazy_vacuum_heap_rel() at the end? I had assumed that we didn't
> want to do that, because we might already know from the conveyor belt
> that there are some dead TIDs that could be marked unused, and it
> seems strange to just ignore that knowledge at a time when we're
> scanning the heap anyway. However, on reflection, that approach has
> something to recommend it, because it would be somewhat simpler to
> understand what's actually being changed. We could just:
>
> 1. Teach lazy_scan_heap() that it should add TIDs to the conveyor
> belt, if we're using one, unless they're already there, but otherwise
> work as today.
>
> 2. Teach lazy_vacuum_heap_rel() that it, if there is a conveyor belt,
> it should try to clear from the indexes all of the dead TIDs that are
> eligible.
>
> 3. If there is a conveyor belt, use some kind of magic to decide when
> to skip vacuuming some or all indexes. When we skip one or more
> indexes, the subsequent lazy_vacuum_heap_rel() can't possibly mark as
> unused any of the dead TIDs we found this time, so we should just skip
> it, unless somehow there are TIDs on the conveyor belt that were
> already ready to be marked unused at the start of this VACUUM, in
> which case we can still handle those.

Based on this discussion, IIUC, we are saying that now we will do the
lazy_scan_heap every time like we are doing now.  And we will
conditionally skip the index vacuum for all or some of the indexes and
then based on how much index vacuum is done we will conditionally do
the lazy_vacuum_heap_rel().  Is my understanding correct?

IMHO, if we are doing the heap scan every time and then we are going
to get the same dead items again which we had previously collected in
the conveyor belt.  I agree that we will not add them again into the
conveyor belt but why do we want to store them in the conveyor belt
when we want to redo the whole scanning again?

I think (without global indexes) the main advantage of using the
conveyor belt is that if we skip the index scan for some of the
indexes then we can save the dead item somewhere so that without
scanning the heap again we have those dead items to do the index
vacuum sometime in future but if you are going to rescan the heap
again next time before doing any index vacuuming then why we want to
store them anyway.

IMHO, what we should do is, if there are not many new dead tuples in
the heap (total dead tuple based on the statistic - existing items in
the conveyor belt) then we should conditionally skip the heap scanning
(first pass) and directly jump to the index vacuuming for some or all
the indexes based on the index size bloat.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: should vacuum's first heap pass be read-only?

From
Peter Geoghegan
Date:
On Fri, Feb 25, 2022 at 5:06 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Based on this discussion, IIUC, we are saying that now we will do the
> lazy_scan_heap every time like we are doing now.  And we will
> conditionally skip the index vacuum for all or some of the indexes and
> then based on how much index vacuum is done we will conditionally do
> the lazy_vacuum_heap_rel().  Is my understanding correct?

I can only speak for myself, but that sounds correct to me. IMO what
we really want here is to create lots of options for VACUUM. To do the
work of index vacuuming when it is most convenient, based on very
recent information about what's going on in each index. There at some
specific obvious ways that it might help. For example, it would be
nice if the failsafe could not really skip index vacuuming -- it could
just put it off until later, after relfrozenxid has been advanced to a
safe value.

Bear in mind that the cost of lazy_scan_heap is often vastly less than
the cost of vacuuming all indexes -- and so doing a bit more work
there than theoretically necessary is not necessarily a problem.
Especially if you have something like UUID indexes, where there is no
natural locality. Many tables have 10+ indexes. Even large tables.

> IMHO, if we are doing the heap scan every time and then we are going
> to get the same dead items again which we had previously collected in
> the conveyor belt.  I agree that we will not add them again into the
> conveyor belt but why do we want to store them in the conveyor belt
> when we want to redo the whole scanning again?

I don't think we want to, exactly. Maybe it's easier to store
redundant TIDs than to avoid storing them in the first place (we can
probably just accept some redundancy). There is a trade-off to be made
there. I'm not at all sure of what the best trade-off is, though.

> I think (without global indexes) the main advantage of using the
> conveyor belt is that if we skip the index scan for some of the
> indexes then we can save the dead item somewhere so that without
> scanning the heap again we have those dead items to do the index
> vacuum sometime in future

Global indexes are important in their own right, but ISTM that they
have similar needs to other things anyway. Having this flexibility is
even more important with global indexes, but the concepts themselves
are similar. We want options and maximum flexibility, everywhere.

> but if you are going to rescan the heap
> again next time before doing any index vacuuming then why we want to
> store them anyway.

It all depends, of course. The decision needs to be made using a cost
model. I suspect it will be necessary to try it out, and see.

--
Peter Geoghegan



Re: should vacuum's first heap pass be read-only?

From
Dilip Kumar
Date:
On Fri, Feb 25, 2022 at 10:45 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Fri, Feb 25, 2022 at 5:06 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > Based on this discussion, IIUC, we are saying that now we will do the
> > lazy_scan_heap every time like we are doing now.  And we will
> > conditionally skip the index vacuum for all or some of the indexes and
> > then based on how much index vacuum is done we will conditionally do
> > the lazy_vacuum_heap_rel().  Is my understanding correct?
>
> Bear in mind that the cost of lazy_scan_heap is often vastly less than
> the cost of vacuuming all indexes -- and so doing a bit more work
> there than theoretically necessary is not necessarily a problem.
> Especially if you have something like UUID indexes, where there is no
> natural locality. Many tables have 10+ indexes. Even large tables.

Completely agree with that.

> > IMHO, if we are doing the heap scan every time and then we are going
> > to get the same dead items again which we had previously collected in
> > the conveyor belt.  I agree that we will not add them again into the
> > conveyor belt but why do we want to store them in the conveyor belt
> > when we want to redo the whole scanning again?
>
> I don't think we want to, exactly. Maybe it's easier to store
> redundant TIDs than to avoid storing them in the first place (we can
> probably just accept some redundancy). There is a trade-off to be made
> there. I'm not at all sure of what the best trade-off is, though.

Yeah we can think of that.

> > I think (without global indexes) the main advantage of using the
> > conveyor belt is that if we skip the index scan for some of the
> > indexes then we can save the dead item somewhere so that without
> > scanning the heap again we have those dead items to do the index
> > vacuum sometime in future
>
> Global indexes are important in their own right, but ISTM that they
> have similar needs to other things anyway. Having this flexibility is
> even more important with global indexes, but the concepts themselves
> are similar. We want options and maximum flexibility, everywhere.

+1

> > but if you are going to rescan the heap
> > again next time before doing any index vacuuming then why we want to
> > store them anyway.
>
> It all depends, of course. The decision needs to be made using a cost
> model. I suspect it will be necessary to try it out, and see.

Yeah right.  But I still think that we should be thinking toward
skipping the first vacuum pass also conditionally.  I mean if there
are not many new dead tuples which we realize before evejn starting
the heap scan then why not directly jump to the index vacuuming if
some of the index needs vacuum.  But I agree based on some testing and
cost model we can decide what is the best way forward.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: should vacuum's first heap pass be read-only?

From
Robert Haas
Date:
[ returning to this thread after a bit of a hiatus ]

On Fri, Feb 25, 2022 at 12:15 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I can only speak for myself, but that sounds correct to me. IMO what
> we really want here is to create lots of options for VACUUM.

I agree. That seems like a good way of thinking about it.

> I don't think we want to, exactly. Maybe it's easier to store
> redundant TIDs than to avoid storing them in the first place (we can
> probably just accept some redundancy). There is a trade-off to be made
> there. I'm not at all sure of what the best trade-off is, though.

My intuition is that storing redundant TIDs will turn out to be a very
bad idea. I think that if we do that, the system will become prone to
a new kind of vicious cycle (to try to put this in words similar to
the ones you've been using to write about similar effects). Imagine
that we vacuum a table which contains N dead tuples a total of K times
in succession, but each time we either deliberately decide against
index vacuuming or get killed before we can complete it. If we don't
have logic to prevent duplicate entries from being added to the
conveyor belt, we will have N*K TIDs in the conveyor belt rather than
only N, and I think it's not hard to imagine situations where K could
be big enough for that to be hugely painful. Moreover, at some point,
we're going to need to deduplicate anyway, because each of those dead
TIDs can only be marked unused once. Letting the data on the conveyor
belt in the hopes of sorting it out later seems almost certain to be a
losing proposition. The more data we collect on the conveyor belt, the
harder it's going to be when we eventually need to deduplicate.

Also, I think it's possible to deduplicate at a very reasonable cost
so long as (1) we enter each batch of TIDs into the conveyor belt as a
distinguishable "run" and (2) we never accumulate so many of these
runs at the same time that we can't do a single merge pass to turn
them into a sorted output stream. We're always going to discover dead
TIDs in sorted order, so as we make our pass over the heap, we can
simultaneously be doing an on-the-fly merge pass over the existing
runs that are on the conveyor belt. That means we never need to store
all the dead TIDs in memory at once. We just fetch enough data from
each "run" to cover the block numbers for which we're performing the
heap scan, and when the heap scan advances we throw away data for
blocks that we've passed and fetch data for the blocks we're now
reaching.

> > but if you are going to rescan the heap
> > again next time before doing any index vacuuming then why we want to
> > store them anyway.
>
> It all depends, of course. The decision needs to be made using a cost
> model. I suspect it will be necessary to try it out, and see.

But having said that, coming back to this with fresh eyes, I think
Dilip has a really good point here. If the first thing we do at the
start of every VACUUM is scan the heap in a way that is guaranteed to
rediscover all of the dead TIDs that we've previously added to the
conveyor belt plus maybe also new ones, we may as well just forget the
whole idea of having a conveyor belt at all. At that point we're just
talking about a system for deciding when to skip index vacuuming, and
the conveyor belt is a big complicated piece of machinery that stores
data we don't really need for anything because if we threw it out the
next vacuum would reconstruct it anyhow and we'd get exactly the same
results with less code. The only way the conveyor belt system has any
value is if we think that there is some set of circumstances where the
heap scan is separated in time from the index vacuum, such that we
might sometimes do an index vacuum without having done a heap scan
just before.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: should vacuum's first heap pass be read-only?

From
Peter Geoghegan
Date:
On Thu, Mar 31, 2022 at 1:25 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > I don't think we want to, exactly. Maybe it's easier to store
> > redundant TIDs than to avoid storing them in the first place (we can
> > probably just accept some redundancy). There is a trade-off to be made
> > there. I'm not at all sure of what the best trade-off is, though.
>
> My intuition is that storing redundant TIDs will turn out to be a very
> bad idea. I think that if we do that, the system will become prone to
> a new kind of vicious cycle (to try to put this in words similar to
> the ones you've been using to write about similar effects).

I don't feel very strongly about it either way. I definitely think
that there are workloads for which that will be true, and I in general
I am no fan of putting off work that cannot possibly turn out to be
unnecessary in the end. I am in favor of "good laziness", not "bad
laziness".

> The more data we collect on the conveyor belt, the
> harder it's going to be when we eventually need to deduplicate.
>
> Also, I think it's possible to deduplicate at a very reasonable cost
> so long as (1) we enter each batch of TIDs into the conveyor belt as a
> distinguishable "run"

I definitely think that's the way to go, in general (regardless of
what we do about deduplicating TIDs).

> and (2) we never accumulate so many of these
> runs at the same time that we can't do a single merge pass to turn
> them into a sorted output stream. We're always going to discover dead
> TIDs in sorted order, so as we make our pass over the heap, we can
> simultaneously be doing an on-the-fly merge pass over the existing
> runs that are on the conveyor belt. That means we never need to store
> all the dead TIDs in memory at once.

That's a good idea IMV. I am vaguely reminded of an LSM tree.

> We just fetch enough data from
> each "run" to cover the block numbers for which we're performing the
> heap scan, and when the heap scan advances we throw away data for
> blocks that we've passed and fetch data for the blocks we're now
> reaching.

I wonder if you've thought about exploiting the new approach to
skipping pages using the visibility map from my patch series (which
you reviewed recently). I think that putting that in scope here could
be very helpful. As a way of making the stuff we already want to do
with [global] indexes easier, but also as independently useful work
based on the conveyor belt. The visibility map is very underexploited
as a source of accurate information about what VACUUM should be doing
IMV (in autovacuum.c's scheduling logic, but also in vacuumlazy.c
itself).

Imagine a world in which we decide up-front what pages we're going to
scan (i.e. our final vacrel->scanned_pages), by first scanning the
visibility map, and serializing it in local memory, or sometimes in
disk using the conveyor belt. Now you'll have a pretty good idea how
much TID deduplication will be necessary when you're done (to give one
example). In general "locking in" a plan of action for VACUUM seems
like it would be very useful. It will tend to cut down on the number
of "dead but not yet removable" tuples that VACUUM encounters right
now -- you at least avoid visiting concurrently modified heap pages
that were all-visible back when OldestXmin was first established.

When all skippable ranges are known in advance, we can reorder things
in many different ways -- since the work of vacuuming can be
decomposed on the lazy_scan_heap side, too. The only ordering
dependency among heap pages (that I can think of offhand) is FSM
vacuuming, which seems like it could be addressed without great
difficulty.

Ideally everything can be processed in whatever order is convenient as
of right now, based on costs and benefits. We could totally decompose
the largest VACUUM operations into individually processable unit of
work (heap and index units), so that individual autovacuum workers no
longer own particular VACUUM operations at all. Autovacuum workers
would instead multiplex all of the units of work from all pending
VACUUMs, that are scheduled centrally, based on the current load of
the system. We can probably afford to be much more sensitive to the
number of pages we'll dirty relative to what the system can take right
now, and so on.

Cancelling an autovacuum worker may just make the system temporarily
suspend the VACUUM operation for later with this design (in fact
cancelling an autovacuum may not really be meaningful at all). A
design like this could also enable things like changing our mind about
advancing relfrozenxid: if we decided to skip all-visible pages in
this VACUUM operation, and come to regret that decision by the end
(because lots of XIDs are consumed in the interim), maybe we can go
back and get the all-visible pages we missed first time around.

I don't think that all of these ideas will turn out to be winners, but
I'm just trying to stimulate discussion. Breaking almost all
dependencies on the order that we process things in seems like it has
real promise.

> > It all depends, of course. The decision needs to be made using a cost
> > model. I suspect it will be necessary to try it out, and see.
>
> But having said that, coming back to this with fresh eyes, I think
> Dilip has a really good point here. If the first thing we do at the
> start of every VACUUM is scan the heap in a way that is guaranteed to
> rediscover all of the dead TIDs that we've previously added to the
> conveyor belt plus maybe also new ones, we may as well just forget the
> whole idea of having a conveyor belt at all.

I definitely agree that that's bad, and would be the inevitable result
of being lazy about deduplicating consistently.

> The only way the conveyor belt system has any
> value is if we think that there is some set of circumstances where the
> heap scan is separated in time from the index vacuum, such that we
> might sometimes do an index vacuum without having done a heap scan
> just before.

I agree.

-- 
Peter Geoghegan



Re: should vacuum's first heap pass be read-only?

From
Robert Haas
Date:
On Thu, Mar 31, 2022 at 6:43 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > The only way the conveyor belt system has any
> > value is if we think that there is some set of circumstances where the
> > heap scan is separated in time from the index vacuum, such that we
> > might sometimes do an index vacuum without having done a heap scan
> > just before.
>
> I agree.

But in http://postgr.es/m/CA+Tgmoa6kVEeurtyeOi3a+rA2XuynwQmJ_s-h4kUn6-bKMMDRw@mail.gmail.com
(and the messages just before and just after it) we seemed to be
agreeing on a design where that's exactly what happens. It seemed like
a good idea to me at the time, but now it seems like it's a bad idea,
because it involves using the conveyor belt in a way that adds no
value.

Am I confused here?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: should vacuum's first heap pass be read-only?

From
Peter Geoghegan
Date:
On Thu, Mar 31, 2022 at 5:31 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > I agree.
>
> But in http://postgr.es/m/CA+Tgmoa6kVEeurtyeOi3a+rA2XuynwQmJ_s-h4kUn6-bKMMDRw@mail.gmail.com
> (and the messages just before and just after it) we seemed to be
> agreeing on a design where that's exactly what happens. It seemed like
> a good idea to me at the time, but now it seems like it's a bad idea,
> because it involves using the conveyor belt in a way that adds no
> value.

There are two types of heap scans (in my mind, at least): those that
prune, and those that VACUUM. While there has traditionally been a 1:1
correspondence between these two scans (barring cases with no LP_DEAD
items whatsoever), that's no longer in Postgres 14, which added the
"bypass index scan in the event of few LP_DEAD items left by pruning"
optimization (or Postgres 12 if you count INDEX_CLEANUP=off).

When I said "I agree" earlier today, I imagined that I was pretty much
affirming everything else that I'd said up until that point of the
email. Which is that the conveyor belt is interesting as a way of
breaking (or even just loosening) dependencies on the *order* in which
we perform work within a given "VACUUM cycle". Things can be much
looser than they are today, with indexes (which we've discussed a lot
already), and even with heap pruning (which I brought up for the first
time just today).

However, I don't see any way that it will be possible to break one
particular ordering dependency, even with the conveyor belt stuff: The
"basic invariant" described in comments above lazy_scan_heap(), which
describes rules about TID recycling -- we can only recycle TIDs when a
full "VACUUM cycle" completes, just like today.

That was a point I was making in the email from back in February:
obviously it's unsafe to do lazy_vacuum_heap_page() processing of a
page until we're already 100% sure that the LP_DEAD items are not
referenced by any indexes, even indexes that have very little bloat
(that don't really need to be vacuumed for their own sake). However,
the conveyor belt can add value by doing much more frequent processing
in lazy_scan_prune() (of different pages each time, or perhaps even
repeat processing of the same heap pages), and much more frequent
index vacuuming for those indexes that seem to need it.

So the lazy_scan_prune() work (pruning and freezing) can and probably
should be separated in time from the index vacuuming (compared to the
current design). Maybe not for all of the indexes -- typically for the
majority, maybe 8 out of 10. We can do much less index vacuuming in
those indexes that don't really need it, in order to be able to do
much more in those that do. At some point we must "complete a whole
cycle of heap vacuuming" by processing all the heap pages using
lazy_vacuum_heap_page() that need it.

Separately, the conveyor belt seems to have promise as a way of
breaking up work for multiplexing, or parallel processing.

-- 
Peter Geoghegan



Re: should vacuum's first heap pass be read-only?

From
Dilip Kumar
Date:
On Fri, Apr 1, 2022 at 1:55 AM Robert Haas <robertmhaas@gmail.com> wrote:
>

> But having said that, coming back to this with fresh eyes, I think
> Dilip has a really good point here. If the first thing we do at the
> start of every VACUUM is scan the heap in a way that is guaranteed to
> rediscover all of the dead TIDs that we've previously added to the
> conveyor belt plus maybe also new ones, we may as well just forget the
> whole idea of having a conveyor belt at all. At that point we're just
> talking about a system for deciding when to skip index vacuuming, and
> the conveyor belt is a big complicated piece of machinery that stores
> data we don't really need for anything because if we threw it out the
> next vacuum would reconstruct it anyhow and we'd get exactly the same
> results with less code.

After thinking more about this I see there is some value of
remembering the dead tids in the conveyor belt.  Basically, the point
is if there are multiple indexes and we do the index vacuum for some
of the indexes and skip for others.  And now when we again do the
complete vacuum cycle that time we will again get all the old dead
tids + the new dead tids but without conveyor belt we might need to
perform the multiple cycle of the index vacuum even for the indexes
for which we had done the vacuum in previous vacuum cycle (if all tids
are not fitting in maintenance work mem). But with the conveyor belt
we remember the conveyor belt pageno upto which we have done the index
vacuum and then we only need to do vacuum for the remaining tids which
will definitely reduce the index vacuuming passes, right?

So my stand is, a) for the global indexes we must need the conveyor
belt for remembering the partition wise dead tids (because after
vacuuming certain partitions when we go for global index vacuuming we
don't want to rescan all the partitions to get the same dead items) b)
and even without global indexes there are advantages of storing dead
items in the conveyor belt as explained in my previous paragraph.  So
I think it is worth adding the conveyor belt infrastructure.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: should vacuum's first heap pass be read-only?

From
Peter Geoghegan
Date:
On Thu, Mar 31, 2022 at 9:08 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> But with the conveyor belt
> we remember the conveyor belt pageno upto which we have done the index
> vacuum and then we only need to do vacuum for the remaining tids which
> will definitely reduce the index vacuuming passes, right?

Right, exactly -- the index or two that really need to be vacuumed a
lot can have relatively small dead_items arrays.

Other indexes (when eventually vacuumed) will need a larger dead_items
array, with everything we need to get rid of from the index in one big
array. Hopefully this won't matter much. Vacuuming these indexes
should be required infrequently (compared to the bloat-prone indexes).

As I said upthread, when we finally have to perform heap vacuuming
(not heap pruning), it'll probably happen because the heap itself
needs heap vacuuming. We could probably get away with *never* vacuum
certain indexes on tables prone to non-HOT updates, without that ever
causing index bloat.  But heap line pointer bloat is eventually going
to become a real problem with non-HOT updates, no matter what.

-- 
Peter Geoghegan



Re: should vacuum's first heap pass be read-only?

From
Robert Haas
Date:
On Fri, Apr 1, 2022 at 12:08 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> After thinking more about this I see there is some value of
> remembering the dead tids in the conveyor belt.  Basically, the point
> is if there are multiple indexes and we do the index vacuum for some
> of the indexes and skip for others.  And now when we again do the
> complete vacuum cycle that time we will again get all the old dead
> tids + the new dead tids but without conveyor belt we might need to
> perform the multiple cycle of the index vacuum even for the indexes
> for which we had done the vacuum in previous vacuum cycle (if all tids
> are not fitting in maintenance work mem). But with the conveyor belt
> we remember the conveyor belt pageno upto which we have done the index
> vacuum and then we only need to do vacuum for the remaining tids which
> will definitely reduce the index vacuuming passes, right?

I guess you're right, and it's actually a little bit better than that,
because even if the data does fit into shared memory, we'll have to
pass fewer TIDs to the worker to be removed from the heap, which might
save a few CPU cycles. But I think both of those are very small
benefits. If that's all we're going to do with the conveyor belt
infrastructure, I don't think it's worth the effort. I am completely
in agreement with Peter's comments to the effect that the goal here is
to create flexibility, but it feels to me like the particular
development plan we discussed back in late February isn't going to
create enough flexibility to make it worth the effort it takes. It
seems to me we need to find a way to do better than that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: should vacuum's first heap pass be read-only?

From
Peter Geoghegan
Date:
On Fri, Apr 1, 2022 at 11:04 AM Robert Haas <robertmhaas@gmail.com> wrote:
> I guess you're right, and it's actually a little bit better than that,
> because even if the data does fit into shared memory, we'll have to
> pass fewer TIDs to the worker to be removed from the heap, which might
> save a few CPU cycles. But I think both of those are very small
> benefits.

I'm not following. It seems like you're saying that the ability to
vacuum indexes on their own schedule (based on their own needs) is not
sufficiently compelling. I think it's very compelling, with enough
indexes (and maybe not very many).

The conveyor belt doesn't just save I/O from repeated scanning of the
heap. It may also save on repeated pruning (or just dirtying) of the
same heap pages again and again, for very little benefit.

Imagine an append-only table where 1% of transactions that insert are
aborts. You really want to be able to constantly VACUUM such a table,
so that its pages are proactively frozen and set all-visible in the
visibility map -- it's not that different to a perfectly append-only
table, without any garbage tuples. And so it would be very useful if
we could delay index vacuuming for much longer than the current 2% of
rel_pages heuristics seems to allow.

That heuristic has to conservatively assume that it might be some time
before the next vacuum is launched, and has the opportunity to
reconsider index vacuuming. What if it was a more or less independent
question instead? To put it another way, it would be great if the
scheduling code for autovacuum could make inferences about what
general strategy works best for a given table over time. In order to
be able to do that sensibly, the algorithm needs more context, so that
it can course correct without paying much of a cost for being wrong.

-- 
Peter Geoghegan



Re: should vacuum's first heap pass be read-only?

From
Robert Haas
Date:
On Fri, Apr 1, 2022 at 2:27 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I'm not following. It seems like you're saying that the ability to
> vacuum indexes on their own schedule (based on their own needs) is not
> sufficiently compelling. I think it's very compelling, with enough
> indexes (and maybe not very many).
>
> The conveyor belt doesn't just save I/O from repeated scanning of the
> heap. It may also save on repeated pruning (or just dirtying) of the
> same heap pages again and again, for very little benefit.

I'm also not following. In order to get that benefit, we would have to
sometimes decide to not perform lazy_scan_heap() at the startup of a
vacuum. And in this email I asked you whether it was your idea that we
should always start a vacuum operation with lazy_scan_heap(), and you
said "yes":

http://postgr.es/m/CA+Tgmoa6kVEeurtyeOi3a+rA2XuynwQmJ_s-h4kUn6-bKMMDRw@mail.gmail.com

So I'm completely confused here. If we always start a vacuum with
lazy_scan_heap(), as you said you wanted, then we will not save any
heap scanning.

What am I missing?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: should vacuum's first heap pass be read-only?

From
Peter Geoghegan
Date:
On Fri, Apr 1, 2022 at 11:39 AM Robert Haas <robertmhaas@gmail.com> wrote:
> So I'm completely confused here. If we always start a vacuum with
> lazy_scan_heap(), as you said you wanted, then we will not save any
> heap scanning.

The term "start a VACUUM" becomes ambiguous with the conveyor belt.

What I was addressed in a nearby email back in February [1] was the
idea of doing heap vacuuming of the last run (or several runs) of dead
TIDs on top of heap pruning to create the next run/runs of dead TIDs.

> What am I missing?

There is a certain sense in which we are bound to always "start a
vacuum" in lazy_scan_prune(), with any design based on the current
one. How else are we ever going to make a basic initial determination
about which heap LP_DEAD items need their TIDs deleted from indexes,
sooner or later? Obviously that information must always have
originated in lazy_scan_prune (or in lazy_scan_noprune).

With the conveyor belt, and a non-HOT-update heavy workload, we'll
eventually need to exhaustively do index vacuuming of all indexes
(even those that don't need it for their own sake) to make it safe to
remove heap line pointer bloat (to set heap LP_DEAD items to
LP_UNUSED). This will happen least often of all, and is the one
dependency conveyor belt can't help with.

To answer your question: when heap vacuuming does finally happen, we
at least don't need to call lazy_scan_prune for any pages first
(neither the pages we're vacuuming, nor any other heap pages). Plus
the decision to finally clean up line pointer bloat can be made based
on known facts about line pointer bloat, without tying that to other
processing done by lazy_scan_prune() -- so there's greater separation
of concerns.

That having been said...maybe it would make sense to also call
lazy_scan_prune() right after these relatively rare calls to
lazy_vacuum_heap_page(), opportunistically (since we already dirtied
the page once). But that would be an additional optimization, at best; it
wouldn't be the main way that we call lazy_scan_prune().

[1] https://www.postgresql.org/message-id/CAH2-WzmG%3D_vYv0p4bhV8L73_u%2BBkd0JMWe2zHH333oEujhig1g%40mail.gmail.com
--
Peter Geoghegan



Re: should vacuum's first heap pass be read-only?

From
Dilip Kumar
Date:
On Fri, Apr 1, 2022 at 11:34 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Apr 1, 2022 at 12:08 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > After thinking more about this I see there is some value of
> > remembering the dead tids in the conveyor belt.  Basically, the point
> > is if there are multiple indexes and we do the index vacuum for some
> > of the indexes and skip for others.  And now when we again do the
> > complete vacuum cycle that time we will again get all the old dead
> > tids + the new dead tids but without conveyor belt we might need to
> > perform the multiple cycle of the index vacuum even for the indexes
> > for which we had done the vacuum in previous vacuum cycle (if all tids
> > are not fitting in maintenance work mem). But with the conveyor belt
> > we remember the conveyor belt pageno upto which we have done the index
> > vacuum and then we only need to do vacuum for the remaining tids which
> > will definitely reduce the index vacuuming passes, right?
>
> I guess you're right, and it's actually a little bit better than that,
> because even if the data does fit into shared memory, we'll have to
> pass fewer TIDs to the worker to be removed from the heap, which might
> save a few CPU cycles. But I think both of those are very small
> benefits. If that's all we're going to do with the conveyor belt
> infrastructure, I don't think it's worth the effort.

I don't think that saving extra index passes is really a small gain.
I think this will save a lot of IO if indexes pages are not in shared
buffers because here we are talking about we can completely avoid the
index passes for some of the indexes if it is already done.  And if
this is the only advantage then it might not be worth adding this
infrastructure but what about global indexes?

Because if we have global indexes then we must need this
infrastructure to store the dead items for the partition because for
example after vacuuming 1000 partitions while vacuuming the 1001st
partition if we need to vacuum the global index then we don't want to
rescan all the previous 1000 partitions to regenerate those old dead
items right?  So I think this is the actual use case where we
indirectly skip the heap vacuuming for some of the partitions before
performing the index vacuum.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: should vacuum's first heap pass be read-only?

From
Robert Haas
Date:
On Tue, Apr 5, 2022 at 6:19 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I don't think that saving extra index passes is really a small gain.
> I think this will save a lot of IO if indexes pages are not in shared
> buffers because here we are talking about we can completely avoid the
> index passes for some of the indexes if it is already done.  And if
> this is the only advantage then it might not be worth adding this
> infrastructure but what about global indexes?

Sure, I agree that the gain is large when the situation arises -- but
in practice I think it's pretty rare that the dead TID array can't fit
in maintenance_work_mem. In ten years of doing PostgreSQL support,
I've seen only a handful of cases where # of index scans > 1, and
those were solved by just increasing maintenance_work_mem until the
problem went away. AFAICT, there's pretty much nobody who can't fit
the dead TID list in main memory. They just occasionally don't
configure enough memory for it to happen. It makes sense if you think
about the math. Say you run with maintenance_work_mem=64MB. That's
enough for 10 million dead TIDs. With default settings, the table
becomes eligible for vacuuming when the number of updates and deletes
exceeds 20% of the table. So to fill up that amount of memory, you
need the table to have more than 50 million tuples. If you estimate
(somewhat randomly) 100 tuples per page, that's 5 million pages, or
40GB. If you have a 40GB table, you don't have a problem with using
64MB of memory to vacuum it. And similarly if you have a 640GB table,
you don't have a problem with using 1GB of memory to vacuum it.
Practically speaking, if we made work memory for autovacuum unlimited,
and allocated on demand as much as we need, I bet almost nobody would
have an issue.

> Because if we have global indexes then we must need this
> infrastructure to store the dead items for the partition because for
> example after vacuuming 1000 partitions while vacuuming the 1001st
> partition if we need to vacuum the global index then we don't want to
> rescan all the previous 1000 partitions to regenerate those old dead
> items right?  So I think this is the actual use case where we
> indirectly skip the heap vacuuming for some of the partitions before
> performing the index vacuum.

Well I agree. But the problem is what development path we should
pursue in terms of getting there. We want to do something that's going
to make sense if and when we eventually get global indexes, but which
is going to give us a good amount of benefit in the meanwhile, and
also doesn't involve having to make too many changes to the code at
the same time. I liked the idea of keeping VACUUM basically as it is
today -- two heap passes with an index pass in the middle, but now
with the conveyor injected -- because it keeps the code changes as
simple as possible. And perhaps we should start by doing just that
much. But now that I've realized that the benefit of doing only that
much is so little, I'm a lot less convinced that it is a good first
step. Any hope of getting a more significant benefit out of the
conveyor belt stuff relies on our ability to get more decoupling, so
that we for example collect dead TIDs on Tuesday, vacuum the indexes
on Wednesday, and set the dead TIDs unused on Thursday, doing other
things meanwhile.

And from that point of view I see two problems. One problem is that I
do not think we want to force all vacuuming through the conveyor belt
model. It doesn't really make sense for a small table with no
associated global indexes. And so then there is a code structure
issue: how do we set things up so that we can vacuum as we do today,
or alternatively vacuum in completely separate stages, without filling
the code up with a million "if" statements? The other problem is
understanding whether it's really feasible to postpone the index
vacuuming and the second heap pass in realistic scenarios. Postponing
index vacuuming and the second heap pass means that dead line pointers
remain in the heap, and that can drive bloat via line pointer
exhaustion. The whole idea of decoupling table and index vacuum
supposes that there are situations in which it's worth performing the
first heap pass where we gather the dead line pointers but where it's
not necessary to follow that up as quickly as possible with a second
heap pass to mark dead line pointers unused. I think Peter and I are
in agreement that there are situations in which some indexes need to
be vacuumed much more often than others -- but that doesn't matter if
the heap needs to be vacuumed more frequently than anything else,
because you can't do that without first vacuuming all the indexes.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: should vacuum's first heap pass be read-only?

From
Peter Geoghegan
Date:
On Tue, Apr 5, 2022 at 5:44 AM Robert Haas <robertmhaas@gmail.com> wrote:
> The whole idea of decoupling table and index vacuum
> supposes that there are situations in which it's worth performing the
> first heap pass where we gather the dead line pointers but where it's
> not necessary to follow that up as quickly as possible with a second
> heap pass to mark dead line pointers unused. I think Peter and I are
> in agreement that there are situations in which some indexes need to
> be vacuumed much more often than others -- but that doesn't matter if
> the heap needs to be vacuumed more frequently than anything else,
> because you can't do that without first vacuuming all the indexes.

It's not just an enabler of more frequent index vacuuming (for those
indexes that need it the most), though. It's also an enabler of more
frequent lazy_scan_prune processing (in particular setting hint bits
and freezing), which is probably even more likely to benefit from the
decoupling you'd be enabling. I can imagine this having great value in
a world where autovacuum scheduling eagerly keeps up with inserts into
an append-mostly table, largely avoiding repeating dirtying within
lazy_scan_prune, with dynamic feedback. You just need to put off the
work of index/heap vacuuming to be able to do that kind of thing.

Postgres 14 split the WAL record previously shared by both pruning and
vacuuming (called XLOG_HEAP2_CLEAN) into two separate WAL records
(called XLOG_HEAP2_PRUNE and XLOG_HEAP2_VACUUM). That made it easier
to spot the fact that we usually have far fewer of the latter WAL
records during VACUUM by using pg_waldump. Might be worth doing your
own experiments on this.

Other instrumentation changes in 14 also helped here. In particular
the "%u pages from table (%.2f%% of total) had %lld dead item
identifiers removed" line that was added to autovacuum's log output
made it easy to spot how little heap vacuuming might really be needed.
With some tables it is roughly the opposite way around (as much or
even more heap vacuuming than pruning/freezing) -- you'll tend to see
that in tables where opportunistic pruning leaves behind a lot of
LP_DEAD stubs that only VACUUM can make LP_UNUSED.

But, these same LP_DEAD-heavy tables *also* have a very decent
chance of benefiting from a better index vacuuming strategy, something
*also* enabled by the conveyor belt design. So overall, in either scenario,
VACUUM concentrates on problems that are particular to a given table
and workload, without being hindered by implementation-level
restrictions.

--
Peter Geoghegan



Re: should vacuum's first heap pass be read-only?

From
Robert Haas
Date:
On Tue, Apr 5, 2022 at 3:22 PM Peter Geoghegan <pg@bowt.ie> wrote:
> It's not just an enabler of more frequent index vacuuming (for those
> indexes that need it the most), though. It's also an enabler of more
> frequent lazy_scan_prune processing (in particular setting hint bits
> and freezing), which is probably even more likely to benefit from the
> decoupling you'd be enabling. I can imagine this having great value in
> a world where autovacuum scheduling eagerly keeps up with inserts into
> an append-mostly table, largely avoiding repeating dirtying within
> lazy_scan_prune, with dynamic feedback. You just need to put off the
> work of index/heap vacuuming to be able to do that kind of thing.

I had assumed that this would not be the case, because if the page is
being accessed by the workload, it can be pruned - and probably frozen
too, if we wanted to write code for that and spend the cycles on it -
and if it isn't, pruning and freezing probably aren't needed.

> But, these same LP_DEAD-heavy tables *also* have a very decent
> chance of benefiting from a better index vacuuming strategy, something
> *also* enabled by the conveyor belt design. So overall, in either scenario,
> VACUUM concentrates on problems that are particular to a given table
> and workload, without being hindered by implementation-level
> restrictions.

Well this is what I'm not sure about. We need to demonstrate that
there are at least some workloads where retiring the LP_DEAD line
pointers doesn't become the dominant concern.

Also, just to be clear, I'm not arguing against my own idea. I'm
trying to figure out what the first, smallest unit of work that
produces a committable patch with a meaningful benefit is.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: should vacuum's first heap pass be read-only?

From
Peter Geoghegan
Date:
On Tue, Apr 5, 2022 at 1:10 PM Robert Haas <robertmhaas@gmail.com> wrote:
> I had assumed that this would not be the case, because if the page is
> being accessed by the workload, it can be pruned - and probably frozen
> too, if we wanted to write code for that and spend the cycles on it -
> and if it isn't, pruning and freezing probably aren't needed.

VACUUM has a top-down structure, and so seems to me to be the natural
place to think about the high level needs of the table as a whole,
especially over time.

I don't think we actually need to scan the pages that we left some
LP_DEAD items in previous VACUUM operations. It seems possible to
freeze newly appended pages quite often, without needlessly revisiting
the pages from previous batches (even those with LP_DEAD items left
behind). Maybe we need to rethink the definition of "VACUUM operation"
a little to do that, but it seems relatively tractable.

As I said upthread recently, I am excited about the potential of
"locking in" a set of scanned_pages using a local/private version of
the visibility map (a copy from just after OldestXmin is initially
established), that VACUUM can completely work off of. Especially if
combined with the conveyor belt, which could make VACUUM operations
suspendable and resumable.

I don't see any reason why it wouldn't be possible to "lock in" an
initial scanned_pages, and then use that data structure (which could
be persisted) to avoid revisiting the pages that we know we already
visited (and left LP_DEAD items in). We could "resume the VACUUM
operation that was suspended earlier" a bit later (not have several
technically unrelated VACUUM operations together in close succession).
The later rounds of processing could even use new cutoffs for both
freezing and freezing, despite being from "the same VACUUM operation".
They could have an "expanded rel_pages" that covers the newly appended
pages that we want to quickly freeze tuples on.

AFAICT the only thing that we need to do to make this safe is to carry
forward our original vacrel->NewRelfrozenXid (which can never be later
than our original vacrel->OldestXmin). Under this architecture, we
don't really "skip index vacuuming" at all. Rather, we redefine VACUUM
operations in a way that makes the final rel_pages provisional, at
least when run in autovacuum.

VACUUM itself can notice that it might be a good idea to "expand
rel_pages" and expand the scope of the work it ultimately does, based
on the observed characteristics of the table. No heap pages get repeat
processing per "VACUUM operation" (relative to the current definition
of the term). Some indexes will get "extra, earlier index vacuuming",
which we've already said is the right way to think about all this (we
should think of it as extra index vacuuming, not less index
vacuuming).

> > But, these same LP_DEAD-heavy tables *also* have a very decent
> > chance of benefiting from a better index vacuuming strategy, something
> > *also* enabled by the conveyor belt design. So overall, in either scenario,
> > VACUUM concentrates on problems that are particular to a given table
> > and workload, without being hindered by implementation-level
> > restrictions.
>
> Well this is what I'm not sure about. We need to demonstrate that
> there are at least some workloads where retiring the LP_DEAD line
> pointers doesn't become the dominant concern.

It will eventually become the dominant concern. But that could take a
while, compared to the growth in indexes.

An LP_DEAD line pointer stub in a heap page is 4 bytes. The smallest
possible B-Tree index tuple is 20 bytes on mainstream platforms (16
bytes + 4 byte line pointer). Granted deduplication makes this less
true, but that's far from guaranteed to help. Also, many tables have
way more than one index.

Of course it isn't nearly as simple as comparing the bytes of bloat in
each case. More generally, I don't claim that it's easy to
characterize which factor is more important, even in the abstract,
even under ideal conditions -- it's very hard. But I'm sure that there
are routinely very large differences among indexes and the heap
structure.

-- 
Peter Geoghegan



Re: should vacuum's first heap pass be read-only?

From
Robert Haas
Date:
On Tue, Apr 5, 2022 at 4:30 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Tue, Apr 5, 2022 at 1:10 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > I had assumed that this would not be the case, because if the page is
> > being accessed by the workload, it can be pruned - and probably frozen
> > too, if we wanted to write code for that and spend the cycles on it -
> > and if it isn't, pruning and freezing probably aren't needed.
>
> [ a lot of things ]

I don't understand what any of this has to do with the point I was raising here.

> > > But, these same LP_DEAD-heavy tables *also* have a very decent
> > > chance of benefiting from a better index vacuuming strategy, something
> > > *also* enabled by the conveyor belt design. So overall, in either scenario,
> > > VACUUM concentrates on problems that are particular to a given table
> > > and workload, without being hindered by implementation-level
> > > restrictions.
> >
> > Well this is what I'm not sure about. We need to demonstrate that
> > there are at least some workloads where retiring the LP_DEAD line
> > pointers doesn't become the dominant concern.
>
> It will eventually become the dominant concern. But that could take a
> while, compared to the growth in indexes.
>
> An LP_DEAD line pointer stub in a heap page is 4 bytes. The smallest
> possible B-Tree index tuple is 20 bytes on mainstream platforms (16
> bytes + 4 byte line pointer). Granted deduplication makes this less
> true, but that's far from guaranteed to help. Also, many tables have
> way more than one index.
>
> Of course it isn't nearly as simple as comparing the bytes of bloat in
> each case. More generally, I don't claim that it's easy to
> characterize which factor is more important, even in the abstract,
> even under ideal conditions -- it's very hard. But I'm sure that there
> are routinely very large differences among indexes and the heap
> structure.

Yeah, I think we need to better understand how this works out.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: should vacuum's first heap pass be read-only?

From
Peter Geoghegan
Date:
On Tue, Apr 5, 2022 at 2:53 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Apr 5, 2022 at 4:30 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > On Tue, Apr 5, 2022 at 1:10 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > > I had assumed that this would not be the case, because if the page is
> > > being accessed by the workload, it can be pruned - and probably frozen
> > > too, if we wanted to write code for that and spend the cycles on it -
> > > and if it isn't, pruning and freezing probably aren't needed.
> >
> > [ a lot of things ]
>
> I don't understand what any of this has to do with the point I was raising here.

Why do you assume that we'll ever have an accurate idea of how many
LP_DEAD items there are, before we've looked? And if we're wrong about
that, persistently, why should anything else we think about it really
matter? This is an inherently dynamic and cyclic process. Statistics
don't really work here. That was how my remarks were related to yours.
That should be in scope -- getting better information about what work
we need to do by blurring the boundaries between deciding what to do,
and executing that plan.

On a long enough timeline the LP_DEAD items in heap pages are bound to
become the dominant concern in almost any interesting case for the
conveyor belt, for the obvious reason: you can't do anything about
LP_DEAD items without also doing every other piece of processing
involving those same heap pages. So in that sense, yes, they will be
the dominant problem at times, for sure.

On the other hand it seems very hard to imagine an interesting
scenario in which LP_DEAD items are the dominant problem from the
earliest stage of processing by VACUUM. But even if it was somehow
possible, would it matter? That would mean that there'd be occasional
instances of the conveyor belt being ineffective -- hardly the end of
the world. What has it cost us to keep it as an option that wasn't
used? I don't think we'd have to do any extra work, other than
in-memory bookkeeping.

-- 
Peter Geoghegan



Re: should vacuum's first heap pass be read-only?

From
Robert Haas
Date:
On Tue, Apr 5, 2022 at 6:26 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On a long enough timeline the LP_DEAD items in heap pages are bound to
> become the dominant concern in almost any interesting case for the
> conveyor belt, for the obvious reason: you can't do anything about
> LP_DEAD items without also doing every other piece of processing
> involving those same heap pages. So in that sense, yes, they will be
> the dominant problem at times, for sure.
>
> On the other hand it seems very hard to imagine an interesting
> scenario in which LP_DEAD items are the dominant problem from the
> earliest stage of processing by VACUUM. But even if it was somehow
> possible, would it matter? That would mean that there'd be occasional
> instances of the conveyor belt being ineffective -- hardly the end of
> the world. What has it cost us to keep it as an option that wasn't
> used? I don't think we'd have to do any extra work, other than
> in-memory bookkeeping.

Well, OK, here's what I don't understand. Let's say I insert a tuple
and then I delete a tuple. Then time goes by and other things happen,
including but those things do not include a heap vacuum. However,
during that time, all transactions that were in progress at the time
of the insert-then-delete have now completed. At the end of that time,
the number of things that need to be cleaned out of the heap is
exactly 1: there is either a dead line pointer, or if the page hasn't
been pruned yet, there is a dead tuple. The number of things that need
to be cleaned out of the index is <= 1, because the index tuple could
have gotten nuked by kill_prior_tuple or bottom-up index deletion, or
it might still be there. It follows that the number of dead line
pointers (or tuples that can be truncated to dead line pointers) in
the heap is always greater than or equal to the number in the index.

All things being equal, that means the heap is always in trouble
before the index is in trouble. Maybe all things are not equal, but I
don't know why that should be so. It feels like the index has
opportunistic cleanup mechanisms that can completely eliminate index
tuples, while the heap can at best replace dead tuples with dead line
pointers which still consume some resources.

And if that's the case then doing more index vacuum cycles than we do
heap vacuum cycles really isn't a sensible thing to do. You seem to
think it is, though... what am I missing?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: should vacuum's first heap pass be read-only?

From
Peter Geoghegan
Date:
On Thu, Apr 7, 2022 at 6:45 AM Robert Haas <robertmhaas@gmail.com> wrote:
> Well, OK, here's what I don't understand. Let's say I insert a tuple
> and then I delete a tuple. Then time goes by and other things happen,
> including but those things do not include a heap vacuum. However,
> during that time, all transactions that were in progress at the time
> of the insert-then-delete have now completed. At the end of that time,
> the number of things that need to be cleaned out of the heap is
> exactly 1: there is either a dead line pointer, or if the page hasn't
> been pruned yet, there is a dead tuple. The number of things that need
> to be cleaned out of the index is <= 1,

I don't think it's useful to talk about only 1 or 2 tuple (or
tuple-like) units of bloat in isolation.

> because the index tuple could
> have gotten nuked by kill_prior_tuple or bottom-up index deletion, or
> it might still be there. It follows that the number of dead line
> pointers (or tuples that can be truncated to dead line pointers) in
> the heap is always greater than or equal to the number in the index.

These techniques are effective because they limit the *concentration*
of garbage index tuples in any part of the index's key space (or the
concentration on individual leaf pages, if you prefer). There is a
huge difference between 100 dead tuples that are obsolete versions of
100 logical rows, and 100 dead tuples that are obsolete versions of
only 2 or 3 logical rows. In general just counting the number of dead
index tuples can be highly misleading.

> All things being equal, that means the heap is always in trouble
> before the index is in trouble. Maybe all things are not equal, but I
> don't know why that should be so. It feels like the index has
> opportunistic cleanup mechanisms that can completely eliminate index
> tuples, while the heap can at best replace dead tuples with dead line
> pointers which still consume some resources.

The problem with line pointer bloat is not really that you're wasting
these 4 byte units of space. The problem comes from second order
effects. By allowing line pointer bloat in the short term, there is a
tendency for heap fragmentation to build in the long term. I am
referring to a situation in which heap tuples tend to get located in
random places over time, leaving logically related heap tuples (those
originally inserted around the same time, by the same transaction)
strewn all over the place (not packed together).

Fragmentation will persist after VACUUM runs and makes all LP_DEAD
items LP_UNUSED. It can be thought of as a degenerative process. If it
was enough to just VACUUM and then reclaim the LP space periodically,
then there wouldn't be much of a problem to fix here. I don't expect
that you'll find this explanation of things satisfactory, since it is
very complicated, and admits a lot of uncertainty. That's just what
experience has led me to believe.

The fact that all of this is so complicated makes techniques that
focus on lowering costs and bounding the uncertainty/worst case seem
so compelling to me -- focussing on keeping costs low (without being
overly concerned about the needs of the workload) is underexploited
right now. Even if you could come up with a 100% needs driven (and 0%
cost-of-cleanup driven) model that really worked, it would still be
very sensitive to having accurate paramaters -- but accurate current
information is hard to come by right now. And so I just don't see it
ever adding much value on its own.

> And if that's the case then doing more index vacuum cycles than we do
> heap vacuum cycles really isn't a sensible thing to do. You seem to
> think it is, though... what am I missing?

Bottom-up index deletion is only effective with logically unchanged
B-Tree indexes and non-HOT updates. The kill_prior_tuple technique is
only effective when index scans happen to read the same part of the
key space for a B-Tree, GiST, or hash index. That leaves a lot of
other cases unaddressed.

I just can't imagine a plausible workload in which there are real
problems, which manifest themselves as line pointer bloat first, with
any problems in indexes coming up only much later, if at all
(admittedly I could probably contrive such a case if I wanted to).
Absence of evidence isn't evidence of absence, though. Just giving you
my opinion.

Again, though, I must ask: why does it matter either way? Even if such
a scenario were reasonably common, it wouldn't necessarily make life
harder for you here.

--
Peter Geoghegan