Thread: Emit fewer vacuum records by reaping removable tuples during pruning

Emit fewer vacuum records by reaping removable tuples during pruning

From

Melanie Plageman

Date:

13 November 2023, 22:28:50

Hi,

When there are no indexes on the relation, we can set would-be dead
items LP_UNUSED and remove them during pruning. This saves us a vacuum
WAL record, reducing WAL volume (and time spent writing and syncing
WAL).

See this example:

  drop table if exists foo;
  create table foo(a int) with (autovacuum_enabled=false);
  insert into foo select i from generate_series(1,10000000)i;
  update foo set a = 10;
  \timing on
  vacuum foo;

On my machine, the attached patch set provides a 10% speedup for vacuum
for this example -- and a 40% decrease in WAL bytes emitted.

Admittedly, this case is probably unusual in the real world. On-access
pruning would preclude it. Throw a SELECT * FROM foo before the vacuum
and the patch has no performance benefit.

However, it has no downside as far as I can tell. And, IMHO, it is a
code clarity improvement. This change means that lazy_vacuum_heap_page()
is only called when we are actually doing a second pass and reaping dead
items. I found it quite confusing that lazy_vacuum_heap_page() was
called by lazy_scan_heap() to set dead items unused in a block that we
just pruned.

I think it also makes it clear that we should update the VM in
lazy_scan_prune(). All callers of lazy_scan_prune() will now consider
updating the VM after returning. And most of the state communicated back
to lazy_scan_heap() from lazy_scan_prune() is to inform it whether or
not to update the VM. I didn't do that in this patch set because I would
need to pass all_visible_according_to_vm to lazy_scan_prune() and that
change didn't seem worth the improvement in code clarity in
lazy_scan_heap().

I am planning to add a VM update into the freeze record, at which point
I will move the VM update code into lazy_scan_prune(). This will then
allow us to consolidate the freespace map update code for the prune and
noprune cases and make lazy_scan_heap() short and sweet.

Note that (on principle) this patch set is on top of the bug fix I
proposed in [1].

- Melanie

[1] https://www.postgresql.org/message-id/CAAKRu_YiL%3D44GvGnt1dpYouDSSoV7wzxVoXs8m3p311rp-TVQQ%40mail.gmail.com

Attachment

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Peter Geoghegan

Date:

13 November 2023, 22:58:44

On Mon, Nov 13, 2023 at 2:29 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> I think it also makes it clear that we should update the VM in
> lazy_scan_prune(). All callers of lazy_scan_prune() will now consider
> updating the VM after returning. And most of the state communicated back
> to lazy_scan_heap() from lazy_scan_prune() is to inform it whether or
> not to update the VM.

That makes sense.

> I didn't do that in this patch set because I would
> need to pass all_visible_according_to_vm to lazy_scan_prune() and that
> change didn't seem worth the improvement in code clarity in
> lazy_scan_heap().

Have you thought about finding a way to get rid of
all_visible_according_to_vm? (Not necessarily in the scope of the
ongoing work, just in general.)

all_visible_according_to_vm is inherently prone to races -- it tells
us what the VM used to say about the page, back when we looked. It is
very likely that the page isn't visible in this sense, anyway, because
VACUUM is after all choosing to scan the page in the first place when
we end up in lazy_scan_prune. (Granted, this is much less true than it
should be due to the influence of SKIP_PAGES_THRESHOLD, which
implements a weird and inefficient form of prefetching/readahead.)

Why should we necessarily care what all_visible_according_to_vm says
or would say at this point? We're looking at the heap page itself,
which is more authoritative than the VM (theoretically they're equally
authoritative, but not really, not when push comes to shove). The best
answer that I can think of is that all_visible_according_to_vm gives
us a way of noticing and then logging any inconsistencies between VM
and heap that we might end up "repairing" in passing (once we've
rechecked the VM). But maybe that could happen elsewhere.

Perhaps that cross-check could be pushed into visibilitymap.c, when we
go to set a !PageIsAllVisible(page) page all-visible in the VM. If
we're setting it and find that it's already set from earlier on, then
remember and complaing/LOG it. No all_visible_according_to_vm
required, plus this seems like it might be more thorough.

Just a thought.

--
Peter Geoghegan

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Melanie Plageman

Date:

14 November 2023, 00:06:15

On Mon, Nov 13, 2023 at 5:59 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Mon, Nov 13, 2023 at 2:29 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > I think it also makes it clear that we should update the VM in
> > lazy_scan_prune(). All callers of lazy_scan_prune() will now consider
> > updating the VM after returning. And most of the state communicated back
> > to lazy_scan_heap() from lazy_scan_prune() is to inform it whether or
> > not to update the VM.
>
> That makes sense.
>
> > I didn't do that in this patch set because I would
> > need to pass all_visible_according_to_vm to lazy_scan_prune() and that
> > change didn't seem worth the improvement in code clarity in
> > lazy_scan_heap().
>
> Have you thought about finding a way to get rid of
> all_visible_according_to_vm? (Not necessarily in the scope of the
> ongoing work, just in general.)
>
> all_visible_according_to_vm is inherently prone to races -- it tells
> us what the VM used to say about the page, back when we looked. It is
> very likely that the page isn't visible in this sense, anyway, because
> VACUUM is after all choosing to scan the page in the first place when
> we end up in lazy_scan_prune. (Granted, this is much less true than it
> should be due to the influence of SKIP_PAGES_THRESHOLD, which
> implements a weird and inefficient form of prefetching/readahead.)
>
> Why should we necessarily care what all_visible_according_to_vm says
> or would say at this point? We're looking at the heap page itself,
> which is more authoritative than the VM (theoretically they're equally
> authoritative, but not really, not when push comes to shove). The best
> answer that I can think of is that all_visible_according_to_vm gives
> us a way of noticing and then logging any inconsistencies between VM
> and heap that we might end up "repairing" in passing (once we've
> rechecked the VM). But maybe that could happen elsewhere.
>
> Perhaps that cross-check could be pushed into visibilitymap.c, when we
> go to set a !PageIsAllVisible(page) page all-visible in the VM. If
> we're setting it and find that it's already set from earlier on, then
> remember and complaing/LOG it. No all_visible_according_to_vm
> required, plus this seems like it might be more thorough.

Setting aside the data corruption cases (the repairing you mentioned), I
think the primary case that all_visible_according_to_vm seeks to protect
us from is if the page was already marked all visible in the VM and
pruning did not change that. So, it avoids a call to visibilitymap_set()
when the page was known to already be set all visible (as long as we
didn't newly set all frozen).

As you say, this does not seem like the common case. Pages we vacuum
will most often not be all visible.

I actually wonder if it is worse to rely on that old value of all
visible and then call visibilitymap_set(), which takes an exclusive lock
on the VM page, than it would be to just call visibilitymap_get_status()
anew. This is what we do with all frozen.

The problem is that it is kind of hard to tell because the whole thing
is a bit of a tangled knot.

I've ripped this code apart and put it back together again six ways from
Sunday trying to get rid of all_visible_according_to_vm and
next_unskippable_allvis/skipping_current_range's bootleg prefetching
without incurring any additional calls to visibilitymap_get_status().

As best I can tell, our best case scenario is Thomas' streaming read API
goes in, we add vacuum as a user, and we can likely remove the skip
range logic.

Then, the question remains about how useful it is to save the visibility
map statuses we got when deciding whether or not to skip a block.

- Melanie

[1]
https://www.postgresql.org/message-id/CA%2BhUKGJkOiOCa%2Bmag4BF%2BzHo7qo%3Do9CFheB8%3Dg6uT5TUm2gkvA%40mail.gmail.com

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Melanie Plageman

Date:

17 November 2023, 23:12:08

On Mon, Nov 13, 2023 at 5:28 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> When there are no indexes on the relation, we can set would-be dead
> items LP_UNUSED and remove them during pruning. This saves us a vacuum
> WAL record, reducing WAL volume (and time spent writing and syncing
> WAL).
...
> Note that (on principle) this patch set is on top of the bug fix I
> proposed in [1].
>
> [1] https://www.postgresql.org/message-id/CAAKRu_YiL%3D44GvGnt1dpYouDSSoV7wzxVoXs8m3p311rp-TVQQ%40mail.gmail.com

Rebased on top of fix in b2e237afddc56a and registered for the january fest
https://commitfest.postgresql.org/46/4665/

- Melanie

Attachment

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Melanie Plageman

Date:

21 December 2023, 21:36:12

On Fri, Nov 17, 2023 at 6:12 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Mon, Nov 13, 2023 at 5:28 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > When there are no indexes on the relation, we can set would-be dead
> > items LP_UNUSED and remove them during pruning. This saves us a vacuum
> > WAL record, reducing WAL volume (and time spent writing and syncing
> > WAL).
> ...
> > Note that (on principle) this patch set is on top of the bug fix I
> > proposed in [1].
> >
> > [1] https://www.postgresql.org/message-id/CAAKRu_YiL%3D44GvGnt1dpYouDSSoV7wzxVoXs8m3p311rp-TVQQ%40mail.gmail.com
>
> Rebased on top of fix in b2e237afddc56a and registered for the january fest
> https://commitfest.postgresql.org/46/4665/

I got an off-list question about whether or not this codepath is
exercised in existing regression tests. It is -- vacuum.sql tests
include those which vacuum a table with no indexes and tuples that can
be deleted.

I also looked through [1] to see if there were any user-facing docs
which happened to mention the exact implementation details of how and
when tuples are deleted by vacuum. I didn't see anything like that, so
I don't think there are user-facing docs which need updating.

- Melanie

[1] https://www.postgresql.org/docs/devel/routine-vacuuming.html

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Michael Paquier

Date:

24 December 2023, 02:14:05

On Mon, Nov 13, 2023 at 07:06:15PM -0500, Melanie Plageman wrote:
> As best I can tell, our best case scenario is Thomas' streaming read API
> goes in, we add vacuum as a user, and we can likely remove the skip
> range logic.

This does not prevent the work you've been doing in 0001 and 0002
posted upthread, right?  Some progress is always better than no
progress, and I can see the appeal behind doing 0001 actually to keep
the updates of the block numbers closer to where we determine if
relation truncation is safe of not rather than use an intermediate
state in LVPagePruneState.

0002 is much, much, much trickier..
--
Michael

Attachment

signature.asc

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Melanie Plageman

Date:

27 December 2023, 16:26:52

On Sat, Dec 23, 2023 at 9:14 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Mon, Nov 13, 2023 at 07:06:15PM -0500, Melanie Plageman wrote:
> > As best I can tell, our best case scenario is Thomas' streaming read API
> > goes in, we add vacuum as a user, and we can likely remove the skip
> > range logic.
>
> This does not prevent the work you've been doing in 0001 and 0002
> posted upthread, right? Some progress is always better than no
> progress

Correct. Peter and I were mainly discussing next refactoring steps as
we move toward combining the prune, freeze, and VM records. This
thread's patches stand alone.

> I can see the appeal behind doing 0001 actually to keep
> the updates of the block numbers closer to where we determine if
> relation truncation is safe of not rather than use an intermediate
> state in LVPagePruneState.

Exactly.

> 0002 is much, much, much trickier..

Do you have specific concerns about its correctness? I understand it
is an area where we have to be sure we are correct. But, to be fair,
that is true of all the pruning and vacuuming code.

- Melanie

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Robert Haas

Date:

04 January 2024, 17:31:36

On Wed, Dec 27, 2023 at 11:27 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> Do you have specific concerns about its correctness? I understand it
> is an area where we have to be sure we are correct. But, to be fair,
> that is true of all the pruning and vacuuming code.

I'm kind of concerned that 0002 might be a performance regression. It
pushes more branches down into the heap-pruning code, which I think
can sometimes be quite hot, for the sake of a case that rarely occurs
in practice. I take your point about it improving things when there
are no indexes, but what about when there are? And even if there are
no adverse performance consequences, is it really worth complicating
the logic at such a low level?

Also, I find "pronto_reap" to be a poor choice of name. "pronto" is an
informal word that seems to have no advantage over something like
"immediate" or "now," and I don't think "reap" has a precise,
universally-understood meaning. You could call this "mark_unused_now"
or "immediately_mark_unused" or something and it would be far more
self-documenting, IMHO.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Andres Freund

Date:

04 January 2024, 19:24:20

Hi,

On 2024-01-04 12:31:36 -0500, Robert Haas wrote:
> On Wed, Dec 27, 2023 at 11:27 AM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > Do you have specific concerns about its correctness? I understand it
> > is an area where we have to be sure we are correct. But, to be fair,
> > that is true of all the pruning and vacuuming code.
> 
> I'm kind of concerned that 0002 might be a performance regression. It
> pushes more branches down into the heap-pruning code, which I think
> can sometimes be quite hot, for the sake of a case that rarely occurs
> in practice.

I was wondering the same when talking about this with Melanie. But I guess
there are some scenarios that aren't unrealistic, consider e.g. bulk data
loading with COPY with an UPDATE to massage the data afterwards, before
creating the indexes.

Where I could see this becoming more interesting / powerful is if we were able
to do 'pronto reaping' not just from vacuum but also during on-access
pruning. For ETL workloads VACUUM might often not run between modifications of
the same page, but on-access pruning will. Being able to reclaim dead items at
that point seems like it could be pretty sizable improvement?

> I take your point about it improving things when there are no indexes, but
> what about when there are?

I suspect that branch prediction should take care of the additional
branches. heap_prune_chain() indeed can be very hot, but I think we're
primarily bottlenecked on data cache misses.

For a single VACUUM, whether we'd do pronto reaping would be constant -> very
well predictable. We could add an unlikely() to make sure the
branch-predictor-is-empty case optimizes for the much more common case of
having indexes.  Falsely assuming we'd not pronto reap wouldn't be
particularly bad, as the wins for the case are so much bigger.

If we were to use pronto reaping for on-access pruning, it's perhaps a bit
less predictable, as pruning for pages of a relation with indexes could be
interleaved with pruning for relations without. But even there I suspect it'd
not be the primary bottleneck: We call heap_page_prune_chain() in a loop for
every tuple on a page, the branch predictor should quickly learn whether we're
using pronto reaping. Whereas we're not becoming less cache-miss heavy when
looking at subsequent tuples.

> And even if there are no adverse performance consequences, is it really
> worth complicating the logic at such a low level?

Yes, I think this is the main question here. It's not clear though that the
state after the patch is meaningfullye more complicated? It removes nontrivial
code from lazy_scan_heap() and pays for that with a bit more complexity in
heap_prune_chain().

> Also, I find "pronto_reap" to be a poor choice of name. "pronto" is an
> informal word that seems to have no advantage over something like
> "immediate" or "now,"

I didn't like that either :)

> and I don't think "reap" has a precise, universally-understood meaning.

Less concerned about that.

> You could call this "mark_unused_now" or "immediately_mark_unused" or
> something and it would be far more self-documenting, IMHO.

How about 'no_indexes' or such?

Greetings,

Andres Freund

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Andres Freund

Date:

04 January 2024, 20:03:31

Hi,

On 2023-11-17 18:12:08 -0500, Melanie Plageman wrote:
> diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
> index 14de8158d49..b578c32eeb6 100644
> --- a/src/backend/access/heap/heapam.c
> +++ b/src/backend/access/heap/heapam.c
> @@ -8803,8 +8803,13 @@ heap_xlog_prune(XLogReaderState *record)
>          nunused = (end - nowunused);
>          Assert(nunused >= 0);
>
> -        /* Update all line pointers per the record, and repair fragmentation */
> -        heap_page_prune_execute(buffer,
> +        /*
> +         * Update all line pointers per the record, and repair fragmentation.
> +         * We always pass pronto_reap as true, because we don't know whether
> +         * or not this option was used when pruning. This reduces the
> +         * validation done on replay in an assert build.
> +         */

Hm, that seems not great. Both because we loose validation and because it
seems to invite problems down the line, due to pronto_reap falsely being set
to true in heap_page_prune_execute().


> @@ -581,7 +589,17 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum,
>           * function.)
>           */
>          if (ItemIdIsDead(lp))
> +        {
> +            /*
> +             * If the relation has no indexes, we can set dead line pointers
> +             * LP_UNUSED now. We don't increment ndeleted here since the LP
> +             * was already marked dead.
> +             */
> +            if (prstate->pronto_reap)
> +                heap_prune_record_unused(prstate, offnum);
> +
>              break;
> +        }

I wasn't immediately sure whether this is reachable - but it is, e.g. after
on-access pruning (which currently doesn't yet use pronto reaping), after
pg_upgrade or dropping an index.


>          Assert(ItemIdIsNormal(lp));
>          htup = (HeapTupleHeader) PageGetItem(dp, lp);
> @@ -715,7 +733,17 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum,
>           * redirect the root to the correct chain member.
>           */
>          if (i >= nchain)
> -            heap_prune_record_dead(prstate, rootoffnum);
> +        {
> +            /*
> +             * If the relation has no indexes, we can remove dead tuples
> +             * during pruning instead of marking their line pointers dead. Set
> +             * this tuple's line pointer LP_UNUSED.
> +             */
> +            if (prstate->pronto_reap)
> +                heap_prune_record_unused(prstate, rootoffnum);
> +            else
> +                heap_prune_record_dead(prstate, rootoffnum);
> +        }
>          else
>              heap_prune_record_redirect(prstate, rootoffnum, chainitems[i]);
>      }
> @@ -726,9 +754,12 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum,
>           * item.  This can happen if the loop in heap_page_prune caused us to
>           * visit the dead successor of a redirect item before visiting the
>           * redirect item.  We can clean up by setting the redirect item to
> -         * DEAD state.
> +         * DEAD state. If pronto_reap is true, we can set it LP_UNUSED now.
>           */
> -        heap_prune_record_dead(prstate, rootoffnum);
> +        if (prstate->pronto_reap)
> +            heap_prune_record_unused(prstate, rootoffnum);
> +        else
> +            heap_prune_record_dead(prstate, rootoffnum);
>      }
>
>      return ndeleted;

There's three new calls to heap_prune_record_unused() and the logic got more
nested. Is there a way to get to a nicer end result?


> From 608658f2cbc0acde55aac815c0fdb523ec24c452 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Mon, 13 Nov 2023 16:47:08 -0500
> Subject: [PATCH v2 1/2] Indicate rel truncation unsafe in lazy_scan[no]prune
>
> Both lazy_scan_prune() and lazy_scan_noprune() must determine whether or
> not there are tuples on the page making rel truncation unsafe.
> LVRelState->nonempty_pages is updated to reflect this. Previously, both
> functions set an output parameter or output parameter member, hastup, to
> indicate that nonempty_pages should be updated to reflect the latest
> non-removable page. There doesn't seem to be any reason to wait until
> lazy_scan_[no]prune() returns to update nonempty_pages. Plenty of other
> counters in the LVRelState are updated in lazy_scan_[no]prune().
> This allows us to get rid of the output parameter hastup.


> @@ -972,20 +970,21 @@ lazy_scan_heap(LVRelState *vacrel)
>                  continue;
>              }
>
> -            /* Collect LP_DEAD items in dead_items array, count tuples */
> -            if (lazy_scan_noprune(vacrel, buf, blkno, page, &hastup,
> +            /*
> +             * Collect LP_DEAD items in dead_items array, count tuples,
> +             * determine if rel truncation is safe
> +             */
> +            if (lazy_scan_noprune(vacrel, buf, blkno, page,
>                                    &recordfreespace))
>              {
>                  Size        freespace = 0;
>
>                  /*
>                   * Processed page successfully (without cleanup lock) -- just
> -                 * need to perform rel truncation and FSM steps, much like the
> -                 * lazy_scan_prune case.  Don't bother trying to match its
> -                 * visibility map setting steps, though.
> +                 * need to update the FSM, much like the lazy_scan_prune case.
> +                 * Don't bother trying to match its visibility map setting
> +                 * steps, though.
>                   */
> -                if (hastup)
> -                    vacrel->nonempty_pages = blkno + 1;
>                  if (recordfreespace)
>                      freespace = PageGetHeapFreeSpace(page);
>                  UnlockReleaseBuffer(buf);

The comment continues to say that we "determine if rel truncation is safe" -
but I don't see that?  Oh, I see, it's done inside lazy_scan_noprune(). This
doesn't seem like a clear improvement to me. Particularly because it's only
set if lazy_scan_noprune() actually does something.


I don't like the existing code in lazy_scan_heap(). But this kinda seems like
tinkering around the edges, without getting to the heart of the issue. I think
we should

1) Move everything after ReadBufferExtended() and the end of the loop into its
   own function

2) All the code in the loop body after the call to lazy_scan_prune() is
   specific to the lazy_scan_prune() path, it doesn't make sense that it's at
   the same level as the the calls to lazy_scan_noprune(),
   lazy_scan_new_or_empty() or lazy_scan_prune(). Either it should be in
   lazy_scan_prune() or a new wrapper function.

3) It's imo wrong that we have UnlockReleaseBuffer() (there are 6 different
   places unlocking if I didn't miscount!) and RecordPageWithFreeSpace() calls
   in this many places. I think this is largely a consequence of the previous
   points. Once those are addressed, we can have one common place.

But I digress.

Greetings,

Andres Freund

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Melanie Plageman

Date:

04 January 2024, 22:37:27

Thanks for the review!

On Thu, Jan 4, 2024 at 3:03 PM Andres Freund <andres@anarazel.de> wrote:
>
> On 2023-11-17 18:12:08 -0500, Melanie Plageman wrote:
> >               Assert(ItemIdIsNormal(lp));
> >               htup = (HeapTupleHeader) PageGetItem(dp, lp);
> > @@ -715,7 +733,17 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum,
> >                * redirect the root to the correct chain member.
> >                */
> >               if (i >= nchain)
> > -                     heap_prune_record_dead(prstate, rootoffnum);
> > +             {
> > +                     /*
> > +                      * If the relation has no indexes, we can remove dead tuples
> > +                      * during pruning instead of marking their line pointers dead. Set
> > +                      * this tuple's line pointer LP_UNUSED.
> > +                      */
> > +                     if (prstate->pronto_reap)
> > +                             heap_prune_record_unused(prstate, rootoffnum);
> > +                     else
> > +                             heap_prune_record_dead(prstate, rootoffnum);
> > +             }
> >               else
> >                       heap_prune_record_redirect(prstate, rootoffnum, chainitems[i]);
> >       }
> > @@ -726,9 +754,12 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum,
> >                * item.  This can happen if the loop in heap_page_prune caused us to
> >                * visit the dead successor of a redirect item before visiting the
> >                * redirect item.  We can clean up by setting the redirect item to
> > -              * DEAD state.
> > +              * DEAD state. If pronto_reap is true, we can set it LP_UNUSED now.
> >                */
> > -             heap_prune_record_dead(prstate, rootoffnum);
> > +             if (prstate->pronto_reap)
> > +                     heap_prune_record_unused(prstate, rootoffnum);
> > +             else
> > +                     heap_prune_record_dead(prstate, rootoffnum);
> >       }
> >
> >       return ndeleted;
>
> There's three new calls to heap_prune_record_unused() and the logic got more
> nested. Is there a way to get to a nicer end result?

So, I could do another loop through the line pointers in
heap_page_prune() (after the loop calling heap_prune_chain()) and, if
pronto_reap is true, set dead line pointers LP_UNUSED. Then, when
constructing the WAL record, I would just not add the prstate.nowdead
that I saved from heap_prune_chain() to the prune WAL record.

This would eliminate the extra if statements from heap_prune_chain().
It will be more performant than sticking with the original (master)
call to lazy_vacuum_heap_page(). However, I'm not convinced that the
extra loop to set line pointers LP_DEAD -> LP_UNUSED is less confusing
than keeping the if pronto_reap test in heap_prune_chain().
heap_prune_chain() is where line pointers' new values are decided. It
seems weird to pick one new value for a line pointer in
heap_prune_chain() and then pick another, different new value in a
loop after heap_prune_chain(). I don't see any way to eliminate the if
pronto_reap tests without a separate loop setting LP_DEAD->LP_UNUSED,
though.

> > From 608658f2cbc0acde55aac815c0fdb523ec24c452 Mon Sep 17 00:00:00 2001
> > From: Melanie Plageman <melanieplageman@gmail.com>
> > Date: Mon, 13 Nov 2023 16:47:08 -0500
> > Subject: [PATCH v2 1/2] Indicate rel truncation unsafe in lazy_scan[no]prune
> >
> > Both lazy_scan_prune() and lazy_scan_noprune() must determine whether or
> > not there are tuples on the page making rel truncation unsafe.
> > LVRelState->nonempty_pages is updated to reflect this. Previously, both
> > functions set an output parameter or output parameter member, hastup, to
> > indicate that nonempty_pages should be updated to reflect the latest
> > non-removable page. There doesn't seem to be any reason to wait until
> > lazy_scan_[no]prune() returns to update nonempty_pages. Plenty of other
> > counters in the LVRelState are updated in lazy_scan_[no]prune().
> > This allows us to get rid of the output parameter hastup.
>
>
> > @@ -972,20 +970,21 @@ lazy_scan_heap(LVRelState *vacrel)
> >                               continue;
> >                       }
> >
> > -                     /* Collect LP_DEAD items in dead_items array, count tuples */
> > -                     if (lazy_scan_noprune(vacrel, buf, blkno, page, &hastup,
> > +                     /*
> > +                      * Collect LP_DEAD items in dead_items array, count tuples,
> > +                      * determine if rel truncation is safe
> > +                      */
> > +                     if (lazy_scan_noprune(vacrel, buf, blkno, page,
> >                                                                 &recordfreespace))
> >                       {
> >                               Size            freespace = 0;
> >
> >                               /*
> >                                * Processed page successfully (without cleanup lock) -- just
> > -                              * need to perform rel truncation and FSM steps, much like the
> > -                              * lazy_scan_prune case.  Don't bother trying to match its
> > -                              * visibility map setting steps, though.
> > +                              * need to update the FSM, much like the lazy_scan_prune case.
> > +                              * Don't bother trying to match its visibility map setting
> > +                              * steps, though.
> >                                */
> > -                             if (hastup)
> > -                                     vacrel->nonempty_pages = blkno + 1;
> >                               if (recordfreespace)
> >                                       freespace = PageGetHeapFreeSpace(page);
> >                               UnlockReleaseBuffer(buf);
>
> The comment continues to say that we "determine if rel truncation is safe" -
> but I don't see that?  Oh, I see, it's done inside lazy_scan_noprune(). This
> doesn't seem like a clear improvement to me. Particularly because it's only
> set if lazy_scan_noprune() actually does something.

I don't get what the last sentence means ("Particularly because...").
The new location of the hastup test in lazy_scan_noprune() is above an
unconditional return true, so it is also only set if
lazy_scan_noprune() actually does something. I think the
lazy_scan[]prune() functions shouldn't try to export the hastup
information to lazy_scan_heap(). It's confusing. We should be moving
all of the page-specific processing into the individual functions
instead of in the body of lazy_scan_heap().

> I don't like the existing code in lazy_scan_heap(). But this kinda seems like
> tinkering around the edges, without getting to the heart of the issue. I think
> we should
>
> 1) Move everything after ReadBufferExtended() and the end of the loop into its
>    own function
>
> 2) All the code in the loop body after the call to lazy_scan_prune() is
>    specific to the lazy_scan_prune() path, it doesn't make sense that it's at
>    the same level as the the calls to lazy_scan_noprune(),
>    lazy_scan_new_or_empty() or lazy_scan_prune(). Either it should be in
>    lazy_scan_prune() or a new wrapper function.
>
> 3) It's imo wrong that we have UnlockReleaseBuffer() (there are 6 different
>    places unlocking if I didn't miscount!) and RecordPageWithFreeSpace() calls
>    in this many places. I think this is largely a consequence of the previous
>    points. Once those are addressed, we can have one common place.

I have other patches that do versions of all of the above, but they
didn't seem to really fit with this patch set. I am taking a step to
move code out of lazy_scan_heap() that doesn't belong there. That fact
that other code should also be moved from there seems more like a "yes
and" than a "no but". That being said, do you think I should introduce
patches doing further refactoring of lazy_scan_heap() (like what you
suggest above) into this thread?

- Melanie

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Melanie Plageman

Date:

04 January 2024, 23:03:25

On Thu, Jan 4, 2024 at 12:31 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Dec 27, 2023 at 11:27 AM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > Do you have specific concerns about its correctness? I understand it
> > is an area where we have to be sure we are correct. But, to be fair,
> > that is true of all the pruning and vacuuming code.
>
> I'm kind of concerned that 0002 might be a performance regression. It
> pushes more branches down into the heap-pruning code, which I think
> can sometimes be quite hot, for the sake of a case that rarely occurs
> in practice. I take your point about it improving things when there
> are no indexes, but what about when there are? And even if there are
> no adverse performance consequences, is it really worth complicating
> the logic at such a low level?

Regarding the additional code complexity, I think the extra call to
lazy_vacuum_heap_page() in lazy_scan_heap() actually represents a fair
amount of code complexity. It is a special case of page-level
processing that should be handled by heap_page_prune() and not
lazy_scan_heap().

lazy_scan_heap() is responsible for three main things -- loop through
the blocks in a relation and process each page (pruning, freezing,
etc), invoke index vacuuming, invoke functions to loop through
dead_items and vacuum pages. The logic to do the per-page processing
is spread across three places, though.

When a single page is being processed, page pruning happens in
heap_page_prune(). Freezing, dead items recording, and visibility
checks happen in lazy_scan_prune(). Visibility map updates and
freespace map updates happen back in lazy_scan_heap(). Except, if the
table has no indexes, in which case, lazy_scan_heap() also invokes
lazy_vacuum_heap_page() to set dead line pointers unused and do
another separate visibility check and VM update. I maintain that all
page-level processing should be done in the page-level processing
functions (like lazy_scan_prune()). And lazy_scan_heap() shouldn't be
directly responsible for special case page-level processing.

> Also, I find "pronto_reap" to be a poor choice of name. "pronto" is an
> informal word that seems to have no advantage over something like
> "immediate" or "now," and I don't think "reap" has a precise,
> universally-understood meaning. You could call this "mark_unused_now"
> or "immediately_mark_unused" or something and it would be far more
> self-documenting, IMHO.

Yes, I see how pronto is unnecessarily informal. If there are no cases
other than when the table has no indexes that we would consider
immediately marking LPs unused, then perhaps it is better to call it
"no_indexes" (per andres' suggestion)?

- Melanie

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Robert Haas

Date:

05 January 2024, 13:59:41

On Thu, Jan 4, 2024 at 6:03 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> When a single page is being processed, page pruning happens in
> heap_page_prune(). Freezing, dead items recording, and visibility
> checks happen in lazy_scan_prune(). Visibility map updates and
> freespace map updates happen back in lazy_scan_heap(). Except, if the
> table has no indexes, in which case, lazy_scan_heap() also invokes
> lazy_vacuum_heap_page() to set dead line pointers unused and do
> another separate visibility check and VM update. I maintain that all
> page-level processing should be done in the page-level processing
> functions (like lazy_scan_prune()). And lazy_scan_heap() shouldn't be
> directly responsible for special case page-level processing.

But you can just as easily turn this argument on its head, can't you?
In general, except for HOT tuples, line pointers are marked dead by
pruning and unused by vacuum. Here you want to turn it on its head and
make pruning do what would normally be vacuum's responsibility.

I mean, that's not to say that your argument is "wrong" ... but what I
just said really is how I think about it, too.

> > Also, I find "pronto_reap" to be a poor choice of name. "pronto" is an
> > informal word that seems to have no advantage over something like
> > "immediate" or "now," and I don't think "reap" has a precise,
> > universally-understood meaning. You could call this "mark_unused_now"
> > or "immediately_mark_unused" or something and it would be far more
> > self-documenting, IMHO.
>
> Yes, I see how pronto is unnecessarily informal. If there are no cases
> other than when the table has no indexes that we would consider
> immediately marking LPs unused, then perhaps it is better to call it
> "no_indexes" (per andres' suggestion)?

wfm.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Melanie Plageman

Date:

05 January 2024, 17:59:34

On Fri, Jan 5, 2024 at 8:59 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Jan 4, 2024 at 6:03 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > When a single page is being processed, page pruning happens in
> > heap_page_prune(). Freezing, dead items recording, and visibility
> > checks happen in lazy_scan_prune(). Visibility map updates and
> > freespace map updates happen back in lazy_scan_heap(). Except, if the
> > table has no indexes, in which case, lazy_scan_heap() also invokes
> > lazy_vacuum_heap_page() to set dead line pointers unused and do
> > another separate visibility check and VM update. I maintain that all
> > page-level processing should be done in the page-level processing
> > functions (like lazy_scan_prune()). And lazy_scan_heap() shouldn't be
> > directly responsible for special case page-level processing.
>
> But you can just as easily turn this argument on its head, can't you?
> In general, except for HOT tuples, line pointers are marked dead by
> pruning and unused by vacuum. Here you want to turn it on its head and
> make pruning do what would normally be vacuum's responsibility.

I actually think we are going to want to stop referring to these steps
as pruning and vacuuming. It is confusing because vacuuming refers to
the whole process of doing garbage collection on the table and also to
the specific step of setting dead line pointers unused. If we called
these steps say, pruning and reaping, that may be more clear.

Vacuuming consists of three phases -- the first pass, index vacuuming,
and the second pass. I don't think we should dictate what happens in
each pass. That is, we shouldn't expect only pruning to happen in the
first pass and only reaping to happen in the second pass. For example,
I think Andres has previously proposed doing another round of pruning
after index vacuuming. The second pass/third phase is distinguished
primarily by being after index vacuuming.

If we think about it this way, that frees us up to set dead line
pointers unused in the first pass when the table has no indexes. For
clarity, I could add a block comment explaining that doing this is an
optimization and not a logical requirement. One way to make this even
more clear would be to set the dead line pointers unused in a separate
loop after heap_prune_chain() as I proposed upthread.

- Melanie

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Robert Haas

Date:

05 January 2024, 18:47:43

On Fri, Jan 5, 2024 at 12:59 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> I actually think we are going to want to stop referring to these steps
> as pruning and vacuuming. It is confusing because vacuuming refers to
> the whole process of doing garbage collection on the table and also to
> the specific step of setting dead line pointers unused. If we called
> these steps say, pruning and reaping, that may be more clear.

I agree that there's some terminological confusion here, but if we
leave all of the current naming unchanged and start using new naming
conventions in new code, I don't think that's going to clear things
up. Getting rid of the weird naming with pruning, "scanning" (that is
part of vacuum), and "vacuuming" (that is another part of vacuum) is
gonna take some work.

> Vacuuming consists of three phases -- the first pass, index vacuuming,
> and the second pass. I don't think we should dictate what happens in
> each pass. That is, we shouldn't expect only pruning to happen in the
> first pass and only reaping to happen in the second pass. For example,
> I think Andres has previously proposed doing another round of pruning
> after index vacuuming. The second pass/third phase is distinguished
> primarily by being after index vacuuming.
>
> If we think about it this way, that frees us up to set dead line
> pointers unused in the first pass when the table has no indexes. For
> clarity, I could add a block comment explaining that doing this is an
> optimization and not a logical requirement. One way to make this even
> more clear would be to set the dead line pointers unused in a separate
> loop after heap_prune_chain() as I proposed upthread.

I don't really disagree with any of this, but I think the question is
whether the code is cleaner with mark-LP-as-unused pushed down into
pruning or whether it's better the way it is. Andres seems confident
that the change won't suck for performance, which is good, but I'm not
quite convinced that it's the right direction to go with the code, and
he doesn't seem to be either. Perhaps this all turns on this point:

MP> I am planning to add a VM update into the freeze record, at which point
MP> I will move the VM update code into lazy_scan_prune(). This will then
MP> allow us to consolidate the freespace map update code for the prune and
MP> noprune cases and make lazy_scan_heap() short and sweet.

Can we see what that looks like on top of this change?

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Andres Freund

Date:

05 January 2024, 19:47:09

Hi,

On 2024-01-04 17:37:27 -0500, Melanie Plageman wrote:
> On Thu, Jan 4, 2024 at 3:03 PM Andres Freund <andres@anarazel.de> wrote:
> >
> > On 2023-11-17 18:12:08 -0500, Melanie Plageman wrote:
> > >               Assert(ItemIdIsNormal(lp));
> > >               htup = (HeapTupleHeader) PageGetItem(dp, lp);
> > > @@ -715,7 +733,17 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum,
> > >                * redirect the root to the correct chain member.
> > >                */
> > >               if (i >= nchain)
> > > -                     heap_prune_record_dead(prstate, rootoffnum);
> > > +             {
> > > +                     /*
> > > +                      * If the relation has no indexes, we can remove dead tuples
> > > +                      * during pruning instead of marking their line pointers dead. Set
> > > +                      * this tuple's line pointer LP_UNUSED.
> > > +                      */
> > > +                     if (prstate->pronto_reap)
> > > +                             heap_prune_record_unused(prstate, rootoffnum);
> > > +                     else
> > > +                             heap_prune_record_dead(prstate, rootoffnum);
> > > +             }
> > >               else
> > >                       heap_prune_record_redirect(prstate, rootoffnum, chainitems[i]);
> > >       }
> > > @@ -726,9 +754,12 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum,
> > >                * item.  This can happen if the loop in heap_page_prune caused us to
> > >                * visit the dead successor of a redirect item before visiting the
> > >                * redirect item.  We can clean up by setting the redirect item to
> > > -              * DEAD state.
> > > +              * DEAD state. If pronto_reap is true, we can set it LP_UNUSED now.
> > >                */
> > > -             heap_prune_record_dead(prstate, rootoffnum);
> > > +             if (prstate->pronto_reap)
> > > +                     heap_prune_record_unused(prstate, rootoffnum);
> > > +             else
> > > +                     heap_prune_record_dead(prstate, rootoffnum);
> > >       }
> > >
> > >       return ndeleted;
> >
> > There's three new calls to heap_prune_record_unused() and the logic got more
> > nested. Is there a way to get to a nicer end result?
> 
> So, I could do another loop through the line pointers in
> heap_page_prune() (after the loop calling heap_prune_chain()) and, if
> pronto_reap is true, set dead line pointers LP_UNUSED. Then, when
> constructing the WAL record, I would just not add the prstate.nowdead
> that I saved from heap_prune_chain() to the prune WAL record.

Hm, that seems a bit sad as well. I am wondering if we could move the
pronto_reap handling into heap_prune_record_dead() or a wrapper of it.  I am
more concerned about the human reader than the CPU here...



> > > @@ -972,20 +970,21 @@ lazy_scan_heap(LVRelState *vacrel)
> > >                               continue;
> > >                       }
> > >
> > > -                     /* Collect LP_DEAD items in dead_items array, count tuples */
> > > -                     if (lazy_scan_noprune(vacrel, buf, blkno, page, &hastup,
> > > +                     /*
> > > +                      * Collect LP_DEAD items in dead_items array, count tuples,
> > > +                      * determine if rel truncation is safe
> > > +                      */
> > > +                     if (lazy_scan_noprune(vacrel, buf, blkno, page,
> > >                                                                 &recordfreespace))
> > >                       {
> > >                               Size            freespace = 0;
> > >
> > >                               /*
> > >                                * Processed page successfully (without cleanup lock) -- just
> > > -                              * need to perform rel truncation and FSM steps, much like the
> > > -                              * lazy_scan_prune case.  Don't bother trying to match its
> > > -                              * visibility map setting steps, though.
> > > +                              * need to update the FSM, much like the lazy_scan_prune case.
> > > +                              * Don't bother trying to match its visibility map setting
> > > +                              * steps, though.
> > >                                */
> > > -                             if (hastup)
> > > -                                     vacrel->nonempty_pages = blkno + 1;
> > >                               if (recordfreespace)
> > >                                       freespace = PageGetHeapFreeSpace(page);
> > >                               UnlockReleaseBuffer(buf);
> >
> > The comment continues to say that we "determine if rel truncation is safe" -
> > but I don't see that?  Oh, I see, it's done inside lazy_scan_noprune(). This
> > doesn't seem like a clear improvement to me. Particularly because it's only
> > set if lazy_scan_noprune() actually does something.
> 
> I don't get what the last sentence means ("Particularly because...").

Took me a second to understand myself again too, oops. What I think I meant is
that it seems error-prone that it's only set in some paths inside
lazy_scan_noprune(). Previously it was at least a bit clearer in
lazy_scan_heap() that it would be set for the different possible paths.


> > I don't like the existing code in lazy_scan_heap(). But this kinda seems like
> > tinkering around the edges, without getting to the heart of the issue. I think
> > we should
> >
> > 1) Move everything after ReadBufferExtended() and the end of the loop into its
> >    own function
> >
> > 2) All the code in the loop body after the call to lazy_scan_prune() is
> >    specific to the lazy_scan_prune() path, it doesn't make sense that it's at
> >    the same level as the the calls to lazy_scan_noprune(),
> >    lazy_scan_new_or_empty() or lazy_scan_prune(). Either it should be in
> >    lazy_scan_prune() or a new wrapper function.
> >
> > 3) It's imo wrong that we have UnlockReleaseBuffer() (there are 6 different
> >    places unlocking if I didn't miscount!) and RecordPageWithFreeSpace() calls
> >    in this many places. I think this is largely a consequence of the previous
> >    points. Once those are addressed, we can have one common place.
> 
> I have other patches that do versions of all of the above, but they
> didn't seem to really fit with this patch set. I am taking a step to
> move code out of lazy_scan_heap() that doesn't belong there. That fact
> that other code should also be moved from there seems more like a "yes
> and" than a "no but". That being said, do you think I should introduce
> patches doing further refactoring of lazy_scan_heap() (like what you
> suggest above) into this thread?

It probably should not be part of this patchset. I probably shouldn't have
written the above here, but after concluding that I didn't think your small
refactoring patch was quite right, I couldn't stop myself from thinking about
what would be right.

Greetings,

Andres Freund

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Andres Freund

Date:

05 January 2024, 20:05:20

Hi,

On 2024-01-05 08:59:41 -0500, Robert Haas wrote:
> On Thu, Jan 4, 2024 at 6:03 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > When a single page is being processed, page pruning happens in
> > heap_page_prune(). Freezing, dead items recording, and visibility
> > checks happen in lazy_scan_prune(). Visibility map updates and
> > freespace map updates happen back in lazy_scan_heap(). Except, if the
> > table has no indexes, in which case, lazy_scan_heap() also invokes
> > lazy_vacuum_heap_page() to set dead line pointers unused and do
> > another separate visibility check and VM update. I maintain that all
> > page-level processing should be done in the page-level processing
> > functions (like lazy_scan_prune()). And lazy_scan_heap() shouldn't be
> > directly responsible for special case page-level processing.
> 
> But you can just as easily turn this argument on its head, can't you?
> In general, except for HOT tuples, line pointers are marked dead by
> pruning and unused by vacuum. Here you want to turn it on its head and
> make pruning do what would normally be vacuum's responsibility.

OTOH, the pruning logic, including its WAL record, already supports marking
items unused, all we need to do is to tell it to do so in a few more cases. If
we didn't already need to have support for this, I'd a much harder time
arguing for doing this.

One important part of the larger project is to combine the WAL records for
pruning, freezing and setting the all-visible/all-frozen bit into one WAL
record. We can't set all-frozen before we have removed the dead items. So
either we need to combine pruning and setting items unused for no-index tables
or we end up considerably less efficient in the no-indexes case.

An aside:

As I think we chatted about before, I eventually would like the option to
remove index entries for a tuple during on-access pruning, for OLTP
workloads. I.e. before removing the tuple, construct the corresponding index
tuple, use it to look up index entries pointing to the tuple. If all the index
entries were found (they might not be, if they already were marked dead during
a lookup, or if an expression wasn't actually immutable), we can prune without
the full index scan.  Obviously this would only be suitable for some
workloads, but it could be quite beneficial when you have huge indexes.  The
reason I mention this is that then we'd have another source of marking items
unused during pruning.

Greetings,

Andres Freund

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Robert Haas

Date:

05 January 2024, 20:23:12

On Fri, Jan 5, 2024 at 3:05 PM Andres Freund <andres@anarazel.de> wrote:
> OTOH, the pruning logic, including its WAL record, already supports marking
> items unused, all we need to do is to tell it to do so in a few more cases. If
> we didn't already need to have support for this, I'd a much harder time
> arguing for doing this.
>
> One important part of the larger project is to combine the WAL records for
> pruning, freezing and setting the all-visible/all-frozen bit into one WAL
> record. We can't set all-frozen before we have removed the dead items. So
> either we need to combine pruning and setting items unused for no-index tables
> or we end up considerably less efficient in the no-indexes case.

Those are fair arguments.

> An aside:
>
> As I think we chatted about before, I eventually would like the option to
> remove index entries for a tuple during on-access pruning, for OLTP
> workloads. I.e. before removing the tuple, construct the corresponding index
> tuple, use it to look up index entries pointing to the tuple. If all the index
> entries were found (they might not be, if they already were marked dead during
> a lookup, or if an expression wasn't actually immutable), we can prune without
> the full index scan.  Obviously this would only be suitable for some
> workloads, but it could be quite beneficial when you have huge indexes.  The
> reason I mention this is that then we'd have another source of marking items
> unused during pruning.

I will be astonished if you can make this work well enough to avoid
huge regressions in plausible cases. There are plenty of cases where
we do a very thorough job opportunistically removing index tuples.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Melanie Plageman

Date:

05 January 2024, 20:34:22

On Fri, Jan 5, 2024 at 1:47 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Jan 5, 2024 at 12:59 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> MP> I am planning to add a VM update into the freeze record, at which point
> MP> I will move the VM update code into lazy_scan_prune(). This will then
> MP> allow us to consolidate the freespace map update code for the prune and
> MP> noprune cases and make lazy_scan_heap() short and sweet.
>
> Can we see what that looks like on top of this change?

Yes, attached is a patch set which does this. My previous patchset
already reduced the number of places we unlock the buffer and update
the freespace map in lazy_scan_heap(). This patchset combines the
lazy_scan_prune() and lazy_scan_noprune() FSM update cases. I also
have a version which moves the freespace map updates into
lazy_scan_prune() and lazy_scan_noprune() -- eliminating all of these
from lazy_scan_heap(). This is arguably more clear. But Andres
mentioned he wanted fewer places unlocking the buffer and updating the
FSM.

- Melanie

Attachment

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Andres Freund

Date:

05 January 2024, 20:57:30

Hi,

On 2024-01-05 15:23:12 -0500, Robert Haas wrote:
> On Fri, Jan 5, 2024 at 3:05 PM Andres Freund <andres@anarazel.de> wrote:
> > An aside:
> >
> > As I think we chatted about before, I eventually would like the option to
> > remove index entries for a tuple during on-access pruning, for OLTP
> > workloads. I.e. before removing the tuple, construct the corresponding index
> > tuple, use it to look up index entries pointing to the tuple. If all the index
> > entries were found (they might not be, if they already were marked dead during
> > a lookup, or if an expression wasn't actually immutable), we can prune without
> > the full index scan.  Obviously this would only be suitable for some
> > workloads, but it could be quite beneficial when you have huge indexes.  The
> > reason I mention this is that then we'd have another source of marking items
> > unused during pruning.
>
> I will be astonished if you can make this work well enough to avoid
> huge regressions in plausible cases. There are plenty of cases where
> we do a very thorough job opportunistically removing index tuples.

These days the AM is often involved with that, via
table_index_delete_tuples()/heap_index_delete_tuples(). That IIRC has to
happen before physically removing the already-marked-killed index entries. We
can't rely on being able to actually prune the heap page at that point, there
might be other backends pinning it, but often we will be able to. If we were
to prune below heap_index_delete_tuples(), we wouldn't need to recheck that
index again during "individual tuple pruning", if the to-be-marked-unused heap
tuple is one of the tuples passed to heap_index_delete_tuples(). Which
presumably will be very commonly the case.

At least for nbtree, we are much more aggressive about marking index entries
as killed, than about actually removing the index entries. "individual tuple
pruning" would have to look for killed-but-still-present index entries, not
just for "live" entries.

Greetings,

Andres Freund

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Melanie Plageman

Date:

05 January 2024, 23:57:02

On Fri, Jan 5, 2024 at 2:47 PM Andres Freund <andres@anarazel.de> wrote:
> On 2024-01-04 17:37:27 -0500, Melanie Plageman wrote:
> > On Thu, Jan 4, 2024 at 3:03 PM Andres Freund <andres@anarazel.de> wrote:
> > >
> > > On 2023-11-17 18:12:08 -0500, Melanie Plageman wrote:
> > > > @@ -972,20 +970,21 @@ lazy_scan_heap(LVRelState *vacrel)
> > > >                               continue;
> > > >                       }
> > > >
> > > > -                     /* Collect LP_DEAD items in dead_items array, count tuples */
> > > > -                     if (lazy_scan_noprune(vacrel, buf, blkno, page, &hastup,
> > > > +                     /*
> > > > +                      * Collect LP_DEAD items in dead_items array, count tuples,
> > > > +                      * determine if rel truncation is safe
> > > > +                      */
> > > > +                     if (lazy_scan_noprune(vacrel, buf, blkno, page,
> > > >                                                                 &recordfreespace))
> > > >                       {
> > > >                               Size            freespace = 0;
> > > >
> > > >                               /*
> > > >                                * Processed page successfully (without cleanup lock) -- just
> > > > -                              * need to perform rel truncation and FSM steps, much like the
> > > > -                              * lazy_scan_prune case.  Don't bother trying to match its
> > > > -                              * visibility map setting steps, though.
> > > > +                              * need to update the FSM, much like the lazy_scan_prune case.
> > > > +                              * Don't bother trying to match its visibility map setting
> > > > +                              * steps, though.
> > > >                                */
> > > > -                             if (hastup)
> > > > -                                     vacrel->nonempty_pages = blkno + 1;
> > > >                               if (recordfreespace)
> > > >                                       freespace = PageGetHeapFreeSpace(page);
> > > >                               UnlockReleaseBuffer(buf);
> > >
> > > The comment continues to say that we "determine if rel truncation is safe" -
> > > but I don't see that?  Oh, I see, it's done inside lazy_scan_noprune(). This
> > > doesn't seem like a clear improvement to me. Particularly because it's only
> > > set if lazy_scan_noprune() actually does something.
> >
> > I don't get what the last sentence means ("Particularly because...").
>
> Took me a second to understand myself again too, oops. What I think I meant is
> that it seems error-prone that it's only set in some paths inside
> lazy_scan_noprune(). Previously it was at least a bit clearer in
> lazy_scan_heap() that it would be set for the different possible paths.

I see what you are saying. But if lazy_scan_noprune() doesn't do
anything, then it calls lazy_scan_prune(), which does set hastup and
update vacrel->nonempty_pages if needed.

Using hastup in lazy_scan_[no]prune() also means that they are
directly updating LVRelState after determining how to update it.
lazy_scan_heap() isn't doing responsible anymore. I don't see a reason
to be passing information back to lazy_scan_heap() to update
LVRelState when lazy_scan_[no]prune() has access to the LVRelState.

Importantly, once I combine the prune and freeze records, hastup is
set in heap_page_prune() instead of lazy_scan_prune() (that whole loop
in lazy_scan_prune() is eliminated()). And I don't like having to pass
hastup back through lazy_scan_prune() and then to lazy_scan_heap() so
that lazy_scan_heap() can use it (and use it to update a data
structure available in lazy_scan_prune()).

- Melanie

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Melanie Plageman

Date:

06 January 2024, 15:34:59

Patch 0001 in the attached set addresses the following review feedback:

- pronto_reap renamed to no_indexes
- reduce the number of callers of heap_prune_record_unused() by calling
  it from heap_prune_record_dead() when appropriate
- add unlikely hint to no_indexes test

I've also dropped the patch which moves the test of hastup into
lazy_scan_[no]prune(). In the future, I plan to remove the loop from
lazy_scan_prune() which sets hastup and set it instead in
heap_page_prune(). Changes to hastup's usage could be done with that
change instead of in this set.

The one review point I did not address in the code is that which I've
responded to inline below.

Though not required for the immediate reaping feature, I've included
patches 0002-0004 which are purely refactoring. These patches simplify
lazy_scan_heap() by moving the visibility map code into
lazy_scan_prune() and consolidating the updates to the FSM and
vacrel->nonempty_pages. I've proposed them in this thread because there
is an interdependence between eliminating the lazy_vacuum_heap_page()
call, moving the VM code, and combining the three FSM updates
(lazy_scan_prune() [index and no index] and lazy_scan_noprune()).

This illustrates the code clarity benefits of the change to mark line
pointers LP_UNUSED during pruning if the table has no indexes. I can
propose them in another thread if 0001 is merged.

On Thu, Jan 4, 2024 at 3:03 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2023-11-17 18:12:08 -0500, Melanie Plageman wrote:
> > diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
> > index 14de8158d49..b578c32eeb6 100644
> > --- a/src/backend/access/heap/heapam.c
> > +++ b/src/backend/access/heap/heapam.c
> > @@ -8803,8 +8803,13 @@ heap_xlog_prune(XLogReaderState *record)
> >               nunused = (end - nowunused);
> >               Assert(nunused >= 0);
> >
> > -             /* Update all line pointers per the record, and repair fragmentation */
> > -             heap_page_prune_execute(buffer,
> > +             /*
> > +              * Update all line pointers per the record, and repair fragmentation.
> > +              * We always pass pronto_reap as true, because we don't know whether
> > +              * or not this option was used when pruning. This reduces the
> > +              * validation done on replay in an assert build.
> > +              */
>
> Hm, that seems not great. Both because we loose validation and because it
> seems to invite problems down the line, due to pronto_reap falsely being set
> to true in heap_page_prune_execute().

I see what you are saying. With the name change to no_indexes, it
would be especially misleading to future programmers who might decide
to use that parameter for something else. Are you thinking it would be
okay to check the catalog to see if the table has indexes in
heap_xlog_prune() or are you suggesting that I add some kind of flag
to the prune WAL record?

- Melanie

Attachment

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Peter Geoghegan

Date:

06 January 2024, 16:03:03

On Fri, Jan 5, 2024 at 12:23 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > As I think we chatted about before, I eventually would like the option to
> > remove index entries for a tuple during on-access pruning, for OLTP
> > workloads. I.e. before removing the tuple, construct the corresponding index
> > tuple, use it to look up index entries pointing to the tuple. If all the index
> > entries were found (they might not be, if they already were marked dead during
> > a lookup, or if an expression wasn't actually immutable), we can prune without
> > the full index scan.  Obviously this would only be suitable for some
> > workloads, but it could be quite beneficial when you have huge indexes.  The
> > reason I mention this is that then we'd have another source of marking items
> > unused during pruning.
>
> I will be astonished if you can make this work well enough to avoid
> huge regressions in plausible cases. There are plenty of cases where
> we do a very thorough job opportunistically removing index tuples.

Right. In particular, bottom-up index deletion works well because it
adds a kind of natural backpressure to one important special case (the
case of non-HOT updates that don't "logically change" any indexed
column). It isn't really all that "opportunistic" in my understanding
of the term -- the overall effect is to *systematically* control bloat
in a way that is actually quite reliable. Like you, I have my doubts
that it would be valuable to be more proactive about deleting dead
index tuples that are just random dead tuples. There may be a great
many dead index tuples diffusely spread across an index -- these can
be quite harmless, and not worth proactively cleaning up (even at a
fairly low cost). What we mostly need to worry about is *concentrated*
build-up of dead index tuples in particular leaf pages.

A natural question to ask is: what cases remain, where we could stand
to add more backpressure? What other "special case" do we not yet
address? I think that retail index tuple deletion could work well as
part of a limited form of "transaction rollback" that cleans up after
a just-aborted transaction, within the backend that executed the
transaction itself. I suspect that this case now has outsized
importance, precisely because it's the one remaining case where the
system accumulates index bloat without any sort of natural
backpressure. Making the transaction/backend that creates bloat
directly responsible for proactively cleaning it up tends to have a
stabilizing effect over time. The system is made to live within its
means.

We could even fully reverse heap page line pointer bloat under this
"transaction rollback" scheme -- I bet that aborted xacts are a
disproportionate source of line pointer bloat. Barring a hard crash,
or a very large transaction, we could "undo" the physical changes to
relations before permitting the backend to retry the transaction from
scratch. This would just work as an optimization.

--
Peter Geoghegan

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Peter Geoghegan

Date:

06 January 2024, 16:34:25

On Fri, Jan 5, 2024 at 12:57 PM Andres Freund <andres@anarazel.de> wrote:
> > I will be astonished if you can make this work well enough to avoid
> > huge regressions in plausible cases. There are plenty of cases where
> > we do a very thorough job opportunistically removing index tuples.
>
> These days the AM is often involved with that, via
> table_index_delete_tuples()/heap_index_delete_tuples(). That IIRC has to
> happen before physically removing the already-marked-killed index entries. We
> can't rely on being able to actually prune the heap page at that point, there
> might be other backends pinning it, but often we will be able to. If we were
> to prune below heap_index_delete_tuples(), we wouldn't need to recheck that
> index again during "individual tuple pruning", if the to-be-marked-unused heap
> tuple is one of the tuples passed to heap_index_delete_tuples(). Which
> presumably will be very commonly the case.

I don't understand. Making heap_index_delete_tuples() prune heap pages
in passing such that we can ultimately mark dead heap tuples LP_UNUSED
necessitates high level coordination -- it has to happen at a level
much higher than heap_index_delete_tuples(). In other words, making it
all work safely requires the same high level context that makes it
safe for VACUUM to set a stub LP_DEAD line pointer to LP_UNUSED (index
tuples must never be allowed to point to TIDs/heap line pointers that
can be concurrently recycled).

Obviously my idea of "a limited form of transaction rollback" has the
required high-level context available, which is the crucial factor
that allows it to safely reverse all bloat -- even line pointer bloat
(which is traditionally something that only VACUUM can do safely). I
have a hard time imagining a scheme that can do that outside of VACUUM
without directly targeting some special case, such as the case that
I'm calling "transaction rollback". In other words, I have a hard time
imagining how this would ever be practical as part of any truly
opportunistic cleanup process. AFAICT the dependency between indexes
and the heap is just too delicate for such a scheme to ever really be
practical.

> At least for nbtree, we are much more aggressive about marking index entries
> as killed, than about actually removing the index entries. "individual tuple
> pruning" would have to look for killed-but-still-present index entries, not
> just for "live" entries.

These days having index tuples directly marked LP_DEAD is surprisingly
unimportant to heap_index_delete_tuples(). The batching optimization
implemented by _bt_simpledel_pass() tends to be very effective in
practice. We only need to have the right *general* idea about which
heap pages to visit -- which heap pages will yield some number of
deletable index tuples.

--
Peter Geoghegan

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Peter Geoghegan

Date:

08 January 2024, 17:27:16

On Fri, Jan 5, 2024 at 12:59 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> > But you can just as easily turn this argument on its head, can't you?
> > In general, except for HOT tuples, line pointers are marked dead by
> > pruning and unused by vacuum. Here you want to turn it on its head and
> > make pruning do what would normally be vacuum's responsibility.
>
> I actually think we are going to want to stop referring to these steps
> as pruning and vacuuming. It is confusing because vacuuming refers to
> the whole process of doing garbage collection on the table and also to
> the specific step of setting dead line pointers unused. If we called
> these steps say, pruning and reaping, that may be more clear.

What about index VACUUM records? Should they be renamed to REAP records, too?

> Vacuuming consists of three phases -- the first pass, index vacuuming,
> and the second pass. I don't think we should dictate what happens in
> each pass. That is, we shouldn't expect only pruning to happen in the
> first pass and only reaping to happen in the second pass.

Why not? It's not self-evident that it matters much either way. I
don't see it being worth the complexity (which is not to say that I'm
opposed to what you're trying to do here).

Note that we only need a cleanup for the first heap pass right now
(not the second heap pass). So if you're going to prune in the second
heap pass, you're going to have to add a mechanism that makes it safe
(by acquiring a cleanup lock once we decide that we actually want to
prune, say). Or maybe you'd just give up on the fact that we don't
need cleanup locks for the second hea pass these days instead (which
seems like a step backwards).

> For example,
> I think Andres has previously proposed doing another round of pruning
> after index vacuuming. The second pass/third phase is distinguished
> primarily by being after index vacuuming.

I think of the second pass/third phase as being very similar to index vacuuming.

Both processes/phases don't require cleanup locks (actually nbtree
VACUUM does require cleanup locks, but the reasons why are pretty
esoteric, and not shared by every index AM). And, both
processes/phases don't need to generate their own recovery conflicts.
Neither type of WAL record requires a snapshotConflictHorizon field of
its own, since we can safely assume that some PRUNE record must have
taken care of all that earlier on.

--
Peter Geoghegan

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Robert Haas

Date:

08 January 2024, 20:10:33

On Fri, Jan 5, 2024 at 3:57 PM Andres Freund <andres@anarazel.de> wrote:
> > I will be astonished if you can make this work well enough to avoid
> > huge regressions in plausible cases. There are plenty of cases where
> > we do a very thorough job opportunistically removing index tuples.
>
> These days the AM is often involved with that, via
> table_index_delete_tuples()/heap_index_delete_tuples(). That IIRC has to
> happen before physically removing the already-marked-killed index entries. We
> can't rely on being able to actually prune the heap page at that point, there
> might be other backends pinning it, but often we will be able to. If we were
> to prune below heap_index_delete_tuples(), we wouldn't need to recheck that
> index again during "individual tuple pruning", if the to-be-marked-unused heap
> tuple is one of the tuples passed to heap_index_delete_tuples(). Which
> presumably will be very commonly the case.
>
> At least for nbtree, we are much more aggressive about marking index entries
> as killed, than about actually removing the index entries. "individual tuple
> pruning" would have to look for killed-but-still-present index entries, not
> just for "live" entries.

I don't want to derail this thread, but I don't really see what you
have in mind here. The first paragraph sounds like you're imagining
that while pruning the index entries we might jump over to the heap
and clean things up there, too, but that seems like it wouldn't work
if the table has more than one index. I thought you were talking about
starting with a heap tuple and bouncing around to every index to see
if we can find index pointers to kill in every one of them. That
*could* work out, but you only need one index to have been
opportunistically cleaned up in order for it to fail to work out.
There might well be some workloads where that's often the case, but
the regressions in the workloads where it isn't the case seem like
they would be rather substantial, because doing an extra lookup in
every index for each heap tuple visited sounds pricey.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Robert Haas

Date:

08 January 2024, 20:50:47

On Fri, Jan 5, 2024 at 3:34 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> Yes, attached is a patch set which does this. My previous patchset
> already reduced the number of places we unlock the buffer and update
> the freespace map in lazy_scan_heap(). This patchset combines the
> lazy_scan_prune() and lazy_scan_noprune() FSM update cases. I also
> have a version which moves the freespace map updates into
> lazy_scan_prune() and lazy_scan_noprune() -- eliminating all of these
> from lazy_scan_heap(). This is arguably more clear. But Andres
> mentioned he wanted fewer places unlocking the buffer and updating the
> FSM.

Hmm, interesting. I haven't had time to study this fully today, but I
think 0001 looks fine and could just be committed. Hooray for killing
useless variables with dumb names.

This part of 0002 makes me very, very uncomfortable:

+               /*
+                * Update all line pointers per the record, and repair
fragmentation.
+                * We always pass no_indexes as true, because we don't
know whether or
+                * not this option was used when pruning. This reduces
the validation
+                * done on replay in an assert build.
+                */
+               heap_page_prune_execute(buffer, true,

redirected, nredirected,
                                                                nowdead, ndead,

nowunused, nunused);

The problem that I have with this is that we're essentially saying
that it's ok to lie to heap_page_prune_execute because we know how
it's going to use the information, and therefore we know that the lie
is harmless. But that's not how things are supposed to work. We should
either find a way to tell the truth, or change the name of the
parameter so that it's not a lie, or change the function so that it
doesn't need this parameter in the first place, or something. I can
occasionally stomach this sort of lying as a last resort when there's
no realistic way of adapting the code being called, but that's surely
not the case here -- this is a newborn parameter, and it shouldn't be
a lie on day 1. Just imagine if some future developer thought that the
no_indexes parameter meant that the relation actually didn't have
indexes (the nerve of them!).

I took a VERY fast look through the rest of the patch set and I think
that the overall direction looks like it's probably reasonable, but
that's a very preliminary conclusion which I reserve the right to
revise after studying further. @Andres: Are you planning to
review/commit any of this? Are you hoping that I'm going to do that?
Somebody else want to jump in here?

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Michael Paquier

Date:

09 January 2024, 05:56:34

On Mon, Jan 08, 2024 at 03:50:47PM -0500, Robert Haas wrote:
> Hmm, interesting. I haven't had time to study this fully today, but I
> think 0001 looks fine and could just be committed. Hooray for killing
> useless variables with dumb names.

I've been looking at 0001 a couple of weeks ago and thought that it
was fine because there's only one caller of lazy_scan_prune() and one
caller of lazy_scan_noprune() so all the code paths were covered.

+   /* rel truncation is unsafe */
+   if (hastup)
+       vacrel->nonempty_pages = blkno + 1;

Except for this comment that I found misleading because this is not
about the fact that truncation is unsafe, it's about correctly
tracking the the last block where we have tuples to ensure a correct
truncation.  Perhaps this could just reuse "Remember the location of
the last page with nonremovable tuples"?  If people object to that,
feel free.
--
Michael

Attachment

signature.asc

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Melanie Plageman

Date:

09 January 2024, 15:56:46

On Tue, Jan 9, 2024 at 12:56 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Mon, Jan 08, 2024 at 03:50:47PM -0500, Robert Haas wrote:
> > Hmm, interesting. I haven't had time to study this fully today, but I
> > think 0001 looks fine and could just be committed. Hooray for killing
> > useless variables with dumb names.
>
> I've been looking at 0001 a couple of weeks ago and thought that it
> was fine because there's only one caller of lazy_scan_prune() and one
> caller of lazy_scan_noprune() so all the code paths were covered.
>
> +   /* rel truncation is unsafe */
> +   if (hastup)
> +       vacrel->nonempty_pages = blkno + 1;

Andres had actually said that he didn't like pushing the update of
nonempty_pages into lazy_scan_[no]prune(). So, my v4 patch set
eliminates this.

I can see an argument for doing both the update of
vacrel->nonempty_pages and the FSM updates in lazy_scan_[no]prune()
because it eliminates some of the back-and-forth between the
block-specific functions and lazy_scan_heap().
lazy_scan_new_or_empty() has special logic for deciding how to update
the FSM -- so that remains in lazy_scan_new_or_empty() either way.

On the other hand, the comment above lazy_scan_new_or_empty() says we
can get rid of this special handling if we make relation extension
crash safe. Then it would make more sense to have a consolidated FSM
update in lazy_scan_heap(). However it does still mean that we repeat
the "UnlockReleaseBuffer()" and FSM update code in even more places.

Ultimately I can see arguments for and against. Is it better to avoid
having the same few lines of code in two places or avoid unneeded
communication between page-level functions and lazy_scan_heap()?

> Except for this comment that I found misleading because this is not
> about the fact that truncation is unsafe, it's about correctly
> tracking the the last block where we have tuples to ensure a correct
> truncation.  Perhaps this could just reuse "Remember the location of
> the last page with nonremovable tuples"?  If people object to that,
> feel free.

I agree the comment could be better. But, simply saying that it tracks
the last page with non-removable tuples makes it less clear how
important this is. It makes it sound like it could be simply for stats
purposes. I'll update the comment to something that includes that
sentiment but is more exact than "rel truncation is unsafe".

- Melanie

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Melanie Plageman

Date:

09 January 2024, 16:35:34

On Mon, Jan 8, 2024 at 3:51 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Jan 5, 2024 at 3:34 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
>
> This part of 0002 makes me very, very uncomfortable:
>
> +               /*
> +                * Update all line pointers per the record, and repair
> fragmentation.
> +                * We always pass no_indexes as true, because we don't
> know whether or
> +                * not this option was used when pruning. This reduces
> the validation
> +                * done on replay in an assert build.
> +                */
> +               heap_page_prune_execute(buffer, true,
>
> redirected, nredirected,
>                                                                 nowdead, ndead,
>
> nowunused, nunused);
>
> The problem that I have with this is that we're essentially saying
> that it's ok to lie to heap_page_prune_execute because we know how
> it's going to use the information, and therefore we know that the lie
> is harmless. But that's not how things are supposed to work. We should
> either find a way to tell the truth, or change the name of the
> parameter so that it's not a lie, or change the function so that it
> doesn't need this parameter in the first place, or something. I can
> occasionally stomach this sort of lying as a last resort when there's
> no realistic way of adapting the code being called, but that's surely
> not the case here -- this is a newborn parameter, and it shouldn't be
> a lie on day 1. Just imagine if some future developer thought that the
> no_indexes parameter meant that the relation actually didn't have
> indexes (the nerve of them!).

I agree that this is an issue.

The easiest solution would be to change the name of the parameter to
heap_page_prune_execute()'s from  "no_indexes"  to something like
"validate_unused", since it is only used in assert builds for
validation.

However, though I wish a name change was the right way to solve this
problem, my gut feeling is that it is not. It seems like we should
rely only on the WAL record itself in recovery. Right now the
parameter is used exclusively for validation, so it isn't so bad. But
what if someone uses this parameter in the future in heap_xlog_prune()
to decide how to modify the page?

It seems like the right solution would be to add a flag into the prune
record indicating what to pass to heap_page_prune_execute(). In the
future, I'd like to add flags for updating the VM to each of the prune
and vacuum records (eliminating the separate VM update record). Thus,
a new flags member of the prune record could have future use. However,
this would add a uint8 to the record. I can go and look for some
padding if you think this is the right direction?

- Melanie

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Robert Haas

Date:

09 January 2024, 18:31:30

On Tue, Jan 9, 2024 at 10:56 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> Andres had actually said that he didn't like pushing the update of
> nonempty_pages into lazy_scan_[no]prune(). So, my v4 patch set
> eliminates this.

Mmph - but I was so looking forward to killing hastup!

> On the other hand, the comment above lazy_scan_new_or_empty() says we
> can get rid of this special handling if we make relation extension
> crash safe. Then it would make more sense to have a consolidated FSM
> update in lazy_scan_heap(). However it does still mean that we repeat
> the "UnlockReleaseBuffer()" and FSM update code in even more places.

I wouldn't hold my breath waiting for relation extension to become
crash-safe. Even if you were a whale, you'd be about four orders of
magnitude short of holding it long enough.

> Ultimately I can see arguments for and against. Is it better to avoid
> having the same few lines of code in two places or avoid unneeded
> communication between page-level functions and lazy_scan_heap()?

To me, the conceptual complexity of an extra structure member is a
bigger cost than duplicating TWO lines of code. If we were talking
about 20 lines of code, I'd say rename it to something less dumb.

> > Except for this comment that I found misleading because this is not
> > about the fact that truncation is unsafe, it's about correctly
> > tracking the the last block where we have tuples to ensure a correct
> > truncation.  Perhaps this could just reuse "Remember the location of
> > the last page with nonremovable tuples"?  If people object to that,
> > feel free.
>
> I agree the comment could be better. But, simply saying that it tracks
> the last page with non-removable tuples makes it less clear how
> important this is. It makes it sound like it could be simply for stats
> purposes. I'll update the comment to something that includes that
> sentiment but is more exact than "rel truncation is unsafe".

I agree that "rel truncation is unsafe" is less clear than desirable,
but I'm not sure that I agree that tracking the last page with
non-removable tuples makes it sound unimportant. However, the comments
also need to make everybody happy, not just me. Maybe something like
"can't truncate away this page" or similar. A long-form comment that
really spells it out is fine too, but I don't know if we really need
it.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Robert Haas

Date:

09 January 2024, 19:00:27

On Tue, Jan 9, 2024 at 11:35 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> The easiest solution would be to change the name of the parameter to
> heap_page_prune_execute()'s from  "no_indexes"  to something like
> "validate_unused", since it is only used in assert builds for
> validation.

Right.

> However, though I wish a name change was the right way to solve this
> problem, my gut feeling is that it is not. It seems like we should
> rely only on the WAL record itself in recovery. Right now the
> parameter is used exclusively for validation, so it isn't so bad. But
> what if someone uses this parameter in the future in heap_xlog_prune()
> to decide how to modify the page?

Exactly.

> It seems like the right solution would be to add a flag into the prune
> record indicating what to pass to heap_page_prune_execute(). In the
> future, I'd like to add flags for updating the VM to each of the prune
> and vacuum records (eliminating the separate VM update record). Thus,
> a new flags member of the prune record could have future use. However,
> this would add a uint8 to the record. I can go and look for some
> padding if you think this is the right direction?

I thought about this approach and it might be OK but I don't love it,
because it means making the WAL record bigger on production systems
for the sake of assertion that only fires for developers. Sure, it's
possible that there might be another use in the future, but there
might also not be another use in the future.

How about changing if (no_indexes) to if (ndead == 0) and adding a
comment like this: /* If there are any tuples being marked LP_DEAD,
then the relation must have indexes, so every item being marked unused
must be a heap-only tuple. But if there are no tuples being marked
LP_DEAD, then it's possible that the relation has no indexes, in which
case all we know is that the line pointer shouldn't already be
LP_UNUSED. */

BTW:

+                        * LP_REDIRECT, or LP_DEAD items to LP_UNUSED
during pruning. We
+                        * can't check much here except that, if the
item is LP_NORMAL, it
+                        * should have storage before it is set LP_UNUSED.

Is it really helpful to check this here, or just confusing/grasping at
straws? I mean, the requirement that LP_NORMAL items have storage is a
general one, IIUC, not something that's specific to this situation. It
feels like the equivalent of checking that your children don't set
fire to the couch on Tuesdays.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Melanie Plageman

Date:

09 January 2024, 19:23:24

On Tue, Jan 9, 2024 at 1:31 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Jan 9, 2024 at 10:56 AM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > Andres had actually said that he didn't like pushing the update of
> > nonempty_pages into lazy_scan_[no]prune(). So, my v4 patch set
> > eliminates this.
>
> Mmph - but I was so looking forward to killing hastup!
>
> > Ultimately I can see arguments for and against. Is it better to avoid
> > having the same few lines of code in two places or avoid unneeded
> > communication between page-level functions and lazy_scan_heap()?
>
> To me, the conceptual complexity of an extra structure member is a
> bigger cost than duplicating TWO lines of code. If we were talking
> about 20 lines of code, I'd say rename it to something less dumb.

Yes, I agree. I thought about it more, and I prefer updating the FSM
and setting nonempty_pages into lazy_scan_[no]prune(). Originally, I
had ordered the patch set with that first (before the patch to do
immediate reaping), but there is no reason for it to be so. Using
hastup can be done in a subsequent commit on top of the immediate
reaping patch. I will post a new version of the immediate reaping
patch which addresses your feedback. Then, separately, I will post a
revised version of the lazy_scan_heap() refactoring patches.

- Melanie

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Robert Haas

Date:

09 January 2024, 19:33:29

On Tue, Jan 9, 2024 at 2:23 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> Yes, I agree. I thought about it more, and I prefer updating the FSM
> and setting nonempty_pages into lazy_scan_[no]prune(). Originally, I
> had ordered the patch set with that first (before the patch to do
> immediate reaping), but there is no reason for it to be so. Using
> hastup can be done in a subsequent commit on top of the immediate
> reaping patch. I will post a new version of the immediate reaping
> patch which addresses your feedback. Then, separately, I will post a
> revised version of the lazy_scan_heap() refactoring patches.

I kind of liked it first, because I thought we could just do it and
get it out of the way, but if Andres doesn't agree with the idea, it
probably does make sense to push it later, as you say here.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Andres Freund

Date:

09 January 2024, 19:35:36

Hi,

On January 9, 2024 11:33:29 AM PST, Robert Haas <robertmhaas@gmail.com> wrote:
>On Tue, Jan 9, 2024 at 2:23 PM Melanie Plageman
><melanieplageman@gmail.com> wrote:
>> Yes, I agree. I thought about it more, and I prefer updating the FSM
>> and setting nonempty_pages into lazy_scan_[no]prune(). Originally, I
>> had ordered the patch set with that first (before the patch to do
>> immediate reaping), but there is no reason for it to be so. Using
>> hastup can be done in a subsequent commit on top of the immediate
>> reaping patch. I will post a new version of the immediate reaping
>> patch which addresses your feedback. Then, separately, I will post a
>> revised version of the lazy_scan_heap() refactoring patches.
>
>I kind of liked it first, because I thought we could just do it and
>get it out of the way, but if Andres doesn't agree with the idea, it
>probably does make sense to push it later, as you say here.


I don't have that strong feelings about it. If both of you think it looks good, go ahead...


Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Melanie Plageman

Date:

09 January 2024, 20:13:30

I had already written the patch for immediate reaping addressing the
below feedback before I saw the emails that said everyone is happy
with using hastup in lazy_scan_[no]prune() in a preliminary patch. Let
me know if you have a strong preference for reordering. Otherwise, I
will write the three subsequent patches on top of this one.

On Tue, Jan 9, 2024 at 2:00 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Jan 9, 2024 at 11:35 AM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > The easiest solution would be to change the name of the parameter to
> > heap_page_prune_execute()'s from  "no_indexes"  to something like
> > "validate_unused", since it is only used in assert builds for
> > validation.
>
> Right.
>
> > However, though I wish a name change was the right way to solve this
> > problem, my gut feeling is that it is not. It seems like we should
> > rely only on the WAL record itself in recovery. Right now the
> > parameter is used exclusively for validation, so it isn't so bad. But
> > what if someone uses this parameter in the future in heap_xlog_prune()
> > to decide how to modify the page?
>
> Exactly.
>
> > It seems like the right solution would be to add a flag into the prune
> > record indicating what to pass to heap_page_prune_execute(). In the
> > future, I'd like to add flags for updating the VM to each of the prune
> > and vacuum records (eliminating the separate VM update record). Thus,
> > a new flags member of the prune record could have future use. However,
> > this would add a uint8 to the record. I can go and look for some
> > padding if you think this is the right direction?
>
> I thought about this approach and it might be OK but I don't love it,
> because it means making the WAL record bigger on production systems
> for the sake of assertion that only fires for developers. Sure, it's
> possible that there might be another use in the future, but there
> might also not be another use in the future.
>
> How about changing if (no_indexes) to if (ndead == 0) and adding a
> comment like this: /* If there are any tuples being marked LP_DEAD,
> then the relation must have indexes, so every item being marked unused
> must be a heap-only tuple. But if there are no tuples being marked
> LP_DEAD, then it's possible that the relation has no indexes, in which
> case all we know is that the line pointer shouldn't already be
> LP_UNUSED. */

Ah, I like this a lot. Attached patch does this. I've added a modified
version of the comment you suggested. My only question is if we are
losing something without this sentence (from the old comment):

-         * ... They don't need to be left in place as LP_DEAD items
until VACUUM gets
-         * around to doing index vacuuming.

I don't feel like it adds a lot, but it is absent from the new
comment, so thought I would check.

> BTW:
>
> +                        * LP_REDIRECT, or LP_DEAD items to LP_UNUSED
> during pruning. We
> +                        * can't check much here except that, if the
> item is LP_NORMAL, it
> +                        * should have storage before it is set LP_UNUSED.
>
> Is it really helpful to check this here, or just confusing/grasping at
> straws? I mean, the requirement that LP_NORMAL items have storage is a
> general one, IIUC, not something that's specific to this situation. It
> feels like the equivalent of checking that your children don't set
> fire to the couch on Tuesdays.

Hmm. Yes. I suppose I was trying to find something to validate. Is it
worth checking that the line pointer is not already LP_UNUSED? Or is
that a bit ridiculous?

- Melanie

Attachment

v5-0001-Set-would-be-dead-items-LP_UNUSED-while-pruning.patch

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Robert Haas

Date:

09 January 2024, 20:40:36

On Tue, Jan 9, 2024 at 3:13 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> I had already written the patch for immediate reaping addressing the
> below feedback before I saw the emails that said everyone is happy
> with using hastup in lazy_scan_[no]prune() in a preliminary patch. Let
> me know if you have a strong preference for reordering. Otherwise, I
> will write the three subsequent patches on top of this one.

I don't know if it rises to the level of a strong preference. It's
just a preference.

> Ah, I like this a lot. Attached patch does this. I've added a modified
> version of the comment you suggested. My only question is if we are
> losing something without this sentence (from the old comment):
>
> -         * ... They don't need to be left in place as LP_DEAD items
> until VACUUM gets
> -         * around to doing index vacuuming.
>
> I don't feel like it adds a lot, but it is absent from the new
> comment, so thought I would check.

I agree that we can leave that out. It wouldn't be bad to include it
if someone had a nice way of doing that, but it doesn't seem critical,
and if forcing it in there makes the comment less clear overall, it's
a net loss IMHO.

> Hmm. Yes. I suppose I was trying to find something to validate. Is it
> worth checking that the line pointer is not already LP_UNUSED? Or is
> that a bit ridiculous?

I think that's worthwhile (hence my proposed wording).

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Jim Nasby

Date:

09 January 2024, 21:20:01

On 1/8/24 2:10 PM, Robert Haas wrote:
> On Fri, Jan 5, 2024 at 3:57 PM Andres Freund <andres@anarazel.de> wrote:
>>> I will be astonished if you can make this work well enough to avoid
>>> huge regressions in plausible cases. There are plenty of cases where
>>> we do a very thorough job opportunistically removing index tuples.
>>
>> These days the AM is often involved with that, via
>> table_index_delete_tuples()/heap_index_delete_tuples(). That IIRC has to
>> happen before physically removing the already-marked-killed index entries. We
>> can't rely on being able to actually prune the heap page at that point, there
>> might be other backends pinning it, but often we will be able to. If we were
>> to prune below heap_index_delete_tuples(), we wouldn't need to recheck that
>> index again during "individual tuple pruning", if the to-be-marked-unused heap
>> tuple is one of the tuples passed to heap_index_delete_tuples(). Which
>> presumably will be very commonly the case.
>>
>> At least for nbtree, we are much more aggressive about marking index entries
>> as killed, than about actually removing the index entries. "individual tuple
>> pruning" would have to look for killed-but-still-present index entries, not
>> just for "live" entries.
> 
> I don't want to derail this thread, but I don't really see what you
> have in mind here. The first paragraph sounds like you're imagining
> that while pruning the index entries we might jump over to the heap
> and clean things up there, too, but that seems like it wouldn't work
> if the table has more than one index. I thought you were talking about
> starting with a heap tuple and bouncing around to every index to see
> if we can find index pointers to kill in every one of them. That
> *could* work out, but you only need one index to have been
> opportunistically cleaned up in order for it to fail to work out.
> There might well be some workloads where that's often the case, but
> the regressions in the workloads where it isn't the case seem like
> they would be rather substantial, because doing an extra lookup in
> every index for each heap tuple visited sounds pricey.

The idea of probing indexes for tuples that are now dead has come up in 
the past, and the concern has always been whether it's actually safe to 
do so. An obvious example is an index on a function and now the function 
has changed so you can't reliably determine if a particular tuple is 
present in the index. That's bad enough during an index scan, but 
potentially worse while doing heeap cleanup. Given that operators are 
functions, this risk exists to some degree in even simple indexes.

Depending on the gains this might still be worth doing, at least for 
some cases. It's hard to conceive of this breaking for indexes on 
integers, for example. But we'd still need to be cautious.
-- 
Jim Nasby, Data Architect, Austin TX

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Melanie Plageman

Date:

09 January 2024, 22:42:38

On Tue, Jan 9, 2024 at 3:40 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Jan 9, 2024 at 3:13 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > I had already written the patch for immediate reaping addressing the
> > below feedback before I saw the emails that said everyone is happy
> > with using hastup in lazy_scan_[no]prune() in a preliminary patch. Let
> > me know if you have a strong preference for reordering. Otherwise, I
> > will write the three subsequent patches on top of this one.
>
> I don't know if it rises to the level of a strong preference. It's
> just a preference.

Attached v6 has the immediate reaping patch first followed by the code
to use hastup in lazy_scan_[no]prune(). 0003 and 0004 move the VM
update code into lazy_scan_prune() and eliminate LVPagePruneState
entirely. 0005 moves the FSM update into lazy_scan_[no]prune(),
substantially simplifying lazy_scan_heap().

> I agree that we can leave that out. It wouldn't be bad to include it
> if someone had a nice way of doing that, but it doesn't seem critical,
> and if forcing it in there makes the comment less clear overall, it's
> a net loss IMHO.
>
> > Hmm. Yes. I suppose I was trying to find something to validate. Is it
> > worth checking that the line pointer is not already LP_UNUSED? Or is
> > that a bit ridiculous?
>
> I think that's worthwhile (hence my proposed wording).

Done in attached v6.

- Melanie

Attachment

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Robert Haas

Date:

10 January 2024, 20:54:40

On Tue, Jan 9, 2024 at 5:42 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> Done in attached v6.

Don't kill me, but:

+                       /*
+                        * For now, pass no_indexes == false
regardless of whether or not
+                        * the relation has indexes. In the future we
may enable immediate
+                        * reaping for on access pruning.
+                        */
+                       heap_page_prune(relation, buffer, vistest, false,
+                                                       &presult, NULL);

My previous comments about lying seem equally applicable here.

-               if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid))
+               if (!ItemIdIsUsed(itemid))
                        continue;

There is a bit of overhead here for the !no_indexes case. I assume it
doesn't matter.

 static void
 heap_prune_record_dead(PruneState *prstate, OffsetNumber offnum)
 {
+       /*
+        * If the relation has no indexes, we can remove dead tuples during
+        * pruning instead of marking their line pointers dead. Set this tuple's
+        * line pointer LP_UNUSED. We hint that tables with indexes are more
+        * likely.
+        */
+       if (unlikely(prstate->no_indexes))
+       {
+               heap_prune_record_unused(prstate, offnum);
+               return;
+       }

I think this should be pushed to the callers. Else you make the
existing function name into another lie.

+       bool            recordfreespace;

Not sure if it's necessary to move this to an outer scope like this?
The whole handling of this looks kind of confusing. If we're going to
do it this way, then I think lazy_scan_prune() definitely needs to
document how it handles this function (i.e. might set true to false,
won't set false to true, also meaning). But are we sure we want to let
a local variable with a weird name "leak out" like this?

+       Assert(vacrel->do_index_vacuuming);

Is this related?

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Melanie Plageman

Date:

10 January 2024, 22:27:57

On Wed, Jan 10, 2024 at 3:54 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Jan 9, 2024 at 5:42 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > Done in attached v6.
>
> Don't kill me, but:
>
> +                       /*
> +                        * For now, pass no_indexes == false
> regardless of whether or not
> +                        * the relation has indexes. In the future we
> may enable immediate
> +                        * reaping for on access pruning.
> +                        */
> +                       heap_page_prune(relation, buffer, vistest, false,
> +                                                       &presult, NULL);
>
> My previous comments about lying seem equally applicable here.

Yes, the options I can think of are:

1) rename the parameter to "immed_reap" or similar and make very clear
in heap_page_prune_opt() that you are not to pass true.
2) make immediate reaping work for on-access pruning. I would need a
low cost way to find out if there are any indexes on the table. Do you
think this is possible? Should we just rename the parameter for now
and think about that later?

>
> -               if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid))
> +               if (!ItemIdIsUsed(itemid))
>                         continue;
>
> There is a bit of overhead here for the !no_indexes case. I assume it
> doesn't matter.

Right. Should be okay. Alternatively, and I'm not saying this is a
good idea, but we could throw this into the loop in heap_page_prune()
which calls heap_prune_chain():

+       if (ItemIdIsDead(itemid) && prstate.no_indexes)
+       {
+           heap_prune_record_unused(&prstate, offnum);
+           continue;
+       }

I think that is correct?
But, from a consistency perspective, we don't call the
heap_prune_record* functions directly from heap_page_prune() in any
other cases. And it seems like it would be nice to have all of those
in one place?

>  static void
>  heap_prune_record_dead(PruneState *prstate, OffsetNumber offnum)
>  {
> +       /*
> +        * If the relation has no indexes, we can remove dead tuples during
> +        * pruning instead of marking their line pointers dead. Set this tuple's
> +        * line pointer LP_UNUSED. We hint that tables with indexes are more
> +        * likely.
> +        */
> +       if (unlikely(prstate->no_indexes))
> +       {
> +               heap_prune_record_unused(prstate, offnum);
> +               return;
> +       }
>
> I think this should be pushed to the callers. Else you make the
> existing function name into another lie.

Yes, so I preferred it in the body of heap_prune_chain() (the caller).
Andres didn't like the extra level of indentation. I could wrap
heap_record_dead() and heap_record_unused(), but I couldn't really
think of a good wrapper name. heap_record_dead_or_unused() seems long
and literal. But, it might be the best I can do. I can't think of a
general word which encompasses both the transition to death and to
disposal.

> +       bool            recordfreespace;
>
> Not sure if it's necessary to move this to an outer scope like this?
> The whole handling of this looks kind of confusing. If we're going to
> do it this way, then I think lazy_scan_prune() definitely needs to
> document how it handles this function (i.e. might set true to false,
> won't set false to true, also meaning). But are we sure we want to let
> a local variable with a weird name "leak out" like this?

Which function do you mean when you say "document how
lazy_scan_prune() handles this function". And no we definitely don't
want a variable like this to be hanging out in lazy_scan_heap(), IMHO.
The other patches in the stack move the FSM updates into
lazy_scan_[no]prune() and eliminate this parameter. I moved up the
scope because lazy_scan_noprune() already had recordfreespace as an
output parameter and initialized it unconditionally inside. I
initialize it unconditionally in lazy_scan_prune() as well. I mention
in the commit message that this is temporary because we plan to
eliminate recordfreespace as an output parameter by updating the FSM
in lazy_scan_[no]prune(). I could have stuck recordfreespace into the
LVPagePruneState with the other output parameters. But, leaving it as
a separate output parameter made the diffs lovely for the rest of the
patches in the stack.

> +       Assert(vacrel->do_index_vacuuming);
>
> Is this related?

Yes, previously the assert was:
Assert(vacrel->nindexes == 0 || vacrel->do_index_vacuuming);
And we eliminated the caller of lazy_vacuum_heap_page() with
vacrel->nindexes == 0. Now it should only be called after doing index
vacuuming (thus index vacuuming should definitely be enabled).

- Melanie

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Robert Haas

Date:

11 January 2024, 16:54:32

On Wed, Jan 10, 2024 at 5:28 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> Yes, the options I can think of are:
>
> 1) rename the parameter to "immed_reap" or similar and make very clear
> in heap_page_prune_opt() that you are not to pass true.
> 2) make immediate reaping work for on-access pruning. I would need a
> low cost way to find out if there are any indexes on the table. Do you
> think this is possible? Should we just rename the parameter for now
> and think about that later?

I don't think we can implement (2), because:

robert.haas=# create table test (a int);
CREATE TABLE
robert.haas=# begin;
BEGIN
robert.haas=*# select * from test;
 a
---
(0 rows)

<in another window>

robert.haas=# create index on test (a);
CREATE INDEX

In English, the lock we hold during regular table access isn't strong
enough to foreclose concurrent addition of an index.

So renaming the parameter seems like the way to go. How about "mark_unused_now"?

> > -               if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid))
> > +               if (!ItemIdIsUsed(itemid))
> >                         continue;
> >
> > There is a bit of overhead here for the !no_indexes case. I assume it
> > doesn't matter.
>
> Right. Should be okay. Alternatively, and I'm not saying this is a
> good idea, but we could throw this into the loop in heap_page_prune()
> which calls heap_prune_chain():
>
> +       if (ItemIdIsDead(itemid) && prstate.no_indexes)
> +       {
> +           heap_prune_record_unused(&prstate, offnum);
> +           continue;
> +       }
>
> I think that is correct?

Wouldn't that be wrong in any case where heap_prune_chain() calls
heap_prune_record_dead()?

> Yes, so I preferred it in the body of heap_prune_chain() (the caller).
> Andres didn't like the extra level of indentation. I could wrap
> heap_record_dead() and heap_record_unused(), but I couldn't really
> think of a good wrapper name. heap_record_dead_or_unused() seems long
> and literal. But, it might be the best I can do. I can't think of a
> general word which encompasses both the transition to death and to
> disposal.

I'm sure we could solve the wordsmithing problem but I think it's
clearer if the heap_prune_record_* functions don't get clever.

> > +       bool            recordfreespace;
> >
> > Not sure if it's necessary to move this to an outer scope like this?
> > The whole handling of this looks kind of confusing. If we're going to
> > do it this way, then I think lazy_scan_prune() definitely needs to
> > document how it handles this function (i.e. might set true to false,
> > won't set false to true, also meaning). But are we sure we want to let
> > a local variable with a weird name "leak out" like this?
>
> Which function do you mean when you say "document how
> lazy_scan_prune() handles this function".

"function" was a thinko for "variable".

> And no we definitely don't
> want a variable like this to be hanging out in lazy_scan_heap(), IMHO.
> The other patches in the stack move the FSM updates into
> lazy_scan_[no]prune() and eliminate this parameter. I moved up the
> scope because lazy_scan_noprune() already had recordfreespace as an
> output parameter and initialized it unconditionally inside. I
> initialize it unconditionally in lazy_scan_prune() as well. I mention
> in the commit message that this is temporary because we plan to
> eliminate recordfreespace as an output parameter by updating the FSM
> in lazy_scan_[no]prune(). I could have stuck recordfreespace into the
> LVPagePruneState with the other output parameters. But, leaving it as
> a separate output parameter made the diffs lovely for the rest of the
> patches in the stack.

I guess I'll have to look at more of the patches to see what I think
of this. Looking at this patch in isolation, it's ugly, IMHO, but it
seems we agree on that.

> > +       Assert(vacrel->do_index_vacuuming);
> >
> > Is this related?
>
> Yes, previously the assert was:
> Assert(vacrel->nindexes == 0 || vacrel->do_index_vacuuming);
> And we eliminated the caller of lazy_vacuum_heap_page() with
> vacrel->nindexes == 0. Now it should only be called after doing index
> vacuuming (thus index vacuuming should definitely be enabled).

Ah.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Robert Haas

Date:

11 January 2024, 18:43:27

On Tue, Jan 9, 2024 at 2:35 PM Andres Freund <andres@anarazel.de> wrote:
> I don't have that strong feelings about it. If both of you think it looks good, go ahead...

Done.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Melanie Plageman

Date:

11 January 2024, 19:30:07

Attached v7 is rebased over the commit you just made to remove
LVPagePruneState->hastup.

On Thu, Jan 11, 2024 at 11:54 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Jan 10, 2024 at 5:28 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > Yes, the options I can think of are:
> >
> > 1) rename the parameter to "immed_reap" or similar and make very clear
> > in heap_page_prune_opt() that you are not to pass true.
> > 2) make immediate reaping work for on-access pruning. I would need a
> > low cost way to find out if there are any indexes on the table. Do you
> > think this is possible? Should we just rename the parameter for now
> > and think about that later?
>
> I don't think we can implement (2), because:
>
> robert.haas=# create table test (a int);
> CREATE TABLE
> robert.haas=# begin;
> BEGIN
> robert.haas=*# select * from test;
>  a
> ---
> (0 rows)
>
> <in another window>
>
> robert.haas=# create index on test (a);
> CREATE INDEX
>
> In English, the lock we hold during regular table access isn't strong
> enough to foreclose concurrent addition of an index.

Ah, I see.

> So renaming the parameter seems like the way to go. How about "mark_unused_now"?

I've done this in attached v7.

> > > -               if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid))
> > > +               if (!ItemIdIsUsed(itemid))
> > >                         continue;
> > >
> > > There is a bit of overhead here for the !no_indexes case. I assume it
> > > doesn't matter.
> >
> > Right. Should be okay. Alternatively, and I'm not saying this is a
> > good idea, but we could throw this into the loop in heap_page_prune()
> > which calls heap_prune_chain():
> >
> > +       if (ItemIdIsDead(itemid) && prstate.no_indexes)
> > +       {
> > +           heap_prune_record_unused(&prstate, offnum);
> > +           continue;
> > +       }
> >
> > I think that is correct?
>
> Wouldn't that be wrong in any case where heap_prune_chain() calls
> heap_prune_record_dead()?

Hmm. But previously, we skipped heap_prune_chain() for already dead
line pointers. In my patch, I rely on the test in the loop in
heap_prune_chain() to set those LP_UNUSED if mark_unused_now is true
(previously the below code just broke out of the loop).

        /*
         * Likewise, a dead line pointer can't be part of the chain. (We
         * already eliminated the case of dead root tuple outside this
         * function.)
         */
        if (ItemIdIsDead(lp))
        {
            /*
             * If the caller set mark_unused_now true, we can set dead line
             * pointers LP_UNUSED now. We don't increment ndeleted here since
             * the LP was already marked dead.
             */
            if (unlikely(prstate->mark_unused_now))
                heap_prune_record_unused(prstate, offnum);

            break;
        }

so wouldn't what I suggested simply set the item LP_UNSED that before
invoking heap_prune_chain()? Thereby achieving the same result without
invoking heap_prune_chain() for already dead line pointers? I could be
missing something. That heap_prune_chain() logic sometimes gets me
turned around.

> > Yes, so I preferred it in the body of heap_prune_chain() (the caller).
> > Andres didn't like the extra level of indentation. I could wrap
> > heap_record_dead() and heap_record_unused(), but I couldn't really
> > think of a good wrapper name. heap_record_dead_or_unused() seems long
> > and literal. But, it might be the best I can do. I can't think of a
> > general word which encompasses both the transition to death and to
> > disposal.
>
> I'm sure we could solve the wordsmithing problem but I think it's
> clearer if the heap_prune_record_* functions don't get clever.

I've gone with my heap_record_dead_or_unused() suggestion.

- Melanie

Attachment

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Robert Haas

Date:

11 January 2024, 21:49:01

On Thu, Jan 11, 2024 at 2:30 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> Attached v7 is rebased over the commit you just made to remove
> LVPagePruneState->hastup.

Apologies for making you rebase but I was tired of thinking about that patch.

I'm still kind of hung up on the changes that 0001 makes to vacuumlazy.c.

Say we didn't add the recordfreespace parameter to lazy_scan_prune().
Couldn't the caller just compute it? lpdead_items goes out of scope,
but there's also prstate.has_lpdead_items.

Pressing that gripe a bit more, I just figured out that "Wait until
lazy_vacuum_heap_rel() to save free space" gets turned into "If we
will likely do index vacuuming, wait until lazy_vacuum_heap_rel() to
save free space." That follows the good practice of phrasing the
comment conditionally when the comment is outside the if-statement.
But the if-statement says merely "if (recordfreespace)", which is not
obviously related to "If we will likely do index vacuuming" but under
the hood actually is. But it seems like an unpleasant amount of action
at a distance. If that condition said if (vacrel->nindexes > 0 &&
vacrel->do_index_vacuuming && prstate.has_lpdead_items)
UnlockReleaseBuffer(); else { PageGetHeapFreeSpace;
UnlockReleaseBuffer; RecordPageWithFreeSpace } it would be a lot more
obvious that the code was doing what the comment says.

That's a refactoring question, but I'm also wondering if there's a
functional issue. Pre-patch, if the table has no indexes and some
items are left LP_DEAD, then we mark them unused, forget them,
RecordPageWithFreeSpace(), and if enough pages have been visited,
FreeSpaceMapVacuumRange(). Post-patch, if the table has no indexes, no
items will be left LP_DEAD, and *recordfreespace will be set to true,
so we'll PageRecordFreeSpace(). But this seems to me to mean that
post-patch we'll PageRecordFreeSpace() in more cases than we do now.
Specifically, imagine a case where the current code wouldn't mark any
items LP_DEAD and the page also had no pre-existing items that were
LP_DEAD and the table also has no indexes. Currently, I believe we
wouldn't PageRecordFreeSpace(), but with the patch, I think we would.
Am I wrong?

Note that the call to FreeSpaceMapVacuumRange() seems to try to guard
against this problem by testing for vacrel->tuples_deleted >
tuples_already_deleted. I haven't tried to verify whether that guard
is correct, but the fact that FreeSpaceMapVacuumRange() has such a
guard and RecordPageWithFreeSpace() does not have one makes me
suspicious.

Another interesting effect of the patch is that instead of getting
lazy_vacuum_heap_page()'s handling of the all-visible status of the
page, we get the somewhat more complex handling done by
lazy_scan_heap(). I haven't fully through the consequences of that,
but if you have, I'd be interested to hear your analysis.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Melanie Plageman

Date:

12 January 2024, 02:04:59

On Thu, Jan 11, 2024 at 4:49 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Jan 11, 2024 at 2:30 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
>
> I'm still kind of hung up on the changes that 0001 makes to vacuumlazy.c.
>
> Say we didn't add the recordfreespace parameter to lazy_scan_prune().
> Couldn't the caller just compute it? lpdead_items goes out of scope,
> but there's also prstate.has_lpdead_items.
>
> Pressing that gripe a bit more, I just figured out that "Wait until
> lazy_vacuum_heap_rel() to save free space" gets turned into "If we
> will likely do index vacuuming, wait until lazy_vacuum_heap_rel() to
> save free space." That follows the good practice of phrasing the
> comment conditionally when the comment is outside the if-statement.
> But the if-statement says merely "if (recordfreespace)", which is not
> obviously related to "If we will likely do index vacuuming" but under
> the hood actually is. But it seems like an unpleasant amount of action
> at a distance. If that condition said if (vacrel->nindexes > 0 &&
> vacrel->do_index_vacuuming && prstate.has_lpdead_items)
> UnlockReleaseBuffer(); else { PageGetHeapFreeSpace;
> UnlockReleaseBuffer; RecordPageWithFreeSpace } it would be a lot more
> obvious that the code was doing what the comment says.

Yes, this is a good point. Seems like writing maintainable code is
really about never lying and then figuring out when hiding the truth
is also lying. Darn!

My only excuse is that lazy_scan_noprune() has a similarly confusing
output parameter, recordfreespace, both of which I removed in a later
patch in the set. I think we should change it as you say.

> That's a refactoring question, but I'm also wondering if there's a
> functional issue. Pre-patch, if the table has no indexes and some
> items are left LP_DEAD, then we mark them unused, forget them,
> RecordPageWithFreeSpace(), and if enough pages have been visited,
> FreeSpaceMapVacuumRange(). Post-patch, if the table has no indexes, no
> items will be left LP_DEAD, and *recordfreespace will be set to true,
> so we'll PageRecordFreeSpace(). But this seems to me to mean that
> post-patch we'll PageRecordFreeSpace() in more cases than we do now.
> Specifically, imagine a case where the current code wouldn't mark any
> items LP_DEAD and the page also had no pre-existing items that were
> LP_DEAD and the table also has no indexes. Currently, I believe we
> wouldn't PageRecordFreeSpace(), but with the patch, I think we would.
> Am I wrong?

Ah! I think you are right. Good catch. I could fix this with logic like this:

bool space_freed = vacrel->tuples_deleted > tuples_already_deleted;
if ((vacrel->nindexes == 0 && space_freed) ||
    (vacrel->nindexes > 0 && (space_freed || !vacrel->do_index_vacuuming)))

I think I made this mistake when working on a different version of
this that combined the prune and no prune cases. I noticed that
lazy_scan_noprune() updates the FSM whenever there are no indexes. I
wonder why this is (and why we don't do it in the prune case).

> Note that the call to FreeSpaceMapVacuumRange() seems to try to guard
> against this problem by testing for vacrel->tuples_deleted >
> tuples_already_deleted. I haven't tried to verify whether that guard
> is correct, but the fact that FreeSpaceMapVacuumRange() has such a
> guard and RecordPageWithFreeSpace() does not have one makes me
> suspicious.

FreeSpaceMapVacuumRange() is not called for the no prune case, so I
think this is right.

> Another interesting effect of the patch is that instead of getting
> lazy_vacuum_heap_page()'s handling of the all-visible status of the
> page, we get the somewhat more complex handling done by
> lazy_scan_heap(). I haven't fully through the consequences of that,
> but if you have, I'd be interested to hear your analysis.

lazy_vacuum_heap_page() calls heap_page_is_all_visible() which does
another HeapTupleSatisfiesVacuum() call -- which is definitely going
to be more expensive than not doing that. In one case, in
lazy_scan_heap(), we might do a visibilitymap_get_status() (via
VM_ALL_FROZEN()) to avoid calling visibilitymap_set() if the page is
already marked all frozen in the VM. But that would pale in comparison
to another HeapTupleSatisfiesVacuum() (I think).

The VM update code in lazy_scan_heap() looks complicated but two out
of four cases are basically to deal with uncommon data corruption.

- Melanie

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Michael Paquier

Date:

12 January 2024, 04:35:40

On Thu, Jan 11, 2024 at 01:43:27PM -0500, Robert Haas wrote:
> On Tue, Jan 9, 2024 at 2:35 PM Andres Freund <andres@anarazel.de> wrote:
>> I don't have that strong feelings about it. If both of you think it
>> looks good, go ahead...
>
> Done.

Thanks for e2d5b3b9b643.
--
Michael

Attachment

signature.asc

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Robert Haas

Date:

12 January 2024, 14:43:53

On Thu, Jan 11, 2024 at 9:05 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> Yes, this is a good point. Seems like writing maintainable code is
> really about never lying and then figuring out when hiding the truth
> is also lying. Darn!

I think that's pretty true. In this particular case, this code has a
fair number of preexisting problems of this type, IMHO. It's been
touched by many hands, each person with their own design ideas, and
the result isn't as coherent as if it were written by a single mind
from scratch. But, because this code is also absolutely critical to
the system not eating user data, any changes have to be approached
with the highest level of caution. I think it's good to be really
careful about this sort of refactoring anywhere, because finding out a
year later that you broke something and having to go back and fix it
is no fun even if the consequences are minor ... here, they might not
be.

> Ah! I think you are right. Good catch. I could fix this with logic like this:
>
> bool space_freed = vacrel->tuples_deleted > tuples_already_deleted;
> if ((vacrel->nindexes == 0 && space_freed) ||
>     (vacrel->nindexes > 0 && (space_freed || !vacrel->do_index_vacuuming)))

Perhaps that would be better written as space_freed ||
(vacrel->nindexes > 0 && !vacrel->do_index_vacuuming), at which point
you might not need to introduce the variable.

> I think I made this mistake when working on a different version of
> this that combined the prune and no prune cases. I noticed that
> lazy_scan_noprune() updates the FSM whenever there are no indexes. I
> wonder why this is (and why we don't do it in the prune case).

Yeah, this all seems distinctly under-commented. As noted above, this
code has grown organically, and some things just never got written
down.

Looking through the git history, I see that this behavior seems to
date back to 44fa84881fff4529d68e2437a58ad2c906af5805 which introduced
lazy_scan_noprune(). The comments don't explain the reasoning, but my
guess is that it was just an accident. It's not entirely evident to me
whether there might ever be good reasons to update the freespace map
for a page where we haven't freed up any space -- after all, the free
space map isn't crash-safe, so you could always postulate that
updating it will correct an existing inaccuracy. But I really doubt
that there's any good reason for lazy_scan_prune() and
lazy_scan_noprune() to have different ideas about whether to update
the FSM or not, especially in an obscure corner case like this.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Peter Geoghegan

Date:

12 January 2024, 16:50:00

On Fri, Jan 12, 2024 at 9:44 AM Robert Haas <robertmhaas@gmail.com> wrote:
> Looking through the git history, I see that this behavior seems to
> date back to 44fa84881fff4529d68e2437a58ad2c906af5805 which introduced
> lazy_scan_noprune(). The comments don't explain the reasoning, but my
> guess is that it was just an accident. It's not entirely evident to me
> whether there might ever be good reasons to update the freespace map
> for a page where we haven't freed up any space -- after all, the free
> space map isn't crash-safe, so you could always postulate that
> updating it will correct an existing inaccuracy. But I really doubt
> that there's any good reason for lazy_scan_prune() and
> lazy_scan_noprune() to have different ideas about whether to update
> the FSM or not, especially in an obscure corner case like this.

Why do you think that lazy_scan_prune() and lazy_scan_noprune() have
different ideas about whether to update the FSM or not?

Barring certain failsafe edge cases, we always call
PageGetHeapFreeSpace() exactly once for each scanned page. While it's
true that we won't always do that in the first heap pass (in cases
where VACUUM has indexes), that's only because we expect to do it in
the second heap pass instead -- since we only want to do it once.

It's true that lazy_scan_noprune unconditionally calls
PageGetHeapFreeSpace() when vacrel->nindexes == 0. But that's the same
behavior as lazy_scan_prune when vacrel-> nindexes == 0. In both cases
we know that there won't be any second heap pass, and so in both cases
we always call PageGetHeapFreeSpace() in the first heap pass. It's
just that it's a bit harder to see that in the lazy_scan_prune case.
No?

--
Peter Geoghegan

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Melanie Plageman

Date:

12 January 2024, 17:03:05

On Fri, Jan 12, 2024 at 9:44 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Jan 11, 2024 at 9:05 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > Ah! I think you are right. Good catch. I could fix this with logic like this:
> >
> > bool space_freed = vacrel->tuples_deleted > tuples_already_deleted;
> > if ((vacrel->nindexes == 0 && space_freed) ||
> >     (vacrel->nindexes > 0 && (space_freed || !vacrel->do_index_vacuuming)))
>
> Perhaps that would be better written as space_freed ||
> (vacrel->nindexes > 0 && !vacrel->do_index_vacuuming), at which point
> you might not need to introduce the variable.

As I revisited this, I realized why I had written this logic for
updating the FSM:

    if (vacrel->nindexes == 0 ||
        !vacrel->do_index_vacuuming || lpdead_items == 0)

It is because in master, in lazy_scan_heap(), when there are no
indexes and no space has been freed, we actually keep going and hit
the other logic to update the FSM at the bottom of the loop:

        if (prunestate.has_lpdead_items && vacrel->do_index_vacuuming)

The effect is that if there are no indexes and no space is freed we
always update the FSM. When combined, that means that if there are no
indexes, we always update the FSM. With the logic you suggested, we
fail a pageinspect test which expects the FSM to exist when it
doesn't. So, adding the logic:

        if (space_freed ||
            (vacrel->nindexes > 0 && !vacrel->do_index_vacuuming) ||
            vacrel->nindexes == 0)

We pass that pageinspect test. But, we fail a freespacemap test which
expects some freespace reported which isn't:

        id        | blkno | is_avail
 -----------------+-------+----------
- freespace_tab   |     0 | t
+ freespace_tab   |     0 | f

which I presume is because space_freed is not the same as !has_lpdead_items.

I can't really see what logic would be right here.

> > I think I made this mistake when working on a different version of
> > this that combined the prune and no prune cases. I noticed that
> > lazy_scan_noprune() updates the FSM whenever there are no indexes. I
> > wonder why this is (and why we don't do it in the prune case).
>
> Yeah, this all seems distinctly under-commented. As noted above, this
> code has grown organically, and some things just never got written
> down.
>
> Looking through the git history, I see that this behavior seems to
> date back to 44fa84881fff4529d68e2437a58ad2c906af5805 which introduced
> lazy_scan_noprune(). The comments don't explain the reasoning, but my
> guess is that it was just an accident. It's not entirely evident to me
> whether there might ever be good reasons to update the freespace map
> for a page where we haven't freed up any space -- after all, the free
> space map isn't crash-safe, so you could always postulate that
> updating it will correct an existing inaccuracy. But I really doubt
> that there's any good reason for lazy_scan_prune() and
> lazy_scan_noprune() to have different ideas about whether to update
> the FSM or not, especially in an obscure corner case like this.

All of this makes me wonder why we would update the FSM in vacuum when
no space was freed. I thought that RelationAddBlocks() and
RelationGetBufferForTuple() would handle making sure the FSM gets
created and updated if the reason there is freespace is because the
page hasn't been filled yet.

- Melanie

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Melanie Plageman

Date:

12 January 2024, 17:33:28

On Fri, Jan 12, 2024 at 11:50 AM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Fri, Jan 12, 2024 at 9:44 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > Looking through the git history, I see that this behavior seems to
> > date back to 44fa84881fff4529d68e2437a58ad2c906af5805 which introduced
> > lazy_scan_noprune(). The comments don't explain the reasoning, but my
> > guess is that it was just an accident. It's not entirely evident to me
> > whether there might ever be good reasons to update the freespace map
> > for a page where we haven't freed up any space -- after all, the free
> > space map isn't crash-safe, so you could always postulate that
> > updating it will correct an existing inaccuracy. But I really doubt
> > that there's any good reason for lazy_scan_prune() and
> > lazy_scan_noprune() to have different ideas about whether to update
> > the FSM or not, especially in an obscure corner case like this.
>
> Why do you think that lazy_scan_prune() and lazy_scan_noprune() have
> different ideas about whether to update the FSM or not?
>
> Barring certain failsafe edge cases, we always call
> PageGetHeapFreeSpace() exactly once for each scanned page. While it's
> true that we won't always do that in the first heap pass (in cases
> where VACUUM has indexes), that's only because we expect to do it in
> the second heap pass instead -- since we only want to do it once.
>
> It's true that lazy_scan_noprune unconditionally calls
> PageGetHeapFreeSpace() when vacrel->nindexes == 0. But that's the same
> behavior as lazy_scan_prune when vacrel-> nindexes == 0. In both cases
> we know that there won't be any second heap pass, and so in both cases
> we always call PageGetHeapFreeSpace() in the first heap pass. It's
> just that it's a bit harder to see that in the lazy_scan_prune case.
> No?

So, I think this is the logic in master:

Prune case, first pass

- indexes == 0 && space_freed -> update FSM
- indexes == 0 && (!space_freed || !index_vacuuming) -> update FSM
- indexes > 0 && (!space_freed || !index_vacuuming) -> update FSM

which reduces to:

- indexes == 0 || !space_freed || !index_vacuuming

No Prune (except aggressive vacuum), first pass:

- indexes == 0 -> update FSM
- indexes > 0 && !space_freed -> update FSM

which reduces to:

- indexes == 0 || !space_freed

which is, during the first pass, if we will not do index vacuuming,
either because we have no indexes, because there is nothing to vacuum,
or because do_index_vacuuming is false, make sure we update the
freespace map now.

but what about no prune when do_index_vacuuming is false?

I still don't understand why vacuum is responsible for updating the
FSM per page when no line pointers have been set unused. That is how
PageGetFreeSpace() figures out if there is free space, right?

- Melanie

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Peter Geoghegan

Date:

12 January 2024, 18:07:30

On Fri, Jan 12, 2024 at 12:33 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> So, I think this is the logic in master:
>
> Prune case, first pass
>
> ...
> - indexes > 0 && (!space_freed || !index_vacuuming) -> update FSM

What is "space_freed"? Isn't that something from your uncommitted patch?

As I said, the aim is to call PageGetHeapFreeSpace() (*not*
PageGetFreeSpace(), which is only used for index pages) exactly once
per heap page scanned. This is supposed to happen independently of
whatever specific work was/will be required for the heap page. In
general, we don't ever trust that the FSM is already up-to-date.
Presumably because the FSM isn't crash safe.

On master, prunestate.has_lpdead_items may be set true when our VACUUM
wasn't actually the thing that performed pruning that freed tuple
storage -- typically when some other backend was the one that did all
required pruning at some earlier point in time, often via
opportunistic pruning. For better or worse, the only thing that VACUUM
aims to do is make sure that PageGetHeapFreeSpace() gets called
exactly once per scanned page.

> which is, during the first pass, if we will not do index vacuuming,
> either because we have no indexes, because there is nothing to vacuum,
> or because do_index_vacuuming is false, make sure we update the
> freespace map now.
>
> but what about no prune when do_index_vacuuming is false?

Note that do_index_vacuuming cannot actually affect the "nindexes ==
0" case. We have an assertion that's kinda relevant to this point:

        if (prunestate.has_lpdead_items && vacrel->do_index_vacuuming)
        {
            /*
             * Wait until lazy_vacuum_heap_rel() to save free space.
             * ....
             */
            Assert(vacrel->nindexes > 0);
            UnlockReleaseBuffer(buf);
        }
        else
        {
            ....
        }

We should never get this far down in the lazy_scan_heap() loop when
"nindexes == 0 && prunestate.has_lpdead_items". That's handled in the
special "single heap pass" branch after lazy_scan_prune().

As for the case where we use lazy_scan_noprune() for a "nindexes == 0"
VACUUM's heap page, we also can't get this far down. (Actually,
lazy_scan_noprune lies if it has to in order to make sure that
lazy_scan_heap never has to deal with a "nindexes == 0" VACUUM with
LP_DEAD items from a no-cleanup-lock page. This is a kludge -- see
comments about how this is "dishonest" inside lazy_scan_noprune().)

> I still don't understand why vacuum is responsible for updating the
> FSM per page when no line pointers have been set unused. That is how
> PageGetFreeSpace() figures out if there is free space, right?

You mean PageGetHeapFreeSpace? Not really. (Though even pruning can
set line pointers unused, or heap-only tuples.)

Even if pruning doesn't happen in VACUUM, that doesn't mean that the
FSM is up-to-date.

In short, we do these things with the free space map because it is a
map of free space (which isn't crash safe) -- nothing more. I happen
to agree that that general design has a lot of problems, but those
seem out of scope here.

--
Peter Geoghegan

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Robert Haas

Date:

12 January 2024, 18:45:53

On Fri, Jan 12, 2024 at 11:50 AM Peter Geoghegan <pg@bowt.ie> wrote:
> Why do you think that lazy_scan_prune() and lazy_scan_noprune() have
> different ideas about whether to update the FSM or not?

When lazy_scan_prune() is called, we call RecordPageWithFreeSpace if
vacrel->nindexes == 0 && prunestate.has_lpdead_items. See the code
near the comment that begins "Consider the need to do page-at-a-time
heap vacuuming".

When lazy_scan_noprune() is called, whether we call
RecordPageWithFreeSpace depends on the output parameter
recordfreespace. That is set to true whenever vacrel->nindexes == 0 ||
lpdead_items == 0. See the code near the comment that begins "Save any
LP_DEAD items".

The case where I thought there was a behavior difference is when
vacrel->nindexes == 0 and lpdead_items == 0 and thus
prunestate.has_lpdead_items is false. But now I see (from Melanie's
email) that this isn't really true, because in that case we fall
through to the logic that we use when indexes are present, giving us a
second chance to call RecordPageWithFreeSpace(), which we take when
(prunestate.has_lpdead_items && vacrel->do_index_vacuuming) comes out
false, as it always does in the scenario that I postulated.

P.S. to Melanie: I'll respond to your further emails next but I wanted
to respond to this one from Peter first so he didn't think I was
rudely ignoring him. :-)

P.P.S. to everyone: Yikes, this logic is really confusing.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Melanie Plageman

Date:

12 January 2024, 18:52:14

On Fri, Jan 12, 2024 at 1:07 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Fri, Jan 12, 2024 at 12:33 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > So, I think this is the logic in master:
> >
> > Prune case, first pass
> >
> > ...
> > - indexes > 0 && (!space_freed || !index_vacuuming) -> update FSM
>
> What is "space_freed"? Isn't that something from your uncommitted patch?

Yes, I was mixing the two together.

I just want to make sure that we agree that, on master, when
lazy_scan_prune() is called, the logic for whether or not to update
the FSM after the first pass is:

indexes == 0 || !has_lpdead_items || !index_vacuuming

and when lazy_scan_noprune() is called, the logic for whether or not
to update the FSM after the first pass is:

indexes == 0 || !has_lpdead_items

Those seem different to me.

- Melanie

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Melanie Plageman

Date:

12 January 2024, 19:02:21

On Fri, Jan 12, 2024 at 1:07 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Fri, Jan 12, 2024 at 12:33 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > So, I think this is the logic in master:
> >
> > Prune case, first pass
> >
> > ...
> > - indexes > 0 && (!space_freed || !index_vacuuming) -> update FSM
>
> What is "space_freed"? Isn't that something from your uncommitted patch?
>
> As I said, the aim is to call PageGetHeapFreeSpace() (*not*
> PageGetFreeSpace(), which is only used for index pages) exactly once
> per heap page scanned. This is supposed to happen independently of
> whatever specific work was/will be required for the heap page. In
> general, we don't ever trust that the FSM is already up-to-date.
> Presumably because the FSM isn't crash safe.
>
> On master, prunestate.has_lpdead_items may be set true when our VACUUM
> wasn't actually the thing that performed pruning that freed tuple
> storage -- typically when some other backend was the one that did all
> required pruning at some earlier point in time, often via
> opportunistic pruning. For better or worse, the only thing that VACUUM
> aims to do is make sure that PageGetHeapFreeSpace() gets called
> exactly once per scanned page.
...
> > I still don't understand why vacuum is responsible for updating the
> > FSM per page when no line pointers have been set unused. That is how
> > PageGetFreeSpace() figures out if there is free space, right?
>
> You mean PageGetHeapFreeSpace? Not really. (Though even pruning can
> set line pointers unused, or heap-only tuples.)
>
> Even if pruning doesn't happen in VACUUM, that doesn't mean that the
> FSM is up-to-date.
>
> In short, we do these things with the free space map because it is a
> map of free space (which isn't crash safe) -- nothing more. I happen
> to agree that that general design has a lot of problems, but those
> seem out of scope here.

So, there are 3 issues I am trying to understand:

1) How often should vacuum update the FSM (not vacuum as in the second
pass but vacuum as in the whole thing that is happening in
lazy_scan_heap())?
2) What is the exact logic in master that ensures that vacuum
implements the cadence in 1)?
3) How can the logic in 2) be replicated exactly in my patch that sets
would-be dead items LP_UNUSED during pruning?

From what Peter is saying, I think 1) is decided and is once per page
(across all passes)
For 2), see my previous email. And for 3), TBD until 2) is agreed upon.

- Melanie

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Robert Haas

Date:

12 January 2024, 19:32:13

On Fri, Jan 12, 2024 at 1:52 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> Yes, I was mixing the two together.
>
> I just want to make sure that we agree that, on master, when
> lazy_scan_prune() is called, the logic for whether or not to update
> the FSM after the first pass is:
>
> indexes == 0 || !has_lpdead_items || !index_vacuuming
>
> and when lazy_scan_noprune() is called, the logic for whether or not
> to update the FSM after the first pass is:
>
> indexes == 0 || !has_lpdead_items
>
> Those seem different to me.

This analysis seems correct to me, except that "when
lazy_scan_noprune() is called" should really say "when
lazy_scan_noprune() is called (and returns true)", because when it
returns false we fall through and call lazy_scan_prune() afterwards.

Here's a draft patch to clean up the inconsistency here. It also gets
rid of recordfreespace, because ISTM that recordfreespace is adding to
the confusion here rather than helping anything.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

v1-0001-Be-more-consistent-about-whether-to-update-the-FS.patch.nocfbot

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Peter Geoghegan

Date:

12 January 2024, 19:43:09

On Fri, Jan 12, 2024 at 2:32 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Jan 12, 2024 at 1:52 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> This analysis seems correct to me, except that "when
> lazy_scan_noprune() is called" should really say "when
> lazy_scan_noprune() is called (and returns true)", because when it
> returns false we fall through and call lazy_scan_prune() afterwards.

Now that I see your patch, I understand what Melanie must have meant.
I agree that there is a small inconsistency here, that we could well
do without.

In general I am in favor of religiously eliminating such
inconsistencies (between lazy_scan_prune and lazy_scan_noprune),
unless there is a reason not to. Not because it's necessarily
important. More because it's just too hard to be sure whether it might
matter. It's usually easier to defensively assume that it matters.

> Here's a draft patch to clean up the inconsistency here. It also gets
> rid of recordfreespace, because ISTM that recordfreespace is adding to
> the confusion here rather than helping anything.

You're using "!prunestate.has_lpdead_items" as part of your test that
sets "recordfreespace". But lazy_scan_noprune doesn't get passed a
pointer to prunestate, so clearly you'll need to detect the same
condition some other way.

--
Peter Geoghegan

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Robert Haas

Date:

12 January 2024, 19:47:11

On Fri, Jan 12, 2024 at 2:43 PM Peter Geoghegan <pg@bowt.ie> wrote:
> You're using "!prunestate.has_lpdead_items" as part of your test that
> sets "recordfreespace". But lazy_scan_noprune doesn't get passed a
> pointer to prunestate, so clearly you'll need to detect the same
> condition some other way.

OOPS. Thanks.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Peter Geoghegan

Date:

12 January 2024, 20:01:59

On Fri, Jan 12, 2024 at 1:52 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> On Fri, Jan 12, 2024 at 1:07 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > What is "space_freed"? Isn't that something from your uncommitted patch?
>
> Yes, I was mixing the two together.

An understandable mistake.

> I just want to make sure that we agree that, on master, when
> lazy_scan_prune() is called, the logic for whether or not to update
> the FSM after the first pass is:
>
> indexes == 0 || !has_lpdead_items || !index_vacuuming
>
> and when lazy_scan_noprune() is called, the logic for whether or not
> to update the FSM after the first pass is:
>
> indexes == 0 || !has_lpdead_items
>
> Those seem different to me.

Right. As I said to Robert just now, I can now see that they're
slightly different conditions.

FWIW my brain was just ignoring " || !index_vacuuming". I dismissed it
as an edge-case, only relevant when the failsafe has kicked in. Which
it is. But that's still no reason to allow an inconsistency that we
can easily just avoid.

--
Peter Geoghegan

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Melanie Plageman

Date:

12 January 2024, 20:04:19

On Fri, Jan 12, 2024 at 2:47 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Jan 12, 2024 at 2:43 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > You're using "!prunestate.has_lpdead_items" as part of your test that
> > sets "recordfreespace". But lazy_scan_noprune doesn't get passed a
> > pointer to prunestate, so clearly you'll need to detect the same
> > condition some other way.
>
> OOPS. Thanks.

Also, I think you should combine these in lazy_scan_noprune() now

    /* Save any LP_DEAD items found on the page in dead_items array */
    if (vacrel->nindexes == 0)
    {
        /* Using one-pass strategy (since table has no indexes) */
        if (lpdead_items > 0)
        {

Since we don't set recordfreespace in the outer if statement anymore

And I noticed you missed a reference to recordfreespace output
parameter in the function comment above lazy_scan_noprune().

- Melanie

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Robert Haas

Date:

12 January 2024, 20:22:16

On Fri, Jan 12, 2024 at 3:04 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> Also, I think you should combine these in lazy_scan_noprune() now
>
>     /* Save any LP_DEAD items found on the page in dead_items array */
>     if (vacrel->nindexes == 0)
>     {
>         /* Using one-pass strategy (since table has no indexes) */
>         if (lpdead_items > 0)
>         {
>
> Since we don't set recordfreespace in the outer if statement anymore

Well, maybe, but there's an else clause attached to the outer "if", so
you have to be a bit careful. I didn't think it was critical to
further rejigger this.

> And I noticed you missed a reference to recordfreespace output
> parameter in the function comment above lazy_scan_noprune().

OK.

So what's the best way to solve the problem that Peter pointed out?
Should we pass in the prunestate? Maybe just replace bool
*recordfreespace with bool *has_lpdead_items?

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Melanie Plageman

Date:

12 January 2024, 21:05:35

On Fri, Jan 12, 2024 at 3:22 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Jan 12, 2024 at 3:04 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
>
> So what's the best way to solve the problem that Peter pointed out?
> Should we pass in the prunestate? Maybe just replace bool
> *recordfreespace with bool *has_lpdead_items?

Yea, that works for now. I mean, I think the way we should do it is
update the FSM in lazy_scan_noprune(), but, for the purposes of this
patch, yes. has_lpdead_items output parameter seems fine to me.

- Melanie

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Robert Haas

Date:

15 January 2024, 17:29:57

On Fri, Jan 12, 2024 at 4:05 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> Yea, that works for now. I mean, I think the way we should do it is
> update the FSM in lazy_scan_noprune(), but, for the purposes of this
> patch, yes. has_lpdead_items output parameter seems fine to me.

Here's v2.

It's not exactly clear to me why you want to update the FSM in
lazy_scan_[no]prune(). When I first looked at v7-0004, I said to
myself "well, this is dumb, because now we're just duplicating
something that is common to both cases". But then I realized that the
logic was *already* duplicated in lazy_scan_heap(), and that by
pushing the duplication down to lazy_scan_[no]prune(), you made the
two copies identical rather than, as at present, having two copies
that differ from each other. Perhaps that's a sufficient reason to
make the change all by itself, but it seems like what would be really
good is if we only needed one copy of the logic. I don't know if
that's achievable, though.

More generally, I somewhat question whether it's really right to push
things from lazy_scan_heap() and into lazy_scan_[no]prune(), mostly
because of the risk of having to duplicate logic. I'm not even really
convinced that it's good for us to have both of those functions.
There's an awful lot of code duplication between them already. Each
has a loop that tests the status of each item and then, for LP_USED,
switches on htsv_get_valid_status or HeapTupleSatisfiesVacuum. It
doesn't seem trivial to unify all that code, but it doesn't seem very
nice to have two copies of it, either.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

v2-0001-Be-more-consistent-about-whether-to-update-the-FS.patch

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Melanie Plageman

Date:

15 January 2024, 21:03:24

On Mon, Jan 15, 2024 at 12:29:57PM -0500, Robert Haas wrote:
> On Fri, Jan 12, 2024 at 4:05 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > Yea, that works for now. I mean, I think the way we should do it is
> > update the FSM in lazy_scan_noprune(), but, for the purposes of this
> > patch, yes. has_lpdead_items output parameter seems fine to me.
> 
> Here's v2.
> 
> It's not exactly clear to me why you want to update the FSM in
> lazy_scan_[no]prune(). When I first looked at v7-0004, I said to
> myself "well, this is dumb, because now we're just duplicating
> something that is common to both cases". But then I realized that the
> logic was *already* duplicated in lazy_scan_heap(), and that by
> pushing the duplication down to lazy_scan_[no]prune(), you made the
> two copies identical rather than, as at present, having two copies
> that differ from each other. Perhaps that's a sufficient reason to
> make the change all by itself, but it seems like what would be really
> good is if we only needed one copy of the logic. I don't know if
> that's achievable, though.

Upthread in v4-0004, I do have a version which combines the two FSM
updates for lazy_scan_prune() and lazy_scan_noprune().

If you move the VM updates from lazy_scan_heap() into lazy_scan_prune(),
then there is little difference between the prune and no prune cases in
lazy_scan_heap(). The main difference is that, when lazy_scan_noprune()
returns true, you are supposed to avoid calling lazy_scan_new_or_empty()
again and have to avoid calling lazy_scan_prune(). I solved this with a
local variable "do_prune" and checked it before calling
lazy_scan_new_or_empty() and lazy_scan_prune().

I moved away from this approach because it felt odd to test "do_prune"
before calling lazy_scan_new_or_empty(). Though, perhaps it doesn't
matter if we just call lazy_scan_new_or_empty() again. We do that when
lazy_scan_noprune() returns false anyway.

I thought perhaps we could circumvent this issue by moving
lazy_scan_new_or_empty() into lazy_scan_prune(). But, this seemed wrong
to me because, as it stands now, if lazy_scan_new_or_empty() returns
true, we would want to skip the VM update code and FSM update code in
lazy_scan_heap() after lazy_scan_prune(). That would mean
lazy_scan_prune() would have to return something to indicate that.

> More generally, I somewhat question whether it's really right to push
> things from lazy_scan_heap() and into lazy_scan_[no]prune(), mostly
> because of the risk of having to duplicate logic. I'm not even really
> convinced that it's good for us to have both of those functions.
> There's an awful lot of code duplication between them already. Each
> has a loop that tests the status of each item and then, for LP_USED,
> switches on htsv_get_valid_status or HeapTupleSatisfiesVacuum. It
> doesn't seem trivial to unify all that code, but it doesn't seem very
> nice to have two copies of it, either.

I agree that the duplicated logic in both places is undesirable.
I think the biggest issue with combining them would be that when
lazy_scan_noprune() returns false, it needs to go get a cleanup lock and
then invoke the logic of lazy_scan_prune() for all the tuples on the
page. This seems a little hard to get right in a single function.

Then there are more trivial-to-solve differences like invoking
heap_tuple_should_freeze() and not bothering with the whole
heap_prepare_freeze_tuple() if we only have the share lock.

I am willing to try and write a version of this to see if it is better.
I will say, though, my agenda was to eventually push the actions taken
in the loop in lazy_scan_prune() into heap_page_prune() and
heap_prune_chain().

> From 32684f41d1dd50f726aa0dfe8a5d816aa5c42d64 Mon Sep 17 00:00:00 2001
> From: Robert Haas <rhaas@postgresql.org>
> Date: Mon, 15 Jan 2024 12:05:52 -0500
> Subject: [PATCH v2] Be more consistent about whether to update the FSM while
>  vacuuming.

Few small review comments below, but, overall, LGTM.

> Previously, when lazy_scan_noprune() was called and returned true, we would
> update the FSM immediately if the relation had no indexes or if the page
> contained no dead items. On the other hand, when lazy_scan_prune() was
> called, we would update the FSM if either of those things was true or
> if index vacuuming was disabled. Eliminate that behavioral difference by
> considering vacrel->do_index_vacuuming in both cases.
> 
> Also, instead of having lazy_scan_noprune() make the decision
> internally and pass it back to the caller via *recordfreespace, just
> have it pass the number of LP_DEAD items back to the caller, and then

It doesn't pass the number of LP_DEAD items back to the caller. It
passes a boolean.

> let the caller make the decision.  That seems less confusing, since
> the caller also decides in the lazy_scan_prune() case; moreover, this
> way, the whole test is in one place, instead of spread out.

Perhaps it isn't important, but I find this wording confusing. You
mention lazy_scan_prune() and then mention that "the whole test is in
one place instead of spread out" -- which kind of makes it sound like
you are consolidating FSM updates for both the lazy_scan_noprune() and
lazy_scan_prune() cases. Perhaps simply flipping the order of the "since
the caller" and "moreover, this way" conjunctions would solve it. I
defer to your judgment.

> ---
>  src/backend/access/heap/vacuumlazy.c | 58 ++++++++++++++--------------
>  1 file changed, 29 insertions(+), 29 deletions(-)
> 
> diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
> index b63cad1335..f17816b81d 100644
> --- a/src/backend/access/heap/vacuumlazy.c
> +++ b/src/backend/access/heap/vacuumlazy.c
> @@ -1960,7 +1974,6 @@ lazy_scan_noprune(LVRelState *vacrel,
>      Assert(BufferGetBlockNumber(buf) == blkno);
>  
>      hastup = false;                /* for now */
> -    *recordfreespace = false;    /* for now */
>  
>      lpdead_items = 0;
>      live_tuples = 0;
> @@ -2102,18 +2115,8 @@ lazy_scan_noprune(LVRelState *vacrel,
>              hastup = true;
>              missed_dead_tuples += lpdead_items;
>          }
> -
> -        *recordfreespace = true;
> -    }
> -    else if (lpdead_items == 0)
> -    {
> -        /*
> -         * Won't be vacuuming this page later, so record page's freespace in
> -         * the FSM now
> -         */
> -        *recordfreespace = true;
>      }
> -    else
> +    else if (lpdead_items != 0)

It stuck out to me a bit that this test is if lpdead_items != 0 and the
one above it:

    /* Save any LP_DEAD items found on the page in dead_items array */
    if (vacrel->nindexes == 0)
    {
        /* Using one-pass strategy (since table has no indexes) */
        if (lpdead_items > 0)
        {

tests if lpdead_items > 0. It is more the inconsistency that bothered me
than the fact that lpdead_items is signed.

> @@ -2159,6 +2156,9 @@ lazy_scan_noprune(LVRelState *vacrel,
>      if (hastup)
>          vacrel->nonempty_pages = blkno + 1;
>  
> +    /* Did we find LP_DEAD items? */
> +    *has_lpdead_items = (lpdead_items > 0);

I would drop this comment. The code doesn't really need it, and the
reason we care if there are LP_DEAD items is not because they are dead
but because we want to know if we'll touch this page again. You don't
need to rehash all that here, so I think omitting the comment is enough.

- Melanie

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Robert Haas

Date:

16 January 2024, 15:24:27

On Mon, Jan 15, 2024 at 4:03 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> It doesn't pass the number of LP_DEAD items back to the caller. It
> passes a boolean.

Oops.

> Perhaps it isn't important, but I find this wording confusing. You
> mention lazy_scan_prune() and then mention that "the whole test is in
> one place instead of spread out" -- which kind of makes it sound like
> you are consolidating FSM updates for both the lazy_scan_noprune() and
> lazy_scan_prune() cases. Perhaps simply flipping the order of the "since
> the caller" and "moreover, this way" conjunctions would solve it. I
> defer to your judgment.

I rewrote the commit message a bit. See what you think of this version.

> tests if lpdead_items > 0. It is more the inconsistency that bothered me
> than the fact that lpdead_items is signed.

Changed.

> > @@ -2159,6 +2156,9 @@ lazy_scan_noprune(LVRelState *vacrel,
> >       if (hastup)
> >               vacrel->nonempty_pages = blkno + 1;
> >
> > +     /* Did we find LP_DEAD items? */
> > +     *has_lpdead_items = (lpdead_items > 0);
>
> I would drop this comment. The code doesn't really need it, and the
> reason we care if there are LP_DEAD items is not because they are dead
> but because we want to know if we'll touch this page again. You don't
> need to rehash all that here, so I think omitting the comment is enough.

I want to keep the comment. I guess it's a pet peeve of mine, but I
hate it when people do this:

/* some comment */
some_code();

some_more_code();

/* some other comment */
even_more_code();

IMV, this makes it unclear whether /* some comment */ is describing
both some_code() and some_more_code(), or just the former. To be fair,
there is often no practical confusion, because if the comment is good
and the code is nothing too complicated then you understand which way
the author meant it. But sometimes the comment is bad or out of date
and sometimes the code is difficult to understand and then you're left
scratching your head as to what the author meant. I prefer to insert a
comment above some_more_code() in such cases, even if it's a bit
perfunctory. I think it makes the code easier to read.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

v3-0001-Be-more-consistent-about-whether-to-update-the-FS.patch

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Melanie Plageman

Date:

16 January 2024, 16:28:46

On Tue, Jan 16, 2024 at 10:24 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Jan 15, 2024 at 4:03 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
>
> > Perhaps it isn't important, but I find this wording confusing. You
> > mention lazy_scan_prune() and then mention that "the whole test is in
> > one place instead of spread out" -- which kind of makes it sound like
> > you are consolidating FSM updates for both the lazy_scan_noprune() and
> > lazy_scan_prune() cases. Perhaps simply flipping the order of the "since
> > the caller" and "moreover, this way" conjunctions would solve it. I
> > defer to your judgment.
>
> I rewrote the commit message a bit. See what you think of this version.

All LGTM.

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Robert Haas

Date:

16 January 2024, 19:23:07

On Tue, Jan 16, 2024 at 11:28 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> All LGTM.

Committed.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Jim Nasby

Date:

16 January 2024, 21:28:32

On 1/12/24 12:45 PM, Robert Haas wrote:
> P.P.S. to everyone: Yikes, this logic is really confusing.

Having studied all this code several years ago when it was even simpler 
- it was *still* very hard to grok even back then. I *greatly 
appreciate* the effort that y'all are putting into increasing the 
clarity here.

BTW, back in the day the whole "no indexes" optimization was a really 
tiny amount of code... I think it amounted to 2 or 3 if statements. I 
haven't yet attempted to grok this patchset, but I'm definitely 
wondering how much it's worth continuing to optimize that case. Clearly 
it'd be very expensive to memoize dead tuples just to trawl that list a 
single time to clean the heap, but outside of that I'm not sure other 
optimazations are worth it given the amount of code 
complexity/duplication they seem to require - especially for code where 
correctness is so crucial.
-- 
Jim Nasby, Data Architect, Austin TX

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Robert Haas

Date:

16 January 2024, 21:59:17

On Tue, Jan 16, 2024 at 4:28 PM Jim Nasby <jim.nasby@gmail.com> wrote:
> On 1/12/24 12:45 PM, Robert Haas wrote:
> > P.P.S. to everyone: Yikes, this logic is really confusing.
>
> Having studied all this code several years ago when it was even simpler
> - it was *still* very hard to grok even back then. I *greatly
> appreciate* the effort that y'all are putting into increasing the
> clarity here.

Thanks. And yeah, I agree.

> BTW, back in the day the whole "no indexes" optimization was a really
> tiny amount of code... I think it amounted to 2 or 3 if statements. I
> haven't yet attempted to grok this patchset, but I'm definitely
> wondering how much it's worth continuing to optimize that case. Clearly
> it'd be very expensive to memoize dead tuples just to trawl that list a
> single time to clean the heap, but outside of that I'm not sure other
> optimazations are worth it given the amount of code
> complexity/duplication they seem to require - especially for code where
> correctness is so crucial.

Personally, I don't think throwing away that optimization is the way
to go. The idea isn't intrinsically complicated, I believe. It's just
that the code has become messy because of too many hands touching it.
At least, that's my read.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Melanie Plageman

Date:

16 January 2024, 23:07:24

On Tue, Jan 16, 2024 at 2:23 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Jan 16, 2024 at 11:28 AM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > All LGTM.
>
> Committed.

Attached v8 patch set is rebased over this.

In 0004, I've taken the approach you seem to favor and combined the FSM
updates from the prune and no prune cases in lazy_scan_heap() instead
of pushing the FSM updates into lazy_scan_prune() and
lazy_scan_noprune().

I did not guard against calling lazy_scan_new_or_empty() a second time
in the case that lazy_scan_noprune() was called. I can do this. I
mentioned upthread I found it confusing for lazy_scan_new_or_empty()
to be guarded by if (do_prune). The overhead of calling it wouldn't be
terribly high. I can change that based on your opinion of what is
better.

The big open question/functional change is when we consider vacuuming
the FSM. Previously, we only considered vacuuming the FSM in the no
indexes, has dead items case. After combining the FSM updates from
lazy_scan_prune()'s no indexes/has lpdead items case,
lazy_scan_prune()'s regular case, and lazy_scan_noprune(), all of them
consider vacuuming the FSM. I could guard against this, but I wasn't
sure why we would want to only vacuum the FSM in the no indexes/has
dead items case.

I also noticed while rebasing something I missed while reviewing
45d395cd75ffc5b -- has_lpdead_items is set in a slightly different
place in lazy_scan_noprune() than lazy_scan_prune() (with otherwise
identical surrounding code). Both are correct, but it would have been
nice for them to be the same. If the patches below are committed, we
could standardize on the location in lazy_scan_noprune().

- Melanie

Attachment

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Robert Haas

Date:

17 January 2024, 17:17:09

On Tue, Jan 16, 2024 at 6:07 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> Attached v8 patch set is rebased over this.

Reviewing 0001, consider the case where a table has no indexes.
Pre-patch, PageTruncateLinePointerArray() will happen when
lazy_vacuum_heap_page() is called; post-patch, it will not occur.
That's a loss. Should we care? On the plus side, visibility map
repair, if required, can now take place. That's a gain.

I'm otherwise satisfied with this patch now, except for some extremely
minor nitpicking:

+             * For now, pass mark_unused_now == false regardless of whether or

Personally, i would write "pass mark_unused_now as false" here,
because we're not testing equality. Or else "pass mark_unused_now =
false". This is not an equality test.

+         * During pruning, the caller may have passed mark_unused_now == true,

Again here, but also, this is referring to the name of a parameter to
a function whose name is not given. I think this this should either
talk fully in terms of code ("When heap_page_prune was called,
mark_unused_now may have been passed as true, which allows would-be
LP_DEAD items to be made LP_USED instead.") or fully in English
("During pruning, we may have decided to mark would-be dead items as
unused.").

> In 0004, I've taken the approach you seem to favor and combined the FSM
> updates from the prune and no prune cases in lazy_scan_heap() instead
> of pushing the FSM updates into lazy_scan_prune() and
> lazy_scan_noprune().

I do like that approach.

I think do_prune can be declared one scope inward, in the per-block
for loop. I would probably initialize it to true so I could drop the
stubby else block:

+               /* If we do get a cleanup lock, we will definitely prune */
+               else
+                       do_prune = true;

And then I'd probably write the call as if (!lazy_scan_noprune())
doprune = true.

If I wanted to stick with not initializing do_prune, I'd put the
comment inside as /* We got the cleanup lock, so we will definitely
prune */ and add braces since that makes it a two-line block.

> I did not guard against calling lazy_scan_new_or_empty() a second time
> in the case that lazy_scan_noprune() was called. I can do this. I
> mentioned upthread I found it confusing for lazy_scan_new_or_empty()
> to be guarded by if (do_prune). The overhead of calling it wouldn't be
> terribly high. I can change that based on your opinion of what is
> better.

To me, the relevant question here isn't reader confusion, because that
can be cleared up with a comment explaining why we do or do not test
do_prune. Indeed, I'd say it's a textbook example of when you should
comment a test: when it might look wrong to the reader but you have a
good reason for doing it.

But that brings me to what I think the real question is here: do we,
uh, have a good reason for doing it? At first blush the structure
looks a bit odd here. lazy_scan_new_or_empty() is intended to handle
PageIsNew() and PageIsEmpty() cases, lazy_scan_noprune() the cases
where we process the page without a cleanup lock, and
lazy_scan_prune() the regular case. So you might think that
lazy_scan_new_or_empty() would always be applied *first*, that we
would then conditionally apply lazy_scan_noprune(), and finally
conditionally apply lazy_scan_prune(). Like this:

bool got_cleanup_lock = ConditionalLockBufferForCleanup(buf);
if (lazy_scan_new_or_empty())
    continue;
if (!got_cleanup_lock && !lazy_scan_noprune())
{
    LockBuffer(buf, BUFFER_LOCK_UNLOCK);
    LockBufferForCleanup(buf);
    got_cleanup_lock = true;
}
if (got_cleanup_lock)
    lazy_scan_prune();

The current organization of the code seems to imply that we don't need
to worry about the PageIsNew() and PageIsEmpty() cases before calling
lazy_scan_noprune(), and at the moment I'm not understanding why that
should be the case. I wonder if this is a bug, or if I'm just
confused.

> The big open question/functional change is when we consider vacuuming
> the FSM. Previously, we only considered vacuuming the FSM in the no
> indexes, has dead items case. After combining the FSM updates from
> lazy_scan_prune()'s no indexes/has lpdead items case,
> lazy_scan_prune()'s regular case, and lazy_scan_noprune(), all of them
> consider vacuuming the FSM. I could guard against this, but I wasn't
> sure why we would want to only vacuum the FSM in the no indexes/has
> dead items case.

I don't get it. Conceptually, I agree that we don't want to be
inconsistent here  without some good reason. One of the big advantages
of unifying different code paths is that you avoid being accidentally
inconsistent. If different things are different it shows up as a test
in the code instead of just having different code paths in different
places that may or may not match.

But I thought the whole point of
45d395cd75ffc5b4c824467140127a5d11696d4c was to iron out the existing
inconsistencies so that we could unify this code without having to
change any more behavior. In particular, I thought we just made it
consistently adhere to the principle Peter articulated, where we
record free space when we're touching the page for presumptively the
last time. I gather that you think it's still not consistent, but I
don't understand what the remaining inconsistency is. Can you explain
further?

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Melanie Plageman

Date:

17 January 2024, 20:11:55

On Wed, Jan 17, 2024 at 12:17 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Jan 16, 2024 at 6:07 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > Attached v8 patch set is rebased over this.
>
> Reviewing 0001, consider the case where a table has no indexes.
> Pre-patch, PageTruncateLinePointerArray() will happen when
> lazy_vacuum_heap_page() is called; post-patch, it will not occur.
> That's a loss. Should we care? On the plus side, visibility map
> repair, if required, can now take place. That's a gain.

I thought that this wasn't an issue because heap_page_prune_execute()
calls PageRepairFragmentation() which similarly modifies pd_lower and
sets the hint bit about free line pointers.

> I'm otherwise satisfied with this patch now, except for some extremely
> minor nitpicking:
>
> +             * For now, pass mark_unused_now == false regardless of whether or
>
> Personally, i would write "pass mark_unused_now as false" here,
> because we're not testing equality. Or else "pass mark_unused_now =
> false". This is not an equality test.
>
> +         * During pruning, the caller may have passed mark_unused_now == true,
>
> Again here, but also, this is referring to the name of a parameter to
> a function whose name is not given. I think this this should either
> talk fully in terms of code ("When heap_page_prune was called,
> mark_unused_now may have been passed as true, which allows would-be
> LP_DEAD items to be made LP_USED instead.") or fully in English
> ("During pruning, we may have decided to mark would-be dead items as
> unused.").

Fixed both of the above issues as suggested in attached v9

> > In 0004, I've taken the approach you seem to favor and combined the FSM
> > updates from the prune and no prune cases in lazy_scan_heap() instead
> > of pushing the FSM updates into lazy_scan_prune() and
> > lazy_scan_noprune().
>
> I do like that approach.
>
> I think do_prune can be declared one scope inward, in the per-block
> for loop. I would probably initialize it to true so I could drop the
> stubby else block:
>
> +               /* If we do get a cleanup lock, we will definitely prune */
> +               else
> +                       do_prune = true;
>
> And then I'd probably write the call as if (!lazy_scan_noprune())
> doprune = true.
>
> If I wanted to stick with not initializing do_prune, I'd put the
> comment inside as /* We got the cleanup lock, so we will definitely
> prune */ and add braces since that makes it a two-line block.

If we don't unconditionally set do_prune using the result of
lazy_scan_noprune(), then we cannot leave do_prune uninitialized. I
preferred having it uninitialized, as it didn't imply that doing
pruning was the default. Also, it made it simpler to have that comment
about always pruning when we get the cleanup lock.

However, with the changes you mentioned below (got_cleanup_lock), this
discussion is moot.

> > I did not guard against calling lazy_scan_new_or_empty() a second time
> > in the case that lazy_scan_noprune() was called. I can do this. I
> > mentioned upthread I found it confusing for lazy_scan_new_or_empty()
> > to be guarded by if (do_prune). The overhead of calling it wouldn't be
> > terribly high. I can change that based on your opinion of what is
> > better.
>
> To me, the relevant question here isn't reader confusion, because that
> can be cleared up with a comment explaining why we do or do not test
> do_prune. Indeed, I'd say it's a textbook example of when you should
> comment a test: when it might look wrong to the reader but you have a
> good reason for doing it.
>
> But that brings me to what I think the real question is here: do we,
> uh, have a good reason for doing it? At first blush the structure
> looks a bit odd here. lazy_scan_new_or_empty() is intended to handle
> PageIsNew() and PageIsEmpty() cases, lazy_scan_noprune() the cases
> where we process the page without a cleanup lock, and
> lazy_scan_prune() the regular case. So you might think that
> lazy_scan_new_or_empty() would always be applied *first*, that we
> would then conditionally apply lazy_scan_noprune(), and finally
> conditionally apply lazy_scan_prune(). Like this:
>
> bool got_cleanup_lock = ConditionalLockBufferForCleanup(buf);
> if (lazy_scan_new_or_empty())
>     continue;
> if (!got_cleanup_lock && !lazy_scan_noprune())
> {
>     LockBuffer(buf, BUFFER_LOCK_UNLOCK);
>     LockBufferForCleanup(buf);
>     got_cleanup_lock = true;
> }
> if (got_cleanup_lock)
>     lazy_scan_prune();
>
> The current organization of the code seems to imply that we don't need
> to worry about the PageIsNew() and PageIsEmpty() cases before calling
> lazy_scan_noprune(), and at the moment I'm not understanding why that
> should be the case. I wonder if this is a bug, or if I'm just
> confused.

Yes, I also spent some time thinking about this. In master, we do
always call lazy_scan_new_or_empty() before calling
lazy_scan_noprune(). The code is aiming to ensure we call
lazy_scan_new_or_empty() once before calling either of
lazy_scan_noprune() or lazy_scan_prune(). I think it didn't call
lazy_scan_new_or_empty() unconditionally first because of the
different lock types expected. But, your structure has solved that.
I've used a version of your example code above in attached v9. It is
much nicer.

> > The big open question/functional change is when we consider vacuuming
> > the FSM. Previously, we only considered vacuuming the FSM in the no
> > indexes, has dead items case. After combining the FSM updates from
> > lazy_scan_prune()'s no indexes/has lpdead items case,
> > lazy_scan_prune()'s regular case, and lazy_scan_noprune(), all of them
> > consider vacuuming the FSM. I could guard against this, but I wasn't
> > sure why we would want to only vacuum the FSM in the no indexes/has
> > dead items case.
>
> I don't get it. Conceptually, I agree that we don't want to be
> inconsistent here  without some good reason. One of the big advantages
> of unifying different code paths is that you avoid being accidentally
> inconsistent. If different things are different it shows up as a test
> in the code instead of just having different code paths in different
> places that may or may not match.
>
> But I thought the whole point of
> 45d395cd75ffc5b4c824467140127a5d11696d4c was to iron out the existing
> inconsistencies so that we could unify this code without having to
> change any more behavior. In particular, I thought we just made it
> consistently adhere to the principle Peter articulated, where we
> record free space when we're touching the page for presumptively the
> last time. I gather that you think it's still not consistent, but I
> don't understand what the remaining inconsistency is. Can you explain
> further?

Ah, I realize I was not clear. I am now talking about inconsistencies
in vacuuming the FSM itself. FreeSpaceMapVacuumRange(). Not updating
the freespace map during the course of vacuuming the heap relation.

- Melanie

Attachment

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Robert Haas

Date:

17 January 2024, 20:58:10

On Wed, Jan 17, 2024 at 3:12 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> > Reviewing 0001, consider the case where a table has no indexes.
> > Pre-patch, PageTruncateLinePointerArray() will happen when
> > lazy_vacuum_heap_page() is called; post-patch, it will not occur.
> > That's a loss. Should we care? On the plus side, visibility map
> > repair, if required, can now take place. That's a gain.
>
> I thought that this wasn't an issue because heap_page_prune_execute()
> calls PageRepairFragmentation() which similarly modifies pd_lower and
> sets the hint bit about free line pointers.

Ah, OK, I didn't understand that PageRepairFragmentation() does what
is also done by PageTruncateLinePointerArray().

> Yes, I also spent some time thinking about this. In master, we do
> always call lazy_scan_new_or_empty() before calling
> lazy_scan_noprune(). The code is aiming to ensure we call
> lazy_scan_new_or_empty() once before calling either of
> lazy_scan_noprune() or lazy_scan_prune(). I think it didn't call
> lazy_scan_new_or_empty() unconditionally first because of the
> different lock types expected. But, your structure has solved that.
> I've used a version of your example code above in attached v9. It is
> much nicer.

Oh, OK, I see it now. I missed that lazy_scan_new_or_empty() was
called either way. Glad that my proposed restructuring managed to be
helpful despite that confusion, though. :-)

At a quick glance, I also like the way this looks. I'll review it more
thoroughly later. Does this patch require 0002 and 0003 or could it
equally well go first? I confess that I don't entirely understand why
we want 0002 and 0003.

> Ah, I realize I was not clear. I am now talking about inconsistencies
> in vacuuming the FSM itself. FreeSpaceMapVacuumRange(). Not updating
> the freespace map during the course of vacuuming the heap relation.

Fair enough, but I'm still not quite sure exactly what the question
is. It looks to me like the current code, when there are indexes,
vacuums the FSM after each round of index vacuuming. When there are no
indexes, doing it after each round of index vacuuming would mean never
doing it, so instead we vacuum the FSM every ~8GB. I assume what
happened here is that somebody decided doing it after each round of
index vacuuming was the "right thing," and then realized that was not
going to work if no index vacuuming was happening, and so inserted the
8GB threshold to cover that case. I don't really know what to make of
all of this. On a happy PostgreSQL system, doing anything after each
round of index vacuuming means doing it once, because multiple rounds
of indexing vacuum are extremely expensive and we hope that it won't
ever occur. From that point of view, the 8GB threshold is better,
because it means that when we vacuum a large relation, space should
become visible to the rest of the system incrementally without needing
to wait for the entire vacuum to finish. On the other hand, we also
have this idea that we want to record free space in the FSM once,
after the last time we touch the page. Given that behavior, vacuuming
the FSM every 8GB when we haven't yet done index vacuuming wouldn't
accomplish much of anything, because we haven't updated it for the
pages we just touched. On the third hand, the current behavior seems
slightly ridiculous, because pruning the page is where we're mostly
going to free up space, so we might be better off just updating the
FSM then instead of waiting. That free space could be mighty useful
during the long wait between pruning and marking line pointers unused.
On the fourth hand, that seems like a significant behavior change that
we might not want to undertake without a bunch of research that we
might not want to do right now -- and if we did do it, should we then
update the FSM a second time after marking line pointers unused?

I'm not sure if any of this is answering your actual question, though.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Peter Geoghegan

Date:

17 January 2024, 21:25:02

On Wed, Jan 17, 2024 at 3:58 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > Ah, I realize I was not clear. I am now talking about inconsistencies
> > in vacuuming the FSM itself. FreeSpaceMapVacuumRange(). Not updating
> > the freespace map during the course of vacuuming the heap relation.
>
> Fair enough, but I'm still not quite sure exactly what the question
> is. It looks to me like the current code, when there are indexes,
> vacuums the FSM after each round of index vacuuming. When there are no
> indexes, doing it after each round of index vacuuming would mean never
> doing it, so instead we vacuum the FSM every ~8GB. I assume what
> happened here is that somebody decided doing it after each round of
> index vacuuming was the "right thing," and then realized that was not
> going to work if no index vacuuming was happening, and so inserted the
> 8GB threshold to cover that case.

Note that VACUUM_FSM_EVERY_PAGES is applied against the number of
rel_pages "processed" so far -- *including* any pages that were
skipped using the visibility map. It would make a bit more sense if it
was applied against scanned_pages instead (just like
FAILSAFE_EVERY_PAGES has been since commit 07eef53955). In other
words, VACUUM_FSM_EVERY_PAGES is applied against a thing that has only
a very loose relationship with physical work performed/time elapsed.

I tend to suspect that VACUUM_FSM_EVERY_PAGES is fundamentally the
wrong idea. If it's such a good idea then why not apply it all the
time? That is, why not apply it independently of whether nindexes==0
in the current VACUUM operation? (You know, just like with
FAILSAFE_EVERY_PAGES.)

--
Peter Geoghegan

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Peter Geoghegan

Date:

17 January 2024, 21:31:12

On Wed, Jan 17, 2024 at 4:25 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I tend to suspect that VACUUM_FSM_EVERY_PAGES is fundamentally the
> wrong idea. If it's such a good idea then why not apply it all the
> time? That is, why not apply it independently of whether nindexes==0
> in the current VACUUM operation? (You know, just like with
> FAILSAFE_EVERY_PAGES.)

Actually, I suppose that we couldn't apply it independently of
nindexes==0. Then we'd call FreeSpaceMapVacuumRange() before our
second pass over the heap takes place for those LP_DEAD-containing
heap pages scanned since the last round of index/heap vacuuming took
place (or since VACUUM began). We need to make sure that the FSM has
the most recent possible information known to VACUUM, which would
break if we applied VACUUM_FSM_EVERY_PAGES rules when nindexes > 0.

Even still, the design of VACUUM_FSM_EVERY_PAGES seems questionable to me.

--
Peter Geoghegan

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Melanie Plageman

Date:

17 January 2024, 22:33:27

On Wed, Jan 17, 2024 at 3:58 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Jan 17, 2024 at 3:12 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
>
> > Yes, I also spent some time thinking about this. In master, we do
> > always call lazy_scan_new_or_empty() before calling
> > lazy_scan_noprune(). The code is aiming to ensure we call
> > lazy_scan_new_or_empty() once before calling either of
> > lazy_scan_noprune() or lazy_scan_prune(). I think it didn't call
> > lazy_scan_new_or_empty() unconditionally first because of the
> > different lock types expected. But, your structure has solved that.
> > I've used a version of your example code above in attached v9. It is
> > much nicer.
>
> Oh, OK, I see it now. I missed that lazy_scan_new_or_empty() was
> called either way. Glad that my proposed restructuring managed to be
> helpful despite that confusion, though. :-)
>
> At a quick glance, I also like the way this looks. I'll review it more
> thoroughly later. Does this patch require 0002 and 0003 or could it
> equally well go first? I confess that I don't entirely understand why
> we want 0002 and 0003.

Well, 0002 and 0003 move the updates to the visibility map into
lazy_scan_prune(). We only want to update the VM if we called
lazy_scan_prune() (i.e. not if lazy_scan_noprune() returned true). We
also need the lock on the heap page when updating the visibility map
but we want to have released the lock before updating the FSM, so we
need to first update the VM then the FSM.

The VM update code, in my opinion, belongs in lazy_scan_prune() --
since it is only done when lazy_scan_prune() is called. To keep the VM
update code in lazy_scan_heap() and still consolidate the FSM update
code, we would have to surround all of the VM update code in a test
(if got_cleanup_lock, I suppose). I don't see any advantage in doing
that.

> > Ah, I realize I was not clear. I am now talking about inconsistencies
> > in vacuuming the FSM itself. FreeSpaceMapVacuumRange(). Not updating
> > the freespace map during the course of vacuuming the heap relation.
>
> Fair enough, but I'm still not quite sure exactly what the question
> is. It looks to me like the current code, when there are indexes,
> vacuums the FSM after each round of index vacuuming. When there are no
> indexes, doing it after each round of index vacuuming would mean never
> doing it, so instead we vacuum the FSM every ~8GB. I assume what
> happened here is that somebody decided doing it after each round of
> index vacuuming was the "right thing," and then realized that was not
> going to work if no index vacuuming was happening, and so inserted the
> 8GB threshold to cover that case.

Ah, I see. I understood that we want to update the FSM every 8GB, but
I didn't understand that we wanted to check if we were at that 8GB
only after a round of index vacuuming. That would explain why we also
had to do it in the no indexes case -- because, as you say, there
wouldn't be a round of index vacuuming.

This does mean that something is not quite right with 0001 as well as
0004. We'd end up checking if we are at 8GB much more often. I should
probably find a way to replicate the cadence on master.

> I don't really know what to make of
> all of this. On a happy PostgreSQL system, doing anything after each
> round of index vacuuming means doing it once, because multiple rounds
> of indexing vacuum are extremely expensive and we hope that it won't
> ever occur. From that point of view, the 8GB threshold is better,
> because it means that when we vacuum a large relation, space should
> become visible to the rest of the system incrementally without needing
> to wait for the entire vacuum to finish. On the other hand, we also
> have this idea that we want to record free space in the FSM once,
> after the last time we touch the page. Given that behavior, vacuuming
> the FSM every 8GB when we haven't yet done index vacuuming wouldn't
> accomplish much of anything, because we haven't updated it for the
> pages we just touched. On the third hand, the current behavior seems
> slightly ridiculous, because pruning the page is where we're mostly
> going to free up space, so we might be better off just updating the
> FSM then instead of waiting. That free space could be mighty useful
> during the long wait between pruning and marking line pointers unused.
> On the fourth hand, that seems like a significant behavior change that
> we might not want to undertake without a bunch of research that we
> might not want to do right now -- and if we did do it, should we then
> update the FSM a second time after marking line pointers unused?

I suspect we'd need to do some testing of various scenarios to justify
such a change.

- Melanie

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Melanie Plageman

Date:

17 January 2024, 22:38:52

On Wed, Jan 17, 2024 at 4:25 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Wed, Jan 17, 2024 at 3:58 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > > Ah, I realize I was not clear. I am now talking about inconsistencies
> > > in vacuuming the FSM itself. FreeSpaceMapVacuumRange(). Not updating
> > > the freespace map during the course of vacuuming the heap relation.
> >
> > Fair enough, but I'm still not quite sure exactly what the question
> > is. It looks to me like the current code, when there are indexes,
> > vacuums the FSM after each round of index vacuuming. When there are no
> > indexes, doing it after each round of index vacuuming would mean never
> > doing it, so instead we vacuum the FSM every ~8GB. I assume what
> > happened here is that somebody decided doing it after each round of
> > index vacuuming was the "right thing," and then realized that was not
> > going to work if no index vacuuming was happening, and so inserted the
> > 8GB threshold to cover that case.
>
> Note that VACUUM_FSM_EVERY_PAGES is applied against the number of
> rel_pages "processed" so far -- *including* any pages that were
> skipped using the visibility map. It would make a bit more sense if it
> was applied against scanned_pages instead (just like
> FAILSAFE_EVERY_PAGES has been since commit 07eef53955). In other
> words, VACUUM_FSM_EVERY_PAGES is applied against a thing that has only
> a very loose relationship with physical work performed/time elapsed.

This is a good point. Seems like a very reasonable change to make, as
I would think that was the original intent.

- Melanie

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Melanie Plageman

Date:

17 January 2024, 22:47:37

On Wed, Jan 17, 2024 at 4:31 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Wed, Jan 17, 2024 at 4:25 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > I tend to suspect that VACUUM_FSM_EVERY_PAGES is fundamentally the
> > wrong idea. If it's such a good idea then why not apply it all the
> > time? That is, why not apply it independently of whether nindexes==0
> > in the current VACUUM operation? (You know, just like with
> > FAILSAFE_EVERY_PAGES.)
>
> Actually, I suppose that we couldn't apply it independently of
> nindexes==0. Then we'd call FreeSpaceMapVacuumRange() before our
> second pass over the heap takes place for those LP_DEAD-containing
> heap pages scanned since the last round of index/heap vacuuming took
> place (or since VACUUM began). We need to make sure that the FSM has
> the most recent possible information known to VACUUM, which would
> break if we applied VACUUM_FSM_EVERY_PAGES rules when nindexes > 0.
>
> Even still, the design of VACUUM_FSM_EVERY_PAGES seems questionable to me.

I now see I misunderstood and my earlier email was wrong. I didn't
notice that we only use VACUUM_FSM_EVERY_PAGES if nindexes ==0.
So, in master, we call FreeSpaceMapVacuumRange() always after a round
of index vacuuming and periodically if there are no indexes.

It seems like you are asking whether not we should vacuum the FSM at a
different cadence for the no indexes case (and potentially count
blocks actually vacuumed instead of blocks considered).

And it seems like Robert is asking whether or not we should
FreeSpaceMapVacuumRange() more frequently than after index vacuuming
in the nindexes > 0 case.

Other than the overhead of the actual vacuuming of the FSM, what are
the potential downsides of knowing about freespace sooner? It could
change what pages are inserted to. What are the possible undesirable
side effects?

- Melanie

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Peter Geoghegan

Date:

17 January 2024, 23:08:51

On Wed, Jan 17, 2024 at 5:47 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Wed, Jan 17, 2024 at 4:31 PM Peter Geoghegan <pg@bowt.ie> wrote:
> >
> > On Wed, Jan 17, 2024 at 4:25 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > > I tend to suspect that VACUUM_FSM_EVERY_PAGES is fundamentally the
> > > wrong idea. If it's such a good idea then why not apply it all the
> > > time? That is, why not apply it independently of whether nindexes==0
> > > in the current VACUUM operation? (You know, just like with
> > > FAILSAFE_EVERY_PAGES.)
> >
> > Actually, I suppose that we couldn't apply it independently of
> > nindexes==0. Then we'd call FreeSpaceMapVacuumRange() before our
> > second pass over the heap takes place for those LP_DEAD-containing
> > heap pages scanned since the last round of index/heap vacuuming took
> > place (or since VACUUM began). We need to make sure that the FSM has
> > the most recent possible information known to VACUUM, which would
> > break if we applied VACUUM_FSM_EVERY_PAGES rules when nindexes > 0.
> >
> > Even still, the design of VACUUM_FSM_EVERY_PAGES seems questionable to me.
>
> I now see I misunderstood and my earlier email was wrong. I didn't
> notice that we only use VACUUM_FSM_EVERY_PAGES if nindexes ==0.
> So, in master, we call FreeSpaceMapVacuumRange() always after a round
> of index vacuuming and periodically if there are no indexes.

The "nindexes == 0" if() that comes just after our call to
lazy_scan_prune() is "the one-pass equivalent of a call to
lazy_vacuum()". Though this includes the call to
FreeSpaceMapVacuumRange() that immediately follows the two-pass case
calling lazy_vacuum(), too.

> It seems like you are asking whether not we should vacuum the FSM at a
> different cadence for the no indexes case (and potentially count
> blocks actually vacuumed instead of blocks considered).
>
> And it seems like Robert is asking whether or not we should
> FreeSpaceMapVacuumRange() more frequently than after index vacuuming
> in the nindexes > 0 case.

There is no particular reason for the nindexes==0 case to care about
how often we'd call FreeSpaceMapVacuumRange() in the counterfactual
world where the same VACUUM ran on the same table, except that it was
nindexes>1 instead. At least I don't see any.

> Other than the overhead of the actual vacuuming of the FSM, what are
> the potential downsides of knowing about freespace sooner? It could
> change what pages are inserted to. What are the possible undesirable
> side effects?

The whole VACUUM_FSM_EVERY_PAGES thing comes from commit 851a26e266.
The commit message of that work seems to suppose that calling
FreeSpaceMapVacuumRange() more frequently is pretty much strictly
better than calling it less frequently, at least up to the point where
certain more-or-less fixed costs paid once per
FreeSpaceMapVacuumRange() start to become a problem. I think that
that's probably about right.

The commit message also says that we "arbitrarily update upper FSM
pages after each 8GB of heap" (in the nindexes==0 case). So
VACUUM_FSM_EVERY_PAGES is only very approximately analogous to what we
do in the nindexes>1 case. That seems reasonable because these two
cases really aren't so comparable in terms of the FSM vacuuming
requirements -- the nindexes==0 case legitimately doesn't have the
same dependency on heap vacuuming (and index vacuuming) that we have
to consider when nindexes>1.

--
Peter Geoghegan

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Robert Haas

Date:

18 January 2024, 13:51:46

On Wed, Jan 17, 2024 at 4:31 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Actually, I suppose that we couldn't apply it independently of
> nindexes==0. Then we'd call FreeSpaceMapVacuumRange() before our
> second pass over the heap takes place for those LP_DEAD-containing
> heap pages scanned since the last round of index/heap vacuuming took
> place (or since VACUUM began). We need to make sure that the FSM has
> the most recent possible information known to VACUUM, which would
> break if we applied VACUUM_FSM_EVERY_PAGES rules when nindexes > 0.
>
> Even still, the design of VACUUM_FSM_EVERY_PAGES seems questionable to me.

I agree with all of this. I thought I'd said all of this, actually, in
my prior email, but perhaps it wasn't as clear as it needed to be.

But I also said one more thing that I'd still like to hear your
thoughts about, which is: why is it right to update the FSM after the
second heap pass rather than the first one? I can't help but suspect
this is an algorithmic holdover from pre-HOT days, when VACUUM's first
heap pass was read-only and all the work happened in the second pass.
Now, nearly all of the free space that will ever become free becomes
free in the first pass, so why not advertise it then, instead of
waiting?

Admittedly, HOT is not yet 15 years old, so maybe it's too soon to
adapt our VACUUM algorithm for it. *wink*

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Melanie Plageman

Date:

18 January 2024, 14:53:12

On Wed, Jan 17, 2024 at 5:33 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> This does mean that something is not quite right with 0001 as well as
> 0004. We'd end up checking if we are at 8GB much more often. I should
> probably find a way to replicate the cadence on master.

I believe I've done this in attached v10.

- Melanie

Attachment

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Peter Geoghegan

Date:

18 January 2024, 15:09:11

On Thu, Jan 18, 2024 at 8:52 AM Robert Haas <robertmhaas@gmail.com> wrote:
> But I also said one more thing that I'd still like to hear your
> thoughts about, which is: why is it right to update the FSM after the
> second heap pass rather than the first one? I can't help but suspect
> this is an algorithmic holdover from pre-HOT days, when VACUUM's first
> heap pass was read-only and all the work happened in the second pass.
> Now, nearly all of the free space that will ever become free becomes
> free in the first pass, so why not advertise it then, instead of
> waiting?

I don't think that doing everything FSM-related in the first heap pass
is a bad idea -- especially not if it buys you something elsewhere.

The problem with your justification for moving things in that
direction (if any) is that it is occasionally not quite true: there
are at least some cases where line pointer truncation after making a
page's LP_DEAD items -> LP_UNUSED will actually matter. Plus
PageGetHeapFreeSpace() will return 0 if and when
"PageGetMaxOffsetNumber(page) > MaxHeapTuplesPerPage &&
!PageHasFreeLinePointers(page)". Of course, nothing stops you from
compensating for this by anticipating what will happen later on, and
assuming that the page already has that much free space.

It might even be okay to just not try to compensate for anything,
PageGetHeapFreeSpace-wise -- just do all FSM stuff in the first heap
pass, and ignore all this. I happen to believe that a FSM_CATEGORIES
of 256 is way too much granularity to be useful in practice -- I just
don't have any faith in the idea that that kind of granularity is
useful (it's quite the opposite).

A further justification might be what we already do in the heapam.c
REDO routines: the way that we use XLogRecordPageWithFreeSpace already
operates with far less precision that corresponding code from
vacuumlazy.c. heap_xlog_prune() already has recovery do what you
propose to do during original execution; it doesn't try to avoid
duplicating an anticipated call to XLogRecordPageWithFreeSpace that'll
take place when heap_xlog_vacuum() runs against the same page a bit
later on.

You'd likely prefer a simpler argument for doing this -- an argument
that doesn't require abandoning/discrediting the idea that a high
degree of FSM_CATEGORIES-wise precision is a valuable thing. Not sure
that that's possible -- the current design is at least correct on its
own terms. And what you propose to do will probably be less correct on
those same terms, silly though they are.

--
Peter Geoghegan

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Robert Haas

Date:

18 January 2024, 15:09:28

On Thu, Jan 18, 2024 at 9:53 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> I believe I've done this in attached v10.

Oh, I see. Good catch.

I've now committed 0001.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Robert Haas

Date:

18 January 2024, 15:42:53

On Thu, Jan 18, 2024 at 10:09 AM Peter Geoghegan <pg@bowt.ie> wrote:
> The problem with your justification for moving things in that
> direction (if any) is that it is occasionally not quite true: there
> are at least some cases where line pointer truncation after making a
> page's LP_DEAD items -> LP_UNUSED will actually matter. Plus
> PageGetHeapFreeSpace() will return 0 if and when
> "PageGetMaxOffsetNumber(page) > MaxHeapTuplesPerPage &&
> !PageHasFreeLinePointers(page)". Of course, nothing stops you from
> compensating for this by anticipating what will happen later on, and
> assuming that the page already has that much free space.

I think we're agreeing but I want to be sure. If we only set LP_DEAD
items to LP_UNUSED, that frees no space. But if doing so allows us to
truncate the line pointer array, that that frees a little bit of
space. Right?

One problem with using this as a justification for the status quo is
that truncating the line pointer array is a relatively recent
behavior. It's certainly much newer than the choice to have VACUUM
touch the FSM in the second page than the first page.

Another problem is that the amount of space that we're freeing up in
the second pass is really quite minimal even when it's >0. Any tuple
that actually contains any data at all is at least 32 bytes, and most
of them are quite a bit larger. Item pointers are 2 bytes. To save
enough space to fit even one additional tuple, we'd have to free *at
least* 16 line pointers. That's going to be really rare.

And even if it happens, is it even useful to advertise that free
space? Do we want to cram one more tuple into a page that has a
history of extremely heavy updates? Could it be that it's smarter to
just forget about that free space? You've written before about the
stupidity of cramming tuples of different generations into the same
page, and that concept seems to apply here. When we heap_page_prune(),
we don't know how much time has elapsed since the page was last
modified - but if we're lucky, it might not be very much. Updating the
FSM at that time gives us some shot of filling up the page with data
created around the same time as the existing page contents. By the
time we vacuum the indexes and come back, that temporal locality is
definitely lost.

> You'd likely prefer a simpler argument for doing this -- an argument
> that doesn't require abandoning/discrediting the idea that a high
> degree of FSM_CATEGORIES-wise precision is a valuable thing. Not sure
> that that's possible -- the current design is at least correct on its
> own terms. And what you propose to do will probably be less correct on
> those same terms, silly though they are.

I've never really understood why you think that the number of
FSM_CATEGORIES is the problem. I believe I recall you endorsing a
system where pages are open or closed, to try to achieve temporal
locality of data. I agree that such a system could work better than
what we have now. I think there's a risk that such a system could
create pathological cases where the behavior is much worse than what
we have today, and I think we'd need to consider carefully what such
cases might exist and what mitigation strategies might make sense.
However, I don't see a reason why such a system should intrinsically
want to reduce FSM_CATEGORIES. If we have two open pages and one of
them has enough space for the tuple we're now trying to insert and the
other doesn't, we'd still like to avoid having the FSM hand us the one
that doesn't.

Now, that said, I suspect that we actually could reduce FSM_CATEGORIES
somewhat without causing any real problems, because many tables are
going to have tuples that are all about the same size, and even in a
table where the sizes vary more than is typical, a single tuple can't
consume more than a quarter of the page, so granularity above that
point seems completely useless. So if we needed some bitspace to track
the open/closed status of pages or similar, I suspect we could find
that in the existing FSM byte per page without losing anything. But
all of that is just an argument that reducing the number of
FSM_CATEGORIES is *acceptable*; it doesn't amount to an argument that
it's better. My current belief is that it isn't better, just a vehicle
to do something else that maybe is better, like squeezing open/closed
tracking or similar into the existing bit space. My understanding is
that you think it would be better on its own terms, but I have not yet
been able to grasp why that would be so.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Robert Haas

Date:

18 January 2024, 15:50:13

On Thu, Jan 18, 2024 at 10:42 AM Robert Haas <robertmhaas@gmail.com> wrote:
> Now, that said, I suspect that we actually could reduce FSM_CATEGORIES
> somewhat without causing any real problems, because many tables are
> going to have tuples that are all about the same size, and even in a
> table where the sizes vary more than is typical, a single tuple can't
> consume more than a quarter of the page,

Actually, I think that's a soft limit, not a hard limit. But the
granularity above that level probably doesn't need to be very high, at
leat.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Peter Geoghegan

Date:

18 January 2024, 16:17:21

On Thu, Jan 18, 2024 at 10:43 AM Robert Haas <robertmhaas@gmail.com> wrote:
> I think we're agreeing but I want to be sure. If we only set LP_DEAD
> items to LP_UNUSED, that frees no space. But if doing so allows us to
> truncate the line pointer array, that that frees a little bit of
> space. Right?

That's part of it, yes.

> One problem with using this as a justification for the status quo is
> that truncating the line pointer array is a relatively recent
> behavior. It's certainly much newer than the choice to have VACUUM
> touch the FSM in the second page than the first page.

True. But the way that PageGetHeapFreeSpace() returns 0 for a page
with 291 LP_DEAD stubs is a much older behavior. When that happens it
is literally true that the page has lots of free space. And yet it's
not free space we can actually use. Not until those LP_DEAD items are
marked LP_UNUSED.

> Another problem is that the amount of space that we're freeing up in
> the second pass is really quite minimal even when it's >0. Any tuple
> that actually contains any data at all is at least 32 bytes, and most
> of them are quite a bit larger. Item pointers are 2 bytes. To save
> enough space to fit even one additional tuple, we'd have to free *at
> least* 16 line pointers. That's going to be really rare.

I basically agree with this. I would still worry about the "291
LP_DEAD stubs makes PageGetHeapFreeSpace return 0" thing specifically,
though. It's sort of a special case.

> And even if it happens, is it even useful to advertise that free
> space? Do we want to cram one more tuple into a page that has a
> history of extremely heavy updates? Could it be that it's smarter to
> just forget about that free space?

I think so, yes.

Another big source of inaccuracies here is that we don't credit
RECENTLY_DEAD tuple space with being free space. Maybe that isn't a
huge problem, but it makes it even harder to believe that precision in
FSM accounting is an intrinsic good.

> > You'd likely prefer a simpler argument for doing this -- an argument
> > that doesn't require abandoning/discrediting the idea that a high
> > degree of FSM_CATEGORIES-wise precision is a valuable thing. Not sure
> > that that's possible -- the current design is at least correct on its
> > own terms. And what you propose to do will probably be less correct on
> > those same terms, silly though they are.
>
> I've never really understood why you think that the number of
> FSM_CATEGORIES is the problem. I believe I recall you endorsing a
> system where pages are open or closed, to try to achieve temporal
> locality of data.

My remarks about "FSM_CATEGORIES-wise precision" were basically
remarks about the fundamental problem with the free space map. Which
is really that it's just a map of free space, that gives exactly zero
thought to various high level things that *obviously* matter. I wasn't
particularly planning on getting into the specifics of that with you
now, on this thread.

A brief recap might be useful: other systems with a heap table AM free
space management structure typically represent the free space
available on each page using a far more coarse grained counter.
Usually one with less than 10 distinct increments. The immediate
problem with FSM_CATEGORIES having such a fine granularity is that it
increases contention/competition among backends that need to find some
free space for a new tuple. They'll all diligently try to find the
page with the least free space that still satisfies their immediate
needs -- there is no thought for the second-order effects, which are
really important in practice.

> But all of that is just an argument that reducing the number of
> FSM_CATEGORIES is *acceptable*; it doesn't amount to an argument that
> it's better. My current belief is that it isn't better, just a vehicle
> to do something else that maybe is better, like squeezing open/closed
> tracking or similar into the existing bit space. My understanding is
> that you think it would be better on its own terms, but I have not yet
> been able to grasp why that would be so.

I'm not really arguing that reducing FSM_CATEGORIES and changing
nothing else would be better on its own (it might be, but that's not
what I meant to convey).

What I really wanted to convey is this: if you're going to go the
route of ignoring LP_DEAD free space during vacuuming, you're
conceding that having a high degree of precision about available free
space isn't actually useful (or wouldn't be useful if it was actually
possible at all). Which is something that I generally agree with. I'd
just like it to be clear that you/Melanie are in fact taking one small
step in that direction. We don't need to discuss possible later steps
beyond that first step. Not right now.

--
Peter Geoghegan

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Robert Haas

Date:

18 January 2024, 16:45:53

On Thu, Jan 18, 2024 at 11:17 AM Peter Geoghegan <pg@bowt.ie> wrote:
> True. But the way that PageGetHeapFreeSpace() returns 0 for a page
> with 291 LP_DEAD stubs is a much older behavior. When that happens it
> is literally true that the page has lots of free space. And yet it's
> not free space we can actually use. Not until those LP_DEAD items are
> marked LP_UNUSED.

To me, this is just accurate reporting. What we care about in this
context is the amount of free space on the page that can be used to
store a new tuple. When there are no line pointers available to be
allocated, that amount is 0.

> Another big source of inaccuracies here is that we don't credit
> RECENTLY_DEAD tuple space with being free space. Maybe that isn't a
> huge problem, but it makes it even harder to believe that precision in
> FSM accounting is an intrinsic good.

The difficulty here is that we don't know how long it will be before
that space can be reused. Those recently dead tuples could become dead
within a few milliseconds or stick around for hours. I've wondered
about the merits of some FSM that had built-in visibility awareness,
i.e. the capability to record something like "page X currently has Y
space free and after XID Z is all-visible it will have Y' space free".
That seems complex, but without it, we either have to bet that the
space will actually become free before anyone tries to use it, or that
it won't. If whatever guess we make is wrong, bad things happen.

> My remarks about "FSM_CATEGORIES-wise precision" were basically
> remarks about the fundamental problem with the free space map. Which
> is really that it's just a map of free space, that gives exactly zero
> thought to various high level things that *obviously* matter. I wasn't
> particularly planning on getting into the specifics of that with you
> now, on this thread.

Fair.

> A brief recap might be useful: other systems with a heap table AM free
> space management structure typically represent the free space
> available on each page using a far more coarse grained counter.
> Usually one with less than 10 distinct increments. The immediate
> problem with FSM_CATEGORIES having such a fine granularity is that it
> increases contention/competition among backends that need to find some
> free space for a new tuple. They'll all diligently try to find the
> page with the least free space that still satisfies their immediate
> needs -- there is no thought for the second-order effects, which are
> really important in practice.

I think that the completely deterministic nature of the computation is
a mistake regardless of anything else. That serves to focus contention
rather than spreading it out, which is dumb, and would still be dumb
with any other number of FSM_CATEGORIES.

> What I really wanted to convey is this: if you're going to go the
> route of ignoring LP_DEAD free space during vacuuming, you're
> conceding that having a high degree of precision about available free
> space isn't actually useful (or wouldn't be useful if it was actually
> possible at all). Which is something that I generally agree with. I'd
> just like it to be clear that you/Melanie are in fact taking one small
> step in that direction. We don't need to discuss possible later steps
> beyond that first step. Not right now.

Yeah. I'm not sure we're actually going to change that right now, but
I agree with the high-level point regardless, which I would summarize
like this: The current system provides more precision about available
free space than we actually need, while failing to provide some other
things that we really do need. We need not agree today on exactly what
those other things are or how best to get them in order to agree that
the current system has significant flaws, and we do agree that it
does.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Peter Geoghegan

Date:

18 January 2024, 17:14:55

On Thu, Jan 18, 2024 at 11:46 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Jan 18, 2024 at 11:17 AM Peter Geoghegan <pg@bowt.ie> wrote:
> > True. But the way that PageGetHeapFreeSpace() returns 0 for a page
> > with 291 LP_DEAD stubs is a much older behavior. When that happens it
> > is literally true that the page has lots of free space. And yet it's
> > not free space we can actually use. Not until those LP_DEAD items are
> > marked LP_UNUSED.
>
> To me, this is just accurate reporting. What we care about in this
> context is the amount of free space on the page that can be used to
> store a new tuple. When there are no line pointers available to be
> allocated, that amount is 0.

I agree. All I'm saying is this (can't imagine you'll disagree):

It's not okay if you fail to update the FSM a second time in the
second heap pass -- at least in some cases. It's reasonably frequent
for a page that has 0 usable free space when lazy_scan_prune returns
to go on to have almost BLCKSZ free space once lazy_vacuum_heap_page()
is done with it.

While I am sympathetic to the argument that LP_DEAD item space just
isn't that important in general, that doesn't apply with this one
special case. This is a "step function" behavior, and is seen whenever
VACUUM runs following bulk deletes of tuples -- a rather common case.
Clearly the FSM shouldn't show that pages that are actually completely
empty at the end of VACUUM as having no available free space after a
VACUUM finishes (on account of how they looked immediately after
lazy_scan_prune ran). That'd just be wrong.

> > Another big source of inaccuracies here is that we don't credit
> > RECENTLY_DEAD tuple space with being free space. Maybe that isn't a
> > huge problem, but it makes it even harder to believe that precision in
> > FSM accounting is an intrinsic good.
>
> The difficulty here is that we don't know how long it will be before
> that space can be reused. Those recently dead tuples could become dead
> within a few milliseconds or stick around for hours. I've wondered
> about the merits of some FSM that had built-in visibility awareness,
> i.e. the capability to record something like "page X currently has Y
> space free and after XID Z is all-visible it will have Y' space free".
> That seems complex, but without it, we either have to bet that the
> space will actually become free before anyone tries to use it, or that
> it won't. If whatever guess we make is wrong, bad things happen.

All true -- it is rather complex.

Other systems with a heap table access method based on a foundation of
2PL (Oracle, DB2) literally need a transactionally consistent FSM
structure. In fact I believe that Oracle literally requires the
equivalent of an MVCC snapshot read (a "consistent get") to be able to
access what seems like it ought to be strictly a physical data
structure correctly. Everything needs to work in the rollback path,
independent of whatever else may happen to the page before an xact
rolls back (i.e. independently of what other xacts might end up doing
with the page). This requires very tight coordination to avoid bugs
where a transaction cannot roll back due to not having enough free
space to restore the original tuple during UNDO.

I don't think it's desirable to have anything as delicate as that
here. But some rudimentary understanding of free space being
allocated/leased to certain transactions and/or backends does seem
like a good idea. There is some intrinsic value to these sorts of
behaviors, even in a system without any UNDO segments, where it is
never strictly necessary.

> I think that the completely deterministic nature of the computation is
> a mistake regardless of anything else. That serves to focus contention
> rather than spreading it out, which is dumb, and would still be dumb
> with any other number of FSM_CATEGORIES.

That's a part of the problem too, I guess.

The actual available free space on each page is literally changing all
the time, when measured at FSM_CATEGORIES-wise granularity -- which
leads to a mad dash among backends that all need the same amount of
free space for their new tuple. One reason why other systems pretty
much require coarse-grained increments of free space is the need to
manage the WAL overhead for a crash-safe FSM/free list structure.

> Yeah. I'm not sure we're actually going to change that right now, but
> I agree with the high-level point regardless, which I would summarize
> like this: The current system provides more precision about available
> free space than we actually need, while failing to provide some other
> things that we really do need. We need not agree today on exactly what
> those other things are or how best to get them in order to agree that
> the current system has significant flaws, and we do agree that it
> does.

I agree with this.

--
Peter Geoghegan

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Robert Haas

Date:

18 January 2024, 17:38:13

On Thu, Jan 18, 2024 at 12:15 PM Peter Geoghegan <pg@bowt.ie> wrote:
> It's not okay if you fail to update the FSM a second time in the
> second heap pass -- at least in some cases. It's reasonably frequent
> for a page that has 0 usable free space when lazy_scan_prune returns
> to go on to have almost BLCKSZ free space once lazy_vacuum_heap_page()
> is done with it.

Oh, good point. That's an important subtlety I had missed.

> That's a part of the problem too, I guess.
>
> The actual available free space on each page is literally changing all
> the time, when measured at FSM_CATEGORIES-wise granularity -- which
> leads to a mad dash among backends that all need the same amount of
> free space for their new tuple. One reason why other systems pretty
> much require coarse-grained increments of free space is the need to
> manage the WAL overhead for a crash-safe FSM/free list structure.

Interesting.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Robert Haas

Date:

18 January 2024, 20:20:38

On Thu, Jan 18, 2024 at 10:09 AM Robert Haas <robertmhaas@gmail.com> wrote:
> Oh, I see. Good catch.
>
> I've now committed 0001.

I have now also committed 0002 and 0003. I made some modifications to
0003. Specifically:

- I moved has_lpdead_items inside the loop over blocks instead of
putting it at the function toplevel.
- I adjusted the comments in lazy_scan_prune() because it seemed to me
that the comment about "Now scan the page..." had gotten too far
separated from the loop where that happens.
- I combined two lines in an if-test because one of them was kinda short.

Hope that's OK with you.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Melanie Plageman

Date:

19 January 2024, 02:23:44

On Thu, Jan 18, 2024 at 3:20 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Jan 18, 2024 at 10:09 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > Oh, I see. Good catch.
> >
> > I've now committed 0001.
>
> I have now also committed 0002 and 0003. I made some modifications to
> 0003. Specifically:
>
> - I moved has_lpdead_items inside the loop over blocks instead of
> putting it at the function toplevel.
> - I adjusted the comments in lazy_scan_prune() because it seemed to me
> that the comment about "Now scan the page..." had gotten too far
> separated from the loop where that happens.
> - I combined two lines in an if-test because one of them was kinda short.
>
> Hope that's OK with you.

Awesome, thanks!

I have attached a rebased version of the former 0004 as v11-0001.

- Melanie

Attachment

v11-0001-Combine-FSM-updates-for-prune-and-no-prune-cases.patch

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Robert Haas

Date:

24 January 2024, 19:59:44

On Thu, Jan 18, 2024 at 9:23 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> I have attached a rebased version of the former 0004 as v11-0001.

This looks correct to me, although I wouldn't mind some more eyes on
it. However, I think that the comments still need more work.

Specifically:

         /*
          * Prune, freeze, and count tuples.
          *
          * Accumulates details of remaining LP_DEAD line pointers on page in
          * dead_items array.  This includes LP_DEAD line pointers that we
          * pruned ourselves, as well as existing LP_DEAD line pointers that
          * were pruned some time earlier.  Also considers freezing XIDs in the
          * tuple headers of remaining items with storage. It also determines
          * if truncating this block is safe.
          */
-        lazy_scan_prune(vacrel, buf, blkno, page,
-                        vmbuffer, all_visible_according_to_vm,
-                        &has_lpdead_items);
+        if (got_cleanup_lock)
+            lazy_scan_prune(vacrel, buf, blkno, page,
+                            vmbuffer, all_visible_according_to_vm,
+                            &has_lpdead_items);

I think this comment needs adjusting. Possibly, the preceding calls to
lazy_scan_noprune() and lazy_scan_new_or_empty() could even use a bit
better comments, but in those cases, you're basically keeping the same
code with the same comment, so it's kinda defensible. Here, though,
you're making the call conditional without any comment update.

         /*
          * Final steps for block: drop cleanup lock, record free space in the
          * FSM.
          *
          * If we will likely do index vacuuming, wait until
          * lazy_vacuum_heap_rel() to save free space. This doesn't just save
          * us some cycles; it also allows us to record any additional free
          * space that lazy_vacuum_heap_page() will make available in cases
          * where it's possible to truncate the page's line pointer array.
          *
+         * Our goal is to update the freespace map the last time we touch the
+         * page. If the relation has no indexes, or if index vacuuming is
+         * disabled, there will be no second heap pass; if this particular
+         * page has no dead items, the second heap pass will not touch this
+         * page. So, in those cases, update the FSM now.
+         *
          * Note: It's not in fact 100% certain that we really will call
          * lazy_vacuum_heap_rel() -- lazy_vacuum() might yet opt to skip index
          * vacuuming (and so must skip heap vacuuming).  This is deemed okay
          * because it only happens in emergencies, or when there is very
          * little free space anyway. (Besides, we start recording free space
          * in the FSM once index vacuuming has been abandoned.)
          */

I think this comment needs a rewrite, not just sticking the other
comment in the middle of it. There's some duplication between these
two comments, and merging it all together should iron that out.
Personally, I think my comment (which was there before, this commit
only moves it here) is clearer than what's already here about the
intent, but it's lacking some details that are captured in the other
two paragraphs, and we probably don't want to lose those details.

If you'd like, I can try rewriting these comments to my satisfaction
and you can reverse-review the result. Or you can rewrite them and
I'll re-review the result. But I think this needs to be a little less
mechanical. It's not just about shuffling all the comments around so
that all the text ends up somewhere -- we also need to consider the
degree to which the meaning becomes duplicated when it all gets merged
together.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Melanie Plageman

Date:

24 January 2024, 21:34:11

On Wed, Jan 24, 2024 at 2:59 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Jan 18, 2024 at 9:23 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > I have attached a rebased version of the former 0004 as v11-0001.
>
> This looks correct to me, although I wouldn't mind some more eyes on
> it. However, I think that the comments still need more work.
>
> Specifically:
>
>          /*
>           * Prune, freeze, and count tuples.
>           *
>           * Accumulates details of remaining LP_DEAD line pointers on page in
>           * dead_items array.  This includes LP_DEAD line pointers that we
>           * pruned ourselves, as well as existing LP_DEAD line pointers that
>           * were pruned some time earlier.  Also considers freezing XIDs in the
>           * tuple headers of remaining items with storage. It also determines
>           * if truncating this block is safe.
>           */
> -        lazy_scan_prune(vacrel, buf, blkno, page,
> -                        vmbuffer, all_visible_according_to_vm,
> -                        &has_lpdead_items);
> +        if (got_cleanup_lock)
> +            lazy_scan_prune(vacrel, buf, blkno, page,
> +                            vmbuffer, all_visible_according_to_vm,
> +                            &has_lpdead_items);
>
> I think this comment needs adjusting. Possibly, the preceding calls to
> lazy_scan_noprune() and lazy_scan_new_or_empty() could even use a bit
> better comments, but in those cases, you're basically keeping the same
> code with the same comment, so it's kinda defensible. Here, though,
> you're making the call conditional without any comment update.
>
>          /*
>           * Final steps for block: drop cleanup lock, record free space in the
>           * FSM.
>           *
>           * If we will likely do index vacuuming, wait until
>           * lazy_vacuum_heap_rel() to save free space. This doesn't just save
>           * us some cycles; it also allows us to record any additional free
>           * space that lazy_vacuum_heap_page() will make available in cases
>           * where it's possible to truncate the page's line pointer array.
>           *
> +         * Our goal is to update the freespace map the last time we touch the
> +         * page. If the relation has no indexes, or if index vacuuming is
> +         * disabled, there will be no second heap pass; if this particular
> +         * page has no dead items, the second heap pass will not touch this
> +         * page. So, in those cases, update the FSM now.
> +         *
>           * Note: It's not in fact 100% certain that we really will call
>           * lazy_vacuum_heap_rel() -- lazy_vacuum() might yet opt to skip index
>           * vacuuming (and so must skip heap vacuuming).  This is deemed okay
>           * because it only happens in emergencies, or when there is very
>           * little free space anyway. (Besides, we start recording free space
>           * in the FSM once index vacuuming has been abandoned.)
>           */
>
> I think this comment needs a rewrite, not just sticking the other
> comment in the middle of it. There's some duplication between these
> two comments, and merging it all together should iron that out.
> Personally, I think my comment (which was there before, this commit
> only moves it here) is clearer than what's already here about the
> intent, but it's lacking some details that are captured in the other
> two paragraphs, and we probably don't want to lose those details.
>
> If you'd like, I can try rewriting these comments to my satisfaction
> and you can reverse-review the result. Or you can rewrite them and
> I'll re-review the result. But I think this needs to be a little less
> mechanical. It's not just about shuffling all the comments around so
> that all the text ends up somewhere -- we also need to consider the
> degree to which the meaning becomes duplicated when it all gets merged
> together.

I will take a stab at rewriting the comments myself first. Usually, I
try to avoid changing comments if the code isn't functionally
different because I know it adds additional review overhead and I try
to reduce that to an absolute minimum. However, I see what you are
saying and agree that it would be better to have actually good
comments instead of frankenstein comments made up of parts that were
previously considered acceptable. I'll have a new version ready by
tomorrow.

- Melanie

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Melanie Plageman

Date:

25 January 2024, 02:13:09

On Wed, Jan 24, 2024 at 4:34 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Wed, Jan 24, 2024 at 2:59 PM Robert Haas <robertmhaas@gmail.com> wrote:
...
> > If you'd like, I can try rewriting these comments to my satisfaction
> > and you can reverse-review the result. Or you can rewrite them and
> > I'll re-review the result. But I think this needs to be a little less
> > mechanical. It's not just about shuffling all the comments around so
> > that all the text ends up somewhere -- we also need to consider the
> > degree to which the meaning becomes duplicated when it all gets merged
> > together.
>
> I will take a stab at rewriting the comments myself first. Usually, I
> try to avoid changing comments if the code isn't functionally
> different because I know it adds additional review overhead and I try
> to reduce that to an absolute minimum. However, I see what you are
> saying and agree that it would be better to have actually good
> comments instead of frankenstein comments made up of parts that were
> previously considered acceptable. I'll have a new version ready by
> tomorrow.

v12 attached has my attempt at writing better comments for this
section of lazy_scan_heap().

Above the combined FSM update code, I have written a comment that is a
revised version of your comment from above the lazy_scan_noprune() FSM
update code but with some of the additional details from the previous
comment above the lazy_scan_pruen() FSM update code.

The one part that I did not incorporate was the point about how
sometimes we think we'll do a second pass on the block so we don't
update the FSM but then we end up not doing it but it's all okay.

* Note: It's not in fact 100% certain that we really will call
* lazy_vacuum_heap_rel() -- lazy_vacuum() might yet opt to skip index
* vacuuming (and so must skip heap vacuuming).  This is deemed okay
* because it only happens in emergencies, or when there is very
* little free space anyway. (Besides, we start recording free space
* in the FSM once index vacuuming has been abandoned.)

I didn't incorporate it because I wasn't sure I understood the
situation. I can imagine us skipping updating the FSM after
lazy_scan_prune() because there are indexes on the relation and dead
items on the page and we think we'll do a second pass. Later, we end
up triggering a failsafe vacuum or, somehow, there are still too few
TIDs for the second pass, so we update do_index_vacuuming to false.
Then we wouldn't ever record this block's free space in the FSM. That
seems fine (which is what the comment says). So, what does the last
sentence mean? "Besides, we start recording..."

- Melanie

Attachment

v12-0001-Combine-FSM-updates-for-prune-and-no-prune-cases.patch

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Peter Geoghegan

Date:

25 January 2024, 02:26:12

On Wed, Jan 24, 2024 at 9:13 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> I didn't incorporate it because I wasn't sure I understood the
> situation. I can imagine us skipping updating the FSM after
> lazy_scan_prune() because there are indexes on the relation and dead
> items on the page and we think we'll do a second pass. Later, we end
> up triggering a failsafe vacuum or, somehow, there are still too few
> TIDs for the second pass, so we update do_index_vacuuming to false.
> Then we wouldn't ever record this block's free space in the FSM. That
> seems fine (which is what the comment says). So, what does the last
> sentence mean? "Besides, we start recording..."

It means: when the failsafe kicks in, from that point on we won't do
any more heap vacuuming. Clearly any pages that still need to be
scanned at that point won't ever be processed by
lazy_vacuum_heap_rel(). So from that point on we should record the
free space in every scanned heap page in the "first heap pass" --
including pages that have LP_DEAD stubs that aren't going to be made
LP_UNUSED in the ongoing VACUUM.

--
Peter Geoghegan

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Robert Haas

Date:

25 January 2024, 13:56:54

On Wed, Jan 24, 2024 at 9:13 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> v12 attached has my attempt at writing better comments for this
> section of lazy_scan_heap().

+ /*
+ * If we didn't get the cleanup lock and the page is not new or empty,
+ * we can still collect LP_DEAD items in the dead_items array for
+ * later vacuuming, count live and recently dead tuples for vacuum
+ * logging, and determine if this block could later be truncated. If
+ * we encounter any xid/mxids that require advancing the
+ * relfrozenxid/relminxid, we'll have to wait for a cleanup lock and
+ * call lazy_scan_prune().
+ */

I like this comment. I would probably drop "and the page is not new or
empty" from it since that's really speaking to the previous bit of
code, but it wouldn't be the end of the world to keep it, either.

  /*
- * Prune, freeze, and count tuples.
+ * If we got a cleanup lock, we can prune and freeze tuples and
+ * defragment the page. If we didn't get a cleanup lock, we will still
+ * consider whether or not to update the FSM.
  *
- * Accumulates details of remaining LP_DEAD line pointers on page in
- * dead_items array.  This includes LP_DEAD line pointers that we
- * pruned ourselves, as well as existing LP_DEAD line pointers that
- * were pruned some time earlier.  Also considers freezing XIDs in the
- * tuple headers of remaining items with storage. It also determines
- * if truncating this block is safe.
+ * Like lazy_scan_noprune(), lazy_scan_prune() will count
+ * recently_dead_tuples and live tuples for vacuum logging, determine
+ * if the block can later be truncated, and accumulate the details of
+ * remaining LP_DEAD line pointers on the page in the dead_items
+ * array. These dead items include those pruned by lazy_scan_prune()
+ * as well we line pointers previously marked LP_DEAD.
  */

To me, the first paragraph of this one misses the mark. What I thought
we should be saying here was something like "If we don't have a
cleanup lock, the code above has already processed this page to the
extent that is possible. Otherwise, we either got the cleanup lock
initially and have not processed the page yet, or we didn't get it
initially, attempted to process it without the cleanup lock, and
decided we needed one after all. Either way, if we now have the lock,
we must prune, freeze, and count tuples."

The second paragraph seems fine.

- * Note: It's not in fact 100% certain that we really will call
- * lazy_vacuum_heap_rel() -- lazy_vacuum() might yet opt to skip index
- * vacuuming (and so must skip heap vacuuming).  This is deemed okay
- * because it only happens in emergencies, or when there is very
- * little free space anyway. (Besides, we start recording free space
- * in the FSM once index vacuuming has been abandoned.)

Here's a suggestion from me:

Note: In corner cases, it's possible to miss updating the FSM
entirely. If index vacuuming is currently enabled, we'll skip the FSM
update now. But if failsafe mode is later activated, disabling index
vacuuming, there will also be no opportunity to update the FSM later,
because we'll never revisit this page. Since updating the FSM is
desirable but not absolutely required, that's OK.

I think this expresses the same sentiment as the current comment, but
IMHO more clearly. The one part of the current comment that I don't
understand at all is the remark about "when there is very little
freespace anyway". I get that if the failsafe activates we won't come
back to the page, which is the "only happens in emergencies" part of
the existing comment. But the current phrasing makes it sound like
there is a second case where it can happen -- "when there is very
little free space anyway" -- and I don't know what that is talking
about. If it's important, we should try to make it clearer.

We could also just decide to keep this entire paragraph as it is for
purposes of the present patch. The part I really thought needed
adjusting was "Prune, freeze, and count tuples."

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Melanie Plageman

Date:

25 January 2024, 14:17:47

On Thu, Jan 25, 2024 at 8:57 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Jan 24, 2024 at 9:13 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > v12 attached has my attempt at writing better comments for this
> > section of lazy_scan_heap().
>
> + /*
> + * If we didn't get the cleanup lock and the page is not new or empty,
> + * we can still collect LP_DEAD items in the dead_items array for
> + * later vacuuming, count live and recently dead tuples for vacuum
> + * logging, and determine if this block could later be truncated. If
> + * we encounter any xid/mxids that require advancing the
> + * relfrozenxid/relminxid, we'll have to wait for a cleanup lock and
> + * call lazy_scan_prune().
> + */
>
> I like this comment. I would probably drop "and the page is not new or
> empty" from it since that's really speaking to the previous bit of
> code, but it wouldn't be the end of the world to keep it, either.

Yes, probably best to get rid of the part about new or empty.

>   /*
> - * Prune, freeze, and count tuples.
> + * If we got a cleanup lock, we can prune and freeze tuples and
> + * defragment the page. If we didn't get a cleanup lock, we will still
> + * consider whether or not to update the FSM.
>   *
> - * Accumulates details of remaining LP_DEAD line pointers on page in
> - * dead_items array.  This includes LP_DEAD line pointers that we
> - * pruned ourselves, as well as existing LP_DEAD line pointers that
> - * were pruned some time earlier.  Also considers freezing XIDs in the
> - * tuple headers of remaining items with storage. It also determines
> - * if truncating this block is safe.
> + * Like lazy_scan_noprune(), lazy_scan_prune() will count
> + * recently_dead_tuples and live tuples for vacuum logging, determine
> + * if the block can later be truncated, and accumulate the details of
> + * remaining LP_DEAD line pointers on the page in the dead_items
> + * array. These dead items include those pruned by lazy_scan_prune()
> + * as well we line pointers previously marked LP_DEAD.
>   */
>
> To me, the first paragraph of this one misses the mark. What I thought
> we should be saying here was something like "If we don't have a
> cleanup lock, the code above has already processed this page to the
> extent that is possible. Otherwise, we either got the cleanup lock
> initially and have not processed the page yet, or we didn't get it
> initially, attempted to process it without the cleanup lock, and
> decided we needed one after all. Either way, if we now have the lock,
> we must prune, freeze, and count tuples."

I see. Your suggestion makes sense. The sentence starting with
"Otherwise" is a bit long. I started to lose the thread at "decided we
needed one after all". You previously referred to the cleanup lock as
"it" -- once you start referring to it as "one", I as the future
developer am no longer sure we are talking about the cleanup lock (as
opposed to the page or something else).

> - * Note: It's not in fact 100% certain that we really will call
> - * lazy_vacuum_heap_rel() -- lazy_vacuum() might yet opt to skip index
> - * vacuuming (and so must skip heap vacuuming).  This is deemed okay
> - * because it only happens in emergencies, or when there is very
> - * little free space anyway. (Besides, we start recording free space
> - * in the FSM once index vacuuming has been abandoned.)
>
> Here's a suggestion from me:
>
> Note: In corner cases, it's possible to miss updating the FSM
> entirely. If index vacuuming is currently enabled, we'll skip the FSM
> update now. But if failsafe mode is later activated, disabling index
> vacuuming, there will also be no opportunity to update the FSM later,
> because we'll never revisit this page. Since updating the FSM is
> desirable but not absolutely required, that's OK.
>
> I think this expresses the same sentiment as the current comment, but
> IMHO more clearly. The one part of the current comment that I don't
> understand at all is the remark about "when there is very little
> freespace anyway". I get that if the failsafe activates we won't come
> back to the page, which is the "only happens in emergencies" part of
> the existing comment. But the current phrasing makes it sound like
> there is a second case where it can happen -- "when there is very
> little free space anyway" -- and I don't know what that is talking
> about. If it's important, we should try to make it clearer.
>
> We could also just decide to keep this entire paragraph as it is for
> purposes of the present patch. The part I really thought needed
> adjusting was "Prune, freeze, and count tuples."

I think it would be nice to clarify this comment. I think the "when
there is little free space anyway" is referring to the case in
lazy_vacuum() where we set do_index_vacuuming to false because "there
are almost zero TIDs". I initially thought it was saying that in the
failsafe vacuum case the pages whose free space we wouldn't record in
the FSM have little free space anyway -- which I didn't get. But then
I looked at where we set do_index_vacuuming to false.

As for the last sentence starting with "Besides", even with Peter's
explanation I still am not sure what it should say. There are blocks
whose free space we don't record in the first heap pass. Then, due to
skipping index vacuuming and the second heap pass, we also don't
record their free space in the second heap pass. I think he is saying
that once we set do_index_vacuuming to false, we will stop skipping
updating the FSM after the first pass for future blocks. So, future
blocks will have their free space recorded in the FSM. But that feels
self-evident. The more salient point is that there are some blocks
whose free space is not recorded (those whose first pass happened
before unsetting do_index_vacuuming and whose second pass did not
happen before do_index_vacuuming is unset). The extra sentence made me
think there was some way we might go back and record free space for
those blocks, but that is not true.

- Melanie

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Robert Haas

Date:

25 January 2024, 15:19:28

On Thu, Jan 25, 2024 at 9:18 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> > To me, the first paragraph of this one misses the mark. What I thought
> > we should be saying here was something like "If we don't have a
> > cleanup lock, the code above has already processed this page to the
> > extent that is possible. Otherwise, we either got the cleanup lock
> > initially and have not processed the page yet, or we didn't get it
> > initially, attempted to process it without the cleanup lock, and
> > decided we needed one after all. Either way, if we now have the lock,
> > we must prune, freeze, and count tuples."
>
> I see. Your suggestion makes sense. The sentence starting with
> "Otherwise" is a bit long. I started to lose the thread at "decided we
> needed one after all". You previously referred to the cleanup lock as
> "it" -- once you start referring to it as "one", I as the future
> developer am no longer sure we are talking about the cleanup lock (as
> opposed to the page or something else).

Ok... trying again:

If we have a cleanup lock, we must now prune, freeze, and count
tuples. We may have acquired the cleanup lock originally, or we may
have gone back and acquired it after lazy_scan_noprune() returned
false. Either way, the page hasn't been processed yet.

> I think it would be nice to clarify this comment. I think the "when
> there is little free space anyway" is referring to the case in
> lazy_vacuum() where we set do_index_vacuuming to false because "there
> are almost zero TIDs". I initially thought it was saying that in the
> failsafe vacuum case the pages whose free space we wouldn't record in
> the FSM have little free space anyway -- which I didn't get. But then
> I looked at where we set do_index_vacuuming to false.

Oh... wow. That's kind of confusing; somehow I was thinking we were
talking about free space on the disk, rather than newly free space in
pages that could be added to the FSM. And it seems really questionable
whether that case is OK. I mean, in the emergency case, fine,
whatever, we should do whatever it takes to get the system back up,
and it should barely ever happen on a well-configured system. But this
case could happen regularly, and losing track of free space could
easily cause bloat.

This might be another argument for moving FSM updates to the first
heap pass, but that's a separate task from fixing the comment.

> As for the last sentence starting with "Besides", even with Peter's
> explanation I still am not sure what it should say. There are blocks
> whose free space we don't record in the first heap pass. Then, due to
> skipping index vacuuming and the second heap pass, we also don't
> record their free space in the second heap pass. I think he is saying
> that once we set do_index_vacuuming to false, we will stop skipping
> updating the FSM after the first pass for future blocks. So, future
> blocks will have their free space recorded in the FSM. But that feels
> self-evident.

Yes, I don't think that necessarily needs to be mentioned here.

> The more salient point is that there are some blocks
> whose free space is not recorded (those whose first pass happened
> before unsetting do_index_vacuuming and whose second pass did not
> happen before do_index_vacuuming is unset). The extra sentence made me
> think there was some way we might go back and record free space for
> those blocks, but that is not true.

I don't really see why that sentence made you think that, but it's not
important. I agree with you about what point we need to emphasize
here.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Melanie Plageman

Date:

25 January 2024, 16:18:55

On Thu, Jan 25, 2024 at 10:19 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Jan 25, 2024 at 9:18 AM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > > To me, the first paragraph of this one misses the mark. What I thought
> > > we should be saying here was something like "If we don't have a
> > > cleanup lock, the code above has already processed this page to the
> > > extent that is possible. Otherwise, we either got the cleanup lock
> > > initially and have not processed the page yet, or we didn't get it
> > > initially, attempted to process it without the cleanup lock, and
> > > decided we needed one after all. Either way, if we now have the lock,
> > > we must prune, freeze, and count tuples."
> >
> > I see. Your suggestion makes sense. The sentence starting with
> > "Otherwise" is a bit long. I started to lose the thread at "decided we
> > needed one after all". You previously referred to the cleanup lock as
> > "it" -- once you start referring to it as "one", I as the future
> > developer am no longer sure we are talking about the cleanup lock (as
> > opposed to the page or something else).
>
> Ok... trying again:
>
> If we have a cleanup lock, we must now prune, freeze, and count
> tuples. We may have acquired the cleanup lock originally, or we may
> have gone back and acquired it after lazy_scan_noprune() returned
> false. Either way, the page hasn't been processed yet.

Cool. I might add "successfully" or "fully" to "Either way, the page
hasn't been processed yet"

> > I think it would be nice to clarify this comment. I think the "when
> > there is little free space anyway" is referring to the case in
> > lazy_vacuum() where we set do_index_vacuuming to false because "there
> > are almost zero TIDs". I initially thought it was saying that in the
> > failsafe vacuum case the pages whose free space we wouldn't record in
> > the FSM have little free space anyway -- which I didn't get. But then
> > I looked at where we set do_index_vacuuming to false.
>
> Oh... wow. That's kind of confusing; somehow I was thinking we were
> talking about free space on the disk, rather than newly free space in
> pages that could be added to the FSM.

Perhaps I misunderstood. I interpreted it to refer to the bypass optimization.

> And it seems really questionable
> whether that case is OK. I mean, in the emergency case, fine,
> whatever, we should do whatever it takes to get the system back up,
> and it should barely ever happen on a well-configured system. But this
> case could happen regularly, and losing track of free space could
> easily cause bloat.
>
> This might be another argument for moving FSM updates to the first
> heap pass, but that's a separate task from fixing the comment.

Yes, it seems we could miss recording space freed in the first pass if
we never end up doing a second pass. consider_bypass_optimization is
set to false only if index cleanup is explicitly enabled or there are
dead items accumulated for vacuum's second pass at some point.

- Melanie

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Robert Haas

Date:

25 January 2024, 17:25:46

On Thu, Jan 25, 2024 at 11:19 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> Cool. I might add "successfully" or "fully" to "Either way, the page
> hasn't been processed yet"

I'm OK with that.

> > > I think it would be nice to clarify this comment. I think the "when
> > > there is little free space anyway" is referring to the case in
> > > lazy_vacuum() where we set do_index_vacuuming to false because "there
> > > are almost zero TIDs". I initially thought it was saying that in the
> > > failsafe vacuum case the pages whose free space we wouldn't record in
> > > the FSM have little free space anyway -- which I didn't get. But then
> > > I looked at where we set do_index_vacuuming to false.
> >
> > Oh... wow. That's kind of confusing; somehow I was thinking we were
> > talking about free space on the disk, rather than newly free space in
> > pages that could be added to the FSM.
>
> Perhaps I misunderstood. I interpreted it to refer to the bypass optimization.

I think you're probably correct. I just didn't realize what was meant.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Robert Haas

Date:

26 January 2024, 16:44:01

On Thu, Jan 25, 2024 at 12:25 PM Robert Haas <robertmhaas@gmail.com> wrote:
> I think you're probably correct. I just didn't realize what was meant.

I tweaked your v12 based on this discussion and committed the result.

Thanks to you for the patches, and to Peter for participating in the
discussion which, IMHO, was very helpful in clarifying things.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Melanie Plageman

Date:

26 January 2024, 17:06:30

On Fri, Jan 26, 2024 at 11:44 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Jan 25, 2024 at 12:25 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > I think you're probably correct. I just didn't realize what was meant.
>
> I tweaked your v12 based on this discussion and committed the result.
>
> Thanks to you for the patches, and to Peter for participating in the
> discussion which, IMHO, was very helpful in clarifying things.

Thanks! I've marked the CF entry as committed.

- Melanie

Re: Emit fewer vacuum records by reaping removable tuples during pruning

From

Peter Geoghegan

Date:

26 January 2024, 18:05:48

On Fri, Jan 26, 2024 at 11:44 AM Robert Haas <robertmhaas@gmail.com> wrote:
> Thanks to you for the patches, and to Peter for participating in the
> discussion which, IMHO, was very helpful in clarifying things.

Glad I could help.

--
Peter Geoghegan