Re: Combine Prune and Freeze records emitted by vacuum - Mailing list pgsql-hackers

From Melanie Plageman
Subject Re: Combine Prune and Freeze records emitted by vacuum
Date
Msg-id 20240328015326.x5gnzsohl6j23b42@liskov
Whole thread Raw
In response to Re: Combine Prune and Freeze records emitted by vacuum  (Heikki Linnakangas <hlinnaka@iki.fi>)
Responses Re: Combine Prune and Freeze records emitted by vacuum
List pgsql-hackers
On Thu, Mar 28, 2024 at 01:04:04AM +0200, Heikki Linnakangas wrote:
> On 27/03/2024 20:26, Melanie Plageman wrote:
> > On Wed, Mar 27, 2024 at 12:18 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> > > 
> > > On 27/03/2024 17:18, Melanie Plageman wrote:
> > > > I need some way to modify the control flow or accounting such that I
> > > > know which HEAPTUPLE_RECENTLY_DEAD tuples will not be marked LP_DEAD.
> > > > And a way to consider freezing and do live tuple accounting for these
> > > > and HEAPTUPLE_LIVE tuples exactly once.
> > > 
> > > Just a quick update: I've been massaging this some more today, and I
> > > think I'm onto got something palatable. I'll send an updated patch later
> > > today, but the key is to note that for each item on the page, there is
> > > one point where we determine the fate of the item, whether it's pruned
> > > or not. That can happen in different points in in heap_page_prune().
> > > That's also when we set marked[offnum] = true. Whenever that happens, we
> > > all call one of the a heap_page_prune_record_*() subroutines. We already
> > > have those subroutines for when a tuple is marked as dead or unused, but
> > > let's add similar subroutines for the case that we're leaving the tuple
> > > unchanged. If we move all the bookkeeping logic to those subroutines, we
> > > can ensure that it gets done exactly once for each tuple, and at that
> > > point we know what we are going to do to the tuple, so we can count it
> > > correctly. So heap_prune_chain() decides what to do with each tuple, and
> > > ensures that each tuple is marked only once, and the subroutines update
> > > all the variables, add the item to the correct arrays etc. depending on
> > > what we're doing with it.
> > 
> > Yes, this would be ideal.
> 
> Well, that took me a lot longer than expected. My approach of "make sure you
> all the right heap_prune_record_*() subroutine in all cases didn't work out
> quite as easily as I thought. Because, as you pointed out, it's difficult to
> know if a non-DEAD tuple that is part of a HOT chain will be visited later
> as part of the chain processing, or needs to be counted at the top of
> heap_prune_chain().
> 
> The solution I came up with is to add a third phase to pruning. At the top
> of heap_prune_chain(), if we see a live heap-only tuple, and we're not sure
> if it will be counted later as part of a HOT chain, we stash it away and
> revisit it later, after processing all the hot chains. That's somewhat
> similar to your 'counted' array, but not quite.

I like this idea (the third phase). I've just started reviewing this but
I thought I would give the initial thoughts I had inline.

> One change with this is that live_tuples and many of the other fields are
> now again updated, even if the caller doesn't need them. It was hard to skip
> them in a way that would save any cycles, with the other refactorings.

I am worried we are writing checks that are going to have to come out of
SELECT queries' bank accounts, but I'll do some benchmarking when we're
all done with major refactoring.

> From 2f38628373ccfb6e8f8fd883955056030092569d Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
> Date: Thu, 28 Mar 2024 00:16:09 +0200
> Subject: [PATCH v8 21/22] Add comment about a pre-existing issue
> 
> Not sure if we want to keep this, but I wanted to document it for
> discussion at least.
> ---
>  src/backend/access/heap/pruneheap.c | 17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
> 
> diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
> index e37ba655a7d..2b720ab6aa1 100644
> --- a/src/backend/access/heap/pruneheap.c
> +++ b/src/backend/access/heap/pruneheap.c
> @@ -792,6 +792,23 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum,
>               * Note that we might first arrive at a dead heap-only tuple
>               * either here or while following a chain below.  Whichever path
>               * gets there first will mark the tuple unused.
> +             *
> +             * FIXME: The next paragraph isn't new with these patches, but
> +             * just something I realized while looking at this. But maybe we should
> +             * add a comment like this? Or is it too much detail?

I think a comment is a good idea.

> +             *
> +             * Whether we arrive at the dead HOT tuple first here or while
> +             * following a chain below affects whether preceding RECENTLY_DEAD
> +             * tuples in the chain can be removed or not.  Imagine that you
> +             * have a chain with two tuples: RECENTLY_DEAD -> DEAD.  If we
> +             * reach the RECENTLY_DEAD tuple first, the chain-following logic
> +             * will find the DEAD tuple and conclude that both tuples are in
> +             * fact dead and can be removed.  But if we reach the DEAD tuple
> +             * at the end of the chain first, when we reach the RECENTLY_DEAD
> +             * tuple later, we will not follow the chain because the DEAD
> +             * TUPLE is already 'marked', and will not remove the
> +             * RECENTLY_DEAD tuple.  This is not a correctness issue, and the
> +             * RECENTLY_DEAD tuple will be removed by a later VACUUM.
>               */
>              if (!HeapTupleHeaderIsHotUpdated(htup))

Is this intentional? Like would it be correct to remove the
RECENTLY_DEAD tuple during the current vacuum?

> From c2ee7508456d0e76de985f9997a6840450e342a8 Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
> Date: Thu, 28 Mar 2024 00:45:26 +0200
> Subject: [PATCH v8 22/22] WIP
> 
> - Got rid of all_visible_except_removable again. We're back to the
>   roughly the same mechanism as on 'master', where the all_visible
>   doesn't include LP_DEAD items, but at the end of
>   heap_page_prune_and_freeze() when we return to the caller, we clear
>   it if there were any LP_DEAD items. I considered calling the
>   variable 'all_visible_except_lp_dead', which would be more accurate,
>   but it's also very long.

not longer than all_visible_except_removable. I would be happy to keep
it more exact, but I'm also okay with just all_visible.

> - I duplicated all the fields from PageFreezeResult to PruneState. Now
>   heap_prune_chain() and all the subroutines don't need the
>   PageFreezeResult argument, and you don't need to remember which
>   struct each field is kept in. It's all now in PruneState, and the
>   fields that the caller is interested in are copied to
>   PageFreezeResult at the end of heap_page_prune_and_freeze()

yea, this makes sense to me. Makes me wonder if we shouldn't just have
PruneFreezeResult->live_tuples/recently_dead_tuples/etc be pointers and
then lazy_scan_prune() can pass the actual vacrel->live_tuples counter
and heap_page_prune_and_freeze() can increment it itself. Maybe that's
weird though.

> - Replaced the 'counted' array with 'revisit' array. I thought I could
>   get rid of it altogether, by just being careful to call the right
>   heap_prune_record_*() subroutine for each tuple in heap_prune_chain(),
>   but with live and recently-dead tuples that are part of a HOT chain,
>   we might visit the tuple as part of the HOT chain or not, depending
>   on what it's position in the chain is. So I invented a new revisit
>   phase. All live heap-only tuples that we find, that haven't already
>   been processed as part of a hot chain, are stashed away in the
>   'revisit' array. After processing all the HOT chains, the 'revisit'
>   tuples are re-checked, and counted if they haven't already been counted.

makes sense.

> - Live tuples are now also marked in the 'marked' array, when they are
>   counted. This gives a nice invariant: all tuples must be marked
>   exactly once, as part of a hot chain or otherwise. Added an
>   assertion for that.

this is a nice thing to have.

> ---
>  src/backend/access/heap/pruneheap.c  | 706 +++++++++++++++++----------
>  src/backend/access/heap/vacuumlazy.c |   6 +-
>  src/include/access/heapam.h          |  37 +-
>  3 files changed, 464 insertions(+), 285 deletions(-)
> 
> diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
> index 2b720ab6aa1..a8ed11a1858 100644
> --- a/src/backend/access/heap/pruneheap.c
> +++ b/src/backend/access/heap/pruneheap.c
>  
> -static void heap_prune_record_live_or_recently_dead(Page page, PruneState *prstate,
> -                                                    OffsetNumber offnum, PruneFreezeResult *presult);
>  static void page_verify_redirects(Page page);
>  
>  
> @@ -242,6 +284,8 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
>   * vistest is used to distinguish whether tuples are DEAD or RECENTLY_DEAD
>   * (see heap_prune_satisfies_vacuum).
>   *

What's this "cutoffs TODO"?

> + * cutoffs TODO
> + *
>   * presult contains output parameters needed by callers such as the number of
>   * tuples removed and the number of line pointers newly marked LP_DEAD.
>   * heap_page_prune_and_freeze() is responsible for initializing it.
> @@ -326,70 +370,63 @@ heap_page_prune_and_freeze(Relation relation, Buffer buffer,
>      prstate.latest_xid_removed = InvalidTransactionId;
>      prstate.nredirected = prstate.ndead = prstate.nunused = prstate.nfrozen = 0;
>      memset(prstate.marked, 0, sizeof(prstate.marked));
> -    memset(prstate.counted, 0, sizeof(prstate.counted));
> +    prstate.nrevisit = 0;
>  

>          if (presult->nfrozen > 0)
> @@ -728,10 +832,12 @@ htsv_get_valid_status(int status)
>   * DEAD, our visibility test is just too coarse to detect it.
>   *
>   * In general, pruning must never leave behind a DEAD tuple that still has
> - * tuple storage.  VACUUM isn't prepared to deal with that case.  That's why
> + * tuple storage.  VACUUM isn't prepared to deal with that case (FIXME: it no longer cares, right?).
> + * That's why
>   * VACUUM prunes the same heap page a second time (without dropping its lock
>   * in the interim) when it sees a newly DEAD tuple that we initially saw as
> - * in-progress.  Retrying pruning like this can only happen when an inserting
> + * in-progress (FIXME: Really? Does it still do that?).

Yea, I think all that is no longer true. I missed this comment back
then.

> + * Retrying pruning like this can only happen when an inserting
>   * transaction concurrently aborts.
>   *
>   * The root line pointer is redirected to the tuple immediately after the
> @@ -743,15 +849,18 @@ htsv_get_valid_status(int status)
>   * prstate showing the changes to be made.  Items to be redirected are added
>   * to the redirected[] array (two entries per redirection); items to be set to
>   * LP_DEAD state are added to nowdead[]; and items to be set to LP_UNUSED
> - * state are added to nowunused[].
> - *
> - * Returns the number of tuples (to be) deleted from the page.
> + * state are added to nowunused[].  We perform bookkeeping of live tuples,
> + * visibility etc. based on what the page will look like after the changes
> + * applied.  All that bookkeeping is performed in the heap_prune_record_*()
> + * subroutines.  The division of labor is that heap_prune_chain() decides the
> + * fate of each tuple, ie. whether it's going to be removed, redirected or
> + * left unchanged, and the heap_prune_record_*() subroutines update PruneState
> + * based on that outcome.
>   */
> -static int
> +static void
>  heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum,
> -                 PruneState *prstate, PruneFreezeResult *presult)
> +                 PruneState *prstate)
>  {
> -    int            ndeleted = 0;
>      Page        dp = (Page) BufferGetPage(buffer);
>      TransactionId priorXmax = InvalidTransactionId;
>      ItemId        rootlp;
> @@ -794,8 +903,8 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum,
>               * gets there first will mark the tuple unused.
>               *
>               * FIXME: The next paragraph isn't new with these patches, but
> -             * just something I realized while looking at this. But maybe we should
> -             * add a comment like this? Or is it too much detail?
> +             * just something I realized while looking at this. But maybe we
> +             * should add a comment like this? Or is it too much detail?

i don't think it is too much detail.

>               *
>               * Whether we arrive at the dead HOT tuple first here or while
>               * following a chain below affects whether preceding RECENTLY_DEAD
> @@ -809,43 +918,52 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum,
>               * TUPLE is already 'marked', and will not remove the
>               * RECENTLY_DEAD tuple.  This is not a correctness issue, and the
>               * RECENTLY_DEAD tuple will be removed by a later VACUUM.
> +             *
> +             * FIXME: Now that we have the 'revisit' array, we could stash
> +             * these DEAD items there too, instead of processing them here
> +             * immediately.  That way, DEAD tuples that are still part of a
> +             * chain would always get processed as part of the chain.
>               */

I really like this idea!

>              if (!HeapTupleHeaderIsHotUpdated(htup))
>              {
>                  if (prstate->htsv[rootoffnum] == HEAPTUPLE_DEAD)
>                  {
> -                    heap_prune_record_unused(prstate, rootoffnum);
> +                    heap_prune_record_unused(prstate, rootoffnum, true);
>                      HeapTupleHeaderAdvanceConflictHorizon(htup,
>                                                            &prstate->latest_xid_removed);
> -                    ndeleted++;
> -                }

I think we could really do with some more comments with examples like
this in the pruning code (that go through an example series of steps).
Not least of which because you can't see RECENTLY_DEAD in pageinspect
(you'd have to create some kind of status for it from the different
tuples on the page).

For example, I hadn't thought of this one:

REDIRECT -> RECENTY_DEAD -> DEAD -> RECENTLY_DEAD -> DEAD

> +    /*----
> +     * FIXME: this helped me to visualize how different chains might look like
> +     * here. It's not an exhaustive list, just some examples to help with
> +     * thinking. Remove this comment from final version, or refine.
> +     *
> +     * REDIRECT -> LIVE (stop) -> ...
> +     * REDIRECT -> RECENTY_DEAD -> LIVE (stop) -> ...
> +     * REDIRECT -> RECENTY_DEAD -> RECENTLY_DEAD
> +     * REDIRECT -> RECENTY_DEAD -> DEAD
> +     * REDIRECT -> RECENTY_DEAD -> DEAD -> RECENTLY_DEAD -> DEAD
> +     * RECENTLY_DEAD -> ...
> +     */
> +
>      /* while not end of the chain */
>      for (;;)
>      {
> @@ -897,19 +1015,12 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum,
>  
>              case HEAPTUPLE_RECENTLY_DEAD:
>                  recent_dead = true;
> -
> -                /*
> -                 * This tuple may soon become DEAD.  Update the hint field so
> -                 * that the page is reconsidered for pruning in future.
> -                 */
> -                heap_prune_record_prunable(prstate,
> -                                           HeapTupleHeaderGetUpdateXid(htup));
>                  break;
>  
>              case HEAPTUPLE_DELETE_IN_PROGRESS:
> -
> -                /*
> -                 * This tuple may soon become DEAD.  Update the hint field so
> -                 * that the page is reconsidered for pruning in future.
> -                 */
> -                heap_prune_record_prunable(prstate,
> -                                           HeapTupleHeaderGetUpdateXid(htup));
> -                break;
> -
>              case HEAPTUPLE_LIVE:
>              case HEAPTUPLE_INSERT_IN_PROGRESS:
> -
> -                /*
> -                 * If we wanted to optimize for aborts, we might consider
> -                 * marking the page prunable when we see INSERT_IN_PROGRESS.
> -                 * But we don't.  See related decisions about when to mark the
> -                 * page prunable in heapam.c.
> -                 */
>                  break;
>  
>              default:
> @@ -981,7 +1069,18 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum,
>           * RECENTLY_DEAD tuples just in case there's a DEAD one after them;
>           * but we can't advance past anything else.  We have to make sure that
>           * we don't miss any DEAD tuples, since DEAD tuples that still have
> -         * tuple storage after pruning will confuse VACUUM.
> +         * tuple storage after pruning will confuse VACUUM (FIXME: not anymore
> +         * I think?).

Meaning, it won't confuse vacuum anymore or there won't be DEAD tuples
with storage after pruning anymore?

> +         *
> +         * FIXME: Not a new issue, but spotted it now : If there is a chain
> +         * like RECENTLY_DEAD -> DEAD, we will remove both tuples, but will
> +         * not call HeapTupleHeaderAdvanceConflictHorizon() for the
> +         * RECENTLY_DEAD tuple. Is that OK?  I think it is.  In a HOT chain,
> +         * we know that the later tuple committed before any earlier tuples in
> +         * the chain, therefore it ought to be enough to set the conflict
> +         * horizon based on the later tuple. If all snapshots on the standby
> +         * see the deleter of the last tuple as committed, they must consider
> +         * all the earlier ones as committed too.
>           */

This makes sense to me. (that if all the snapshots on the standby see
the deleter of the last tuple as committed, then they consider the
earlier ones deleted too). Probably wouldn't hurt to call this out here
though.

> @@ -1206,15 +1318,6 @@ heap_prune_record_live_or_recently_dead(Page page, PruneState *prstate, OffsetNu
>       * can't run inside a transaction block, which makes some cases impossible
>       * (e.g. in-progress insert from the same transaction).
>       *
> -     * We treat LP_DEAD items (which are the closest thing to DEAD tuples that
> -     * might be seen here) differently, too: we assume that they'll become
> -     * LP_UNUSED before VACUUM finishes.  This difference is only superficial.
> -     * VACUUM effectively agrees with ANALYZE about DEAD items, in the end.
> -     * VACUUM won't remember LP_DEAD items, but only because they're not
> -     * supposed to be left behind when it is done. (Cases where we bypass
> -     * index vacuuming will violate this optimistic assumption, but the
> -     * overall impact of that should be negligible.)
> -     *
>       * HEAPTUPLE_LIVE tuples are naturally counted as live. This is also what
>       * acquire_sample_rows() does.
>       *
> @@ -1224,10 +1327,21 @@ heap_prune_record_live_or_recently_dead(Page page, PruneState *prstate, OffsetNu
>       * ensure the math works out. The assumption that the transaction will
>       * commit and update the counters after we report is a bit shaky; but it
>       * is what acquire_sample_rows() does, so we do the same to be consistent.
> +     *
> +     * HEAPTUPLE_DEAD are handled by the other heap_prune_record_*()
> +     * subroutines.  They don't count dead items like acquire_sample_rows()
> +     * does, because we assume that all dead items will become LP_UNUSED
> +     * before VACUUM finishes.  This difference is only superficial.  VACUUM
> +     * effectively agrees with ANALYZE about DEAD items, in the end.  VACUUM
> +     * won't remember LP_DEAD items, but only because they're not supposed to
> +     * be left behind when it is done. (Cases where we bypass index vacuuming
> +     * will violate this optimistic assumption, but the overall impact of that
> +     * should be negligible.) FIXME: I don't understand that last sentence in
> +     * parens. It was copied from elsewhere.

If we bypass index vacuuming, there will be LP_DEAD items left behind
when we are done because we didn't do index vacuuming and then reaping
of those dead items. All of these comments are kind of a copypasta,
though.

>       */
>      htup = (HeapTupleHeader) PageGetItem(page, PageGetItemId(page, offnum));
>  
> -    switch (status)

Okay, that's all for now. I'll do more in-depth review tomorrow.

- Melanie



pgsql-hackers by date:

Previous
From: Nathan Bossart
Date:
Subject: Re: add AVX2 support to simd.h
Next
From: "Euler Taveira"
Date:
Subject: Re: Fix some resources leaks (src/bin/pg_basebackup/pg_createsubscriber.c)