Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum - Mailing list pgsql-bugs

From Andres Freund
Subject Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum
Date
Msg-id 20211111030821.2xxl75qfkynaga4j@alap3.anarazel.de
Whole thread Raw
In response to Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum
Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum
List pgsql-bugs
Hi,

On 2021-11-09 19:07:02 -0800, Peter Geoghegan wrote:
> diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
> index c7331d810..31fa4d2a3 100644
> --- a/src/backend/access/heap/pruneheap.c
> +++ b/src/backend/access/heap/pruneheap.c
> @@ -31,6 +31,7 @@
>  typedef struct
>  {
>      Relation    rel;
> +    BlockNumber targetblkno;

It's a good optimization to remove the reundant BufferGetBlockNumber(), but
perhaps it's worth committing that separately?


> @@ -256,10 +260,8 @@ heap_page_prune(Relation relation, Buffer buffer,
>           offnum <= maxoff;
>           offnum = OffsetNumberNext(offnum))
>      {
> -        ItemId        itemid;
> -
>          /* Ignore items already processed as part of an earlier chain */
> -        if (prstate.marked[offnum])
> +        if (prstate.fromvalidchain[offnum])
>              continue;
>  
>          /*

ISTM that we should now call heap_prune_chain() only for the start of actual
chains (i.e. redirect or non-hot tuple).

I think it'd be good to have a comment above the loop explaining that we're
just iterating over chains starting at a non-HOT element.


> @@ -269,15 +271,29 @@ heap_page_prune(Relation relation, Buffer buffer,
>          if (off_loc)
>              *off_loc = offnum;
>  
> -        /* Nothing to do if slot is empty or already dead */
> -        itemid = PageGetItemId(page, offnum);
> -        if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid))
> -            continue;
> -
>          /* Process this item or chain of items */
>          ndeleted += heap_prune_chain(buffer, offnum, &prstate);
>      }

Why did you move this into heap_prune_chain()?


> +    /*
> +     * Scan the page again, processing any heap-only tuples that we now know
> +     * are not part of any valid HOT chain
> +     */
> +    for (offnum = FirstOffsetNumber;
> +         offnum <= maxoff;
> +         offnum = OffsetNumberNext(offnum))
> +    {
> +        /* Ignore items already processed as part of a known valid chain */
> +        if (prstate.fromvalidchain[offnum])
> +            continue;
> +
> +        if (off_loc)
> +            *off_loc = offnum;
> +
> +        /* Process this disconnected heap-only tuple */
> +        ndeleted += heap_prune_heaponly(buffer, offnum, &prstate);
> +    }
> +
>      /* Clear the offset information once we have processed the given page. */
>      if (off_loc)
>          *off_loc = InvalidOffsetNumber;
> @@ -383,19 +399,8 @@ heap_page_prune(Relation relation, Buffer buffer,
>          pgstat_update_heap_dead_tuples(relation, ndeleted - prstate.ndead);
>  
>      /*
> -     * XXX Should we update the FSM information of this page ?
> -     *
> -     * There are two schools of thought here. We may not want to update FSM
> -     * information so that the page is not used for unrelated UPDATEs/INSERTs
> -     * and any free space in this page will remain available for further
> -     * UPDATEs in *this* page, thus improving chances for doing HOT updates.
> -     *
> -     * But for a large table and where a page does not receive further UPDATEs
> -     * for a long time, we might waste this space by not updating the FSM
> -     * information. The relation may get extended and fragmented further.
> -     *
> -     * One possibility is to leave "fillfactor" worth of space in this page
> -     * and update FSM with the remaining space.
> +     * Don't update the FSM information on this page.  We leave it up to our
> +     * caller to decide what to do about that.
>       */

I don't think this should be part of the commit?



>      offnum = rootoffnum;
> +    nchain = 0;

>      /* while not end of the chain */
>      for (;;)
>      {
>          ItemId        lp;
> -        bool        tupdead,
> -                    recent_dead;
> +        HeapTupleHeader htup;
> +        HeapTupleData tup;
> +        bool        tupdead;
>  
>          /* Sanity check (pure paranoia) */
>          if (offnum < FirstOffsetNumber)
> @@ -601,15 +603,11 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, PruneState *prstate)
>              break;
>  
>          /* If item is already processed, stop --- it must not be same chain */
> -        if (prstate->marked[offnum])
> +        if (nchain != 0 && prstate->fromvalidchain[offnum])
>              break;

I think we should make it an error to reach the same tuple multiple ways. It's
not *quite* as trivial as making this an assert/error though, as we can only
raise an error after checking that the xmin/xmax thing passes.


>          /*
>           * If we are looking at the redirected root line pointer, jump to the
>           * first normal tuple in the chain.  If we find a redirect somewhere
> @@ -619,42 +617,63 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, PruneState *prstate)
>          {
>              if (nchain > 0)
>                  break;            /* not at start of chain */
> +            Assert(rootisredirect);
>              chainitems[nchain++] = offnum;
>              offnum = ItemIdGetRedirect(rootlp);
>              continue;
>          }

Why are we handling this inside the loop?


> -        /*
> -         * Likewise, a dead line pointer can't be part of the chain. (We
> -         * already eliminated the case of dead root tuple outside this
> -         * function.)
> -         */
> -        if (ItemIdIsDead(lp))
> +        /* LP_UNUSED or LP_DEAD items obviously not part of the chain */
> +        if (!ItemIdIsUsed(lp) || ItemIdIsDead(lp))
> +        {
> +            /*
> +             * XXX What if we just came from root item, which is a plain heap
> +             * tuple?  Do we need assert that notices when we reach an LP_DEAD
> +             * or LP_UNUSED item having started from such a root item?
> +             */
> +            Assert(!rootisredirect || nchain > 1);

I don't think we can assert that. It's perfectly possible to have a non-hot
update chain where the following element can be vacuumed, but the preceding
element can't (e.g. when the updater has an older xid that prevents it from
being pruned).



>          chainitems[nchain++] = offnum;
> +        prstate->fromvalidchain[offnum] = true;
>  
>          /*
>           * Check tuple's visibility status.
>           */
> -        tupdead = recent_dead = false;
> +        tupdead = false;
>  
>          switch (heap_prune_satisfies_vacuum(prstate, &tup, buffer))
>          {
> @@ -663,7 +682,6 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, PruneState *prstate)
>                  break;
>  
>              case HEAPTUPLE_RECENTLY_DEAD:
> -                recent_dead = true;
>  
>                  /*
>                   * This tuple may soon become DEAD.  Update the hint field so
> @@ -679,6 +697,7 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, PruneState *prstate)
>                   * This tuple may soon become DEAD.  Update the hint field so
>                   * that the page is reconsidered for pruning in future.
>                   */
> +                Assert(!tupdead);
>                  heap_prune_record_prunable(prstate,
>                                             HeapTupleHeaderGetUpdateXid(htup));
>                  break;

> @@ -692,6 +711,7 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, PruneState *prstate)
>                   * But we don't.  See related decisions about when to mark the
>                   * page prunable in heapam.c.
>                   */
> +                Assert(!tupdead);
>                  break;
>  
>              default:

I don't understand these new assert? We just set tupdead to false above?


> @@ -700,11 +720,18 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, PruneState *prstate)
>          }
>  
>          /*
> -         * Remember the last DEAD tuple seen.  We will advance past
> -         * RECENTLY_DEAD tuples just in case there's a DEAD one after them;
> -         * but we can't advance past anything else.  We have to make sure that
> -         * we don't miss any DEAD tuples, since DEAD tuples that still have
> -         * tuple storage after pruning will confuse VACUUM.
> +         * Remember the last DEAD tuple seen from this HOT chain.
> +         *
> +         * We expect to sometimes find a RECENTLY_DEAD tuple after a DEAD
> +         * tuple.  When this happens, the RECENTLY_DEAD tuple will be treated
> +         * as if it was DEAD all along.  Manage that now.

I now actually wonder why this is correct. There's another comment about it:

> We also prune any RECENTLY_DEAD tuples preceding a DEAD tuple.
> * This is OK because a RECENTLY_DEAD tuple preceding a DEAD tuple is really

But that doesn't justify very much.

What prevents the scenario that some other backend e.g. has a snapshot with
xmin=xmax=RECENTLY_DEAD-row. If the RECENTLY_DEAD row has an xid that is later
than the DEAD row, this afaict would make it perfectly legal to prune the DEAD
row, but *not* the RECENTLY_DEAD one.


> +         * We're not just interested in DEAD and RECENTLY_DEAD tuples.  We
> +         * need to traverse each and every HOT chain exhaustively, in order to
> +         * determine which heap-only tuples are part of a valid HOT chain.
> +         * Heap-only tuples that cannot be located like this must not be part
> +         * of a valid HOT chain.  They are therefore processed during our
> +         * second pass over the page.
>           */
>          if (tupdead)
>          {

This comment can't really be understood without the historical behaviour of
the function.


> +static int
> +heap_prune_heaponly(Buffer buffer, OffsetNumber offnum, PruneState *prstate)
> +{
> +    Page        dp = (Page) BufferGetPage(buffer);
> +    ItemId        lp;
> +    HeapTupleHeader htup;
> +    HeapTupleData tup;
> +    HTSV_Result res;
> +
> +    lp = PageGetItemId(dp, offnum);
> +    Assert(ItemIdIsNormal(lp));
> +    htup = (HeapTupleHeader) PageGetItem(dp, lp);
> +
> +    /*
> +     * Caller must make sure that the tuple at 'offnum' is in fact a heap-only
> +     * tuple that is disconnected from its HOT chain (i.e. isn't reachable
> +     * through a HOT traversal that starts at any plausible-looking root item
> +     * on the page).
> +     */
> +    Assert(!prstate->fromvalidchain[offnum]);
> +    Assert(HeapTupleHeaderIsHeapOnly(htup));
> +
> +    /*
> +     * We expect that disconnected heap-only tuples must be from aborted
> +     * transactions.  Any RECENTLY_DEAD tuples we see here are really DEAD,
> +     * but the heap_prune_satisfies_vacuum test is too coarse to detect it.
> +     */
> +    tup.t_len = ItemIdGetLength(lp);
> +    tup.t_tableOid = RelationGetRelid(prstate->rel);
> +    tup.t_data = htup;
> +    ItemPointerSet(&(tup.t_self), prstate->targetblkno, offnum);
> +    res = heap_prune_satisfies_vacuum(prstate, &tup, buffer);
> +    if (res == HEAPTUPLE_DEAD || res == HEAPTUPLE_RECENTLY_DEAD)
> +    {
> +        heap_prune_record_unused(prstate, offnum);
> +        HeapTupleHeaderAdvanceLatestRemovedXid(htup,
> +                                               &prstate->latestRemovedXid);
> +    }
> +    else
> +        Assert(false);
> +
> +    /*
> +     * Should always be DEAD.  A DEAD heap-only tuple is always counted in
> +     * top-level ndeleted counter for pruning operation.
> +     */
> +    return 1;
> +}

It seems weird to have the if (res == HEAPTUPLE_DEAD ..) branch, but then to
unconditionally return 1.

I'm not actually sure the Assert is unreachable. I can imagine cases where
we'd see e.g. DELETE/INSERT_IN_PROGRESS due to a concurrent subtransaction
abort or such.


Greetings,

Andres Freund



pgsql-bugs by date:

Previous
From: Michael Paquier
Date:
Subject: Re: BUG #17280: global-buffer-overflow on select from pg_stat_slru
Next
From: Andres Freund
Date:
Subject: Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum