Thread: Hardening heap pruning code (was: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum)

Hardening heap pruning code (was: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum)

From

Peter Geoghegan

Date:

20 March 2022, 06:48:15

On Wed, Mar 9, 2022 at 4:46 PM Andres Freund <andres@anarazel.de> wrote:
> On 2022-03-03 19:31:32 -0800, Peter Geoghegan wrote:
> > Attached is a new revision of my fix. This is more or less a
> > combination of my v4 fix from November 12 [1] and Andres'
> > already-committed fix (commit 18b87b20), rebased on top of HEAD. This
> > patch was originally a bugfix, but now it's technically just
> > refactoring/hardening of the logic in pruneheap.c. It hasn't changed
> > all that much, though.
>
> Perhaps worth discussing outside of -bugs?

Okay. Replying on a new -hackers thread dedicated to the hardening patch.

Attached is v6, which goes a bit further than v5 in using local state
that we build up-front to describe the state of the page being pruned
(rather than rereading the page itself). I'll address your v5 review
comments now.

> > We now do "3 passes" over the page. The first is the aforementioned
> > "precalculate HTSV" pass, the second is for determining the extent of
> > HOT chains, and the third is for any remaining disconnected/orphaned
> > HOT chains. I suppose that that's okay, since the amount of work has
> > hardly increased in proportion to this "extra pass over the page". Not
> > 100% sure about everything myself right now, though. I guess that "3
> > passes" might be considered excessive.
>
> We should be able to optimize away the third pass in the majority of cases, by
> keeping track of the number of tuples visited or such. That seems like it
> might be worth doing?

It independently occured to me to go further with using the state from
the first pass to save work in the second and third pass. We still
have what looks like a third pass over the page in v6, but  it doesn't
really work that way. That is, we only need to rely on local state
that describes the page, which is conveniently available already. We
don't need to look at the physical page in the so-called third pass.

> > +     /*
> > +      * Start from the root item.  Mark it as valid up front, since root items
> > +      * are always processed here (not as disconnected tuples in third pass
> > +      * over page).
> > +      */
> > +     prstate->visited[rootoffnum] = true;
> >       offnum = rootoffnum;
> > +     nchain = 0;
>
> I wonder if it'd be clearer if we dealt with redirects outside of the
> loop.

I found it more natural to deal with everything outside of the entire
function (the heap_prune_chain function, which I've renamed to
heap_prune_from_root in v6). This is kind of what you suggested
anyway, since the function itself is stripped down to just the loop in
v6.

> Would make it easier to assert that the target of a redirect may not be
> unused / !heap-only?

With the approach is v6 we always eliminate LP_DEAD and LP_UNUSED
items up-front (by considering them "visited" in the first pass over
the page). We're also able to eliminate not-heap-only tuples quickly,
since whether or not each item is a heap-only tuple is also recorded
in the first pass (the same pass that gets the HTSV status of each
item). It seemed natural to give more responsibility to the caller,
heap_page_prune, which is what this really is. This continues the
trend you started in bugfix commit 18b87b20, which added the initial
loop for the HTSV calls.

In v6 it becomes heap_page_prune's responsibility to only call
heap_prune_from_root with an item that is either an LP_REDIRECT item,
or a plain heap tuple (not a heap-only tuple). The new bookkeeping
state gathered in our first pass over the page makes this easy.

I'm not entirely sure that it makes sense to go this far. We do need
an extra array of booleans to make this work (the new "heaponly[]"
array in PruneState), which will need to be stored on the stack. What
do you think of that aspect?

> > +                             /*
> > +                              * Remember the offnum of the last DEAD tuple in this HOT
> > +                              * chain.  To keep things simple, don't treat heap-only tuples
> > +                              * from a HOT chain as DEAD unless they're only preceded by
> > +                              * other DEAD tuples (in addition to actually being DEAD).
>
> s/actually/themselves/?

Fixed.

> > +                              * Remaining tuples that appear DEAD (but don't get treated as
> > +                              * such by us) are from concurrently aborting updaters.
>
> I don't understand this bit. A DEAD tuple following a RECENTLY_DEAD one won't
> be removed now, and doesn't need to involve a concurrent abort? Are you
> thinking of "remaining" as the tuples not referenced in the previous sentences?

This was just brainfade on my part. I think that I messed it up during rebasing.

> > +                              * VACUUM will ask us to prune the heap page again when it
> > +                              * sees that there is a DEAD tuple left behind, but that would
> > +                              * be necessary regardless of our approach here.
> > +                              */
>
> Only as long as we do another set of HTSV calls...

Removed.

> >                       case HEAPTUPLE_LIVE:
> >                       case HEAPTUPLE_INSERT_IN_PROGRESS:
> > +                             pastlatestdead = true;  /* no further DEAD tuples in CHAIN */
>
> If we don't do anything to the following tuples, why don't we just abort here?
> I assume it is because we'd then treat them as disconnected? That should
> probably be called out...

Good point. Fixed.

> > -     }
> > -     else if (nchain < 2 && ItemIdIsRedirected(rootlp))
> > -     {
> > -             /*
> > -              * We found a redirect item that doesn't point to a valid follow-on
> > -              * item.  This can happen if the loop in heap_page_prune caused us to
> > -              * visit the dead successor of a redirect item before visiting the
> > -              * redirect item.  We can clean up by setting the redirect item to
> > -              * DEAD state.
> > -              */
> > -             heap_prune_record_dead(prstate, rootoffnum);
> > +
> > +             return ndeleted;
> >       }
>
> Could there be such tuples from before a pg_upgrade? Do we need to deal with
> them somehow?

BTW, this code (which I now propose to more or less remove) predates
commit 0a469c87, which removed old style VACUUM full back in 2010.
Prior to that commit there was an extra block that followed this one
-- "else if (redirect_move && ItemIdIsRedirected(rootlp) { ... }".
Once you consider that old style VACUUM FULL once had to rewrite
LP_REDIRECT items then this code structure makes a bit more sense.

Another possibly-relevant piece of historic context about this hunk of
code: it was originally part of a critical section -- structuring that
added heap_page_prune_execute() came later, in commit 6f10eb2111.

Anyway, to answer your question: I don't believe that there will be
any such pre-pg_upgrade tuples. But let's assume that I'm wrong about
that; what can be done about it now?

We're talking about a root LP_REDIRECT item that points to some item
that just doesn't appear sane for it to point to (e.g. maybe it points
past the end of the line pointer array, or to a not-heap-only tuple).
If we are ever able to even *detect* such corruption using tools like
amcheck, then that's just because we got lucky with the details. The
item is irredeemably corrupt, no matter what we do.

To be clear, I'm not saying that you're wrong -- maybe I do need more
handling for this case. In any case I haven't quite figured out where
to draw the line with visibly corrupt HOT chains, even in v6.

All I'm saying is that the old code doesn't seem to tell us anything
about what I ought to replace it with now. It tells us nothing about
what should be possible with LP_REDIRECT items in general, with
pg_upgrade.  The structure doesn't even suggest that somebody once
believed that it was acceptable for an existing LP_REDIRECT item to
redirect to something other than a heap-only tuple. And even if it did
suggest that, it would still be a wildly unreasonable thing for
anybody to have ever believed IMV. As I said, the item has to be
considered corrupt, no matter what might have been possible in the
past.

--
Peter Geoghegan

Attachment

v6-0001-Make-heap-pruning-more-robust.patch

Re: Hardening heap pruning code (was: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum)

From

Matthias van de Meent

Date:

20 March 2022, 23:12:17

On Sun, 20 Mar 2022 at 04:48, Peter Geoghegan <pg@bowt.ie> wrote:
>
> Attached is v6, which goes a bit further than v5 in using local state
> that we build up-front to describe the state of the page being pruned
> (rather than rereading the page itself).

I didn't test the code; so these comments are my general feel of this
patch based on visual analysis only.

> > > We now do "3 passes" over the page. The first is the aforementioned
> > > "precalculate HTSV" pass, the second is for determining the extent of
> > > HOT chains, and the third is for any remaining disconnected/orphaned
> > > HOT chains. I suppose that that's okay, since the amount of work has
> > > hardly increased in proportion to this "extra pass over the page". Not
> > > 100% sure about everything myself right now, though. I guess that "3
> > > passes" might be considered excessive.

The passes don't all have a very clear explanation what they do; or
what they leave for the next one to process. It can be distilled from
the code comments at the respective phases and various functions, but
only after careful review of all code comments; there doesn't seem to
be a clear overview.

> @@ -295,30 +305,25 @@ heap_page_prune(Relation relation, Buffer buffer,
> [...]
> +     * prstate for later passes.  Scan the page backwards (in reverse item
> +     * offset number order).
> - [...]
> +     * This approach is good for performance.  Most commonly tuples within a
> +     * page are stored at decreasing offsets (while the items are stored at
> +     * increasing offsets).

A reference to why this is commonly the case (i.e.
PageRepairFragmentation, compactify_tuples, natural insertion order)
would probably help make the case; as this order being common is not
specifically obvious at first sight.

> @@ -350,28 +370,41 @@ heap_page_prune(Relation relation, Buffer buffer,
> +    /* Now scan the page a second time to process each root item */

This second pass also processes the HOT chain of each root item; but
that is not clear from the comment on the loop. I'd suggest a comment
more along these lines:

/*
 * Now scan the page a second time to process each valid HOT chain;
 * i.e. each non-HOT tuple or redirect line pointer and the HOT tuples in
 * the trailing chain, if any.
 * heap_prune_from_root marks all items in the HOT chain as visited;
 * so that phase 3 knows to skip those items.
 */

Apart from changes in comments for extra clarity; I think this
materially improves pruning, so thanks for working on this.

-Matthias