Re: Improve search for missing parent downlinks in amcheck - Mailing list pgsql-hackers

From Alexander Korotkov
Subject Re: Improve search for missing parent downlinks in amcheck
Date
Msg-id CAPpHfdt6DdEDcub6378iiLiQhoYhaxZu+neczMxH=bYcED5T5w@mail.gmail.com
Whole thread Raw
In response to Re: Improve search for missing parent downlinks in amcheck  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: Improve search for missing parent downlinks in amcheck
Re: Improve search for missing parent downlinks in amcheck
List pgsql-hackers
Hi, Peter!

On Fri, Jan 24, 2020 at 4:31 AM Peter Geoghegan <pg@bowt.ie> wrote:
> On Wed, Jan 22, 2020 at 6:41 PM Alexander Korotkov
> <a.korotkov@postgrespro.ru> wrote:
> > Rebased patch is attached.  Sorry for so huge delay.
>
> I really like this patch. Your interest in amcheck is something that
> makes me feel good about having put so much work into it myself.
>
> Here are some review comments:

Great, thank you very much!

> > +   /*
> > +    * Rightlink and incomplete split flag of previous block referenced by
> > +    * downlink.
> > +    */
> > +   BlockNumber prevrightlink;
> > +   bool        previncompletesplit;
> > +
>
> What downlink? What does this mean? Do you mean the most recently
> followed rightlink on the current level, or InvalidBlockNumber if
> target page is the leftmost page on the current level on the scan?
>
> (Thinks some more...)
>
> Actually, these two new fields track the state *one level down* from
> the target page level when !readonly (unless target page is on the
> leaf level). Right? Comments should be explicit about this. The
> current comments about downlinks isn't clear.

I agree. I've used very vague terms in comments.  Revised now.

> >         if (offset_is_negative_infinity(topaque, offset))
> > +       {
> > +           /*
> > +            * Initialize downlink connectivity check if needed.
> > +            */
> > +           if (!P_ISLEAF(topaque) && state->readonly)
> > +           {
> > +               bt_downlink_connectivity_check(state,
> > +                                              offset,
> > +                                              NULL,
> > +                                              topaque->btpo.level);
> > +           }
> >             continue;
> > +       }
>
> Don't need the "!P_ISLEAF()" here.

Why don't I need.  bt_downlink_connectivity_check() checks one level
down to the target level.  But there is no one level down to leaf...

> Also, you should say something like
> "we need to call this here because the usual callsite in
> bt_downlink_check() won't be reached".

Sure, fixed.

> >     /*
> > -    * * Check if page has a downlink in parent *
> > -    *
> > -    * This can only be checked in heapallindexed + readonly case.
> > +    * If we traversed the whole level to the rightmost page, there might be
> > +    * missing downlinks for the pages to the right of rightmost downlink.
> > +    * Check for them.
> >      */
>
> You mean "to the right of the child page pointed to by our rightmost downlink"?

Yep, fixed.

> I think that the final bt_downlink_connectivity_check() call within
> bt_target_page_check() should make it clear that it is kind of special
> compared to the other two calls.

Yes, this is fixed too.

> > +/*
> > + * Check connectivity of downlinks.  Traverse rightlinks from previous downlink
> > + * to the current one.  Check that there are no intermediate pages with missing
> > + * downlinks.
> > + *
> > + * If 'loaded_page' is given, it's assumed to be contents of downlink
> > + * referenced by 'downlinkoffnum'.
> > + */
>
> Say "assumed to be the page pointed to by the downlink", perhaps?

Yes, fixed.

> > +static void
> > +bt_downlink_connectivity_check(BtreeCheckState *state,
> > +                              OffsetNumber downlinkoffnum,
> > +                              Page loaded_page,
> > +                              uint32 parent_level)
> > +{
>
> In amcheck, we always have a current target page. Every page gets to
> be the target exactly once, though sometimes other subsidiary pages
> are accessed. We try to blame the target page, even with checks that
> are technically against its child/sibling/whatever. The target page is
> always our conceptual point of reference. Sometimes this is a bit
> artificial, but it's still worth doing consistently. So I think you
> should change these argument names with that in mind (see below).

Yes, the arguments were changes as you proposed.

> > +       /*
> > +        * If we visit page with high key, check that it should be equal to
> > +        * the target key next to corresponding downlink.
> > +        */
>
> I suggest "...check that it is equal to the target key..."

Agree, fixed.

> > +           /*
> > +            * There might be two situations when we examine high key.  If
> > +            * current child page is referenced by given downlink, we should
> > +            * look to the next offset number for matching key.
>
> You mean "the next offset number for the matching key from the target
> page"? I find it much easier to keep this stuff in my head if
> everything is defined in terms of its relationship with the current
> target page. For example, bt_downlink_connectivity_check()'s
> "parent_level" argument should be called "target_level" instead, while
> its "loaded_page" should be called "loaded_child". Maybe
> "downlinkoffnum" should be "target_downlinkoffnum". And downlinkoffnum
> should definitely be explained in comments at the top of
> bt_downlink_connectivity_check() (e.g., say what it means when it is
> InvalidOffsetNumber).

Renamed as you proposed.

> > Alternatively
> > +            * we might find child with high key while traversing from
> > +            * previous downlink to current one.  Then matching key resides
> > +            * the same offset number as current downlink.
> > +            */
>
> Not sure what "traversing from previous downlink to current one" means at all.

I've rephrased this comment, please check.

> > +           if (!offset_is_negative_infinity(topaque, pivotkey_offset) &&
> > +               pivotkey_offset <= PageGetMaxOffsetNumber(state->target))
> > +           {
> > +               uint32      cmp = _bt_compare(state->rel,
> > +                                             skey,
> > +                                             state->target,
> > +                                             pivotkey_offset);
>
> There is no need to bother with a _bt_compare() here. Why not just use
> memcmp() with a pointer to itup->t_tid.ip_posid (i.e. memcmp() that
> skips the block number)? I think that it is better to expect the keys
> to be *identical* among pivot tuples, including within tuple alignment
> padding (only the downlink block number can be different here). If
> non-pivot tuples were involved then you couldn't do it this way, but
> they're never involved, so it makes sense. A memcmp() will be faster,
> obviously. More importantly, it has the advantage of not relying on
> opclass infrastructure in any way. It might be worth adding an
> internal verify_nbtree.c static helper function to do the memcmp() for
> you -- bt_pivot_tuple_identical(), or something like that.

Agree, replaced _bt_compare() with bt_pivot_tuple_identical().  It
becomes even simpler now, thanks!

> I think bt_downlink_check() and bt_downlink_connectivity_check()
> should be renamed to something broader. In my mind, downlink is
> basically a block number. We have been sloppy about using the term
> downlink when we really mean "pivot tuple with a downlink" -- I am
> guilty of this myself. But it seems more important, now that you have
> the new high key check.

Hmm... Names are hard for me.  I didn't do any renaming for now.  What
about this?
bt_downlink_check() => bt_child_check()
bt_downlink_connectivity_check() => bt_child_connectivity_highkey_check()

> I particularly don't like the way you sometimes say "downlink" when
> you mean "child page". You do that in this error message:
>
> > +                   (errcode(ERRCODE_INDEX_CORRUPTED),
> > +                    errmsg("block found while traversing rightlinks from downlink of index \"%s\" has invalid
level",
> > +                           RelationGetRelationName(state->rel)),

Agree. Error message has been changed.

> Typo here:
>
> > +   /*
> > +    * If no previos rightlink is memorized, get it from current downlink for
> > +    * future usage.
> > +    */
>
> You mean "previous". Also, I think that you should say "memorized for
> current level just below target page's level".

Yep, fixed.

> >      * within bt_check_level_from_leftmost() won't reach the page either,
> >      * since the leaf's live siblings should have their sibling links updated
> > -    * to bypass the deletion target page when it is marked fully dead.)
> > +    * to bypass the deletion page under check when it is marked fully dead.)
> >      *
>
> This change seems wrong or unnecessary -- "deletion target" means
> "page undergoing deletion" (not necessarily marked P_ISDELETED() just
> yet), and has nothing to do with the amcheck target. You can change
> this if you want, but I don't get it.

Seems like random oversight.  Change removed.

> I tested this by using pg_hexedit to corrupt the least significant
> byte of a text key in the root page:
>
> pg@tpce:5432 [32610]=# select bt_index_parent_check('pk_holding');
> DEBUG:  verifying level 2 (true root level)
> DEBUG:  verifying 9 items on internal block 290
> DEBUG:  verifying level 1
> DEBUG:  verifying 285 items on internal block 3
> ERROR:  mismatch between parent key and child high key index "pk_holding"
> DETAIL:  Parent block=3 child block=9 parent page lsn=998/EFA21550.
>
> Happy to see that this works, even though this is one of the subtlest
> possible forms of index corruption. Previously, we could sometimes
> catch this with "rootdescend" verification, but only if there were
> *current* items that a scan couldn't find on lower levels (often just
> the leaf level). But now it doesn't matter -- we'll always detect it.
> (I think.)

Thank you for testing!

> Shouldn't this error message read '...in index "pk_holding"'? You
> missed the "in". Also, why not have the DETAIL message call the
> "Parent block" the target block?

Yep, fixed.

> I think that bt_downlink_connectivity_check() should have some
> high-level comments about what it's supposed to do. Perhaps an example
> is the best way to explain the concepts. Maybe say something about a
> three level B-Tree. Each of the separator keys in the grandparent/root
> page should also appear as high keys at the parent level. Each of the
> separator keys in the parent level should also appear as high keys on
> the leaf level, including the separators from the parent level high
> keys. Since each separator defines which subtrees are <= and > of the
> separator, there must be an identical seam of separators (in high
> keys) on lower levels. bt_downlink_connectivity_check() verifies that
> separator keys agree across a single level, which verifies the
> integrity of the whole tree.

I've revised finding the matching pivot key for high key.  Now, code
assumes we should always find a matching pivot key.  It could use both
target high key or left sibling high key (which is memorized as "low
key").

I've checked this on normal indexes, but I didn't try to exercise this
with broken indexes.  I would appreciate if you do.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

pgsql-hackers by date:

Previous
From: Amit Langote
Date:
Subject: Re: plan cache overhead on plpgsql expression
Next
From: Alexander Korotkov
Date:
Subject: Re: Improve search for missing parent downlinks in amcheck