Re: Improve search for missing parent downlinks in amcheck - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: Improve search for missing parent downlinks in amcheck
Date
Msg-id CAH2-WznOg4q6QrtthM_4fAE8xKUPGNV6LisQRgH-63kpmKJzkw@mail.gmail.com
Whole thread Raw
In response to Re: Improve search for missing parent downlinks in amcheck  (Alexander Korotkov <a.korotkov@postgrespro.ru>)
Responses Re: Improve search for missing parent downlinks in amcheck
Re: Improve search for missing parent downlinks in amcheck
List pgsql-hackers
On Wed, Jan 22, 2020 at 6:41 PM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> Rebased patch is attached.  Sorry for so huge delay.

I really like this patch. Your interest in amcheck is something that
makes me feel good about having put so much work into it myself.

Here are some review comments:

> +   /*
> +    * Rightlink and incomplete split flag of previous block referenced by
> +    * downlink.
> +    */
> +   BlockNumber prevrightlink;
> +   bool        previncompletesplit;
> +

What downlink? What does this mean? Do you mean the most recently
followed rightlink on the current level, or InvalidBlockNumber if
target page is the leftmost page on the current level on the scan?

(Thinks some more...)

Actually, these two new fields track the state *one level down* from
the target page level when !readonly (unless target page is on the
leaf level). Right? Comments should be explicit about this. The
current comments about downlinks isn't clear.

>         if (offset_is_negative_infinity(topaque, offset))
> +       {
> +           /*
> +            * Initialize downlink connectivity check if needed.
> +            */
> +           if (!P_ISLEAF(topaque) && state->readonly)
> +           {
> +               bt_downlink_connectivity_check(state,
> +                                              offset,
> +                                              NULL,
> +                                              topaque->btpo.level);
> +           }
>             continue;
> +       }

Don't need the "!P_ISLEAF()" here. Also, you should say something like
"we need to call this here because the usual callsite in
bt_downlink_check() won't be reached".

>     /*
> -    * * Check if page has a downlink in parent *
> -    *
> -    * This can only be checked in heapallindexed + readonly case.
> +    * If we traversed the whole level to the rightmost page, there might be
> +    * missing downlinks for the pages to the right of rightmost downlink.
> +    * Check for them.
>      */

You mean "to the right of the child page pointed to by our rightmost downlink"?

I think that the final bt_downlink_connectivity_check() call within
bt_target_page_check() should make it clear that it is kind of special
compared to the other two calls.

> +/*
> + * Check connectivity of downlinks.  Traverse rightlinks from previous downlink
> + * to the current one.  Check that there are no intermediate pages with missing
> + * downlinks.
> + *
> + * If 'loaded_page' is given, it's assumed to be contents of downlink
> + * referenced by 'downlinkoffnum'.
> + */

Say "assumed to be the page pointed to by the downlink", perhaps?

> +static void
> +bt_downlink_connectivity_check(BtreeCheckState *state,
> +                              OffsetNumber downlinkoffnum,
> +                              Page loaded_page,
> +                              uint32 parent_level)
> +{

In amcheck, we always have a current target page. Every page gets to
be the target exactly once, though sometimes other subsidiary pages
are accessed. We try to blame the target page, even with checks that
are technically against its child/sibling/whatever. The target page is
always our conceptual point of reference. Sometimes this is a bit
artificial, but it's still worth doing consistently. So I think you
should change these argument names with that in mind (see below).

> +       /*
> +        * If we visit page with high key, check that it should be equal to
> +        * the target key next to corresponding downlink.
> +        */

I suggest "...check that it is equal to the target key..."

> +           /*
> +            * There might be two situations when we examine high key.  If
> +            * current child page is referenced by given downlink, we should
> +            * look to the next offset number for matching key.

You mean "the next offset number for the matching key from the target
page"? I find it much easier to keep this stuff in my head if
everything is defined in terms of its relationship with the current
target page. For example, bt_downlink_connectivity_check()'s
"parent_level" argument should be called "target_level" instead, while
its "loaded_page" should be called "loaded_child". Maybe
"downlinkoffnum" should be "target_downlinkoffnum". And downlinkoffnum
should definitely be explained in comments at the top of
bt_downlink_connectivity_check() (e.g., say what it means when it is
InvalidOffsetNumber).

> Alternatively
> +            * we might find child with high key while traversing from
> +            * previous downlink to current one.  Then matching key resides
> +            * the same offset number as current downlink.
> +            */

Not sure what "traversing from previous downlink to current one" means at all.

> +           if (!offset_is_negative_infinity(topaque, pivotkey_offset) &&
> +               pivotkey_offset <= PageGetMaxOffsetNumber(state->target))
> +           {
> +               uint32      cmp = _bt_compare(state->rel,
> +                                             skey,
> +                                             state->target,
> +                                             pivotkey_offset);

There is no need to bother with a _bt_compare() here. Why not just use
memcmp() with a pointer to itup->t_tid.ip_posid (i.e. memcmp() that
skips the block number)? I think that it is better to expect the keys
to be *identical* among pivot tuples, including within tuple alignment
padding (only the downlink block number can be different here). If
non-pivot tuples were involved then you couldn't do it this way, but
they're never involved, so it makes sense. A memcmp() will be faster,
obviously. More importantly, it has the advantage of not relying on
opclass infrastructure in any way. It might be worth adding an
internal verify_nbtree.c static helper function to do the memcmp() for
you -- bt_pivot_tuple_identical(), or something like that.

I think bt_downlink_check() and bt_downlink_connectivity_check()
should be renamed to something broader. In my mind, downlink is
basically a block number. We have been sloppy about using the term
downlink when we really mean "pivot tuple with a downlink" -- I am
guilty of this myself. But it seems more important, now that you have
the new high key check.

I particularly don't like the way you sometimes say "downlink" when
you mean "child page". You do that in this error message:

> +                   (errcode(ERRCODE_INDEX_CORRUPTED),
> +                    errmsg("block found while traversing rightlinks from downlink of index \"%s\" has invalid
level",
> +                           RelationGetRelationName(state->rel)),

Typo here:

> +   /*
> +    * If no previos rightlink is memorized, get it from current downlink for
> +    * future usage.
> +    */

You mean "previous". Also, I think that you should say "memorized for
current level just below target page's level".

>      * within bt_check_level_from_leftmost() won't reach the page either,
>      * since the leaf's live siblings should have their sibling links updated
> -    * to bypass the deletion target page when it is marked fully dead.)
> +    * to bypass the deletion page under check when it is marked fully dead.)
>      *

This change seems wrong or unnecessary -- "deletion target" means
"page undergoing deletion" (not necessarily marked P_ISDELETED() just
yet), and has nothing to do with the amcheck target. You can change
this if you want, but I don't get it.

I tested this by using pg_hexedit to corrupt the least significant
byte of a text key in the root page:

pg@tpce:5432 [32610]=# select bt_index_parent_check('pk_holding');
DEBUG:  verifying level 2 (true root level)
DEBUG:  verifying 9 items on internal block 290
DEBUG:  verifying level 1
DEBUG:  verifying 285 items on internal block 3
ERROR:  mismatch between parent key and child high key index "pk_holding"
DETAIL:  Parent block=3 child block=9 parent page lsn=998/EFA21550.

Happy to see that this works, even though this is one of the subtlest
possible forms of index corruption. Previously, we could sometimes
catch this with "rootdescend" verification, but only if there were
*current* items that a scan couldn't find on lower levels (often just
the leaf level). But now it doesn't matter -- we'll always detect it.
(I think.)

Shouldn't this error message read '...in index "pk_holding"'? You
missed the "in". Also, why not have the DETAIL message call the
"Parent block" the target block?

I think that bt_downlink_connectivity_check() should have some
high-level comments about what it's supposed to do. Perhaps an example
is the best way to explain the concepts. Maybe say something about a
three level B-Tree. Each of the separator keys in the grandparent/root
page should also appear as high keys at the parent level. Each of the
separator keys in the parent level should also appear as high keys on
the leaf level, including the separators from the parent level high
keys. Since each separator defines which subtrees are <= and > of the
separator, there must be an identical seam of separators (in high
keys) on lower levels. bt_downlink_connectivity_check() verifies that
separator keys agree across a single level, which verifies the
integrity of the whole tree.

(Thinks some more...)

Actually, this patch doesn't quite manage to verify that there is this
"unbroken seam" of separator keys from the root to the leaf, so my
suggested wording is kind of wrong -- but I think we can fix this
weakness. The specific weakness that I saw and verified exists is as
follows:

If I corrupt the high key of most of the leaf pages in a multi-level
index just a little bit (by once again corrupting the least
significant byte of the key using pg_hexedit), then the new check
alone will usually detect the problem, which is good. However, if I
deliberately pick a leaf page that happens to be the rightmost child
of some internal page, then it is a different story -- even the new
check won't detect the problem (the existing rootdescend check may or
may not detect the problem, depending on the current non-pivot tuples
near the leaf high key in question). There is no real reason why we
shouldn't be able to detect this problem, though.

The solution is to recognize that we sometimes need to use the
target/parent page high key separator -- not the separator key from
some pivot tuple in parent that also contains a downlink. Goetz Graefe
calls this "the cousin problem" [1]. To be more precise, I think that
the "pivotkey_offset <= PageGetMaxOffsetNumber(state->target))" part
of this test can be improved:

> +           /*
> +            * There might be two situations when we examine high key.  If
> +            * current child page is referenced by given downlink, we should
> +            * look to the next offset number for matching key.  Alternatively
> +            * we might find child with high key while traversing from
> +            * previous downlink to current one.  Then matching key resides
> +            * the same offset number as current downlink.
> +            */
> +           if (blkno == downlink)
> +               pivotkey_offset = OffsetNumberNext(downlinkoffnum);
> +           else
> +               pivotkey_offset = downlinkoffnum;
> +
> +           topaque = (BTPageOpaque) PageGetSpecialPointer(state->target);
> +
> +           if (!offset_is_negative_infinity(topaque, pivotkey_offset) &&
> +               pivotkey_offset <= PageGetMaxOffsetNumber(state->target))

When OffsetNumberNext(downlinkoffnum) exceeds the
PageGetMaxOffsetNumber(state->target), doesn't that actually mean that
the high key offset (i.e. P_HIKEY) should be used to get an item from
the next level up? We can still correctly detect the problem that way.
Remember, the high key on an internal page is a tuple that contains a
separator but no downlink, which is really not that special within an
internal page -- if you think about it from the Lehman & Yao
perspective. So we should take the pivot tuple from the parent at
offset P_HIKEY, and everything can work just the same.

That said, I guess you really do need the
"!offset_is_negative_infinity(topaque, pivotkey_offset)" part of the
test. The only other possibility is that you access the target/parent
page's downlink + separator in its own parent page (i.e. the
grandparent of the page whose high key might be corrupt), which is
significantly more complexity -- that may not be worth it. (If you did
this, you would have to teach the code the difference between
"absolute" negative infinity in the leftmost leaf page on the leaf
level, and "subtree relative" negative infinity for other leaf pages
that are merely leftmost within a subtree).

In summary: I suppose that we can also solve "the cousin problem"
quite easily, but only for rightmost cousins within a subtree --
leftmost cousins might be too messy to verify for it to be worth it.
We don't want to have to jump two or three levels up within
bt_downlink_connectivity_check() just for leftmost cousin pages. But
maybe you're feeling ambitious! What do you think?

Note: There is an existing comment about this exact negative infinity
business within bt_downlink_check(). It starts with "Negative
inifinity items can be thought of as a strict lower bound that works
transitively...". There should probably be some comment updates to
this comment block as part of this patch.

[1] https://pdfs.semanticscholar.org/fd45/15ab23c00231d96c95c1091459d0d1eebfae.pdf
--
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: doc: alter table references bogus table-specific plannerparameters
Next
From: Peter Geoghegan
Date:
Subject: Re: Improve search for missing parent downlinks in amcheck