Re: Improve search for missing parent downlinks in amcheck - Mailing list pgsql-hackers
From | Alexander Korotkov |
---|---|
Subject | Re: Improve search for missing parent downlinks in amcheck |
Date | |
Msg-id | CAPpHfdt6DdEDcub6378iiLiQhoYhaxZu+neczMxH=bYcED5T5w@mail.gmail.com Whole thread Raw |
In response to | Re: Improve search for missing parent downlinks in amcheck (Peter Geoghegan <pg@bowt.ie>) |
Responses |
Re: Improve search for missing parent downlinks in amcheck
Re: Improve search for missing parent downlinks in amcheck |
List | pgsql-hackers |
Hi, Peter! On Fri, Jan 24, 2020 at 4:31 AM Peter Geoghegan <pg@bowt.ie> wrote: > On Wed, Jan 22, 2020 at 6:41 PM Alexander Korotkov > <a.korotkov@postgrespro.ru> wrote: > > Rebased patch is attached. Sorry for so huge delay. > > I really like this patch. Your interest in amcheck is something that > makes me feel good about having put so much work into it myself. > > Here are some review comments: Great, thank you very much! > > + /* > > + * Rightlink and incomplete split flag of previous block referenced by > > + * downlink. > > + */ > > + BlockNumber prevrightlink; > > + bool previncompletesplit; > > + > > What downlink? What does this mean? Do you mean the most recently > followed rightlink on the current level, or InvalidBlockNumber if > target page is the leftmost page on the current level on the scan? > > (Thinks some more...) > > Actually, these two new fields track the state *one level down* from > the target page level when !readonly (unless target page is on the > leaf level). Right? Comments should be explicit about this. The > current comments about downlinks isn't clear. I agree. I've used very vague terms in comments. Revised now. > > if (offset_is_negative_infinity(topaque, offset)) > > + { > > + /* > > + * Initialize downlink connectivity check if needed. > > + */ > > + if (!P_ISLEAF(topaque) && state->readonly) > > + { > > + bt_downlink_connectivity_check(state, > > + offset, > > + NULL, > > + topaque->btpo.level); > > + } > > continue; > > + } > > Don't need the "!P_ISLEAF()" here. Why don't I need. bt_downlink_connectivity_check() checks one level down to the target level. But there is no one level down to leaf... > Also, you should say something like > "we need to call this here because the usual callsite in > bt_downlink_check() won't be reached". Sure, fixed. > > /* > > - * * Check if page has a downlink in parent * > > - * > > - * This can only be checked in heapallindexed + readonly case. > > + * If we traversed the whole level to the rightmost page, there might be > > + * missing downlinks for the pages to the right of rightmost downlink. > > + * Check for them. > > */ > > You mean "to the right of the child page pointed to by our rightmost downlink"? Yep, fixed. > I think that the final bt_downlink_connectivity_check() call within > bt_target_page_check() should make it clear that it is kind of special > compared to the other two calls. Yes, this is fixed too. > > +/* > > + * Check connectivity of downlinks. Traverse rightlinks from previous downlink > > + * to the current one. Check that there are no intermediate pages with missing > > + * downlinks. > > + * > > + * If 'loaded_page' is given, it's assumed to be contents of downlink > > + * referenced by 'downlinkoffnum'. > > + */ > > Say "assumed to be the page pointed to by the downlink", perhaps? Yes, fixed. > > +static void > > +bt_downlink_connectivity_check(BtreeCheckState *state, > > + OffsetNumber downlinkoffnum, > > + Page loaded_page, > > + uint32 parent_level) > > +{ > > In amcheck, we always have a current target page. Every page gets to > be the target exactly once, though sometimes other subsidiary pages > are accessed. We try to blame the target page, even with checks that > are technically against its child/sibling/whatever. The target page is > always our conceptual point of reference. Sometimes this is a bit > artificial, but it's still worth doing consistently. So I think you > should change these argument names with that in mind (see below). Yes, the arguments were changes as you proposed. > > + /* > > + * If we visit page with high key, check that it should be equal to > > + * the target key next to corresponding downlink. > > + */ > > I suggest "...check that it is equal to the target key..." Agree, fixed. > > + /* > > + * There might be two situations when we examine high key. If > > + * current child page is referenced by given downlink, we should > > + * look to the next offset number for matching key. > > You mean "the next offset number for the matching key from the target > page"? I find it much easier to keep this stuff in my head if > everything is defined in terms of its relationship with the current > target page. For example, bt_downlink_connectivity_check()'s > "parent_level" argument should be called "target_level" instead, while > its "loaded_page" should be called "loaded_child". Maybe > "downlinkoffnum" should be "target_downlinkoffnum". And downlinkoffnum > should definitely be explained in comments at the top of > bt_downlink_connectivity_check() (e.g., say what it means when it is > InvalidOffsetNumber). Renamed as you proposed. > > Alternatively > > + * we might find child with high key while traversing from > > + * previous downlink to current one. Then matching key resides > > + * the same offset number as current downlink. > > + */ > > Not sure what "traversing from previous downlink to current one" means at all. I've rephrased this comment, please check. > > + if (!offset_is_negative_infinity(topaque, pivotkey_offset) && > > + pivotkey_offset <= PageGetMaxOffsetNumber(state->target)) > > + { > > + uint32 cmp = _bt_compare(state->rel, > > + skey, > > + state->target, > > + pivotkey_offset); > > There is no need to bother with a _bt_compare() here. Why not just use > memcmp() with a pointer to itup->t_tid.ip_posid (i.e. memcmp() that > skips the block number)? I think that it is better to expect the keys > to be *identical* among pivot tuples, including within tuple alignment > padding (only the downlink block number can be different here). If > non-pivot tuples were involved then you couldn't do it this way, but > they're never involved, so it makes sense. A memcmp() will be faster, > obviously. More importantly, it has the advantage of not relying on > opclass infrastructure in any way. It might be worth adding an > internal verify_nbtree.c static helper function to do the memcmp() for > you -- bt_pivot_tuple_identical(), or something like that. Agree, replaced _bt_compare() with bt_pivot_tuple_identical(). It becomes even simpler now, thanks! > I think bt_downlink_check() and bt_downlink_connectivity_check() > should be renamed to something broader. In my mind, downlink is > basically a block number. We have been sloppy about using the term > downlink when we really mean "pivot tuple with a downlink" -- I am > guilty of this myself. But it seems more important, now that you have > the new high key check. Hmm... Names are hard for me. I didn't do any renaming for now. What about this? bt_downlink_check() => bt_child_check() bt_downlink_connectivity_check() => bt_child_connectivity_highkey_check() > I particularly don't like the way you sometimes say "downlink" when > you mean "child page". You do that in this error message: > > > + (errcode(ERRCODE_INDEX_CORRUPTED), > > + errmsg("block found while traversing rightlinks from downlink of index \"%s\" has invalid level", > > + RelationGetRelationName(state->rel)), Agree. Error message has been changed. > Typo here: > > > + /* > > + * If no previos rightlink is memorized, get it from current downlink for > > + * future usage. > > + */ > > You mean "previous". Also, I think that you should say "memorized for > current level just below target page's level". Yep, fixed. > > * within bt_check_level_from_leftmost() won't reach the page either, > > * since the leaf's live siblings should have their sibling links updated > > - * to bypass the deletion target page when it is marked fully dead.) > > + * to bypass the deletion page under check when it is marked fully dead.) > > * > > This change seems wrong or unnecessary -- "deletion target" means > "page undergoing deletion" (not necessarily marked P_ISDELETED() just > yet), and has nothing to do with the amcheck target. You can change > this if you want, but I don't get it. Seems like random oversight. Change removed. > I tested this by using pg_hexedit to corrupt the least significant > byte of a text key in the root page: > > pg@tpce:5432 [32610]=# select bt_index_parent_check('pk_holding'); > DEBUG: verifying level 2 (true root level) > DEBUG: verifying 9 items on internal block 290 > DEBUG: verifying level 1 > DEBUG: verifying 285 items on internal block 3 > ERROR: mismatch between parent key and child high key index "pk_holding" > DETAIL: Parent block=3 child block=9 parent page lsn=998/EFA21550. > > Happy to see that this works, even though this is one of the subtlest > possible forms of index corruption. Previously, we could sometimes > catch this with "rootdescend" verification, but only if there were > *current* items that a scan couldn't find on lower levels (often just > the leaf level). But now it doesn't matter -- we'll always detect it. > (I think.) Thank you for testing! > Shouldn't this error message read '...in index "pk_holding"'? You > missed the "in". Also, why not have the DETAIL message call the > "Parent block" the target block? Yep, fixed. > I think that bt_downlink_connectivity_check() should have some > high-level comments about what it's supposed to do. Perhaps an example > is the best way to explain the concepts. Maybe say something about a > three level B-Tree. Each of the separator keys in the grandparent/root > page should also appear as high keys at the parent level. Each of the > separator keys in the parent level should also appear as high keys on > the leaf level, including the separators from the parent level high > keys. Since each separator defines which subtrees are <= and > of the > separator, there must be an identical seam of separators (in high > keys) on lower levels. bt_downlink_connectivity_check() verifies that > separator keys agree across a single level, which verifies the > integrity of the whole tree. I've revised finding the matching pivot key for high key. Now, code assumes we should always find a matching pivot key. It could use both target high key or left sibling high key (which is memorized as "low key"). I've checked this on normal indexes, but I didn't try to exercise this with broken indexes. I would appreciate if you do. ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
pgsql-hackers by date: