Re: Improve search for missing parent downlinks in amcheck - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Improve search for missing parent downlinks in amcheck |
Date | |
Msg-id | CAH2-WznOg4q6QrtthM_4fAE8xKUPGNV6LisQRgH-63kpmKJzkw@mail.gmail.com Whole thread Raw |
In response to | Re: Improve search for missing parent downlinks in amcheck (Alexander Korotkov <a.korotkov@postgrespro.ru>) |
Responses |
Re: Improve search for missing parent downlinks in amcheck
Re: Improve search for missing parent downlinks in amcheck |
List | pgsql-hackers |
On Wed, Jan 22, 2020 at 6:41 PM Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > Rebased patch is attached. Sorry for so huge delay. I really like this patch. Your interest in amcheck is something that makes me feel good about having put so much work into it myself. Here are some review comments: > + /* > + * Rightlink and incomplete split flag of previous block referenced by > + * downlink. > + */ > + BlockNumber prevrightlink; > + bool previncompletesplit; > + What downlink? What does this mean? Do you mean the most recently followed rightlink on the current level, or InvalidBlockNumber if target page is the leftmost page on the current level on the scan? (Thinks some more...) Actually, these two new fields track the state *one level down* from the target page level when !readonly (unless target page is on the leaf level). Right? Comments should be explicit about this. The current comments about downlinks isn't clear. > if (offset_is_negative_infinity(topaque, offset)) > + { > + /* > + * Initialize downlink connectivity check if needed. > + */ > + if (!P_ISLEAF(topaque) && state->readonly) > + { > + bt_downlink_connectivity_check(state, > + offset, > + NULL, > + topaque->btpo.level); > + } > continue; > + } Don't need the "!P_ISLEAF()" here. Also, you should say something like "we need to call this here because the usual callsite in bt_downlink_check() won't be reached". > /* > - * * Check if page has a downlink in parent * > - * > - * This can only be checked in heapallindexed + readonly case. > + * If we traversed the whole level to the rightmost page, there might be > + * missing downlinks for the pages to the right of rightmost downlink. > + * Check for them. > */ You mean "to the right of the child page pointed to by our rightmost downlink"? I think that the final bt_downlink_connectivity_check() call within bt_target_page_check() should make it clear that it is kind of special compared to the other two calls. > +/* > + * Check connectivity of downlinks. Traverse rightlinks from previous downlink > + * to the current one. Check that there are no intermediate pages with missing > + * downlinks. > + * > + * If 'loaded_page' is given, it's assumed to be contents of downlink > + * referenced by 'downlinkoffnum'. > + */ Say "assumed to be the page pointed to by the downlink", perhaps? > +static void > +bt_downlink_connectivity_check(BtreeCheckState *state, > + OffsetNumber downlinkoffnum, > + Page loaded_page, > + uint32 parent_level) > +{ In amcheck, we always have a current target page. Every page gets to be the target exactly once, though sometimes other subsidiary pages are accessed. We try to blame the target page, even with checks that are technically against its child/sibling/whatever. The target page is always our conceptual point of reference. Sometimes this is a bit artificial, but it's still worth doing consistently. So I think you should change these argument names with that in mind (see below). > + /* > + * If we visit page with high key, check that it should be equal to > + * the target key next to corresponding downlink. > + */ I suggest "...check that it is equal to the target key..." > + /* > + * There might be two situations when we examine high key. If > + * current child page is referenced by given downlink, we should > + * look to the next offset number for matching key. You mean "the next offset number for the matching key from the target page"? I find it much easier to keep this stuff in my head if everything is defined in terms of its relationship with the current target page. For example, bt_downlink_connectivity_check()'s "parent_level" argument should be called "target_level" instead, while its "loaded_page" should be called "loaded_child". Maybe "downlinkoffnum" should be "target_downlinkoffnum". And downlinkoffnum should definitely be explained in comments at the top of bt_downlink_connectivity_check() (e.g., say what it means when it is InvalidOffsetNumber). > Alternatively > + * we might find child with high key while traversing from > + * previous downlink to current one. Then matching key resides > + * the same offset number as current downlink. > + */ Not sure what "traversing from previous downlink to current one" means at all. > + if (!offset_is_negative_infinity(topaque, pivotkey_offset) && > + pivotkey_offset <= PageGetMaxOffsetNumber(state->target)) > + { > + uint32 cmp = _bt_compare(state->rel, > + skey, > + state->target, > + pivotkey_offset); There is no need to bother with a _bt_compare() here. Why not just use memcmp() with a pointer to itup->t_tid.ip_posid (i.e. memcmp() that skips the block number)? I think that it is better to expect the keys to be *identical* among pivot tuples, including within tuple alignment padding (only the downlink block number can be different here). If non-pivot tuples were involved then you couldn't do it this way, but they're never involved, so it makes sense. A memcmp() will be faster, obviously. More importantly, it has the advantage of not relying on opclass infrastructure in any way. It might be worth adding an internal verify_nbtree.c static helper function to do the memcmp() for you -- bt_pivot_tuple_identical(), or something like that. I think bt_downlink_check() and bt_downlink_connectivity_check() should be renamed to something broader. In my mind, downlink is basically a block number. We have been sloppy about using the term downlink when we really mean "pivot tuple with a downlink" -- I am guilty of this myself. But it seems more important, now that you have the new high key check. I particularly don't like the way you sometimes say "downlink" when you mean "child page". You do that in this error message: > + (errcode(ERRCODE_INDEX_CORRUPTED), > + errmsg("block found while traversing rightlinks from downlink of index \"%s\" has invalid level", > + RelationGetRelationName(state->rel)), Typo here: > + /* > + * If no previos rightlink is memorized, get it from current downlink for > + * future usage. > + */ You mean "previous". Also, I think that you should say "memorized for current level just below target page's level". > * within bt_check_level_from_leftmost() won't reach the page either, > * since the leaf's live siblings should have their sibling links updated > - * to bypass the deletion target page when it is marked fully dead.) > + * to bypass the deletion page under check when it is marked fully dead.) > * This change seems wrong or unnecessary -- "deletion target" means "page undergoing deletion" (not necessarily marked P_ISDELETED() just yet), and has nothing to do with the amcheck target. You can change this if you want, but I don't get it. I tested this by using pg_hexedit to corrupt the least significant byte of a text key in the root page: pg@tpce:5432 [32610]=# select bt_index_parent_check('pk_holding'); DEBUG: verifying level 2 (true root level) DEBUG: verifying 9 items on internal block 290 DEBUG: verifying level 1 DEBUG: verifying 285 items on internal block 3 ERROR: mismatch between parent key and child high key index "pk_holding" DETAIL: Parent block=3 child block=9 parent page lsn=998/EFA21550. Happy to see that this works, even though this is one of the subtlest possible forms of index corruption. Previously, we could sometimes catch this with "rootdescend" verification, but only if there were *current* items that a scan couldn't find on lower levels (often just the leaf level). But now it doesn't matter -- we'll always detect it. (I think.) Shouldn't this error message read '...in index "pk_holding"'? You missed the "in". Also, why not have the DETAIL message call the "Parent block" the target block? I think that bt_downlink_connectivity_check() should have some high-level comments about what it's supposed to do. Perhaps an example is the best way to explain the concepts. Maybe say something about a three level B-Tree. Each of the separator keys in the grandparent/root page should also appear as high keys at the parent level. Each of the separator keys in the parent level should also appear as high keys on the leaf level, including the separators from the parent level high keys. Since each separator defines which subtrees are <= and > of the separator, there must be an identical seam of separators (in high keys) on lower levels. bt_downlink_connectivity_check() verifies that separator keys agree across a single level, which verifies the integrity of the whole tree. (Thinks some more...) Actually, this patch doesn't quite manage to verify that there is this "unbroken seam" of separator keys from the root to the leaf, so my suggested wording is kind of wrong -- but I think we can fix this weakness. The specific weakness that I saw and verified exists is as follows: If I corrupt the high key of most of the leaf pages in a multi-level index just a little bit (by once again corrupting the least significant byte of the key using pg_hexedit), then the new check alone will usually detect the problem, which is good. However, if I deliberately pick a leaf page that happens to be the rightmost child of some internal page, then it is a different story -- even the new check won't detect the problem (the existing rootdescend check may or may not detect the problem, depending on the current non-pivot tuples near the leaf high key in question). There is no real reason why we shouldn't be able to detect this problem, though. The solution is to recognize that we sometimes need to use the target/parent page high key separator -- not the separator key from some pivot tuple in parent that also contains a downlink. Goetz Graefe calls this "the cousin problem" [1]. To be more precise, I think that the "pivotkey_offset <= PageGetMaxOffsetNumber(state->target))" part of this test can be improved: > + /* > + * There might be two situations when we examine high key. If > + * current child page is referenced by given downlink, we should > + * look to the next offset number for matching key. Alternatively > + * we might find child with high key while traversing from > + * previous downlink to current one. Then matching key resides > + * the same offset number as current downlink. > + */ > + if (blkno == downlink) > + pivotkey_offset = OffsetNumberNext(downlinkoffnum); > + else > + pivotkey_offset = downlinkoffnum; > + > + topaque = (BTPageOpaque) PageGetSpecialPointer(state->target); > + > + if (!offset_is_negative_infinity(topaque, pivotkey_offset) && > + pivotkey_offset <= PageGetMaxOffsetNumber(state->target)) When OffsetNumberNext(downlinkoffnum) exceeds the PageGetMaxOffsetNumber(state->target), doesn't that actually mean that the high key offset (i.e. P_HIKEY) should be used to get an item from the next level up? We can still correctly detect the problem that way. Remember, the high key on an internal page is a tuple that contains a separator but no downlink, which is really not that special within an internal page -- if you think about it from the Lehman & Yao perspective. So we should take the pivot tuple from the parent at offset P_HIKEY, and everything can work just the same. That said, I guess you really do need the "!offset_is_negative_infinity(topaque, pivotkey_offset)" part of the test. The only other possibility is that you access the target/parent page's downlink + separator in its own parent page (i.e. the grandparent of the page whose high key might be corrupt), which is significantly more complexity -- that may not be worth it. (If you did this, you would have to teach the code the difference between "absolute" negative infinity in the leftmost leaf page on the leaf level, and "subtree relative" negative infinity for other leaf pages that are merely leftmost within a subtree). In summary: I suppose that we can also solve "the cousin problem" quite easily, but only for rightmost cousins within a subtree -- leftmost cousins might be too messy to verify for it to be worth it. We don't want to have to jump two or three levels up within bt_downlink_connectivity_check() just for leftmost cousin pages. But maybe you're feeling ambitious! What do you think? Note: There is an existing comment about this exact negative infinity business within bt_downlink_check(). It starts with "Negative inifinity items can be thought of as a strict lower bound that works transitively...". There should probably be some comment updates to this comment block as part of this patch. [1] https://pdfs.semanticscholar.org/fd45/15ab23c00231d96c95c1091459d0d1eebfae.pdf -- Peter Geoghegan
pgsql-hackers by date: