Thread: Improve search for missing parent downlinks in amcheck
Hi! Currently we amcheck supports lossy checking for missing parent downlinks. It collects bitmap of downlink hashes and use it to check subsequent tree level. We've experienced some large corrupted indexes which pass this check due to its looseness. However, it seems to me we can implement this check in non-lossy manner without making it significantly slower. We anyway traverse downlinks from parent to children in order to verify that hikeys are corresponding to downlink keys. We can also traverse from one downlink to subsequent using rightlinks. So, if there are some intermediate pages between them, they are candidates to have missing parent downlinks. The patch is attached. With this patch amcheck could successfully detect corruption for our customer, which unpatched amcheck couldn't find. Opinions? ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On Mon, Apr 15, 2019 at 7:30 PM Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > Currently we amcheck supports lossy checking for missing parent > downlinks. It collects bitmap of downlink hashes and use it to check > subsequent tree level. We've experienced some large corrupted indexes > which pass this check due to its looseness. Can you be more specific? What was the cause of the corruption? I'm always very interested in hearing about cases that amcheck could have detected, but didn't. Was the issue that the Bloom filter was simply undersized/ineffective? > However, it seems to me we can implement this check in non-lossy > manner without making it significantly slower. We anyway traverse > downlinks from parent to children in order to verify that hikeys are > corresponding to downlink keys. Actually, we don't check the high keys in children against the parent (all other items are checked, though). We probably *should* do something with the high key when verifying consistency across levels, but currently we don't. (We only use the high key for the same-page high key check -- more on this below.) > We can also traverse from one > downlink to subsequent using rightlinks. So, if there are some > intermediate pages between them, they are candidates to have missing > parent downlinks. The patch is attached. > > With this patch amcheck could successfully detect corruption for our > customer, which unpatched amcheck couldn't find. Maybe we can be a lot less conservative about sizing the Bloom filter instead? That would be an easier fix IMV -- we can probably change that logic to be a lot more aggressive without anybody noticing, since the Bloom filter is already usually tiny. We are already not very careful about saving work within bt_index_parent_check(), but with this patch we follow each downlink twice instead of once. That seems wasteful. The reason I used a Bloom filter here is because I would eventually like the missing downlink check to fingerprint entire tuples, not just block numbers. In L&Y terms, the check could in the future fingerprint the separator key and the downlink at the same time, not just the downlink. That way, we could probe the Bloom filter on the next level down for its high key (with the right sibling pointer set to be consistent with the parent) iff we don't see that the page split was interrupted (i.e. iff P_INCOMPLETE_SPLIT() bit is not set). Obviously this would be a more effective form of verification, since we would notice high key values that don't agree with the parent's values for the same sibling/cousin/child block. I didn't do it that way for v11 because of "minus infinity" items on internal pages, which don't store the original key (the key remains the high key of the left sibling page, but we truncate the original to 0 attributes in _bt_pgaddtup()). I think that we should eventually stop truncating minus infinity items, and actually store a "low key" at P_FIRSTDATAKEY() within internal pages instead. That would be useful for other things anyway (e.g. prefix compression). -- Peter Geoghegan
On Tue, Apr 16, 2019 at 12:00 PM Peter Geoghegan <pg@bowt.ie> wrote: > Can you be more specific? What was the cause of the corruption? I'm > always very interested in hearing about cases that amcheck could have > detected, but didn't. FWIW, v4 indexes in Postgres 12 will support the new "rootdescend" verification option, which isn't lossy, and would certainly have detected your customer issue in practice. Admittedly the new check is quite expensive, even compared to the other bt_index_parent_check() checks, but it is nice that we now have a verification option that is *extremely* thorough, and uses _bt_search() directly. -- Peter Geoghegan
On Tue, Apr 16, 2019 at 12:00 PM Peter Geoghegan <pg@bowt.ie> wrote: > On Mon, Apr 15, 2019 at 7:30 PM Alexander Korotkov > <a.korotkov@postgrespro.ru> wrote: > > Currently we amcheck supports lossy checking for missing parent > > downlinks. It collects bitmap of downlink hashes and use it to check > > subsequent tree level. We've experienced some large corrupted indexes > > which pass this check due to its looseness. > > Can you be more specific? What was the cause of the corruption? I'm > always very interested in hearing about cases that amcheck could have > detected, but didn't. Ping? I am interested in doing better here. The background information seems very interesting. -- Peter Geoghegan
On Tue, Apr 16, 2019 at 10:04 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Tue, Apr 16, 2019 at 12:00 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > Can you be more specific? What was the cause of the corruption? I'm
> > always very interested in hearing about cases that amcheck could have
> > detected, but didn't.
>
> FWIW, v4 indexes in Postgres 12 will support the new "rootdescend"
> verification option, which isn't lossy, and would certainly have
> detected your customer issue in practice. Admittedly the new check is
> quite expensive, even compared to the other bt_index_parent_check()
> checks, but it is nice that we now have a verification option that is
> *extremely* thorough, and uses _bt_search() directly.
"rootdescend" is cool type of check. Thank you for noticing, I wasn't aware of it.
>
> On Tue, Apr 16, 2019 at 12:00 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > Can you be more specific? What was the cause of the corruption? I'm
> > always very interested in hearing about cases that amcheck could have
> > detected, but didn't.
>
> FWIW, v4 indexes in Postgres 12 will support the new "rootdescend"
> verification option, which isn't lossy, and would certainly have
> detected your customer issue in practice. Admittedly the new check is
> quite expensive, even compared to the other bt_index_parent_check()
> checks, but it is nice that we now have a verification option that is
> *extremely* thorough, and uses _bt_search() directly.
"rootdescend" is cool type of check. Thank you for noticing, I wasn't aware of it.
But can it detect the missing downlink in following situation?
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
A
/ \
B <-> C <-> D
Here A has downlinks to B and D, which downlink to C is missing,
while B, C and D are correctly connected with leftlinks and rightlinks.
I can see "rootdescend" calls _bt_search(), which would just step
right from C to D as if it was concurrent split.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Sat, Apr 27, 2019 at 4:57 PM Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > "rootdescend" is cool type of check. Thank you for noticing, I wasn't aware of it. > But can it detect the missing downlink in following situation? > > A > / \ > B <-> C <-> D > > Here A has downlinks to B and D, which downlink to C is missing, > while B, C and D are correctly connected with leftlinks and rightlinks. > I can see "rootdescend" calls _bt_search(), which would just step > right from C to D as if it was concurrent split. There is a comment about this scenario above bt_rootdescend() in amcheck. You're right -- this is a kind of corruption that even the new rootdescend verification option would miss. We can imagine a version of rootdescend verification that tells the core code to only move right when there was an *interrupted* page split (i.e. P_INCOMPLETE_SPLIT() flag bit is set), but that isn't what happens right now. That said, the lossy downlink check that you want to improve on *should* already catch this situation. Of course it might not because it is lossy (uses a Bloom filter), but I think that that's very unlikely. That's why I would like to understand the problem that you found with the check. -- Peter Geoghegan
On Tue, Apr 16, 2019 at 10:00 PM Peter Geoghegan <pg@bowt.ie> wrote: > > On Mon, Apr 15, 2019 at 7:30 PM Alexander Korotkov > <a.korotkov@postgrespro.ru> wrote: > > Currently we amcheck supports lossy checking for missing parent > > downlinks. It collects bitmap of downlink hashes and use it to check > > subsequent tree level. We've experienced some large corrupted indexes > > which pass this check due to its looseness. > > Can you be more specific? What was the cause of the corruption? I'm > always very interested in hearing about cases that amcheck could have > detected, but didn't. AFAIR, the cause of corruption in this case was our (Postgres Pro) modification. Not something really present in PostgreSQL core. > > Was the issue that the Bloom filter was simply undersized/ineffective? Yes, increasing of Bloom filter size also helps. But my intention was to make non-lossy check here. > > > However, it seems to me we can implement this check in non-lossy > > manner without making it significantly slower. We anyway traverse > > downlinks from parent to children in order to verify that hikeys are > > corresponding to downlink keys. > > Actually, we don't check the high keys in children against the parent > (all other items are checked, though). We probably *should* do > something with the high key when verifying consistency across levels, > but currently we don't. (We only use the high key for the same-page > high key check -- more on this below.) Nice idea. > > We can also traverse from one > > downlink to subsequent using rightlinks. So, if there are some > > intermediate pages between them, they are candidates to have missing > > parent downlinks. The patch is attached. > > > > With this patch amcheck could successfully detect corruption for our > > customer, which unpatched amcheck couldn't find. > > Maybe we can be a lot less conservative about sizing the Bloom filter > instead? That would be an easier fix IMV -- we can probably change > that logic to be a lot more aggressive without anybody noticing, since > the Bloom filter is already usually tiny. We are already not very > careful about saving work within bt_index_parent_check(), but with > this patch we follow each downlink twice instead of once. That seems > wasteful. It wouldn't be really wasteful, because the same children were accessed recently. So, they are likely not yet evicted from shared_buffers. I think we can fit both checks into one loop over downlinks instead of two. > The reason I used a Bloom filter here is because I would eventually > like the missing downlink check to fingerprint entire tuples, not just > block numbers. In L&Y terms, the check could in the future fingerprint > the separator key and the downlink at the same time, not just the > downlink. That way, we could probe the Bloom filter on the next level > down for its high key (with the right sibling pointer set to be > consistent with the parent) iff we don't see that the page split was > interrupted (i.e. iff P_INCOMPLETE_SPLIT() bit is not set). Obviously > this would be a more effective form of verification, since we would > notice high key values that don't agree with the parent's values for > the same sibling/cousin/child block. > > I didn't do it that way for v11 because of "minus infinity" items on > internal pages, which don't store the original key (the key remains > the high key of the left sibling page, but we truncate the original to > 0 attributes in _bt_pgaddtup()). I think that we should eventually > stop truncating minus infinity items, and actually store a "low key" > at P_FIRSTDATAKEY() within internal pages instead. That would be > useful for other things anyway (e.g. prefix compression). Yes, we can do more checks with bloom filter. But assuming that we anyway iterate over children for each non-leaf page, can we do: 1) All the checks, which bt_downlink_check() does for now 2) Check there are no missing downlinks 3) Check hikeys in one pass, can't we? ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Sat, Apr 27, 2019 at 5:13 PM Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > Yes, increasing of Bloom filter size also helps. But my intention was > to make non-lossy check here. I agree that that might be a good goal, but I am interested in knowing if there is something naive about how the downlinkfilter Bloom filter is sized. I think that we could probably do better than this without much work: /* * Extra readonly downlink check. * * In readonly case, we know that there cannot be a concurrent * page split or a concurrent page deletion, which gives us the * opportunity to verify that every non-ignorable page had a * downlink one level up. We must be tolerant of interrupted page * splits and page deletions, though. This is taken care of in * bt_downlink_missing_check(). */ total_pages = (int64) state->rel->rd_rel->relpages; state->downlinkfilter = bloom_create(total_pages, work_mem, seed); Maybe we could use "smgrnblocks(index->rd_smgr, MAIN_FORKNUM))" instead of relpages, for example. > It wouldn't be really wasteful, because the same children were > accessed recently. So, they are likely not yet evicted from > shared_buffers. I think we can fit both checks into one loop over > downlinks instead of two. I see your point, but if we're going to treat this as a bug then I would prefer a simple fix. > Yes, we can do more checks with bloom filter. But assuming that we > anyway iterate over children for each non-leaf page, can we do: > 1) All the checks, which bt_downlink_check() does for now > 2) Check there are no missing downlinks > 3) Check hikeys > in one pass, can't we? We can expect every high key in a page to have a copy contained within its parent, either as one of the keys, or as parent's own high key (assuming no concurrent or interrupted page splits or page deletions). This is true today, even though we truncate negative infinity items in internal pages. I think that the simple answer to your question is yes. It would be more complicated that way, and the only extra check would be the check of high keys against their parent, but overall this does seem possible. -- Peter Geoghegan
On Sat, Apr 27, 2019 at 5:13 PM Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > Yes, increasing of Bloom filter size also helps. But my intention was > to make non-lossy check here. Why is that your intention? Do you want to do this as a feature for Postgres 13, or do you want to treat this as a bug that we need to backpatch a fix for? Can we avoid the problem you saw with the Bloom filter approach by using the real size of the index (i.e. smgrnblocks()/RelationGetNumberOfBlocks()) to size the Bloom filter, and/or by rethinking the work_mem cap? Maybe we can have a WARNING that advertises that work_mem is probably too low? The state->downlinkfilter Bloom filter should be small in almost all cases, so I still don't fully understand your concern. With a 100GB index, we'll have ~13 million blocks. We only need a Bloom filter that is ~250MB to have less than a 1% chance of missing inconsistencies even with such a large index. I admit that its unfriendly that users are not warned about the shortage currently, but that is something we can probably find a simple (backpatchable) fix for. -- Peter Geoghegan
On Sun, Apr 28, 2019 at 4:36 AM Peter Geoghegan <pg@bowt.ie> wrote: > On Sat, Apr 27, 2019 at 5:13 PM Alexander Korotkov > <a.korotkov@postgrespro.ru> wrote: > > Yes, increasing of Bloom filter size also helps. But my intention was > > to make non-lossy check here. > > Why is that your intention? Do you want to do this as a feature for > Postgres 13, or do you want to treat this as a bug that we need to > backpatch a fix for? I think this definitely not bug fix. Bloom filter was designed to be lossy, no way blaming it for that :) > Can we avoid the problem you saw with the Bloom filter approach by > using the real size of the index (i.e. > smgrnblocks()/RelationGetNumberOfBlocks()) to size the Bloom filter, > and/or by rethinking the work_mem cap? Maybe we can have a WARNING > that advertises that work_mem is probably too low? > > The state->downlinkfilter Bloom filter should be small in almost all > cases, so I still don't fully understand your concern. With a 100GB > index, we'll have ~13 million blocks. We only need a Bloom filter that > is ~250MB to have less than a 1% chance of missing inconsistencies > even with such a large index. I admit that its unfriendly that users > are not warned about the shortage currently, but that is something we > can probably find a simple (backpatchable) fix for. Sounds reasonable. I'll think about proposing backpatch of something like this. ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Sun, Apr 28, 2019 at 10:15 AM Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > I think this definitely not bug fix. Bloom filter was designed to be > lossy, no way blaming it for that :) I will think about a simple fix, but after the upcoming point release. There is no hurry. -- Peter Geoghegan
On Wed, May 1, 2019 at 12:58 PM Peter Geoghegan <pg@bowt.ie> wrote: > On Sun, Apr 28, 2019 at 10:15 AM Alexander Korotkov > <a.korotkov@postgrespro.ru> wrote: > > I think this definitely not bug fix. Bloom filter was designed to be > > lossy, no way blaming it for that :) > > I will think about a simple fix, but after the upcoming point release. > There is no hurry. A bureaucratic question: What should the status be for this CF entry? -- Thomas Munro https://enterprisedb.com
On Sun, Jul 7, 2019 at 7:53 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Wed, May 1, 2019 at 12:58 PM Peter Geoghegan <pg@bowt.ie> wrote: > > I will think about a simple fix, but after the upcoming point release. > > There is no hurry. > > A bureaucratic question: What should the status be for this CF entry? I have plans to use RelationGetNumberOfBlocks() to size amcheck's downlink Bloom filter. I think that that will solve the problems with unreliable estimates of index size. which seemed to be the basis of Alexander's complaint. -- Peter Geoghegan
On Tue, Apr 30, 2019 at 5:58 PM Peter Geoghegan <pg@bowt.ie> wrote: > I will think about a simple fix, but after the upcoming point release. > There is no hurry. Attached draft patch uses RelationGetNumberOfBlocks() to size each of the two Bloom filters that may be used by amcheck to perform verification. The basic heapallindexed Bloom filter is now sized based on the conservative assumption that there must be *at least* "RelationGetNumberOfBlocks() * 50" elements to fingerprint (reltuples will continue to be used to size the basic heapallindexed Bloom filter in most cases, though). The patch also uses the same RelationGetNumberOfBlocks() value to size the downlink Bloom filter. This second change will fix your problem very non-invasively. I intend to backpatch this to v11 in the next few days. -- Peter Geoghegan
Attachment
On Fri, Jul 19, 2019 at 3:21 AM Peter Geoghegan <pg@bowt.ie> wrote: > On Tue, Apr 30, 2019 at 5:58 PM Peter Geoghegan <pg@bowt.ie> wrote: > > I will think about a simple fix, but after the upcoming point release. > > There is no hurry. > > Attached draft patch uses RelationGetNumberOfBlocks() to size each of > the two Bloom filters that may be used by amcheck to perform > verification. > > The basic heapallindexed Bloom filter is now sized based on the > conservative assumption that there must be *at least* > "RelationGetNumberOfBlocks() * 50" elements to fingerprint (reltuples > will continue to be used to size the basic heapallindexed Bloom filter > in most cases, though). The patch also uses the same > RelationGetNumberOfBlocks() value to size the downlink Bloom filter. > This second change will fix your problem very non-invasively. > > I intend to backpatch this to v11 in the next few days. Thank you for backpatching this! BTW, there is next revision of patch I'm proposing for v13. In this revision check for missing downlinks is combined with bt_downlink_check(). So, pages visited by bt_downlink_check() patch doesn't cause extra accessed. It only causes following additional page accesses: 1) Downlinks corresponding to "negative infinity" keys, 2) Pages of child level, which are not referenced by downlinks. But I think these two kinds are very minority, and those accesses could be trade off with more precise missing downlink check without bloom filter. What do you think? ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On Mon, Aug 12, 2019 at 12:01 PM Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > BTW, there is next revision of patch I'm proposing for v13. Cool. > In this revision check for missing downlinks is combined with > bt_downlink_check(). So, pages visited by bt_downlink_check() patch > doesn't cause extra accessed. It only causes following additional > page accesses: > 1) Downlinks corresponding to "negative infinity" keys, > 2) Pages of child level, which are not referenced by downlinks. > > But I think these two kinds are very minority, and those accesses > could be trade off with more precise missing downlink check without > bloom filter. What do you think? I am generally in favor of making the downlink check absolutely reliable, and am not too worried about the modest additional overhead. After all, bt_index_parent_check() is supposed to be thorough though expensive. The only reason that I used a Bloom filter for fingerprinting downlink blocks was that it seemed important to quickly get amcheck coverage for subtle multi-level page deletion bugs just after v11 feature freeze. We can now come up with a better design for that. I was confused about how this patch worked at first. But then I remembered that Lehman and Yao consider downlinks to be distinct things to separator keys in internal pages. The high key of an internal page in the final separator key, so you have n downlinks and n + 1 separator keys per internal page -- two distinct things that appear in alternating order (the negative infinity item is not considered to have any separator key here). So, while internal page items are explicitly "(downlink, separator)" pairs that are grouped into a single tuple in nbtree, with a separate tuple just for the high key, Lehman and Yao would find it just as natural to treat them as "(separator, downlink)" pairs. You have to skip the first downlink on each internal level to make that work, but this makes our bt_downlink_check() check have something from the target page (child's parent page) that is like the high key in the child. It's already really confusing that we don't quite mean the same thing as Lehman and Yao when we say downlink (See also my long "why a highkey is never truly a copy of another item" comment block within bt_target_page_check()), and that is not your patch's fault. But maybe we need to fix that to make your patch easier to understand. (i.e. maybe we need to go over every use of the word "downlink" in nbtree, and make it say something more precise, to make everything less confusing.) Other feedback: * Did you miss the opportunity to verify that every high key matches its right sibling page's downlink tuple in parent page? We talked about this already, but you don't seem to match the key (you only match the downlink block). * You are decoupling the new check from the bt_index_parent_check() "heapallindexed" option. That's probably a good thing, but you need to update the sgml docs. * Didn't you forget to remove the BtreeCheckState.rightsplit flag? * You've significantly changed the behavior of bt_downlink_check() -- I would note this in its header comments. This is where ~99% of the new work happens. * I don't like that you use the loaded term "target" here -- anything called "target" should be BtreeCheckState.target: > static void > -bt_downlink_missing_check(BtreeCheckState *state) > +bt_downlink_missing_check(BtreeCheckState *state, bool rightsplit, > + BlockNumber targetblock, Page target) > { If it's unclear what I mean, this old comment should make it clearer: /* * State associated with verifying a B-Tree index * * target is the point of reference for a verification operation. * * Other B-Tree pages may be allocated, but those are always auxiliary (e.g., * they are current target's child pages). Conceptually, problems are only * ever found in the current target page (or for a particular heap tuple during * heapallindexed verification). Each page found by verification's left/right, * top/bottom scan becomes the target exactly once. */ -- Peter Geoghegan
On Tue, Aug 13, 2019 at 11:44 PM Peter Geoghegan <pg@bowt.ie> wrote: > > In this revision check for missing downlinks is combined with > > bt_downlink_check(). So, pages visited by bt_downlink_check() patch > > doesn't cause extra accessed. It only causes following additional > > page accesses: > > 1) Downlinks corresponding to "negative infinity" keys, > > 2) Pages of child level, which are not referenced by downlinks. > > > > But I think these two kinds are very minority, and those accesses > > could be trade off with more precise missing downlink check without > > bloom filter. What do you think? > > I am generally in favor of making the downlink check absolutely > reliable, and am not too worried about the modest additional overhead. > After all, bt_index_parent_check() is supposed to be thorough though > expensive. The only reason that I used a Bloom filter for > fingerprinting downlink blocks was that it seemed important to quickly > get amcheck coverage for subtle multi-level page deletion bugs just > after v11 feature freeze. We can now come up with a better design for > that. Great! > I was confused about how this patch worked at first. But then I > remembered that Lehman and Yao consider downlinks to be distinct > things to separator keys in internal pages. The high key of an > internal page in the final separator key, so you have n downlinks and > n + 1 separator keys per internal page -- two distinct things that > appear in alternating order (the negative infinity item is not > considered to have any separator key here). So, while internal page > items are explicitly "(downlink, separator)" pairs that are grouped > into a single tuple in nbtree, with a separate tuple just for the high > key, Lehman and Yao would find it just as natural to treat them as > "(separator, downlink)" pairs. You have to skip the first downlink on > each internal level to make that work, but this makes our > bt_downlink_check() check have something from the target page (child's > parent page) that is like the high key in the child. > > It's already really confusing that we don't quite mean the same thing > as Lehman and Yao when we say downlink (See also my long "why a > highkey is never truly a copy of another item" comment block within > bt_target_page_check()), and that is not your patch's fault. But maybe > we need to fix that to make your patch easier to understand. (i.e. > maybe we need to go over every use of the word "downlink" in nbtree, > and make it say something more precise, to make everything less > confusing.) I agree that current terms nbtree use to describe downlinks and separator keys may be confusing. I'll try to fix this and come up with patch if succeed. > Other feedback: > > * Did you miss the opportunity to verify that every high key matches > its right sibling page's downlink tuple in parent page? We talked > about this already, but you don't seem to match the key (you only > match the downlink block). > > * You are decoupling the new check from the bt_index_parent_check() > "heapallindexed" option. That's probably a good thing, but you need to > update the sgml docs. > > * Didn't you forget to remove the BtreeCheckState.rightsplit flag? > > * You've significantly changed the behavior of bt_downlink_check() -- > I would note this in its header comments. This is where ~99% of the > new work happens. > > * I don't like that you use the loaded term "target" here -- anything > called "target" should be BtreeCheckState.target: > > > static void > > -bt_downlink_missing_check(BtreeCheckState *state) > > +bt_downlink_missing_check(BtreeCheckState *state, bool rightsplit, > > + BlockNumber targetblock, Page target) > > { > > If it's unclear what I mean, this old comment should make it clearer: > > /* > * State associated with verifying a B-Tree index > * > * target is the point of reference for a verification operation. > * > * Other B-Tree pages may be allocated, but those are always auxiliary (e.g., > * they are current target's child pages). Conceptually, problems are only > * ever found in the current target page (or for a particular heap tuple during > * heapallindexed verification). Each page found by verification's left/right, > * top/bottom scan becomes the target exactly once. > */ The revised patch seems to fix all of above. ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On Mon, Aug 19, 2019 at 01:15:19AM +0300, Alexander Korotkov wrote: > The revised patch seems to fix all of above. The latest patch is failing to apply. Please provide a rebase. -- Michael
Attachment
On Fri, Nov 29, 2019 at 03:03:01PM +0900, Michael Paquier wrote: >On Mon, Aug 19, 2019 at 01:15:19AM +0300, Alexander Korotkov wrote: >> The revised patch seems to fix all of above. > >The latest patch is failing to apply. Please provide a rebase. This still does not apply (per cputube). Can you provide a fixed patch? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Jan 17, 2020 at 1:05 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > On Fri, Nov 29, 2019 at 03:03:01PM +0900, Michael Paquier wrote: > >On Mon, Aug 19, 2019 at 01:15:19AM +0300, Alexander Korotkov wrote: > >> The revised patch seems to fix all of above. > > > >The latest patch is failing to apply. Please provide a rebase. > > This still does not apply (per cputube). Can you provide a fixed patch? Rebased patch is attached. Sorry for so huge delay. ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On Wed, Jan 22, 2020 at 6:41 PM Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > Rebased patch is attached. Sorry for so huge delay. I really like this patch. Your interest in amcheck is something that makes me feel good about having put so much work into it myself. Here are some review comments: > + /* > + * Rightlink and incomplete split flag of previous block referenced by > + * downlink. > + */ > + BlockNumber prevrightlink; > + bool previncompletesplit; > + What downlink? What does this mean? Do you mean the most recently followed rightlink on the current level, or InvalidBlockNumber if target page is the leftmost page on the current level on the scan? (Thinks some more...) Actually, these two new fields track the state *one level down* from the target page level when !readonly (unless target page is on the leaf level). Right? Comments should be explicit about this. The current comments about downlinks isn't clear. > if (offset_is_negative_infinity(topaque, offset)) > + { > + /* > + * Initialize downlink connectivity check if needed. > + */ > + if (!P_ISLEAF(topaque) && state->readonly) > + { > + bt_downlink_connectivity_check(state, > + offset, > + NULL, > + topaque->btpo.level); > + } > continue; > + } Don't need the "!P_ISLEAF()" here. Also, you should say something like "we need to call this here because the usual callsite in bt_downlink_check() won't be reached". > /* > - * * Check if page has a downlink in parent * > - * > - * This can only be checked in heapallindexed + readonly case. > + * If we traversed the whole level to the rightmost page, there might be > + * missing downlinks for the pages to the right of rightmost downlink. > + * Check for them. > */ You mean "to the right of the child page pointed to by our rightmost downlink"? I think that the final bt_downlink_connectivity_check() call within bt_target_page_check() should make it clear that it is kind of special compared to the other two calls. > +/* > + * Check connectivity of downlinks. Traverse rightlinks from previous downlink > + * to the current one. Check that there are no intermediate pages with missing > + * downlinks. > + * > + * If 'loaded_page' is given, it's assumed to be contents of downlink > + * referenced by 'downlinkoffnum'. > + */ Say "assumed to be the page pointed to by the downlink", perhaps? > +static void > +bt_downlink_connectivity_check(BtreeCheckState *state, > + OffsetNumber downlinkoffnum, > + Page loaded_page, > + uint32 parent_level) > +{ In amcheck, we always have a current target page. Every page gets to be the target exactly once, though sometimes other subsidiary pages are accessed. We try to blame the target page, even with checks that are technically against its child/sibling/whatever. The target page is always our conceptual point of reference. Sometimes this is a bit artificial, but it's still worth doing consistently. So I think you should change these argument names with that in mind (see below). > + /* > + * If we visit page with high key, check that it should be equal to > + * the target key next to corresponding downlink. > + */ I suggest "...check that it is equal to the target key..." > + /* > + * There might be two situations when we examine high key. If > + * current child page is referenced by given downlink, we should > + * look to the next offset number for matching key. You mean "the next offset number for the matching key from the target page"? I find it much easier to keep this stuff in my head if everything is defined in terms of its relationship with the current target page. For example, bt_downlink_connectivity_check()'s "parent_level" argument should be called "target_level" instead, while its "loaded_page" should be called "loaded_child". Maybe "downlinkoffnum" should be "target_downlinkoffnum". And downlinkoffnum should definitely be explained in comments at the top of bt_downlink_connectivity_check() (e.g., say what it means when it is InvalidOffsetNumber). > Alternatively > + * we might find child with high key while traversing from > + * previous downlink to current one. Then matching key resides > + * the same offset number as current downlink. > + */ Not sure what "traversing from previous downlink to current one" means at all. > + if (!offset_is_negative_infinity(topaque, pivotkey_offset) && > + pivotkey_offset <= PageGetMaxOffsetNumber(state->target)) > + { > + uint32 cmp = _bt_compare(state->rel, > + skey, > + state->target, > + pivotkey_offset); There is no need to bother with a _bt_compare() here. Why not just use memcmp() with a pointer to itup->t_tid.ip_posid (i.e. memcmp() that skips the block number)? I think that it is better to expect the keys to be *identical* among pivot tuples, including within tuple alignment padding (only the downlink block number can be different here). If non-pivot tuples were involved then you couldn't do it this way, but they're never involved, so it makes sense. A memcmp() will be faster, obviously. More importantly, it has the advantage of not relying on opclass infrastructure in any way. It might be worth adding an internal verify_nbtree.c static helper function to do the memcmp() for you -- bt_pivot_tuple_identical(), or something like that. I think bt_downlink_check() and bt_downlink_connectivity_check() should be renamed to something broader. In my mind, downlink is basically a block number. We have been sloppy about using the term downlink when we really mean "pivot tuple with a downlink" -- I am guilty of this myself. But it seems more important, now that you have the new high key check. I particularly don't like the way you sometimes say "downlink" when you mean "child page". You do that in this error message: > + (errcode(ERRCODE_INDEX_CORRUPTED), > + errmsg("block found while traversing rightlinks from downlink of index \"%s\" has invalid level", > + RelationGetRelationName(state->rel)), Typo here: > + /* > + * If no previos rightlink is memorized, get it from current downlink for > + * future usage. > + */ You mean "previous". Also, I think that you should say "memorized for current level just below target page's level". > * within bt_check_level_from_leftmost() won't reach the page either, > * since the leaf's live siblings should have their sibling links updated > - * to bypass the deletion target page when it is marked fully dead.) > + * to bypass the deletion page under check when it is marked fully dead.) > * This change seems wrong or unnecessary -- "deletion target" means "page undergoing deletion" (not necessarily marked P_ISDELETED() just yet), and has nothing to do with the amcheck target. You can change this if you want, but I don't get it. I tested this by using pg_hexedit to corrupt the least significant byte of a text key in the root page: pg@tpce:5432 [32610]=# select bt_index_parent_check('pk_holding'); DEBUG: verifying level 2 (true root level) DEBUG: verifying 9 items on internal block 290 DEBUG: verifying level 1 DEBUG: verifying 285 items on internal block 3 ERROR: mismatch between parent key and child high key index "pk_holding" DETAIL: Parent block=3 child block=9 parent page lsn=998/EFA21550. Happy to see that this works, even though this is one of the subtlest possible forms of index corruption. Previously, we could sometimes catch this with "rootdescend" verification, but only if there were *current* items that a scan couldn't find on lower levels (often just the leaf level). But now it doesn't matter -- we'll always detect it. (I think.) Shouldn't this error message read '...in index "pk_holding"'? You missed the "in". Also, why not have the DETAIL message call the "Parent block" the target block? I think that bt_downlink_connectivity_check() should have some high-level comments about what it's supposed to do. Perhaps an example is the best way to explain the concepts. Maybe say something about a three level B-Tree. Each of the separator keys in the grandparent/root page should also appear as high keys at the parent level. Each of the separator keys in the parent level should also appear as high keys on the leaf level, including the separators from the parent level high keys. Since each separator defines which subtrees are <= and > of the separator, there must be an identical seam of separators (in high keys) on lower levels. bt_downlink_connectivity_check() verifies that separator keys agree across a single level, which verifies the integrity of the whole tree. (Thinks some more...) Actually, this patch doesn't quite manage to verify that there is this "unbroken seam" of separator keys from the root to the leaf, so my suggested wording is kind of wrong -- but I think we can fix this weakness. The specific weakness that I saw and verified exists is as follows: If I corrupt the high key of most of the leaf pages in a multi-level index just a little bit (by once again corrupting the least significant byte of the key using pg_hexedit), then the new check alone will usually detect the problem, which is good. However, if I deliberately pick a leaf page that happens to be the rightmost child of some internal page, then it is a different story -- even the new check won't detect the problem (the existing rootdescend check may or may not detect the problem, depending on the current non-pivot tuples near the leaf high key in question). There is no real reason why we shouldn't be able to detect this problem, though. The solution is to recognize that we sometimes need to use the target/parent page high key separator -- not the separator key from some pivot tuple in parent that also contains a downlink. Goetz Graefe calls this "the cousin problem" [1]. To be more precise, I think that the "pivotkey_offset <= PageGetMaxOffsetNumber(state->target))" part of this test can be improved: > + /* > + * There might be two situations when we examine high key. If > + * current child page is referenced by given downlink, we should > + * look to the next offset number for matching key. Alternatively > + * we might find child with high key while traversing from > + * previous downlink to current one. Then matching key resides > + * the same offset number as current downlink. > + */ > + if (blkno == downlink) > + pivotkey_offset = OffsetNumberNext(downlinkoffnum); > + else > + pivotkey_offset = downlinkoffnum; > + > + topaque = (BTPageOpaque) PageGetSpecialPointer(state->target); > + > + if (!offset_is_negative_infinity(topaque, pivotkey_offset) && > + pivotkey_offset <= PageGetMaxOffsetNumber(state->target)) When OffsetNumberNext(downlinkoffnum) exceeds the PageGetMaxOffsetNumber(state->target), doesn't that actually mean that the high key offset (i.e. P_HIKEY) should be used to get an item from the next level up? We can still correctly detect the problem that way. Remember, the high key on an internal page is a tuple that contains a separator but no downlink, which is really not that special within an internal page -- if you think about it from the Lehman & Yao perspective. So we should take the pivot tuple from the parent at offset P_HIKEY, and everything can work just the same. That said, I guess you really do need the "!offset_is_negative_infinity(topaque, pivotkey_offset)" part of the test. The only other possibility is that you access the target/parent page's downlink + separator in its own parent page (i.e. the grandparent of the page whose high key might be corrupt), which is significantly more complexity -- that may not be worth it. (If you did this, you would have to teach the code the difference between "absolute" negative infinity in the leftmost leaf page on the leaf level, and "subtree relative" negative infinity for other leaf pages that are merely leftmost within a subtree). In summary: I suppose that we can also solve "the cousin problem" quite easily, but only for rightmost cousins within a subtree -- leftmost cousins might be too messy to verify for it to be worth it. We don't want to have to jump two or three levels up within bt_downlink_connectivity_check() just for leftmost cousin pages. But maybe you're feeling ambitious! What do you think? Note: There is an existing comment about this exact negative infinity business within bt_downlink_check(). It starts with "Negative inifinity items can be thought of as a strict lower bound that works transitively...". There should probably be some comment updates to this comment block as part of this patch. [1] https://pdfs.semanticscholar.org/fd45/15ab23c00231d96c95c1091459d0d1eebfae.pdf -- Peter Geoghegan
On Thu, Jan 23, 2020 at 5:31 PM Peter Geoghegan <pg@bowt.ie> wrote: > In summary: I suppose that we can also solve "the cousin problem" > quite easily, but only for rightmost cousins within a subtree -- > leftmost cousins might be too messy to verify for it to be worth it. > We don't want to have to jump two or three levels up within > bt_downlink_connectivity_check() just for leftmost cousin pages. But > maybe you're feeling ambitious! What do you think? I suppose the alternative is to get the high key of the parent's left sibling, rather than going to the parent's parent (i.e. our grandparent). That would probably be the best way to get a separator key to compare against the high key in the leftmost cousin page of a subtree, if in fact we wanted to *fully* solve the "cousin problem". Goetz Graefe recommends keeping both a low key and a high key in every page for verification purposes. We don't actually have a low key (we only have a truncated negative infinity item), but this approach isn't that far off having a low key. -- Peter Geoghegan
On Thu, Jan 23, 2020 at 5:40 PM Peter Geoghegan <pg@bowt.ie> wrote: > I suppose the alternative is to get the high key of the parent's left > sibling, rather than going to the parent's parent (i.e. our > grandparent). That would probably be the best way to get a separator > key to compare against the high key in the leftmost cousin page of a > subtree, if in fact we wanted to *fully* solve the "cousin problem". I think I've confused myself here. The "!offset_is_negative_infinity(topaque, pivotkey_offset)" part of the bt_downlink_connectivity_check() high key check test that I mentioned now *also* seem unnecessary. Any high key in a page that isn't marked "half-dead" or marked "deleted" or marked "has incomplete split" can be targeted by the check. Once the page meets that criteria, there must be a pivot tuple in the parent page that contains an identical-to-highkey separator key (this could be the parent's own high key). The only thing that you need to do is be careful about rightmost parent pages of a non-rightmost page -- stuff like that. But, I think that that's only needed because an amcheck segfault isn't a very nice way to detect corruption. The existing comment about negative infinity items within bt_downlink_check() that I mentioned in my main e-mail (the "Note:" part) don't quite apply here. You're not verifying a pivot tuple that has a downlink (which might be negative infinity) against the lower levels -- you're doing *the opposite*. That is, you're verifying a high key using the parent. Which seems like the right way to do it -- you can test for the incomplete split flag and so on by doing it bottom up (not top down). This must have been why I was confused. Phew! -- Peter Geoghegan
Hi, Peter! On Fri, Jan 24, 2020 at 4:31 AM Peter Geoghegan <pg@bowt.ie> wrote: > On Wed, Jan 22, 2020 at 6:41 PM Alexander Korotkov > <a.korotkov@postgrespro.ru> wrote: > > Rebased patch is attached. Sorry for so huge delay. > > I really like this patch. Your interest in amcheck is something that > makes me feel good about having put so much work into it myself. > > Here are some review comments: Great, thank you very much! > > + /* > > + * Rightlink and incomplete split flag of previous block referenced by > > + * downlink. > > + */ > > + BlockNumber prevrightlink; > > + bool previncompletesplit; > > + > > What downlink? What does this mean? Do you mean the most recently > followed rightlink on the current level, or InvalidBlockNumber if > target page is the leftmost page on the current level on the scan? > > (Thinks some more...) > > Actually, these two new fields track the state *one level down* from > the target page level when !readonly (unless target page is on the > leaf level). Right? Comments should be explicit about this. The > current comments about downlinks isn't clear. I agree. I've used very vague terms in comments. Revised now. > > if (offset_is_negative_infinity(topaque, offset)) > > + { > > + /* > > + * Initialize downlink connectivity check if needed. > > + */ > > + if (!P_ISLEAF(topaque) && state->readonly) > > + { > > + bt_downlink_connectivity_check(state, > > + offset, > > + NULL, > > + topaque->btpo.level); > > + } > > continue; > > + } > > Don't need the "!P_ISLEAF()" here. Why don't I need. bt_downlink_connectivity_check() checks one level down to the target level. But there is no one level down to leaf... > Also, you should say something like > "we need to call this here because the usual callsite in > bt_downlink_check() won't be reached". Sure, fixed. > > /* > > - * * Check if page has a downlink in parent * > > - * > > - * This can only be checked in heapallindexed + readonly case. > > + * If we traversed the whole level to the rightmost page, there might be > > + * missing downlinks for the pages to the right of rightmost downlink. > > + * Check for them. > > */ > > You mean "to the right of the child page pointed to by our rightmost downlink"? Yep, fixed. > I think that the final bt_downlink_connectivity_check() call within > bt_target_page_check() should make it clear that it is kind of special > compared to the other two calls. Yes, this is fixed too. > > +/* > > + * Check connectivity of downlinks. Traverse rightlinks from previous downlink > > + * to the current one. Check that there are no intermediate pages with missing > > + * downlinks. > > + * > > + * If 'loaded_page' is given, it's assumed to be contents of downlink > > + * referenced by 'downlinkoffnum'. > > + */ > > Say "assumed to be the page pointed to by the downlink", perhaps? Yes, fixed. > > +static void > > +bt_downlink_connectivity_check(BtreeCheckState *state, > > + OffsetNumber downlinkoffnum, > > + Page loaded_page, > > + uint32 parent_level) > > +{ > > In amcheck, we always have a current target page. Every page gets to > be the target exactly once, though sometimes other subsidiary pages > are accessed. We try to blame the target page, even with checks that > are technically against its child/sibling/whatever. The target page is > always our conceptual point of reference. Sometimes this is a bit > artificial, but it's still worth doing consistently. So I think you > should change these argument names with that in mind (see below). Yes, the arguments were changes as you proposed. > > + /* > > + * If we visit page with high key, check that it should be equal to > > + * the target key next to corresponding downlink. > > + */ > > I suggest "...check that it is equal to the target key..." Agree, fixed. > > + /* > > + * There might be two situations when we examine high key. If > > + * current child page is referenced by given downlink, we should > > + * look to the next offset number for matching key. > > You mean "the next offset number for the matching key from the target > page"? I find it much easier to keep this stuff in my head if > everything is defined in terms of its relationship with the current > target page. For example, bt_downlink_connectivity_check()'s > "parent_level" argument should be called "target_level" instead, while > its "loaded_page" should be called "loaded_child". Maybe > "downlinkoffnum" should be "target_downlinkoffnum". And downlinkoffnum > should definitely be explained in comments at the top of > bt_downlink_connectivity_check() (e.g., say what it means when it is > InvalidOffsetNumber). Renamed as you proposed. > > Alternatively > > + * we might find child with high key while traversing from > > + * previous downlink to current one. Then matching key resides > > + * the same offset number as current downlink. > > + */ > > Not sure what "traversing from previous downlink to current one" means at all. I've rephrased this comment, please check. > > + if (!offset_is_negative_infinity(topaque, pivotkey_offset) && > > + pivotkey_offset <= PageGetMaxOffsetNumber(state->target)) > > + { > > + uint32 cmp = _bt_compare(state->rel, > > + skey, > > + state->target, > > + pivotkey_offset); > > There is no need to bother with a _bt_compare() here. Why not just use > memcmp() with a pointer to itup->t_tid.ip_posid (i.e. memcmp() that > skips the block number)? I think that it is better to expect the keys > to be *identical* among pivot tuples, including within tuple alignment > padding (only the downlink block number can be different here). If > non-pivot tuples were involved then you couldn't do it this way, but > they're never involved, so it makes sense. A memcmp() will be faster, > obviously. More importantly, it has the advantage of not relying on > opclass infrastructure in any way. It might be worth adding an > internal verify_nbtree.c static helper function to do the memcmp() for > you -- bt_pivot_tuple_identical(), or something like that. Agree, replaced _bt_compare() with bt_pivot_tuple_identical(). It becomes even simpler now, thanks! > I think bt_downlink_check() and bt_downlink_connectivity_check() > should be renamed to something broader. In my mind, downlink is > basically a block number. We have been sloppy about using the term > downlink when we really mean "pivot tuple with a downlink" -- I am > guilty of this myself. But it seems more important, now that you have > the new high key check. Hmm... Names are hard for me. I didn't do any renaming for now. What about this? bt_downlink_check() => bt_child_check() bt_downlink_connectivity_check() => bt_child_connectivity_highkey_check() > I particularly don't like the way you sometimes say "downlink" when > you mean "child page". You do that in this error message: > > > + (errcode(ERRCODE_INDEX_CORRUPTED), > > + errmsg("block found while traversing rightlinks from downlink of index \"%s\" has invalid level", > > + RelationGetRelationName(state->rel)), Agree. Error message has been changed. > Typo here: > > > + /* > > + * If no previos rightlink is memorized, get it from current downlink for > > + * future usage. > > + */ > > You mean "previous". Also, I think that you should say "memorized for > current level just below target page's level". Yep, fixed. > > * within bt_check_level_from_leftmost() won't reach the page either, > > * since the leaf's live siblings should have their sibling links updated > > - * to bypass the deletion target page when it is marked fully dead.) > > + * to bypass the deletion page under check when it is marked fully dead.) > > * > > This change seems wrong or unnecessary -- "deletion target" means > "page undergoing deletion" (not necessarily marked P_ISDELETED() just > yet), and has nothing to do with the amcheck target. You can change > this if you want, but I don't get it. Seems like random oversight. Change removed. > I tested this by using pg_hexedit to corrupt the least significant > byte of a text key in the root page: > > pg@tpce:5432 [32610]=# select bt_index_parent_check('pk_holding'); > DEBUG: verifying level 2 (true root level) > DEBUG: verifying 9 items on internal block 290 > DEBUG: verifying level 1 > DEBUG: verifying 285 items on internal block 3 > ERROR: mismatch between parent key and child high key index "pk_holding" > DETAIL: Parent block=3 child block=9 parent page lsn=998/EFA21550. > > Happy to see that this works, even though this is one of the subtlest > possible forms of index corruption. Previously, we could sometimes > catch this with "rootdescend" verification, but only if there were > *current* items that a scan couldn't find on lower levels (often just > the leaf level). But now it doesn't matter -- we'll always detect it. > (I think.) Thank you for testing! > Shouldn't this error message read '...in index "pk_holding"'? You > missed the "in". Also, why not have the DETAIL message call the > "Parent block" the target block? Yep, fixed. > I think that bt_downlink_connectivity_check() should have some > high-level comments about what it's supposed to do. Perhaps an example > is the best way to explain the concepts. Maybe say something about a > three level B-Tree. Each of the separator keys in the grandparent/root > page should also appear as high keys at the parent level. Each of the > separator keys in the parent level should also appear as high keys on > the leaf level, including the separators from the parent level high > keys. Since each separator defines which subtrees are <= and > of the > separator, there must be an identical seam of separators (in high > keys) on lower levels. bt_downlink_connectivity_check() verifies that > separator keys agree across a single level, which verifies the > integrity of the whole tree. I've revised finding the matching pivot key for high key. Now, code assumes we should always find a matching pivot key. It could use both target high key or left sibling high key (which is memorized as "low key"). I've checked this on normal indexes, but I didn't try to exercise this with broken indexes. I would appreciate if you do. ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On Tue, Feb 18, 2020 at 1:15 PM Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > On Fri, Jan 24, 2020 at 4:31 AM Peter Geoghegan <pg@bowt.ie> wrote: > > On Wed, Jan 22, 2020 at 6:41 PM Alexander Korotkov > > <a.korotkov@postgrespro.ru> wrote: > > > Rebased patch is attached. Sorry for so huge delay. > > > > I really like this patch. Your interest in amcheck is something that > > makes me feel good about having put so much work into it myself. > > > > Here are some review comments: > > Great, thank you very much! Sorry, I've forgot to commit some comments before publishing a patch. The right version is attached. ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On Tue, Feb 18, 2020 at 2:16 AM Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > Great, thank you very much! No problem! My remarks here are based on "amcheck-btree-improve-missing-parent-downlinks-check-6.patch". I have found a false positive corruption report bug in this latest version -- see note below about incomplete page splits. > > Don't need the "!P_ISLEAF()" here. > > Why don't I need. bt_downlink_connectivity_check() checks one level > down to the target level. But there is no one level down to leaf... Because offset_is_negative_infinity() checks P_ISLEAF() for you. Maybe it's better your way, though -- apparently it's clearer. > > > Alternatively > > > + * we might find child with high key while traversing from > > > + * previous downlink to current one. Then matching key resides > > > + * the same offset number as current downlink. > > > + */ > > > > Not sure what "traversing from previous downlink to current one" means at all. > > I've rephrased this comment, please check. > Agree, replaced _bt_compare() with bt_pivot_tuple_identical(). It > becomes even simpler now, thanks! There was actually an even better reason to invent bt_pivot_tuple_identical(): a call to _bt_compare() in amcheck needs to do something like the extra steps that you see in routines like invariant_l_offset(). _bt_compare() will return 0 when the insertion scankey has a prefix of scankey/column values that are equal, even though there may be additional columns in the index tuple that are not compared. So, you could have a truncated multi-column high key that is "equal" to pivot tuple in parent that is actually to the right in the key space. This blind spot would often occur with low cardinality indexes, where we often have something like this in pivot tuples on internal pages: 'foo, -inf' 'foo, (1,24)' 'food, -inf'. <-- This pivot tuple's downlink points to the final leaf page that's filled with duplicates of the value 'foo' 'food, (3,19)' <-- This pivot tuple's downlink points to the *first* leaf page that's filled with duplicates of the value 'food' ... The situation is really common in low cardinality indexes because nbtsplitloc.c hates splitting a leaf page between two duplicates -- it is avoided whenever possible. You reliably get a '-inf' value for the TID in the first pivot tuple for the duplicate, followed by a real heap TID for later pivot tuples for pages with the same duplicate value. (Anyway, it's not important now.) > > I think bt_downlink_check() and bt_downlink_connectivity_check() > > should be renamed to something broader. In my mind, downlink is > > basically a block number. We have been sloppy about using the term > > downlink when we really mean "pivot tuple with a downlink" -- I am > > guilty of this myself. But it seems more important, now that you have > > the new high key check. > > Hmm... Names are hard for me. I didn't do any renaming for now. What > about this? > bt_downlink_check() => bt_child_check() > bt_downlink_connectivity_check() => bt_child_connectivity_highkey_check() I suggest: bt_downlink_check() => bt_child_check() bt_downlink_connectivity_check() => bt_child_highkey_check() While bt_downlink_connectivity_check() moves right on the target's child level, this isn't the common case. Moving right like that doesn't need to be suggested by the name of the function. Most of the time, we just check the high key -- right? > I've revised finding the matching pivot key for high key. Now, code > assumes we should always find a matching pivot key. It could use both > target high key or left sibling high key (which is memorized as "low > key"). > > I've checked this on normal indexes, but I didn't try to exercise this > with broken indexes. I would appreciate if you do. I can confirm that checking the high key in target's child page against the high key in target (when appropriate) removes the "cousin page verification blind spot" that I noticed in the last version, as expected. Great! * You should say "target page lsn" here instead: pg@tpce:5432 [19852]=# select bt_index_parent_check(:'idxrelation',true, true); ERROR: mismatch between parent key and child high key in index "i_holding2" DETAIL: Target block=1570 child block=1690 parent page lsn=0/0. Time: 12.509 ms * Maybe say "Move to the right on the child level" in a comment above the bt_downlink_connectivity_check() "while (true)" loop here: > + > + while (true) > + { > + /* > + * Did we traverse the whole tree level and this is check for pages to > + * the right of rightmost downlink? > + */ * If you are going to save a low key for the target page in memory, then you only need to do so for "state->readonly"/parent verification. * You should s/lokey/lowkey/ -- I prefer the spelling "lowkey" or "low key". This is a term that nbtsort.c now uses, in case you didn't know. * The reason for saving a low key for each target page is very unclear. Can we fix this? The low key hardly ever gets used -- I can comment out the code that uses a low key within bt_downlink_connectivity_check(), and indexes that I tested appear fine. So I imagine this is for incomplete split cases -- right? I tested incomplete split cases using this simple hack: diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c index 4e5849ab8e..5811338584 100644 --- a/src/backend/access/nbtree/nbtinsert.c +++ b/src/backend/access/nbtree/nbtinsert.c @@ -1821,7 +1821,7 @@ _bt_insert_parent(Relation rel, */ _bt_relbuf(rel, rbuf); - if (pbuf == InvalidBuffer) + if (random() <= (MAX_RANDOM_VALUE / 100)) ereport(ERROR, (errcode(ERRCODE_INDEX_CORRUPTED), errmsg_internal("failed to re-find parent key in index \"%s\" for split pages %u/%u", Now, 1% of all page splits will fail after the first phase finishes, but before the second phase begins/finishes. The incomplete split flag will be found set by other, future inserts, though, at which point the split will be finished (unless we're unlucky again). An INSERT that inserts many rows to bound to fail with this hack applied, but the INSERT shouldn't leave the index in a state that looks like corruption to amcheck. I can see false positive reports of corruption like this, though. Consider the following sample session. The session inserts data from one table into another table with an equivalent schema. As you can see, I have to get my INSERT to fail three times before I see the problem: pg@tpce:5432 [13026]=# truncate holding2; TRUNCATE TABLE pg@tpce:5432 [13026]=# insert into holding2 select * from holding; ERROR: failed to re-find parent key in index "pk_holding2" for split pages 83/158 pg@tpce:5432 [13026]=# select bt_index_parent_check('i_holding2',true, true); DEBUG: verifying level 1 (true root level) DEBUG: verifying 200 items on internal block 3 DEBUG: verifying level 0 (leaf level) DEBUG: verifying 262 items on leaf block 1 DEBUG: verifying 258 items on leaf block 2 *** SNIP *** DEBUG: verifying 123 items on leaf block 201 DEBUG: verifying that tuples from index "i_holding2" are present in "holding2" DEBUG: finished verifying presence of 0 tuples from table "holding2" with bitset 4.83% set ┌───────────────────────┐ │ bt_index_parent_check │ <-- No false positive yet ├───────────────────────┤ │ │ └───────────────────────┘ (1 row) pg@tpce:5432 [13026]=# insert into holding2 select * from holding; DEBUG: finishing incomplete split of 83/158 ERROR: failed to re-find parent key in index "i_holding2" for split pages 435/436 pg@tpce:5432 [13026]=# select bt_index_parent_check(:'idxrelation',true, true); DEBUG: verifying level 2 (true root level) DEBUG: verifying 2 items on internal block 303 *** SNIP *** DEBUG: verifying 74 items on leaf block 436 DEBUG: verifying that tuples from index "i_holding2" are present in "holding2" DEBUG: finished verifying presence of 0 tuples from table "holding2" with bitset 9.97% set ┌───────────────────────┐ │ bt_index_parent_check │ <-- No false positive yet ├───────────────────────┤ │ │ └───────────────────────┘ (1 row) pg@tpce:5432 [13026]=# insert into holding2 select * from holding; ERROR: failed to re-find parent key in index "i_holding2" for split pages 331/577 pg@tpce:5432 [13026]=# select bt_index_parent_check(:'idxrelation',true, true); DEBUG: verifying level 2 (true root level) DEBUG: verifying 3 items on internal block 303 DEBUG: verifying level 1 DEBUG: verifying 151 items on internal block 3 DEBUG: verifying 189 items on internal block 521 DEBUG: verifying 233 items on internal block 302 WARNING: 577 rightsplit ERROR: leaf index block lacks downlink in index "i_holding2". <-- False positive "corruption" DETAIL: Block=127 page lsn=0/0. Notice I added a custom WARNING here, and we get the false positive error just after the point that the WARNING is seen. This WARNING is triggered by the temporary instrumentation change that I added to amcheck, and can be seen in some cases that don't have an accompanying false positive report of corruption: diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c index d6cb10e263..dcfee9e4fa 100644 --- a/contrib/amcheck/verify_nbtree.c +++ b/contrib/amcheck/verify_nbtree.c @@ -1654,6 +1654,8 @@ bt_downlink_connectivity_check(BtreeCheckState *state, errmsg("circular link chain found in block %u of index \"%s\"", blkno, RelationGetRelationName(state->rel)))); + if (rightsplit) + elog(WARNING, "%u rightsplit", blkno); if (!first && !P_IGNORE(opaque)) { /* blkno probably has missing parent downlink */ *** END DIFF *** Anyway, I didn't take the time to debug this today, but I suspect that the problem is that the lowkey thing is kind of brittle. If you can't figure it out, let me know and I'll help with it. * Can we reverse the order here, so that the common case (i.e. the offset_is_negative_infinity() case) comes first?: > + if (offset_is_negative_infinity(topaque, pivotkey_offset)) > + { > + /* > + * We're going to try match child high key to "negative > + * infinity key". That means we should match to high key of > + * left sibling of target page. > + */ *** SNIP *** > + } > + else > + { *** SNIP *** > + } * Can the offset_is_negative_infinity() branch explain what's going on at a much higher level than this? -- Peter Geoghegan
Hi! Thank you for your review. The revised version is attached. On Wed, Feb 19, 2020 at 1:16 AM Peter Geoghegan <pg@bowt.ie> wrote: > On Tue, Feb 18, 2020 at 2:16 AM Alexander Korotkov > <a.korotkov@postgrespro.ru> wrote: > > > Don't need the "!P_ISLEAF()" here. > > > > Why don't I need. bt_downlink_connectivity_check() checks one level > > down to the target level. But there is no one level down to leaf... > > Because offset_is_negative_infinity() checks P_ISLEAF() for you. Maybe > it's better your way, though -- apparently it's clearer. Oh, I see. I prefer to leave it my way. Explicit check is nearly free and makes it clearer for me. > > > > Alternatively > > > > + * we might find child with high key while traversing from > > > > + * previous downlink to current one. Then matching key resides > > > > + * the same offset number as current downlink. > > > > + */ > > > > > > Not sure what "traversing from previous downlink to current one" means at all. > > > > I've rephrased this comment, please check. > > > Agree, replaced _bt_compare() with bt_pivot_tuple_identical(). It > > becomes even simpler now, thanks! > > There was actually an even better reason to invent > bt_pivot_tuple_identical(): a call to _bt_compare() in amcheck needs > to do something like the extra steps that you see in routines like > invariant_l_offset(). _bt_compare() will return 0 when the insertion > scankey has a prefix of scankey/column values that are equal, even > though there may be additional columns in the index tuple that are not > compared. So, you could have a truncated multi-column high key that is > "equal" to pivot tuple in parent that is actually to the right in the > key space. This blind spot would often occur with low cardinality > indexes, where we often have something like this in pivot tuples on > internal pages: > > 'foo, -inf' > 'foo, (1,24)' > 'food, -inf'. <-- This pivot tuple's downlink points to the final leaf > page that's filled with duplicates of the value 'foo' > 'food, (3,19)' <-- This pivot tuple's downlink points to the *first* > leaf page that's filled with duplicates of the value 'food' > ... > > The situation is really common in low cardinality indexes because > nbtsplitloc.c hates splitting a leaf page between two duplicates -- it > is avoided whenever possible. You reliably get a '-inf' value for the > TID in the first pivot tuple for the duplicate, followed by a real > heap TID for later pivot tuples for pages with the same duplicate > value. > Thank you for the explanation! > > > I think bt_downlink_check() and bt_downlink_connectivity_check() > > > should be renamed to something broader. In my mind, downlink is > > > basically a block number. We have been sloppy about using the term > > > downlink when we really mean "pivot tuple with a downlink" -- I am > > > guilty of this myself. But it seems more important, now that you have > > > the new high key check. > > > > Hmm... Names are hard for me. I didn't do any renaming for now. What > > about this? > > bt_downlink_check() => bt_child_check() > > bt_downlink_connectivity_check() => bt_child_connectivity_highkey_check() > > I suggest: > > bt_downlink_check() => bt_child_check() > bt_downlink_connectivity_check() => bt_child_highkey_check() > > While bt_downlink_connectivity_check() moves right on the target's > child level, this isn't the common case. Moving right like that > doesn't need to be suggested by the name of the function. > > Most of the time, we just check the high key -- right? Good. Renamed as you proposed. > * You should say "target page lsn" here instead: > > pg@tpce:5432 [19852]=# select bt_index_parent_check(:'idxrelation',true, true); > ERROR: mismatch between parent key and child high key in index "i_holding2" > DETAIL: Target block=1570 child block=1690 parent page lsn=0/0. > Time: 12.509 ms Yep, fixed. > * Maybe say "Move to the right on the child level" in a comment above > the bt_downlink_connectivity_check() "while (true)" loop here: Comment is added. > > + > > + while (true) > > + { > > + /* > > + * Did we traverse the whole tree level and this is check for pages to > > + * the right of rightmost downlink? > > + */ > > * If you are going to save a low key for the target page in memory, > then you only need to do so for "state->readonly"/parent verification. Sure, now it's just for state->readonly. > * You should s/lokey/lowkey/ -- I prefer the spelling "lowkey" or "low > key". This is a term that nbtsort.c now uses, in case you didn't know. Changed, thank you for pointing. > * The reason for saving a low key for each target page is very > unclear. Can we fix this? Actually, lowkey is used for removing "cousin page verification blind spot" when there are incomplete splits. It might happen that we read child with hikey matching its parent high key only when bt_child_highkey_check() is called for "minus infinity" key of parent right sibling. Saving low key helps in this case. > The low key hardly ever gets used -- I can comment out the code that > uses a low key within bt_downlink_connectivity_check(), and indexes > that I tested appear fine. So I imagine this is for incomplete split > cases -- right? > > I tested incomplete split cases using this simple hack: > > diff --git a/src/backend/access/nbtree/nbtinsert.c > b/src/backend/access/nbtree/nbtinsert.c > index 4e5849ab8e..5811338584 100644 > --- a/src/backend/access/nbtree/nbtinsert.c > +++ b/src/backend/access/nbtree/nbtinsert.c > @@ -1821,7 +1821,7 @@ _bt_insert_parent(Relation rel, > */ > _bt_relbuf(rel, rbuf); > > - if (pbuf == InvalidBuffer) > + if (random() <= (MAX_RANDOM_VALUE / 100)) > ereport(ERROR, > (errcode(ERRCODE_INDEX_CORRUPTED), > errmsg_internal("failed to re-find parent key in > index \"%s\" for split pages %u/%u", > > Now, 1% of all page splits will fail after the first phase finishes, > but before the second phase begins/finishes. The incomplete split flag > will be found set by other, future inserts, though, at which point the > split will be finished (unless we're unlucky again). > > An INSERT that inserts many rows to bound to fail with this hack > applied, but the INSERT shouldn't leave the index in a state that > looks like corruption to amcheck. I can see false positive reports of > corruption like this, though. > > Consider the following sample session. The session inserts data from > one table into another table with an equivalent schema. As you can > see, I have to get my INSERT to fail three times before I see the > problem: > > pg@tpce:5432 [13026]=# truncate holding2; > TRUNCATE TABLE > pg@tpce:5432 [13026]=# insert into holding2 select * from holding; > ERROR: failed to re-find parent key in index "pk_holding2" for split > pages 83/158 > pg@tpce:5432 [13026]=# select bt_index_parent_check('i_holding2',true, true); > DEBUG: verifying level 1 (true root level) > DEBUG: verifying 200 items on internal block 3 > DEBUG: verifying level 0 (leaf level) > DEBUG: verifying 262 items on leaf block 1 > DEBUG: verifying 258 items on leaf block 2 > *** SNIP *** > DEBUG: verifying 123 items on leaf block 201 > DEBUG: verifying that tuples from index "i_holding2" are present in "holding2" > DEBUG: finished verifying presence of 0 tuples from table "holding2" > with bitset 4.83% set > ┌───────────────────────┐ > │ bt_index_parent_check │ <-- No false positive yet > ├───────────────────────┤ > │ │ > └───────────────────────┘ > (1 row) > > pg@tpce:5432 [13026]=# insert into holding2 select * from holding; > DEBUG: finishing incomplete split of 83/158 > ERROR: failed to re-find parent key in index "i_holding2" for split > pages 435/436 > pg@tpce:5432 [13026]=# select bt_index_parent_check(:'idxrelation',true, true); > DEBUG: verifying level 2 (true root level) > DEBUG: verifying 2 items on internal block 303 > *** SNIP *** > DEBUG: verifying 74 items on leaf block 436 > DEBUG: verifying that tuples from index "i_holding2" are present in "holding2" > DEBUG: finished verifying presence of 0 tuples from table "holding2" > with bitset 9.97% set > ┌───────────────────────┐ > │ bt_index_parent_check │ <-- No false positive yet > ├───────────────────────┤ > │ │ > └───────────────────────┘ > (1 row) > > pg@tpce:5432 [13026]=# insert into holding2 select * from holding; > ERROR: failed to re-find parent key in index "i_holding2" for split > pages 331/577 > pg@tpce:5432 [13026]=# select bt_index_parent_check(:'idxrelation',true, true); > DEBUG: verifying level 2 (true root level) > DEBUG: verifying 3 items on internal block 303 > DEBUG: verifying level 1 > DEBUG: verifying 151 items on internal block 3 > DEBUG: verifying 189 items on internal block 521 > DEBUG: verifying 233 items on internal block 302 > WARNING: 577 rightsplit > ERROR: leaf index block lacks downlink in index "i_holding2". <-- > False positive "corruption" > DETAIL: Block=127 page lsn=0/0. > > Notice I added a custom WARNING here, and we get the false positive > error just after the point that the WARNING is seen. This WARNING is > triggered by the temporary instrumentation change that I added to > amcheck, and can be seen in some cases that don't have an accompanying > false positive report of corruption: > > diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c > index d6cb10e263..dcfee9e4fa 100644 > --- a/contrib/amcheck/verify_nbtree.c > +++ b/contrib/amcheck/verify_nbtree.c > @@ -1654,6 +1654,8 @@ bt_downlink_connectivity_check(BtreeCheckState *state, > errmsg("circular link chain found in block %u of > index \"%s\"", > blkno, RelationGetRelationName(state->rel)))); > > + if (rightsplit) > + elog(WARNING, "%u rightsplit", blkno); > if (!first && !P_IGNORE(opaque)) > { > /* blkno probably has missing parent downlink */ > *** END DIFF *** > > Anyway, I didn't take the time to debug this today, but I suspect that > the problem is that the lowkey thing is kind of brittle. If you can't > figure it out, let me know and I'll help with it. It appears that these false positives were cause by very basic error made here: if (!first && !P_IGNORE(opaque)) { /* blkno probably has missing parent downlink */ bt_downlink_missing_check(state, rightsplit, blkno, page); } actually it should be if (blkno != downlink && !P_IGNORE(opaque)) { /* blkno probably has missing parent downlink */ bt_downlink_missing_check(state, rightsplit, blkno, page); } So "blkno == downlink" means blkno has downlink, not being checked first in the loop. This is remains of old version of patch which I forget to clean. Now the check you've described works for me. If you still think lowkey check is a problem, please help me figure it out. > * Can we reverse the order here, so that the common case (i.e. the > offset_is_negative_infinity() case) comes first?: > > > + if (offset_is_negative_infinity(topaque, pivotkey_offset)) > > + { > > + /* > > + * We're going to try match child high key to "negative > > + * infinity key". That means we should match to high key of > > + * left sibling of target page. > > + */ > *** SNIP *** > > > + } > > + else > > + { > *** SNIP *** > > > + } > > * Can the offset_is_negative_infinity() branch explain what's going on > at a much higher level than this? Sure, comment is revised! ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
Hi Alexander, Apologies for the delayed response. I was a little tired from the deduplication project. On Mon, Feb 24, 2020 at 2:54 PM Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > Thank you for your review. The revised version is attached. This has bitrot, because of the deduplication patch. Shouldn't be hard to rebase, though. > > 'foo, -inf' > > 'foo, (1,24)' > > 'food, -inf'. <-- This pivot tuple's downlink points to the final leaf > > page that's filled with duplicates of the value 'foo' > > 'food, (3,19)' <-- This pivot tuple's downlink points to the *first* > > leaf page that's filled with duplicates of the value 'food' > > ... > Thank you for the explanation! I taught pageinspect to display a "htid" field for pivot tuples recently, making it easier to visualize this example. I think that you should say more about how "lowkey" is used here: > /* > - * Record if page that is about to become target is the right half of > - * an incomplete page split. This can go stale immediately in > - * !readonly case. > + * Copy current target low key as the high key of right sibling. > + * Allocate memory in upper level context, so it would be cleared > + * after reset of target context. > + * > + * We only need low key for parent check. > */ > - state->rightsplit = P_INCOMPLETE_SPLIT(opaque); > + if (state->readonly && !P_RIGHTMOST(opaque)) > + { Say something about concurrent page splits, since they're the only case where we actually use lowkey. Maybe say something like: "We probably won't end up doing anything with lowkey, but it's simpler for readonly verification to always have it available". > Actually, lowkey is used for removing "cousin page verification blind > spot" when there are incomplete splits. It might happen that we read > child with hikey matching its parent high key only when > bt_child_highkey_check() is called for "minus infinity" key of parent > right sibling. Saving low key helps in this case. That makes sense to me. > It appears that these false positives were cause by very basic error made here: > > if (!first && !P_IGNORE(opaque)) > { > /* blkno probably has missing parent downlink */ > bt_downlink_missing_check(state, rightsplit, blkno, page); > } > > actually it should be > > if (blkno != downlink && !P_IGNORE(opaque)) > { > /* blkno probably has missing parent downlink */ > bt_downlink_missing_check(state, rightsplit, blkno, page); > } > > So "blkno == downlink" means blkno has downlink, not being checked > first in the loop. This is remains of old version of patch which I > forget to clean. Now the check you've described works for me. > > If you still think lowkey check is a problem, please help me figure it out. * I think that these comments could still be clearer: > + /* > + * We're going to try match child high key to "negative > + * infinity key". This normally happens when the last child > + * we visited for target's left sibling was an incomplete > + * split. So, we must be still on the child of target's left > + * sibling. Thus, we should match to target's left sibling > + * high key. Thankfully we saved it, it's called a "low key". > + */ Maybe start with "We cannot try to match child's high key to a negative infinity key in target, since there is nothing to compare. However...". Perhaps use terms like "cousin page" and "subtree", which can be useful. Alternatively, mention this case in the diagram example at the top of bt_child_highkey_check(). It's tough to write comments like this, but I think it's worth it. Note that a high key is also a pivot tuple, so I wouldn't mention high keys here: > +/* > + * Check if two tuples are binary identical except the block number. So, > + * this function is capable to compare high keys with pivot keys. > + */ > +static bool > +bt_pivot_tuple_identical(IndexTuple itup1, IndexTuple itup2) > +{ v7 looks pretty close to being commitable, though I'll probably want to update some comments that you haven't touched when you commit this. I should probably wait until you've committed the patch to go do that. I'm thinking of things like old comments in bt_downlink_check(). I will test the patch properly one more time when you produce a new revision. I haven't really tested it since the last time. -- Peter Geoghegan
Hi, Peter! On Tue, Mar 3, 2020 at 3:04 AM Peter Geoghegan <pg@bowt.ie> wrote: > Apologies for the delayed response. I was a little tired from the > deduplication project. No problem. Apologies for the delayed revision as well. > I taught pageinspect to display a "htid" field for pivot tuples > recently, making it easier to visualize this example. Great! > I think that you should say more about how "lowkey" is used here: > > > /* > > - * Record if page that is about to become target is the right half of > > - * an incomplete page split. This can go stale immediately in > > - * !readonly case. > > + * Copy current target low key as the high key of right sibling. > > + * Allocate memory in upper level context, so it would be cleared > > + * after reset of target context. > > + * > > + * We only need low key for parent check. > > */ > > - state->rightsplit = P_INCOMPLETE_SPLIT(opaque); > > + if (state->readonly && !P_RIGHTMOST(opaque)) > > + { > > Say something about concurrent page splits, since they're the only > case where we actually use lowkey. Maybe say something like: "We > probably won't end up doing anything with lowkey, but it's simpler for > readonly verification to always have it available". I've revised this comment. Hopefully it's better now. > * I think that these comments could still be clearer: > > > + /* > > + * We're going to try match child high key to "negative > > + * infinity key". This normally happens when the last child > > + * we visited for target's left sibling was an incomplete > > + * split. So, we must be still on the child of target's left > > + * sibling. Thus, we should match to target's left sibling > > + * high key. Thankfully we saved it, it's called a "low key". > > + */ > > Maybe start with "We cannot try to match child's high key to a > negative infinity key in target, since there is nothing to compare. > However...". Perhaps use terms like "cousin page" and "subtree", which > can be useful. Alternatively, mention this case in the diagram example > at the top of bt_child_highkey_check(). It's tough to write comments > like this, but I think it's worth it. I've updated this comment using terms "cousin page" and "subtree". I didn't refer the diagram example, because it doesn't contain appropriate case. And I wouldn't like this diagram to contain such case, because that probably makes this diagram too complex. I've also invented term "uncle page". BTW, should it be "aunt page"? I don't know. > Note that a high key is also a pivot tuple, so I wouldn't mention high > keys here: > > > +/* > > + * Check if two tuples are binary identical except the block number. So, > > + * this function is capable to compare high keys with pivot keys. > > + */ > > +static bool > > +bt_pivot_tuple_identical(IndexTuple itup1, IndexTuple itup2) > > +{ Sure, this comment is revised. > v7 looks pretty close to being commitable, though I'll probably want > to update some comments that you haven't touched when you commit this. > I should probably wait until you've committed the patch to go do that. > I'm thinking of things like old comments in bt_downlink_check(). > > I will test the patch properly one more time when you produce a new > revision. I haven't really tested it since the last time. Attached patch also has revised commit message. I'll wait for your response before commit. ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Mon, Mar 9, 2020 at 12:52 AM Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > Attached patch also has revised commit message. I'll wait for your > response before commit. Oh, I found that I haven't attached the patch. ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On Sun, Mar 8, 2020 at 2:52 PM Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > I've revised this comment. Hopefully it's better now. I think that the new comments about why we need a low key for the page are much better now. > I've updated this comment using terms "cousin page" and "subtree". I > didn't refer the diagram example, because it doesn't contain > appropriate case. And I wouldn't like this diagram to contain such > case, because that probably makes this diagram too complex. I've also > invented term "uncle page". BTW, should it be "aunt page"? I don't > know. I have never heard the term "uncle page" before, but I like it -- though maybe say "right uncle page". That happens to be the exact relationship that we're talking about here. I think any one of "uncle", "aunt", or "uncle/aunt" are acceptable. We'll probably never need to use this term again, but it seems like the right term to use here. Anyway, this section also seems much better now. Other things that I noticed: * Typo: > + /* > + * We don't call bt_child_check() for "negative infinity" items. > + * But if we're performatin downlink connectivity check, we do it > + * for every item including "negative infinity" one. > + */ s/performatin/performing/ * Suggest that you say "has incomplete split flag set" here: > + * - The call for block 4 will initialize data structure, but doesn't do actual > + * checks assuming page 4 has incomplete split. * More importantly, is this the right thing to say about page 4? Isn't it also true that page 4 is the leftmost leaf page, and therefore kind of special in another way? Even without having the incomplete split flag set at all? Wouldn't it be better to address the incomplete split flag issue by making that apply to some other page that isn't also the leftmost? That would allow you to talk about the leftmost case directly here. Or it would at least make it less confusing. BTW, a P_LEFTMOST() assertion at the beginning of bt_child_highkey_check() would make this easier to follow. * Correct spelling is "happens" here: > + * current child page is not incomplete split, then its high key > + * should match to the target's key of current offset number. This > + * happends when child page referenced by previous downlink is * Actually, maybe this whole sentence should be reworded instead -- say "This happens when a previous call here (to bt_child_highkey_check()) found an incomplete split, and we reach a right sibling page without a downlink -- the right sibling page's high key still needs to be matched to a separator key on the parent/target level". * Maybe say "Don't apply OffsetNumberNext() to target_downlinkoffnum when we already had to step right on the child level. Our traversal of the child level must try to move in perfect lockstep behind (to the left of) the target/parent level traversal." I found this detail very confusing at first. * The docs should say "...relationships, including checking that there are no missing downlinks in the index structure" here: > unlike <function>bt_index_check</function>, > <function>bt_index_parent_check</function> also checks > - invariants that span parent/child relationships. > + invariants that span parent/child relationships including check > + that there are no missing downlinks in the index structure. > <function>bt_index_parent_check</function> follows the general > convention of raising an error if it finds a logical > inconsistency or other problem. This is very close now. I would be okay with you committing the patch once you deal with this feedback. If you prefer, I can take another look at a new revision. -- Peter Geoghegan
On Tue, Mar 10, 2020 at 3:07 AM Peter Geoghegan <pg@bowt.ie> wrote: > On Sun, Mar 8, 2020 at 2:52 PM Alexander Korotkov > <a.korotkov@postgrespro.ru> wrote: > > I've revised this comment. Hopefully it's better now. > > I think that the new comments about why we need a low key for the page > are much better now. Good, thank you. > > I've updated this comment using terms "cousin page" and "subtree". I > > didn't refer the diagram example, because it doesn't contain > > appropriate case. And I wouldn't like this diagram to contain such > > case, because that probably makes this diagram too complex. I've also > > invented term "uncle page". BTW, should it be "aunt page"? I don't > > know. > > I have never heard the term "uncle page" before, but I like it -- > though maybe say "right uncle page". That happens to be the exact > relationship that we're talking about here. I think any one of > "uncle", "aunt", or "uncle/aunt" are acceptable. We'll probably never > need to use this term again, but it seems like the right term to use > here. According to context that should be left uncle page. I've changed the text accordingly. > Anyway, this section also seems much better now. > > Other things that I noticed: > > * Typo: > > > + /* > > + * We don't call bt_child_check() for "negative infinity" items. > > + * But if we're performatin downlink connectivity check, we do it > > + * for every item including "negative infinity" one. > > + */ > > s/performatin/performing/ Fixed. > * Suggest that you say "has incomplete split flag set" here: > > > + * - The call for block 4 will initialize data structure, but doesn't do actual > > + * checks assuming page 4 has incomplete split. Yes, that sounds better. Changed here and in the other places. > * More importantly, is this the right thing to say about page 4? Isn't > it also true that page 4 is the leftmost leaf page, and therefore kind > of special in another way? Even without having the incomplete split > flag set at all? Wouldn't it be better to address the incomplete split > flag issue by making that apply to some other page that isn't also the > leftmost? That would allow you to talk about the leftmost case > directly here. Or it would at least make it less confusing. Yes, current example looks confusing in this aspect. But your comment spotted to me an algorithmic issue. We don't match highkey of leftmost child against parent pivot key. But we can. The "if (!BlockNumberIsValid(blkno))" branch survived from the patch version when we didn't match high keys. I've revised it. Now we enter the loop even for leftmost page on child level and match high key for that page. > BTW, a P_LEFTMOST() assertion at the beginning of > bt_child_highkey_check() would make this easier to follow. Yes, but why should it be an assert? We can imagine corruption, when there is left sibling of first child of leftmost target. I guess, current code would report such situation as an error, because this left sibling lacks of parent downlink. I've revised that "if" branch, so we don't load a child page there anymore. Error reporting is added to the main loop. > * Correct spelling is "happens" here: > > > + * current child page is not incomplete split, then its high key > > + * should match to the target's key of current offset number. This > > + * happends when child page referenced by previous downlink is > > * Actually, maybe this whole sentence should be reworded instead -- > say "This happens when a previous call here (to > bt_child_highkey_check()) found an incomplete split, and we reach a > right sibling page without a downlink -- the right sibling page's high > key still needs to be matched to a separator key on the parent/target > level". > > * Maybe say "Don't apply OffsetNumberNext() to target_downlinkoffnum > when we already had to step right on the child level. Our traversal of > the child level must try to move in perfect lockstep behind (to the > left of) the target/parent level traversal." > > I found this detail very confusing at first. > > * The docs should say "...relationships, including checking that there > are no missing downlinks in the index structure" here: > > > unlike <function>bt_index_check</function>, > > <function>bt_index_parent_check</function> also checks > > - invariants that span parent/child relationships. > > + invariants that span parent/child relationships including check > > + that there are no missing downlinks in the index structure. > > <function>bt_index_parent_check</function> follows the general > > convention of raising an error if it finds a logical > > inconsistency or other problem. This comments are revised as you proposed. > This is very close now. I would be okay with you committing the patch > once you deal with this feedback. If you prefer, I can take another > look at a new revision. Thank you. I'd like to have another feedback from you assuming there are logic changes. ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On Tue, Mar 10, 2020 at 8:30 AM Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > Yes, current example looks confusing in this aspect. But your comment > spotted to me an algorithmic issue. We don't match highkey of > leftmost child against parent pivot key. But we can. The "if > (!BlockNumberIsValid(blkno))" branch survived from the patch version > when we didn't match high keys. I've revised it. Now we enter the > loop even for leftmost page on child level and match high key for that > page. Great. That looks better. > > BTW, a P_LEFTMOST() assertion at the beginning of > > bt_child_highkey_check() would make this easier to follow. > > Yes, but why should it be an assert? We can imagine corruption, when > there is left sibling of first child of leftmost target. I agree. I would only make it an assertion when it concerns an implementation detail of amcheck, but that doesn't apply here. > Thank you. I'd like to have another feedback from you assuming there > are logic changes. This looks committable. I only noticed one thing: The comments above bt_target_page_check() need to be updated to reflect the new check, which no longer has anything to do with "heapallindexed = true". -- Peter Geoghegan
On Wed, Mar 11, 2020 at 7:19 AM Peter Geoghegan <pg@bowt.ie> wrote: > This looks committable. I only noticed one thing: The comments above > bt_target_page_check() need to be updated to reflect the new check, > which no longer has anything to do with "heapallindexed = true". Thank you! Pushed with this comment revised! ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Wed, Mar 11, 2020 at 2:02 AM Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > Thank you! Pushed with this comment revised! Thanks! -- Peter Geoghegan