Thread: Improve search for missing parent downlinks in amcheck

Improve search for missing parent downlinks in amcheck

From
Alexander Korotkov
Date:
Hi!

Currently we amcheck supports lossy checking for missing parent
downlinks.  It collects bitmap of downlink hashes and use it to check
subsequent tree level.  We've experienced some large corrupted indexes
which pass this check due to its looseness.

However, it seems to me we can implement this check in non-lossy
manner without making it significantly slower.  We anyway traverse
downlinks from parent to children in order to verify that hikeys are
corresponding to downlink keys.  We can also traverse from one
downlink to subsequent using rightlinks.  So, if there are some
intermediate pages between them, they are candidates to have missing
parent downlinks.  The patch is attached.

With this patch amcheck could successfully detect corruption for our
customer, which unpatched amcheck couldn't find.

Opinions?

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

Re: Improve search for missing parent downlinks in amcheck

From
Peter Geoghegan
Date:
On Mon, Apr 15, 2019 at 7:30 PM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> Currently we amcheck supports lossy checking for missing parent
> downlinks.  It collects bitmap of downlink hashes and use it to check
> subsequent tree level.  We've experienced some large corrupted indexes
> which pass this check due to its looseness.

Can you be more specific? What was the cause of the corruption? I'm
always very interested in hearing about cases that amcheck could have
detected, but didn't.

Was the issue that the Bloom filter was simply undersized/ineffective?

> However, it seems to me we can implement this check in non-lossy
> manner without making it significantly slower.  We anyway traverse
> downlinks from parent to children in order to verify that hikeys are
> corresponding to downlink keys.

Actually, we don't check the high keys in children against the parent
(all other items are checked, though). We probably *should* do
something with the high key when verifying consistency across levels,
but currently we don't. (We only use the high key for the same-page
high key check -- more on this below.)

> We can also traverse from one
> downlink to subsequent using rightlinks.  So, if there are some
> intermediate pages between them, they are candidates to have missing
> parent downlinks.  The patch is attached.
>
> With this patch amcheck could successfully detect corruption for our
> customer, which unpatched amcheck couldn't find.

Maybe we can be a lot less conservative about sizing the Bloom filter
instead? That would be an easier fix IMV -- we can probably change
that logic to be a lot more aggressive without anybody noticing, since
the Bloom filter is already usually tiny. We are already not very
careful about saving work within bt_index_parent_check(), but with
this patch we follow each downlink twice instead of once. That seems
wasteful.

The reason I used a Bloom filter here is because I would eventually
like the missing downlink check to fingerprint entire tuples, not just
block numbers. In L&Y terms, the check could in the future fingerprint
the separator key and the downlink at the same time, not just the
downlink. That way, we could probe the Bloom filter on the next level
down for its high key (with the right sibling pointer set to be
consistent with the parent) iff we don't see that the page split was
interrupted (i.e. iff P_INCOMPLETE_SPLIT() bit is not set). Obviously
this would be a more effective form of verification, since we would
notice high key values that don't agree with the parent's values for
the same sibling/cousin/child block.

I didn't do it that way for v11 because of "minus infinity" items on
internal pages, which don't store the original key (the key remains
the high key of the left sibling page, but we truncate the original to
0 attributes in _bt_pgaddtup()). I think that we should eventually
stop truncating minus infinity items, and actually store a "low key"
at P_FIRSTDATAKEY() within internal pages instead. That would be
useful for other things anyway (e.g. prefix compression).

--
Peter Geoghegan



Re: Improve search for missing parent downlinks in amcheck

From
Peter Geoghegan
Date:
On Tue, Apr 16, 2019 at 12:00 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Can you be more specific? What was the cause of the corruption? I'm
> always very interested in hearing about cases that amcheck could have
> detected, but didn't.

FWIW, v4 indexes in Postgres 12 will support the new "rootdescend"
verification option, which isn't lossy, and would certainly have
detected your customer issue in practice. Admittedly the new check is
quite expensive, even compared to the other bt_index_parent_check()
checks, but it is nice that we now have a verification option that is
*extremely* thorough, and uses _bt_search() directly.

-- 
Peter Geoghegan



Re: Improve search for missing parent downlinks in amcheck

From
Peter Geoghegan
Date:
On Tue, Apr 16, 2019 at 12:00 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Mon, Apr 15, 2019 at 7:30 PM Alexander Korotkov
> <a.korotkov@postgrespro.ru> wrote:
> > Currently we amcheck supports lossy checking for missing parent
> > downlinks.  It collects bitmap of downlink hashes and use it to check
> > subsequent tree level.  We've experienced some large corrupted indexes
> > which pass this check due to its looseness.
>
> Can you be more specific? What was the cause of the corruption? I'm
> always very interested in hearing about cases that amcheck could have
> detected, but didn't.

Ping?

I am interested in doing better here. The background information seems
very interesting.

-- 
Peter Geoghegan



Re: Improve search for missing parent downlinks in amcheck

From
Alexander Korotkov
Date:
On Tue, Apr 16, 2019 at 10:04 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Tue, Apr 16, 2019 at 12:00 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > Can you be more specific? What was the cause of the corruption? I'm
> > always very interested in hearing about cases that amcheck could have
> > detected, but didn't.
>
> FWIW, v4 indexes in Postgres 12 will support the new "rootdescend"
> verification option, which isn't lossy, and would certainly have
> detected your customer issue in practice. Admittedly the new check is
> quite expensive, even compared to the other bt_index_parent_check()
> checks, but it is nice that we now have a verification option that is
> *extremely* thorough, and uses _bt_search() directly.

"rootdescend" is cool type of check.  Thank you for noticing, I wasn't aware of it.
But can it detect the missing downlink in following situation?

        A
     /     \
  B <-> C <-> D

Here A has downlinks to B and D, which downlink to C is missing,
while B, C and D are correctly connected with leftlinks and rightlinks.
I can see "rootdescend" calls _bt_search(), which would just step
right from C to D as if it was concurrent split.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Improve search for missing parent downlinks in amcheck

From
Peter Geoghegan
Date:
On Sat, Apr 27, 2019 at 4:57 PM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> "rootdescend" is cool type of check.  Thank you for noticing, I wasn't aware of it.
> But can it detect the missing downlink in following situation?
>
>         A
>      /     \
>   B <-> C <-> D
>
> Here A has downlinks to B and D, which downlink to C is missing,
> while B, C and D are correctly connected with leftlinks and rightlinks.
> I can see "rootdescend" calls _bt_search(), which would just step
> right from C to D as if it was concurrent split.

There is a comment about this scenario above bt_rootdescend() in amcheck.

You're right -- this is a kind of corruption that even the new
rootdescend verification option would miss. We can imagine a version
of rootdescend verification that tells the core code to only move
right when there was an *interrupted* page split (i.e.
P_INCOMPLETE_SPLIT() flag bit is set), but that isn't what happens
right now.

That said, the lossy downlink check that you want to improve on
*should* already catch this situation. Of course it might not because
it is lossy (uses a Bloom filter), but I think that that's very
unlikely. That's why I would like to understand the problem that you
found with the check.

-- 
Peter Geoghegan



Re: Improve search for missing parent downlinks in amcheck

From
Alexander Korotkov
Date:
On Tue, Apr 16, 2019 at 10:00 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Mon, Apr 15, 2019 at 7:30 PM Alexander Korotkov
> <a.korotkov@postgrespro.ru> wrote:
> > Currently we amcheck supports lossy checking for missing parent
> > downlinks.  It collects bitmap of downlink hashes and use it to check
> > subsequent tree level.  We've experienced some large corrupted indexes
> > which pass this check due to its looseness.
>
> Can you be more specific? What was the cause of the corruption? I'm
> always very interested in hearing about cases that amcheck could have
> detected, but didn't.

AFAIR, the cause of corruption in this case was our (Postgres Pro)
modification.  Not something really present in PostgreSQL core.

>
> Was the issue that the Bloom filter was simply undersized/ineffective?

Yes, increasing of Bloom filter size also helps.  But my intention was
to make non-lossy check here.

>
> > However, it seems to me we can implement this check in non-lossy
> > manner without making it significantly slower.  We anyway traverse
> > downlinks from parent to children in order to verify that hikeys are
> > corresponding to downlink keys.
>
> Actually, we don't check the high keys in children against the parent
> (all other items are checked, though). We probably *should* do
> something with the high key when verifying consistency across levels,
> but currently we don't. (We only use the high key for the same-page
> high key check -- more on this below.)

Nice idea.

> > We can also traverse from one
> > downlink to subsequent using rightlinks.  So, if there are some
> > intermediate pages between them, they are candidates to have missing
> > parent downlinks.  The patch is attached.
> >
> > With this patch amcheck could successfully detect corruption for our
> > customer, which unpatched amcheck couldn't find.
>
> Maybe we can be a lot less conservative about sizing the Bloom filter
> instead? That would be an easier fix IMV -- we can probably change
> that logic to be a lot more aggressive without anybody noticing, since
> the Bloom filter is already usually tiny. We are already not very
> careful about saving work within bt_index_parent_check(), but with
> this patch we follow each downlink twice instead of once. That seems
> wasteful.

It wouldn't be really wasteful, because the same children were
accessed recently.  So, they are likely not yet evicted from
shared_buffers.  I think we can fit both checks into one loop over
downlinks instead of two.

> The reason I used a Bloom filter here is because I would eventually
> like the missing downlink check to fingerprint entire tuples, not just
> block numbers. In L&Y terms, the check could in the future fingerprint
> the separator key and the downlink at the same time, not just the
> downlink. That way, we could probe the Bloom filter on the next level
> down for its high key (with the right sibling pointer set to be
> consistent with the parent) iff we don't see that the page split was
> interrupted (i.e. iff P_INCOMPLETE_SPLIT() bit is not set). Obviously
> this would be a more effective form of verification, since we would
> notice high key values that don't agree with the parent's values for
> the same sibling/cousin/child block.
>
> I didn't do it that way for v11 because of "minus infinity" items on
> internal pages, which don't store the original key (the key remains
> the high key of the left sibling page, but we truncate the original to
> 0 attributes in _bt_pgaddtup()). I think that we should eventually
> stop truncating minus infinity items, and actually store a "low key"
> at P_FIRSTDATAKEY() within internal pages instead. That would be
> useful for other things anyway (e.g. prefix compression).

Yes, we can do more checks with bloom filter.  But assuming that we
anyway iterate over children for each non-leaf page, can we do:
1) All the checks, which bt_downlink_check() does for now
2) Check there are no missing downlinks
3) Check hikeys
in one pass, can't we?

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Improve search for missing parent downlinks in amcheck

From
Peter Geoghegan
Date:
On Sat, Apr 27, 2019 at 5:13 PM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> Yes, increasing of Bloom filter size also helps.  But my intention was
> to make non-lossy check here.

I agree that that might be a good goal, but I am interested in knowing
if there is something naive about how the downlinkfilter Bloom filter
is sized. I think that we could probably do better than this without
much work:

            /*
             * Extra readonly downlink check.
             *
             * In readonly case, we know that there cannot be a concurrent
             * page split or a concurrent page deletion, which gives us the
             * opportunity to verify that every non-ignorable page had a
             * downlink one level up.  We must be tolerant of interrupted page
             * splits and page deletions, though.  This is taken care of in
             * bt_downlink_missing_check().
             */
            total_pages = (int64) state->rel->rd_rel->relpages;
            state->downlinkfilter = bloom_create(total_pages, work_mem, seed);

Maybe we could use "smgrnblocks(index->rd_smgr, MAIN_FORKNUM))"
instead of relpages, for example.

> It wouldn't be really wasteful, because the same children were
> accessed recently.  So, they are likely not yet evicted from
> shared_buffers.  I think we can fit both checks into one loop over
> downlinks instead of two.

I see your point, but if we're going to treat this as a bug then I
would prefer a simple fix.

> Yes, we can do more checks with bloom filter.  But assuming that we
> anyway iterate over children for each non-leaf page, can we do:
> 1) All the checks, which bt_downlink_check() does for now
> 2) Check there are no missing downlinks
> 3) Check hikeys
> in one pass, can't we?

We can expect every high key in a page to have a copy contained within
its parent, either as one of the keys, or as parent's own high key
(assuming no concurrent or interrupted page splits or page deletions).
This is true today, even though we truncate negative infinity items in
internal pages.

I think that the simple answer to your question is yes. It would be
more complicated that way, and the only extra check would be the check
of high keys against their parent, but overall this does seem
possible.

-- 
Peter Geoghegan



Re: Improve search for missing parent downlinks in amcheck

From
Peter Geoghegan
Date:
On Sat, Apr 27, 2019 at 5:13 PM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> Yes, increasing of Bloom filter size also helps.  But my intention was
> to make non-lossy check here.

Why is that your intention? Do you want to do this as a feature for
Postgres 13, or do you want to treat this as a bug that we need to
backpatch a fix for?

Can we avoid the problem you saw with the Bloom filter approach by
using the real size of the index (i.e.
smgrnblocks()/RelationGetNumberOfBlocks()) to size the Bloom filter,
and/or by rethinking the work_mem cap? Maybe we can have a WARNING
that advertises that work_mem is probably too low?

The  state->downlinkfilter Bloom filter should be small in almost all
cases, so I still don't fully understand your concern. With a 100GB
index, we'll have ~13 million blocks. We only need a Bloom filter that
is ~250MB to have less than a 1% chance of missing inconsistencies
even with such a large index. I admit that its unfriendly that users
are not warned about the shortage currently, but that is something we
can probably find a simple (backpatchable) fix for.

-- 
Peter Geoghegan



Re: Improve search for missing parent downlinks in amcheck

From
Alexander Korotkov
Date:
On Sun, Apr 28, 2019 at 4:36 AM Peter Geoghegan <pg@bowt.ie> wrote:
> On Sat, Apr 27, 2019 at 5:13 PM Alexander Korotkov
> <a.korotkov@postgrespro.ru> wrote:
> > Yes, increasing of Bloom filter size also helps.  But my intention was
> > to make non-lossy check here.
>
> Why is that your intention? Do you want to do this as a feature for
> Postgres 13, or do you want to treat this as a bug that we need to
> backpatch a fix for?

I think this definitely not bug fix.  Bloom filter was designed to be
lossy, no way blaming it for that :)

> Can we avoid the problem you saw with the Bloom filter approach by
> using the real size of the index (i.e.
> smgrnblocks()/RelationGetNumberOfBlocks()) to size the Bloom filter,
> and/or by rethinking the work_mem cap? Maybe we can have a WARNING
> that advertises that work_mem is probably too low?
>
> The  state->downlinkfilter Bloom filter should be small in almost all
> cases, so I still don't fully understand your concern. With a 100GB
> index, we'll have ~13 million blocks. We only need a Bloom filter that
> is ~250MB to have less than a 1% chance of missing inconsistencies
> even with such a large index. I admit that its unfriendly that users
> are not warned about the shortage currently, but that is something we
> can probably find a simple (backpatchable) fix for.

Sounds reasonable.  I'll think about proposing backpatch of something like this.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Improve search for missing parent downlinks in amcheck

From
Peter Geoghegan
Date:
On Sun, Apr 28, 2019 at 10:15 AM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> I think this definitely not bug fix.  Bloom filter was designed to be
> lossy, no way blaming it for that :)

I will think about a simple fix, but after the upcoming point release.
There is no hurry.


-- 
Peter Geoghegan



Re: Improve search for missing parent downlinks in amcheck

From
Thomas Munro
Date:
On Wed, May 1, 2019 at 12:58 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Sun, Apr 28, 2019 at 10:15 AM Alexander Korotkov
> <a.korotkov@postgrespro.ru> wrote:
> > I think this definitely not bug fix.  Bloom filter was designed to be
> > lossy, no way blaming it for that :)
>
> I will think about a simple fix, but after the upcoming point release.
> There is no hurry.

A bureaucratic question:  What should the status be for this CF entry?

-- 
Thomas Munro
https://enterprisedb.com



Re: Improve search for missing parent downlinks in amcheck

From
Peter Geoghegan
Date:
On Sun, Jul 7, 2019 at 7:53 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Wed, May 1, 2019 at 12:58 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > I will think about a simple fix, but after the upcoming point release.
> > There is no hurry.
>
> A bureaucratic question:  What should the status be for this CF entry?

I have plans to use RelationGetNumberOfBlocks() to size amcheck's
downlink Bloom filter. I think that that will solve the problems with
unreliable estimates of index size. which seemed to be the basis of
Alexander's complaint.

--
Peter Geoghegan



Re: Improve search for missing parent downlinks in amcheck

From
Peter Geoghegan
Date:
On Tue, Apr 30, 2019 at 5:58 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I will think about a simple fix, but after the upcoming point release.
> There is no hurry.

Attached draft patch uses RelationGetNumberOfBlocks() to size each of
the two Bloom filters that may be used by amcheck to perform
verification.

The basic heapallindexed Bloom filter is now sized based on the
conservative assumption that there must be *at least*
"RelationGetNumberOfBlocks()  * 50" elements to fingerprint (reltuples
will continue to be used to size the basic heapallindexed Bloom filter
in most cases, though). The patch also uses the same
RelationGetNumberOfBlocks() value to size the downlink Bloom filter.
This second change will fix your problem very non-invasively.

I intend to backpatch this to v11 in the next few days.

-- 
Peter Geoghegan

Attachment

Re: Improve search for missing parent downlinks in amcheck

From
Alexander Korotkov
Date:
On Fri, Jul 19, 2019 at 3:21 AM Peter Geoghegan <pg@bowt.ie> wrote:
> On Tue, Apr 30, 2019 at 5:58 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > I will think about a simple fix, but after the upcoming point release.
> > There is no hurry.
>
> Attached draft patch uses RelationGetNumberOfBlocks() to size each of
> the two Bloom filters that may be used by amcheck to perform
> verification.
>
> The basic heapallindexed Bloom filter is now sized based on the
> conservative assumption that there must be *at least*
> "RelationGetNumberOfBlocks()  * 50" elements to fingerprint (reltuples
> will continue to be used to size the basic heapallindexed Bloom filter
> in most cases, though). The patch also uses the same
> RelationGetNumberOfBlocks() value to size the downlink Bloom filter.
> This second change will fix your problem very non-invasively.
>
> I intend to backpatch this to v11 in the next few days.

Thank you for backpatching this!

BTW, there is next revision of patch I'm proposing for v13.

In this revision check for missing downlinks is combined with
bt_downlink_check().  So, pages visited by bt_downlink_check() patch
doesn't cause extra accessed.  It only causes following additional
page accesses:
1) Downlinks corresponding to "negative infinity" keys,
2) Pages of child level, which are not referenced by downlinks.

But I think these two kinds are very minority, and those accesses
could be trade off with more precise missing downlink check without
bloom filter.  What do you think?

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

Re: Improve search for missing parent downlinks in amcheck

From
Peter Geoghegan
Date:
On Mon, Aug 12, 2019 at 12:01 PM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> BTW, there is next revision of patch I'm proposing for v13.

Cool.

> In this revision check for missing downlinks is combined with
> bt_downlink_check().  So, pages visited by bt_downlink_check() patch
> doesn't cause extra accessed.  It only causes following additional
> page accesses:
> 1) Downlinks corresponding to "negative infinity" keys,
> 2) Pages of child level, which are not referenced by downlinks.
>
> But I think these two kinds are very minority, and those accesses
> could be trade off with more precise missing downlink check without
> bloom filter.  What do you think?

I am generally in favor of making the downlink check absolutely
reliable, and am not too worried about the modest additional overhead.
After all, bt_index_parent_check() is supposed to be thorough though
expensive. The only reason that I used a Bloom filter for
fingerprinting downlink blocks was that it seemed important to quickly
get amcheck coverage for subtle multi-level page deletion bugs just
after v11 feature freeze. We can now come up with a better design for
that.

I was confused about how this patch worked at first. But then I
remembered that Lehman and Yao consider downlinks to be distinct
things to separator keys in internal pages. The high key of an
internal page in the final separator key, so you have n downlinks and
n + 1 separator keys per internal page -- two distinct things that
appear in alternating order (the negative infinity item is not
considered to have any separator key here). So, while internal page
items are explicitly "(downlink, separator)" pairs that are grouped
into a single tuple in nbtree, with a separate tuple just for the high
key, Lehman and Yao would find it just as natural to treat them as
"(separator, downlink)" pairs. You have to skip the first downlink on
each internal level to make that work, but this makes our
bt_downlink_check() check have something from the target page (child's
parent page) that is like the high key in the child.

It's already really confusing that we don't quite mean the same thing
as Lehman and Yao when we say downlink (See also my long "why a
highkey is never truly a copy of another item" comment block within
bt_target_page_check()), and that is not your patch's fault. But maybe
we need to fix that to make your patch easier to understand. (i.e.
maybe we need to go over every use of the word "downlink" in nbtree,
and make it say something more precise, to make everything less
confusing.)

Other feedback:

* Did you miss the opportunity to verify that every high key matches
its right sibling page's downlink tuple in parent page? We talked
about this already, but you don't seem to match the key (you only
match the downlink block).

* You are decoupling the new check from the bt_index_parent_check()
"heapallindexed" option. That's probably a good thing, but you need to
update the sgml docs.

* Didn't you forget to remove the BtreeCheckState.rightsplit flag?

* You've significantly changed the behavior of bt_downlink_check() --
I would note this in its header comments. This is where ~99% of the
new work happens.

* I don't like that you use the loaded term "target" here -- anything
called "target" should be BtreeCheckState.target:

>  static void
> -bt_downlink_missing_check(BtreeCheckState *state)
> +bt_downlink_missing_check(BtreeCheckState *state, bool rightsplit,
> +                         BlockNumber targetblock, Page target)
>  {

If it's unclear what I mean, this old comment should make it clearer:

/*
 * State associated with verifying a B-Tree index
 *
 * target is the point of reference for a verification operation.
 *
 * Other B-Tree pages may be allocated, but those are always auxiliary (e.g.,
 * they are current target's child pages).  Conceptually, problems are only
 * ever found in the current target page (or for a particular heap tuple during
 * heapallindexed verification).  Each page found by verification's left/right,
 * top/bottom scan becomes the target exactly once.
 */

-- 
Peter Geoghegan



Re: Improve search for missing parent downlinks in amcheck

From
Alexander Korotkov
Date:
On Tue, Aug 13, 2019 at 11:44 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > In this revision check for missing downlinks is combined with
> > bt_downlink_check().  So, pages visited by bt_downlink_check() patch
> > doesn't cause extra accessed.  It only causes following additional
> > page accesses:
> > 1) Downlinks corresponding to "negative infinity" keys,
> > 2) Pages of child level, which are not referenced by downlinks.
> >
> > But I think these two kinds are very minority, and those accesses
> > could be trade off with more precise missing downlink check without
> > bloom filter.  What do you think?
>
> I am generally in favor of making the downlink check absolutely
> reliable, and am not too worried about the modest additional overhead.
> After all, bt_index_parent_check() is supposed to be thorough though
> expensive. The only reason that I used a Bloom filter for
> fingerprinting downlink blocks was that it seemed important to quickly
> get amcheck coverage for subtle multi-level page deletion bugs just
> after v11 feature freeze. We can now come up with a better design for
> that.

Great!

> I was confused about how this patch worked at first. But then I
> remembered that Lehman and Yao consider downlinks to be distinct
> things to separator keys in internal pages. The high key of an
> internal page in the final separator key, so you have n downlinks and
> n + 1 separator keys per internal page -- two distinct things that
> appear in alternating order (the negative infinity item is not
> considered to have any separator key here). So, while internal page
> items are explicitly "(downlink, separator)" pairs that are grouped
> into a single tuple in nbtree, with a separate tuple just for the high
> key, Lehman and Yao would find it just as natural to treat them as
> "(separator, downlink)" pairs. You have to skip the first downlink on
> each internal level to make that work, but this makes our
> bt_downlink_check() check have something from the target page (child's
> parent page) that is like the high key in the child.
>
> It's already really confusing that we don't quite mean the same thing
> as Lehman and Yao when we say downlink (See also my long "why a
> highkey is never truly a copy of another item" comment block within
> bt_target_page_check()), and that is not your patch's fault. But maybe
> we need to fix that to make your patch easier to understand. (i.e.
> maybe we need to go over every use of the word "downlink" in nbtree,
> and make it say something more precise, to make everything less
> confusing.)

I agree that current terms nbtree use to describe downlinks and
separator keys may be confusing.  I'll try to fix this and come up
with patch if succeed.

> Other feedback:
>
> * Did you miss the opportunity to verify that every high key matches
> its right sibling page's downlink tuple in parent page? We talked
> about this already, but you don't seem to match the key (you only
> match the downlink block).
>
> * You are decoupling the new check from the bt_index_parent_check()
> "heapallindexed" option. That's probably a good thing, but you need to
> update the sgml docs.
>
> * Didn't you forget to remove the BtreeCheckState.rightsplit flag?
>
> * You've significantly changed the behavior of bt_downlink_check() --
> I would note this in its header comments. This is where ~99% of the
> new work happens.
>
> * I don't like that you use the loaded term "target" here -- anything
> called "target" should be BtreeCheckState.target:
>
> >  static void
> > -bt_downlink_missing_check(BtreeCheckState *state)
> > +bt_downlink_missing_check(BtreeCheckState *state, bool rightsplit,
> > +                         BlockNumber targetblock, Page target)
> >  {
>
> If it's unclear what I mean, this old comment should make it clearer:
>
> /*
>  * State associated with verifying a B-Tree index
>  *
>  * target is the point of reference for a verification operation.
>  *
>  * Other B-Tree pages may be allocated, but those are always auxiliary (e.g.,
>  * they are current target's child pages).  Conceptually, problems are only
>  * ever found in the current target page (or for a particular heap tuple during
>  * heapallindexed verification).  Each page found by verification's left/right,
>  * top/bottom scan becomes the target exactly once.
>  */

The revised patch seems to fix all of above.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

Re: Improve search for missing parent downlinks in amcheck

From
Michael Paquier
Date:
On Mon, Aug 19, 2019 at 01:15:19AM +0300, Alexander Korotkov wrote:
> The revised patch seems to fix all of above.

The latest patch is failing to apply.  Please provide a rebase.
--
Michael

Attachment

Re: Improve search for missing parent downlinks in amcheck

From
Tomas Vondra
Date:
On Fri, Nov 29, 2019 at 03:03:01PM +0900, Michael Paquier wrote:
>On Mon, Aug 19, 2019 at 01:15:19AM +0300, Alexander Korotkov wrote:
>> The revised patch seems to fix all of above.
>
>The latest patch is failing to apply.  Please provide a rebase.

This still does not apply (per cputube). Can you provide a fixed patch?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: Improve search for missing parent downlinks in amcheck

From
Alexander Korotkov
Date:
On Fri, Jan 17, 2020 at 1:05 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On Fri, Nov 29, 2019 at 03:03:01PM +0900, Michael Paquier wrote:
> >On Mon, Aug 19, 2019 at 01:15:19AM +0300, Alexander Korotkov wrote:
> >> The revised patch seems to fix all of above.
> >
> >The latest patch is failing to apply.  Please provide a rebase.
>
> This still does not apply (per cputube). Can you provide a fixed patch?

Rebased patch is attached.  Sorry for so huge delay.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

Re: Improve search for missing parent downlinks in amcheck

From
Peter Geoghegan
Date:
On Wed, Jan 22, 2020 at 6:41 PM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> Rebased patch is attached.  Sorry for so huge delay.

I really like this patch. Your interest in amcheck is something that
makes me feel good about having put so much work into it myself.

Here are some review comments:

> +   /*
> +    * Rightlink and incomplete split flag of previous block referenced by
> +    * downlink.
> +    */
> +   BlockNumber prevrightlink;
> +   bool        previncompletesplit;
> +

What downlink? What does this mean? Do you mean the most recently
followed rightlink on the current level, or InvalidBlockNumber if
target page is the leftmost page on the current level on the scan?

(Thinks some more...)

Actually, these two new fields track the state *one level down* from
the target page level when !readonly (unless target page is on the
leaf level). Right? Comments should be explicit about this. The
current comments about downlinks isn't clear.

>         if (offset_is_negative_infinity(topaque, offset))
> +       {
> +           /*
> +            * Initialize downlink connectivity check if needed.
> +            */
> +           if (!P_ISLEAF(topaque) && state->readonly)
> +           {
> +               bt_downlink_connectivity_check(state,
> +                                              offset,
> +                                              NULL,
> +                                              topaque->btpo.level);
> +           }
>             continue;
> +       }

Don't need the "!P_ISLEAF()" here. Also, you should say something like
"we need to call this here because the usual callsite in
bt_downlink_check() won't be reached".

>     /*
> -    * * Check if page has a downlink in parent *
> -    *
> -    * This can only be checked in heapallindexed + readonly case.
> +    * If we traversed the whole level to the rightmost page, there might be
> +    * missing downlinks for the pages to the right of rightmost downlink.
> +    * Check for them.
>      */

You mean "to the right of the child page pointed to by our rightmost downlink"?

I think that the final bt_downlink_connectivity_check() call within
bt_target_page_check() should make it clear that it is kind of special
compared to the other two calls.

> +/*
> + * Check connectivity of downlinks.  Traverse rightlinks from previous downlink
> + * to the current one.  Check that there are no intermediate pages with missing
> + * downlinks.
> + *
> + * If 'loaded_page' is given, it's assumed to be contents of downlink
> + * referenced by 'downlinkoffnum'.
> + */

Say "assumed to be the page pointed to by the downlink", perhaps?

> +static void
> +bt_downlink_connectivity_check(BtreeCheckState *state,
> +                              OffsetNumber downlinkoffnum,
> +                              Page loaded_page,
> +                              uint32 parent_level)
> +{

In amcheck, we always have a current target page. Every page gets to
be the target exactly once, though sometimes other subsidiary pages
are accessed. We try to blame the target page, even with checks that
are technically against its child/sibling/whatever. The target page is
always our conceptual point of reference. Sometimes this is a bit
artificial, but it's still worth doing consistently. So I think you
should change these argument names with that in mind (see below).

> +       /*
> +        * If we visit page with high key, check that it should be equal to
> +        * the target key next to corresponding downlink.
> +        */

I suggest "...check that it is equal to the target key..."

> +           /*
> +            * There might be two situations when we examine high key.  If
> +            * current child page is referenced by given downlink, we should
> +            * look to the next offset number for matching key.

You mean "the next offset number for the matching key from the target
page"? I find it much easier to keep this stuff in my head if
everything is defined in terms of its relationship with the current
target page. For example, bt_downlink_connectivity_check()'s
"parent_level" argument should be called "target_level" instead, while
its "loaded_page" should be called "loaded_child". Maybe
"downlinkoffnum" should be "target_downlinkoffnum". And downlinkoffnum
should definitely be explained in comments at the top of
bt_downlink_connectivity_check() (e.g., say what it means when it is
InvalidOffsetNumber).

> Alternatively
> +            * we might find child with high key while traversing from
> +            * previous downlink to current one.  Then matching key resides
> +            * the same offset number as current downlink.
> +            */

Not sure what "traversing from previous downlink to current one" means at all.

> +           if (!offset_is_negative_infinity(topaque, pivotkey_offset) &&
> +               pivotkey_offset <= PageGetMaxOffsetNumber(state->target))
> +           {
> +               uint32      cmp = _bt_compare(state->rel,
> +                                             skey,
> +                                             state->target,
> +                                             pivotkey_offset);

There is no need to bother with a _bt_compare() here. Why not just use
memcmp() with a pointer to itup->t_tid.ip_posid (i.e. memcmp() that
skips the block number)? I think that it is better to expect the keys
to be *identical* among pivot tuples, including within tuple alignment
padding (only the downlink block number can be different here). If
non-pivot tuples were involved then you couldn't do it this way, but
they're never involved, so it makes sense. A memcmp() will be faster,
obviously. More importantly, it has the advantage of not relying on
opclass infrastructure in any way. It might be worth adding an
internal verify_nbtree.c static helper function to do the memcmp() for
you -- bt_pivot_tuple_identical(), or something like that.

I think bt_downlink_check() and bt_downlink_connectivity_check()
should be renamed to something broader. In my mind, downlink is
basically a block number. We have been sloppy about using the term
downlink when we really mean "pivot tuple with a downlink" -- I am
guilty of this myself. But it seems more important, now that you have
the new high key check.

I particularly don't like the way you sometimes say "downlink" when
you mean "child page". You do that in this error message:

> +                   (errcode(ERRCODE_INDEX_CORRUPTED),
> +                    errmsg("block found while traversing rightlinks from downlink of index \"%s\" has invalid
level",
> +                           RelationGetRelationName(state->rel)),

Typo here:

> +   /*
> +    * If no previos rightlink is memorized, get it from current downlink for
> +    * future usage.
> +    */

You mean "previous". Also, I think that you should say "memorized for
current level just below target page's level".

>      * within bt_check_level_from_leftmost() won't reach the page either,
>      * since the leaf's live siblings should have their sibling links updated
> -    * to bypass the deletion target page when it is marked fully dead.)
> +    * to bypass the deletion page under check when it is marked fully dead.)
>      *

This change seems wrong or unnecessary -- "deletion target" means
"page undergoing deletion" (not necessarily marked P_ISDELETED() just
yet), and has nothing to do with the amcheck target. You can change
this if you want, but I don't get it.

I tested this by using pg_hexedit to corrupt the least significant
byte of a text key in the root page:

pg@tpce:5432 [32610]=# select bt_index_parent_check('pk_holding');
DEBUG:  verifying level 2 (true root level)
DEBUG:  verifying 9 items on internal block 290
DEBUG:  verifying level 1
DEBUG:  verifying 285 items on internal block 3
ERROR:  mismatch between parent key and child high key index "pk_holding"
DETAIL:  Parent block=3 child block=9 parent page lsn=998/EFA21550.

Happy to see that this works, even though this is one of the subtlest
possible forms of index corruption. Previously, we could sometimes
catch this with "rootdescend" verification, but only if there were
*current* items that a scan couldn't find on lower levels (often just
the leaf level). But now it doesn't matter -- we'll always detect it.
(I think.)

Shouldn't this error message read '...in index "pk_holding"'? You
missed the "in". Also, why not have the DETAIL message call the
"Parent block" the target block?

I think that bt_downlink_connectivity_check() should have some
high-level comments about what it's supposed to do. Perhaps an example
is the best way to explain the concepts. Maybe say something about a
three level B-Tree. Each of the separator keys in the grandparent/root
page should also appear as high keys at the parent level. Each of the
separator keys in the parent level should also appear as high keys on
the leaf level, including the separators from the parent level high
keys. Since each separator defines which subtrees are <= and > of the
separator, there must be an identical seam of separators (in high
keys) on lower levels. bt_downlink_connectivity_check() verifies that
separator keys agree across a single level, which verifies the
integrity of the whole tree.

(Thinks some more...)

Actually, this patch doesn't quite manage to verify that there is this
"unbroken seam" of separator keys from the root to the leaf, so my
suggested wording is kind of wrong -- but I think we can fix this
weakness. The specific weakness that I saw and verified exists is as
follows:

If I corrupt the high key of most of the leaf pages in a multi-level
index just a little bit (by once again corrupting the least
significant byte of the key using pg_hexedit), then the new check
alone will usually detect the problem, which is good. However, if I
deliberately pick a leaf page that happens to be the rightmost child
of some internal page, then it is a different story -- even the new
check won't detect the problem (the existing rootdescend check may or
may not detect the problem, depending on the current non-pivot tuples
near the leaf high key in question). There is no real reason why we
shouldn't be able to detect this problem, though.

The solution is to recognize that we sometimes need to use the
target/parent page high key separator -- not the separator key from
some pivot tuple in parent that also contains a downlink. Goetz Graefe
calls this "the cousin problem" [1]. To be more precise, I think that
the "pivotkey_offset <= PageGetMaxOffsetNumber(state->target))" part
of this test can be improved:

> +           /*
> +            * There might be two situations when we examine high key.  If
> +            * current child page is referenced by given downlink, we should
> +            * look to the next offset number for matching key.  Alternatively
> +            * we might find child with high key while traversing from
> +            * previous downlink to current one.  Then matching key resides
> +            * the same offset number as current downlink.
> +            */
> +           if (blkno == downlink)
> +               pivotkey_offset = OffsetNumberNext(downlinkoffnum);
> +           else
> +               pivotkey_offset = downlinkoffnum;
> +
> +           topaque = (BTPageOpaque) PageGetSpecialPointer(state->target);
> +
> +           if (!offset_is_negative_infinity(topaque, pivotkey_offset) &&
> +               pivotkey_offset <= PageGetMaxOffsetNumber(state->target))

When OffsetNumberNext(downlinkoffnum) exceeds the
PageGetMaxOffsetNumber(state->target), doesn't that actually mean that
the high key offset (i.e. P_HIKEY) should be used to get an item from
the next level up? We can still correctly detect the problem that way.
Remember, the high key on an internal page is a tuple that contains a
separator but no downlink, which is really not that special within an
internal page -- if you think about it from the Lehman & Yao
perspective. So we should take the pivot tuple from the parent at
offset P_HIKEY, and everything can work just the same.

That said, I guess you really do need the
"!offset_is_negative_infinity(topaque, pivotkey_offset)" part of the
test. The only other possibility is that you access the target/parent
page's downlink + separator in its own parent page (i.e. the
grandparent of the page whose high key might be corrupt), which is
significantly more complexity -- that may not be worth it. (If you did
this, you would have to teach the code the difference between
"absolute" negative infinity in the leftmost leaf page on the leaf
level, and "subtree relative" negative infinity for other leaf pages
that are merely leftmost within a subtree).

In summary: I suppose that we can also solve "the cousin problem"
quite easily, but only for rightmost cousins within a subtree --
leftmost cousins might be too messy to verify for it to be worth it.
We don't want to have to jump two or three levels up within
bt_downlink_connectivity_check() just for leftmost cousin pages. But
maybe you're feeling ambitious! What do you think?

Note: There is an existing comment about this exact negative infinity
business within bt_downlink_check(). It starts with "Negative
inifinity items can be thought of as a strict lower bound that works
transitively...". There should probably be some comment updates to
this comment block as part of this patch.

[1] https://pdfs.semanticscholar.org/fd45/15ab23c00231d96c95c1091459d0d1eebfae.pdf
--
Peter Geoghegan



Re: Improve search for missing parent downlinks in amcheck

From
Peter Geoghegan
Date:
On Thu, Jan 23, 2020 at 5:31 PM Peter Geoghegan <pg@bowt.ie> wrote:
> In summary: I suppose that we can also solve "the cousin problem"
> quite easily, but only for rightmost cousins within a subtree --
> leftmost cousins might be too messy to verify for it to be worth it.
> We don't want to have to jump two or three levels up within
> bt_downlink_connectivity_check() just for leftmost cousin pages. But
> maybe you're feeling ambitious! What do you think?

I suppose the alternative is to get the high key of the parent's left
sibling, rather than going to the parent's parent (i.e. our
grandparent). That would probably be the best way to get a separator
key to compare against the high key in the leftmost cousin page of a
subtree, if in fact we wanted to *fully* solve the "cousin problem".
Goetz Graefe recommends keeping both a low key and a high key in every
page for verification purposes. We don't actually have a low key (we
only have a truncated negative infinity item), but this approach isn't
that far off having a low key.

-- 
Peter Geoghegan



Re: Improve search for missing parent downlinks in amcheck

From
Peter Geoghegan
Date:
On Thu, Jan 23, 2020 at 5:40 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I suppose the alternative is to get the high key of the parent's left
> sibling, rather than going to the parent's parent (i.e. our
> grandparent). That would probably be the best way to get a separator
> key to compare against the high key in the leftmost cousin page of a
> subtree, if in fact we wanted to *fully* solve the "cousin problem".

I think I've confused myself here. The
"!offset_is_negative_infinity(topaque, pivotkey_offset)" part of the
bt_downlink_connectivity_check() high key check test that I mentioned
now *also* seem unnecessary. Any high key in a page that isn't marked
"half-dead" or marked "deleted" or marked "has incomplete split" can
be targeted by the check. Once the page meets that criteria, there
must be a pivot tuple in the parent page that contains an
identical-to-highkey separator key (this could be the parent's own
high key).

The only thing that you need to do is be careful about rightmost
parent pages of a non-rightmost page -- stuff like that. But, I think
that that's only needed because an amcheck segfault isn't a very nice
way to detect corruption.

The existing comment about negative infinity items within
bt_downlink_check() that I mentioned in my main e-mail (the "Note:"
part) don't quite apply here. You're not verifying a pivot tuple that
has a downlink (which might be negative infinity) against the lower
levels -- you're doing *the opposite*. That is, you're verifying a
high key using the parent. Which seems like the right way to do it --
you can test for the incomplete split flag and so on by doing it
bottom up (not top down). This must have been why I was confused.

Phew!
-- 
Peter Geoghegan



Re: Improve search for missing parent downlinks in amcheck

From
Alexander Korotkov
Date:
Hi, Peter!

On Fri, Jan 24, 2020 at 4:31 AM Peter Geoghegan <pg@bowt.ie> wrote:
> On Wed, Jan 22, 2020 at 6:41 PM Alexander Korotkov
> <a.korotkov@postgrespro.ru> wrote:
> > Rebased patch is attached.  Sorry for so huge delay.
>
> I really like this patch. Your interest in amcheck is something that
> makes me feel good about having put so much work into it myself.
>
> Here are some review comments:

Great, thank you very much!

> > +   /*
> > +    * Rightlink and incomplete split flag of previous block referenced by
> > +    * downlink.
> > +    */
> > +   BlockNumber prevrightlink;
> > +   bool        previncompletesplit;
> > +
>
> What downlink? What does this mean? Do you mean the most recently
> followed rightlink on the current level, or InvalidBlockNumber if
> target page is the leftmost page on the current level on the scan?
>
> (Thinks some more...)
>
> Actually, these two new fields track the state *one level down* from
> the target page level when !readonly (unless target page is on the
> leaf level). Right? Comments should be explicit about this. The
> current comments about downlinks isn't clear.

I agree. I've used very vague terms in comments.  Revised now.

> >         if (offset_is_negative_infinity(topaque, offset))
> > +       {
> > +           /*
> > +            * Initialize downlink connectivity check if needed.
> > +            */
> > +           if (!P_ISLEAF(topaque) && state->readonly)
> > +           {
> > +               bt_downlink_connectivity_check(state,
> > +                                              offset,
> > +                                              NULL,
> > +                                              topaque->btpo.level);
> > +           }
> >             continue;
> > +       }
>
> Don't need the "!P_ISLEAF()" here.

Why don't I need.  bt_downlink_connectivity_check() checks one level
down to the target level.  But there is no one level down to leaf...

> Also, you should say something like
> "we need to call this here because the usual callsite in
> bt_downlink_check() won't be reached".

Sure, fixed.

> >     /*
> > -    * * Check if page has a downlink in parent *
> > -    *
> > -    * This can only be checked in heapallindexed + readonly case.
> > +    * If we traversed the whole level to the rightmost page, there might be
> > +    * missing downlinks for the pages to the right of rightmost downlink.
> > +    * Check for them.
> >      */
>
> You mean "to the right of the child page pointed to by our rightmost downlink"?

Yep, fixed.

> I think that the final bt_downlink_connectivity_check() call within
> bt_target_page_check() should make it clear that it is kind of special
> compared to the other two calls.

Yes, this is fixed too.

> > +/*
> > + * Check connectivity of downlinks.  Traverse rightlinks from previous downlink
> > + * to the current one.  Check that there are no intermediate pages with missing
> > + * downlinks.
> > + *
> > + * If 'loaded_page' is given, it's assumed to be contents of downlink
> > + * referenced by 'downlinkoffnum'.
> > + */
>
> Say "assumed to be the page pointed to by the downlink", perhaps?

Yes, fixed.

> > +static void
> > +bt_downlink_connectivity_check(BtreeCheckState *state,
> > +                              OffsetNumber downlinkoffnum,
> > +                              Page loaded_page,
> > +                              uint32 parent_level)
> > +{
>
> In amcheck, we always have a current target page. Every page gets to
> be the target exactly once, though sometimes other subsidiary pages
> are accessed. We try to blame the target page, even with checks that
> are technically against its child/sibling/whatever. The target page is
> always our conceptual point of reference. Sometimes this is a bit
> artificial, but it's still worth doing consistently. So I think you
> should change these argument names with that in mind (see below).

Yes, the arguments were changes as you proposed.

> > +       /*
> > +        * If we visit page with high key, check that it should be equal to
> > +        * the target key next to corresponding downlink.
> > +        */
>
> I suggest "...check that it is equal to the target key..."

Agree, fixed.

> > +           /*
> > +            * There might be two situations when we examine high key.  If
> > +            * current child page is referenced by given downlink, we should
> > +            * look to the next offset number for matching key.
>
> You mean "the next offset number for the matching key from the target
> page"? I find it much easier to keep this stuff in my head if
> everything is defined in terms of its relationship with the current
> target page. For example, bt_downlink_connectivity_check()'s
> "parent_level" argument should be called "target_level" instead, while
> its "loaded_page" should be called "loaded_child". Maybe
> "downlinkoffnum" should be "target_downlinkoffnum". And downlinkoffnum
> should definitely be explained in comments at the top of
> bt_downlink_connectivity_check() (e.g., say what it means when it is
> InvalidOffsetNumber).

Renamed as you proposed.

> > Alternatively
> > +            * we might find child with high key while traversing from
> > +            * previous downlink to current one.  Then matching key resides
> > +            * the same offset number as current downlink.
> > +            */
>
> Not sure what "traversing from previous downlink to current one" means at all.

I've rephrased this comment, please check.

> > +           if (!offset_is_negative_infinity(topaque, pivotkey_offset) &&
> > +               pivotkey_offset <= PageGetMaxOffsetNumber(state->target))
> > +           {
> > +               uint32      cmp = _bt_compare(state->rel,
> > +                                             skey,
> > +                                             state->target,
> > +                                             pivotkey_offset);
>
> There is no need to bother with a _bt_compare() here. Why not just use
> memcmp() with a pointer to itup->t_tid.ip_posid (i.e. memcmp() that
> skips the block number)? I think that it is better to expect the keys
> to be *identical* among pivot tuples, including within tuple alignment
> padding (only the downlink block number can be different here). If
> non-pivot tuples were involved then you couldn't do it this way, but
> they're never involved, so it makes sense. A memcmp() will be faster,
> obviously. More importantly, it has the advantage of not relying on
> opclass infrastructure in any way. It might be worth adding an
> internal verify_nbtree.c static helper function to do the memcmp() for
> you -- bt_pivot_tuple_identical(), or something like that.

Agree, replaced _bt_compare() with bt_pivot_tuple_identical().  It
becomes even simpler now, thanks!

> I think bt_downlink_check() and bt_downlink_connectivity_check()
> should be renamed to something broader. In my mind, downlink is
> basically a block number. We have been sloppy about using the term
> downlink when we really mean "pivot tuple with a downlink" -- I am
> guilty of this myself. But it seems more important, now that you have
> the new high key check.

Hmm... Names are hard for me.  I didn't do any renaming for now.  What
about this?
bt_downlink_check() => bt_child_check()
bt_downlink_connectivity_check() => bt_child_connectivity_highkey_check()

> I particularly don't like the way you sometimes say "downlink" when
> you mean "child page". You do that in this error message:
>
> > +                   (errcode(ERRCODE_INDEX_CORRUPTED),
> > +                    errmsg("block found while traversing rightlinks from downlink of index \"%s\" has invalid
level",
> > +                           RelationGetRelationName(state->rel)),

Agree. Error message has been changed.

> Typo here:
>
> > +   /*
> > +    * If no previos rightlink is memorized, get it from current downlink for
> > +    * future usage.
> > +    */
>
> You mean "previous". Also, I think that you should say "memorized for
> current level just below target page's level".

Yep, fixed.

> >      * within bt_check_level_from_leftmost() won't reach the page either,
> >      * since the leaf's live siblings should have their sibling links updated
> > -    * to bypass the deletion target page when it is marked fully dead.)
> > +    * to bypass the deletion page under check when it is marked fully dead.)
> >      *
>
> This change seems wrong or unnecessary -- "deletion target" means
> "page undergoing deletion" (not necessarily marked P_ISDELETED() just
> yet), and has nothing to do with the amcheck target. You can change
> this if you want, but I don't get it.

Seems like random oversight.  Change removed.

> I tested this by using pg_hexedit to corrupt the least significant
> byte of a text key in the root page:
>
> pg@tpce:5432 [32610]=# select bt_index_parent_check('pk_holding');
> DEBUG:  verifying level 2 (true root level)
> DEBUG:  verifying 9 items on internal block 290
> DEBUG:  verifying level 1
> DEBUG:  verifying 285 items on internal block 3
> ERROR:  mismatch between parent key and child high key index "pk_holding"
> DETAIL:  Parent block=3 child block=9 parent page lsn=998/EFA21550.
>
> Happy to see that this works, even though this is one of the subtlest
> possible forms of index corruption. Previously, we could sometimes
> catch this with "rootdescend" verification, but only if there were
> *current* items that a scan couldn't find on lower levels (often just
> the leaf level). But now it doesn't matter -- we'll always detect it.
> (I think.)

Thank you for testing!

> Shouldn't this error message read '...in index "pk_holding"'? You
> missed the "in". Also, why not have the DETAIL message call the
> "Parent block" the target block?

Yep, fixed.

> I think that bt_downlink_connectivity_check() should have some
> high-level comments about what it's supposed to do. Perhaps an example
> is the best way to explain the concepts. Maybe say something about a
> three level B-Tree. Each of the separator keys in the grandparent/root
> page should also appear as high keys at the parent level. Each of the
> separator keys in the parent level should also appear as high keys on
> the leaf level, including the separators from the parent level high
> keys. Since each separator defines which subtrees are <= and > of the
> separator, there must be an identical seam of separators (in high
> keys) on lower levels. bt_downlink_connectivity_check() verifies that
> separator keys agree across a single level, which verifies the
> integrity of the whole tree.

I've revised finding the matching pivot key for high key.  Now, code
assumes we should always find a matching pivot key.  It could use both
target high key or left sibling high key (which is memorized as "low
key").

I've checked this on normal indexes, but I didn't try to exercise this
with broken indexes.  I would appreciate if you do.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

Re: Improve search for missing parent downlinks in amcheck

From
Alexander Korotkov
Date:
On Tue, Feb 18, 2020 at 1:15 PM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> On Fri, Jan 24, 2020 at 4:31 AM Peter Geoghegan <pg@bowt.ie> wrote:
> > On Wed, Jan 22, 2020 at 6:41 PM Alexander Korotkov
> > <a.korotkov@postgrespro.ru> wrote:
> > > Rebased patch is attached.  Sorry for so huge delay.
> >
> > I really like this patch. Your interest in amcheck is something that
> > makes me feel good about having put so much work into it myself.
> >
> > Here are some review comments:
>
> Great, thank you very much!

Sorry, I've forgot to commit some comments before publishing a patch.
The right version is attached.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

Re: Improve search for missing parent downlinks in amcheck

From
Peter Geoghegan
Date:
On Tue, Feb 18, 2020 at 2:16 AM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> Great, thank you very much!

No problem!

My remarks here are based on
"amcheck-btree-improve-missing-parent-downlinks-check-6.patch". I have
found a false positive corruption report bug in this latest version --
see note below about incomplete page splits.

> > Don't need the "!P_ISLEAF()" here.
>
> Why don't I need.  bt_downlink_connectivity_check() checks one level
> down to the target level.  But there is no one level down to leaf...

Because offset_is_negative_infinity() checks P_ISLEAF() for you. Maybe
it's better your way, though -- apparently it's clearer.

> > > Alternatively
> > > +            * we might find child with high key while traversing from
> > > +            * previous downlink to current one.  Then matching key resides
> > > +            * the same offset number as current downlink.
> > > +            */
> >
> > Not sure what "traversing from previous downlink to current one" means at all.
>
> I've rephrased this comment, please check.

> Agree, replaced _bt_compare() with bt_pivot_tuple_identical().  It
> becomes even simpler now, thanks!

There was actually an even better reason to invent
bt_pivot_tuple_identical(): a call to _bt_compare() in amcheck needs
to do something like the extra steps that you see in routines like
invariant_l_offset(). _bt_compare() will return 0 when the insertion
scankey has a prefix of scankey/column values that are equal, even
though there may be additional columns in the index tuple that are not
compared. So, you could have a truncated multi-column high key that is
"equal" to pivot tuple in parent that is actually to the right in the
key space. This blind spot would often occur with low cardinality
indexes, where we often have something like this in pivot tuples on
internal pages:

'foo, -inf'
'foo, (1,24)'
'food, -inf'. <-- This pivot tuple's downlink points to the final leaf
page that's filled with duplicates of the value 'foo'
'food, (3,19)' <-- This pivot tuple's downlink points to the *first*
leaf page that's filled with duplicates of the value 'food'
...

The situation is really common in low cardinality indexes because
nbtsplitloc.c hates splitting a leaf page between two duplicates -- it
is avoided whenever possible. You reliably get a '-inf' value for the
TID in the first pivot tuple for the duplicate, followed by a real
heap TID for later pivot tuples for pages with the same duplicate
value.

(Anyway, it's not important now.)

> > I think bt_downlink_check() and bt_downlink_connectivity_check()
> > should be renamed to something broader. In my mind, downlink is
> > basically a block number. We have been sloppy about using the term
> > downlink when we really mean "pivot tuple with a downlink" -- I am
> > guilty of this myself. But it seems more important, now that you have
> > the new high key check.
>
> Hmm... Names are hard for me.  I didn't do any renaming for now.  What
> about this?
> bt_downlink_check() => bt_child_check()
> bt_downlink_connectivity_check() => bt_child_connectivity_highkey_check()

I suggest:

bt_downlink_check() => bt_child_check()
bt_downlink_connectivity_check() => bt_child_highkey_check()

While bt_downlink_connectivity_check() moves right on the target's
child level, this isn't the common case. Moving right like that
doesn't need to be suggested by the name of the function.

Most of the time, we just check the high key -- right?

> I've revised finding the matching pivot key for high key.  Now, code
> assumes we should always find a matching pivot key.  It could use both
> target high key or left sibling high key (which is memorized as "low
> key").
>
> I've checked this on normal indexes, but I didn't try to exercise this
> with broken indexes.  I would appreciate if you do.

I can confirm that checking the high key in target's child page
against the high key in target (when appropriate) removes the "cousin
page verification blind spot" that I noticed in the last version, as
expected. Great!

* You should say "target page lsn" here instead:

pg@tpce:5432 [19852]=# select bt_index_parent_check(:'idxrelation',true, true);
ERROR:  mismatch between parent key and child high key in index "i_holding2"
DETAIL:  Target block=1570 child block=1690 parent page lsn=0/0.
Time: 12.509 ms

* Maybe say "Move to the right on the child level" in a comment above
the bt_downlink_connectivity_check() "while (true)" loop here:

> +
> +   while (true)
> +   {
> +       /*
> +        * Did we traverse the whole tree level and this is check for pages to
> +        * the right of rightmost downlink?
> +        */

* If you are going to save a low key for the target page in memory,
then you only need to do so for "state->readonly"/parent verification.

* You should s/lokey/lowkey/ -- I prefer the spelling "lowkey" or "low
key". This is a term that nbtsort.c now uses, in case you didn't know.

* The reason for saving a low key for each target page is very
unclear. Can we fix this?

The low key hardly ever gets used -- I can comment out the code that
uses a low key within bt_downlink_connectivity_check(), and indexes
that I tested appear fine. So I imagine this is for incomplete split
cases -- right?

I tested incomplete split cases using this simple hack:

diff --git a/src/backend/access/nbtree/nbtinsert.c
b/src/backend/access/nbtree/nbtinsert.c
index 4e5849ab8e..5811338584 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -1821,7 +1821,7 @@ _bt_insert_parent(Relation rel,
          */
         _bt_relbuf(rel, rbuf);

-        if (pbuf == InvalidBuffer)
+        if (random() <= (MAX_RANDOM_VALUE / 100))
             ereport(ERROR,
                     (errcode(ERRCODE_INDEX_CORRUPTED),
                      errmsg_internal("failed to re-find parent key in
index \"%s\" for split pages %u/%u",

Now, 1% of all page splits will fail after the first phase finishes,
but before the second phase begins/finishes. The incomplete split flag
will be found set by other, future inserts, though, at which point the
split will be finished (unless we're unlucky again).

An INSERT that inserts many rows to bound to fail with this hack
applied, but the INSERT shouldn't leave the index in a state that
looks like corruption to amcheck. I can see false positive reports of
corruption like this, though.

Consider the following sample session. The session inserts data from
one table into another table with an equivalent schema. As you can
see, I have to get my INSERT to fail three times before I see the
problem:

pg@tpce:5432 [13026]=# truncate  holding2;
TRUNCATE TABLE
pg@tpce:5432 [13026]=# insert into holding2 select * from holding;
ERROR:  failed to re-find parent key in index "pk_holding2" for split
pages 83/158
pg@tpce:5432 [13026]=# select bt_index_parent_check('i_holding2',true, true);
DEBUG:  verifying level 1 (true root level)
DEBUG:  verifying 200 items on internal block 3
DEBUG:  verifying level 0 (leaf level)
DEBUG:  verifying 262 items on leaf block 1
DEBUG:  verifying 258 items on leaf block 2
*** SNIP ***
DEBUG:  verifying 123 items on leaf block 201
DEBUG:  verifying that tuples from index "i_holding2" are present in "holding2"
DEBUG:  finished verifying presence of 0 tuples from table "holding2"
with bitset 4.83% set
┌───────────────────────┐
│ bt_index_parent_check │                          <-- No false positive yet
├───────────────────────┤
│                       │
└───────────────────────┘
(1 row)

pg@tpce:5432 [13026]=# insert into holding2 select * from holding;
DEBUG:  finishing incomplete split of 83/158
ERROR:  failed to re-find parent key in index "i_holding2" for split
pages 435/436
pg@tpce:5432 [13026]=# select bt_index_parent_check(:'idxrelation',true, true);
DEBUG:  verifying level 2 (true root level)
DEBUG:  verifying 2 items on internal block 303
*** SNIP ***
DEBUG:  verifying 74 items on leaf block 436
DEBUG:  verifying that tuples from index "i_holding2" are present in "holding2"
DEBUG:  finished verifying presence of 0 tuples from table "holding2"
with bitset 9.97% set
┌───────────────────────┐
│ bt_index_parent_check │                     <-- No false positive yet
├───────────────────────┤
│                       │
└───────────────────────┘
(1 row)

pg@tpce:5432 [13026]=# insert into holding2 select * from holding;
ERROR:  failed to re-find parent key in index "i_holding2" for split
pages 331/577
pg@tpce:5432 [13026]=# select bt_index_parent_check(:'idxrelation',true, true);
DEBUG:  verifying level 2 (true root level)
DEBUG:  verifying 3 items on internal block 303
DEBUG:  verifying level 1
DEBUG:  verifying 151 items on internal block 3
DEBUG:  verifying 189 items on internal block 521
DEBUG:  verifying 233 items on internal block 302
WARNING:  577 rightsplit
ERROR:  leaf index block lacks downlink in index "i_holding2". <--
False positive "corruption"
DETAIL:  Block=127 page lsn=0/0.

Notice I added a custom WARNING here, and we get the false positive
error just after the point that the WARNING is seen. This WARNING is
triggered by the temporary instrumentation change that I added to
amcheck, and can be seen in some cases that don't have an accompanying
false positive report of corruption:

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index d6cb10e263..dcfee9e4fa 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -1654,6 +1654,8 @@ bt_downlink_connectivity_check(BtreeCheckState *state,
                      errmsg("circular link chain found in block %u of
index \"%s\"",
                             blkno, RelationGetRelationName(state->rel))));

+        if (rightsplit)
+            elog(WARNING, "%u rightsplit", blkno);
         if (!first && !P_IGNORE(opaque))
         {
             /* blkno probably has missing parent downlink */
*** END DIFF ***

Anyway, I didn't take the time to debug this today, but I suspect that
the problem is that the lowkey thing is kind of brittle. If you can't
figure it out, let me know and I'll help with it.

* Can we reverse the order here, so that the common case (i.e. the
offset_is_negative_infinity() case) comes first?:

> +           if (offset_is_negative_infinity(topaque, pivotkey_offset))
> +           {
> +               /*
> +                * We're going to try match child high key to "negative
> +                * infinity key".  That means we should match to high key of
> +                * left sibling of target page.
> +                */
 *** SNIP ***

> +           }
> +           else
> +           {
 *** SNIP ***

> +           }

* Can the offset_is_negative_infinity() branch explain what's going on
at a much higher level than this?

--
Peter Geoghegan



Re: Improve search for missing parent downlinks in amcheck

From
Alexander Korotkov
Date:
Hi!

Thank you for your review.  The revised version is attached.

On Wed, Feb 19, 2020 at 1:16 AM Peter Geoghegan <pg@bowt.ie> wrote:
> On Tue, Feb 18, 2020 at 2:16 AM Alexander Korotkov
> <a.korotkov@postgrespro.ru> wrote:
> > > Don't need the "!P_ISLEAF()" here.
> >
> > Why don't I need.  bt_downlink_connectivity_check() checks one level
> > down to the target level.  But there is no one level down to leaf...
>
> Because offset_is_negative_infinity() checks P_ISLEAF() for you. Maybe
> it's better your way, though -- apparently it's clearer.

Oh, I see.  I prefer to leave it my way.  Explicit check is nearly
free and makes it clearer for me.

> > > > Alternatively
> > > > +            * we might find child with high key while traversing from
> > > > +            * previous downlink to current one.  Then matching key resides
> > > > +            * the same offset number as current downlink.
> > > > +            */
> > >
> > > Not sure what "traversing from previous downlink to current one" means at all.
> >
> > I've rephrased this comment, please check.
>
> > Agree, replaced _bt_compare() with bt_pivot_tuple_identical().  It
> > becomes even simpler now, thanks!
>
> There was actually an even better reason to invent
> bt_pivot_tuple_identical(): a call to _bt_compare() in amcheck needs
> to do something like the extra steps that you see in routines like
> invariant_l_offset(). _bt_compare() will return 0 when the insertion
> scankey has a prefix of scankey/column values that are equal, even
> though there may be additional columns in the index tuple that are not
> compared. So, you could have a truncated multi-column high key that is
> "equal" to pivot tuple in parent that is actually to the right in the
> key space. This blind spot would often occur with low cardinality
> indexes, where we often have something like this in pivot tuples on
> internal pages:
>
> 'foo, -inf'
> 'foo, (1,24)'
> 'food, -inf'. <-- This pivot tuple's downlink points to the final leaf
> page that's filled with duplicates of the value 'foo'
> 'food, (3,19)' <-- This pivot tuple's downlink points to the *first*
> leaf page that's filled with duplicates of the value 'food'
> ...
>
> The situation is really common in low cardinality indexes because
> nbtsplitloc.c hates splitting a leaf page between two duplicates -- it
> is avoided whenever possible. You reliably get a '-inf' value for the
> TID in the first pivot tuple for the duplicate, followed by a real
> heap TID for later pivot tuples for pages with the same duplicate
> value.
>

Thank you for the explanation!

> > > I think bt_downlink_check() and bt_downlink_connectivity_check()
> > > should be renamed to something broader. In my mind, downlink is
> > > basically a block number. We have been sloppy about using the term
> > > downlink when we really mean "pivot tuple with a downlink" -- I am
> > > guilty of this myself. But it seems more important, now that you have
> > > the new high key check.
> >
> > Hmm... Names are hard for me.  I didn't do any renaming for now.  What
> > about this?
> > bt_downlink_check() => bt_child_check()
> > bt_downlink_connectivity_check() => bt_child_connectivity_highkey_check()
>
> I suggest:
>
> bt_downlink_check() => bt_child_check()
> bt_downlink_connectivity_check() => bt_child_highkey_check()
>
> While bt_downlink_connectivity_check() moves right on the target's
> child level, this isn't the common case. Moving right like that
> doesn't need to be suggested by the name of the function.
>
> Most of the time, we just check the high key -- right?

Good.  Renamed as you proposed.

> * You should say "target page lsn" here instead:
>
> pg@tpce:5432 [19852]=# select bt_index_parent_check(:'idxrelation',true, true);
> ERROR:  mismatch between parent key and child high key in index "i_holding2"
> DETAIL:  Target block=1570 child block=1690 parent page lsn=0/0.
> Time: 12.509 ms

Yep, fixed.

> * Maybe say "Move to the right on the child level" in a comment above
> the bt_downlink_connectivity_check() "while (true)" loop here:

Comment is added.

> > +
> > +   while (true)
> > +   {
> > +       /*
> > +        * Did we traverse the whole tree level and this is check for pages to
> > +        * the right of rightmost downlink?
> > +        */
>
> * If you are going to save a low key for the target page in memory,
> then you only need to do so for "state->readonly"/parent verification.

Sure, now it's just for state->readonly.

> * You should s/lokey/lowkey/ -- I prefer the spelling "lowkey" or "low
> key". This is a term that nbtsort.c now uses, in case you didn't know.

Changed, thank you for pointing.

> * The reason for saving a low key for each target page is very
> unclear. Can we fix this?

Actually, lowkey is used for removing "cousin page verification blind
spot" when there are incomplete splits.  It might happen that we read
child with hikey matching its parent high key only when
bt_child_highkey_check() is called for "minus infinity" key of parent
right sibling.  Saving low key helps in this case.

> The low key hardly ever gets used -- I can comment out the code that
> uses a low key within bt_downlink_connectivity_check(), and indexes
> that I tested appear fine. So I imagine this is for incomplete split
> cases -- right?
>
> I tested incomplete split cases using this simple hack:
>
> diff --git a/src/backend/access/nbtree/nbtinsert.c
> b/src/backend/access/nbtree/nbtinsert.c
> index 4e5849ab8e..5811338584 100644
> --- a/src/backend/access/nbtree/nbtinsert.c
> +++ b/src/backend/access/nbtree/nbtinsert.c
> @@ -1821,7 +1821,7 @@ _bt_insert_parent(Relation rel,
>           */
>          _bt_relbuf(rel, rbuf);
>
> -        if (pbuf == InvalidBuffer)
> +        if (random() <= (MAX_RANDOM_VALUE / 100))
>              ereport(ERROR,
>                      (errcode(ERRCODE_INDEX_CORRUPTED),
>                       errmsg_internal("failed to re-find parent key in
> index \"%s\" for split pages %u/%u",
>
> Now, 1% of all page splits will fail after the first phase finishes,
> but before the second phase begins/finishes. The incomplete split flag
> will be found set by other, future inserts, though, at which point the
> split will be finished (unless we're unlucky again).
>
> An INSERT that inserts many rows to bound to fail with this hack
> applied, but the INSERT shouldn't leave the index in a state that
> looks like corruption to amcheck. I can see false positive reports of
> corruption like this, though.
>
> Consider the following sample session. The session inserts data from
> one table into another table with an equivalent schema. As you can
> see, I have to get my INSERT to fail three times before I see the
> problem:
>
> pg@tpce:5432 [13026]=# truncate  holding2;
> TRUNCATE TABLE
> pg@tpce:5432 [13026]=# insert into holding2 select * from holding;
> ERROR:  failed to re-find parent key in index "pk_holding2" for split
> pages 83/158
> pg@tpce:5432 [13026]=# select bt_index_parent_check('i_holding2',true, true);
> DEBUG:  verifying level 1 (true root level)
> DEBUG:  verifying 200 items on internal block 3
> DEBUG:  verifying level 0 (leaf level)
> DEBUG:  verifying 262 items on leaf block 1
> DEBUG:  verifying 258 items on leaf block 2
> *** SNIP ***
> DEBUG:  verifying 123 items on leaf block 201
> DEBUG:  verifying that tuples from index "i_holding2" are present in "holding2"
> DEBUG:  finished verifying presence of 0 tuples from table "holding2"
> with bitset 4.83% set
> ┌───────────────────────┐
> │ bt_index_parent_check │                          <-- No false positive yet
> ├───────────────────────┤
> │                       │
> └───────────────────────┘
> (1 row)
>
> pg@tpce:5432 [13026]=# insert into holding2 select * from holding;
> DEBUG:  finishing incomplete split of 83/158
> ERROR:  failed to re-find parent key in index "i_holding2" for split
> pages 435/436
> pg@tpce:5432 [13026]=# select bt_index_parent_check(:'idxrelation',true, true);
> DEBUG:  verifying level 2 (true root level)
> DEBUG:  verifying 2 items on internal block 303
> *** SNIP ***
> DEBUG:  verifying 74 items on leaf block 436
> DEBUG:  verifying that tuples from index "i_holding2" are present in "holding2"
> DEBUG:  finished verifying presence of 0 tuples from table "holding2"
> with bitset 9.97% set
> ┌───────────────────────┐
> │ bt_index_parent_check │                     <-- No false positive yet
> ├───────────────────────┤
> │                       │
> └───────────────────────┘
> (1 row)
>
> pg@tpce:5432 [13026]=# insert into holding2 select * from holding;
> ERROR:  failed to re-find parent key in index "i_holding2" for split
> pages 331/577
> pg@tpce:5432 [13026]=# select bt_index_parent_check(:'idxrelation',true, true);
> DEBUG:  verifying level 2 (true root level)
> DEBUG:  verifying 3 items on internal block 303
> DEBUG:  verifying level 1
> DEBUG:  verifying 151 items on internal block 3
> DEBUG:  verifying 189 items on internal block 521
> DEBUG:  verifying 233 items on internal block 302
> WARNING:  577 rightsplit
> ERROR:  leaf index block lacks downlink in index "i_holding2". <--
> False positive "corruption"
> DETAIL:  Block=127 page lsn=0/0.
>
> Notice I added a custom WARNING here, and we get the false positive
> error just after the point that the WARNING is seen. This WARNING is
> triggered by the temporary instrumentation change that I added to
> amcheck, and can be seen in some cases that don't have an accompanying
> false positive report of corruption:
>
> diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
> index d6cb10e263..dcfee9e4fa 100644
> --- a/contrib/amcheck/verify_nbtree.c
> +++ b/contrib/amcheck/verify_nbtree.c
> @@ -1654,6 +1654,8 @@ bt_downlink_connectivity_check(BtreeCheckState *state,
>                       errmsg("circular link chain found in block %u of
> index \"%s\"",
>                              blkno, RelationGetRelationName(state->rel))));
>
> +        if (rightsplit)
> +            elog(WARNING, "%u rightsplit", blkno);
>          if (!first && !P_IGNORE(opaque))
>          {
>              /* blkno probably has missing parent downlink */
> *** END DIFF ***
>
> Anyway, I didn't take the time to debug this today, but I suspect that
> the problem is that the lowkey thing is kind of brittle. If you can't
> figure it out, let me know and I'll help with it.

It appears that these false positives were cause by very basic error made here:

        if (!first && !P_IGNORE(opaque))
        {
            /* blkno probably has missing parent downlink */
            bt_downlink_missing_check(state, rightsplit, blkno, page);
        }

actually it should be

        if (blkno != downlink && !P_IGNORE(opaque))
        {
            /* blkno probably has missing parent downlink */
            bt_downlink_missing_check(state, rightsplit, blkno, page);
        }

So "blkno == downlink" means blkno has downlink, not being checked
first in the loop.  This is remains of old version of patch which I
forget to clean.  Now the check you've described works for me.

If you still think lowkey check is a problem, please help me figure it out.

> * Can we reverse the order here, so that the common case (i.e. the
> offset_is_negative_infinity() case) comes first?:
>
> > +           if (offset_is_negative_infinity(topaque, pivotkey_offset))
> > +           {
> > +               /*
> > +                * We're going to try match child high key to "negative
> > +                * infinity key".  That means we should match to high key of
> > +                * left sibling of target page.
> > +                */
>  *** SNIP ***
>
> > +           }
> > +           else
> > +           {
>  *** SNIP ***
>
> > +           }
>
> * Can the offset_is_negative_infinity() branch explain what's going on
> at a much higher level than this?

Sure, comment is revised!

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

Re: Improve search for missing parent downlinks in amcheck

From
Peter Geoghegan
Date:
Hi Alexander,

Apologies for the delayed response. I was a little tired from the
deduplication project.

On Mon, Feb 24, 2020 at 2:54 PM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> Thank you for your review.  The revised version is attached.

This has bitrot, because of the deduplication patch. Shouldn't be hard
to rebase, though.

> > 'foo, -inf'
> > 'foo, (1,24)'
> > 'food, -inf'. <-- This pivot tuple's downlink points to the final leaf
> > page that's filled with duplicates of the value 'foo'
> > 'food, (3,19)' <-- This pivot tuple's downlink points to the *first*
> > leaf page that's filled with duplicates of the value 'food'
> > ...

> Thank you for the explanation!

I taught pageinspect to display a "htid" field for pivot tuples
recently, making it easier to visualize this example.

I think that you should say more about how "lowkey" is used here:

>         /*
> -        * Record if page that is about to become target is the right half of
> -        * an incomplete page split.  This can go stale immediately in
> -        * !readonly case.
> +        * Copy current target low key as the high key of right sibling.
> +        * Allocate memory in upper level context, so it would be cleared
> +        * after reset of target context.
> +        *
> +        * We only need low key for parent check.
>          */
> -       state->rightsplit = P_INCOMPLETE_SPLIT(opaque);
> +       if (state->readonly && !P_RIGHTMOST(opaque))
> +       {

Say something about concurrent page splits, since they're the only
case where we actually use lowkey. Maybe say something like: "We
probably won't end up doing anything with lowkey, but it's simpler for
readonly verification to always have it available".

> Actually, lowkey is used for removing "cousin page verification blind
> spot" when there are incomplete splits.  It might happen that we read
> child with hikey matching its parent high key only when
> bt_child_highkey_check() is called for "minus infinity" key of parent
> right sibling.  Saving low key helps in this case.

That makes sense to me.

> It appears that these false positives were cause by very basic error made here:
>
>         if (!first && !P_IGNORE(opaque))
>         {
>             /* blkno probably has missing parent downlink */
>             bt_downlink_missing_check(state, rightsplit, blkno, page);
>         }
>
> actually it should be
>
>         if (blkno != downlink && !P_IGNORE(opaque))
>         {
>             /* blkno probably has missing parent downlink */
>             bt_downlink_missing_check(state, rightsplit, blkno, page);
>         }
>
> So "blkno == downlink" means blkno has downlink, not being checked
> first in the loop.  This is remains of old version of patch which I
> forget to clean.  Now the check you've described works for me.
>
> If you still think lowkey check is a problem, please help me figure it out.

* I think that these comments could still be clearer:

> +               /*
> +                * We're going to try match child high key to "negative
> +                * infinity key".  This normally happens when the last child
> +                * we visited for target's left sibling was an incomplete
> +                * split. So, we must be still on the child of target's left
> +                * sibling. Thus, we should match to target's left sibling
> +                * high key. Thankfully we saved it, it's called a "low key".
> +                */

Maybe start with "We cannot try to match child's high key to a
negative infinity key in target, since there is nothing to compare.
However...". Perhaps use terms like "cousin page" and "subtree", which
can be useful. Alternatively, mention this case in the diagram example
at the top of bt_child_highkey_check(). It's tough to write comments
like this, but I think it's worth it.

Note that a high key is also a pivot tuple, so I wouldn't mention high
keys here:

> +/*
> + * Check if two tuples are binary identical except the block number.  So,
> + * this function is capable to compare high keys with pivot keys.
> + */
> +static bool
> +bt_pivot_tuple_identical(IndexTuple itup1, IndexTuple itup2)
> +{

v7 looks pretty close to being commitable, though I'll probably want
to update some comments that you haven't touched when you commit this.
I should probably wait until you've committed the patch to go do that.
I'm thinking of things like old comments in bt_downlink_check().

I will test the patch properly one more time when you produce a new
revision. I haven't really tested it since the last time.

-- 
Peter Geoghegan



Re: Improve search for missing parent downlinks in amcheck

From
Alexander Korotkov
Date:
Hi, Peter!

On Tue, Mar 3, 2020 at 3:04 AM Peter Geoghegan <pg@bowt.ie> wrote:
> Apologies for the delayed response. I was a little tired from the
> deduplication project.

No problem.  Apologies for the delayed revision as well.

> I taught pageinspect to display a "htid" field for pivot tuples
> recently, making it easier to visualize this example.

Great!

> I think that you should say more about how "lowkey" is used here:
>
> >         /*
> > -        * Record if page that is about to become target is the right half of
> > -        * an incomplete page split.  This can go stale immediately in
> > -        * !readonly case.
> > +        * Copy current target low key as the high key of right sibling.
> > +        * Allocate memory in upper level context, so it would be cleared
> > +        * after reset of target context.
> > +        *
> > +        * We only need low key for parent check.
> >          */
> > -       state->rightsplit = P_INCOMPLETE_SPLIT(opaque);
> > +       if (state->readonly && !P_RIGHTMOST(opaque))
> > +       {
>
> Say something about concurrent page splits, since they're the only
> case where we actually use lowkey. Maybe say something like: "We
> probably won't end up doing anything with lowkey, but it's simpler for
> readonly verification to always have it available".

I've revised this comment.  Hopefully it's better now.

> * I think that these comments could still be clearer:
>
> > +               /*
> > +                * We're going to try match child high key to "negative
> > +                * infinity key".  This normally happens when the last child
> > +                * we visited for target's left sibling was an incomplete
> > +                * split. So, we must be still on the child of target's left
> > +                * sibling. Thus, we should match to target's left sibling
> > +                * high key. Thankfully we saved it, it's called a "low key".
> > +                */
>
> Maybe start with "We cannot try to match child's high key to a
> negative infinity key in target, since there is nothing to compare.
> However...". Perhaps use terms like "cousin page" and "subtree", which
> can be useful. Alternatively, mention this case in the diagram example
> at the top of bt_child_highkey_check(). It's tough to write comments
> like this, but I think it's worth it.

I've updated this comment using terms "cousin page" and "subtree".  I
didn't refer the diagram example, because it doesn't contain
appropriate case.  And I wouldn't like this diagram to contain such
case, because that probably makes this diagram too complex.  I've also
invented term "uncle page". BTW, should it be "aunt page"?  I don't
know.

> Note that a high key is also a pivot tuple, so I wouldn't mention high
> keys here:
>
> > +/*
> > + * Check if two tuples are binary identical except the block number.  So,
> > + * this function is capable to compare high keys with pivot keys.
> > + */
> > +static bool
> > +bt_pivot_tuple_identical(IndexTuple itup1, IndexTuple itup2)
> > +{

Sure, this comment is revised.

> v7 looks pretty close to being commitable, though I'll probably want
> to update some comments that you haven't touched when you commit this.
> I should probably wait until you've committed the patch to go do that.
> I'm thinking of things like old comments in bt_downlink_check().
>
> I will test the patch properly one more time when you produce a new
> revision. I haven't really tested it since the last time.

Attached patch also has revised commit message.  I'll wait for your
response before commit.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Improve search for missing parent downlinks in amcheck

From
Alexander Korotkov
Date:
On Mon, Mar 9, 2020 at 12:52 AM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> Attached patch also has revised commit message.  I'll wait for your
> response before commit.

Oh, I found that I haven't attached the patch.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

Re: Improve search for missing parent downlinks in amcheck

From
Peter Geoghegan
Date:
On Sun, Mar 8, 2020 at 2:52 PM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> I've revised this comment.  Hopefully it's better now.

I think that the new comments about why we need a low key for the page
are much better now.

> I've updated this comment using terms "cousin page" and "subtree".  I
> didn't refer the diagram example, because it doesn't contain
> appropriate case.  And I wouldn't like this diagram to contain such
> case, because that probably makes this diagram too complex.  I've also
> invented term "uncle page". BTW, should it be "aunt page"?  I don't
> know.

I have never heard the term "uncle page" before, but I like it --
though maybe say "right uncle page". That happens to be the exact
relationship that we're talking about here. I think any one of
"uncle", "aunt", or "uncle/aunt" are acceptable. We'll probably never
need to use this term again, but it seems like the right term to use
here.

Anyway, this section also seems much better now.

Other things that I noticed:

* Typo:

> +           /*
> +            * We don't call bt_child_check() for "negative infinity" items.
> +            * But if we're performatin downlink connectivity check, we do it
> +            * for every item including "negative infinity" one.
> +            */

s/performatin/performing/

* Suggest that you say "has incomplete split flag set" here:

> + * - The call for block 4 will initialize data structure, but doesn't do actual
> + *   checks assuming page 4 has incomplete split.

* More importantly, is this the right thing to say about page 4? Isn't
it also true that page 4 is the leftmost leaf page, and therefore kind
of special in another way? Even without having the incomplete split
flag set at all? Wouldn't it be better to address the incomplete split
flag issue by making that apply to some other page that isn't also the
leftmost? That would allow you to talk about the leftmost case
directly here. Or it would at least make it less confusing.

BTW, a P_LEFTMOST() assertion at the beginning of
bt_child_highkey_check() would make this easier to follow.

* Correct spelling is "happens" here:

> +            * current child page is not incomplete split, then its high key
> +            * should match to the target's key of current offset number. This
> +            * happends when child page referenced by previous downlink is

* Actually, maybe this whole sentence should be reworded instead --
say "This happens when a previous call here (to
bt_child_highkey_check()) found an incomplete split, and we reach a
right sibling page without a downlink -- the right sibling page's high
key still needs to be matched to a separator key on the parent/target
level".

* Maybe say "Don't apply OffsetNumberNext() to target_downlinkoffnum
when we already had to step right on the child level. Our traversal of
the child level must try to move in perfect lockstep behind (to the
left of) the target/parent level traversal."

I found this detail very confusing at first.

* The docs should say "...relationships, including checking that there
are no missing downlinks in the index structure" here:

>        unlike <function>bt_index_check</function>,
>        <function>bt_index_parent_check</function> also checks
> -      invariants that span parent/child relationships.
> +      invariants that span parent/child relationships including check
> +      that there are no missing downlinks in the index structure.
>        <function>bt_index_parent_check</function> follows the general
>        convention of raising an error if it finds a logical
>        inconsistency or other problem.

This is very close now. I would be okay with you committing the patch
once you deal with this feedback. If you prefer, I can take another
look at a new revision.

--
Peter Geoghegan



Re: Improve search for missing parent downlinks in amcheck

From
Alexander Korotkov
Date:
On Tue, Mar 10, 2020 at 3:07 AM Peter Geoghegan <pg@bowt.ie> wrote:
> On Sun, Mar 8, 2020 at 2:52 PM Alexander Korotkov
> <a.korotkov@postgrespro.ru> wrote:
> > I've revised this comment.  Hopefully it's better now.
>
> I think that the new comments about why we need a low key for the page
> are much better now.

Good, thank you.

> > I've updated this comment using terms "cousin page" and "subtree".  I
> > didn't refer the diagram example, because it doesn't contain
> > appropriate case.  And I wouldn't like this diagram to contain such
> > case, because that probably makes this diagram too complex.  I've also
> > invented term "uncle page". BTW, should it be "aunt page"?  I don't
> > know.
>
> I have never heard the term "uncle page" before, but I like it --
> though maybe say "right uncle page". That happens to be the exact
> relationship that we're talking about here. I think any one of
> "uncle", "aunt", or "uncle/aunt" are acceptable. We'll probably never
> need to use this term again, but it seems like the right term to use
> here.

According to context that should be left uncle page.  I've changed the
text accordingly.

> Anyway, this section also seems much better now.
>
> Other things that I noticed:
>
> * Typo:
>
> > +           /*
> > +            * We don't call bt_child_check() for "negative infinity" items.
> > +            * But if we're performatin downlink connectivity check, we do it
> > +            * for every item including "negative infinity" one.
> > +            */
>
> s/performatin/performing/

Fixed.

> * Suggest that you say "has incomplete split flag set" here:
>
> > + * - The call for block 4 will initialize data structure, but doesn't do actual
> > + *   checks assuming page 4 has incomplete split.

Yes, that sounds better.  Changed here and in the other places.

> * More importantly, is this the right thing to say about page 4? Isn't
> it also true that page 4 is the leftmost leaf page, and therefore kind
> of special in another way? Even without having the incomplete split
> flag set at all? Wouldn't it be better to address the incomplete split
> flag issue by making that apply to some other page that isn't also the
> leftmost? That would allow you to talk about the leftmost case
> directly here. Or it would at least make it less confusing.

Yes, current example looks confusing in this aspect.  But your comment
spotted to me an algorithmic issue.  We don't match highkey of
leftmost child against parent pivot key.  But we can.  The "if
(!BlockNumberIsValid(blkno))" branch survived from the patch version
when we didn't match high keys.  I've revised it.  Now we enter the
loop even for leftmost page on child level and match high key for that
page.

> BTW, a P_LEFTMOST() assertion at the beginning of
> bt_child_highkey_check() would make this easier to follow.

Yes, but why should it be an assert?  We can imagine corruption, when
there is left sibling of first child of leftmost target.  I guess,
current code would report such situation as an error, because this
left sibling lacks of parent downlink.  I've revised that "if" branch,
so we don't load a child page there anymore.  Error reporting is added
to the main loop.

> * Correct spelling is "happens" here:
>
> > +            * current child page is not incomplete split, then its high key
> > +            * should match to the target's key of current offset number. This
> > +            * happends when child page referenced by previous downlink is
>
> * Actually, maybe this whole sentence should be reworded instead --
> say "This happens when a previous call here (to
> bt_child_highkey_check()) found an incomplete split, and we reach a
> right sibling page without a downlink -- the right sibling page's high
> key still needs to be matched to a separator key on the parent/target
> level".
>
> * Maybe say "Don't apply OffsetNumberNext() to target_downlinkoffnum
> when we already had to step right on the child level. Our traversal of
> the child level must try to move in perfect lockstep behind (to the
> left of) the target/parent level traversal."
>
> I found this detail very confusing at first.
>
> * The docs should say "...relationships, including checking that there
> are no missing downlinks in the index structure" here:
>
> >        unlike <function>bt_index_check</function>,
> >        <function>bt_index_parent_check</function> also checks
> > -      invariants that span parent/child relationships.
> > +      invariants that span parent/child relationships including check
> > +      that there are no missing downlinks in the index structure.
> >        <function>bt_index_parent_check</function> follows the general
> >        convention of raising an error if it finds a logical
> >        inconsistency or other problem.

This comments are revised as you proposed.

> This is very close now. I would be okay with you committing the patch
> once you deal with this feedback. If you prefer, I can take another
> look at a new revision.

Thank you.  I'd like to have another feedback from you assuming there
are logic changes.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

Re: Improve search for missing parent downlinks in amcheck

From
Peter Geoghegan
Date:
On Tue, Mar 10, 2020 at 8:30 AM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> Yes, current example looks confusing in this aspect.  But your comment
> spotted to me an algorithmic issue.  We don't match highkey of
> leftmost child against parent pivot key.  But we can.  The "if
> (!BlockNumberIsValid(blkno))" branch survived from the patch version
> when we didn't match high keys.  I've revised it.  Now we enter the
> loop even for leftmost page on child level and match high key for that
> page.

Great. That looks better.

> > BTW, a P_LEFTMOST() assertion at the beginning of
> > bt_child_highkey_check() would make this easier to follow.
>
> Yes, but why should it be an assert?  We can imagine corruption, when
> there is left sibling of first child of leftmost target.

I agree. I would only make it an assertion when it concerns an
implementation detail of amcheck, but that doesn't apply here.

> Thank you.  I'd like to have another feedback from you assuming there
> are logic changes.

This looks committable. I only noticed one thing: The comments above
bt_target_page_check() need to be updated to reflect the new check,
which no longer has anything to do with "heapallindexed = true".

-- 
Peter Geoghegan



Re: Improve search for missing parent downlinks in amcheck

From
Alexander Korotkov
Date:
On Wed, Mar 11, 2020 at 7:19 AM Peter Geoghegan <pg@bowt.ie> wrote:
> This looks committable. I only noticed one thing: The comments above
> bt_target_page_check() need to be updated to reflect the new check,
> which no longer has anything to do with "heapallindexed = true".

Thank you!  Pushed with this comment revised!

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Improve search for missing parent downlinks in amcheck

From
Peter Geoghegan
Date:
On Wed, Mar 11, 2020 at 2:02 AM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> Thank you!  Pushed with this comment revised!

Thanks!

-- 
Peter Geoghegan