Thread: Corrupted btree index on HEAD because of covering indexes

Corrupted btree index on HEAD because of covering indexes

From

Michael Paquier

Date:

19 April 2018, 08:24:36

Hi all,

I was just testing the VACUUM truncation logic, and bumped into what
looks like a corrupted btree index.  Here is a reproducer:
create table aa (a int primary key, b bool);
insert into aa values (generate_series(1,1000000), false);
checkpoint;
update aa set b = false where a > 500000; -- Dirties a set of shared
buffers
delete from aa where a > 750000; -- Delete a set of rows
vacuum aa;
delete from aa where a > 10;
vacuum aa; -- error on btree with right sibling

And here is the actual failure when the second vacuum:
ERROR:  XX000: right sibling 4132 of block 2128 is not next child 5396 of block 412 in index "aa_pkey"
LOCATION:  _bt_mark_page_halfdead, nbtpage.c:1564

This works on REL_10_STABLE, so I am adding an open item.  I have not
investigated the exact problem yet, but bisect is showing me covering
indexes as the culprit (8224de4).

Thanks,
--
Michael

Attachment

signature.asc

Re: Corrupted btree index on HEAD because of covering indexes

From

Teodor Sigaev

Date:

19 April 2018, 08:29:39

Will see...

Michael Paquier wrote:
> Hi all,
> 
> I was just testing the VACUUM truncation logic, and bumped into what
> looks like a corrupted btree index.  Here is a reproducer:
> create table aa (a int primary key, b bool);
> insert into aa values (generate_series(1,1000000), false);
> checkpoint;
> update aa set b = false where a > 500000; -- Dirties a set of shared
> buffers
> delete from aa where a > 750000; -- Delete a set of rows
> vacuum aa;
> delete from aa where a > 10;
> vacuum aa; -- error on btree with right sibling
> 
> And here is the actual failure when the second vacuum:
> ERROR:  XX000: right sibling 4132 of block 2128 is not next child 5396 of block 412 in index "aa_pkey"
> LOCATION:  _bt_mark_page_halfdead, nbtpage.c:1564
> 
> This works on REL_10_STABLE, so I am adding an open item.  I have not
> investigated the exact problem yet, but bisect is showing me covering
> indexes as the culprit (8224de4).
> 
> Thanks,
> --
> Michael
> 

-- 
Teodor Sigaev                      E-mail: teodor@sigaev.ru
                                       WWW: http://www.sigaev.ru/

Re: Corrupted btree index on HEAD because of covering indexes

From

Peter Geoghegan

Date:

19 April 2018, 09:44:05

On Wed, Apr 18, 2018 at 10:29 PM, Teodor Sigaev <teodor@sigaev.ru> wrote:
> Will see...

I'll take a look tomorrow.

-- 
Peter Geoghegan

Re: Corrupted btree index on HEAD because of covering indexes

From

Teodor Sigaev

Date:

19 April 2018, 19:42:59

> I'll take a look tomorrow.

Interesting, contrib/amcheck doesn't find any error in index. Seems, it's 
subject for further improvement.

Nevertheless, seems, I found. In _bt_mark_page_halfdead() we use truncated high 
key IndexTuple as a storage of blocknumber of top parent to remove. And sets 
BTreeTupleSetNAtts(&trunctuple, 0) - it's stored in ip_posid.

But some later, in _bt_unlink_halfdead_page() we check ItemPointer high key with 
ItemPointerIsValid macro - and it returns false, because offset is actually 
InvalidOffsetNumber - i.e. 0 which was set by BTreeTupleSetNAtts. And some wrong 
decisions are follows, I didn't look at that.

Trivial and naive fix is attached, but for me it looks a bit annoing that we 
store pointer (leafhikey) somewhere inside unlocked page.



-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Attachment

vacuum_nbtree.patch

Re: Corrupted btree index on HEAD because of covering indexes

From

Peter Geoghegan

Date:

19 April 2018, 21:39:05

On Thu, Apr 19, 2018 at 9:42 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:
> Interesting, contrib/amcheck doesn't find any error in index. Seems, it's
> subject for further improvement.

I think that you're right that this should be detectable by
bt_index_parent_check(). I have an idea about how we can make this
happen with low overhead:

Iff we're readonly (that is, called through bt_index_parent_check()),
use a bloom filter to fingerprint downlink block numbers on each
non-leaf level, starting from the root. On the next level down, check
that we encountered a downlink at the parent level for non-P_IGNORE()
pages. If we didn't, throw an error. This should only need a small
Bloom filter to work well. I suppose we could only do it when
"heapallindexed = true" was specified to bt_index_parent_check(), and
size the Bloom filter based on the same principle as heapallindexed
verification.

This check is correct because the downlink in the parent is atomically
removed when the page (that the downlink points to) is marked
half-dead by VACUUM (and because there cannot be a concurrent VACUUM,
since we have a ShareLock on the relation). If we find a
non-P_IGNORE() page whose block number was not fingerprinted one level
up, then something must be wrong. (We should also throw an error
within bt_downlink_check() if the page turns out to be P_IGNORE(), to
make our testing "symmetrical".)

The really scary thing about corruption like this is not that it leads
to wrong answers to queries. It's that it *doesn't* lead to wrong
answers to queries (I mentioned this issue to Alexander a few weeks
ago, actually). This is true because we'll move right one level down,
maybe more than once, because it just looks like there was a
concurrent page split to index scans.

(There is actually another subtlety about this new
bt_index_parent_check() check, which is legitimate incomplete page
splits that we detect using P_INCOMPLETE_SPLIT(). The proposed check
still seems totally doable, though.)

I would like to go and implement this check now, and if all goes well
I may propose that you commit it to the master branch for v11. I don't
see this as a new feature. I see it as essential testing
infrastructure. What do you think about adding this new check soon?

> Nevertheless, seems, I found. In _bt_mark_page_halfdead() we use truncated
> high key IndexTuple as a storage of blocknumber of top parent to remove. And
> sets BTreeTupleSetNAtts(&trunctuple, 0) - it's stored in ip_posid.
>
> But some later, in _bt_unlink_halfdead_page() we check ItemPointer high key
> with ItemPointerIsValid macro - and it returns false, because offset is
> actually InvalidOffsetNumber - i.e. 0 which was set by BTreeTupleSetNAtts.
> And some wrong decisions are follows, I didn't look at that.

I'm not at all surprised. I strongly suspected that it was some simple
issue with the representation, since the INCLUDE patch didn't actually
change the page deletion algorithm in any way.

> Trivial and naive fix is attached, but for me it looks a bit annoing that we
> store pointer (leafhikey) somewhere inside unlocked page.

Well, it has worked that way for a long time. No reason to change it now.

I also think that we could have better conventional regression test
coverage here.

-- 
Peter Geoghegan

Re: Corrupted btree index on HEAD because of covering indexes

From

Teodor Sigaev

Date:

19 April 2018, 21:59:57

> I would like to go and implement this check now, and if all goes well
> I may propose that you commit it to the master branch for v11. I don't
> see this as a new feature. I see it as essential testing
> infrastructure. What do you think about adding this new check soon?
Agree. Hope, nobody found a way how to use amcheck module in production 
to serve user requests. But, it should be implemented before BETA stage, 
in opposite case we will get a lot of objections.

> 
>> Nevertheless, seems, I found. In _bt_mark_page_halfdead() we use truncated
>> high key IndexTuple as a storage of blocknumber of top parent to remove. And
>> sets BTreeTupleSetNAtts(&trunctuple, 0) - it's stored in ip_posid.
>>
>> But some later, in _bt_unlink_halfdead_page() we check ItemPointer high key
>> with ItemPointerIsValid macro - and it returns false, because offset is
>> actually InvalidOffsetNumber - i.e. 0 which was set by BTreeTupleSetNAtts.
>> And some wrong decisions are follows, I didn't look at that.
> 
> I'm not at all surprised. I strongly suspected that it was some simple
> issue with the representation, since the INCLUDE patch didn't actually
> change the page deletion algorithm in any way.
+1

> 
>> Trivial and naive fix is attached, but for me it looks a bit annoing that we
>> store pointer (leafhikey) somewhere inside unlocked page.
> 
> Well, it has worked that way for a long time. No reason to change it now.
Agree, but at least this place needs a comment - why it's safe.

> I also think that we could have better conventional regression test
> coverage here.
Will try to invent not so large test.

-- 
Teodor Sigaev                      E-mail: teodor@sigaev.ru
                                       WWW: http://www.sigaev.ru/

Re: Corrupted btree index on HEAD because of covering indexes

From

Teodor Sigaev

Date:

19 April 2018, 22:47:02

> I also think that we could have better conventional regression test
> coverage here.

I tried to minimize Michael's test case and add it to patch.

-- 
Teodor Sigaev                      E-mail: teodor@sigaev.ru
                                       WWW: http://www.sigaev.ru/

Attachment

vacuum_nbtree-v2.patch

Re: Corrupted btree index on HEAD because of covering indexes

From

Alexander Korotkov

Date:

20 April 2018, 00:00:11

On Thu, Apr 19, 2018 at 10:47 PM, Teodor Sigaev <teodor@sigaev.ru> wrote:

I also think that we could have better conventional regression test
coverage here.

I tried to minimize Michael's test case and add it to patch.

- if (ItemPointerIsValid(leafhikey))

+ if (ItemPointerGetBlockNumberNoCheck(leafhikey) != InvalidBlockNumber)

Should we use BTreeInnerTupleGetDownLink() as soon as we use

BTreeInnerTupleSetDownLink() for setting this?

Or even invent BTreeInnerTupleDownLinkIsValid() macro?

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Corrupted btree index on HEAD because of covering indexes

From

Peter Geoghegan

Date:

20 April 2018, 03:52:02

On Thu, Apr 19, 2018 at 2:00 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> - if (ItemPointerIsValid(leafhikey))
> + if (ItemPointerGetBlockNumberNoCheck(leafhikey) != InvalidBlockNumber)
>
> Should we use BTreeInnerTupleGetDownLink() as soon as we use
> BTreeInnerTupleSetDownLink() for setting this?
> Or even invent BTreeInnerTupleDownLinkIsValid() macro?

+1 to at least using BTreeInnerTupleGetDownLink().

I think it's okay to use ItemPointerGetBlockNumberNoCheck() in
amcheck, because it's inconvenient for amcheck to know if it's on the
leaf level or not (it's rather low-context code). However, the nbtree
code doesn't have that problem, and already has very few places that
could possibly need ItemPointerGetBlockNumberNoCheck(). In fact,
Teodor's fix touches the only remaining
ItemPointerGetBlockNumberNoCheck() caller code within all of nbtree.

In summary, I think that we should find a way to never use
ItemPointerGetBlockNumberNoCheck() in nbtree, while still using it in
amcheck.

-- 
Peter Geoghegan

Re: Corrupted btree index on HEAD because of covering indexes

From

Peter Geoghegan

Date:

20 April 2018, 04:26:49

On Thu, Apr 19, 2018 at 11:59 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:
> Agree. Hope, nobody found a way how to use amcheck module in production to
> serve user requests. But, it should be implemented before BETA stage, in
> opposite case we will get a lot of objections.

It shouldn't take that long to get a patch together. I've already started.

I think that any objections to pushing this amcheck enhancement after
feature freeze are unlikely, for these reasons:

* There is a precedent, since the integrity checking functions added
to pg_visibility by commit e472ce96 were added well after feature
freeze, and only 3 months before the release of 9.6.

* Anyone paying attention to the ongoing "VM map freeze corruption"
thread would have to agree that that was an extremely good idea. We're
the ones that have to maintain this code, and I think that we're
entitled to add a little extra C code to a contrib extension, in order
to make testing easier. Including for beta testing.

* The extra overhead is quite low, and will only affect
bt_index_parent_check() (not bt_index_check()).

* I deliberately created bt_index_parent_check() so that we'd have
something that does very thorough testing, that's too slow and/or
requires too heavy a relation-level lock to be useful to most users.
That was the plan from day one.

If you only care about hardware issues, then bt_index_check() is
better in almost every way, and is really all you should be using in
production unless you really don't care about the extra overhead, or
about the ShareLock. The docs more or less say this. I have certainly
heard of people using bt_index_parent_check() in production, but only
when they already knew that their database was corrupt, and wanted to
isolate the problem. I imagine that people that use
bt_index_parent_check() in production do so because they want as much
information as possible, and are willing to do whatever it takes to
get more information.

> Agree, but at least this place needs a comment - why it's safe.

Good idea.

>> I also think that we could have better conventional regression test
>> coverage here.
>
> Will try to invent not so large test.oif it means they may get a little extra

Your smaller test takes about 350ms to run on a debug build on my
laptop, which seems worth it (honestly, this test should have been
there already). Maybe we can add it to the amcheck regression tests
instead, since those are run less often. This also allows us to add a
test case specifically as part of the amcheck enhancement, which makes
a bit more sense.

-- 
Peter Geoghegan

Re: Corrupted btree index on HEAD because of covering indexes

From

Michael Paquier

Date:

20 April 2018, 04:27:50

On Thu, Apr 19, 2018 at 10:47:02PM +0300, Teodor Sigaev wrote:
> I tried to minimize Michael's test case and add it to patch.

I cannot comment on the actual fix as I lack background in the area, but
having a test case and even more having pg_upgrade do some work on those
pages are good things.  This addresses the failure as far as I can see.
--
Michael

Attachment

signature.asc

Re: Corrupted btree index on HEAD because of covering indexes

From

Teodor Sigaev

Date:

20 April 2018, 09:29:25

>     I tried to minimize Michael's test case and add it to patch.
> 
> 
> -if (ItemPointerIsValid(leafhikey))
> +if (ItemPointerGetBlockNumberNoCheck(leafhikey) != InvalidBlockNumber)
> 
> Should we use BTreeInnerTupleGetDownLink() as soon as we use
> BTreeInnerTupleSetDownLink() for setting this?
> Or even invent BTreeInnerTupleDownLinkIsValid() macro?
I am not sure. Here we actually store UP link - to top parent to remove. 
I'm afraid using BTreeInnerTupleGetDownLink/BTreeInnerTupleSetDownLink 
could cause a confusion, in other hand, introducing 
TreeInnerTupleGetUpLink/BTreeInnerTupleSetUpLink seems over-engineering

-- 
Teodor Sigaev                      E-mail: teodor@sigaev.ru
                                       WWW: http://www.sigaev.ru/

Re: Corrupted btree index on HEAD because of covering indexes

From

Teodor Sigaev

Date:

20 April 2018, 09:41:58

> heard of people using bt_index_parent_check() in production, but only
> when they already knew that their database was corrupt, and wanted to
> isolate the problem. I imagine that people that use
> bt_index_parent_check() in production do so because they want as much
> information as possible, and are willing to do whatever it takes to
> get more information.
That why I think we need improve amcheck - ideally, it should not miss 
any mistakes in tree structure.

>> Agree, but at least this place needs a comment - why it's safe.
> 
> Good idea.
Could you write it? I'm afraid I can't give good explanation why we 
believe that nobody update this page ant it's  high key while page is 
unlocked but pinned.

> 
>>> I also think that we could have better conventional regression test
>>> coverage here.
>>
>> Will try to invent not so large test.oif it means they may get a little extra
> 
> Your smaller test takes about 350ms to run on a debug build on my
> laptop, which seems worth it (honestly, this test should have been
> there already). Maybe we can add it to the amcheck regression tests
> instead, since those are run less often. This also allows us to add a
> test case specifically as part of the amcheck enhancement, which makes
> a bit more sense.
Hm, it seems to me, that 350ms is short enough to place it in both core 
and amcheck test. I think, basic functionality should be covered by core 
tests as we test insert/create.


-- 
Teodor Sigaev                      E-mail: teodor@sigaev.ru
                                       WWW: http://www.sigaev.ru/

Re: Corrupted btree index on HEAD because of covering indexes

From

Teodor Sigaev

Date:

20 April 2018, 17:18:12

>> Should we use BTreeInnerTupleGetDownLink() as soon as we use
>> BTreeInnerTupleSetDownLink() for setting this?
>> Or even invent BTreeInnerTupleDownLinkIsValid() macro?
> I am not sure. Here we actually store UP link - to top parent to remove. I'm 
> afraid using BTreeInnerTupleGetDownLink/BTreeInnerTupleSetDownLink could cause a 
> confusion, in other hand, introducing 
> TreeInnerTupleGetUpLink/BTreeInnerTupleSetUpLink seems over-engineering
> 

After close look I change my opinion. To have a clean code it's much better to 
have new pair get/set macroses specialy to manage link to top pare during page 
deletion. This removes last naked usage of 
ItemPointer(SetInvalid/IsInvalid/GetBlockNumberNoCheck) and uses self-described 
macroses. Patch is attached.


-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Attachment

vacuum_nbtree-v3.patch

Re: Corrupted btree index on HEAD because of covering indexes

From

Peter Geoghegan

Date:

20 April 2018, 23:06:15

On Thu, Apr 19, 2018 at 11:41 PM, Teodor Sigaev <teodor@sigaev.ru> wrote:
>> heard of people using bt_index_parent_check() in production, but only
>> when they already knew that their database was corrupt, and wanted to
>> isolate the problem. I imagine that people that use
>> bt_index_parent_check() in production do so because they want as much
>> information as possible, and are willing to do whatever it takes to
>> get more information.
>
> That why I think we need improve amcheck - ideally, it should not miss any
> mistakes in tree structure.

I have finished writing a prototype that works, and passes all tests.
Without the fix, this is what the complaint from VACUUM looks like on
my machine:

ERROR:  right sibling 265 of block 134 is not next child 395 of block
135 in index "delete_test_table_pkey"

After the final VACUUM in your test case runs (the one that throws
this error), my working copy of amcheck can detect the same issue with
low overhead:

pg@~[20995]=# select bt_index_parent_check('delete_test_table_pkey', true);
ERROR:  index block lacks downlink in index "delete_test_table_pkey"
DETAIL:  Block=265 level=1 page lsn=0/909F4B18.

There is no complaint from amcheck if called before the final VACUUM
in your test case, since that's when the trouble starts (and ends).
It's easy to account for concurrent page splits within amcheck by
checking for BTP_SPLIT_END on the child page that could lack a
downlink. Don't confuse this with checking the BTP_INCOMPLETE_SPLIT
flag, which is what we set on the left half/sibling of a split -- not
the "new" right half/sibling. Of course, the right sibling is the type
of block that can lack a downlink until after the second phase of a
pagesplit (i.e. until it's fully finished), or after the first phase
of page deletion. (As you know, page deletion is like a page split "in
reverse".)

I said concurrent page split, but I really meant incomplete page
split, since a concurrent insert/split is impossible given that
amcheck holds a ShareLock here. (See commit 40dae7ec5.)

Of course, we don't see details about the next level up in the amcheck
error, unlike the VACUUM error. That shouldn't matter, because we're
already verifying a lot about the relationship between blocks at the
next level up, and even between the two levels. This enhancement adds
the one piece we were missing. It could in theory detect other
problems that VACUUM cannot detect, too.

Expect me to post a patch during the work day on Monday or Tuesday
(Pacific time). I need to polish this patch some more. I particularly
need to think about the use of memory for the Bloom filter (right now
I just use work_mem).

> Could you write it? I'm afraid I can't give good explanation why we believe
> that nobody update this page ant it's  high key while page is unlocked but
> pinned.

Okay. I'll come up with something soon.

> Hm, it seems to me, that 350ms is short enough to place it in both core and
> amcheck test. I think, basic functionality should be covered by core tests
> as we test insert/create.

I think you're right. There is really no excuse for not having
thorough code coverage for a module like nbtree.

-- 
Peter Geoghegan

Re: Corrupted btree index on HEAD because of covering indexes

From

Peter Geoghegan

Date:

21 April 2018, 04:56:19

On Fri, Apr 20, 2018 at 7:18 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:
> After close look I change my opinion. To have a clean code it's much better
> to have new pair get/set macroses specialy to manage link to top pare during
> page deletion. This removes last naked usage of
> ItemPointer(SetInvalid/IsInvalid/GetBlockNumberNoCheck) and uses
> self-described macroses. Patch is attached.

I see your point. Maybe don't have the newline between the get and
set, though, to match the existing style. And, the note about the
assertion seems unnecessary.

I suggest putting something about what general area this deals with.
Perhaps something like the following:

"Get/set leaf page highkey's link. During the second phase of
deletion, the target leaf page's high key may point to an ancestor
page (at all other times, the leaf level high key's link is not used).
See the nbtree README for full details."

-- 
Peter Geoghegan

Re: Corrupted btree index on HEAD because of covering indexes

From

Peter Geoghegan

Date:

22 April 2018, 04:02:12

On Thu, Apr 19, 2018 at 9:42 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:
> But some later, in _bt_unlink_halfdead_page() we check ItemPointer high key
> with ItemPointerIsValid macro - and it returns false, because offset is
> actually InvalidOffsetNumber - i.e. 0 which was set by BTreeTupleSetNAtts.
> And some wrong decisions are follows, I didn't look at that.

The problem here is that multi-level page deletion incorrectly thinks
that it's safe to fully delete a half-dead leaf page. Of course, this
is due to the issue with the representation of high keys that you
identified. IOW, VACUUM incorrect believes that fully deleting the
leaf page will leave no "dangling downlink" one level up, but, in
fact, there is one such dangling downlink. This could eventually lead
to wrong answers to queries, since the deleted page will eventually be
fully recycled.

I refined the amcheck enhancement quite a bit today. It will not just
check that a downlink is not missing; It will also confirm that it
wasn't a legitimately interrupted multi-level deletion, by descending
to the leaf page to match the leaf high key pointer to the top most
parent, which should be the target page (the page that lacked a
downlink according to the new Bloom filter). We need to worry about
multi-level deletions that are interrupted by an error or a hard
crash, which presents a legitimate case where there'll be no downlink
for an internal page in its parent. VACUUM is okay with that, so we
must be too.

I'll finish it off early next week as planned. It should
comprehensively cover issues with multi-level page deletion, including
cases where you have this kind of index corruption and page recycling
later makes it much worse. So, it won't just cover the exact test case
from Michael, which produced corruption that can be detected by
throwing an error when a downlink leads to a fully deleted page
(though only when amcheck has a ShareLock, of course).

Note that the problem here is best understood as a problem around the
presence of a dangling downlink, as oppose to a problem around the
absence of a needed downlink. There is an absent downlink, but that's
beside the point, and in any case missing a downlink is not
*inherently* wrong (as I said, it's not inherently wrong because of
the issue of interrupted multi-level page deletes).

-- 
Peter Geoghegan

Re: Corrupted btree index on HEAD because of covering indexes

From

Teodor Sigaev

Date:

23 April 2018, 15:57:34

Thank you, pushed

> I see your point. Maybe don't have the newline between the get and
> set, though, to match the existing style. And, the note about the
> assertion seems unnecessary.
Removed newline and note

> "Get/set leaf page highkey's link. During the second phase of
> deletion, the target leaf page's high key may point to an ancestor
> page (at all other times, the leaf level high key's link is not used).
> See the nbtree README for full details."
Changed as you suggested.


-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Re: Corrupted btree index on HEAD because of covering indexes

From

Peter Geoghegan

Date:

24 April 2018, 04:13:47

On Sat, Apr 21, 2018 at 6:02 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> I refined the amcheck enhancement quite a bit today. It will not just
> check that a downlink is not missing; It will also confirm that it
> wasn't a legitimately interrupted multi-level deletion, by descending
> to the leaf page to match the leaf high key pointer to the top most
> parent, which should be the target page (the page that lacked a
> downlink according to the new Bloom filter). We need to worry about
> multi-level deletions that are interrupted by an error or a hard
> crash, which presents a legitimate case where there'll be no downlink
> for an internal page in its parent. VACUUM is okay with that, so we
> must be too.

Attached patch lets amcheck detect the issue when
bt_index_parent_check() is called, though only when heapallindexed
verification was requested (that is, only when bt_index_parent_check()
is doing something with a Bloom filter already). The new checks will
probably also detect any possible issue with multi-level page
deletions. The patch tightens up our general expectations around
half-dead and fully deleted pages, which seems necessary but also
independently useful.

I'm using work_mem to constrain the size of the second Bloom filter,
whereas the heapallindexed Bloom filter is constrained by
maintenance_work_mem. This seems fine to me, since we have always used
an additional work_mem budget for spool2 when building a unique index
within nbtsort.c. Besides, it will probably be very common for the
downlink Bloom filter to be less than 1% the size of the first Bloom
filter when we have adequate memory for both Bloom filters (i.e. very
small). I thought about mentioning this work_mem allocation in the
docs, but decided that there was no need, since the CREATE INDEX
spool2 work_mem stuff isn't documented anywhere either.

Note that the "c.relkind = 'i'" change in the docs is about not
breaking the amcheck query when local partitioned indexes happen to be
in use (when the user changed the sample SQL query to not just look at
pg_catalog indexes). See the "Local partitioned indexes and
pageinspect" thread I just started for full details.

The new P_ISDELETED() test within bt_downlink_missing_check() is what
actually detects the corruption that the test case causes, since the
fully deleted leaf page still has a sane top parent block number left
behind (that is, we don't get as far as testing "if
(BTreeTupleGetTopParent(itup) == state->targetblock)"; that's not how
the the leaf page can get corrupt in the test case that Michael
posted). Note that there are also two new similar P_ISDELETED() tests
added to two existing functions (bt_downlink_check() and
bt_check_level_from_leftmost()), but those tests won't detect the
corruption that we saw. They're really there to nail down how we think
about fully deleted pages.

-- 
Peter Geoghegan

Attachment

0001-Add-missing-and-dangling-downlink-checks-to-amcheck.patch

Re: Corrupted btree index on HEAD because of covering indexes

From

Teodor Sigaev

Date:

24 April 2018, 19:06:44

Perfect! I would like to commit it but have some suggestions:

1)
TRUNCATE bttest_multi;
INSERT INTO bttest_multi SELECT i, i%2  FROM generate_series(1, 100000) as i;
SELECT bt_index_parent_check('bttest_multi_idx', true);

to improve test stability it would be better to disable autovacuum:
ALTER TABLE bttest_multi SET (autovacuum_enabled = false)


2) Pls, add to test DELETE as it done in 6db4b49986be3fe59a1f6ba6fabf9852864efc3e

3) It's not directly connected to this patch, but allocation of BtreeCheckState 
is not needed, it could be allocated on stack.

4) Nevertheless, I suggest to use palloc0 (or memset(0)) for BtreeCheckState. 
Now several fields of that structure could be not inited.


-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Re: Corrupted btree index on HEAD because of covering indexes

From

Peter Geoghegan

Date:

24 April 2018, 20:12:28

On Tue, Apr 24, 2018 at 9:06 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:
> Perfect!

Thanks!

> I would like to commit it but have some suggestions:

I attach a revised version, which has changes based on your feedback.

> to improve test stability it would be better to disable autovacuum:
> ALTER TABLE bttest_multi SET (autovacuum_enabled = false)

Done.

> 2) Pls, add to test DELETE as it done in
> 6db4b49986be3fe59a1f6ba6fabf9852864efc3e

Done. I will leave it to you to decide whether or not the original
create_index.sql test can now be removed.

> 3) It's not directly connected to this patch, but allocation of
> BtreeCheckState is not needed, it could be allocated on stack.
>
> 4) Nevertheless, I suggest to use palloc0 (or memset(0)) for
> BtreeCheckState. Now several fields of that structure could be not inited.

I like the idea of using palloc0(). Done that way.

-- 
Peter Geoghegan

Attachment

v2-0001-Add-missing-and-dangling-downlink-checks-to-amche.patch

Re: Corrupted btree index on HEAD because of covering indexes

From

Teodor Sigaev

Date:

25 April 2018, 18:04:21

Thank you, pushed.

Peter Geoghegan wrote:
> On Tue, Apr 24, 2018 at 9:06 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:
>> Perfect!
-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Re: Corrupted btree index on HEAD because of covering indexes

From

Teodor Sigaev

Date:

25 April 2018, 18:06:44

>> 2) Pls, add to test DELETE as it done in
>> 6db4b49986be3fe59a1f6ba6fabf9852864efc3e
> 
> Done. I will leave it to you to decide whether or not the original
> create_index.sql test can now be removed.
Leave it in both places. In ideal world, in amcheck test we need to create a 
broken index, but I don't know a correct way to do it :)

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Re: Corrupted btree index on HEAD because of covering indexes

From

Peter Geoghegan

Date:

25 April 2018, 19:38:12

On Wed, Apr 25, 2018 at 8:06 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:
> Leave it in both places. In ideal world, in amcheck test we need to create a
> broken index, but I don't know a correct way to do it :)

There are many ways to create a broken index, but they're hard to make
work with pg_regress. :-)

It looks like you pushed v1, which didn't have the tests and other
changes you asked for. Attached patch adds those back.

Thanks
-- 
Peter Geoghegan

Attachment

0001-Add-amcheck-missing-downlink-tests.patch

Re: Corrupted btree index on HEAD because of covering indexes

From

Teodor Sigaev

Date:

25 April 2018, 22:06:33

> It looks like you pushed v1, which didn't have the tests and other
> changes you asked for. Attached patch adds those back.

Oops, I missed, I don't know how... Thank you very much for your quick eye!

-- 
Teodor Sigaev                      E-mail: teodor@sigaev.ru
                                       WWW: http://www.sigaev.ru/

Re: Corrupted btree index on HEAD because of covering indexes

From

Peter Geoghegan

Date:

25 April 2018, 22:13:27

On Wed, Apr 25, 2018 at 12:06 PM, Teodor Sigaev <teodor@sigaev.ru> wrote:
> Oops, I missed, I don't know how... Thank you very much for your quick eye!

Thanks Teodor.

-- 
Peter Geoghegan

Re: Corrupted btree index on HEAD because of covering indexes

From

Tom Lane

Date:

25 April 2018, 23:45:51

Teodor Sigaev <teodor@sigaev.ru> writes:
>> It looks like you pushed v1, which didn't have the tests and other
>> changes you asked for. Attached patch adds those back.

> Oops, I missed, I don't know how... Thank you very much for your quick eye!

Teodor, are you caught up to the point where it'd be okay to run pgindent,
or are there still patches waiting?

            regards, tom lane

Re: Corrupted btree index on HEAD because of covering indexes

From

Peter Geoghegan

Date:

25 April 2018, 23:48:23

On Wed, Apr 25, 2018 at 1:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Teodor, are you caught up to the point where it'd be okay to run pgindent,
> or are there still patches waiting?

I can't speak for Teodor, but fwiw I am not aware of any more patches waiting.

-- 
Peter Geoghegan

Re: Corrupted btree index on HEAD because of covering indexes

From

Alexander Korotkov

Date:

26 April 2018, 00:09:10

On Wed, Apr 25, 2018 at 11:48 PM, Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Apr 25, 2018 at 1:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Teodor, are you caught up to the point where it'd be okay to run pgindent,
> or are there still patches waiting?

I can't speak for Teodor, but fwiw I am not aware of any more patches waiting.

Great. Thank you very much for your efforts on this feature!

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Corrupted btree index on HEAD because of covering indexes

From

Peter Geoghegan

Date:

26 April 2018, 01:13:03

On Wed, Apr 25, 2018 at 2:09 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> Great.  Thank you very much for your efforts on this feature!

You're welcome. I also appreciate the efforts of all Postgres Pro
people. I especially appreciate Anastasia's work on this project.

I hope that we see more B-Tree patches from Anastasia, and others.
Getting this patch over the line was, it must be said, pretty painful
for everyone. There were many reasons for that, some good, others not
so good. However, I think that we've learned a few things about how to
manage the risks around a patch like this one. This will hopefully
make it significantly easier to do more work in this area. I think
that it's badly needed. I think that its importance is generally
underestimated, actually. The Gray and Reuter book says "B-Trees are
by far the most important access path structure in database and file
systems". I believe that that's just as true now, 25 years later.

The core algorithms that are used by nbtree are generally very well
implemented, and I don't think it's a good idea to work on features
that change that. At least, I won't be working on something like
merging of non-empty pages myself (or anything that touches
concurrency, page deletion, etc). However, I do think that we can
significantly improve nbtree for certain workloads by optimizing the
representation at the page level, which I'm glad to see more interest
in in the past few years. The major difficulty here is choosing an
on-disk representation that lets us do everything that we'll want to
do in the future. That requires strong cooperation.

-- 
Peter Geoghegan

Re: Corrupted btree index on HEAD because of covering indexes

From

Teodor Sigaev

Date:

26 April 2018, 09:40:12

> Teodor, are you caught up to the point where it'd be okay to run pgindent,
> or are there still patches waiting?

There is a gin predicate lock patch from Heikki, I will work on it. But 
I have not objection to run pgindent, I believe gin patch will be easy 
to change.

-- 
Teodor Sigaev                      E-mail: teodor@sigaev.ru
                                       WWW: http://www.sigaev.ru/