Thread: amcheck (B-Tree integrity checking tool)

amcheck (B-Tree integrity checking tool)

From
Peter Geoghegan
Date:
I was assigned an "action point" during the FOSDEM developer meeting:
"Post new version of btree consistency checker patch". I attach a new
WIP version of my consistency checker tool, amcheck. This patch is
proposed for 9.6, as an extension in contrib -- maybe we can still get
it in. This is the first time I've added any version of this to a
commitfest, although I've posted a couple of rough versions of this in
the past couple of years. The attached version has received a major
overhaul, and is primarily aimed at finding corruption in production
systems, although I think it will still have significant value for
debugging too. Maybe it can help with some of the B-Tree patches in
the final commitfest, for example. I also have some hope that it will
become a learning tool for people interested in how B-Tree indexes
work.

To recap, the extension adds some SQL-callable functions that verify
certain invariant conditions hold within some particular B-Tree index.
These are the conditions that index scans rely on always being true.
The tool's scope may eventually cover other AMs, including heapam, but
nbtree seems like the best place to start.

Note that no function currently checks that the index is consistent
with the heap, which would be very useful (that's probably how I'd
eventually target the heapam, actually).

Invariants
========

nbtree invariants that the tool verifies with just an AccessShareLock
on the relation are:

* That all items are in the correct, opclass order on each page.

* That the page "high key", if any, actually bounds the items on the page.

* That the last item on a page is less than or equal to the first item
on the next page (the page to its right). The idea here is that the
key space spans multiple pages, not just one page, so it make sense to
check the last item where we can.

With an ExclusiveLock + ShareLock, some addition invariants are verified:

* That child pages actually have their parent's downlink as a lower bound.

* Sane right links and left links at each level.

Obviously, this tool is all about distrusting the structure of a
B-Tree index. That being the case, it's not always totally clear where
to draw the line. I think I have the balance about right, though.

Interface
=======

There are only 2 SQL callable functions in the extension, which are
very similar:

bt_index_check(index regclass)

bt_index_parent_check(index regclass)

The latter is more thorough than the former -- it performs all checks,
including those checks that I mentioned required an ExclusiveLock. So,
bt_index_check() only requires an AccessShareLock.
bt_index_parent_check() requires an ExclusiveLock on the index
relation, and a ShareLock on its heap relation, almost like REINDEX.
bt_index_parent_check() performs verification that is a superset of
the verification performed by bt_index_check() -- mostly, the extra
verification/work is that it verifies downlinks against child pages.

Both functions raise an error in the event of observing that an
invariant in a B-Tree was violated, such as items being out of order
on a page. I've written extensive documentation, which goes into
practical aspects of using amcheck effectively. It doesn't go into
significant detail about every invariant that is checked, but gives a
good idea of roughly what checks are performed.

I could almost justify only having one function with an argument about
the downlink/child verification, but that would be a significant
footgun given the varying locking requirements that such a function
would have.

Locking
======

We never rely on something like holding on to a buffer pin as an
interlock for correctness (the vacuum interlock thing isn't generally
necessary, because we don't look at the heap at all). We simply pin +
BT_READ lock a buffer, copy it into local memory allocated by
palloc(), and then immediately release the buffer lock and drop the
pin. This is the same in all instances. There is never more than one
buffer lock or pin held at a time.

We do, on the other hand, have a detailed rationale for why it's okay
that we don't have an ExclusiveLock on the index relation for checks
that span the key space of more than one page by following right links
to compare items across sibling pages. This isn't the same thing as
having an explicit interlock like a pin -- our interlock is one
against *recycling* by vacuum, which is based on recentGlobalXmin.
This rationale requires expert review.

Performance
==========

Trying to keep the tool as simple as possible, while still making it
do verification that is as useful as possible was my priority here,
not performance. Still, verification completes fairly quickly.
Certainly, it takes far less time than having to REINDEX the index,
and doesn't need too much memory. I think that in practice most
problems that can be detected by the B-Tree checker functions will be
detected with the lighter variant.

--
Peter Geoghegan

Attachment

Re: amcheck (B-Tree integrity checking tool)

From
Tomas Vondra
Date:
Hi,

I've looked at this patch today, mostly to educate myself, so this
probably should not count as a full review. Anyway, the patch seems in
excellent shape - it'd be great if all patches (including those written
by me) had this level of comments/docs.

On Mon, 2016-02-29 at 16:09 -0800, Peter Geoghegan wrote:
> I was assigned an "action point" during the FOSDEM developer meeting:
> "Post new version of btree consistency checker patch". I attach a new
> WIP version of my consistency checker tool, amcheck.

...

> To recap, the extension adds some SQL-callable functions that verify
> certain invariant conditions hold within some particular B-Tree index.
> These are the conditions that index scans rely on always being true.
> The tool's scope may eventually cover other AMs, including heapam, but
> nbtree seems like the best place to start.
> 
> Note that no function currently checks that the index is consistent
> with the heap, which would be very useful (that's probably how I'd
> eventually target the heapam, actually).

I'm afraid I don't understand what "target the heapam" means. Could you
elaborate?


> Invariants
> ========
...
> Obviously, this tool is all about distrusting the structure of a
> B-Tree index. That being the case, it's not always totally clear where
> to draw the line. I think I have the balance about right, though.

I agree. It'd be nice to have a tool that also checks for data
corruption the a lower level (e.g. that varlena headers are not
corrupted etc.) but that seems like a task for some other tool.


> Interface
> =======
> 
> There are only 2 SQL callable functions in the extension, which are
> very similar:
> 
> bt_index_check(index regclass)
> 
> bt_index_parent_check(index regclass)
> 

Can we come up with names that more clearly identify the difference
between those two functions? I mean, _parent_ does not make it
particularly obvious that the second function acquires exclusive lock
and performs more thorough checks.

This also applies to the name of the contrib module - it's not
particularly obvious what "amcheck" unless you're familiar with the
concept of access methods. Which is quite unlikely for regular users.
Maybe something like "check_index" would be more illustrative?

> Locking
> ======
...
> 
> We do, on the other hand, have a detailed rationale for why it's okay
> that we don't have an ExclusiveLock on the index relation for checks
> that span the key space of more than one page by following right links
> to compare items across sibling pages. This isn't the same thing as
> having an explicit interlock like a pin -- our interlock is one
> against *recycling* by vacuum, which is based on recentGlobalXmin.
> This rationale requires expert review.

Well, I wouldn't count myself as an expert here, but do we actually need
such protection? I mean, we only do such checks when holding an
exclusive lock on the parent relation, no? And even if we don't vacuum
can only remove entries from the pages - why should that cause
violations of any invariants?

A few minor comments:

1) This should probably say "one block per level":

/** A B-Tree cannot possibly have this many levels, since there must be * one block per page, and that is bound by the
rangeof BlockNumber:*/
 

2) comment before bt_check_every_level() says:
  ... assumed to prevent any kind of physically modification ...
  probably should be "physical modification" instead.

3) s/targecontext/targetcontext/ in BtreeCheckState comment

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: amcheck (B-Tree integrity checking tool)

From
Anastasia Lubennikova
Date:
01.03.2016 03:09, Peter Geoghegan:
> I was assigned an "action point" during the FOSDEM developer meeting:
> "Post new version of btree consistency checker patch". I attach a new
> WIP version of my consistency checker tool, amcheck. This patch is
> proposed for 9.6, as an extension in contrib -- maybe we can still get
> it in. This is the first time I've added any version of this to a
> commitfest, although I've posted a couple of rough versions of this in
> the past couple of years. The attached version has received a major
> overhaul, and is primarily aimed at finding corruption in production
> systems, although I think it will still have significant value for
> debugging too. Maybe it can help with some of the B-Tree patches in
> the final commitfest, for example. I also have some hope that it will
> become a learning tool for people interested in how B-Tree indexes
> work.
>
> To recap, the extension adds some SQL-callable functions that verify
> certain invariant conditions hold within some particular B-Tree index.
> These are the conditions that index scans rely on always being true.
> The tool's scope may eventually cover other AMs, including heapam, but
> nbtree seems like the best place to start.
>
> Note that no function currently checks that the index is consistent
> with the heap, which would be very useful (that's probably how I'd
> eventually target the heapam, actually).

Hi,
thank you for the contrib module. The patch itself looks great.
It was really useful to have such a tool while working on b-tree patches.

> Invariants
> ========
>
> nbtree invariants that the tool verifies with just an AccessShareLock
> on the relation are:
>
> * That all items are in the correct, opclass order on each page.
>
> * That the page "high key", if any, actually bounds the items on the page.
>
> * That the last item on a page is less than or equal to the first item
> on the next page (the page to its right). The idea here is that the
> key space spans multiple pages, not just one page, so it make sense to
> check the last item where we can.
>
> With an ExclusiveLock + ShareLock, some addition invariants are verified:
>
> * That child pages actually have their parent's downlink as a lower bound.
>
> * Sane right links and left links at each level.
>
> Obviously, this tool is all about distrusting the structure of a
> B-Tree index. That being the case, it's not always totally clear where
> to draw the line. I think I have the balance about right, though.
>
> Interface
> =======
>
> There are only 2 SQL callable functions in the extension, which are
> very similar:
>
> bt_index_check(index regclass)
>
> bt_index_parent_check(index regclass)
>
> The latter is more thorough than the former -- it performs all checks,
> including those checks that I mentioned required an ExclusiveLock. So,
> bt_index_check() only requires an AccessShareLock.
> bt_index_parent_check() requires an ExclusiveLock on the index
> relation, and a ShareLock on its heap relation, almost like REINDEX.
> bt_index_parent_check() performs verification that is a superset of
> the verification performed by bt_index_check() -- mostly, the extra
> verification/work is that it verifies downlinks against child pages.
>
> Both functions raise an error in the event of observing that an
> invariant in a B-Tree was violated, such as items being out of order
> on a page. I've written extensive documentation, which goes into
> practical aspects of using amcheck effectively. It doesn't go into
> significant detail about every invariant that is checked, but gives a
> good idea of roughly what checks are performed.
>
> I could almost justify only having one function with an argument about
> the downlink/child verification, but that would be a significant
> footgun given the varying locking requirements that such a function
> would have.

I've done some testing. And I can say that both announced functions work 
well.
BTW, if you know a good way to corrupt index (and do it reproducible) 
I'd be very glad to see it.

I created an index, then opened the file and changed the order of elements.
And bt_index_check() found the error successfully.

/* Ensure that the data is changed. */
SELECT * FROM bt_page_items('idx', 1);
 itemoffset | ctid  | itemlen | nulls | vars | data
------------+-------+---------+-------+------+-------------------------          1 | (0,3) |      16 | f     | f    |
0300 00 00 00 00 00 00          2 | (0,2) |      16 | f     | f    | 02 00 00 00 00 00 00 00          3 | (0,1) |
16| f     | f    | 01 00 00 00 00 00 00 00
 
(3 rows)


/* Great, amcheck found an error. */
select bt_index_check('idx');
ERROR:  page order invariant violated for index "idx"
DETAIL:  Lower index tid=(1,1) (points to heap tid=(0,3)) higher index 
tid=(1,2) (points to heap tid=(0,2)) page lsn=0/17800D0.

> Locking
> ======
>
> We never rely on something like holding on to a buffer pin as an
> interlock for correctness (the vacuum interlock thing isn't generally
> necessary, because we don't look at the heap at all). We simply pin +
> BT_READ lock a buffer, copy it into local memory allocated by
> palloc(), and then immediately release the buffer lock and drop the
> pin. This is the same in all instances. There is never more than one
> buffer lock or pin held at a time.
>
> We do, on the other hand, have a detailed rationale for why it's okay
> that we don't have an ExclusiveLock on the index relation for checks
> that span the key space of more than one page by following right links
> to compare items across sibling pages. This isn't the same thing as
> having an explicit interlock like a pin -- our interlock is one
> against *recycling* by vacuum, which is based on recentGlobalXmin.
> This rationale requires expert review.

I'm not an expert, but I promise to take a closer look at locking.
I will send another email soon.
> Performance
> ==========
>
> Trying to keep the tool as simple as possible, while still making it
> do verification that is as useful as possible was my priority here,
> not performance. Still, verification completes fairly quickly.
> Certainly, it takes far less time than having to REINDEX the index,
> and doesn't need too much memory. I think that in practice most
> problems that can be detected by the B-Tree checker functions will be
> detected with the lighter variant.

I do not see any performance issues.
I'm sure that if someone wants to check whether the index is corrupted, 
he certainly can wait a minute (.

But I have some concerns about compatibility with my patches.
I've tried to call bt_index_check() over my "including" patch [1] and 
caught a segfault.

LOG:  server process (PID 31794) was terminated by signal 11: 
Segmentation fault
DETAIL:  Failed process was running: select bt_index_check('idx');

I do hope that my patch will be accepted in 9.6, so this conflict looks 
really bad.
I think that error is caused by changes in pages layout. To save some 
space nonkey attributes are truncated
when btree copies the indexed value into inner pages or into high key. 
You can look at index_reform_tuple() calls.

I wonder, whether you can add some check of catalog version to your 
module or this case requires more changes?

With second patch which implements compression [2], amcheck causes 
another error.
postgres=# insert into tbl select 1 from generate_series(0,5);
INSERT 0 6
postgres=# SELECT * FROM bt_page_items('idx', 1); itemoffset |      ctid      | itemlen | nulls | vars | data


------------+----------------+---------+-------+------+----------------------------------------------------------------------------------------
---------------------------------------------------------          1 | (2147483664,6) |      56 | f     | f    | 01 00
0000 00 
 
00 00 00 00 00 00 00 01 00 00 00 00 00 02 00 00 00 00 00 03 00 00 00 00
00 04 00 00 00 00 00 05 00 00 00 00 00 06 00 00 00 00 00

postgres=# select bt_index_check('idx');
ERROR:  cannot check index "idx"
DETAIL:  index is not yet ready for insertions

But I'm sure that it's a problem of my patch. So I'll fix it and try again.

[1] https://commitfest.postgresql.org/9/433/
[2] https://commitfest.postgresql.org/9/494/

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: amcheck (B-Tree integrity checking tool)

From
Michael Paquier
Date:
On Fri, Mar 11, 2016 at 5:19 PM, Anastasia Lubennikova wrote:
> BTW, if you know a good way to corrupt index (and do it reproducible) I'd be
> very glad to see it.

You can use for example dd in non-truncate mode to corrupt on-disk
page data, say that for example:
dd if=/dev/random bs=8192 count=1 \   seek=$BLOCK_ID of=base/$DBOID/$RELFILENODE \   conv=notrunc
-- 
Michael



Re: amcheck (B-Tree integrity checking tool)

From
Tomas Vondra
Date:
Hi,

On Fri, 2016-03-11 at 17:24 +0100, Michael Paquier wrote:
> On Fri, Mar 11, 2016 at 5:19 PM, Anastasia Lubennikova wrote:
> > 
> > BTW, if you know a good way to corrupt index (and do it
> > reproducible) I'd be
> > very glad to see it.
> You can use for example dd in non-truncate mode to corrupt on-disk
> page data, say that for example:
> dd if=/dev/random bs=8192 count=1 \
>     seek=$BLOCK_ID of=base/$DBOID/$RELFILENODE \
>     conv=notrunc

I guess /dev/random is not very compatible with the "reproducible"
requirement. I mean, it will reproducibly break the page, but pretty
much completely, which is mostly what checksums are for.

I think the main goal of the amcheck module is to identify issues that
are much more subtle - perhaps a coding error or unhandled cornercase
which results in page with a valid structure and checksum, but broken
page. Or perhaps issues coming from outside of PostgreSQL, like a
subtle change in the locales.

Simulating those issues requires a tool that allows minor tweaks to the
page, dd is a bit too dull for that.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: amcheck (B-Tree integrity checking tool)

From
Peter Geoghegan
Date:
On Fri, Mar 11, 2016 at 8:24 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> You can use for example dd in non-truncate mode to corrupt on-disk
> page data, say that for example:
> dd if=/dev/random bs=8192 count=1 \
>     seek=$BLOCK_ID of=base/$DBOID/$RELFILENODE \
>     conv=notrunc

Sure, but that would probably fail at the first hurdle -- the page
header would be corrupt. Which is a valid test, but not all that
interesting.

One testing workflow I tried is overwriting some page in a B-Tree
relfilenode with some other page in the same file:

$:~/pgdata/base/12413$ dd if=somefile of=somefile conv=notrunc bs=8192
count=1 skip=2 seek=3

That should fail due to the key space not being in order across pages,
which is slightly interesting. Or, you could selectively change one
item with a hex editor, as Anastasia did.

Or, you could add code like this to comparetup_index_btree(), to
simulate a broken opclass:

diff --git a/src/backend/utils/sort/tuplesort.c
b/src/backend/utils/sort/tuplesort.c
index 67d86ed..23712ff 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -3562,6 +3562,9 @@ comparetup_index_btree(const SortTuple *a, const
SortTuple *b,       compare = ApplySortComparator(a->datum1, a->isnull1,

b->datum1, b->isnull1,                                                                 sortKey);
+
+       if (random() <= (MAX_RANDOM_VALUE / 1000))
+               compare = -compare;       if (compare != 0)               return compare;

There are many options when you want to produce a corrupt B-Tree index!

-- 
Peter Geoghegan



Re: amcheck (B-Tree integrity checking tool)

From
Peter Geoghegan
Date:
On Fri, Mar 11, 2016 at 1:09 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Or, you could add code like this to comparetup_index_btree(), to
> simulate a broken opclass:
>
> diff --git a/src/backend/utils/sort/tuplesort.c
> b/src/backend/utils/sort/tuplesort.c
> index 67d86ed..23712ff 100644
> --- a/src/backend/utils/sort/tuplesort.c
> +++ b/src/backend/utils/sort/tuplesort.c
> @@ -3562,6 +3562,9 @@ comparetup_index_btree(const SortTuple *a, const
> SortTuple *b,
>         compare = ApplySortComparator(a->datum1, a->isnull1,
>
> b->datum1, b->isnull1,
>                                                                   sortKey);
> +
> +       if (random() <= (MAX_RANDOM_VALUE / 1000))
> +               compare = -compare;
>         if (compare != 0)
>                 return compare;


Note that this patch that I sketched would make CREATE INDEX produce
corrupt indexes, but the tool's verification itself would not be
affected. Although even if it was (even if _bt_compare() gave the same
wrong answers), it would still very likely detect corruption.

-- 
Peter Geoghegan



Re: amcheck (B-Tree integrity checking tool)

From
Peter Geoghegan
Date:
On Thu, Mar 10, 2016 at 9:18 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> I've looked at this patch today, mostly to educate myself, so this
> probably should not count as a full review. Anyway, the patch seems in
> excellent shape - it'd be great if all patches (including those written
> by me) had this level of comments/docs.

Thanks. I try. This patch is something I've been working on slowly and
steadily for over 2 years. It was time to see it through.

>> Note that no function currently checks that the index is consistent
>> with the heap, which would be very useful (that's probably how I'd
>> eventually target the heapam, actually).
>
> I'm afraid I don't understand what "target the heapam" means. Could you
> elaborate?

Just that that's how I'd make amcheck verify that the heap was sane,
if I was going to undertake that project. Or, I'd start there.

> I agree. It'd be nice to have a tool that also checks for data
> corruption the a lower level (e.g. that varlena headers are not
> corrupted etc.) but that seems like a task for some other tool.

I'm not sure that that's a task for another tool. I think that this
tool is scoped at detecting corruption, and that could work well for
the heap, too. There might be fewer interesting things to test about
the heap when indexes aren't involved. Following HOT chains through an
index, and verifying things like that about the heap as we go could
detect a lot of problematic cases.

> Can we come up with names that more clearly identify the difference
> between those two functions? I mean, _parent_ does not make it
> particularly obvious that the second function acquires exclusive lock
> and performs more thorough checks.

Dunno about that. It's defining characteristic is that it checks child
pages against their parent IMV. Things are not often defined in terms
of their locking requirements.

> This also applies to the name of the contrib module - it's not
> particularly obvious what "amcheck" unless you're familiar with the
> concept of access methods. Which is quite unlikely for regular users.
> Maybe something like "check_index" would be more illustrative?

I think that given the heap may be targeted in the future, amcheck is
more future-proof. I think Robert might have liked that name a year
ago or something. In general, I'm not too worried about what the
contrib module is called in the end.

> Well, I wouldn't count myself as an expert here, but do we actually need
> such protection? I mean, we only do such checks when holding an
> exclusive lock on the parent relation, no? And even if we don't vacuum
> can only remove entries from the pages - why should that cause
> violations of any invariants?

I think that a more worked out explanation for why the ExclusiveLock
is needed is appropriate. I meant to do that.

Basically, a heavier lock is needed because of page deletion by
VACUUM, which is the major source of complexity (much more than page
splits, I'd say). In general, the key space can be consolidated by
VACUUM in a way that breaks child page checking because the downlink
we followed from our target page is no longer the current downlink.
Page deletion deletes the right sibling's downlink, and the deleted
page's downlink is used for its sibling. You could have a race, where
there was a concurrent page deletion of the left sibling of the child
page, then a concurrent insertion into the newly expanded keyspace of
the parent. Therefore, the downlink in the parent (which is the
"target", to use the patch's terminology) would not be a lower bound
on items in the page.

That's a very low probability race, because it involves deletion a
page representing a range in the keyspace that has no live items, but
then immediately getting one just in time for our check. But, I'm
pretty determined that false positives like that need to be impossible
(or else it's a bug).

I have a paranoid feeling that there is a similar very low probability
race with the left-right keyspace checks, which don't have relation
ExclusiveLock protection (IOW, I think that that might be buggy). I
need to think about that some more, but current thinking is that it
would hardly matter if we used the highkey from right page rather than
the first data item, which definitely would be correct. And it would
be simpler.

> A few minor comments:

Thanks for catching those issues. Will fix.

-- 
Peter Geoghegan



Re: amcheck (B-Tree integrity checking tool)

From
Peter Geoghegan
Date:
On Fri, Mar 11, 2016 at 1:31 PM, Peter Geoghegan <pg@heroku.com> wrote:
> You could have a race, where
> there was a concurrent page deletion of the left sibling of the child
> page, then a concurrent insertion into the newly expanded keyspace of
> the parent. Therefore, the downlink in the parent (which is the
> "target", to use the patch's terminology) would not be a lower bound
> on items in the page.

Excuse me: I meant the newly expanded keyspace of the *child*. (The
parent's keyspace would have covered everything. It's naturally far
larger than either child's keyspace, since it typically has several
hundred pages.)


-- 
Peter Geoghegan



Re: amcheck (B-Tree integrity checking tool)

From
Peter Geoghegan
Date:
On Fri, Mar 11, 2016 at 8:19 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> I do hope that my patch will be accepted in 9.6, so this conflict looks
> really bad.

I hope so too. I'll have to look into this issue.

> I think that error is caused by changes in pages layout. To save some space
> nonkey attributes are truncated
> when btree copies the indexed value into inner pages or into high key. You
> can look at index_reform_tuple() calls.

That seems like the kind of problem that would be expected when things
like that change. I think it's going to be hard to add new B-Tree
features without a tool like this, which was a big reason to work on
it; to make these new projects possible to test and review. I see many
opportunities to improve the B-Tree code, just as I imagine you do.
These projects are quite strategically important, because B-Trees are
so frequently used. I think that Postgres B-Trees produce excessive
random I/O for checkpoints for various reasons, and this disadvantages
Postgres when it is compared to other major systems. As things get
better in other areas of Postgres, these hard problems become more
important to solve.

> I wonder, whether you can add some check of catalog version to your module
> or this case requires more changes?

I think that it's just going to be tied to the Postgres version. So,
if your B-Tree patches are committed first, it's on me to make sure
they're handled correctly. Or vice-versa. Not worried that that will
be a problem.

We already take special steps to avoid the "minus infinity" item on
internal pages. I think that in the future, if Postgres B-Trees get
suffix truncation for internal page items, there are new problems for
amcheck (suffix truncation remove unneeded information from internal
page items, sometimes greatly increasing B-Tree fan-out. Internal page
items need only be sufficient to guide index scans correctly.).

Specially, with suffix truncation there might be "minus infinity"
*attributes*, too (these could make it safe to completely remove
attributes/columns past the first distinguishing/distinct attribute on
each item on internal pages). That's a case that amcheck then needs to
care about, just like it currently cares about the existing concept of
minus infinity items. That's how it goes for amcheck.

-- 
Peter Geoghegan



Re: amcheck (B-Tree integrity checking tool)

From
Jim Nasby
Date:
On 3/11/16 3:31 PM, Peter Geoghegan wrote:
>> Can we come up with names that more clearly identify the difference
>> >between those two functions? I mean,_parent_  does not make it
>> >particularly obvious that the second function acquires exclusive lock
>> >and performs more thorough checks.
> Dunno about that. It's defining characteristic is that it checks child
> pages against their parent IMV. Things are not often defined in terms
> of their locking requirements.

First, thanks for your work on this. I've wanted it in the past.

I agree the name isn't very clear. Perhaps _recurse?

I also agree that the nmodule name isn't very clear. If this is meant to 
be the start of a generic consistency checker, lets call it that. 
Otherwise, it should be marked as being specific to btrees, because 
presumably we might eventually want similar tools for GIN, etc. (FWIW 
I'd vote for a general consistency checker).

I know the vacuum race condition would be very rare, but I don't think 
it can be ignored. Last thing you want out of a consistency checker is 
false negatives/positives. I do think it would be reasonable to just 
wholesale block against concurrent vacuums, but I don't think there's 
any reasonable way to do that.

I would prefer the ability to do something other than raising an error 
when corruption is found, so that you could find all corruption in an 
index. Obviously could log to a different level. Another option would be 
SRFs that return info about all the corruption found, but that's 
probably overkill.

It'd be nice if you had the option to obey vacuum_cost_delay when 
running this, but that's clearly just a nice-to-have (or maybe just obey 
it all the time, since it defaults to 0).
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: amcheck (B-Tree integrity checking tool)

From
Peter Geoghegan
Date:
On Fri, Mar 11, 2016 at 3:50 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
> I also agree that the nmodule name isn't very clear. If this is meant to be
> the start of a generic consistency checker, lets call it that. Otherwise, it
> should be marked as being specific to btrees, because presumably we might
> eventually want similar tools for GIN, etc. (FWIW I'd vote for a general
> consistency checker).

It's a generic consistency checker -- that's the intent. Robert wanted
to go that way about a year ago, and I think it makes sense (this
module started out life as btreecheck). As I said, I don't really care
what the name ends up being. amcheck seems fine to me, but I'll go
with any reasonable suggestion that reflects the scope of the tool as
a generic consistency checker for AMs, including heapam.

I don't know that I understand the concern about the naming of
bt_index_parent_check(). I called it something that accurately
reflects what it does. That particular SQL-callable function is more
orientated towards helping hackers than towards helping users detect
routine problems, as the docs say, but it could help users working
with hackers, and it might even help users acting on their own. I
don't want to make it sound scary, because it isn't. I don't know what
problems will be detected by this tool most often, and I don't think
it would be wise to try to predict how that will work out. And so,
concealing the internal workings of each function/check by giving them
all totally non-technical names seems like a bad idea. It's not that
hard to understand that a B-Tree has multiple levels, and checking the
levels against each other requires more restrictive/stronger
heavyweight locks. I've only added 2 SQL-callable functions, each of
which has one argument, and the first of which has a generic name and
very light locking requirements, like a SELECT. It's not that hard to
get the general idea, as an ordinary user.

> I know the vacuum race condition would be very rare, but I don't think it
> can be ignored. Last thing you want out of a consistency checker is false
> negatives/positives.

That's what I said. And the docs also say that there should never be a
false positive. That's definitely an important design goal here,
because it enables routine testing.

> I do think it would be reasonable to just wholesale
> block against concurrent vacuums, but I don't think there's any reasonable
> way to do that.

Actually, that's exactly what bt_index_parent_check() does already.
VACUUM requires a SHARE UPDATE EXCLUSIVE lock on the heap relation,
which cannot be concurrently acquired due to bt_index_parent_check()'s
acquisition of a SHARE lock. The locking for bt_index_parent_check()
is almost the same as REINDEX, except that that acquires an ACCESS
EXCLUSIVE lock on the index relation; bt_index_parent_check() only
requires an EXCLUSIVE lock on the index relation.

Not sure about the cost delay thing. Delays are disabled by default
for manually issued VACUUM, so have doubts that that's useful.

If you want the tool to limp on when it finds an error, that can be
done by changing the constant for the CORRUPTION macro in amcheck.c.
But having that be dynamically configurable is not really compatible
with the goal of having amcheck be run routinely.

-- 
Peter Geoghegan



Re: amcheck (B-Tree integrity checking tool)

From
Jim Nasby
Date:
On 3/11/16 6:17 PM, Peter Geoghegan wrote:
> Not sure about the cost delay thing. Delays are disabled by default
> for manually issued VACUUM, so have doubts that that's useful.

Right, but you still have the option to enable them if you don't want to 
swamp your IO system. That's why CIC obeys it too. If I was running a 
consistency check on a production system I'd certainly want the option 
to throttle it. Without that option, I don't see running this on 
production systems as being an option. If that's not a goal then fine, 
but if it is a goal I think it needs to be there.

Isn't it just a few extra lines of code to support it?
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: amcheck (B-Tree integrity checking tool)

From
Peter Geoghegan
Date:
On Fri, Mar 11, 2016 at 4:17 PM, Peter Geoghegan <pg@heroku.com> wrote:
> If you want the tool to limp on when it finds an error, that can be
> done by changing the constant for the CORRUPTION macro in amcheck.c.
> But having that be dynamically configurable is not really compatible
> with the goal of having amcheck be run routinely.

Also, it's just really hard to reason about what remains OK to check
after the first problem is encountered in the general case. It's
"unreasonable" for any of the checks to ever fail. So, by that
standard, assuming that they might fail at all could be called
paranoid. Who can say what "paranoid enough" should be? I think it's
useful to have a low-impact, generic check function for B-Tree
indexes, but I don't think we need to hold back on being exhaustive
elsewhere.

-- 
Peter Geoghegan



Re: amcheck (B-Tree integrity checking tool)

From
Peter Geoghegan
Date:
On Fri, Mar 11, 2016 at 4:30 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
> Right, but you still have the option to enable them if you don't want to
> swamp your IO system. That's why CIC obeys it too. If I was running a
> consistency check on a production system I'd certainly want the option to
> throttle it. Without that option, I don't see running this on production
> systems as being an option. If that's not a goal then fine, but if it is a
> goal I think it needs to be there.
>
> Isn't it just a few extra lines of code to support it?

I see your point.

I'll add that if people like the interface you propose. (Overloading
the VACUUM cost delay functions to cause a delay for amcheck
functions, too). Note that the functions already use an appropriate
buffer access strategy (it avoids blowing shared_buffers, much like
VACUUM itself).

-- 
Peter Geoghegan



Re: amcheck (B-Tree integrity checking tool)

From
Tomas Vondra
Date:
On Fri, 2016-03-11 at 16:40 -0800, Peter Geoghegan wrote:
> On Fri, Mar 11, 2016 at 4:17 PM, Peter Geoghegan <pg@heroku.com>
> wrote:
> > 
> > If you want the tool to limp on when it finds an error, that can be
> > done by changing the constant for the CORRUPTION macro in
> > amcheck.c.
> > But having that be dynamically configurable is not really
> > compatible
> > with the goal of having amcheck be run routinely.
> Also, it's just really hard to reason about what remains OK to check
> after the first problem is encountered in the general case. It's
> "unreasonable" for any of the checks to ever fail. So, by that
> standard, assuming that they might fail at all could be called
> paranoid. Who can say what "paranoid enough" should be? I think it's
> useful to have a low-impact, generic check function for B-Tree
> indexes, but I don't think we need to hold back on being exhaustive
> elsewhere.

Right, but isn't there a difference between the two functions in this
respect? Once you find corruption involving relationship between
multiple pages, then I agree it's complicated to do any reasoning about
what additional checks are safe.

But does that problem also apply to bt_index_check, which checks pages
independently?

Admittedly, this also depends on the use case. If all we need to do is
answering a question "Is the index corrupted?" then sure, bailing out
on the first error is perfectly appropriate.

But perhaps this would be useful for some recovery/forensics tasks?

From time to time we need to investigate corruption in a database, i.e.
see how much of the data is actually corrupted, list pages that need to
be zeroed to get the cluster up to salvage as much as possible, etc.
Currently this is tedious because we essentially find/fix the pages one
by one. It'd be very useful to list all broken pages in one go and then
fix all of them.

Obviously, that's about heapam checks, but perhaps it would be useful
for an index too?

regard

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: amcheck (B-Tree integrity checking tool)

From
Peter Geoghegan
Date:
On Fri, Mar 11, 2016 at 6:33 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Right, but isn't there a difference between the two functions in this
> respect? Once you find corruption involving relationship between
> multiple pages, then I agree it's complicated to do any reasoning about
> what additional checks are safe.
>
> But does that problem also apply to bt_index_check, which checks pages
> independently?

I think so, yes.

> Admittedly, this also depends on the use case. If all we need to do is
> answering a question "Is the index corrupted?" then sure, bailing out
> on the first error is perfectly appropriate.
>
> But perhaps this would be useful for some recovery/forensics tasks?

Maybe, but I feel like making it possible to change the CORRUPTION
elevel macro was the right trade-off. I don't want to have to reason
about the universe of possible problems that could occur when the tool
must limp on in the event of corruption. For example, I don't want to
have to deal with infinite loops. In practice, an expert would
probably be fine to change the constant themselves if they needed to.

Indexes can always be rebuilt. The tool is for identifying and
diagnosing corruption, but if you want to diagnose a faulty opclass or
something, then I think you need to get out pageinspect. You need
human judgement for that.

> From time to time we need to investigate corruption in a database, i.e.
> see how much of the data is actually corrupted, list pages that need to
> be zeroed to get the cluster up to salvage as much as possible, etc.
> Currently this is tedious because we essentially find/fix the pages one
> by one. It'd be very useful to list all broken pages in one go and then
> fix all of them.
>
> Obviously, that's about heapam checks, but perhaps it would be useful
> for an index too?

Only insofar as it helps diagnose the underlying issue, when it is a
more subtle issue. Actually fixing the index is almost certainly a
REINDEX. Once you're into the messy business of diagnosing a
problematic opclass, you have to be an expert, and tweaking amcheck
for your requirements (i.e. rebuilding from source) becomes
reasonable. Part of the reason that the code is so heavily commented
is to make it hackable, because I do not feel optimistic that I can
get an expert-orientated interface right, but I still want to make the
tool as useful as possible to experts.

-- 
Peter Geoghegan



Re: amcheck (B-Tree integrity checking tool)

From
Matt Kelly
Date:
On Fri, 2016-03-11 at 17:24 +0100, Michael Paquier wrote:
> On Fri, Mar 11, 2016 at 5:19 PM, Anastasia Lubennikova wrote:
> >
> > BTW, if you know a good way to corrupt index (and do it
> > reproducible) I'd be
> > very glad to see it.
> You can use for example dd in non-truncate mode to corrupt on-disk
> page data, say that for example:
> dd if=/dev/random bs=8192 count=1 \
>     seek=$BLOCK_ID of=base/$DBOID/$RELFILENODE \
>     conv=notrunc
I guess /dev/random is not very compatible with the "reproducible"
requirement. I mean, it will reproducibly break the page, but pretty
much completely, which is mostly what checksums are for.

You can actually pretty easily produce a test case by setting up streaming replication between servers running two different version of glibc.

I actually wrote a tool that spins up a pair of VMs using vagrant and then sets them up as streaming replica's using ansible.  It provides a nice one liner to get a streaming replica test environment going and it will easily provide the cross glibc test case.  Technically, though it belongs to Trip because I wrote it on company time.  Let me see if I can open source a version of it later this week that way you can use it for testing.

- Matt K.

Re: amcheck (B-Tree integrity checking tool)

From
Peter Geoghegan
Date:
On Sat, Mar 12, 2016 at 7:46 PM, Matt Kelly <mkellycs@gmail.com> wrote:
> You can actually pretty easily produce a test case by setting up streaming
> replication between servers running two different version of glibc.
>
> I actually wrote a tool that spins up a pair of VMs using vagrant and then
> sets them up as streaming replica's using ansible.  It provides a nice one
> liner to get a streaming replica test environment going and it will easily
> provide the cross glibc test case.  Technically, though it belongs to Trip
> because I wrote it on company time.  Let me see if I can open source a
> version of it later this week that way you can use it for testing.

That could be interesting. The earlier prototypes of this tool are
known to have detected glibc collation incompatibilities in real
production systems.

-- 
Peter Geoghegan



Re: amcheck (B-Tree integrity checking tool)

From
Jim Nasby
Date:
On 3/11/16 6:45 PM, Peter Geoghegan wrote:
> I'll add that if people like the interface you propose. (Overloading
> the VACUUM cost delay functions to cause a delay for amcheck
> functions, too).

I thought that had already been overloaded by CIC, but I'm not finding 
reference to it... ANALYZE does use it though, so the ship has already 
sorta sailed.

I'm actually a bit surprised cost delay isn't used anywhere else. As 
more background operations are added I suspect users will want it at 
some point.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: amcheck (B-Tree integrity checking tool)

From
Robert Haas
Date:
On Fri, Mar 11, 2016 at 7:45 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Fri, Mar 11, 2016 at 4:30 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
>> Right, but you still have the option to enable them if you don't want to
>> swamp your IO system. That's why CIC obeys it too. If I was running a
>> consistency check on a production system I'd certainly want the option to
>> throttle it. Without that option, I don't see running this on production
>> systems as being an option. If that's not a goal then fine, but if it is a
>> goal I think it needs to be there.
>>
>> Isn't it just a few extra lines of code to support it?
>
> I see your point.
>
> I'll add that if people like the interface you propose. (Overloading
> the VACUUM cost delay functions to cause a delay for amcheck
> functions, too). Note that the functions already use an appropriate
> buffer access strategy (it avoids blowing shared_buffers, much like
> VACUUM itself).

I don't particularly like that interface.  I also suggest that it
would be better to leave throttling to a future commit, and focus on
getting the basic feature in first.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: amcheck (B-Tree integrity checking tool)

From
Peter Geoghegan
Date:
On Mon, Mar 14, 2016 at 10:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> I don't particularly like that interface.  I also suggest that it
> would be better to leave throttling to a future commit, and focus on
> getting the basic feature in first.

Works for me. I don't think throttling is an especially compelling feature.

-- 
Peter Geoghegan



Re: amcheck (B-Tree integrity checking tool)

From
Peter Geoghegan
Date:
On Fri, Mar 11, 2016 at 8:19 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> But I have some concerns about compatibility with my patches.
> I've tried to call bt_index_check() over my "including" patch [1] and caught
> a segfault.
>
> LOG:  server process (PID 31794) was terminated by signal 11: Segmentation
> fault
> DETAIL:  Failed process was running: select bt_index_check('idx');
>
> I do hope that my patch will be accepted in 9.6, so this conflict looks
> really bad.
> I think that error is caused by changes in pages layout. To save some space
> nonkey attributes are truncated

> [1] https://commitfest.postgresql.org/9/433/

I posted a review of your "Covering + unique indexes" patch, where I
made an educated guess about what the problem is here (I sort of
hinted at what I thought it was already, in this thread, actually). I
haven't actually tested this theory of mine myself just yet, but let
me know what you think of it on the thread for your patch.

-- 
Peter Geoghegan



Re: amcheck (B-Tree integrity checking tool)

From
Amit Langote
Date:
On 2016/03/12 6:31, Peter Geoghegan wrote:
> On Thu, Mar 10, 2016 at 9:18 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> I've looked at this patch today, mostly to educate myself, so this
>> probably should not count as a full review. Anyway, the patch seems in
>> excellent shape - it'd be great if all patches (including those written
>> by me) had this level of comments/docs.

+1

>> Can we come up with names that more clearly identify the difference
>> between those two functions? I mean, _parent_ does not make it
>> particularly obvious that the second function acquires exclusive lock
>> and performs more thorough checks.
> 
> Dunno about that. It's defining characteristic is that it checks child
> pages against their parent IMV. Things are not often defined in terms
> of their locking requirements.

At the risk of sounding a bit verbose, do bt_check_level() for a check
that inspects a level at a time and bt_check_multi_level() for a check
that spans levels sound descriptive?

Thanks,
Amit





Re: amcheck (B-Tree integrity checking tool)

From
Peter Geoghegan
Date:
On Mon, Mar 14, 2016 at 11:48 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Dunno about that. It's defining characteristic is that it checks child
>> pages against their parent IMV. Things are not often defined in terms
>> of their locking requirements.
>
> At the risk of sounding a bit verbose, do bt_check_level() for a check
> that inspects a level at a time and bt_check_multi_level() for a check
> that spans levels sound descriptive?

Hmm. But all functions verify multiple levels. What distinguishes
bt_index_parent_check()'s verification is that the downlinks in
internal pages are checked against actual child pages (every item in
the child page, in fact). It's the parent/child relationship that is
verified in addition to the standard checks of every page on and
across (not between) every level.

-- 
Peter Geoghegan



Re: amcheck (B-Tree integrity checking tool)

From
Amit Langote
Date:
Hi Peter,

On 2016/03/15 16:11, Peter Geoghegan wrote:
> On Mon, Mar 14, 2016 at 11:48 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>> Dunno about that. It's defining characteristic is that it checks child
>>> pages against their parent IMV. Things are not often defined in terms
>>> of their locking requirements.
>>
>> At the risk of sounding a bit verbose, do bt_check_level() for a check
>> that inspects a level at a time and bt_check_multi_level() for a check
>> that spans levels sound descriptive?
> 
> Hmm. But all functions verify multiple levels. What distinguishes
> bt_index_parent_check()'s verification is that the downlinks in
> internal pages are checked against actual child pages (every item in
> the child page, in fact). It's the parent/child relationship that is
> verified in addition to the standard checks of every page on and
> across (not between) every level.

Ah, I see the nuance.  Thanks for the explanation.  Maybe,
bt_index_check() and bt_index_parent_child_check() /
bt_index_check_parent_child().  IMHO, the latter more clearly highlights
the fact that parent/child relationships in the form of down-links are
checked.

By the way, one request (as a non-native speaker of English language, who
ends up looking up quite a few words regularly) -

Could we use "conform" or "correspond" instead of "comport" in the
following error message:

"left link/right link pair in index \"%s\" don't comport"

Thanks,
Amit





Re: amcheck (B-Tree integrity checking tool)

From
Peter Geoghegan
Date:
On Tue, Mar 15, 2016 at 12:31 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Ah, I see the nuance.  Thanks for the explanation.  Maybe,
> bt_index_check() and bt_index_parent_child_check() /
> bt_index_check_parent_child().  IMHO, the latter more clearly highlights
> the fact that parent/child relationships in the form of down-links are
> checked.

Well, the downlink is in the parent, because there is no such thing as
an "uplink". So I prefer bt_index_parent_check(), since it usefully
hints at starting from the parent. It's also more concise.

> By the way, one request (as a non-native speaker of English language, who
> ends up looking up quite a few words regularly) -
>
> Could we use "conform" or "correspond" instead of "comport" in the
> following error message:
>
> "left link/right link pair in index \"%s\" don't comport"

OK. I'll do something about that.

-- 
Peter Geoghegan



Re: amcheck (B-Tree integrity checking tool)

From
Amit Langote
Date:
On 2016/03/15 16:42, Peter Geoghegan wrote:
> On Tue, Mar 15, 2016 at 12:31 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> By the way, one request (as a non-native speaker of English language, who
>> ends up looking up quite a few words regularly) -
>>
>> Could we use "conform" or "correspond" instead of "comport" in the
>> following error message:
>>
>> "left link/right link pair in index \"%s\" don't comport"
> 
> OK. I'll do something about that.

Thanks a lot.

- Amit





Re: amcheck (B-Tree integrity checking tool)

From
David Steele
Date:
On 3/15/16 3:42 AM, Peter Geoghegan wrote:
> On Tue, Mar 15, 2016 at 12:31 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Ah, I see the nuance.  Thanks for the explanation.  Maybe,
>> bt_index_check() and bt_index_parent_child_check() /
>> bt_index_check_parent_child().  IMHO, the latter more clearly highlights
>> the fact that parent/child relationships in the form of down-links are
>> checked.
> 
> Well, the downlink is in the parent, because there is no such thing as
> an "uplink". So I prefer bt_index_parent_check(), since it usefully
> hints at starting from the parent. It's also more concise.
> 
>> By the way, one request (as a non-native speaker of English language, who
>> ends up looking up quite a few words regularly) -
>>
>> Could we use "conform" or "correspond" instead of "comport" in the
>> following error message:
>>
>> "left link/right link pair in index \"%s\" don't comport"
> 
> OK. I'll do something about that.

It looks like an updated patch is expected here, though it seems that
the only requests are for updates to comments.

-- 
-David
david@pgmasters.net



Re: amcheck (B-Tree integrity checking tool)

From
Peter Geoghegan
Date:
On Tue, Mar 22, 2016 at 9:33 AM, David Steele <david@pgmasters.net> wrote:
> It looks like an updated patch is expected here, though it seems that
> the only requests are for updates to comments.

That's right - I have a small number of feedback items to work
through. I also determined myself that there could be a very low
probability race condition when checking the key space across sibling
pages, and will work to address that. If I'm right about that then
it's not a lot of work to fix; I'm probably just going to use the
right page's high key rather than its first data item.

-- 
Peter Geoghegan



Re: amcheck (B-Tree integrity checking tool)

From
Peter Geoghegan
Date:
On Tue, Mar 22, 2016 at 10:57 AM, Peter Geoghegan <pg@heroku.com> wrote:
> That's right - I have a small number of feedback items to work
> through. I also determined myself that there could be a very low
> probability race condition when checking the key space across sibling
> pages, and will work to address that. If I'm right about that then
> it's not a lot of work to fix; I'm probably just going to use the
> right page's high key rather than its first data item.

I also want to use amcheck to test sorting, especially external
sorting which is currently totally untested. It would be nice to see
testing of strxfrm() on the buildfarm, too -- amcheck provides a nice
way to make sure strxfrm() and strcoll() are in agreement for at least
those cases that are tested, without having to worry about portability
in the same way as a simple pg_regress approach would require.

Note that there are reportedly systems in mainstream use where
strxfrm() is broken; it doesn't agree with strcoll().

-- 
Peter Geoghegan



Re: amcheck (B-Tree integrity checking tool)

From
Peter Geoghegan
Date:
On Tue, Mar 22, 2016 at 10:57 AM, Peter Geoghegan <pg@heroku.com> wrote:
> That's right - I have a small number of feedback items to work
> through. I also determined myself that there could be a very low
> probability race condition when checking the key space across sibling
> pages, and will work to address that.

I attach a V2 revision.

Behavioral changes:

* Fixes very low probability race condition when verifying ordering
across sibling pages with bt_index_check()/AccessShareLock.

Notably, the fix does not involve using the high key from the next
page instead of the first item, despite my recent suggestion that it
would; I changed my mind. In this revision,
bt_right_page_check_scankey() continues to return the first item on
the next/right page (as a scankey), just as in the first version. The
locking considerations here are more complicated than before, and
hopefully correct. I abandoned the high key idea when I realized that
a concurrent page split could happen in the right sibling, making the
high key generally no more safe in the face of concurrent target page
deletion than continuing to use the first data item. Using the first
data item from the right page is better than using the high key item
from the right page verification-wise anyway, so that's good. I found
a new way of locking down and recovering from any race conditions (the
low probability bug in V1 for bt_index_check()/AccessShareLock calls
to the C function bt_right_page_check_scankey()). I won't go into the
gory details here, but I will mention that my new approach is inspired
by how backwards scans work already.

* Adds DEBUG1 traces that indicate the level and page currently being checked.

Advanced users will probably find these traces useful on occasion.
It's nice to be able to see how amcheck walks the B-Tree in practice.

Non-behavioral changes:

* Explains why an ExclusiveLock is needed on the target index by
bt_index_parent_check() (and a ShareLock on its heap relation).

This new explanation talks about why an ExclusiveLock *is* necessary
when checking downlinks against child pages, per Tomas Vondra's
request. (Note that nothing changes about child/downlink verification
itself). Before now, I focused on why they were not necessary, but
Tomas was right to point out that that isn't enough information.

* No longer uses the word "comport" in an error message, per Amit
Langote. Similarly, minor typos fixed, per Tomas.

* Adds a variety of pg_regress-based tests for external sorting. (Via
various CREATE INDEX statements involving a variety of datatypes. Data
is generated pseudo-randomly.)

Notably, the regression tests *never* test external sorting today.

ISTM that amcheck is a convenient, reliable, and portable way of
testing the sorting of collatable types like text, so this is the best
way of testing external sorting. External sorting uses abbreviated
keys for runs, but discards them as runs are written out, and so the
merging of runs does not use abbreviated keys. Abbreviated comparisons
and authoritative comparisons are therefore reliably used within the
same sort. One particular goal here is that something like the recent
strxfrm() debacle might be caught by this testing on some platforms
(if we ever get back to trusting strxfrm() at all, that is). The style
of these new tests is unorthodox, because a variety of collations are
tested, which varies from system to system; we should discuss all
this. The tests can take a minute or two to run on my laptop, but that
could vary significantly. And so, whether or not that needs to be
targeted by something other than "check" and "installcheck" is a
discussion that needs to happen. I imagine a lot of buildfarm animals
have slow storage, but on the other hand I don't think hackers run the
contrib regression tests so often. I prefer to not follow the example
of test_decoding if possible; test_decoding's Makefile only runs tests
when wal_level=logical is expected (that's how the relevant Makefile
targets are scoped), but it's a complicated special case.

Review
------

As already noted after the first bullet point, there are some tricky
considerations on fixing the race when the user calls
bt_index_check(). Reviewers should pay close attention to this; could
I be wrong about that being race-safe even now? This seems to me to be
by far the best place for reviewers to look for problems in this
patch, as it's the only bt_index_check()/AccessShareLock case that
compares anything *across* pages. The measures taken to prevent false
positives described within bt_target_page_check() require expert
review, especially because this is the one case where a faulty
rationale may result in under-protection from false positives, and not
just harmless over-protection.

The rationale for why we can reliably recover from a race described
within the C function bt_right_page_check_scankey() is *complicated*.
I may have failed to make it as simple as possible despite significant
effort. It's hard to describe these things in an accessible way. Maybe
someone can try and understand what I've said there, and let me know
how I might have fallen short. I have used a new term, "cousin page"
in this explanation (cf. sibling page). This seems like useful
terminology that makes these difficult explanations a bit more
accessible.

--
Peter Geoghegan

Attachment

Re: amcheck (B-Tree integrity checking tool)

From
Anastasia Lubennikova
Date:
29.03.2016 06:13, Peter Geoghegan:
> On Tue, Mar 22, 2016 at 10:57 AM, Peter Geoghegan <pg@heroku.com> wrote:
>> That's right - I have a small number of feedback items to work
>> through. I also determined myself that there could be a very low
>> probability race condition when checking the key space across sibling
>> pages, and will work to address that.
> I attach a V2 revision.
>
> Behavioral changes:
>
> * Fixes very low probability race condition when verifying ordering
> across sibling pages with bt_index_check()/AccessShareLock.
>
> Notably, the fix does not involve using the high key from the next
> page instead of the first item, despite my recent suggestion that it
> would; I changed my mind. In this revision,
> bt_right_page_check_scankey() continues to return the first item on
> the next/right page (as a scankey), just as in the first version. The
> locking considerations here are more complicated than before, and
> hopefully correct. I abandoned the high key idea when I realized that
> a concurrent page split could happen in the right sibling, making the
> high key generally no more safe in the face of concurrent target page
> deletion than continuing to use the first data item. Using the first
> data item from the right page is better than using the high key item
> from the right page verification-wise anyway, so that's good. I found
> a new way of locking down and recovering from any race conditions (the
> low probability bug in V1 for bt_index_check()/AccessShareLock calls
> to the C function bt_right_page_check_scankey()). I won't go into the
> gory details here, but I will mention that my new approach is inspired
> by how backwards scans work already.
>
> * Adds DEBUG1 traces that indicate the level and page currently being checked.
>
> Advanced users will probably find these traces useful on occasion.
> It's nice to be able to see how amcheck walks the B-Tree in practice.
>
> Non-behavioral changes:
>
> * Explains why an ExclusiveLock is needed on the target index by
> bt_index_parent_check() (and a ShareLock on its heap relation).
>
> This new explanation talks about why an ExclusiveLock *is* necessary
> when checking downlinks against child pages, per Tomas Vondra's
> request. (Note that nothing changes about child/downlink verification
> itself). Before now, I focused on why they were not necessary, but
> Tomas was right to point out that that isn't enough information.
>
> * No longer uses the word "comport" in an error message, per Amit
> Langote. Similarly, minor typos fixed, per Tomas.
>
> * Adds a variety of pg_regress-based tests for external sorting. (Via
> various CREATE INDEX statements involving a variety of datatypes. Data
> is generated pseudo-randomly.)
>
> Notably, the regression tests *never* test external sorting today.
>
> ISTM that amcheck is a convenient, reliable, and portable way of
> testing the sorting of collatable types like text, so this is the best
> way of testing external sorting. External sorting uses abbreviated
> keys for runs, but discards them as runs are written out, and so the
> merging of runs does not use abbreviated keys. Abbreviated comparisons
> and authoritative comparisons are therefore reliably used within the
> same sort. One particular goal here is that something like the recent
> strxfrm() debacle might be caught by this testing on some platforms
> (if we ever get back to trusting strxfrm() at all, that is). The style
> of these new tests is unorthodox, because a variety of collations are
> tested, which varies from system to system; we should discuss all
> this. The tests can take a minute or two to run on my laptop, but that
> could vary significantly. And so, whether or not that needs to be
> targeted by something other than "check" and "installcheck" is a
> discussion that needs to happen. I imagine a lot of buildfarm animals
> have slow storage, but on the other hand I don't think hackers run the
> contrib regression tests so often. I prefer to not follow the example
> of test_decoding if possible; test_decoding's Makefile only runs tests
> when wal_level=logical is expected (that's how the relevant Makefile
> targets are scoped), but it's a complicated special case.
>
> Review
> ------
>
> As already noted after the first bullet point, there are some tricky
> considerations on fixing the race when the user calls
> bt_index_check(). Reviewers should pay close attention to this; could
> I be wrong about that being race-safe even now? This seems to me to be
> by far the best place for reviewers to look for problems in this
> patch, as it's the only bt_index_check()/AccessShareLock case that
> compares anything *across* pages. The measures taken to prevent false
> positives described within bt_target_page_check() require expert
> review, especially because this is the one case where a faulty
> rationale may result in under-protection from false positives, and not
> just harmless over-protection.
>
> The rationale for why we can reliably recover from a race described
> within the C function bt_right_page_check_scankey() is *complicated*.
> I may have failed to make it as simple as possible despite significant
> effort. It's hard to describe these things in an accessible way. Maybe
> someone can try and understand what I've said there, and let me know
> how I might have fallen short. I have used a new term, "cousin page"
> in this explanation (cf. sibling page). This seems like useful
> terminology that makes these difficult explanations a bit more
> accessible.

I'm not an expert in btree locking races, but I tested the patch it on
different loads and it works as promised.
I didn't find mistakes in the code logic as well.

As a reviewer, I don't have any objections to mark it "Ready for Committer".
The only notice I'd like to add is about it's compatibility with
covering indexes.
We already discussed it in the thread
https://commitfest.postgresql.org/9/433/
Tiny attached patch fixes this issue.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Attachment

Re: amcheck (B-Tree integrity checking tool)

From
Peter Geoghegan
Date:
On Sat, Mar 12, 2016 at 12:38 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Only insofar as it helps diagnose the underlying issue, when it is a
> more subtle issue. Actually fixing the index is almost certainly a
> REINDEX. Once you're into the messy business of diagnosing a
> problematic opclass, you have to be an expert, and tweaking amcheck
> for your requirements (i.e. rebuilding from source) becomes
> reasonable. Part of the reason that the code is so heavily commented
> is to make it hackable, because I do not feel optimistic that I can
> get an expert-orientated interface right, but I still want to make the
> tool as useful as possible to experts.

Heroku began a selective roll-out of amcheck yesterday. amcheck
already found a bug in the PostGiS Geography B-Tree opclass:

https://github.com/postgis/postgis/blob/svn-trunk/postgis/geography_btree.c#L260

The issue is not that PG_RETURN_BOOL() is used here (that's just a
problem of style). Rather, the issue is that the Geography opclass
support function 1 considers an "empty geometry" equal to all other
possible values (values not limited to other empty geometries). This
breaks the transitive law, which nbtree requires and relies on.

I'll go report this to the PostGiS people.

-- 
Peter Geoghegan



Re: amcheck (B-Tree integrity checking tool)

From
Michael Paquier
Date:
On Fri, Aug 19, 2016 at 2:40 AM, Peter Geoghegan <pg@heroku.com> wrote:
> Heroku began a selective roll-out of amcheck yesterday. amcheck
> already found a bug in the PostGiS Geography B-Tree opclass:
> [...]
> I'll go report this to the PostGiS people.

Cool. I have been honestly wondering about deploying this tool as well
to allow some of the QE tests to perform live checks of btree indexes
as we use a bunch of them. By the way, I have not looked at the patch,
but this supports just btree, right? Wouldn't btree_check be a better
name, or do you think that the interface you provide is generic enough
that it could be extended in the future for gin, gist, etc.?
-- 
Michael



Re: amcheck (B-Tree integrity checking tool)

From
Peter Geoghegan
Date:
On Thu, Aug 18, 2016 at 5:06 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> Cool. I have been honestly wondering about deploying this tool as well
> to allow some of the QE tests to perform live checks of btree indexes
> as we use a bunch of them.

I'd certainly welcome that. There are Debian packages available from
the Github version of amcheck, which is otherwise practically
identical to the most recent version of the patch posted here:

https://github.com/petergeoghegan/amcheck

(This version also targets Postgres 9.4+).

> By the way, I have not looked at the patch,
> but this supports just btree, right? Wouldn't btree_check be a better
> name, or do you think that the interface you provide is generic enough
> that it could be extended in the future for gin, gist, etc.?

Yes, the idea of calling it amcheck was support for other AMs could be
added later.

Personally, I would like to make amcheck verifying the heap through a
B-Tree index as a next step. There is also a good case for the tool
directly verifying heap relations, without involving any index, but
that can come later.

-- 
Peter Geoghegan



Re: amcheck (B-Tree integrity checking tool)

From
Michael Paquier
Date:
On Fri, Aug 19, 2016 at 9:14 AM, Peter Geoghegan <pg@heroku.com> wrote:
> On Thu, Aug 18, 2016 at 5:06 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> Cool. I have been honestly wondering about deploying this tool as well
>> to allow some of the QE tests to perform live checks of btree indexes
>> as we use a bunch of them.
>
> I'd certainly welcome that. There are Debian packages available from
> the Github version of amcheck, which is otherwise practically
> identical to the most recent version of the patch posted here:
>
> https://github.com/petergeoghegan/amcheck

This would be packaged from source in my case, but that's no big deal
:) At least I can see that it is added in the next CF, and that's
marked as ready for committer for a couple of months now...
-- 
Michael



Re: amcheck (B-Tree integrity checking tool)

From
Peter Geoghegan
Date:
On Thu, Aug 18, 2016 at 5:35 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> This would be packaged from source in my case, but that's no big deal
> :) At least I can see that it is added in the next CF, and that's
> marked as ready for committer for a couple of months now...

If you consider how the code is written, you'll see that it's very
unlikely to destabilize production systems.

Every page is copied into local memory once before being operated on,
and only one buffer lock/pin is held at once (for long enough to do
that copying). If there were bugs in amcheck, it's much more likely
that they'd be "false positive" bugs. In any case, I haven't seen any
issue with the tool itself yet, having now run the tool on thousands
of servers.

I think I'll have a lot more information in about a week, when I've
had time to work through more data.

-- 
Peter Geoghegan



Re: amcheck (B-Tree integrity checking tool)

From
Peter Geoghegan
Date:
On Thu, Aug 18, 2016 at 5:14 PM, Peter Geoghegan <pg@heroku.com> wrote:
> I'd certainly welcome that. There are Debian packages available from
> the Github version of amcheck, which is otherwise practically
> identical to the most recent version of the patch posted here:
>
> https://github.com/petergeoghegan/amcheck
>
> (This version also targets Postgres 9.4+).

There were some extremely minor changes made here compared to my V2
patch, which, ISTM, means that a V3 patch is necessary. I'll get to
that next week.


-- 
Peter Geoghegan



Re: amcheck (B-Tree integrity checking tool)

From
Kevin Grittner
Date:
On Fri, Sep 2, 2016 at 11:31 AM, Peter Geoghegan <pg@heroku.com> wrote:
> On Thu, Aug 18, 2016 at 5:14 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> I'd certainly welcome that. There are Debian packages available from
>> the Github version of amcheck, which is otherwise practically
>> identical to the most recent version of the patch posted here:
>>
>> https://github.com/petergeoghegan/amcheck
>>
>> (This version also targets Postgres 9.4+).
>
> There were some extremely minor changes made here compared to my V2
> patch, which, ISTM, means that a V3 patch is necessary. I'll get to
> that next week.

I changed the CF entry to "Waiting on Author" pending that.  (I was
starting to review the patch for commit, so it's none to early to
mention that the last posted version is not what I should be
looking at.)

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: amcheck (B-Tree integrity checking tool)

From
Peter Geoghegan
Date:
On Fri, Sep 2, 2016 at 11:11 AM, Kevin Grittner <kgrittn@gmail.com> wrote:
> I changed the CF entry to "Waiting on Author" pending that.  (I was
> starting to review the patch for commit, so it's none to early to
> mention that the last posted version is not what I should be
> looking at.)

There are only tiny differences, which in any case you can see in the
commit log on Github. There is no reason why code review needs to
block on this V3, IMV.

-- 
Peter Geoghegan



Re: amcheck (B-Tree integrity checking tool)

From
Kevin Grittner
Date:
On Fri, Sep 2, 2016 at 1:16 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Fri, Sep 2, 2016 at 11:11 AM, Kevin Grittner <kgrittn@gmail.com> wrote:
>> I changed the CF entry to "Waiting on Author" pending that.  (I was
>> starting to review the patch for commit, so it's none to early to
>> mention that the last posted version is not what I should be
>> looking at.)
>
> There are only tiny differences, which in any case you can see in the
> commit log on Github. There is no reason why code review needs to
> block on this V3, IMV.

IMV the process is to post a patch to this list to certify that it
is yours to contribute and free of IP encumbrances that would
prevent us from using it.  I will wait for that post.

-- 
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: amcheck (B-Tree integrity checking tool)

From
Peter Geoghegan
Date:
On Fri, Sep 2, 2016 at 11:16 AM, Peter Geoghegan <pg@heroku.com> wrote:
> There are only tiny differences, which in any case you can see in the
> commit log on Github. There is no reason why code review needs to
> block on this V3, IMV.

Also, I don't think that we need to include V2's tests now that there
is independent test coverage for external sorts, which is a recent
development. I imagine that the V2 test run duration might be
problematic on some buildfarm animals. I'll omit that in V3.

-- 
Peter Geoghegan



Re: amcheck (B-Tree integrity checking tool)

From
Peter Geoghegan
Date:
On Fri, Sep 2, 2016 at 11:19 AM, Kevin Grittner <kgrittn@gmail.com> wrote:
> IMV the process is to post a patch to this list to certify that it
> is yours to contribute and free of IP encumbrances that would
> prevent us from using it.  I will wait for that post.

I attach my V3. There are only very minor code changes here, such as
breaking out a separate elevel constant for DEBUG2 traces that show
information of how the B-Tree is accessed (e.g. current level, block
number). Tiny tweaks to about 3 comments were also performed. These
changes are trivial.

I've also removed the tests added, since external sort regression test
coverage is in a much better place now.

Finally, the documentation was updated, to make it closer to the
Github project's documentation. I expanded what was started as the
original amcheck sgml documentation based on the experience of using
amcheck on production databases at Heroku. This isn't a big change,
but it's the biggest in V3. The documentation now emphasizes the
likelihood of amcheck finding software errors, which is all we found.
Jim Gray predicted that fault tolerant systems will tend to have
almost all problems arise from operator error and software faults in
practice, so perhaps I should have predicted that that would be the
general trend.

--
Peter Geoghegan

Attachment

Re: amcheck (B-Tree integrity checking tool)

From
Peter Geoghegan
Date:
On Wed, Sep 7, 2016 at 11:44 AM, Peter Geoghegan <pg@heroku.com> wrote:
> I attach my V3.

I should point out that I'm leaving to go on Vacation this weekend.
I'll be away for a week, and will not be answering mail during that
period. If anyone has any questions on this for me, it might be
preferable to ask them before I leave.


-- 
Peter Geoghegan



Re: amcheck (B-Tree integrity checking tool)

From
Michael Paquier
Date:
On Thu, Sep 8, 2016 at 3:44 AM, Peter Geoghegan <pg@heroku.com> wrote:
> On Fri, Sep 2, 2016 at 11:19 AM, Kevin Grittner <kgrittn@gmail.com> wrote:
>> IMV the process is to post a patch to this list to certify that it
>> is yours to contribute and free of IP encumbrances that would
>> prevent us from using it.  I will wait for that post.
>
> I attach my V3. There are only very minor code changes here, such as
> breaking out a separate elevel constant for DEBUG2 traces that show
> information of how the B-Tree is accessed (e.g. current level, block
> number). Tiny tweaks to about 3 comments were also performed. These
> changes are trivial.
>
> I've also removed the tests added, since external sort regression test
> coverage is in a much better place now.

Okay, moved to next CF. I may look at it finally I got some use-cases
for it, similar to yours I guess..
-- 
Michael



Re: amcheck (B-Tree integrity checking tool)

From
Peter Geoghegan
Date:
On Sun, Oct 2, 2016 at 6:48 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> Okay, moved to next CF. I may look at it finally I got some use-cases
> for it, similar to yours I guess..

Let me know how that goes.

One thing I've definitely noticed is that the tool is good at finding
corner-case bugs, so even if you can only test a small fraction of the
number of databases that I've been able to test, there could still be
significant value in your performing your own exercise. Your customer
databases might feature far more use of Japanese collations, for
example, which might be an important factor. (Well, the use of Arabic
code points turned out to be an important factor in the question of
whether or not en_US.utf8 could have been affected by the Glibc +
abbreviated keys bug.)

-- 
Peter Geoghegan



Re: amcheck (B-Tree integrity checking tool)

From
Michael Paquier
Date:
On Fri, Oct 7, 2016 at 8:05 AM, Peter Geoghegan <pg@heroku.com> wrote:
> Your customer
> databases might feature far more use of Japanese collations, for
> example, which might be an important factor.

Not really :)
-- 
Michael



Re: amcheck (B-Tree integrity checking tool)

From
Peter Geoghegan
Date:
On Mon, Feb 29, 2016 at 4:09 PM, Peter Geoghegan <pg@heroku.com> wrote:
> To recap, the extension adds some SQL-callable functions that verify
> certain invariant conditions hold within some particular B-Tree index.
> These are the conditions that index scans rely on always being true.
> The tool's scope may eventually cover other AMs, including heapam, but
> nbtree seems like the best place to start.

Noah and I discussed possible future directions for amcheck in person
recently. I would like to get Noah's thoughts again here on how a tool
like amcheck might reasonably target other access methods for
verification. In particular, the heapam. MultiXacts were mentioned as
a structure that could receive verification in a future iteration of
this tool, but I lack expertise there.

I've placed a lot of emphasis on the importance of having a
low-overhead verification process, particularly in terms of the
strength of heavyweight lock that the verification process requires.
Ideally, it would be possible to run any new verification process in a
fairly indiscriminate way with only limited impact on live production
systems.

-- 
Peter Geoghegan



Re: amcheck (B-Tree integrity checking tool)

From
Noah Misch
Date:
On Fri, Oct 14, 2016 at 04:56:39PM -0700, Peter Geoghegan wrote:
> On Mon, Feb 29, 2016 at 4:09 PM, Peter Geoghegan <pg@heroku.com> wrote:
> > To recap, the extension adds some SQL-callable functions that verify
> > certain invariant conditions hold within some particular B-Tree index.
> > These are the conditions that index scans rely on always being true.
> > The tool's scope may eventually cover other AMs, including heapam, but
> > nbtree seems like the best place to start.
> 
> Noah and I discussed possible future directions for amcheck in person
> recently. I would like to get Noah's thoughts again here on how a tool
> like amcheck might reasonably target other access methods for
> verification. In particular, the heapam. MultiXacts were mentioned as
> a structure that could receive verification in a future iteration of
> this tool, but I lack expertise there.

Yes, a heap checker could examine plenty of things.  Incomplete list:

- Detect impossible conditions in the hint bits.  A tuple should not have both HEAP_XMAX_COMMITTED and
HEAP_XMAX_INVALID. Every tuple bearing HEAP_ONLY_TUPLE should bear HEAP_UPDATED.  HEAP_HASVARWIDTH should be true if
andonly if the tuple has a non-NULL value in a negative-typlen column, possibly a dropped column.  A tuple should not
haveboth HEAP_KEYS_UPDATED and HEAP_XMAX_LOCK_ONLY.
 

- Report evidence of upgrades from older versions.  If the tool sees HEAP_MOVED_IN or HEAP_MOVED_OFF, it can report
thatthe cluster was binary-upgraded from 8.3 or 8.4.  If the user did not upgrade from such a version, the user should
assumecorruption.
 

- Check VARSIZE() of each variable-length datum.  Corrupt lengths might direct you to seek past the end of the tuple,
orthey might imply excess free space at the end of the tuple.
 

- Verify agreement between CLOG, MULTIXACT, and hint bits.  If the hint bits include HEAP_XMAX_LOCK_ONLY, the multixact
shouldnot contain a MultiXactStatusUpdate member.  README.tuplock documents other invariants. If the tool sees a tuple
passingHEAP_LOCKED_UPGRADED, it can report that the cluster was binary-upgraded from a version in [8.3, 9.1].
 

- Verify that TOAST pointers (in non-dropped columns of visible tuples) point to valid data in the TOAST relation.
Thisis much more expensive than the other checks I've named, so it should be optional.
 

- If VM_ALL_VISIBLE() or VM_ALL_FROZEN() passes for a particular page, verify that the visibility data stored in the
pageis compatible with that claim.
 

- Examine PageHeaderData values.  If pd_checksum is non-zero in a cluster with checksums disabled, the cluster was
binary-upgradedfrom [8.3, 9.2].
 

> I've placed a lot of emphasis on the importance of having a
> low-overhead verification process, particularly in terms of the
> strength of heavyweight lock that the verification process requires.
> Ideally, it would be possible to run any new verification process in a
> fairly indiscriminate way with only limited impact on live production
> systems.

I suspect you could keep heap checker overhead similar to the cost of "SELECT
count(*) FROM table_name".



Re: amcheck (B-Tree integrity checking tool)

From
Michael Paquier
Date:
On Mon, Oct 17, 2016 at 10:46 AM, Noah Misch <noah@leadboat.com> wrote:
> - Detect impossible conditions in the hint bits.  A tuple should not have both
>   HEAP_XMAX_COMMITTED and HEAP_XMAX_INVALID.  Every tuple bearing
>   HEAP_ONLY_TUPLE should bear HEAP_UPDATED.  HEAP_HASVARWIDTH should be true
>   if and only if the tuple has a non-NULL value in a negative-typlen column,
>   possibly a dropped column.  A tuple should not have both HEAP_KEYS_UPDATED
>   and HEAP_XMAX_LOCK_ONLY.
>
> - Report evidence of upgrades from older versions.  If the tool sees
>   HEAP_MOVED_IN or HEAP_MOVED_OFF, it can report that the cluster was
>   binary-upgraded from 8.3 or 8.4.  If the user did not upgrade from such a
>   version, the user should assume corruption.
>
> - Check VARSIZE() of each variable-length datum.  Corrupt lengths might direct
>   you to seek past the end of the tuple, or they might imply excess free space
>   at the end of the tuple.
>
> - Verify agreement between CLOG, MULTIXACT, and hint bits.  If the hint bits
>   include HEAP_XMAX_LOCK_ONLY, the multixact should not contain a
>   MultiXactStatusUpdate member.  README.tuplock documents other invariants.
>   If the tool sees a tuple passing HEAP_LOCKED_UPGRADED, it can report that
>   the cluster was binary-upgraded from a version in [8.3, 9.1].
>
> - Verify that TOAST pointers (in non-dropped columns of visible tuples) point
>   to valid data in the TOAST relation.  This is much more expensive than the
>   other checks I've named, so it should be optional.

There is a lot of value in that actually. I got bitten lately by an
issue where this got corrupted because of an incorrect failover flow.

> - If VM_ALL_VISIBLE() or VM_ALL_FROZEN() passes for a particular page, verify
>   that the visibility data stored in the page is compatible with that claim.
>
> - Examine PageHeaderData values.  If pd_checksum is non-zero in a cluster with
>   checksums disabled, the cluster was binary-upgraded from [8.3, 9.2].

Agreed basically on all those items, though the first version does not
need to do everything :)

Honestly just btree checks have enough value to validate the
integration of this module. You can get that as "I'll look at your
patch and provide a review soon because I have a hell lot of cases
where this will be useful".
-- 
Michael



Re: amcheck (B-Tree integrity checking tool)

From
Andres Freund
Date:
Hi,


I think the patch could use a pgindent run.

On 2016-09-07 11:44:01 -0700, Peter Geoghegan wrote:
> diff --git a/contrib/amcheck/amcheck--1.0.sql b/contrib/amcheck/amcheck--1.0.sql
> new file mode 100644
> index 0000000..ebbd6ac
> --- /dev/null
> +++ b/contrib/amcheck/amcheck--1.0.sql
> @@ -0,0 +1,20 @@
> +/* contrib/amcheck/amcheck--1.0.sql */
> +
> +-- complain if script is sourced in psql, rather than via CREATE EXTENSION
> +\echo Use "CREATE EXTENSION amcheck" to load this file. \quit
> +
> +--
> +-- bt_index_check()
> +--
> +CREATE FUNCTION bt_index_check(index regclass)
> +RETURNS VOID
> +AS 'MODULE_PATHNAME', 'bt_index_check'
> +LANGUAGE C STRICT;
> +
> +--
> +-- bt_index_parent_check()
> +--
> +CREATE FUNCTION bt_index_parent_check(index regclass)
> +RETURNS VOID
> +AS 'MODULE_PATHNAME', 'bt_index_parent_check'
> +LANGUAGE C STRICT;

I'd really want a function that runs all check on a table.


> +#define CHECK_SUPERUSER() { \
> +        if (!superuser()) \
> +            ereport(ERROR, \
> +                    (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), \
> +                     (errmsg("must be superuser to use verification functions")))); }

Why is this a macro?


> +/*
> + * Callers to verification functions should never receive a false positive
> + * indication of corruption.  Therefore, when using amcheck functions for
> + * stress testing, it may be useful to temporally change the CORRUPTION elevel
> + * to PANIC, to immediately halt the server in the event of detecting an
> + * invariant condition violation.  This may preserve more information about the
> + * nature of the underlying problem.  Note that modifying the CORRUPTION
> + * constant to be an elevel < ERROR is not well tested.
> + */
> +#ifdef NOT_USED
> +#define CORRUPTION        PANIC
> +#define CONCERN            WARNING
> +#define POSITION        NOTICE
> +#else
> +#define CORRUPTION        ERROR
> +#define CONCERN            DEBUG1
> +#define POSITION        DEBUG2
> +#endif

Hm, if we want that - and it doesn't seem like a bad idea - I think we
should be make it available without recompiling.


> +/*
> + * A B-Tree cannot possibly have this many levels, since there must be one
> + * block per level, which is bound by the range of BlockNumber:
> + */
> +#define InvalidBtreeLevel    ((uint32) InvalidBlockNumber)

Hm, I think it'd be, for the long term, better if we'd move the btree
check code to amcheck_nbtree.c or such.


> +Datum
> +bt_index_parent_check(PG_FUNCTION_ARGS)
> +{
> +    Oid            indrelid = PG_GETARG_OID(0);
> +    Oid            heapid;
> +    Relation    indrel;
> +    Relation    heaprel;
> +
> +    CHECK_SUPERUSER();
> +
> +    /*
> +     * We must lock table before index to avoid deadlocks.  However, if the
> +     * passed indrelid isn't an index then IndexGetRelation() will fail.
> +     * Rather than emitting a not-very-helpful error message, postpone
> +     * complaining, expecting that the is-it-an-index test below will fail.
> +     */
> +    heapid = IndexGetRelation(indrelid, true);
> +    if (OidIsValid(heapid))
> +        heaprel = heap_open(heapid, ShareLock);
> +    else
> +        heaprel = NULL;
> +
> +    /*
> +     * Open the target index relations separately (like relation_openrv(), but
> +     * with heap relation locked first to prevent deadlocking).  In hot standby
> +     * mode this will raise an error.
> +     */
> +    indrel = index_open(indrelid, ExclusiveLock);

Theoretically we should recheck that the oids still match, theoretically
the locking here allows the index->table mapping to change. It's not a
huge window tho...


> +    /* Check for active uses of the index in the current transaction */
> +    CheckTableNotInUse(indrel, "bt_index_parent_check");

Why do we actually care?



> +static void
> +btree_index_checkable(Relation rel)
> +{
> +    if (rel->rd_rel->relkind != RELKIND_INDEX ||
> +        rel->rd_rel->relam != BTREE_AM_OID)
> +        ereport(ERROR,
> +                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> +                 errmsg("only nbtree access method indexes are supported"),
> +                 errdetail("Relation \"%s\" is not a B-Tree index.",
> +                           RelationGetRelationName(rel))));
> +
> +    if (RELATION_IS_OTHER_TEMP(rel))
> +        ereport(ERROR,
> +                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> +                 errmsg("cannot access temporary tables of other sessions"),
> +                 errdetail("Index \"%s\" is associated with temporary relation.",
> +                           RelationGetRelationName(rel))));
> +
> +    if (!rel->rd_index->indisready)
> +        ereport(ERROR,
> +                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> +                 errmsg("cannot check index \"%s\"",
> +                        RelationGetRelationName(rel)),
> +                 errdetail("Index is not yet ready for insertions")));

Why can't we check this? Seems valuable to allow checking after a
partial CONCURRENTLY index build?

Why don't you use IndexIsReady et al? Not sure if the patch compiles
against an earlier release (there weren't valid/ready/live dead fields,
but instead it was compressed into fewer)?


> +/*
> + * bt_check_every_level()
> + *
> + * Main entry point for B-Tree SQL-callable functions.  Walks the B-Tree in
> + * logical order, verifying invariants as it goes.
> + *
> + * It is the caller's responsibility to acquire appropriate heavyweight lock on
> + * the index relation, and advise us if extra checks are safe when an
> + * ExclusiveLock is held.  An ExclusiveLock is generally assumed to prevent any
> + * kind of physical modification to the index structure, including
> + * modifications that VACUUM may make.
> + */

Hm, what kind of locking does the kill_prior_tuple stuff use?


> +static void
> +bt_check_every_level(Relation rel, bool exclusivelock)
> +{

> +    /*
> +     * Certain deletion patterns can result in "skinny" B-Tree indexes, where
> +     * the fast root and true root differ.
> +     *
> +     * Start from the true root, not the fast root, unlike conventional index
> +     * scans.  This approach is more thorough, and removes the risk of
> +     * following a stale fast root from the meta page.
> +     */
> +    if (metad->btm_fastroot != metad->btm_root)
> +        ereport(CONCERN,
> +                (errcode(ERRCODE_DUPLICATE_OBJECT),
> +                 errmsg("fast root mismatch in index %s",
> +                        RelationGetRelationName(rel)),
> +                 errdetail_internal("Fast block %u (level %u) "
> +                                    "differs from true root "
> +                                    "block %u (level %u).",
> +                                    metad->btm_fastroot, metad->btm_fastlevel,
> +                                    metad->btm_root, metad->btm_level)));

Hm, isn't that a potentially spurious report?


> +static BtreeLevel
> +bt_check_level_from_leftmost(BtreeCheckState *state, BtreeLevel level)
> +{

> +    do
> +    {
> +        /* Don't rely on CHECK_FOR_INTERRUPTS() calls at lower level */
> +        CHECK_FOR_INTERRUPTS();
> +
> +        /* Initialize state for this iteration */
> +        state->targetblock = current;
> +        state->target = palloc_btree_page(state, state->targetblock);
> +        state->targetlsn = PageGetLSN(state->target);
> +
> +        opaque = (BTPageOpaque) PageGetSpecialPointer(state->target);
> +
> +        if (P_IGNORE(opaque))
> +        {
> +            if (P_RIGHTMOST(opaque))
> +                ereport(CORRUPTION,
> +                        (errcode(ERRCODE_INDEX_CORRUPTED),
> +                         errmsg("block %u fell off the end of index \"%s\"",
> +                                current, RelationGetRelationName(state->rel))));
> +            else
> +                ereport(CONCERN,
> +                        (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> +                         errmsg("block %u of index \"%s\" ignored",
> +                                current, RelationGetRelationName(state->rel))));

Hm, why is this worth a WARNING? Isn't it perfectly normal to have a few
of these?


> +            /*
> +             * Before beginning any non-trivial examination of level, establish
> +             * next level down's leftmost block number, which next call here
> +             * will pass as its leftmost (iff this isn't leaf level).

"next level down's leftmost block number" - it'd be good to simplify
that a bit.


> +/*
> + * bt_target_page_check()

Why "target"?

> +static void
> +bt_target_page_check(BtreeCheckState *state)
> +{


> +        /*
> +         *   ********************
> +         *   * Page order check *
> +         *   ********************

Isn't that item order, rather than page order?


> +        }
> +        /*
> +         *   ********************
> +         *   * Last item check  *
> +         *   ********************

Newline before block wouldn't hurt.


> +         * Check last item against next/right page's first data item's when
> +         * last item on page is reached.

I think this need some polishing.



> +        /*
> +         *   ********************
> +         *   * Downlink check   *
> +         *   ********************
> +         *
> +         * Additional check of child items against target page (their parent),
> +         * iff this is internal page and caller holds ExclusiveLock on index

s/is/is an/ /holds/holds an/


> +static ScanKey
> +bt_right_page_check_scankey(BtreeCheckState *state)
> +{

> +     * A deleted page won't actually be recycled by VACUUM early enough for us
> +     * to fail to be able follow its right link (or left link, or downlink),

"for us to fail to be able" sounds odd.

> +    /*
> +     * No ExclusiveLock held case -- why it's safe to proceed.
> +     *
> +     * Problem:
> +     *
> +     * We must avoid false positive reports of corruption when caller treats
> +     * item returned here as an upper bound on target's last item.  In general,
> +     * false positives are disallowed.  Ensuring they don't happen in
> +     * !exclusivelock case is subtle.

s/item/an item/?


This paragraph is too complicated to follow in the noisy environment
where we both currently are ;)

> +/*
> + * bt_downlink_check()
> + *
> + * Checks one of target's downlink against its child page.

downlinks?

> + * Conceptually, the target page continues to be what is checked here.  The
> + * target block is still blamed in the event of finding an invariant violation.
> + * The downlink insertion into the target is probably where any problem raised
> + * here arises, and there is no such thing as a parent link, so doing the
> + * verification this way around is much more practical.
> + */
> +static void
> +bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
> +                  ScanKey targetkey)
> +{

> +     * between the deleted page and its right sibling.  (We cannot delete a
> +     * parent page's rightmost page unless it is the last child page, and we
> +     * intend to delete the parent itself.)

"parent page's rightmost page"?


I like this. Some of the more complex pieces towards the end of the
field need some attention, there's a fair amount of word-smithing
needed, and I do think we want to make the structural changes outlined
above, but besides these, imo fairly simple adaptions, I do think this
is useful and not that far from being committable.


- Andres



Re: amcheck (B-Tree integrity checking tool)

From
Thomas Munro
Date:
On Thu, Sep 8, 2016 at 6:44 AM, Peter Geoghegan <pg@heroku.com> wrote:
> On Fri, Sep 2, 2016 at 11:19 AM, Kevin Grittner <kgrittn@gmail.com> wrote:
>> IMV the process is to post a patch to this list to certify that it
>> is yours to contribute and free of IP encumbrances that would
>> prevent us from using it.  I will wait for that post.
>
> I attach my V3

+ * Ideally, we'd compare every item in the index against every other
+ * item in the index, and not trust opclass obedience of the transitive
+ * law to bridge the gap between children and their grandparents (as
+ * well as great-grandparents, and so on).  We don't go to those
+ * lengths because that would be prohibitively expensive, and probably
+ * not markedly more effective in practice.
+ */

I skimmed the Modern B-Tree Techniques paper recently, and there was a
section on "the cousin problem" when verifying btrees, which this
comment reminded me of.  I tried to come up with an example of a pair
of characters being switched in a collation that would introduce
undetectable corruption:

T1.  Order   = a < b < c < d
    Btree   =        [a|c]                     /   \                    /     \                   /       \
   [a]-------[c]                  |         |                  |         |                 [b]-------[d]
 

T2.  Order   = a < c < b < d

Now c and b have been switched around.  Won't amcheck still pass since
we only verify immediate downlinks and siblings?  Yet searches for b
will take a wrong turn at the root.  Do I have this right?  I wonder
if there is a way to verify that each page's high key < the 'next' key
for all ancestors.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: amcheck (B-Tree integrity checking tool)

From
Peter Geoghegan
Date:
On Wed, Nov 16, 2016 at 1:03 PM, Andres Freund <andres@anarazel.de> wrote:
> I think the patch could use a pgindent run.

Okay.

> I'd really want a function that runs all check on a table.

Why not just use a custom SQL query? The docs have an example of one
such query, that verifies all user indexes, but there could be another
example.

>> +#define CHECK_SUPERUSER() { \
>> +             if (!superuser()) \
>> +                     ereport(ERROR, \
>> +                                     (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), \
>> +                                      (errmsg("must be superuser to use verification functions")))); }
>
> Why is this a macro?

This was just copied from pageinspect, TBH. What would you prefer?

>> +#ifdef NOT_USED
>> +#define CORRUPTION           PANIC
>> +#define CONCERN                      WARNING
>> +#define POSITION             NOTICE
>> +#else
>> +#define CORRUPTION           ERROR
>> +#define CONCERN                      DEBUG1
>> +#define POSITION             DEBUG2
>> +#endif
>
> Hm, if we want that - and it doesn't seem like a bad idea - I think we
> should be make it available without recompiling.

I suppose, provided it doesn't let CORRUPTION elevel be < ERROR. That
might be broken if it was allowed.

> Hm, I think it'd be, for the long term, better if we'd move the btree
> check code to amcheck_nbtree.c or such.

Agreed.

>> +Datum
>> +bt_index_parent_check(PG_FUNCTION_ARGS)
>> +{
>> +     Oid                     indrelid = PG_GETARG_OID(0);

> Theoretically we should recheck that the oids still match, theoretically
> the locking here allows the index->table mapping to change. It's not a
> huge window tho...

>> +     /* Check for active uses of the index in the current transaction */
>> +     CheckTableNotInUse(indrel, "bt_index_parent_check");
>
> Why do we actually care?

This was left-over from duplicating REINDEX code. Indeed, we have no
reason to care at all.

>> +     if (!rel->rd_index->indisready)
>> +             ereport(ERROR,
>> +                             (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
>> +                              errmsg("cannot check index \"%s\"",
>> +                                             RelationGetRelationName(rel)),
>> +                              errdetail("Index is not yet ready for insertions")));
>
> Why can't we check this? Seems valuable to allow checking after a
> partial CONCURRENTLY index build?

Okay.

> Why don't you use IndexIsReady et al? Not sure if the patch compiles
> against an earlier release (there weren't valid/ready/live dead fields,
> but instead it was compressed into fewer)?

This version is almost identical to the Github version of amcheck that
we've been using with 9.4+. Agreed on IndexIsReady().

>> +/*
>> + * bt_check_every_level()
>> + *
>> + * Main entry point for B-Tree SQL-callable functions.  Walks the B-Tree in
>> + * logical order, verifying invariants as it goes.
>> + *
>> + * It is the caller's responsibility to acquire appropriate heavyweight lock on
>> + * the index relation, and advise us if extra checks are safe when an
>> + * ExclusiveLock is held.  An ExclusiveLock is generally assumed to prevent any
>> + * kind of physical modification to the index structure, including
>> + * modifications that VACUUM may make.
>> + */
>
> Hm, what kind of locking does the kill_prior_tuple stuff use?

The killing itself, or the setting of LP_DEAD bit? Ordinary index
scans could concurrently set LP_DEAD with only an AccessShareLock.
But, that's just metadata that amcheck doesn't care about. Anything
that needs to actually recyle space based on finding an LP_DEAD bit
set is a write operation, that is precluded from running when
bt_index_parent_check() is called by its ExclusiveLock.
_bt_vacuum_one_page() is only called by one nbtinsert.c code path,
when it looks like a page split might be needed.

>> +static void
>> +bt_check_every_level(Relation rel, bool exclusivelock)
>> +{
>
>> +     /*
>> +      * Certain deletion patterns can result in "skinny" B-Tree indexes, where
>> +      * the fast root and true root differ.
>> +      *
>> +      * Start from the true root, not the fast root, unlike conventional index
>> +      * scans.  This approach is more thorough, and removes the risk of
>> +      * following a stale fast root from the meta page.
>> +      */
>> +     if (metad->btm_fastroot != metad->btm_root)
>> +             ereport(CONCERN,
>> +                             (errcode(ERRCODE_DUPLICATE_OBJECT),
>> +                              errmsg("fast root mismatch in index %s",
>> +                                             RelationGetRelationName(rel)),
>> +                              errdetail_internal("Fast block %u (level %u) "
>> +                                                                     "differs from true root "
>> +                                                                     "block %u (level %u).",
>> +                                                                     metad->btm_fastroot, metad->btm_fastlevel,
>> +                                                                     metad->btm_root, metad->btm_level)));
>
> Hm, isn't that a potentially spurious report?

Well, it's not an error, so no. I think that it would be better with a
lower elevel, though, to indicate the low severity.

>> +                             ereport(CONCERN,
>> +                                             (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
>> +                                              errmsg("block %u of index \"%s\" ignored",
>> +                                                             current, RelationGetRelationName(state->rel))));
>
> Hm, why is this worth a WARNING? Isn't it perfectly normal to have a few
> of these?

It is, but you shouldn't get very many of them, since pages that are
ignorable don't have sibling links at the second phase of deletion. I
guess WARNING might be too strong here, too.

> "next level down's leftmost block number" - it'd be good to simplify
> that a bit.

Okay.

>> +/*
>> + * bt_target_page_check()
>
> Why "target"?

Conceptually, the verification scan vists from true root to leaf, left
to right, and makes each page it finds along the way the target
exactly once. It's represented as such in the state that we pass
around -- there is some metadata stashed about the target there (e.g.,
LSN is kept there for easy retrieval by possible ereport(ERROR)
calls). I think that that makes things quite a lot clearer,
particularly when we need to discuss checks that happen against the
sibling or child of target (conceptually, we're still checking the
target). The description of what happens and why that is okay is a lot
clearer because we constantly refer to the current target page, IMV --
that's always our frame of reference.

>> +             /*
>> +              *   ********************
>> +              *   * Page order check *
>> +              *   ********************
>
> Isn't that item order, rather than page order?

I don't know...it checks that the items on the page are in the right order.

> (Various minor niggles about style)

Okay.

> This paragraph is too complicated to follow in the noisy environment
> where we both currently are ;)

As I pointed out yesterday, that paragraph is 95%+ of what makes the
patch hard to understand. It's worth it, though, to detect problems
with things like transposed pages while holding only an
AccessShareLock.

> downlinks?

Downlinks are an accepted terminology for the index tuples in internal
pages, that point to child pages (rather than having TID pointer to
table).

> "parent page's rightmost page"?

I guess that should be "parent page's rightmost child". This is
covered in the nbtree README:

"""
To preserve consistency on the parent level, we cannot merge the key space
of a page into its right sibling unless the right sibling is a child of
the same parent --- otherwise, the parent's key space assignment changes
too, meaning we'd have to make bounding-key updates in its parent, and
perhaps all the way up the tree.  Since we can't possibly do that
atomically, we forbid this case.  That means that the rightmost child of a
parent node can't be deleted unless it's the only remaining child, in which
case we will delete the parent too (see below).
''""

> I like this. Some of the more complex pieces towards the end of the
> field need some attention, there's a fair amount of word-smithing
> needed, and I do think we want to make the structural changes outlined
> above, but besides these, imo fairly simple adaptions, I do think this
> is useful and not that far from being committable.

Cool. I'll try to get a revision out soon. I'm happy to do that much.

-- 
Peter Geoghegan



Re: amcheck (B-Tree integrity checking tool)

From
Peter Geoghegan
Date:
On Wed, Nov 16, 2016 at 5:06 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> + * Ideally, we'd compare every item in the index against every other
> + * item in the index, and not trust opclass obedience of the transitive
> + * law to bridge the gap between children and their grandparents (as
> + * well as great-grandparents, and so on).  We don't go to those
> + * lengths because that would be prohibitively expensive, and probably
> + * not markedly more effective in practice.
> + */
>
> I skimmed the Modern B-Tree Techniques paper recently, and there was a
> section on "the cousin problem" when verifying btrees, which this
> comment reminded me of.

I'm impressed that you made the connection. This comment reminded you
of that section because it was directly influenced by it. I actually
use the term "cousin page" elsewhere, because it's really useful to
distinguish cousins from "true siblings" when talking about page
deletion in particular...and I have a lot to say about page deletion.

> I tried to come up with an example of a pair
> of characters being switched in a collation that would introduce
> undetectable corruption:
>
> T1.  Order   = a < b < c < d
>
>      Btree   =        [a|c]
>                       /   \
>                      /     \
>                     /       \
>                   [a]-------[c]
>                    |         |
>                    |         |
>                   [b]-------[d]
>
> T2.  Order   = a < c < b < d

So, the Btree you show is supposed to be good? You've left it up to me
to draw T2, the corrupt version? If so, here is a drawing of T2 for
the sake of clarity:
     Btree   =        [a|b]                      /   \                     /     \                    /       \
        [a]-------[b]                   |         |                   |         |                  [c]-------[d]
 

> Now c and b have been switched around.  Won't amcheck still pass since
> we only verify immediate downlinks and siblings?  Yet searches for b
> will take a wrong turn at the root.  Do I have this right?  I wonder
> if there is a way to verify that each page's high key < the 'next' key
> for all ancestors.

I don't think you have it right, because your argument is predicated
on classic L&Y, not the Postgres variant. The high key is literally
guaranteed to be *equal* to the initial first value of the right half
of a split page post-split (at least initially), which is where the
new downlink for the new right half comes from in turn, so it will be
exactly equal as well. There is a comment which says all this within
_bt_insert_parent(), actually.

In general, I think it's pretty bad that we're so loose about
representing the key space in downlinks compared to classic L&Y,
because it probably makes index bloat a lot worse in internal pages,
messing up the key space long term. As long as we're doing that,
though, we clearly cannot usefully "verify that each page's high key <
the 'next' key for all ancestors", because we deviate from L&Y in a
way that makes that check always fail. It's okay that it always fails
for us (it doesn't actually indicate corruption), because of our
duplicate-aware index scan logic (change to invariant).

In summary, I'm pretty sure that our "Ki <= v <= Ki+1" thing makes the
cousin problem a non-issue, because we don't follow downlinks that are
equal to our scankey -- we go to their immediate left and start
*there*, in order to support duplicates in leaf pages. Index scans for
'b' work for both T1 and T2 in Postgres, because we refuse to follow
the 'b' downlinks when doing an index scan for 'b' (relevant to T2). I
would actually like to change things to make the invariant the classic
L&Y "Ki < v <= Ki+1", to avoid bloat in the internal pages and to make
suffix truncation in internal pages work.

So, we don't have the cousin problem, but since I wish for us to adopt
the stricter classic L&Y invariant at some point, you could say that I
also wished that we had the cousin problem.

Confused yet? :-)

[1] https://archive.org/stream/symmetricconcurr00lani#page/6/mode/2up
-- 
Peter Geoghegan



Re: amcheck (B-Tree integrity checking tool)

From
Peter Geoghegan
Date:
On Thu, Nov 17, 2016 at 12:04 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> Hm, if we want that - and it doesn't seem like a bad idea - I think we
>> should be make it available without recompiling.
>
> I suppose, provided it doesn't let CORRUPTION elevel be < ERROR. That
> might be broken if it was allowed.

What do you think about new argument with default vs. GUC? I guess
that the GUC might be a lot less of a foot-gun. We might even give it
a suitably scary name, to indicate that it will make the server PANIC.
(I gather that you don't care about other aspects of verbosity -- just
about the ability to make amcheck PANIC in the event of an invariant
violation without recompiling it.)


-- 
Peter Geoghegan



Re: amcheck (B-Tree integrity checking tool)

From
Robert Haas
Date:
On Fri, Nov 18, 2016 at 6:51 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Thu, Nov 17, 2016 at 12:04 PM, Peter Geoghegan <pg@heroku.com> wrote:
>>> Hm, if we want that - and it doesn't seem like a bad idea - I think we
>>> should be make it available without recompiling.
>>
>> I suppose, provided it doesn't let CORRUPTION elevel be < ERROR. That
>> might be broken if it was allowed.
>
> What do you think about new argument with default vs. GUC? I guess
> that the GUC might be a lot less of a foot-gun. We might even give it
> a suitably scary name, to indicate that it will make the server PANIC.
> (I gather that you don't care about other aspects of verbosity -- just
> about the ability to make amcheck PANIC in the event of an invariant
> violation without recompiling it.)

Yikes.  I don't think I want to expose any kind of API that lets the
user PANIC the server.  A value < ERROR sounds far more reasonable
than a value > ERROR.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: amcheck (B-Tree integrity checking tool)

From
Peter Geoghegan
Date:
On Sat, Nov 19, 2016 at 6:45 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> What do you think about new argument with default vs. GUC? I guess
>> that the GUC might be a lot less of a foot-gun. We might even give it
>> a suitably scary name, to indicate that it will make the server PANIC.
>> (I gather that you don't care about other aspects of verbosity -- just
>> about the ability to make amcheck PANIC in the event of an invariant
>> violation without recompiling it.)
>
> Yikes.  I don't think I want to expose any kind of API that lets the
> user PANIC the server.  A value < ERROR sounds far more reasonable
> than a value > ERROR.

In general, I don't want to get into the business of reasoning about
how well we can limp along when there is a would-be error condition
within amcheck. Once "the impossible" has actually occurred, it's very
difficult to reason about what still works. Also, I actually agree
that making it possible for the tool to force a PANIC through a
user-visible interface is a bad idea.

Maybe we should just leave it as it is -- experts can recompile the
tool after modifying it to use an elevel that is != ERROR (the thing I
mention about elevel < ERROR is already documented in code comments).
If that breaks, they get to keep both halves.

-- 
Peter Geoghegan



Re: amcheck (B-Tree integrity checking tool)

From
Robert Haas
Date:
On Sat, Nov 19, 2016 at 11:38 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Sat, Nov 19, 2016 at 6:45 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> What do you think about new argument with default vs. GUC? I guess
>>> that the GUC might be a lot less of a foot-gun. We might even give it
>>> a suitably scary name, to indicate that it will make the server PANIC.
>>> (I gather that you don't care about other aspects of verbosity -- just
>>> about the ability to make amcheck PANIC in the event of an invariant
>>> violation without recompiling it.)
>>
>> Yikes.  I don't think I want to expose any kind of API that lets the
>> user PANIC the server.  A value < ERROR sounds far more reasonable
>> than a value > ERROR.
>
> In general, I don't want to get into the business of reasoning about
> how well we can limp along when there is a would-be error condition
> within amcheck. Once "the impossible" has actually occurred, it's very
> difficult to reason about what still works. Also, I actually agree
> that making it possible for the tool to force a PANIC through a
> user-visible interface is a bad idea.
>
> Maybe we should just leave it as it is -- experts can recompile the
> tool after modifying it to use an elevel that is != ERROR (the thing I
> mention about elevel < ERROR is already documented in code comments).
> If that breaks, they get to keep both halves.

OK.  If it's not reasonable to continue checking after an ERROR, then
I think ERROR is the way to go.  If somebody really doesn't like that
lack of flexibility (in either direction), they can propose a change
later for separate consideration.  That limitation is not, in my view,
a sufficient reason to hold up the patch on the table.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: amcheck (B-Tree integrity checking tool)

From
Peter Geoghegan
Date:
On Sun, Nov 20, 2016 at 3:42 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> OK.  If it's not reasonable to continue checking after an ERROR, then
> I think ERROR is the way to go.  If somebody really doesn't like that
> lack of flexibility (in either direction), they can propose a change
> later for separate consideration.  That limitation is not, in my view,
> a sufficient reason to hold up the patch on the table.

I attach V4 of amcheck. I decided to leave things as they are, as far
as forcing a different elevel goes (you still need to modify a macro).
I didn't change the elevel macro from CONCERN in a couple of cases, as
Andres suggested, because in fact that's generally the same as DEBUG1,
not WARNING (I agreed with Andres that it was WARNING before, but it
isn't unless you #define NOT_USED).

Andres said: "Theoretically we should recheck that the oids still
match, theoretically the locking here allows the index->table mapping
to change". I don't know how to act on this feedback, since comparable
index + heap locking code should have the same problem, but doesn't
consider it. How is this any different to reindex_index()?

Other than that, all feedback items were worked through. I made the
functions PARALLEL SAFE, too, since I noticed that that wasn't the
case in passing.

--
Peter Geoghegan

Attachment

Re: amcheck (B-Tree integrity checking tool)

From
Thomas Munro
Date:
On Fri, Nov 18, 2016 at 12:17 PM, Peter Geoghegan <pg@heroku.com> wrote:
> So, the Btree you show is supposed to be good? You've left it up to me
> to draw T2, the corrupt version? If so, here is a drawing of T2 for
> the sake of clarity:
>
>       Btree   =        [a|b]
>                        /   \
>                       /     \
>                      /       \
>                    [a]-------[b]
>                     |         |
>                     |         |
>                    [c]-------[d]

Actually I meant that at T2 the btree is exactly same as it was at T1,
but the order of the values has changed according to the comparator
(for example, because a collation definition changed when you updated
your operating system), so that the btree is now corrupted.  Based on
Goetz Graefe's paper I was wondering if amcheck would be unable to
detect this due to the cousin problem.

>> Now c and b have been switched around.  Won't amcheck still pass since
>> we only verify immediate downlinks and siblings?  Yet searches for b
>> will take a wrong turn at the root.  Do I have this right?  I wonder
>> if there is a way to verify that each page's high key < the 'next' key
>> for all ancestors.
>
> I don't think you have it right, because your argument is predicated
> on classic L&Y, not the Postgres variant. The high key is literally
> guaranteed to be *equal* to the initial first value of the right half
> of a split page post-split (at least initially), which is where the
> new downlink for the new right half comes from in turn, so it will be
> exactly equal as well. There is a comment which says all this within
> _bt_insert_parent(), actually.
>
> In general, I think it's pretty bad that we're so loose about
> representing the key space in downlinks compared to classic L&Y,
> because it probably makes index bloat a lot worse in internal pages,
> messing up the key space long term. As long as we're doing that,
> though, we clearly cannot usefully "verify that each page's high key <
> the 'next' key for all ancestors", because we deviate from L&Y in a
> way that makes that check always fail. It's okay that it always fails
> for us (it doesn't actually indicate corruption), because of our
> duplicate-aware index scan logic (change to invariant).
>
> In summary, I'm pretty sure that our "Ki <= v <= Ki+1" thing makes the
> cousin problem a non-issue, because we don't follow downlinks that are
> equal to our scankey -- we go to their immediate left and start
> *there*, in order to support duplicates in leaf pages. Index scans for
> 'b' work for both T1 and T2 in Postgres, because we refuse to follow
> the 'b' downlinks when doing an index scan for 'b' (relevant to T2). I
> would actually like to change things to make the invariant the classic
> L&Y "Ki < v <= Ki+1", to avoid bloat in the internal pages and to make
> suffix truncation in internal pages work.
>
> So, we don't have the cousin problem, but since I wish for us to adopt
> the stricter classic L&Y invariant at some point, you could say that I
> also wished that we had the cousin problem.

Thank you for your patient explanation!  Please see my humble attempt
to test this scenario and learn something about btrees in the process,
attached.  amcheck does detect the violation of the high key
invariant.

> Confused yet? :-)

Yes.  But it's getting better slowly :-)

--
Thomas Munro
http://www.enterprisedb.com

Attachment

Re: amcheck (B-Tree integrity checking tool)

From
Peter Geoghegan
Date:
On Wed, Nov 23, 2016 at 2:47 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> Actually I meant that at T2 the btree is exactly same as it was at T1,
> but the order of the values has changed according to the comparator
> (for example, because a collation definition changed when you updated
> your operating system), so that the btree is now corrupted.  Based on
> Goetz Graefe's paper I was wondering if amcheck would be unable to
> detect this due to the cousin problem.

I love that paper (it's more like a book, really). I'm glad that at
least one other person in the community has read some of it, because I
think it has a lot of clues as to how we can begin to address some of
our problems around bloat without novel redesign. Most of the
techniques are quite complementary.

I wrote a prototype of suffix truncation for text in two days last
week. The regression tests pass, and the technique works well for
certain cases, particularly with many column indexes without
correlation (the index-only-scan case). Let's see if that becomes a
real patch.

> Thank you for your patient explanation!

Your questions were excellent, and so deserved my careful consideration.

> Please see my humble attempt
> to test this scenario and learn something about btrees in the process,
> attached.  amcheck does detect the violation of the high key
> invariant.

This is a nice pattern for testing out if the keyspace is sane, and
how things work. I might incorporate it into my own testing in the
future.

-- 
Peter Geoghegan



Re: amcheck (B-Tree integrity checking tool)

From
Haribabu Kommi
Date:


On Wed, Nov 30, 2016 at 9:06 AM, Peter Geoghegan <pg@heroku.com> wrote:
On Wed, Nov 23, 2016 at 2:47 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> Actually I meant that at T2 the btree is exactly same as it was at T1,
> but the order of the values has changed according to the comparator
> (for example, because a collation definition changed when you updated
> your operating system), so that the btree is now corrupted.  Based on
> Goetz Graefe's paper I was wondering if amcheck would be unable to
> detect this due to the cousin problem.

I love that paper (it's more like a book, really). I'm glad that at
least one other person in the community has read some of it, because I
think it has a lot of clues as to how we can begin to address some of
our problems around bloat without novel redesign. Most of the
techniques are quite complementary.

I wrote a prototype of suffix truncation for text in two days last
week. The regression tests pass, and the technique works well for
certain cases, particularly with many column indexes without
correlation (the index-only-scan case). Let's see if that becomes a
real patch.

> Thank you for your patient explanation!

Your questions were excellent, and so deserved my careful consideration.

> Please see my humble attempt
> to test this scenario and learn something about btrees in the process,
> attached.  amcheck does detect the violation of the high key
> invariant.

This is a nice pattern for testing out if the keyspace is sane, and
how things work. I might incorporate it into my own testing in the
future.

Moved to next CF with " needs review" status.

Regards,
Hari Babu
Fujitsu Australia

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

From
Michael Paquier
Date:
On Mon, Dec 5, 2016 at 2:15 PM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
> Moved to next CF with " needs review" status.

Same, to CF 2017-03 this time.
-- 
Michael



Re: [HACKERS] amcheck (B-Tree integrity checking tool)

From
Andres Freund
Date:
Hi Heikki,

On 2016-11-22 10:56:07 -0800, Peter Geoghegan wrote:
> I attach V4 of amcheck.

Is there any chance you could look at the concurrency related parts of
amcheck?  I'd like to push this to be mergeable, but that area is a bit
outside of my area of expertise...

Andres



Re: [HACKERS] amcheck (B-Tree integrity checking tool)

From
Andres Freund
Date:
Hi,

On 2016-11-22 10:56:07 -0800, Peter Geoghegan wrote:
> Andres said: "Theoretically we should recheck that the oids still
> match, theoretically the locking here allows the index->table mapping
> to change". I don't know how to act on this feedback, since comparable
> index + heap locking code should have the same problem, but doesn't
> consider it. How is this any different to reindex_index()?

reindex_index isn't necessarily comparable, because
RangeVarCallbackForReindexIndex() is, on first blush at least, careful
enough about the locking, and reindex_relation only gets the index list
after building the index:/* * Find and lock index, and check permissions on table; use callback to * obtain lock on
tablefirst, to avoid deadlock hazard.  The lock level * used here must match the index lock obtained in
reindex_index().*/
 

I think you ought to do something similar.


I'd really like to have something like relation_check(), index_check()
which calls the correct functions for the relevan index types (and
possibly warns about !nbtree indexes not being checked?).


> @@ -0,0 +1,1227 @@
> +/*-------------------------------------------------------------------------
> + *
> + * verify_nbtree.c
> + *        Verifies the integrity of nbtree indexes based on invariants.
> + *
> + * For B-Tree indexes, verification includes the invariant that each page in
> + * the target index has items in logical order as reported by an insertion
> + * scankey (the insertion scankey sort-wise NULL semantics are needed for
> + * verification).

I'd appreciate some rephrasing of this paragraph. "includes the
invariant that" isn't that easy to understand for an ESL person.


> + *
> + * Copyright (c) 2016, PostgreSQL Global Development Group

... ;)


> +#define CHECK_SUPERUSER() { \
> +        if (!superuser()) \
> +            ereport(ERROR, \
> +                    (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), \
> +                     (errmsg("must be superuser to use verification functions")))); }

Please make this either a function or just copy - the macro doesn't buy
us anything except weirder looking syntax.


> +/*
> + * Callers to verification functions should never receive a false positive
> + * indication of corruption.  Therefore, when using amcheck functions for
> + * stress testing, it may be useful to temporally change the CORRUPTION elevel
> + * to PANIC, to immediately halt the server in the event of detecting an
> + * invariant condition violation.  This may preserve more information about the
> + * nature of the underlying problem.  Note that modifying the CORRUPTION
> + * constant to be an elevel < ERROR is not well tested.
> + */
> +#ifdef NOT_USED
> +#define CORRUPTION        PANIC
> +#define CONCERN            WARNING
> +#define POSITION        NOTICE
> +#else
> +#define CORRUPTION        ERROR
> +#define CONCERN            DEBUG1
> +#define POSITION        DEBUG2
> +#endif

Please remove this block and use the normal elevels.



> +/*
> + * State associated with verifying a B-Tree index
> + */
> +typedef struct BtreeCheckState
> +{
> +    /*
> +     * Unchanging state, established at start of verification:
> +     *
> +     * rel:                 B-Tree Index Relation
> +     * exclusivelock:       ExclusiveLock held on rel; else AccessShareLock
> +     * checkstrategy:       Buffer access strategy
> +     * targetcontext:       Target page memory context
> +     */
> +    Relation                rel;
> +    bool                    exclusivelock;
> +    BufferAccessStrategy     checkstrategy;
> +    MemoryContext            targetcontext;

Can you move the field comments to the individual fields? Separating
them makes it more likely to get out of date (obviously keep the
header).  I'd also rename 'exclusivelock' to 'exclusivelylocked' or
such.


> +    /*
> +     * Mutable state, for verification of particular page:
> +     *
> +     * target:              Main target page
> +     * targetblock:         Main target page's block number
> +     * targetlsn:           Main target page's LSN

Same here.


> +PG_FUNCTION_INFO_V1(bt_index_check);
> +PG_FUNCTION_INFO_V1(bt_index_parent_check);

I think it'd be good to avoid adding a lot of functions for each
additional type of check we can think of.  I think adding a named 'bool
check_parents = false' argument or such would be better.  Then we can
simply add more and more checks subsequently, and we can have
'bool all_checks = false' so people can trivially check everything,
instead of having to add new checks in each new function.


> +/*
> + * bt_index_check(index regclass)
> + *
> + * Verify integrity of B-Tree index.
> + *
> + * Only acquires AccessShareLock on index relation.  Does not consider
> + * invariants that exist between parent/child pages.
> + */
> +Datum
> +bt_index_check(PG_FUNCTION_ARGS)
> +{
> +    Oid            relid = PG_GETARG_OID(0);
> +    Relation    indrel;
> +
> +    CHECK_SUPERUSER();
> +
> +    indrel = relation_open(relid, AccessShareLock);
>

> +    /* Relation suitable for checking as B-Tree? */
> +    btree_index_checkable(indrel);
> +
> +    /* Check index */
> +    bt_check_every_level(indrel, false);
> +
> +    relation_close(indrel, AccessShareLock);

I'm *VERY* strongly against releasing locks on user relations early.


> +/*
> + * bt_index_parent_check(index regclass)
> + *
> + * Verify integrity of B-Tree index.
> + *
> + * Acquires ExclusiveLock on index relation, and ShareLock on the associated
> + * heap relation, a bit like REINDEX.  Verifies that downlinks in parent pages
> + * are valid lower bounds on child pages.
> + */
> +Datum
> +bt_index_parent_check(PG_FUNCTION_ARGS)
> +{
> +    Oid            indrelid = PG_GETARG_OID(0);
> +    Oid            heapid;
> +    Relation    indrel;
> +    Relation    heaprel;
> +
> +    CHECK_SUPERUSER();
> +
> +    /*
> +     * We must lock table before index to avoid deadlocks.  However, if the
> +     * passed indrelid isn't an index then IndexGetRelation() will fail.
> +     * Rather than emitting a not-very-helpful error message, postpone
> +     * complaining, expecting that the is-it-an-index test below will fail.
> +     */
> +    heapid = IndexGetRelation(indrelid, true);
> +    if (OidIsValid(heapid))
> +        heaprel = heap_open(heapid, ShareLock);
> +    else
> +        heaprel = NULL;
> +
> +    /*
> +     * Open the target index relations separately (like relation_openrv(), but
> +     * with heap relation locked first to prevent deadlocking).  In hot standby
> +     * mode this will raise an error.
> +     */
> +    indrel = index_open(indrelid, ExclusiveLock);

So yea, this is what I was complaining about re locking / index & table
relation IIRC.  Note you're also deadlock prone by violating the rule
about always locking the relation first.


> +    index_close(indrel, ExclusiveLock);
> +    if (heaprel)
> +        heap_close(heaprel, ShareLock);

Same thing about releasing locks early.



> +/*
> + * btree_index_checkable()
> + *
> + * Basic checks about the suitability of a relation for checking as a B-Tree
> + * index.
> + */
> +static void
> +btree_index_checkable(Relation rel)
> +{
> +    if (rel->rd_rel->relkind != RELKIND_INDEX ||
> +        rel->rd_rel->relam != BTREE_AM_OID)
> +        ereport(ERROR,
> +                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> +                 errmsg("only nbtree access method indexes are supported"),
> +                 errdetail("Relation \"%s\" is not a B-Tree index.",
> +                           RelationGetRelationName(rel))));

I think this message could stand some rephrasing.  There's a significant
overlap between msg and detail, and they use different phrasing for
btree.


> +    if (!IndexIsValid(rel->rd_index))
> +        ereport(ERROR,
> +                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> +                 errmsg("cannot check index \"%s\"",
> +                        RelationGetRelationName(rel)),
> +                 errdetail("Index is not valid")));

I'm perfectly fine with not changing that, but IIUC it'd be fine to
weaken this to IndexIsLive() - right?


> + * It is the caller's responsibility to acquire appropriate heavyweight lock on
> + * the index relation, and advise us if extra checks are safe when an
> + * ExclusiveLock is held.  An ExclusiveLock is generally assumed to prevent any
> + * kind of physical modification to the index structure, including
> + * modifications that VACUUM may make.

It might make sense to be a bit more explicit about the structure bit
(which I assume refers to the fact that modifications to individual
pages are possible (re killitems)?).


> + */
> +static void
> +bt_check_every_level(Relation rel, bool exclusivelock)

Hm, is every level accurate - this is more like checking every
"reachable" page, right?


> +    /*
> +     * Certain deletion patterns can result in "skinny" B-Tree indexes, where
> +     * the fast root and true root differ.
> +     *
> +     * Start from the true root, not the fast root, unlike conventional index
> +     * scans.  This approach is more thorough, and removes the risk of
> +     * following a stale fast root from the meta page.
> +     */
> +    if (metad->btm_fastroot != metad->btm_root)
> +        ereport(CONCERN,
> +                (errcode(ERRCODE_DUPLICATE_OBJECT),
> +                 errmsg("fast root mismatch in index %s",
> +                        RelationGetRelationName(rel)),
> +                 errdetail_internal("Fast block %u (level %u) "
> +                                    "differs from true root "
> +                                    "block %u (level %u).",
> +                                    metad->btm_fastroot, metad->btm_fastlevel,
> +                                    metad->btm_root, metad->btm_level)));

I'd weaken the message to something that sounds less like an actual
problem... - which this is not, right?


> +/*
> + * bt_check_level_from_leftmost()
> + *
> + * Given a left-most block at some level, move right, verifying each page
> + * individually (with more verification across pages for "exclusivelock"
> + * callers).  Caller should pass the true root page as the leftmost initially,
> + * working their way down by passing what is returned for the last call here
> + * until level 0 (leaf page level) was reached.
> + *
> + * Returns state for next call, if any.

Wouldn't "'sets state for the next call..." be more accurate than
returning?

> +static BtreeLevel
> +bt_check_level_from_leftmost(BtreeCheckState *state, BtreeLevel level)
> +{

> +        if (P_IGNORE(opaque))
> +        {
> +            if (P_RIGHTMOST(opaque))
> +                ereport(CORRUPTION,
> +                        (errcode(ERRCODE_INDEX_CORRUPTED),
> +                         errmsg("block %u fell off the end of index \"%s\"",
> +                                current, RelationGetRelationName(state->rel))));

I think this could stand a bit more formal language.


> +            else
> +                ereport(CONCERN,
> +                        (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> +                         errmsg("block %u of index \"%s\" ignored",
> +                                current, RelationGetRelationName(state->rel))));

missing verb. Also, is this actually worth logging (even at such a low level)?


> +            /*
> +             * Before beginning any non-trivial examination of level, establish
> +             * state to be passed by caller for next call here, if this isn't
> +             * last call/leaf level.

"state to be passed by caller for next call here" is hard to
understand.  "Prepare state for next bt_check_level_from_leftmost
invocation for the next level"?


> +             * There should be at least one non-ignorable page per level.
> +             */

I might have missed something, but you're not checking that, are you?

> +        if (state->exclusivelock && opaque->btpo_prev != leftcurrent)
> +            ereport(CORRUPTION,
> +                    (errcode(ERRCODE_INDEX_CORRUPTED),
> +                     errmsg("right link/left link pair in index \"%s\" not in mutual agreement",
> +                            RelationGetRelationName(state->rel)),
> +                     errdetail_internal("Deleted block=%u left block=%u link reported block=%u.",
> +                                        current, leftcurrent, opaque->btpo_prev)));

"mutual" doesn't seem to add much here.


> +        /* Try to detect circular links */
> +        if (current == leftcurrent || current == opaque->btpo_prev)
> +            ereport(CORRUPTION,
> +                    (errcode(ERRCODE_INDEX_CORRUPTED),
> +                     errmsg("circular link chain found in block %u of index \"%s\"",
> +                            current, RelationGetRelationName(state->rel))));

That's not really reliable though, is it?  I've seen corrupted link
chains (leading to endless vacuums and such) a number of times, so maybe
explicitly keeping track of already visited pages within a level would
be worthwhile?


> +/*
> + * bt_target_page_check()
> + *
> + * Function performs the following checks on target page, or pages ancillary to
> + * target page:
> + *
> + * - That every "real" data item is less than or equal to the high key, which
> + *   is an upper bound on the items on the pages (where there is a high key at
> + *   all -- pages that are rightmost lack one).
> + *
> + * - That within the page, every "real" item is less than or equal to the item
> + *   immediately to its right, if any (i.e., that the items are in order within
> + *   the page, so that the binary searches performed by index scans are sane).
> + *
> + * - That the last item stored on the page is less than or equal to the first
> + *   "real" data item on the page to the right (if such a first item is
> + *   available).

Hm. I think it'd be useful if we also made sure (or in the caller) that
the page actually is at the correct level.  I seem to recall seing a
case where downlinks were "circular" due to corruption, leading to stuck
scans.

> + * Note:  This routine is not especially proactive in freeing memory.  High
> + * watermark memory consumption is bound to some small fixed multiple of
> + * BLCKSZ, though.  Caller should reset the current context between calls here.

I'd shorten this to "Memory allocated in this routine is expected to be
released by the caller resetting the ->xxx/current context".

> +static void
> +bt_target_page_check(BtreeCheckState *state)
> +{
> +    OffsetNumber    offset;
> +    OffsetNumber    max;
> +    BTPageOpaque    topaque;
> +
> +    topaque = (BTPageOpaque) PageGetSpecialPointer(state->target);
> +    max = PageGetMaxOffsetNumber(state->target);
> +
> +    elog(POSITION, "verifying %u items on %s block %u", max,
> +         P_ISLEAF(topaque) ? "leaf":"internal", state->targetblock);

I think it's fine if you don't, but this could theoretically integrate
with the progress stuff added in 9.6 (as in
pgstat_progress_start_command()).  That'd allow for a lot more
convenient display of this than thousands of debug messages or nothing.


> +    /*
> +     * Loop over page items, but don't start from P_HIKEY (don't have iteration
> +     * directly considering high key item, if any).

I don't understand the parenthetical comment as phrased.


> +        /*
> +         *   ********************
> +         *   * High key check   *
> +         *   ********************
> +         *
> +         * If there is a high key, which there must be for a non-rightmost
> +         * page, check that it actually is upper bound on all page items.

"which there must be for" sounds weird.


> +         * We prefer to check all items, rather than checking just the first
> +         * and trusting that the operator class obeys the transitive law (which
> +         * implies that all subsequent items also respected the high key
> +         * invariant if they pass the item order check).

s/prefer to//


> +        /*
> +         *   ********************
> +         *   * Item order check *
> +         *   ********************
> +         *
> +         * Check that items are stored on page in logical order, by checking
> +         * current item is less than or equal to next item (if any).
> +         */

Does that kind of comment look reasonable after pgindent? Actually,
could you just pindent this patch (including typedefs.list additions)
and repair the bad looking damage?


> +        if (OffsetNumberNext(offset) <= max &&
> +            !invariant_key_less_than_equal_offset(state, skey,
> +                                                  OffsetNumberNext(offset)))
> +        {
> +            char       *itid,  *htid,
> +                       *nitid, *nhtid;
> +
> +            itid = psprintf("(%u,%u)", state->targetblock, offset);
> +            htid = psprintf("(%u,%u)",
> +                            ItemPointerGetBlockNumber(&(itup->t_tid)),
> +                            ItemPointerGetOffsetNumber(&(itup->t_tid)));
> +            nitid = psprintf("(%u,%u)", state->targetblock,
> +                             OffsetNumberNext(offset));

Do all these separate psprintfs have much of a point?  It seems just as
easy to includ ethem in the actual ereport?


> +         * next/sibling page.  There may not be such an item available from
> +         * sibling for various reasons, though (e.g., target is the rightmost
> +         * page on level).
> +         */

Is the comma after e.g. correct?  </mega-pedanticism-from-esl-person>


> +                ereport(CORRUPTION,
> +                        (errcode(ERRCODE_INDEX_CORRUPTED),
> +                         errmsg("cross page item order invariant violated for index \"%s\"",
> +                                RelationGetRelationName(state->rel)),
> +                         errdetail_internal("Last item on page tid=(%u,%u) "
> +                                            "page lsn=%X/%X.",
> +                                            state->targetblock, offset,
> +                                            (uint32) (state->targetlsn >> 32),
> +                                            (uint32) state->targetlsn)));

I'm am *wondering* (as in not sure), if we could get rid of all the "for
index ..." bits and use an errcontext for that instead, which'd be
included in all messages.


> +    /*
> +     * General notes on concurrent page splits and page deletion:
> +     *
> +     * Routines like _bt_search() don't require *any* page split interlock when
> +     * descending the tree, including something very light like a buffer pin.
> +     * That's why it's okay that we don't either.  This avoidance of any need
> +     * to "couple" buffer locks is the raison d' etre of the Lehman & Yao
> +     * algorithm, in fact.

raison d'etre comment seems more like something belonging to
nbtree/README than this, but I don't care much.


> +    /*
> +     * No ExclusiveLock held case -- why it's safe to proceed.
> +     *
> +     * Problem:
> +     *
> +     * We must avoid false positive reports of corruption when caller treats
> +     * item returned here as an upper bound on target's last item.  In general,
> +     * false positives are disallowed.  Ensuring they don't happen in
> +     * !exclusivelock case is subtle.

Maybe s/Ensuring/The reasons why they don't happen in ...'?


> +     * A concurrent page deletion by VACUUM of the target page can result in
> +     * the insertion of items on to this right sibling page that would

"on to this right sibling page"?


> +     * previously have been inserted on our target page.  There might have been
> +     * insertions that followed target's downlink after it was made to point to

missing trailing 'the'?  I'm really not a native speaker (and thus might
be off the mark), but making these sentences a bit clearer really
wouldn't hurt.


> +static void
> +bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
> +                  ScanKey targetkey)
> +{
> +    OffsetNumber    offset;
> +    OffsetNumber    maxoffset;
> +    Page            child;
> +    BTPageOpaque    copaque;
> +
> +    /*
> +     * Caller must have ExclusiveLock on target relation, because of
> +     * considerations around page deletion by VACUUM.
> +     *
> +     * N.B.: In general, page deletion deletes the right sibling's downlink,

s/N.B./NB/ by precedence ;)


> +     * We prefer to check all items, rather than checking just the first and
> +     * trusting that the operator class obeys the transitive law.
> +     */

s/We prefer//


> +    child = palloc_btree_page(state, childblock);
> +    copaque = (BTPageOpaque) PageGetSpecialPointer(child);
> +    maxoffset = PageGetMaxOffsetNumber(child);
> +
> +    for (offset = P_FIRSTDATAKEY(copaque);
> +         offset <= maxoffset;
> +         offset = OffsetNumberNext(offset))
> +    {
> +        /*
> +         * Skip comparison of target page key against "minus infinity" item, if
> +         * any.  Checking it would indicate that it's not an upper bound, but
> +         * that's only because of the hard-coding within _bt_compare().
> +         */
> +        if (OFFSET_IS_MINUS_INFINITY(copaque, offset))
> +            continue;

FWIW, I'd even convert this to an inline function rather than a macro.
Macros are useful when you have to do stuff you can't do otherwise, but
shouldn't be used much otherwise.


> +        if (!invariant_key_less_than_equal_nontarget_offset(state, child,
> +                                                            targetkey, offset))
> +            ereport(CORRUPTION,
> +                    (errcode(ERRCODE_INDEX_CORRUPTED),

What's your reasoning for differing sometimes using
ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE and sometimes ERRCODE_INDEX_CORRUPTED?


> +                     errmsg("down-link lower bound invaariant violated for index \"%s\"",

s/invaariant/invariant/



> +/*
> + * invariant_key_greater_than_equal_offset()

I'd personally just remove the function name in these, but that's personal
taste and i'll leave you the decision.


- Andres



Re: [HACKERS] amcheck (B-Tree integrity checking tool)

From
Robert Haas
Date:
On Thu, Feb 9, 2017 at 3:05 AM, Andres Freund <andres@anarazel.de> wrote:
>> +#define CHECK_SUPERUSER() { \
>> +             if (!superuser()) \
>> +                     ereport(ERROR, \
>> +                                     (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), \
>> +                                      (errmsg("must be superuser to use verification functions")))); }
>
> Please make this either a function or just copy - the macro doesn't buy
> us anything except weirder looking syntax.

Better yet, remove the check altogether and just revoke access from
public.  Then the default policy is superuser-only, but the DBA can
grant access to who they wish.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] amcheck (B-Tree integrity checking tool)

From
Andres Freund
Date:
On 2017-02-09 14:48:18 -0500, Robert Haas wrote:
> On Thu, Feb 9, 2017 at 3:05 AM, Andres Freund <andres@anarazel.de> wrote:
> >> +#define CHECK_SUPERUSER() { \
> >> +             if (!superuser()) \
> >> +                     ereport(ERROR, \
> >> +                                     (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), \
> >> +                                      (errmsg("must be superuser to use verification functions")))); }
> >
> > Please make this either a function or just copy - the macro doesn't buy
> > us anything except weirder looking syntax.
>
> Better yet, remove the check altogether and just revoke access from
> public.  Then the default policy is superuser-only, but the DBA can
> grant access to who they wish.

That works for me as well.



Re: [HACKERS] amcheck (B-Tree integrity checking tool)

From
Peter Geoghegan
Date:
On Thu, Feb 9, 2017 at 12:05 AM, Andres Freund <andres@anarazel.de> wrote:
> reindex_index isn't necessarily comparable, because
> RangeVarCallbackForReindexIndex() is, on first blush at least, careful
> enough about the locking, and reindex_relation only gets the index list
> after building the index:
>         /*
>          * Find and lock index, and check permissions on table; use callback to
>          * obtain lock on table first, to avoid deadlock hazard.  The lock level
>          * used here must match the index lock obtained in reindex_index().
>          */
>
> I think you ought to do something similar.

Okay. Looking into it, but see my response to your later remarks below.

> I'd really like to have something like relation_check(), index_check()
> which calls the correct functions for the relevan index types (and
> possibly warns about !nbtree indexes not being checked?).

I feel that that's something for an in-core facility that exposes this
through DDL. One good reason to not do that just yet is that I'm not
completely comfortable with what a uniform interface looks like here.
And frankly, I really just want to get this out of the way, to prove
the general concept of verification tooling. amcheck v1 is a good
start, but much work remains. I think that no one will take the idea
completely seriously until it is usable as a core extension. Let's get
the ball rolling.

>> +#define CHECK_SUPERUSER() { \
>> +             if (!superuser()) \
>> +                     ereport(ERROR, \
>> +                                     (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), \
>> +                                      (errmsg("must be superuser to use verification functions")))); }
>
> Please make this either a function or just copy - the macro doesn't buy
> us anything except weirder looking syntax.

It's actually from pageinspect, but okay.

>> +#define CORRUPTION           ERROR
>> +#define CONCERN                      DEBUG1
>> +#define POSITION             DEBUG2
>> +#endif
>
> Please remove this block and use the normal elevels.

Okay.

>> +/*
>> + * State associated with verifying a B-Tree index
>> + */
>> +typedef struct BtreeCheckState
>> +{
>> +     /*
>> +      * Unchanging state, established at start of verification:
>> +      *
>> +      * rel:                 B-Tree Index Relation
>> +      * exclusivelock:       ExclusiveLock held on rel; else AccessShareLock
>> +      * checkstrategy:       Buffer access strategy
>> +      * targetcontext:       Target page memory context
>> +      */
>> +     Relation                                rel;
>> +     bool                                    exclusivelock;
>> +     BufferAccessStrategy    checkstrategy;
>> +     MemoryContext                   targetcontext;
>
> Can you move the field comments to the individual fields? Separating
> them makes it more likely to get out of date (obviously keep the
> header).  I'd also rename 'exclusivelock' to 'exclusivelylocked' or
> such.

Okay.

>> +     /*
>> +      * Mutable state, for verification of particular page:
>> +      *
>> +      * target:              Main target page
>> +      * targetblock:         Main target page's block number
>> +      * targetlsn:           Main target page's LSN
>
> Same here.

Okay.

>> +PG_FUNCTION_INFO_V1(bt_index_check);
>> +PG_FUNCTION_INFO_V1(bt_index_parent_check);
>
> I think it'd be good to avoid adding a lot of functions for each
> additional type of check we can think of.  I think adding a named 'bool
> check_parents = false' argument or such would be better.  Then we can
> simply add more and more checks subsequently, and we can have
> 'bool all_checks = false' so people can trivially check everything,
> instead of having to add new checks in each new function.

I strongly disagree. I would agree if the lock strength were the same
in both cases, but it isn't. Varying the strength of heavyweight lock
taken based on an argument to a function is a massive foot-gun IMV. I
do think that that's the logical way to split out functionality. I
don't think that there is likely to be a problem with adding stuff in
the future. It'll either be something that we can do in passing
anyway, in which case we can just add it to the existing checks
harmlessly. Or, it will be something like a tool that verifies the
heap is consistent with a given index, which is vastly more expensive
in terms of CPU and I/O costs, and yet is probably no worse than
bt_index_check() in terms of lock strength (AccessShareLock).

My thinking here is that users really ought to use bt_index_check() in
almost all cases, not bt_index_parent_check(), which is more of a tool
for hackers that are developing new B-Tree features. The distinction
is blurry, in fact, but the lock strength requirements of
bt_index_parent_check() make it vastly less useful. Especially
considering that it seems fairly unlikely that it will ever catch a
problem that bt_index_check() would not have. bt_index_parent_check()
will become a lot more important when we implement suffix truncation
in internal pages. (I have a very rough patch that does this, actually
-- it passes the regression tests, and verification by amcheck, but
isn't anywhere near ready to post.)

>> +/*
>> + * bt_index_check(index regclass)
>> + *
>> + * Verify integrity of B-Tree index.
>> + *
>> + * Only acquires AccessShareLock on index relation.  Does not consider
>> + * invariants that exist between parent/child pages.
>> + */

> I'm *VERY* strongly against releasing locks on user relations early.

Why? pageinspect does this. I think it could be really handy for users
that want to write SQL queries when using bt_index_parent_check().
It's a bit weird, in the sense that it's a special exception that has
to be explained sometimes, but I think it is worth it for something
like this.

>> +     /*
>> +      * We must lock table before index to avoid deadlocks.  However, if the
>> +      * passed indrelid isn't an index then IndexGetRelation() will fail.
>> +      * Rather than emitting a not-very-helpful error message, postpone
>> +      * complaining, expecting that the is-it-an-index test below will fail.
>> +      */
>> +     heapid = IndexGetRelation(indrelid, true);
>> +     if (OidIsValid(heapid))
>> +             heaprel = heap_open(heapid, ShareLock);
>> +     else
>> +             heaprel = NULL;
>> +
>> +     /*
>> +      * Open the target index relations separately (like relation_openrv(), but
>> +      * with heap relation locked first to prevent deadlocking).  In hot standby
>> +      * mode this will raise an error.
>> +      */
>> +     indrel = index_open(indrelid, ExclusiveLock);
>
> So yea, this is what I was complaining about re locking / index & table
> relation IIRC.  Note you're also deadlock prone by violating the rule
> about always locking the relation first.

This is based on brin_summarize_new_values(). I'm not doing a recheck,
though, because I don't actually need the heap relation for anything
(I suppose I imagined that that was okay, possibly failing to consider
some subtlety beyond "consistency is good"). If I added a recheck for
the heap relation in the style of brin_summarize_new_values(), would
you consider this item closed, and all heavyweight lock concerns
addressed?

The comments here are almost a direct lift -- is
brin_summarize_new_values() wrong to itself claim to be
deadlock-safe?.

(BTW, brin_summarize_new_values(), an SQL-callable function, also
releases heavyweight locks early.)

>> +/*
>> + * btree_index_checkable()
>> + *
>> + * Basic checks about the suitability of a relation for checking as a B-Tree
>> + * index.
>> + */
>> +static void
>> +btree_index_checkable(Relation rel)
>> +{
>> +     if (rel->rd_rel->relkind != RELKIND_INDEX ||
>> +             rel->rd_rel->relam != BTREE_AM_OID)
>> +             ereport(ERROR,
>> +                             (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
>> +                              errmsg("only nbtree access method indexes are supported"),
>> +                              errdetail("Relation \"%s\" is not a B-Tree index.",
>> +                                                RelationGetRelationName(rel))));
>
> I think this message could stand some rephrasing.  There's a significant
> overlap between msg and detail, and they use different phrasing for
> btree.

Okay.

>> +     if (!IndexIsValid(rel->rd_index))
>> +             ereport(ERROR,
>> +                             (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
>> +                              errmsg("cannot check index \"%s\"",
>> +                                             RelationGetRelationName(rel)),
>> +                              errdetail("Index is not valid")));
>
> I'm perfectly fine with not changing that, but IIUC it'd be fine to
> weaken this to IndexIsLive() - right?

I believe so, but prefer to be more restrictive in the absence of a
good reason not to be.

>> + * It is the caller's responsibility to acquire appropriate heavyweight lock on
>> + * the index relation, and advise us if extra checks are safe when an
>> + * ExclusiveLock is held.  An ExclusiveLock is generally assumed to prevent any
>> + * kind of physical modification to the index structure, including
>> + * modifications that VACUUM may make.
>
> It might make sense to be a bit more explicit about the structure bit
> (which I assume refers to the fact that modifications to individual
> pages are possible (re killitems)?).

It's not really about killitems, which I explained won't matter here,
but I will add a bit about that.

>> + */
>> +static void
>> +bt_check_every_level(Relation rel, bool exclusivelock)
>
> Hm, is every level accurate - this is more like checking every
> "reachable" page, right?

Every level should have at least one page that is reachable (not
half-dead or fully dead), unless perhaps VACUUM marks the last leaf
page in the entire index as half-dead and we land on that. I think
that that's such a rare case that it shouldn't alter how we name this
function at all, which really does do what it says almost always. Note
that there are pretty severe restrictions on when vacuum can do
page-level deletion of an internal page. This is what leads to index
bloat with sparse deletion patterns (one problem I want to address by
adding suffix truncation to internal pages at a later date).

>> +     /*
>> +      * Certain deletion patterns can result in "skinny" B-Tree indexes, where
>> +      * the fast root and true root differ.
>> +      *
>> +      * Start from the true root, not the fast root, unlike conventional index
>> +      * scans.  This approach is more thorough, and removes the risk of
>> +      * following a stale fast root from the meta page.
>> +      */
>> +     if (metad->btm_fastroot != metad->btm_root)
>> +             ereport(CONCERN,
>> +                             (errcode(ERRCODE_DUPLICATE_OBJECT),
>> +                              errmsg("fast root mismatch in index %s",
>> +                                             RelationGetRelationName(rel)),
>> +                              errdetail_internal("Fast block %u (level %u) "
>> +                                                                     "differs from true root "
>> +                                                                     "block %u (level %u).",
>> +                                                                     metad->btm_fastroot, metad->btm_fastlevel,
>> +                                                                     metad->btm_root, metad->btm_level)));
>
> I'd weaken the message to something that sounds less like an actual
> problem... - which this is not, right?

No, it isn't a real problem. I'll do something about it.

>> +/*
>> + * bt_check_level_from_leftmost()
>> + *
>> + * Given a left-most block at some level, move right, verifying each page
>> + * individually (with more verification across pages for "exclusivelock"
>> + * callers).  Caller should pass the true root page as the leftmost initially,
>> + * working their way down by passing what is returned for the last call here
>> + * until level 0 (leaf page level) was reached.
>> + *
>> + * Returns state for next call, if any.
>
> Wouldn't "'sets state for the next call..." be more accurate than
> returning?

Okay.

>> +static BtreeLevel
>> +bt_check_level_from_leftmost(BtreeCheckState *state, BtreeLevel level)
>> +{
>
>> +             if (P_IGNORE(opaque))
>> +             {
>> +                     if (P_RIGHTMOST(opaque))
>> +                             ereport(CORRUPTION,
>> +                                             (errcode(ERRCODE_INDEX_CORRUPTED),
>> +                                              errmsg("block %u fell off the end of index \"%s\"",
>> +                                                             current, RelationGetRelationName(state->rel))));
>
> I think this could stand a bit more formal language.

I disagree, because it's a direct lift from nbtree itself.

>> +                     else
>> +                             ereport(CONCERN,
>> +                                             (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
>> +                                              errmsg("block %u of index \"%s\" ignored",
>> +                                                             current, RelationGetRelationName(state->rel))));
>
> missing verb. Also, is this actually worth logging (even at such a low level)?

I'm not following -- "ignore" is a verb.

It may not be worth logging. I just don't know, and prefer to err on
the side of including it.

>> +                     /*
>> +                      * Before beginning any non-trivial examination of level, establish
>> +                      * state to be passed by caller for next call here, if this isn't
>> +                      * last call/leaf level.
>
> "state to be passed by caller for next call here" is hard to
> understand.  "Prepare state for next bt_check_level_from_leftmost
> invocation for the next level"?

Okay.

>> +                      * There should be at least one non-ignorable page per level.
>> +                      */
>
> I might have missed something, but you're not checking that, are you?

No. As I said just now, I think it might not be true iff every single
item (every item at the leaf level) has been deleted and is being
vacuumed. If there was even one live item in one leaf page, it would
be guaranteed to have a live parent, which would itself be guaranteed
to if it wasn't the true root, and so on, until the true root. It's
really hard to "lose" a level by having the true root and fast root
differ, in general (of course, we can never truly lose a level -- the
pages can never be reclaimed entirely from a deleted level's last
page). I guess that this fast root stuff was added only for the
benefit of cases where there is continuous bulk deletion and
insertion.

> "mutual" doesn't seem to add much here.

Okay.

>> +             /* Try to detect circular links */
>> +             if (current == leftcurrent || current == opaque->btpo_prev)
>> +                     ereport(CORRUPTION,
>> +                                     (errcode(ERRCODE_INDEX_CORRUPTED),
>> +                                      errmsg("circular link chain found in block %u of index \"%s\"",
>> +                                                     current, RelationGetRelationName(state->rel))));
>
> That's not really reliable though, is it?  I've seen corrupted link
> chains (leading to endless vacuums and such) a number of times, so maybe
> explicitly keeping track of already visited pages within a level would
> be worthwhile?

It's not reliable -- it's one of the "shot in the dark, but why not?"
checks, which are the majority -- no reason to not be defensive.
However, it's highly likely that cases like the one you describe would
be caught by the cross-page check, or even high-key check, because
they're probably explained by an underlying problem with transposed
pages, maybe not even 8KiB aligned pages (e.g., buggy firmware
controllers have been known to do this with their 4KiB pages, but you
wouldn't necessarily notice that easily with an int4 index or
similar).

I feel relatively confident that we don't need to do better with that
one, since it will never make a difference; the only case where that
might be wrong that I can think of is where you have an awful lot of
duplicates, which is unfortunately not that uncommon with Postgres.
Even if it could help, that sounds like something that won't pay for
itself.

> Hm. I think it'd be useful if we also made sure (or in the caller) that
> the page actually is at the correct level.  I seem to recall seing a
> case where downlinks were "circular" due to corruption, leading to stuck
> scans.

That's a good idea, because it is, if nothing else, an easy,
low-overhead additional check.

>> + * Note:  This routine is not especially proactive in freeing memory.  High
>> + * watermark memory consumption is bound to some small fixed multiple of
>> + * BLCKSZ, though.  Caller should reset the current context between calls here.
>
> I'd shorten this to "Memory allocated in this routine is expected to be
> released by the caller resetting the ->xxx/current context".

Okay.

>> +static void
>> +bt_target_page_check(BtreeCheckState *state)
>> +{
>> +     OffsetNumber    offset;
>> +     OffsetNumber    max;
>> +     BTPageOpaque    topaque;
>> +
>> +     topaque = (BTPageOpaque) PageGetSpecialPointer(state->target);
>> +     max = PageGetMaxOffsetNumber(state->target);
>> +
>> +     elog(POSITION, "verifying %u items on %s block %u", max,
>> +              P_ISLEAF(topaque) ? "leaf":"internal", state->targetblock);
>
> I think it's fine if you don't, but this could theoretically integrate
> with the progress stuff added in 9.6 (as in
> pgstat_progress_start_command()).  That'd allow for a lot more
> convenient display of this than thousands of debug messages or nothing.

Never enough time. :-)

>> +     /*
>> +      * Loop over page items, but don't start from P_HIKEY (don't have iteration
>> +      * directly considering high key item, if any).
>
> I don't understand the parenthetical comment as phrased.

It means that the high key item, which is physically the first item on
the leaf level, and the second item on internal levels. (That have an
explicit "minus infinity" downlink item), is skipped in the sense that
it doesn't get its own iteration of the loop (note that there is only
a high key on non-rightmost-at-level pages -- I like to say that the
rightmost pages have an implicit/imaginary "positive infinity" high
key, to illustrate the symmetry between high keys and downlinks.
Although, we don't say that in any code.)

> "which there must be for" sounds weird.

> s/prefer to//

Okay.

> Does that kind of comment look reasonable after pgindent? Actually,
> could you just pindent this patch (including typedefs.list additions)
> and repair the bad looking damage?

I'll check.

> Do all these separate psprintfs have much of a point?  It seems just as
> easy to includ ethem in the actual ereport?

I thought it looked very convoluted that way.

>> +              * next/sibling page.  There may not be such an item available from
>> +              * sibling for various reasons, though (e.g., target is the rightmost
>> +              * page on level).
>> +              */
>
> Is the comma after e.g. correct?  </mega-pedanticism-from-esl-person>

This is a matter of style. "e.g.," is generally considered American style.

>> +                             ereport(CORRUPTION,
>> +                                             (errcode(ERRCODE_INDEX_CORRUPTED),
>> +                                              errmsg("cross page item order invariant violated for index \"%s\"",
>> +                                                             RelationGetRelationName(state->rel)),
>> +                                              errdetail_internal("Last item on page tid=(%u,%u) "
>> +                                                                                     "page lsn=%X/%X.",
>> +                                                                                     state->targetblock, offset,
>> +                                                                                     (uint32) (state->targetlsn >>
32),
>> +                                                                                     (uint32) state->targetlsn)));
>
> I'm am *wondering* (as in not sure), if we could get rid of all the "for
> index ..." bits and use an errcontext for that instead, which'd be
> included in all messages.

Doesn't seem worth it to me.

>> +     /*
>> +      * General notes on concurrent page splits and page deletion:
>> +      *
>> +      * Routines like _bt_search() don't require *any* page split interlock when
>> +      * descending the tree, including something very light like a buffer pin.
>> +      * That's why it's okay that we don't either.  This avoidance of any need
>> +      * to "couple" buffer locks is the raison d' etre of the Lehman & Yao
>> +      * algorithm, in fact.
>
> raison d'etre comment seems more like something belonging to
> nbtree/README than this, but I don't care much.

I think it adds value. This is a learning tool for nbtree. That's a
secondary goal, at least. We're *way* too shy at pointing out simple
facts like this, that are of very significant consequence. It's only
one sentence.

> Maybe s/Ensuring/The reasons why they don't happen in ...'?

> "on to this right sibling page"?

> missing trailing 'the'?  I'm really not a native speaker (and thus might
> be off the mark), but making these sentences a bit clearer really
> wouldn't hurt.

Okay. The sentences are very complicated, but that's justified by the
complexity of the underlying issue. I rewrote that text several times.

> s/N.B./NB/ by precedence ;)

> s/We prefer//

Okay.

>> +             if (OFFSET_IS_MINUS_INFINITY(copaque, offset))
>> +                     continue;
>
> FWIW, I'd even convert this to an inline function rather than a macro.
> Macros are useful when you have to do stuff you can't do otherwise, but
> shouldn't be used much otherwise.

Okay.

>> +             if (!invariant_key_less_than_equal_nontarget_offset(state, child,
>> +
targetkey, offset))
 
>> +                     ereport(CORRUPTION,
>> +                                     (errcode(ERRCODE_INDEX_CORRUPTED),
>
> What's your reasoning for differing sometimes using
> ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE and sometimes ERRCODE_INDEX_CORRUPTED?

The ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE ereports are not errors,
and do not indicate corruption.

-- 
Peter Geoghegan



Re: [HACKERS] amcheck (B-Tree integrity checking tool)

From
Andres Freund
Date:
On 2017-02-09 13:29:42 -0800, Peter Geoghegan wrote:
> > I'd really like to have something like relation_check(), index_check()
> > which calls the correct functions for the relevan index types (and
> > possibly warns about !nbtree indexes not being checked?).
> 
> I feel that that's something for an in-core facility that exposes this
> through DDL. One good reason to not do that just yet is that I'm not
> completely comfortable with what a uniform interface looks like here.
> And frankly, I really just want to get this out of the way, to prove
> the general concept of verification tooling. amcheck v1 is a good
> start, but much work remains. I think that no one will take the idea
> completely seriously until it is usable as a core extension. Let's get
> the ball rolling.

I'm not convinced at all by that argument.  If we want something in
core, having this in contrib doesn't help much.  Nor do I see much
point in explicit syntax.


> >> +#define CHECK_SUPERUSER() { \
> >> +             if (!superuser()) \
> >> +                     ereport(ERROR, \
> >> +                                     (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), \
> >> +                                      (errmsg("must be superuser to use verification functions")))); }
> >
> > Please make this either a function or just copy - the macro doesn't buy
> > us anything except weirder looking syntax.
> 
> It's actually from pageinspect, but okay.

We have a lot of dodgy stuff in places ;)


> >> +PG_FUNCTION_INFO_V1(bt_index_check);
> >> +PG_FUNCTION_INFO_V1(bt_index_parent_check);
> >
> > I think it'd be good to avoid adding a lot of functions for each
> > additional type of check we can think of.  I think adding a named 'bool
> > check_parents = false' argument or such would be better.  Then we can
> > simply add more and more checks subsequently, and we can have
> > 'bool all_checks = false' so people can trivially check everything,
> > instead of having to add new checks in each new function.
> 
> I strongly disagree. I would agree if the lock strength were the same
> in both cases, but it isn't. Varying the strength of heavyweight lock
> taken based on an argument to a function is a massive foot-gun IMV.

Meh.  I don't think that's a serious problem, nor is without precedent
(say same toplevel DDL with differing lock levels depending on
options). Nor do the function names confer that any more efficiently
than parameters.


> I do think that that's the logical way to split out functionality. I
> don't think that there is likely to be a problem with adding stuff in
> the future. It'll either be something that we can do in passing
> anyway, in which case we can just add it to the existing checks
> harmlessly. Or, it will be something like a tool that verifies the
> heap is consistent with a given index, which is vastly more expensive
> in terms of CPU and I/O costs, and yet is probably no worse than
> bt_index_check() in terms of lock strength (AccessShareLock).

But you quite possibly want to do all three in one pass - and that'll be
ugly with your API.  With your API I'll not be able to do the parent
check and the heap check at once.  Lest we add fourth function that does
all of it.


> My thinking here is that users really ought to use bt_index_check() in
> almost all cases, not bt_index_parent_check(), which is more of a tool
> for hackers that are developing new B-Tree features. The distinction
> is blurry, in fact, but the lock strength requirements of
> bt_index_parent_check() make it vastly less useful.

A quite reasonable patter is to do this verification on standbys that
you can either bring up separately and/or just temporarily pause.  Or
just backups before declaring them valid.


> >> +/*
> >> + * bt_index_check(index regclass)
> >> + *
> >> + * Verify integrity of B-Tree index.
> >> + *
> >> + * Only acquires AccessShareLock on index relation.  Does not consider
> >> + * invariants that exist between parent/child pages.
> >> + */
> 
> > I'm *VERY* strongly against releasing locks on user relations early.
> 
> Why? pageinspect does this.

Again, some parts of the code doing something bad isn't a good argument
for doing again. Releasing locks early is a bad pattern, because backend
code isn't generally safe against it, we have special code-paths for
catalog tables to support it for those.


> >> +     /*
> >> +      * We must lock table before index to avoid deadlocks.  However, if the
> >> +      * passed indrelid isn't an index then IndexGetRelation() will fail.
> >> +      * Rather than emitting a not-very-helpful error message, postpone
> >> +      * complaining, expecting that the is-it-an-index test below will fail.
> >> +      */
> >> +     heapid = IndexGetRelation(indrelid, true);
> >> +     if (OidIsValid(heapid))
> >> +             heaprel = heap_open(heapid, ShareLock);
> >> +     else
> >> +             heaprel = NULL;
> >> +
> >> +     /*
> >> +      * Open the target index relations separately (like relation_openrv(), but
> >> +      * with heap relation locked first to prevent deadlocking).  In hot standby
> >> +      * mode this will raise an error.
> >> +      */
> >> +     indrel = index_open(indrelid, ExclusiveLock);
> >
> > So yea, this is what I was complaining about re locking / index & table
> > relation IIRC.  Note you're also deadlock prone by violating the rule
> > about always locking the relation first.
> 
> This is based on brin_summarize_new_values().
> I'm not doing a recheck,
> though, because I don't actually need the heap relation for anything
> (I suppose I imagined that that was okay, possibly failing to consider
> some subtlety beyond "consistency is good").

We iirc rely on indexes not being locked without the heap also being
locked... I see no reason to be careful here.

The deadlock part of my complaint is bogus.


> The comments here are almost a direct lift -- is
> brin_summarize_new_values() wrong to itself claim to be
> deadlock-safe?.

No, that seems ok.


> (BTW, brin_summarize_new_values(), an SQL-callable function, also
> releases heavyweight locks early.)

And that's afaics actually broken and a bug.



> >> +     if (!IndexIsValid(rel->rd_index))
> >> +             ereport(ERROR,
> >> +                             (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> >> +                              errmsg("cannot check index \"%s\"",
> >> +                                             RelationGetRelationName(rel)),
> >> +                              errdetail("Index is not valid")));
> >
> > I'm perfectly fine with not changing that, but IIUC it'd be fine to
> > weaken this to IndexIsLive() - right?
> 
> I believe so, but prefer to be more restrictive in the absence of a
> good reason not to be.

Seems useful to be able to verify an index during CONCURRENTLY, after
the initial build, but before the update pass, it's not like we didn't
have bugs in it :(


> >> +static void
> >> +bt_check_every_level(Relation rel, bool exclusivelock)
> >
> > Hm, is every level accurate - this is more like checking every
> > "reachable" page, right?
> 
> Every level should have at least one page that is reachable (not
> half-dead or fully dead), unless perhaps VACUUM marks the last leaf
> page in the entire index as half-dead and we land on that. I think
> that that's such a rare case that it shouldn't alter how we name this
> function at all, which really does do what it says almost always. Note
> that there are pretty severe restrictions on when vacuum can do
> page-level deletion of an internal page. This is what leads to index
> bloat with sparse deletion patterns (one problem I want to address by
> adding suffix truncation to internal pages at a later date).

I think my point was different - it's not that we're not checking every
level, it's that we're checking *more* than just "levels".



> > What's your reasoning for differing sometimes using
> > ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE and sometimes ERRCODE_INDEX_CORRUPTED?
> 
> The ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE ereports are not errors,
> and do not indicate corruption.

Don't think errcodes make much sense for debug stuff, nor do I think
that ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE would be a good fit for
something that's not an error.

Greetings,

Andres Freund



Re: [HACKERS] amcheck (B-Tree integrity checking tool)

From
Robert Haas
Date:
On Thu, Feb 9, 2017 at 4:57 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2017-02-09 13:29:42 -0800, Peter Geoghegan wrote:
>> > I'd really like to have something like relation_check(), index_check()
>> > which calls the correct functions for the relevan index types (and
>> > possibly warns about !nbtree indexes not being checked?).
>>
>> I feel that that's something for an in-core facility that exposes this
>> through DDL. One good reason to not do that just yet is that I'm not
>> completely comfortable with what a uniform interface looks like here.
>> And frankly, I really just want to get this out of the way, to prove
>> the general concept of verification tooling. amcheck v1 is a good
>> start, but much work remains. I think that no one will take the idea
>> completely seriously until it is usable as a core extension. Let's get
>> the ball rolling.
>
> I'm not convinced at all by that argument.  If we want something in
> core, having this in contrib doesn't help much.  Nor do I see much
> point in explicit syntax.

I don't see any need for this to be in core, nor do I think we need
explicit syntax.  But given that, I also don't think that we need the
generalized infrastructure you proposed in the quoted section above.

>> I strongly disagree. I would agree if the lock strength were the same
>> in both cases, but it isn't. Varying the strength of heavyweight lock
>> taken based on an argument to a function is a massive foot-gun IMV.
>
> Meh.  I don't think that's a serious problem, nor is without precedent
> (say same toplevel DDL with differing lock levels depending on
> options). Nor do the function names confer that any more efficiently
> than parameters.

Really?  check_btree(blah, true) vs. check_btree(blah, false) is no
more or less clear than check_btree_for_plausible_errors(blah) vs.
check_btree_in_unreasonable_detail(blah)?

>> I do think that that's the logical way to split out functionality. I
>> don't think that there is likely to be a problem with adding stuff in
>> the future. It'll either be something that we can do in passing
>> anyway, in which case we can just add it to the existing checks
>> harmlessly. Or, it will be something like a tool that verifies the
>> heap is consistent with a given index, which is vastly more expensive
>> in terms of CPU and I/O costs, and yet is probably no worse than
>> bt_index_check() in terms of lock strength (AccessShareLock).
>
> But you quite possibly want to do all three in one pass - and that'll be
> ugly with your API.  With your API I'll not be able to do the parent
> check and the heap check at once.  Lest we add fourth function that does
> all of it.

This however seems like a good point.

> Again, some parts of the code doing something bad isn't a good argument
> for doing again. Releasing locks early is a bad pattern, because backend
> code isn't generally safe against it, we have special code-paths for
> catalog tables to support it for those.

I don't agree with that.  The reason we keep relation locks until the
end of the transaction is to avoid surprising application code, not
because the system can't tolerate it.  But here it's more likely that
retaining the lock will surprise the user.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] amcheck (B-Tree integrity checking tool)

From
Andres Freund
Date:
On 2017-02-09 17:12:58 -0500, Robert Haas wrote:
> On Thu, Feb 9, 2017 at 4:57 PM, Andres Freund <andres@anarazel.de> wrote:
> > Meh.  I don't think that's a serious problem, nor is without precedent
> > (say same toplevel DDL with differing lock levels depending on
> > options). Nor do the function names confer that any more efficiently
> > than parameters.
> 
> Really?  check_btree(blah, true) vs. check_btree(blah, false) is no
> more or less clear than check_btree_for_plausible_errors(blah) vs.
> check_btree_in_unreasonable_detail(blah)?

Except that the proposed names aren't remotely like that... ;).

And I proposed documenting named parameters, and
check_btree(performing_check_requiring_exclusive_locks => true) is just
about as expressive.


> > Again, some parts of the code doing something bad isn't a good argument
> > for doing again. Releasing locks early is a bad pattern, because backend
> > code isn't generally safe against it, we have special code-paths for
> > catalog tables to support it for those.
> 
> I don't agree with that.  The reason we keep relation locks until the
> end of the transaction is to avoid surprising application code, not
> because the system can't tolerate it.  But here it's more likely that
> retaining the lock will surprise the user.

It's both.  A bunch of code paths rely on early release ony happening on
catalogs. E.g., and that's just the first thing that comes to mind,
cache invalidations which really only are guaranteed to be delivered and
visibile for other cache accesses, if the lock is only released after
the end of the transaction.  Object accesses like relation_open() accept
invalidations *after* acquiring the lock. Combined with lock releases
only happening *after* transaction commits that guarantees an up2date
view of the cache; but if others release the lock earlier and have cache
invalidations, that doesn't work anymore.  Now you can argue that it's
possibly harmless here because there's presumably no way this can ever
trigger cache invals - and you're probably right - but this isn't the
only place with such assumption and you'd definitely have to argue *why*
it's right.

There were other issues than this, but right of the top of my head I
only remember:
double
IndexBuildHeapRangeScan(Relation heapRelation,
..                /*                 * Since caller should hold ShareLock or better, normally                 * the
onlyway to see this is if it was inserted earlier                 * in our own transaction.  However, it can happen in
              * system catalogs, since we tend to release write lock                 * before commit there.  Give a
warningif neither case                 * applies.                 */                xwait =
HeapTupleHeaderGetXmin(heapTuple->t_data);               if (!TransactionIdIsCurrentTransactionId(xwait))
{                    if (!is_system_catalog)                        elog(WARNING, "concurrent insert in progress within
table\"%s\"",                             RelationGetRelationName(heapRelation));
 

which isn't an issue here, but reinforces my point about the (badly
documented) assumption that we don't release locks on user relations
early.

Andres



Re: [HACKERS] amcheck (B-Tree integrity checking tool)

From
Peter Geoghegan
Date:
On Thu, Feb 9, 2017 at 2:32 PM, Andres Freund <andres@anarazel.de> wrote:
> Except that the proposed names aren't remotely like that... ;).
>
> And I proposed documenting named parameters, and
> check_btree(performing_check_requiring_exclusive_locks => true) is just
> about as expressive.

I cannot think of an example, offhand, of where it isn't self-evident
what relation level locks will be acquired when a statement is
prepared. I guess you could say that VACUUM truncation is a bit like
that, but it's not the same IMV.

Maybe there is an easy compromise here. We can add an argument with a
default, "level". It's an "enum" of the style of pg_prewarm (i.e.,
it's actually a text argument). The default is 'medium' or something
like that. Maybe we don't add any other flags for now, or maybe we add
some generic ones that don't actually changed the behavior. Existing
names are retained.

Doesn't that address your concern about an inflexible API? I really
dislike the idea of having the lock strength be determined on the
basis of a function arguments value.

>> I don't agree with that.  The reason we keep relation locks until the
>> end of the transaction is to avoid surprising application code, not
>> because the system can't tolerate it.  But here it's more likely that
>> retaining the lock will surprise the user.

> There were other issues than this, but right of the top of my head I
> only remember:
> double
> IndexBuildHeapRangeScan(Relation heapRelation,
> ..
>                                         /*
>                                          * Since caller should hold ShareLock or better, normally
>                                          * the only way to see this is if it was inserted earlier
>                                          * in our own transaction.  However, it can happen in
>                                          * system catalogs, since we tend to release write lock
>                                          * before commit there.  Give a warning if neither case
>                                          * applies.
>                                          */
>                                         xwait = HeapTupleHeaderGetXmin(heapTuple->t_data);
>                                         if (!TransactionIdIsCurrentTransactionId(xwait))
>                                         {
>                                                 if (!is_system_catalog)
>                                                         elog(WARNING, "concurrent insert in progress within table
\"%s\"",
>                                                                  RelationGetRelationName(heapRelation));
>
> which isn't an issue here, but reinforces my point about the (badly
> documented) assumption that we don't release locks on user relations
> early.

You are right about the substantive issue, I assume, but I have a hard
time agreeing with the idea that it's even badly documented when there
are at least 10 counter-examples of blithely doing this. In any case,
I find it very confusing when you talk about things as established
practice/coding standards when they're very clearly not consistently
adhered to at all. You should at least say "let's not make a bad
situation any worse", or something, so that I don't need to spend 10
minutes pulling my hair out.

-- 
Peter Geoghegan



Re: [HACKERS] amcheck (B-Tree integrity checking tool)

From
Peter Geoghegan
Date:
On Thu, Feb 9, 2017 at 2:47 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>> which isn't an issue here, but reinforces my point about the (badly
>> documented) assumption that we don't release locks on user relations
>> early.
>
> You are right about the substantive issue, I assume, but I have a hard
> time agreeing with the idea that it's even badly documented when there
> are at least 10 counter-examples of blithely doing this. In any case,
> I find it very confusing when you talk about things as established
> practice/coding standards when they're very clearly not consistently
> adhered to at all. You should at least say "let's not make a bad
> situation any worse", or something, so that I don't need to spend 10
> minutes pulling my hair out.

BTW, aren't there cases where a PARALLEL SAFE function that needs to
acquire locks on some arbitrary relation not referenced in the query
can get locks on its own, which may only last as long as the parallel
worker's lifetime? This could happen *despite* the fact that the
author of the function may have imagined that callers could not
release relation level locks early (i.e., by not releasing them
directly, and so correctly following this "badly documented
assumption").

It seems like the existence of PARALLEL RESTRICTED is primarily
motivated by making stuff that definitely needs the locks to last
until xact end being sure that that happens -- the docs say so. This
seems wholly inconsistent with the idea that you're not supposed to
let that happen under any circumstances. I find all this highly
confusing. Have I missed some further subtlety that applies with
PARALLEL RESTRICTED?

-- 
Peter Geoghegan



Re: [HACKERS] amcheck (B-Tree integrity checking tool)

From
Peter Geoghegan
Date:
On Thu, Feb 9, 2017 at 2:32 PM, Andres Freund <andres@anarazel.de> wrote:
> Except that the proposed names aren't remotely like that... ;).

Revision attached -- V5. We now REVOKE ALL on both functions, as
Robert suggested, instead of the previous approach of having a
hard-coded superuser check with enforcement.

> And I proposed documenting named parameters, and
> check_btree(performing_check_requiring_exclusive_locks => true) is just
> about as expressive.

I have not done this, nor have I renamed the functions. I still think
that this is something that we can fix by adding a boolean argument to
each function in the future, or something along those lines. I
*really* hate the idea of having one function with non-obvious,
variable requirements on locking, with locking implications that are
not knowable when we PREPARE an SQL statement calling the function. It
also removes a useful way of have superusers discriminate against the
stronger locking variant bt_index_parent_check() by not granting
execute on it (as an anti-footgun measure). I don't think that it's
too much to ask for you to concede this one point to me (and to
Robert).

I have acted on every other item you raised, without exception,
including retaining the locks for the duration of the transaction. I
also made the functions PARALLEL RESTRICTED, which seemed like a good
idea.

Other notes:

* A nice thing about the additional level check that you suggested is
that it's a cross-level check that is perfectly safe to perform with
only an AccessShareLock, unlike the extra bt_index_parent_check()
check, since the number of levels only ever increases, no matter what.
There is no possible reason why we should ever find that a child page
does not agree with its self-described parent about the level it's on.
I'm glad you thought of this.

* I kept ereport() DEBUG1 calls, but now use non-error errcode for
those. I thought that preserving the various ereport() fields was
worth it, since these traces are rather information dense in a way
that is harder to deal with when using raw elog()s. If you hate this,
you can still change it.

* I have not added code to check the relation permissions, which might
matter if and when the superuser grants execute privs on amcheck
functions. Anyone want to argue that I should have added
relation-level permissions checks? I tend to think it's pointless,
since the superuser almost certainly won't think to consider that
additional factor. I cannot imagine how either function could be used
to leak information, in any case. No error message dumps the contents
of pages (only item pointers).

* pgindent was run against this revision. You were right to want to
get that out of the way now, since it needed to be fixed up by hand to
look reasonable. typedef list also updated.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

From
Amit Kapila
Date:
On Fri, Feb 10, 2017 at 6:45 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Thu, Feb 9, 2017 at 2:47 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>>> which isn't an issue here, but reinforces my point about the (badly
>>> documented) assumption that we don't release locks on user relations
>>> early.
>>

I think even if we don't need to retain locks till transaction end in
the present case, but Andres's point has a merit that it is good to
follow an undocumented rule of releasing locks at transaction end for
user relations unless there is some benefit in releasing the locks
early in some situation.  I think that can avoid unnecessary worry
that whether it safe to do in some situation.

>> You are right about the substantive issue, I assume, but I have a hard
>> time agreeing with the idea that it's even badly documented when there
>> are at least 10 counter-examples of blithely doing this. In any case,
>> I find it very confusing when you talk about things as established
>> practice/coding standards when they're very clearly not consistently
>> adhered to at all. You should at least say "let's not make a bad
>> situation any worse", or something, so that I don't need to spend 10
>> minutes pulling my hair out.
>
> BTW, aren't there cases where a PARALLEL SAFE function that needs to
> acquire locks on some arbitrary relation not referenced in the query
> can get locks on its own, which may only last as long as the parallel
> worker's lifetime? This could happen *despite* the fact that the
> author of the function may have imagined that callers could not
> release relation level locks early (i.e., by not releasing them
> directly, and so correctly following this "badly documented
> assumption").
>
> It seems like the existence of PARALLEL RESTRICTED is primarily
> motivated by making stuff that definitely needs the locks to last
> until xact end being sure that that happens -- the docs say so.
>

I don't think so.  I think one of the primary reason of releasing
locks in workers is that we don't have any way to transfer them to the
leader which can release them at transaction end.

> This
> seems wholly inconsistent with the idea that you're not supposed to
> let that happen under any circumstances. I find all this highly
> confusing. Have I missed some further subtlety that applies with
> PARALLEL RESTRICTED?
>

The basic idea of parallel restricted is that any expression will be
considered parallel restricted if that can't be executed in the
worker, for example, a reference to initplans in a query.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] amcheck (B-Tree integrity checking tool)

From
Robert Haas
Date:
On Thu, Feb 9, 2017 at 8:15 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> BTW, aren't there cases where a PARALLEL SAFE function that needs to
> acquire locks on some arbitrary relation not referenced in the query
> can get locks on its own, which may only last as long as the parallel
> worker's lifetime? This could happen *despite* the fact that the
> author of the function may have imagined that callers could not
> release relation level locks early (i.e., by not releasing them
> directly, and so correctly following this "badly documented
> assumption").

Yes.

> It seems like the existence of PARALLEL RESTRICTED is primarily
> motivated by making stuff that definitely needs the locks to last
> until xact end being sure that that happens -- the docs say so. This
> seems wholly inconsistent with the idea that you're not supposed to
> let that happen under any circumstances. I find all this highly
> confusing. Have I missed some further subtlety that applies with
> PARALLEL RESTRICTED?

That's by no means the only thing that could cause you to mark
something PARALLEL RESTRICTED.  I think you might have missed this bit
of the documentation, a few paragraphs up:
   Similarly, functions must be marked <literal>PARALLEL   RESTRICTED</> if they access temporary tables, client
connectionstate,   cursors, prepared statements, or miscellaneous backend-local state which   the system cannot
synchronizeacross workers. For example,   <literal>setseed</> and <literal>random</> are parallel restricted for   this
lastreason.
 

That stuff is the main motivation.  The early lock release stuff could
come up for user-written functions, but the things listed in the
paragraph I've quoted here come up for plenty of built-in functions.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] amcheck (B-Tree integrity checking tool)

From
Robert Haas
Date:
On Thu, Feb 9, 2017 at 5:32 PM, Andres Freund <andres@anarazel.de> wrote:
>> > Again, some parts of the code doing something bad isn't a good argument
>> > for doing again. Releasing locks early is a bad pattern, because backend
>> > code isn't generally safe against it, we have special code-paths for
>> > catalog tables to support it for those.
>>
>> I don't agree with that.  The reason we keep relation locks until the
>> end of the transaction is to avoid surprising application code, not
>> because the system can't tolerate it.  But here it's more likely that
>> retaining the lock will surprise the user.
>
> It's both.  A bunch of code paths rely on early release ony happening on
> catalogs. E.g., and that's just the first thing that comes to mind,
> cache invalidations which really only are guaranteed to be delivered and
> visibile for other cache accesses, if the lock is only released after
> the end of the transaction.  Object accesses like relation_open() accept
> invalidations *after* acquiring the lock. Combined with lock releases
> only happening *after* transaction commits that guarantees an up2date
> view of the cache; but if others release the lock earlier and have cache
> invalidations, that doesn't work anymore.  Now you can argue that it's
> possibly harmless here because there's presumably no way this can ever
> trigger cache invals - and you're probably right - but this isn't the
> only place with such assumption and you'd definitely have to argue *why*
> it's right.

I agree that code which issues cache invalidations needs to hold a
blocking lock until commit IF it's critical for other backends to see
the update.  That's not always the case; for example, if an ALTER
TABLE statement only changes the fillfactor the worst thing that
happens if some other backend fails to see the invalidations is that
some transactions continue to use the old value for a while.  That's
why we now allow fillfactor and some other things using only
ShareUpdateExclusiveLock.  That's not quite the same thing but the
effect is the same: we issue invalidations that other backends aren't
guaranteed to see in a timely fashion, and it works OK.  Yeah, some
other session might not see the change for a while, but we don't view
that as a big problem.

But I think in this case that argument is basically a straw man.  A
utility that's called amcheck is obviously not issuing any cache
invalidations.  It's probably not even changing anything at all; if it
is, maybe it should be called amrepair.  So the fact that CLUSTER has
to hold AEL until commit time really doesn't have anything to do with
whether amcheck needs to hold AES until commit time.

I don't really understand your demand for "an argument why it's
right".  My argument is simple: I can't think of a single thing that
it would break, and I don't believe there is any such thing.
Furthermore, as Peter points out, we do this kind of thing all the
time in other places and it indeed does not break things.  I think
you're trying to invent a coding rule that doesn't exist, but I can't
prove a negative.  Your argument so far is that "yeah, well, maybe
inval doesn't emit sinval traffic (duh?) but it might do something
else that breaks, prove to me that there is no such thing".  But
that's just like asking me to prove that it's impossible to exceed the
speed of light.  On information and belief, nobody currently knows how
to do that and there is good scientific evidence that it cannot be
done, but there's no way to prove that someone won't in the future
discover a refinement in the physical laws of the universe as we
understand them today which sheds a different light on the situation.

From my point of view, it's likely to be common to want to run amcheck
in series on a bunch of indexes, like by writing a SELECT query
against pg_class or pg_index and passing the OIDs to amcheck.  With
the retain-locks design, that query accumulates locks on a large
number of objects, possibly blowing out the lock table or blocking
DDL.  I think that readily-forseeable problem ought to trump concerns
about purely hypothetical issues caused by releasing locks early.  If
you can come up with an actual issue with releasing locks early that
applies to this particular case, then that's different, of course.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] amcheck (B-Tree integrity checking tool)

From
Robert Haas
Date:
On Fri, Feb 10, 2017 at 8:39 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Thu, Feb 9, 2017 at 2:32 PM, Andres Freund <andres@anarazel.de> wrote:
>> Except that the proposed names aren't remotely like that... ;).
>
> Revision attached -- V5. We now REVOKE ALL on both functions, as
> Robert suggested, instead of the previous approach of having a
> hard-coded superuser check with enforcement.
>
>> And I proposed documenting named parameters, and
>> check_btree(performing_check_requiring_exclusive_locks => true) is just
>> about as expressive.
>
> I have not done this, nor have I renamed the functions. I still think
> that this is something that we can fix by adding a boolean argument to
> each function in the future, or something along those lines. I
> *really* hate the idea of having one function with non-obvious,
> variable requirements on locking, with locking implications that are
> not knowable when we PREPARE an SQL statement calling the function. It
> also removes a useful way of have superusers discriminate against the
> stronger locking variant bt_index_parent_check() by not granting
> execute on it (as an anti-footgun measure).

I think Andres is more or less correct that
"performing_check_requiring_exclusive_locks => true" is just about as
expressive as calling a different function, but I think that your
point that the superuser might want to grant access to one function
but not the other is a good one.  On the other hand, I think Andres
has a concern that we might have more modes in the future and we don't
want to end up with 2^n entrypoints.  That also seems valid.  Hmm.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] amcheck (B-Tree integrity checking tool)

From
Andres Freund
Date:
Hi,

On 2017-02-13 11:52:53 -0500, Robert Haas wrote:
> On Thu, Feb 9, 2017 at 5:32 PM, Andres Freund <andres@anarazel.de> wrote:
> >> > Again, some parts of the code doing something bad isn't a good argument
> >> > for doing again. Releasing locks early is a bad pattern, because backend
> >> > code isn't generally safe against it, we have special code-paths for
> >> > catalog tables to support it for those.
> >>
> >> I don't agree with that.  The reason we keep relation locks until the
> >> end of the transaction is to avoid surprising application code, not
> >> because the system can't tolerate it.  But here it's more likely that
> >> retaining the lock will surprise the user.
> >
> > It's both.  A bunch of code paths rely on early release ony happening on
> > catalogs. E.g., and that's just the first thing that comes to mind,
> > cache invalidations which really only are guaranteed to be delivered and
> > visibile for other cache accesses, if the lock is only released after
> > the end of the transaction.  Object accesses like relation_open() accept
> > invalidations *after* acquiring the lock. Combined with lock releases
> > only happening *after* transaction commits that guarantees an up2date
> > view of the cache; but if others release the lock earlier and have cache
> > invalidations, that doesn't work anymore.  Now you can argue that it's
> > possibly harmless here because there's presumably no way this can ever
> > trigger cache invals - and you're probably right - but this isn't the
> > only place with such assumption and you'd definitely have to argue *why*
> > it's right.
> 
> I agree that code which issues cache invalidations needs to hold a
> blocking lock until commit IF it's critical for other backends to see
> the update.

Right. And I think that needs to be the default assumption, because
otherwise you need to be careful about all the relevant code paths.  I'm
fine with adding something along the lines of
/** Release lock on relation early, this function doesn't generate any* cache invalidations, and it's convenient for
callersto be able to* check multiple functions in one go.*/
 
(after checking that's true of course) and then skipping the lock
release.  What I'm absolutely not ok with is doing so without
consideration and documentation.


> I don't really understand your demand for "an argument why it's
> right".  My argument is simple: I can't think of a single thing that
> it would break, and I don't believe there is any such thing.

That's pretty much an argument.  I just want documentation that calls
attention that we're doing something that requires a modicum of care.  I
fail to see why that's unreasonable.  I don't think it's that unlikely
will come around adding something like "hey we can update the free page
statistic here, because we just read the whole damn thing", suddenly
invalidating the logic.  Or we add a heap check that then might trigger
pruning, which quite conceivably could trigger an invalidation. Or ...


> I think you're trying to invent a coding rule that doesn't exist, but
> I can't prove a negative.

We had bugs around this in the past, so ...


> Your argument so far is that "yeah, well, maybe inval doesn't emit
> sinval traffic (duh?) but it might do something else that breaks,
> prove to me that there is no such thing".

I have no idea what you're trying to say with this.  The rest of that
paragraph just seems like bunch of odd hyperbole ("impossible to exceed
the speed of light...").


> From my point of view, it's likely to be common to want to run amcheck
> in series on a bunch of indexes, like by writing a SELECT query
> against pg_class or pg_index and passing the OIDs to amcheck.  With
> the retain-locks design, that query accumulates locks on a large
> number of objects, possibly blowing out the lock table or blocking
> DDL.

On the other hand, retaining the locks actually increases the chance to
run a check to completion.  If you e.g. write a query checking the
indexes for a table, you suddenly can't be sure that subsequent indexes
for the same table can be checked, because the relation lock has been
released inbetween.  Which then'd lead to erroring out after doing part
of the expensive work.

I don't think that refutes your argument, but I also don't think it's
as clear cut as you make it out to be.


Greetings,

Andres Freund



Re: [HACKERS] amcheck (B-Tree integrity checking tool)

From
Andres Freund
Date:
On 2017-02-13 12:05:21 -0500, Robert Haas wrote:
> On Fri, Feb 10, 2017 at 8:39 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> > On Thu, Feb 9, 2017 at 2:32 PM, Andres Freund <andres@anarazel.de> wrote:
> >> Except that the proposed names aren't remotely like that... ;).
> >
> > Revision attached -- V5. We now REVOKE ALL on both functions, as
> > Robert suggested, instead of the previous approach of having a
> > hard-coded superuser check with enforcement.
> >
> >> And I proposed documenting named parameters, and
> >> check_btree(performing_check_requiring_exclusive_locks => true) is just
> >> about as expressive.
> >
> > I have not done this, nor have I renamed the functions. I still think
> > that this is something that we can fix by adding a boolean argument to
> > each function in the future, or something along those lines. I
> > *really* hate the idea of having one function with non-obvious,
> > variable requirements on locking, with locking implications that are
> > not knowable when we PREPARE an SQL statement calling the function. It
> > also removes a useful way of have superusers discriminate against the
> > stronger locking variant bt_index_parent_check() by not granting
> > execute on it (as an anti-footgun measure).
>
> I think Andres is more or less correct that
> "performing_check_requiring_exclusive_locks => true" is just about as
> expressive as calling a different function, but I think that your
> point that the superuser might want to grant access to one function
> but not the other is a good one.  On the other hand, I think Andres
> has a concern that we might have more modes in the future and we don't
> want to end up with 2^n entrypoints.  That also seems valid.  Hmm.

That's part of my concern.  The second part is that I really want to be
able to have a check_relation() (and check_database())function that I
can pass a bunch of arguments determining how expensive checks are going
to be performed.

E.g. I'd like to be able to do something like

SELECT *
FROM check_relation(  'my_table'::regclass,  test_btree => 'true',  test_btree_heap_interlock => 'true',  test_gin =>
'true');

SELECT *
FROM check_current_database(  test_heap_update_chains => 'true',  test_heap_clog_interlock => 'true',  test_btree =>
'true', test_gin => 'false');
 

etc.

You can't really trivially replace these with a larger query and/or
function, because of the locking considerations (consider what happens
if somebody concurrently drops a table/index - your whole query errors
out, wasting hours of work).

I'm ok with not immediately doing so, but I think Peter's design isn't
in line with achieving something like this.

Regards,

Andres



Re: [HACKERS] amcheck (B-Tree integrity checking tool)

From
Peter Geoghegan
Date:
On Mon, Mar 6, 2017 at 3:57 PM, Andres Freund <andres@anarazel.de> wrote:
> I'm ok with not immediately doing so, but I think Peter's design isn't
> in line with achieving something like this.

I would be okay with doing this if we had a grab-bag of expensive
checks, that all pretty much require some combination of
ExclusiveLocks and/or ShareLocks anyway, and are amenable to being
combined. I could see that happening, but we're a long way off from
that. When it does happen, we might have other things that are a bit
like bt_index_parent_check(), in terms of locking requirements and
degree of thoroughness, that might themselves require other parameters
specific to the verification that they perform that we cannot
anticipate. For example, maybe bt_index_parent_check() could have a
parameter that is a probability that any given IndexTuple will be
verified against its heap tuple. Then you could have something like a
PRNG seed argument, so you can check different tuples every day. There
are many possibilities, and I hope that eventually amcheck has this
kind of very rich functionality.

When you use the interface you describe to "stack" several checks at
once, less important parameters, like the ones I suggest are going to
be an awkward fit, and so will probably be excluded. I'm not opposed
to having what amounts to a second interface to what bt_index_check()
does, to make this kind of thing work where it makes sense. Really,
bt_index_parent_check() is almost an alternative interface to the same
functionality today. There can be another one in the future, if it
serves a purpose, and the locking requirements are roughly the same
for all the checks. I'd be fine with that. Let's just get the basic
feature in for now, though.

-- 
Peter Geoghegan



Re: [HACKERS] amcheck (B-Tree integrity checking tool)

From
Andres Freund
Date:
On 2017-03-06 20:40:59 -0800, Peter Geoghegan wrote:
> On Mon, Mar 6, 2017 at 3:57 PM, Andres Freund <andres@anarazel.de> wrote:
> > I'm ok with not immediately doing so, but I think Peter's design isn't
> > in line with achieving something like this.
>
> I would be okay with doing this if we had a grab-bag of expensive
> checks, that all pretty much require some combination of
> ExclusiveLocks and/or ShareLocks anyway, and are amenable to being
> combined. I could see that happening, but we're a long way off from
> that. When it does happen, we might have other things that are a bit
> like bt_index_parent_check(), in terms of locking requirements and
> degree of thoroughness, that might themselves require other parameters
> specific to the verification that they perform that we cannot
> anticipate. For example, maybe bt_index_parent_check() could have a
> parameter that is a probability that any given IndexTuple will be
> verified against its heap tuple. Then you could have something like a
> PRNG seed argument, so you can check different tuples every day. There
> are many possibilities, and I hope that eventually amcheck has this
> kind of very rich functionality.

> When you use the interface you describe to "stack" several checks at
> once, less important parameters, like the ones I suggest are going to
> be an awkward fit, and so will probably be excluded. I'm not opposed
> to having what amounts to a second interface to what bt_index_check()
> does, to make this kind of thing work where it makes sense. Really,
> bt_index_parent_check() is almost an alternative interface to the same
> functionality today. There can be another one in the future, if it
> serves a purpose, and the locking requirements are roughly the same
> for all the checks. I'd be fine with that. Let's just get the basic
> feature in for now, though.

I disagree quite strongly with pretty much all of that.


But I also think it's more important to get some initial verification
functionality in, than resolving this conflict. I do, also quite
strongly, think we'll be better served with having something like what
you're proposing than nothing, and I don't have time/energy to turn your
patch into what I'm envisioning, nor necessarily the buyin.  Hence I'm
planning to commit the current interface.

Attached is your original patch, and a patch editorializing things. I do
plan to merge those before pushing.

Who would you like to see added to Reviewed-By?

Regards,

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

From
Peter Geoghegan
Date:
On Thu, Mar 9, 2017 at 3:52 PM, Andres Freund <andres@anarazel.de> wrote:
>
> But I also think it's more important to get some initial verification
> functionality in, than resolving this conflict. I do, also quite
> strongly, think we'll be better served with having something like what
> you're proposing than nothing, and I don't have time/energy to turn your
> patch into what I'm envisioning, nor necessarily the buyin.  Hence I'm
> planning to commit the current interface.
>
> Attached is your original patch, and a patch editorializing things. I do
> plan to merge those before pushing.

Your revisions all seem fine. No problems with any of it.

> Who would you like to see added to Reviewed-By?

I am generally in favor of more inclusive Reviewed-By lists. I suggest
that we expand it to:

"Reviewed-By: Andres Freund, Thomas Vondra, Thomas Munro, Anastasia
Lubennikova, Robert Haas, Amit Langote"

Thanks for your help with this!

-- 
Peter Geoghegan



Re: [HACKERS] amcheck (B-Tree integrity checking tool)

From
Andres Freund
Date:
On 2017-03-09 16:29:24 -0800, Peter Geoghegan wrote:
> I am generally in favor of more inclusive Reviewed-By lists. I suggest
> that we expand it to:
> 
> "Reviewed-By: Andres Freund, Thomas Vondra, Thomas Munro, Anastasia
> Lubennikova, Robert Haas, Amit Langote"

And pushed with these.

I don't really expect buildfarm fallout, but ..

- Andres



Re: [HACKERS] amcheck (B-Tree integrity checking tool)

From
Peter Geoghegan
Date:
On Thu, Mar 9, 2017 at 4:41 PM, Andres Freund <andres@anarazel.de> wrote:
> And pushed with these.

Thanks.

> I don't really expect buildfarm fallout, but ..

Unfortunately, there is some on Windows:
 verify_nbtree.obj : error LNK2001: unresolved external symbol
RecentGlobalXmin
[C:\buildfarm\buildenv\HEAD\pgsql.build\amcheck.vcxproj]

Rather than marking RecentGlobalXmin as PGDLLIMPORT, I'd rather just
remove the documenting assertion and leave that comment as-is. I'll
work on a patch for this soon.

-- 
Peter Geoghegan



Re: [HACKERS] amcheck (B-Tree integrity checking tool)

From
Andres Freund
Date:
Hi,

On 2017-03-09 19:04:46 -0800, Peter Geoghegan wrote:
> On Thu, Mar 9, 2017 at 4:41 PM, Andres Freund <andres@anarazel.de> wrote:
> > I don't really expect buildfarm fallout, but ..
> 
> Unfortunately, there is some on Windows:

Thanks for monitoring.


>   verify_nbtree.obj : error LNK2001: unresolved external symbol
> RecentGlobalXmin
> [C:\buildfarm\buildenv\HEAD\pgsql.build\amcheck.vcxproj]
> 
> Rather than marking RecentGlobalXmin as PGDLLIMPORT, I'd rather just
> remove the documenting assertion and leave that comment as-is. I'll
> work on a patch for this soon.

Hm - I think it's fair to export RecentGlobalXmin, given that we
obviously use it across modules in core code.  I see very little reason
not to export it.

Greetings,

Andres Freund



Re: [HACKERS] amcheck (B-Tree integrity checking tool)

From
Peter Geoghegan
Date:
On Thu, Mar 9, 2017 at 7:10 PM, Andres Freund <andres@anarazel.de> wrote:
>>   verify_nbtree.obj : error LNK2001: unresolved external symbol
>> RecentGlobalXmin
>> [C:\buildfarm\buildenv\HEAD\pgsql.build\amcheck.vcxproj]
>>
>> Rather than marking RecentGlobalXmin as PGDLLIMPORT, I'd rather just
>> remove the documenting assertion and leave that comment as-is. I'll
>> work on a patch for this soon.
>
> Hm - I think it's fair to export RecentGlobalXmin, given that we
> obviously use it across modules in core code.  I see very little reason
> not to export it.

Well, the assertion is completely useless as anything but documentation...


-- 
Peter Geoghegan



Re: [HACKERS] amcheck (B-Tree integrity checking tool)

From
Peter Geoghegan
Date:
On Thu, Mar 9, 2017 at 7:12 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>> Hm - I think it's fair to export RecentGlobalXmin, given that we
>> obviously use it across modules in core code.  I see very little reason
>> not to export it.
>
> Well, the assertion is completely useless as anything but documentation...

I came up with the attached.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment