Thread: Documentation of bt_page_items()'s ctid field

Documentation of bt_page_items()'s ctid field

From

Peter Geoghegan

Date:

30 December 2014, 02:08:33

Attached documentation patch describes the purpose of
bt_page_items()'s ctid field. This has come up enough times in
disaster recovery or testing scenarios that I feel it's worth drawing
particular attention to.

--
Peter Geoghegan

Attachment

bt_page_items_ctid.patch

Re: Documentation of bt_page_items()'s ctid field

From

Heikki Linnakangas

Date:

30 December 2014, 16:59:46

On 12/30/2014 04:08 AM, Peter Geoghegan wrote:
> Attached documentation patch describes the purpose of
> bt_page_items()'s ctid field. This has come up enough times in
> disaster recovery or testing scenarios that I feel it's worth drawing
> particular attention to.

How much detail on the b-tree internals do we want to put in the 
pageinspect documentation? I can see that being useful, but should we 
also explain e.g. that the first item on each (non-rightmost) page is 
the high key?

> +      Note that <structfield>ctid</> addresses a heap tuple if the
> +      page under consideration is a B-Tree leaf page.  Otherwise, for
> +      internal B-Tree pages <structfield>ctid</> addresses a page in
> +      the B-Tree itself (excluding the root page if and only if there
> +      has not yet been a root page split, as in the example above).
> +      These internally referenced pages are child pages, and may
> +      themselves be leaf pages or internal pages.

I had a hard time understanding the remark about the root page. But in 
any case, if you look at the flags set e.g. with bt_page_stats(), the 
root page is flagged as also being a leaf page, when it is the only page 
in the index. So the root page is considered also a leaf page in that case.

I'd suggest saying the same thing (or more) with fewer words:

In a B-tree leaf page, <structfield>ctid</> points to a heap tuple. In 
an internal page, it points to another page in the index itself, and the 
offset number part (the second number) of the ctid field is ignored.

- Heikki

Re: Documentation of bt_page_items()'s ctid field

From

Peter Geoghegan

Date:

30 December 2014, 20:07:48

On Tue, Dec 30, 2014 at 8:59 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> How much detail on the b-tree internals do we want to put in the pageinspect
> documentation? I can see that being useful, but should we also explain e.g.
> that the first item on each (non-rightmost) page is the high key?

Maybe we should. I see no reason not to, and I think that it makes
sense to explain things at that level without going into flags and so
on. But don't forget that that isn't quite the full story if we're
going to talk about high keys at all; we must also explain "minus
infinity" keys, alongside any explanation of the high key:

 * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
 * "minus infinity": this routine will always claim it is less than the
 * scankey.  The actual key value stored (if any, which there probably isn't)
 * does not matter.  This convention allows us to implement the Lehman and
 * Yao convention that the first down-link pointer is before the first key.
 * See backend/access/nbtree/README for details.

In particular, this means that the key data is garbage, which is
something I've also seen causing confusion [1].

I would like to make it easier for competent non-experts on the B-Tree
code to eyeball a B-Tree with pageinspect, and be reasonably confident
that things add up. In order for such people to know that something is
wrong, we should explain what "right" looks like in moderate detail.
So, as I said, I feel an exact explanation of flags is unnecessary,
but tend to agree that a brief reference to both page highkeys and
"minus infinity" keys is appropriate, since users of the function will
see them all the time.

> I had a hard time understanding the remark about the root page. But in any
> case, if you look at the flags set e.g. with bt_page_stats(), the root page
> is flagged as also being a leaf page, when it is the only page in the index.
> So the root page is considered also a leaf page in that case.

I think that a better way of handling that originally would have been
to make root-ness a separate property from leaf-ness/internal-ness.
Too late for that now, I suppose.

> I'd suggest saying the same thing (or more) with fewer words:
>
> In a B-tree leaf page, <structfield>ctid</> points to a heap tuple. In an
> internal page, it points to another page in the index itself, and the offset
> number part (the second number) of the ctid field is ignored.

That seems good. What do you think of the attached revision?

[1] http://www.postgresql.org/message-id/20140828.110824.1195843073079055852.t-ishii@sraoss.co.jp
--
Peter Geoghegan

Attachment

bt_page_items_ctid.patch

Re: Documentation of bt_page_items()'s ctid field

From

Heikki Linnakangas

Date:

30 December 2014, 20:21:51

On 12/30/2014 10:07 PM, Peter Geoghegan wrote:
> On Tue, Dec 30, 2014 at 8:59 AM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>> How much detail on the b-tree internals do we want to put in the pageinspect
>> documentation? I can see that being useful, but should we also explain e.g.
>> that the first item on each (non-rightmost) page is the high key?
>
> Maybe we should. I see no reason not to, and I think that it makes
> sense to explain things at that level without going into flags and so
> on. But don't forget that that isn't quite the full story if we're
> going to talk about high keys at all; we must also explain "minus
> infinity" keys, alongside any explanation of the high key:

Yeah, good point.

>   * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
>   * "minus infinity": this routine will always claim it is less than the
>   * scankey.  The actual key value stored (if any, which there probably isn't)
>   * does not matter.  This convention allows us to implement the Lehman and
>   * Yao convention that the first down-link pointer is before the first key.
>   * See backend/access/nbtree/README for details.
>
> In particular, this means that the key data is garbage, which is
> something I've also seen causing confusion [1].

In practice, we never store any actual key value for the "minus 
infinity" key. I guess the code would ignore it if it was there, but it 
would make more sense to explain that the first data key on an internal 
page does not have a key value. If there is a value there, it's a sign 
that something's wrong.

> I would like to make it easier for competent non-experts on the B-Tree
> code to eyeball a B-Tree with pageinspect, and be reasonably confident
> that things add up. In order for such people to know that something is
> wrong, we should explain what "right" looks like in moderate detail.

Makes sense.

>> I had a hard time understanding the remark about the root page. But in any
>> case, if you look at the flags set e.g. with bt_page_stats(), the root page
>> is flagged as also being a leaf page, when it is the only page in the index.
>> So the root page is considered also a leaf page in that case.
>
> I think that a better way of handling that originally would have been
> to make root-ness a separate property from leaf-ness/internal-ness.

Hmm, yeah, bt_page_stats() currently returns 'l' in the type column when 
(BTP_ROOT | BTP_LEAF).

- Heikki

Re: Documentation of bt_page_items()'s ctid field

From

Peter Geoghegan

Date:

30 December 2014, 20:59:41

On Tue, Dec 30, 2014 at 12:21 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> In practice, we never store any actual key value for the "minus infinity"
> key. I guess the code would ignore it if it was there, but it would make
> more sense to explain that the first data key on an internal page does not
> have a key value. If there is a value there, it's a sign that something's
> wrong.

Well, depends on what you're exact definition of a key is, I suppose.
Here is what I see for a B-Tree's root page (it's a few hundred KiBs
in size):

postgres=# select * from bt_page_items('ix_order_custid', 3);itemoffset |  ctid  | itemlen | nulls | vars |
data
------------+--------+---------+-------+------+-------------------------         1 | (1,1)  |       8 | f     | f    |
      2 | (2,1)  |      16 | f     | f    | 65 02 00 00 00 00 00 00         3 | (4,1)  |      16 | f     | f    | d2 04
0000 00 00 00 00         4 | (5,1)  |      16 | f     | f    | 28 07 00 00 00 00 00 00         5 | (6,1)  |      16 | f
   | f    | b9 09 00 00 00 00 00 00

*** SNIP ***

It's storing an index tuple, but not a key value itself. I guess that
the equivocating comments (that I pasted here before, that appear over
_bt_compare()) about what is stored made me use the term garbage, but
off hand I'm not aware of how that could actually happen either.
Perhaps it's an obsolete comment, and the "data" field will always be
empty for "minus infinity" items?

I'll leave the exact description of this state of the "data" field to you.
-- 
Peter Geoghegan

Re: Documentation of bt_page_items()'s ctid field

From

Jeff Janes

Date:

09 March 2015, 22:52:03

On Tue, Dec 30, 2014 at 8:59 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

On 12/30/2014 04:08 AM, Peter Geoghegan wrote:
Attached documentation patch describes the purpose of
bt_page_items()'s ctid field. This has come up enough times in
disaster recovery or testing scenarios that I feel it's worth drawing
particular attention to.

How much detail on the b-tree internals do we want to put in the pageinspect documentation? I can see that being useful, but should we also explain e.g. that the first item on each (non-rightmost) page is the high key?

How do I know if I am looking at a non-rightmost page?

Thanks,

Jeff

Re: Documentation of bt_page_items()'s ctid field

From

Peter Geoghegan

Date:

09 March 2015, 23:06:19

On Mon, Mar 9, 2015 at 3:51 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> How do I know if I am looking at a non-rightmost page?

It has a right-link (that's the easiest way to tell). It will (as the
docpatch points out) also necessarily have a ""high key" item.  We
know that we have to move right if the high key doesn't bound the
value we expected to find on the page (our scankey item - what the
index scan is searching for). So the high key goes with having a
rightlink, and in general we detect that we're looking at a
non-rightmost page based on the presence of a right-link.

The confusing thing is that some pages have a high-key *and* a "minus
infinity" item. The latter has "data" that is undefined (cannot be
interpreted as a value of any type). It is still logically a "real"
item, that represents minus infinity for the purposes of binary
searches. But the high key is not a "real" item, even though its
"data" *is* a well-defined value of the underlying IndexTuple type or
types. Typically, the high key "data" matches the first non-highkey
"real" item on the next/right page...not including the "minus
infinity" item, if any.

Is that any clearer? Basically, high key items are relevant to all
non-rightmost pages on all levels. OTOH, "minus infinity" items are
only relevant to internal (non-leaf) pages (and a pre-first-split root
page is considered a leaf page for this purpose). Internal,
non-rightmost pages (internal pages are typically only a small
minority of the total) have both.
-- 
Peter Geoghegan

Re: Documentation of bt_page_items()'s ctid field

From

Jeff Janes

Date:

10 March 2015, 00:18:26

On Mon, Mar 9, 2015 at 4:06 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Mon, Mar 9, 2015 at 3:51 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> How do I know if I am looking at a non-rightmost page?

It has a right-link (that's the easiest way to tell).

Meaning that btpo_next is not zero? Should we say that in the patch in so many words? I think it will be hard to explain the page_items more without also explaining the page_stats more.

It will (as the
docpatch points out) also necessarily have a ""high key" item. We
know that we have to move right if the high key doesn't bound the
value we expected to find on the page (our scankey item - what the
index scan is searching for). So the high key goes with having a
rightlink, and in general we detect that we're looking at a
non-rightmost page based on the presence of a right-link.

So if I understand this correctly, if there is a high key it is itemoffset 1, but to know whether itemoffset 1 is a high key I first have to look at btpo_next.

And if there is a minus infinity, it will either be itemoffset 2 or 1, depending on whether there is a high key or not.

Thanks,

Jeff

Re: Documentation of bt_page_items()'s ctid field

From

Peter Geoghegan

Date:

10 March 2015, 00:35:05

On Mon, Mar 9, 2015 at 5:18 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> It has a right-link (that's the easiest way to tell).
>
>
> Meaning that btpo_next is not zero?  Should we say that in the patch in so
> many words?  I think it will be hard to explain the page_items more without
> also explaining the page_stats more.

Yes. And my wording above was sloppy: By definition, a non-rightmost
page is a page that has a rightlink (it will also have a highkey, but
that's certainly not how we distinguish them).

Attached patch adds a definition of non-rightmost in parenthesis,
which I think helps.

> So if I understand this correctly, if there is a high key it is itemoffset
> 1, but to know whether itemoffset 1 is a high key I first have to look at
> btpo_next.

Yes. The logic for caring about "minus infinity" items within internal
pages is hardcoded into _bt_compare(), so it's kind of a low-level
detail, even by these standards...but it comes up with pageinspect,
since they consume an IndexTuple. Confusion ensues.

> And if there is a minus infinity, it will either be itemoffset 2 or 1,
> depending on whether there is a high key or not.

That's correct. So "data" can be "(has data), (has no data), (has
data), (has data) .... " on a given page, which could freak users out.

--
Peter Geoghegan

Attachment

bt_page_items_rightmost.patch

Re: Documentation of bt_page_items()'s ctid field

From

Jeff Janes

Date:

11 March 2015, 22:14:53

On Mon, Mar 9, 2015 at 5:34 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Mon, Mar 9, 2015 at 5:18 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> It has a right-link (that's the easiest way to tell).
>
>
> Meaning that btpo_next is not zero? Should we say that in the patch in so
> many words? I think it will be hard to explain the page_items more without
> also explaining the page_stats more.

Yes. And my wording above was sloppy: By definition, a non-rightmost
page is a page that has a rightlink (it will also have a highkey, but
that's certainly not how we distinguish them).

Attached patch adds a definition of non-rightmost in parenthesis,
which I think helps.

Thanks. I think the sense of the definition is wrong (you are defining a rightmost page, not a non-rightmost page), and the btpo_next field is zero, not null, when on a rightmost page. Or at least, bt_page_stats presents it as zero.

So more like the attached patch.

I think we do want this. It was asked how much we want to include internals of the btree code into pageinspect documentation, but I think the nature of pageinspect makes that unavoidable. Its docs have to tell you what it does, and inspecting the internals is what it does. And since those internals are unlikely to change other than in a way that breaks pg_upgrade, I think we can safely assume that the work of updating the doc would pale compared to the work of making the docs inaccurate.

I'd mark it ready for committer, but since I also attached a suggested replacement patch it seems presumptuous to do that.

Cheers,

Jeff

Attachment

bt_page_items_rightmost_jj.patch

Re: Documentation of bt_page_items()'s ctid field

From

Peter Geoghegan

Date:

11 March 2015, 22:57:57

On Wed, Mar 11, 2015 at 3:14 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> I'd mark it ready for committer, but since I also attached a suggested
> replacement patch it seems presumptuous to do that.


I've marked it "ready for committer". I think your version is slightly better.

Thanks
-- 
Peter Geoghegan

Re: Documentation of bt_page_items()'s ctid field

From

Tom Lane

Date:

12 March 2015, 18:19:54

Jeff Janes <jeff.janes@gmail.com> writes:
> I think we do want this.  It was asked how much we want to include
> internals of the btree code into pageinspect documentation, but I think the
> nature of pageinspect makes that unavoidable.

Yeah.  Committed with a little bit of additional wordsmithing.
        regards, tom lane