Thread: Documentation of bt_page_items()'s ctid field
Attached documentation patch describes the purpose of bt_page_items()'s ctid field. This has come up enough times in disaster recovery or testing scenarios that I feel it's worth drawing particular attention to. -- Peter Geoghegan
Attachment
On 12/30/2014 04:08 AM, Peter Geoghegan wrote: > Attached documentation patch describes the purpose of > bt_page_items()'s ctid field. This has come up enough times in > disaster recovery or testing scenarios that I feel it's worth drawing > particular attention to. How much detail on the b-tree internals do we want to put in the pageinspect documentation? I can see that being useful, but should we also explain e.g. that the first item on each (non-rightmost) page is the high key? > + Note that <structfield>ctid</> addresses a heap tuple if the > + page under consideration is a B-Tree leaf page. Otherwise, for > + internal B-Tree pages <structfield>ctid</> addresses a page in > + the B-Tree itself (excluding the root page if and only if there > + has not yet been a root page split, as in the example above). > + These internally referenced pages are child pages, and may > + themselves be leaf pages or internal pages. I had a hard time understanding the remark about the root page. But in any case, if you look at the flags set e.g. with bt_page_stats(), the root page is flagged as also being a leaf page, when it is the only page in the index. So the root page is considered also a leaf page in that case. I'd suggest saying the same thing (or more) with fewer words: In a B-tree leaf page, <structfield>ctid</> points to a heap tuple. In an internal page, it points to another page in the index itself, and the offset number part (the second number) of the ctid field is ignored. - Heikki
On Tue, Dec 30, 2014 at 8:59 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > How much detail on the b-tree internals do we want to put in the pageinspect > documentation? I can see that being useful, but should we also explain e.g. > that the first item on each (non-rightmost) page is the high key? Maybe we should. I see no reason not to, and I think that it makes sense to explain things at that level without going into flags and so on. But don't forget that that isn't quite the full story if we're going to talk about high keys at all; we must also explain "minus infinity" keys, alongside any explanation of the high key: * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be * "minus infinity": this routine will always claim it is less than the * scankey. The actual key value stored (if any, which there probably isn't) * does not matter. This convention allows us to implement the Lehman and * Yao convention that the first down-link pointer is before the first key. * See backend/access/nbtree/README for details. In particular, this means that the key data is garbage, which is something I've also seen causing confusion [1]. I would like to make it easier for competent non-experts on the B-Tree code to eyeball a B-Tree with pageinspect, and be reasonably confident that things add up. In order for such people to know that something is wrong, we should explain what "right" looks like in moderate detail. So, as I said, I feel an exact explanation of flags is unnecessary, but tend to agree that a brief reference to both page highkeys and "minus infinity" keys is appropriate, since users of the function will see them all the time. > I had a hard time understanding the remark about the root page. But in any > case, if you look at the flags set e.g. with bt_page_stats(), the root page > is flagged as also being a leaf page, when it is the only page in the index. > So the root page is considered also a leaf page in that case. I think that a better way of handling that originally would have been to make root-ness a separate property from leaf-ness/internal-ness. Too late for that now, I suppose. > I'd suggest saying the same thing (or more) with fewer words: > > In a B-tree leaf page, <structfield>ctid</> points to a heap tuple. In an > internal page, it points to another page in the index itself, and the offset > number part (the second number) of the ctid field is ignored. That seems good. What do you think of the attached revision? [1] http://www.postgresql.org/message-id/20140828.110824.1195843073079055852.t-ishii@sraoss.co.jp -- Peter Geoghegan
Attachment
On 12/30/2014 10:07 PM, Peter Geoghegan wrote: > On Tue, Dec 30, 2014 at 8:59 AM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >> How much detail on the b-tree internals do we want to put in the pageinspect >> documentation? I can see that being useful, but should we also explain e.g. >> that the first item on each (non-rightmost) page is the high key? > > Maybe we should. I see no reason not to, and I think that it makes > sense to explain things at that level without going into flags and so > on. But don't forget that that isn't quite the full story if we're > going to talk about high keys at all; we must also explain "minus > infinity" keys, alongside any explanation of the high key: Yeah, good point. > * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be > * "minus infinity": this routine will always claim it is less than the > * scankey. The actual key value stored (if any, which there probably isn't) > * does not matter. This convention allows us to implement the Lehman and > * Yao convention that the first down-link pointer is before the first key. > * See backend/access/nbtree/README for details. > > In particular, this means that the key data is garbage, which is > something I've also seen causing confusion [1]. In practice, we never store any actual key value for the "minus infinity" key. I guess the code would ignore it if it was there, but it would make more sense to explain that the first data key on an internal page does not have a key value. If there is a value there, it's a sign that something's wrong. > I would like to make it easier for competent non-experts on the B-Tree > code to eyeball a B-Tree with pageinspect, and be reasonably confident > that things add up. In order for such people to know that something is > wrong, we should explain what "right" looks like in moderate detail. Makes sense. >> I had a hard time understanding the remark about the root page. But in any >> case, if you look at the flags set e.g. with bt_page_stats(), the root page >> is flagged as also being a leaf page, when it is the only page in the index. >> So the root page is considered also a leaf page in that case. > > I think that a better way of handling that originally would have been > to make root-ness a separate property from leaf-ness/internal-ness. Hmm, yeah, bt_page_stats() currently returns 'l' in the type column when (BTP_ROOT | BTP_LEAF). - Heikki
On Tue, Dec 30, 2014 at 12:21 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > In practice, we never store any actual key value for the "minus infinity" > key. I guess the code would ignore it if it was there, but it would make > more sense to explain that the first data key on an internal page does not > have a key value. If there is a value there, it's a sign that something's > wrong. Well, depends on what you're exact definition of a key is, I suppose. Here is what I see for a B-Tree's root page (it's a few hundred KiBs in size): postgres=# select * from bt_page_items('ix_order_custid', 3);itemoffset | ctid | itemlen | nulls | vars | data ------------+--------+---------+-------+------+------------------------- 1 | (1,1) | 8 | f | f | 2 | (2,1) | 16 | f | f | 65 02 00 00 00 00 00 00 3 | (4,1) | 16 | f | f | d2 04 0000 00 00 00 00 4 | (5,1) | 16 | f | f | 28 07 00 00 00 00 00 00 5 | (6,1) | 16 | f | f | b9 09 00 00 00 00 00 00 *** SNIP *** It's storing an index tuple, but not a key value itself. I guess that the equivocating comments (that I pasted here before, that appear over _bt_compare()) about what is stored made me use the term garbage, but off hand I'm not aware of how that could actually happen either. Perhaps it's an obsolete comment, and the "data" field will always be empty for "minus infinity" items? I'll leave the exact description of this state of the "data" field to you. -- Peter Geoghegan
On Tue, Dec 30, 2014 at 8:59 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
How much detail on the b-tree internals do we want to put in the pageinspect documentation? I can see that being useful, but should we also explain e.g. that the first item on each (non-rightmost) page is the high key?On 12/30/2014 04:08 AM, Peter Geoghegan wrote:Attached documentation patch describes the purpose of
bt_page_items()'s ctid field. This has come up enough times in
disaster recovery or testing scenarios that I feel it's worth drawing
particular attention to.
How do I know if I am looking at a non-rightmost page?
Thanks,
Jeff
On Mon, Mar 9, 2015 at 3:51 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > How do I know if I am looking at a non-rightmost page? It has a right-link (that's the easiest way to tell). It will (as the docpatch points out) also necessarily have a ""high key" item. We know that we have to move right if the high key doesn't bound the value we expected to find on the page (our scankey item - what the index scan is searching for). So the high key goes with having a rightlink, and in general we detect that we're looking at a non-rightmost page based on the presence of a right-link. The confusing thing is that some pages have a high-key *and* a "minus infinity" item. The latter has "data" that is undefined (cannot be interpreted as a value of any type). It is still logically a "real" item, that represents minus infinity for the purposes of binary searches. But the high key is not a "real" item, even though its "data" *is* a well-defined value of the underlying IndexTuple type or types. Typically, the high key "data" matches the first non-highkey "real" item on the next/right page...not including the "minus infinity" item, if any. Is that any clearer? Basically, high key items are relevant to all non-rightmost pages on all levels. OTOH, "minus infinity" items are only relevant to internal (non-leaf) pages (and a pre-first-split root page is considered a leaf page for this purpose). Internal, non-rightmost pages (internal pages are typically only a small minority of the total) have both. -- Peter Geoghegan
On Mon, Mar 9, 2015 at 4:06 PM, Peter Geoghegan <pg@heroku.com> wrote:
On Mon, Mar 9, 2015 at 3:51 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> How do I know if I am looking at a non-rightmost page?
It has a right-link (that's the easiest way to tell).
Meaning that btpo_next is not zero? Should we say that in the patch in so many words? I think it will be hard to explain the page_items more without also explaining the page_stats more.
It will (as the
docpatch points out) also necessarily have a ""high key" item. We
know that we have to move right if the high key doesn't bound the
value we expected to find on the page (our scankey item - what the
index scan is searching for). So the high key goes with having a
rightlink, and in general we detect that we're looking at a
non-rightmost page based on the presence of a right-link.
So if I understand this correctly, if there is a high key it is itemoffset 1, but to know whether itemoffset 1 is a high key I first have to look at btpo_next.
And if there is a minus infinity, it will either be itemoffset 2 or 1, depending on whether there is a high key or not.
Thanks,
Jeff
On Mon, Mar 9, 2015 at 5:18 PM, Jeff Janes <jeff.janes@gmail.com> wrote: >> It has a right-link (that's the easiest way to tell). > > > Meaning that btpo_next is not zero? Should we say that in the patch in so > many words? I think it will be hard to explain the page_items more without > also explaining the page_stats more. Yes. And my wording above was sloppy: By definition, a non-rightmost page is a page that has a rightlink (it will also have a highkey, but that's certainly not how we distinguish them). Attached patch adds a definition of non-rightmost in parenthesis, which I think helps. > So if I understand this correctly, if there is a high key it is itemoffset > 1, but to know whether itemoffset 1 is a high key I first have to look at > btpo_next. Yes. The logic for caring about "minus infinity" items within internal pages is hardcoded into _bt_compare(), so it's kind of a low-level detail, even by these standards...but it comes up with pageinspect, since they consume an IndexTuple. Confusion ensues. > And if there is a minus infinity, it will either be itemoffset 2 or 1, > depending on whether there is a high key or not. That's correct. So "data" can be "(has data), (has no data), (has data), (has data) .... " on a given page, which could freak users out. -- Peter Geoghegan
Attachment
On Mon, Mar 9, 2015 at 5:34 PM, Peter Geoghegan <pg@heroku.com> wrote:
On Mon, Mar 9, 2015 at 5:18 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> It has a right-link (that's the easiest way to tell).
>
>
> Meaning that btpo_next is not zero? Should we say that in the patch in so
> many words? I think it will be hard to explain the page_items more without
> also explaining the page_stats more.
Yes. And my wording above was sloppy: By definition, a non-rightmost
page is a page that has a rightlink (it will also have a highkey, but
that's certainly not how we distinguish them).
Attached patch adds a definition of non-rightmost in parenthesis,
which I think helps.
Thanks. I think the sense of the definition is wrong (you are defining a rightmost page, not a non-rightmost page), and the btpo_next field is zero, not null, when on a rightmost page. Or at least, bt_page_stats presents it as zero.
So more like the attached patch.
I think we do want this. It was asked how much we want to include internals of the btree code into pageinspect documentation, but I think the nature of pageinspect makes that unavoidable. Its docs have to tell you what it does, and inspecting the internals is what it does. And since those internals are unlikely to change other than in a way that breaks pg_upgrade, I think we can safely assume that the work of updating the doc would pale compared to the work of making the docs inaccurate.
I'd mark it ready for committer, but since I also attached a suggested replacement patch it seems presumptuous to do that.
Cheers,
Jeff
Attachment
On Wed, Mar 11, 2015 at 3:14 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > I'd mark it ready for committer, but since I also attached a suggested > replacement patch it seems presumptuous to do that. I've marked it "ready for committer". I think your version is slightly better. Thanks -- Peter Geoghegan
Jeff Janes <jeff.janes@gmail.com> writes: > I think we do want this. It was asked how much we want to include > internals of the btree code into pageinspect documentation, but I think the > nature of pageinspect makes that unavoidable. Yeah. Committed with a little bit of additional wordsmithing. regards, tom lane