Thread: Max size of a btree index entry

Max size of a btree index entry

From
Tom Lane
Date:
Currently, we restrict btree index tuples to a size that ensures three of
them will fit on a page.  The motivation for this is the following two
considerations:

1. In a non-rightmost page, we need to include a "high key", or page
boundary key, that isn't one of the useful data keys.

2. In a non-leaf page, there had better be at least two child pages
(downlink entries), else we have failed to subdivide the page's key
range at all, and thus there would be a nonterminating recursion.

However: a non-leaf page actually has one more pointer than key,
eg a page with three children needs only two data keys:

---------------- entire key range assigned to page ------------------

-- range 1 --  boundary key -- range 2 --  boundary key -- range 3 --    |                           |
        |    v                           v                           v
 
child page 1               child page 2                 child page 3

We implement this by having the first data "tuple" on a non-leaf page
contain only a downlink TID and no key data, ie it's just the header.

So it appears to me that we could allow the maximum size of a btree
entry to be just less than half a page, rather than just less than
a third of a page --- the worst-case requirement for a non-leaf page
is not three real tuples, but one tuple header and two real tuples.
On a leaf page we might manage to fit only one real data item, but
AFAICS that doesn't pose any correctness problems.

Obviously a tree containing many such pages would be awfully inefficient
to search, but I think a more common case is that there are a few wide
entries in an index of mostly short entries, and so pushing the hard
limit up a little would add some flexibility with little performance
cost in real-world cases.

Have I missed something?  Is this worth changing?
        regards, tom lane


Re: Max size of a btree index entry

From
Josh Berkus
Date:
Tom,

> Obviously a tree containing many such pages would be awfully inefficient
> to search, but I think a more common case is that there are a few wide
> entries in an index of mostly short entries, and so pushing the hard
> limit up a little would add some flexibility with little performance
> cost in real-world cases.
> 
> Have I missed something?  Is this worth changing?

Not sure.  I don't know that the difference between 2.7K and 3.9K would 
have ever made a difference to me in any real-world case.

If we're going to tinker with this code, it would be far more valuable 
to automatically truncate b-tree entries at, say, 1K so that they could 
be efficiently indexed.

Of course, a quick archives search of -SQL, -Newbie and -General would 
indicate how popular of an issue this is.

--Josh Berkus


Re: Max size of a btree index entry

From
Hannu Krosing
Date:
Ühel kenal päeval, T, 2006-07-11 kell 10:46, kirjutas Josh Berkus:
> Tom,
> 
> > Obviously a tree containing many such pages would be awfully inefficient
> > to search, but I think a more common case is that there are a few wide
> > entries in an index of mostly short entries, and so pushing the hard
> > limit up a little would add some flexibility with little performance
> > cost in real-world cases.
> > 
> > Have I missed something?  Is this worth changing?
> 
> Not sure.  I don't know that the difference between 2.7K and 3.9K would 
> have ever made a difference to me in any real-world case.

One (hopefully) soon-to-be real-world case is index-only queries.

We discussed one approach with Luke and he expressed interest in getting
actually done in not too distant future.

> If we're going to tinker with this code, it would be far more valuable 
> to automatically truncate b-tree entries at, say, 1K so that they could 
> be efficiently indexed.

That would not work, if we want to get all data from indexes.

Maybe compressing the keys (like we do for TOAST) would be a better
solution.

> Of course, a quick archives search of -SQL, -Newbie and -General would 
> indicate how popular of an issue this is.

It may become populat again, when we will be able to do index-only
scans. 

-- 
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com



Re: Max size of a btree index entry

From
"Jim C. Nasby"
Date:
On Tue, Jul 11, 2006 at 10:02:49AM -0400, Tom Lane wrote:
> Currently, we restrict btree index tuples to a size that ensures three of
> them will fit on a page.  The motivation for this is the following two
> considerations:
> 
> 1. In a non-rightmost page, we need to include a "high key", or page
> boundary key, that isn't one of the useful data keys.
Why does a leaf page need a boundary key? ISTM if that wasn't the case,
we could actually allow keys to be nearly 8K, constrained by a non-leaf
page needing to include two pointers.

I guess I must be missing something here (and nbtree/README isn't
helping).

> 2. In a non-leaf page, there had better be at least two child pages
> (downlink entries), else we have failed to subdivide the page's key
> range at all, and thus there would be a nonterminating recursion.
> 
> However: a non-leaf page actually has one more pointer than key,
> eg a page with three children needs only two data keys:
> 
> ---------------- entire key range assigned to page ------------------
> 
> -- range 1 --  boundary key -- range 2 --  boundary key -- range 3 --
>      |                           |                           |
>      v                           v                           v
> child page 1               child page 2                 child page 3
> 
> We implement this by having the first data "tuple" on a non-leaf page
> contain only a downlink TID and no key data, ie it's just the header.
> 
> So it appears to me that we could allow the maximum size of a btree
> entry to be just less than half a page, rather than just less than
> a third of a page --- the worst-case requirement for a non-leaf page
> is not three real tuples, but one tuple header and two real tuples.
> On a leaf page we might manage to fit only one real data item, but
> AFAICS that doesn't pose any correctness problems.
> 
> Obviously a tree containing many such pages would be awfully inefficient
> to search, but I think a more common case is that there are a few wide
> entries in an index of mostly short entries, and so pushing the hard
> limit up a little would add some flexibility with little performance
> cost in real-world cases.
> 
> Have I missed something?  Is this worth changing?
> 
>             regards, tom lane
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>        subscribe-nomail command to majordomo@postgresql.org so that your
>        message can get through to the mailing list cleanly
> 

-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Max size of a btree index entry

From
Tom Lane
Date:
"Jim C. Nasby" <jnasby@pervasive.com> writes:
> On Tue, Jul 11, 2006 at 10:02:49AM -0400, Tom Lane wrote:
>> 1. In a non-rightmost page, we need to include a "high key", or page
>> boundary key, that isn't one of the useful data keys.
> Why does a leaf page need a boundary key?

So you can tell whether a proposed insertion ought to go into this page,
or the one to its right.  The tree descent logic doesn't guarantee that
you descend to exactly the correct page --- if concurrent page splits
are going on, you might have to "move right" one or more times after
reaching the leaf level.  You need the boundary key to make this test
correctly.

And of course, the reason there's no high key on the rightmost page is
exactly that it has no right-hand neighbor, hence no upper limit on its
delegated key space.
        regards, tom lane


Re: Max size of a btree index entry

From
"Jim C. Nasby"
Date:
On Wed, Jul 19, 2006 at 06:23:44PM -0400, Tom Lane wrote:
> "Jim C. Nasby" <jnasby@pervasive.com> writes:
> > On Tue, Jul 11, 2006 at 10:02:49AM -0400, Tom Lane wrote:
> >> 1. In a non-rightmost page, we need to include a "high key", or page
> >> boundary key, that isn't one of the useful data keys.
>  
> > Why does a leaf page need a boundary key?
> 
> So you can tell whether a proposed insertion ought to go into this page,
> or the one to its right.  The tree descent logic doesn't guarantee that
> you descend to exactly the correct page --- if concurrent page splits
> are going on, you might have to "move right" one or more times after
> reaching the leaf level.  You need the boundary key to make this test
> correctly.
> 
> And of course, the reason there's no high key on the rightmost page is
> exactly that it has no right-hand neighbor, hence no upper limit on its
> delegated key space.

Could you not just scan right and see what the first key was? Thought
granted, that means there's a chance of a wasted page scan, but I think
that'd be somewhat of a corner case, so it might not be bad.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Max size of a btree index entry

From
Tom Lane
Date:
"Jim C. Nasby" <jnasby@pervasive.com> writes:
> Could you not just scan right and see what the first key was? Thought
> granted, that means there's a chance of a wasted page scan, but I think
> that'd be somewhat of a corner case, so it might not be bad.

No, because (a) that confuses the first key that happens to be on a page
with its keyspace boundary --- what happens when you need to delete that
data key? and (b) because of locking considerations, you don't want to
move right and then have to back up.  You'd have to hold lock on the
first page while reading in the second, which makes for a nontrivial
performance hit.
        regards, tom lane