Thread: Bizarre coding in _bt_binsrch

Bizarre coding in _bt_binsrch

From
Tom Lane
Date:
I have been puzzling out the coding in _bt_binsrch() in
backend/access/nbtree/nbtsearch.c, with an eye to speeding it up for
the many-equal-keys case.  I have finally worked out exactly what it's
doing, to wit:
    * On a leaf page, we always return the first key >= scan key    * (which could be the last slot + 1).    *    * On
anon-leaf page, there are special cases:    *    * For an insertion (srchtype != BT_DESCENT and natts == keysz)    *
alwaysreturn first key >= scan key (which could be off the end).    *    * For a standard search (srchtype ==
BT_DESCENTand natts == keysz)    * return the first equal key if one exists, else the last lesser key    * if one
exists,else the first slot on the page.    *    * For a partial-match search (srchtype == BT_DESCENT and natts < keysz)
  * return the last lesser key if one exists, else the first slot.
 

This strikes me as a tad bizarre --- in particular, the discrepancy
between treatment of equal keys in the normal and partial search cases.

I think I understand why the partial-match code works that way: there
could be matching keys in the sub-page belonging to the last lesser key.
For example, if our scan target is (x = 2) and we have internal keys(x = 1, y = 2)(x = 2, y = 1)
then we need to look at the first key's subpages, where we might find
matching keys like (x = 2, y = 0).

The full-width case appears to assume that that can't happen: if we
have a given key in an upper page, there can be *no* equal keys in
subpages to its left.  That's a rather strong assumption about how
page splitting is done; is it correct?

Even more to the point, *should* it be correct?  If we always returned
the last lesser key, then the code would work for any correctly
sequenced b-tree, but this code looks like it works only if splits occur
only at the leftmost of a group of equal keys.  If there are a lot of
equal keys, that could result in a badly unbalanced tree, no?  Maybe
that's the real reason why performance seems to be so bad for many
equal keys... maybe the split algorithm needs to be relaxed?
        regards, tom lane


Re: [HACKERS] Bizarre coding in _bt_binsrch

From
Vadim Mikheev
Date:
Tom Lane wrote:
> 
> The full-width case appears to assume that that can't happen: if we
> have a given key in an upper page, there can be *no* equal keys in
> subpages to its left.  That's a rather strong assumption about how
> page splitting is done; is it correct?
> 
> Even more to the point, *should* it be correct?  If we always returned
> the last lesser key, then the code would work for any correctly
> sequenced b-tree, but this code looks like it works only if splits occur
> only at the leftmost of a group of equal keys.  If there are a lot of
> equal keys, that could result in a badly unbalanced tree, no?  Maybe
> that's the real reason why performance seems to be so bad for many
> equal keys... maybe the split algorithm needs to be relaxed?

Our btree-s use Lehman-Yao algorithm which works in assumption
that there is no duplicates at all. It's just reminding.
It was ~ 2 years ago when I changed duplicates handling
to fix some rare bugs (this is why you see BTP_CHAIN stuff
there) and now I don't remember many things and so I can't
comment. But after I changed btree-s I learned how Oracle
handles duplicates problem - it just uses heap tuple id
as (last) part of index key! So simple! Unfortunately,
I had not time to re-implement btree-s in this way.
But this would:

1. get rid of duplicates problem;
2. simplify code (BTP_CHAIN stuff would be removed);
3. order index tuples in such way that in scan heap pages   would be read sequentially (from up of file to down);
4. speed up finding index tuple which corresponds to heap one  (good for index cleaning up).

The main problem is just programistic: you will have to add
heap tid to the end of index tuples on internal index pages,
but on leaf pages heap tid is in the begin of index tuples
(inside of btitem struct).

So, if you're going to change btree, please consider ability
to implement above.

Vadim


Re: [HACKERS] Bizarre coding in _bt_binsrch

From
Bruce Momjian
Date:
Tom, I assume you have dealt with this, right?

> I have been puzzling out the coding in _bt_binsrch() in
> backend/access/nbtree/nbtsearch.c, with an eye to speeding it up for
> the many-equal-keys case.  I have finally worked out exactly what it's
> doing, to wit:
> 
>      * On a leaf page, we always return the first key >= scan key
>      * (which could be the last slot + 1).
>      *
>      * On a non-leaf page, there are special cases:
>      *
>      * For an insertion (srchtype != BT_DESCENT and natts == keysz)
>      * always return first key >= scan key (which could be off the end).
>      *
>      * For a standard search (srchtype == BT_DESCENT and natts == keysz)
>      * return the first equal key if one exists, else the last lesser key
>      * if one exists, else the first slot on the page.
>      *
>      * For a partial-match search (srchtype == BT_DESCENT and natts < keysz)
>      * return the last lesser key if one exists, else the first slot.
> 
> This strikes me as a tad bizarre --- in particular, the discrepancy
> between treatment of equal keys in the normal and partial search cases.
> 
> I think I understand why the partial-match code works that way: there
> could be matching keys in the sub-page belonging to the last lesser key.
> For example, if our scan target is (x = 2) and we have internal keys
>     (x = 1, y = 2)
>     (x = 2, y = 1)
> then we need to look at the first key's subpages, where we might find
> matching keys like (x = 2, y = 0).
> 
> The full-width case appears to assume that that can't happen: if we
> have a given key in an upper page, there can be *no* equal keys in
> subpages to its left.  That's a rather strong assumption about how
> page splitting is done; is it correct?
> 
> Even more to the point, *should* it be correct?  If we always returned
> the last lesser key, then the code would work for any correctly
> sequenced b-tree, but this code looks like it works only if splits occur
> only at the leftmost of a group of equal keys.  If there are a lot of
> equal keys, that could result in a badly unbalanced tree, no?  Maybe
> that's the real reason why performance seems to be so bad for many
> equal keys... maybe the split algorithm needs to be relaxed?
> 
>             regards, tom lane
> 
> 


--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] Bizarre coding in _bt_binsrch

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Tom, I assume you have dealt with this, right?

>> I have been puzzling out the coding in _bt_binsrch() in
>> backend/access/nbtree/nbtsearch.c, with an eye to speeding it up for
>> the many-equal-keys case.

I tweaked the code to go faster in the equal-keys case, but Vadim later
pointed out that what we *really* should do is force the algorithms to
never consider two index keys equal (eg, by including the heap tuple id
as the last part of the comparison key).  See his pgsql-hackers message
dated 06 Jun 1999 21:32:36 +0800.  Getting the full benefit would
require ripping out the BTP_CHAIN logic and doing some other major
surgery, so I don't feel like I know the btree code well enough to
tackle it.  It should be on the TODO list though:

* Include heap CTID in btree index keys, remove equal-key cruft from btree
        regards, tom lane


Re: [HACKERS] Bizarre coding in _bt_binsrch

From
Bruce Momjian
Date:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Tom, I assume you have dealt with this, right?
> 
> >> I have been puzzling out the coding in _bt_binsrch() in
> >> backend/access/nbtree/nbtsearch.c, with an eye to speeding it up for
> >> the many-equal-keys case.
> 
> I tweaked the code to go faster in the equal-keys case, but Vadim later
> pointed out that what we *really* should do is force the algorithms to
> never consider two index keys equal (eg, by including the heap tuple id
> as the last part of the comparison key).  See his pgsql-hackers message
> dated 06 Jun 1999 21:32:36 +0800.  Getting the full benefit would
> require ripping out the BTP_CHAIN logic and doing some other major
> surgery, so I don't feel like I know the btree code well enough to
> tackle it.  It should be on the TODO list though:
> 
> * Include heap CTID in btree index keys, remove equal-key cruft from btree

Thanks. That's what I needed.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026