Re: [HACKERS] Bizarre coding in _bt_binsrch - Mailing list pgsql-hackers

From Bruce Momjian
Subject Re: [HACKERS] Bizarre coding in _bt_binsrch
Date
Msg-id 199911292224.RAA15252@candle.pha.pa.us
Whole thread Raw
In response to Bizarre coding in _bt_binsrch  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: [HACKERS] Bizarre coding in _bt_binsrch
List pgsql-hackers
Tom, I assume you have dealt with this, right?

> I have been puzzling out the coding in _bt_binsrch() in
> backend/access/nbtree/nbtsearch.c, with an eye to speeding it up for
> the many-equal-keys case.  I have finally worked out exactly what it's
> doing, to wit:
> 
>      * On a leaf page, we always return the first key >= scan key
>      * (which could be the last slot + 1).
>      *
>      * On a non-leaf page, there are special cases:
>      *
>      * For an insertion (srchtype != BT_DESCENT and natts == keysz)
>      * always return first key >= scan key (which could be off the end).
>      *
>      * For a standard search (srchtype == BT_DESCENT and natts == keysz)
>      * return the first equal key if one exists, else the last lesser key
>      * if one exists, else the first slot on the page.
>      *
>      * For a partial-match search (srchtype == BT_DESCENT and natts < keysz)
>      * return the last lesser key if one exists, else the first slot.
> 
> This strikes me as a tad bizarre --- in particular, the discrepancy
> between treatment of equal keys in the normal and partial search cases.
> 
> I think I understand why the partial-match code works that way: there
> could be matching keys in the sub-page belonging to the last lesser key.
> For example, if our scan target is (x = 2) and we have internal keys
>     (x = 1, y = 2)
>     (x = 2, y = 1)
> then we need to look at the first key's subpages, where we might find
> matching keys like (x = 2, y = 0).
> 
> The full-width case appears to assume that that can't happen: if we
> have a given key in an upper page, there can be *no* equal keys in
> subpages to its left.  That's a rather strong assumption about how
> page splitting is done; is it correct?
> 
> Even more to the point, *should* it be correct?  If we always returned
> the last lesser key, then the code would work for any correctly
> sequenced b-tree, but this code looks like it works only if splits occur
> only at the leftmost of a group of equal keys.  If there are a lot of
> equal keys, that could result in a badly unbalanced tree, no?  Maybe
> that's the real reason why performance seems to be so bad for many
> equal keys... maybe the split algorithm needs to be relaxed?
> 
>             regards, tom lane
> 
> 


--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


pgsql-hackers by date:

Previous
From: The Hermit Hacker
Date:
Subject: Re: [HACKERS] VACUUM as a denial-of-service attack
Next
From: Bruce Momjian
Date:
Subject: Re: [PATCHES] Re: [HACKERS] new patches