Thread: RE: btree split logic is fragile in the presence of lar ge index items

RE: btree split logic is fragile in the presence of lar ge index items

From
"Mikheev, Vadim"
Date:
> However, I'm not sure I believe the comment anymore: it has 
> not changed since Postgres95 and I can see that quite a bit of work has
> been done on the duplicate-key logic since then.  Furthermore findsplitloc
> itself sometimes ignores the claimed requirement: when it does the
> split-in-the-middle case quoted above, it does not pay attention to
> whether it is splitting in the middle of a group of duplicates.  (But

Ops. This is bug.

> that path is taken infrequently enough that it's possible it's just
> plain broken, and we haven't noticed.)
> 
> Does anyone know whether this comment still describes the btree
> equal-key logic accurately?

I think so.

Vadim


Re: btree split logic is fragile in the presence of large index items

From
Tom Lane
Date:
I've been chewing some more on the duplicate-key btree issue, and
digging through CVS logs and Postgres v4.2 to try to understand the
history of the code (and boy, this code does have a lot of history
doesn't it?)

What I think I now understand is this:

1. The PG 4.2 code used an OID appended to the key to ensure that
btree items are unique.  This fixes the Lehman and Yao algorithm's
assumption of unique keys *for insertions*, since the item being
inserted must have an OID different from existing items even if the
key values are equal.  However there is still an issue for *lookups*,
since we are necessarily doing a lookup with just a key value, and
we want to find all index items with that same key value regardless
of OID.

2. The solution used in the PG 4.2 code, and still present now, handles
equal-keys for lookup by mandating that a sequence of equal-keyed items
not cross page boundaries unless the first key in that sequence is at
the start of an index page.  This ensures that that first key is present
in the parent node and so a lookup for that key will descend to the
page in which the sequence starts.  Without this restriction the lookup
would descend to the first continuation page of the sequence and so miss
the first few matching items.  However we now see that this restriction
can cause insertion failures by preventing a page from being split where
it needs to be split to make room for the incoming item.

3. Awhile back Vadim removed the added-OID code and added a bunch of
logic for explicit management of chains of duplicate keys.  In
retrospect this change was probably a mistake.  For btree items that
point to heap tuples we can use the tuple TID as tie-breaker (since
an index should never contain two items with the same key and TID).
Btree internal pages need an added field since they have to consider the
TID as part of the key (and the field that is the TID in a leaf page is
used as the down-link pointer in non-leaf pages).

I believe that the PG 4.2 solution for equal-key lookups is also a
mistake, and that the correct answer is simple: split pages wherever
you want, but when descending through a non-leaf page, consider the
target key to be *less than* index items that actually have equal keys.
In this way, if we are looking at a downlink pointer that exactly
matches the target key, we go to the page one left of the page the
pointer references, and thereby find any equal-keyed items that may
be lurking at the end of that page.  (We could possibly implement this
by treating the key being searched for as having an attached TID of
minus infinity.  Note we already have more or less this same idea
in place for scanning a multi-column index using a partial key.)

This might sound klugy since it means a different comparison rule for
descending the tree than for actually deciding whether we should return
a particular item.  But *we have to make such a distinction anyway*
in order to handle NULL items.  (The equality check must always say
FALSE when comparing nulls, but we have to consider NULLs as sortable
data values when entering them into the tree.)

I'm starting to feel that the right fix is to bite the bullet and redo
the index logic this way, rather than add another band-aid.

One thing I'm still not too clear on is how we handle backwards
indexscans.  After looking at Lehman and Yao's paper it seems like
only forward scans are guaranteed to work when other processes are
busy splitting pages.  Anybody know how that's handled?
        regards, tom lane


RE: btree split logic is fragile in the presence of large index items

From
"Hiroshi Inoue"
Date:
> -----Original Message-----
> From: pgsql-hackers-owner@hub.org [mailto:pgsql-hackers-owner@hub.org]On
> Behalf Of Tom Lane
>
> I've been chewing some more on the duplicate-key btree issue, and
> digging through CVS logs and Postgres v4.2 to try to understand the
> history of the code (and boy, this code does have a lot of history
> doesn't it?)
>
> One thing I'm still not too clear on is how we handle backwards
> indexscans.  After looking at Lehman and Yao's paper it seems like
> only forward scans are guaranteed to work when other processes are
> busy splitting pages.  Anybody know how that's handled?
>

There's the following comment for backwards scan in _bt_step()
in nbtsearch.c
                                       /*                                        * If the adjacent page just split,
thenwe may have                                        * the wrong block.  Handle this
case.
Because pages                                        * only split right, we don't have
to wo
rry about this                                        * failing to terminate.
*/

Seems backwards index scans sometimes move right(scan forward).

Regards.

Hiroshi Inoue