Re: btree split logic is fragile in the presence of large index items - Mailing list pgsql-hackers

From Tom Lane
Subject Re: btree split logic is fragile in the presence of large index items
Date
Msg-id 19541.963964069@sss.pgh.pa.us
Whole thread Raw
In response to RE: btree split logic is fragile in the presence of lar ge index items  ("Mikheev, Vadim" <vmikheev@SECTORBASE.COM>)
Responses RE: btree split logic is fragile in the presence of large index items  ("Hiroshi Inoue" <Inoue@tpf.co.jp>)
List pgsql-hackers
I've been chewing some more on the duplicate-key btree issue, and
digging through CVS logs and Postgres v4.2 to try to understand the
history of the code (and boy, this code does have a lot of history
doesn't it?)

What I think I now understand is this:

1. The PG 4.2 code used an OID appended to the key to ensure that
btree items are unique.  This fixes the Lehman and Yao algorithm's
assumption of unique keys *for insertions*, since the item being
inserted must have an OID different from existing items even if the
key values are equal.  However there is still an issue for *lookups*,
since we are necessarily doing a lookup with just a key value, and
we want to find all index items with that same key value regardless
of OID.

2. The solution used in the PG 4.2 code, and still present now, handles
equal-keys for lookup by mandating that a sequence of equal-keyed items
not cross page boundaries unless the first key in that sequence is at
the start of an index page.  This ensures that that first key is present
in the parent node and so a lookup for that key will descend to the
page in which the sequence starts.  Without this restriction the lookup
would descend to the first continuation page of the sequence and so miss
the first few matching items.  However we now see that this restriction
can cause insertion failures by preventing a page from being split where
it needs to be split to make room for the incoming item.

3. Awhile back Vadim removed the added-OID code and added a bunch of
logic for explicit management of chains of duplicate keys.  In
retrospect this change was probably a mistake.  For btree items that
point to heap tuples we can use the tuple TID as tie-breaker (since
an index should never contain two items with the same key and TID).
Btree internal pages need an added field since they have to consider the
TID as part of the key (and the field that is the TID in a leaf page is
used as the down-link pointer in non-leaf pages).

I believe that the PG 4.2 solution for equal-key lookups is also a
mistake, and that the correct answer is simple: split pages wherever
you want, but when descending through a non-leaf page, consider the
target key to be *less than* index items that actually have equal keys.
In this way, if we are looking at a downlink pointer that exactly
matches the target key, we go to the page one left of the page the
pointer references, and thereby find any equal-keyed items that may
be lurking at the end of that page.  (We could possibly implement this
by treating the key being searched for as having an attached TID of
minus infinity.  Note we already have more or less this same idea
in place for scanning a multi-column index using a partial key.)

This might sound klugy since it means a different comparison rule for
descending the tree than for actually deciding whether we should return
a particular item.  But *we have to make such a distinction anyway*
in order to handle NULL items.  (The equality check must always say
FALSE when comparing nulls, but we have to consider NULLs as sortable
data values when entering them into the tree.)

I'm starting to feel that the right fix is to bite the bullet and redo
the index logic this way, rather than add another band-aid.

One thing I'm still not too clear on is how we handle backwards
indexscans.  After looking at Lehman and Yao's paper it seems like
only forward scans are guaranteed to work when other processes are
busy splitting pages.  Anybody know how that's handled?
        regards, tom lane


pgsql-hackers by date:

Previous
From: Lamar Owen
Date:
Subject: In the news...
Next
From: The Hermit Hacker
Date:
Subject: Re: TUPLE SIZE HELP