Re: Making all nbtree entries unique by having heap TIDs participatein comparisons - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
Date
Msg-id CAH2-WzmfsTHMvo3Mex+1BLUNjRvHGcV8HXR6nBemVrpmFsxrYA@mail.gmail.com
Whole thread Raw
In response to Re: Making all nbtree entries unique by having heap TIDs participatein comparisons  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
List pgsql-hackers
On Fri, Mar 8, 2019 at 10:48 AM Peter Geoghegan <pg@bowt.ie> wrote:
> All of that said, maybe it would be clearer if page deletion was not
> the special case that has to opt in to minusinfkey semantics. Perhaps
> it would make more sense for everyone else to opt out of minusinfkey
> semantics, and to get the !minusinfkey optimization as a result of
> that. I only did it the other way around because that meant that only
> nbtpage.c had to acknowledge the special case.

This seems like a good idea -- we should reframe the !minusinfkey
optimization, without actually changing the behavior. Flip it around.

The minusinfkey field within the insertion scankey struct would be
called something like "descendrighttrunc" instead. Same idea, but with
the definition inverted. Most _bt_search() callers (all of those
outside of nbtpage.c and amcheck) would be required to opt in to that
optimization to get it.

Under this arrangement, nbtpage.c/page deletion would not ask for the
"descendrighttrunc" optimization, and would therefore continue to do
what it has always done: find the first leaf page that its insertion
scankey values could be on (we don't lie about searching for negative
infinity, or having a negative infinity sentinel value in scan key).
The only difference for page deletion between v3 indexes and v4
indexes is that with v4 indexes we'll relocate the same leaf page
reliably, since every separator key value is guaranteed to be unique
on its level (including the leaf level/leaf high keys). This is just a
detail, though, and not one that's even worth pointing out; we're not
*relying* on that being true on v4 indexes anyway (we check that the
block number is a match too, which is strictly necessary for v3
indexes and seems like a good idea for v4 indexes).

This is also good because it makes it clear that the unique index code
within _bt_doinsert() (that temporarily sets scantid to NULL) benefits
from the descendrighttrunc/!minusinfkey optimization -- it should be
"honest" and ask for it explicitly. We can make _bt_doinsert() opt in
to the optimization for unique indexes, but not for other indexes,
where scantid is set from the start. The
descendrighttrunc/!minusinfkey optimization cannot help when scantid
is set from the start, because we'll always have an attribute value in
insertion scankey that breaks the tie for us instead. We'll always
move right of a heap-TID-truncated separator key whose untruncated
attributes are all equal to a prefix of our insertion scankey values.

(This _bt_doinsert() descendrighttrunc/!minusinfkey optimization for
unique indexes matters more than you might think -- we do really badly
with things like Zipfian distributions currently, and reducing the
contention goes some way towards helping with that. Postgres pro
noticed this a couple of years back, and analyzed it in detail at that
time. It's really nice that we very rarely have to move right within
code like _bt_check_unique() and _bt_findsplitloc() with the patch.)

Does that make sense to you? Can you live with the name
"descendrighttrunc", or do you have a better one?

Thanks
-- 
Peter Geoghegan


pgsql-hackers by date:

Previous
From: Julien Rouhaud
Date:
Subject: Re: Ordered Partitioned Table Scans
Next
From: Tom Lane
Date:
Subject: Re: Ordered Partitioned Table Scans