Re: Making all nbtree entries unique by having heap TIDs participatein comparisons - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: Making all nbtree entries unique by having heap TIDs participatein comparisons |
Date | |
Msg-id | 5ffe7392-0bf3-f559-e737-c63e75175d87@iki.fi Whole thread Raw |
In response to | Re: Making all nbtree entries unique by having heap TIDs participatein comparisons (Peter Geoghegan <pg@bowt.ie>) |
Responses |
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
|
List | pgsql-hackers |
On 08/03/2019 23:21, Peter Geoghegan wrote: > On Fri, Mar 8, 2019 at 10:48 AM Peter Geoghegan <pg@bowt.ie> wrote: >> All of that said, maybe it would be clearer if page deletion was not >> the special case that has to opt in to minusinfkey semantics. Perhaps >> it would make more sense for everyone else to opt out of minusinfkey >> semantics, and to get the !minusinfkey optimization as a result of >> that. I only did it the other way around because that meant that only >> nbtpage.c had to acknowledge the special case. > > This seems like a good idea -- we should reframe the !minusinfkey > optimization, without actually changing the behavior. Flip it around. > > The minusinfkey field within the insertion scankey struct would be > called something like "descendrighttrunc" instead. Same idea, but with > the definition inverted. Most _bt_search() callers (all of those > outside of nbtpage.c and amcheck) would be required to opt in to that > optimization to get it. > > Under this arrangement, nbtpage.c/page deletion would not ask for the > "descendrighttrunc" optimization, and would therefore continue to do > what it has always done: find the first leaf page that its insertion > scankey values could be on (we don't lie about searching for negative > infinity, or having a negative infinity sentinel value in scan key). > The only difference for page deletion between v3 indexes and v4 > indexes is that with v4 indexes we'll relocate the same leaf page > reliably, since every separator key value is guaranteed to be unique > on its level (including the leaf level/leaf high keys). This is just a > detail, though, and not one that's even worth pointing out; we're not > *relying* on that being true on v4 indexes anyway (we check that the > block number is a match too, which is strictly necessary for v3 > indexes and seems like a good idea for v4 indexes). > > This is also good because it makes it clear that the unique index code > within _bt_doinsert() (that temporarily sets scantid to NULL) benefits > from the descendrighttrunc/!minusinfkey optimization -- it should be > "honest" and ask for it explicitly. We can make _bt_doinsert() opt in > to the optimization for unique indexes, but not for other indexes, > where scantid is set from the start. The > descendrighttrunc/!minusinfkey optimization cannot help when scantid > is set from the start, because we'll always have an attribute value in > insertion scankey that breaks the tie for us instead. We'll always > move right of a heap-TID-truncated separator key whose untruncated > attributes are all equal to a prefix of our insertion scankey values. > > (This _bt_doinsert() descendrighttrunc/!minusinfkey optimization for > unique indexes matters more than you might think -- we do really badly > with things like Zipfian distributions currently, and reducing the > contention goes some way towards helping with that. Postgres pro > noticed this a couple of years back, and analyzed it in detail at that > time. It's really nice that we very rarely have to move right within > code like _bt_check_unique() and _bt_findsplitloc() with the patch.) > > Does that make sense to you? Can you live with the name > "descendrighttrunc", or do you have a better one? "descendrighttrunc" doesn't make much sense to me, either. I don't understand it. Maybe a comment would make it clear, though. I don't feel like this is an optimization. It's natural consequence of what the high key means. I guess you can think of it as an optimization, in the same way that not fully scanning the whole index for every search is an optimization, but that's not how I think of it :-). If we don't flip the meaning of the flag, then maybe calling it something like "searching_for_leaf_page" would make sense: /* * When set, we're searching for the leaf page with the given high key, * rather than leaf tuples matching the search keys. * * Normally, when !searching_for_pivot_tuple, if a page's high key * has truncated columns, and the columns that are present are equal to * the search key, the search will not descend to that page. For * example, if an index has two columns, and a page's high key is * ("foo", <omitted>), and the search key is also ("foo," <omitted>), * the search will not descend to that page, but its right sibling. The * omitted column in the high key means that all tuples on the page must * be strictly < "foo", so we don't need to visit it. However, sometimes * we perform a search to find the parent of a leaf page, using the leaf * page's high key as the search key. In that case, when we search for * ("foo", <omitted>), we do want to land on that page, not its right * sibling. */ bool searching_for_leaf_page; As the patch stands, you're also setting minusinfkey when dealing with v3 indexes. I think it would be better to only set searching_for_leaf_page in nbtpage.c. In general, I think BTScanInsert should describe the search key, regardless of whether it's a V3 or V4 index. Properties of the index belong elsewhere. (We're violating that by storing the 'heapkeyspace' flag in BTScanInsert. That wart is probably OK, it is pretty convenient to have it there. But in principle...) - Heikki
pgsql-hackers by date: