Re: Making all nbtree entries unique by having heap TIDs participatein comparisons - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Making all nbtree entries unique by having heap TIDs participatein comparisons |
Date | |
Msg-id | CAH2-WznYfkVCOyYvoG3KUb8_HMWiV5Wu6C5nEwyG=UmJuEPF1g@mail.gmail.com Whole thread Raw |
In response to | Re: Making all nbtree entries unique by having heap TIDs participatein comparisons (Heikki Linnakangas <hlinnaka@iki.fi>) |
Responses |
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
|
List | pgsql-hackers |
On Fri, Dec 28, 2018 at 10:04 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > I spent some time reviewing this. I skipped the first patch, to add a > column to pg_depend, and I got through patches 2, 3 and 4. Impressive > results, and the code looks sane. Thanks! I really appreciate your taking the time to do such a thorough review. You were right to skip the first patch, because there is a fair chance that it won't be used in the end. Tom is looking into the pg_depend problem that I paper over with the first patch. > I wrote a laundry list of little comments on minor things, suggested > rewordings of comments etc. I hope they're useful, but feel free to > ignore/override my opinions of any of those, as you see best. I think that that feedback is also useful, and I'll end up using 95%+ of it. Much of the information I'm trying to get across is very subtle. > But first, a few slightly bigger (medium-sized?) issues that caught my eye: > > 1. How about doing the BTScanInsertData refactoring as a separate > commit, first? It seems like a good thing for readability on its own, > and would slim the big main patch. (And make sure to credit Andrey for > that idea in the commit message.) Good idea. I'll do that. > This 'assumeheapkeyspace' flag feels awkward. What if the caller knows > that it is a v3 index? There's no way to tell _bt_mkscankey() that. > (There's no need for that, currently, but seems a bit weird.) This is there for CREATE INDEX -- we cannot access the metapage during an index build. We'll only be able to create new v4 indexes with the patch applied, so we can assume that heap TID is part of the key space safely. > _bt_split() calls _bt_truncate(), which calls _bt_leave_natts(), which > calls _bt_mkscankey(). It's holding a lock on the page being split. Do > we risk deadlock by locking the metapage at the same time? I already had vague concerns along the same lines. I am also concerned about index_getprocinfo() calls that happen in the same code path, with a buffer lock held. (SP-GiST's doPickSplit() function can be considered a kind of precedent that makes the second issue okay, I suppose.) See also: My later remarks on the use of "authoritative comparisons" from this same e-mail. > I don't have any great ideas on what to do about this, but it's awkward > as it is. Can we get away without the new argument? Could we somehow > arrange things so that rd_amcache would be guaranteed to already be set? These are probably safe in practice, but the way that we rely on them being safe from a distance is a concern. Let me get back to you on this. > 3. In the "Pick nbtree split points discerningly" patch > > I find the different modes and the logic in _bt_findsplitloc() very hard > to understand. I've spent a while looking at it now, and I think I have > a vague understanding of what things it takes into consideration, but I > don't understand why it performs those multiple stages, what each stage > does, and how that leads to an overall strategy. I think a rewrite would > be in order, to make that more understandable. I'm not sure what exactly > it should look like, though. I've already refactored that a little bit for the upcoming v10. The way _bt_findsplitloc() state is initially set up becomes slightly more streamlined. It still works in the same way, though, so you'll probably only think that the new version is a minor improvement. (Actually, v10 focuses on making _bt_splitatnewitem() a bit less magical, at least right now.) > If _bt_findsplitloc() has to fall back to the MANY_DUPLICATES or > SINGLE_VALUE modes, it has to redo a lot of the work that was done in > the DEFAULT mode already. That's probably not a big deal in practice, > performance-wise, but I feel that it's another hint that some > refactoring would be in order. The logic within _bt_findsplitloc() has been very hard to refactor all along. You're right that there is a fair amount of redundant-ish work that the alternative modes (MANY_DUPLICATES + SINGLE_VALUE) perform. The idea is to not burden the common DEFAULT case, and to keep the control flow relatively simple. I'm sure that if I was in your position I'd say something similar. It is complicated in subtle ways, that looks like they might not matter, but actually do. I am working off a fair variety of test cases, which really came in handy. I remember thinking that I'd simplified it a couple of times back in August or September, only to realize that I'd regressed a case that I cared about. I eventually realized that I needed to come up with a comprehensive though relatively fast test suite, which seems essential for refactoring _bt_findsplitloc(), and maybe even for fully understanding how _bt_findsplitloc() works. Another complicating factor is that I have to worry about the number of cycles used under a buffer lock (not just the impact on space utilization). With all of that said, I am willing to give it another try. You've seen opportunities to refactor that I missed before now. More than once. > One idea on how to restructure that: > > Make a single pass over all the offset numbers, considering a split at > that location. Like the current code does. For each offset, calculate a > "penalty" based on two factors: > > * free space on each side > * the number of attributes in the pivot tuple, and whether it needs to > store the heap TID > > Define the penalty function so that having to add a heap TID to the > pivot tuple is considered very expensive, more expensive than anything > else, and truncating away other attributes gives a reward of some size. As you go on to say, accessing the tuple to calculate a penalty like this is expensive, and shouldn't be done exhaustively if at all possible. We're only access item pointer information (that is, lp_len) in the master branch's _bt_findsplitloc(), and that's all we do within the patch until the point where we have a (usually quite small) array of candidate split points, sorted by delta. Doing a pass over the page to assemble an array of candidate splits, and then doing a pass over the sorted array of splits with tolerably-low left/right space deltas works pretty well. "Mixing" the penalties together up front like that is something I considered, and decided not to pursue -- it obscures relatively uncommon though sometimes important large differences, that a single DEFAULT mode style pass would probably miss. MANY_DUPLICATES mode is totally exhaustive, because it's worth being totally exhaustive in the extreme case where there are only a few distinct values, and it's still possible to avoid a large grouping of values that spans more than one page. But it's not worth being exhaustive like that most of the time. That's the useful thing about having 2 alternative modes, that we "escalate" to if and only if it seems necessary to. MANY_DUPLICATES can be expensive, because no workload is likely to consistently use it. Most will almost always use DEFAULT, some will use SINGLE_VALUE quite a bit -- MANY_DUPLICATES is for when we're "in between" those two.Seems unlikely to be the steady state. Maybe we could just have MANY_DUPLICATES mode, and making SINGLE_VALUE mode something that happens within a DEFAULT pass. It's probably not worth it, though -- SINGLE_VALUE mode generally wants to split the page in a way that makes the left page mostly full, and the right page mostly empty. So eliminating SINGLE_VALUE mode would probably not simplify the code. > However, naively computing the penalty upfront for every offset would be > a bit wasteful. Instead, start from the middle of the page, and walk > "outwards" towards both ends, until you find a "good enough" penalty. You can't start at the middle of the page, though. You have to start at the left (though you could probably start at the right instead). This is because of page fragmentation -- it's not correct to assume that the line pointer offset into tuple space on the page (firstright linw pointer lp_off for candidate split point) tells you anything about what the space delta will be after the split. You have to exhaustively add up the free space before the line pointer (the free space for all earlier line pointers) before seeing if the line pointer works as a split point, since each previous line pointer's tuple could be located anywhere in the original page's tuple space (anywhere to the left or to the right of where it would be in the simple/unfragmented case). > 1st commits commit message: > > > Make nbtree treat all index tuples as having a heap TID trailing key > > attribute. Heap TID becomes a first class part of the key space on all > > levels of the tree. Index searches can distinguish duplicates by heap > > TID, at least in principle. > > What do you mean by "at least in principle"? I mean that we don't really do that currently, because we don't have something like retail index tuple deletion. However, we do have, uh, insertion, so I guess that this is just wrong. Will fix. > > Secondary index insertions will descend > > straight to the leaf page that they'll insert on to (unless there is a > > concurrent page split). > > What is a "Secondary" index insertion? Secondary index is how I used to refer to a non-unique index, until I realized that that was kind of wrong. (In fact, all indexes in Postgres are secondary indexes, because we always use a heap, never a clustered index.) Will fix. > Suggestion: "when there are several attributes in an index" -> "in a > multi-column index" I'll change it to say that. > > +/* > > + * Convenience macro to get number of key attributes in tuple in low-context > > + * fashion > > + */ > > +#define BTreeTupleGetNKeyAtts(itup, rel) \ > > + Min(IndexRelationGetNumberOfKeyAttributes(rel), BTreeTupleGetNAtts(itup, rel)) > > + > > What is "low-context fashion"? I mean that it works with non-pivot tuples in INCLUDE indexes without special effort on the caller's part, while also fetching the number of key attributes in any pivot tuple, where it might well be < IndexRelationGetNumberOfKeyAttributes(). Maybe no comment is necessary -- BTreeTupleGetNKeyAtts() is exactly what it sounds like to somebody that already knows about BTreeTupleGetNAtts(). > Suggestion: Reword to something like "During insertion, there must be a > scan key for every attribute, but when starting a regular index scan, > some can be omitted." Will do. > It would feel more natural to me, to have the mutable state *after* the > other fields. I fully agree, but I can't really change it. The struct BTScanInsertData ends with a flexible array member, though its sized INDEX_MAX_KEYS because _bt_first() wants to allocate it on the stack without special effort. This was found to make a measurable difference with nested loop joins -- I used to always allocate BTScanInsertData using palloc(), until I found a regression. This nestloop join issue must be why commit d961a568 removed an insertion scan key palloc() from _bt_first(), way back in 2005. It seems like _bt_first() should remain free of palloc()s, which it seems to actually manage to do, despite being so hairy. > Also, it'd feel less error-prone to have 'scantid' be > ItemPointerData, rather than a pointer to somewhere else. It's useful for me to be able to set it to NULL, though -- I'd need another bool to represent the absence of a scantid if the field was ItemPointerData (the absence could occur when _bt_mkscankey() is passed a pivot tuple with its heap TID already truncated away, for example). Besides, the raw scan keys themselves are very often pointers to an attribute in some index tuple -- a tuple that the caller needs to keep around for as long as the insertion scan key needs to be used. Why not do the same thing with scantid? It is more or less just another attribute, so it's really the same situation as before. > The 'heapkeyspace' name isn't very descriptive. I understand that it means > that the heap TID is part of the keyspace. Not sure what to suggest > instead, though. I already changed this once, based on a similar feeling. If you come up with an even better name than "heapkeyspace", let me know. :-) > > +The requirement that all btree keys be unique is satisfied by treating heap > > +TID as a tiebreaker attribute. Logical duplicates are sorted in heap item > > +pointer order. > > Suggestion: "item pointer" -> TID, to use consistent terms. Will do. > > We don't use btree keys to disambiguate downlinks from the > > +internal pages during a page split, though: only one entry in the parent > > +level will be pointing at the page we just split, so the link fields can be > > +used to re-find downlinks in the parent via a linear search. (This is > > +actually a legacy of when heap TID was not treated as part of the keyspace, > > +but it does no harm to keep things that way.) > > I don't understand this paragraph. I mean that we could now "go full Lehman and Yao" if we wanted to: it's not necessary to even use the link field like this anymore. We don't do that because of v3 indexes, but also because it doesn't actually matter. The current way of re-finding downlinks would probably even be better in a green field situation, in fact -- it's just a bit harder to explain in a research paper. > Suggestion: reword to "All tuples on non-leaf pages, and high keys on > leaf pages, are pivot tuples" Will do. > > Note that pivot tuples are > > +only used to represent which part of the key space belongs on each page, > > +and can have attribute values copied from non-pivot tuples that were > > +deleted and killed by VACUUM some time ago. A pivot tuple may contain a > > +"separator" key and downlink, just a separator key (in practice the > > +downlink will be garbage), or just a downlink. > > Rather than store garbage, set it to zeros? There may be minor forensic value in keeping the item pointer block as the heap block (but not the heap item pointer) within leaf high keys (i.e. only changing it when it gets copied over for insertion into the parent, and the block needs to point to the leaf child). I recall discussing this with Alexander Korotkov shortly before the INCLUDE patch went in. I'd rather keep it that way, rather than zeroing. I could say "undefined" instead of "garbage", though. Not at all attached to that wording. > "distringuish between ... from ..." doesn't sound like correct grammar. > Suggestion: "distinguish between ... and ...", or just "distinguish ... > from ...". Or rephrase the sentence some other way. Yeah, I mangled the grammar. Which is kind of surprising, since I make a very important point about why strict lower bounds are handy in that sentence! > > +We truncate away suffix key attributes that are not needed for a page high > > +key during a leaf page split when the remaining attributes distinguish the > > +last index tuple on the post-split left page as belonging on the left page, > > +and the first index tuple on the post-split right page as belonging on the > > +right page. > > That's a very long sentence. Will restructure. > > * Since the truncated tuple is probably smaller than the > > * original, it cannot just be copied in place (besides, we want > > * to actually save space on the leaf page). We delete the > > * original high key, and add our own truncated high key at the > > * same offset. It's okay if the truncated tuple is slightly > > * larger due to containing a heap TID value, since pivot tuples > > * are treated as a special case by _bt_check_third_page(). > > By "treated as a special case", I assume that _bt_check_third_page() > always reserves some space for that? Maybe clarify that somehow. I'll just say that _bt_check_third_page() reserves space for it in the next revision of the patch. > _bt_truncate(): > > This is possible when there are > > * attributes that follow an attribute in firstright that is not equal to the > > * corresponding attribute in lastleft (equal according to insertion scan key > > * semantics). > > I can't comprehend that sentence. Simpler English, maybe add an example, > please. Okay. > > static int > > _bt_leave_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright, > > bool build) > > IMHO "keep" would sound better here than "leave". WFM. > Could restructure this to avoid having two almost identical strings to > translate. I'll try. > > #define BTREE_METAPAGE 0 /* first page is meta */ > > #define BTREE_MAGIC 0x053162 /* magic number of btree pages */ > > -#define BTREE_VERSION 3 /* current version number */ > > +#define BTREE_VERSION 4 /* current version number */ > > #define BTREE_MIN_VERSION 2 /* minimal supported version number */ > > +#define BTREE_META_VERSION 3 /* minimal version with all meta fields */ > > BTREE_META_VERSION is a strange name for version 3. I think this > deserves a more verbose comment, above these #defines, to list all the > versions and their differences. Okay, but what would be better? I'm trying to convey that BTREE_META_VERSION is the last version where upgrading was a simple matter of changing the metapage, which can be performed on the fly. The details of what were added to v3 (what nbtree stuff went into Postgres 11) are not really interesting enough to have a descriptive nbtree.h #define name. The metapage-only distinction is actually the interesting distinction here (if I could do the upgrade on-the-fly, there'd be no need for a v3 #define at all). > v9-0003-Pick-nbtree-split-points-discerningly.patch commit message: > > Add infrastructure to determine where the earliest difference appears > > among a pair of tuples enclosing a candidate split point. > > I don't understand this sentence. A (candidate) split point is a point *between* two enclosing tuples on the original page, provided you pretend that the new tuple that caused the split is already on the original page. I probably don't need to be (un)clear on that in the commit message, though. I think that I'll probably end up committing 0002-* and 0003-* in one go anyway (though not before doing the insertion scan key struct refactoring in a separate commit, as you suggest). > > _bt_findsplitloc() is also taught to care about the case where there are > > many duplicates, making it hard to find a distinguishing split point. > > _bt_findsplitloc() may even conclude that it isn't possible to avoid > > filling a page entirely with duplicates, in which case it packs pages > > full of duplicates very tightly. > > Hmm. Is the assumption here that if a page is full of duplicates, there > will be no more insertions into that page? Why? This is a really important point, that should probably have been in your main feedback, rather than the laundry list. I was hoping you'd comment on this more, in fact. Imagine the extreme (and admittedly unrealistic) case first: We have a page full of duplicates, all of which point to one heap page, and with a gapless sequence of heap TID item pointers. It's literally impossible to have another page split in this extreme case, because VACUUM is guaranteed to kill the tuples in the leaf page before anybody can insert next time (IOW, there has to be TID recycling before an insertion into the leaf page is even possible). Now, I've made the "fillfactor" 99, so I haven't actually assumed that there will be *no* further insertions on the page. I'm almost assuming that, but not quite. My thinking was that I should match the greedy behavior that we already have to some degree, and continue to pack leaf pages full of duplicates very tight. I am quite willing to consider whether or not I'm still being too aggressive, all things considered. If I made it 50:50, that would make indexes with relatively few distinct values significantly larger than on master, which would probably be deemed a regression. FWIW, I think that even that regression in space utilization would be more than made up for in other ways. The master branch _bt_findinsertloc() stuff is a disaster with many duplicates for a bunch of reasons that are even more important than the easy-to-measure bloat issue (FPIs, unnecessary buffer lock contention... I could go on). What value do you think works better than 99? 95? 90? I'm open minded about this. I have my own ideas about why 99 works, but they're based on intuitions that might fail to consider something important. The current behavior with many duplicates is pretty awful, so we can at least be sure that it isn't any worse than that. > What do you do instead, then? memcmp? (Reading the patch, yes. > Suggestion: "We use a faster binary comparison, instead of proper > datatype-aware comparison, for speed". WFM. > Aside from performance, it would feel inappropriate to call user-defined > code while holding a buffer lock, anyway. But we do that all the time for this particular variety of user defined code? I mean, we actually *have* to use the authoritative comparisons at the last moment, once we actually make our mind up about where to split -- nothing else is truly trustworthy. So, uh, we actually do this "inappropriate" thing -- just not that much of it. > I'd leave out the ", without provoking a split" part. Or maybe reword to > "if you pretend that the incoming tuple fit and was placed on the page > already". Okay. > It took me a while to understand what the "appear as early as possible" > means here. It's talking about a multi-column index, and about finding a > difference in one of the leading key columns. Not, for example, about > finding a split point early in the page. This is probably a hold-over from when we didn't look at candidate split point tuples an attribute at a time (months ago, it was something pretty close to a raw memcmp()). Will fix. > Perhaps we should leave out these details in the README, and explain > this in the comments of the picksplit-function itself? In the README, I > think a more high-level description of what things are taken into > account when picking the split point, would be enough. Agreed. > > +Suffix truncation is primarily valuable because it makes pivot tuples > > +smaller, which delays splits of internal pages, but that isn't the only > > +reason why it's effective. > > Suggestion: reword to "... , but that isn't the only benefit" ? WFM. > > There are cases where suffix truncation can > > +leave a B-Tree significantly smaller in size than it would have otherwise > > +been without actually making any pivot tuple smaller due to restrictions > > +relating to alignment. > > Suggestion: reword to "... smaller in size than it would otherwise be, > without ..." WFM. > and "without making any pivot tuple *physically* smaller, due to alignment". WFM. > This sentence is a bit of a cliffhanger: what are those cases, and how > is that possible? This is something you see with the TPC-C indexes, even without the new split stuff. The TPC-C stock pk is about 45% smaller with that later commit, but it's something like 6% or 7% smaller even without it (or maybe it's the orderlines pk). And without ever managing to make a pivot tuple physically smaller. This happens because truncating away trailing attributes allows more stuff to go on the first right half of a split. In more general terms: suffix truncation avoids committing ourselves to rules about where values should go that are stricter than truly necessary. On balance, this improves space utilization quite noticeably, even without the special cases where really big improvements are made. If that still doesn't make sense, perhaps you should just try out the TPC-C stuff without the new split patch, and see for yourself. The easiest way to do that is to follow the procedure I describe here: https://bitbucket.org/openscg/benchmarksql/issues/6/making-it-easier-to-recreate-postgres-tpc (BenchmarkSQL is by far the best TPC-C implementation I've found that works with Postgres, BTW. Yes, I also hate Java.) > Ok, I guess these sentences resolve the cliffhanger I complained about. > But this still feels like magic. When you split a page, all of the > keyspace must belong on the left or the right page. Why does it make a > difference to space utilization, where exactly you split the key space? You have to think about the aggregate effect, rather than thinking about a single split at a time. But, like I said, maybe the best thing is to see the effect for yourself with TPC-C (while reverting the split-at-new-item patch). > Ok, so this explains it further, I guess. I find this paragraph > difficult to understand, though. The important thing here is the idea > that some split points are more "discriminating" than others, but I > think it needs some further explanation. What makes a split point more > discriminating? Maybe add an example. An understandable example seems really hard, even though the effect is clear. Maybe I should just say *nothing* about the benefits when pivot tuples don't actually shrink? I found it pretty interesting, and maybe even something that makes it more understandable, but maybe that isn't a good enough reason to keep the explanation. This doesn't address your exact concern, but I think it might help: Bayer's Prefix B-tree paper talks about the effect of being more aggressive in finding a split point. You tend to be able to make index have more leaf pages but fewer internal pages as you get more aggressive about split points. However, both internal pages and leaf pages eventually become more numerous than they'd be with a reasonable interval/level of aggression/discernment -- the saving in internal pages no longer compensates for having more downlinks in internal pages. Bayer ends up saying next to nothing about how big the "split interval" should be. BTW, somebody named Timothy L. Towns wrote the only analysis I've been able to find on split interval for "simply prefix B-Trees" (suffix truncation): https://shareok.org/bitstream/handle/11244/16442/Thesis-1983-T747e.pdf?sequence=1 He is mostly talking about the classic case from Bayer's 77 paper, where everything is a memcmp()-able string, which is probably what some systems actually do. On the other hand, I care about attribute granularity. Anyway, it's pretty clear that this Timothy L. Towns fellow should have picked a better topic for his thesis, because he fails to say anything practical about it. Unfortunately, a certain amount of magic in this area is unavoidable. > > +Suffix truncation may make a pivot tuple *larger* than the non-pivot/leaf > > +tuple that it's based on (the first item on the right page), since a heap > > +TID must be appended when nothing else distinguishes each side of a leaf > > +split. Truncation cannot simply reuse the leaf level representation: we > > +must append an additional attribute, rather than incorrectly leaving a heap > > +TID in the generic IndexTuple item pointer field. (The field is already > > +used by pivot tuples to store their downlink, plus some additional > > +metadata.) > > That's not really the fault of suffix truncation as such, but the > process of turning a leaf tuple into a pivot tuple. It would happen even > if you didn't truncate anything. Fair. Will change. > I think this point, that we have to store the heap TID differently in > pivot tuples, would deserve a comment somewhere else, too. While reading > the patch, I didn't realize that that's what we're doing, until I read > this part of the README, even though I saw the new code to deal with > heap TIDs elsewhere in the code. Not sure where, maybe in > BTreeTupleGetHeapTID(). Okay. > This is the first mention of "many duplicates" mode. Maybe just say > "_bt_findsplitloc() almost always ..." or "The logic for selecting a > split point goes to great lengths to avoid heap TIDs in pivots, and > almost always manages to pick a split point between two > user-key-distinct tuples, accepting a completely lopsided split if it must." Sure. > > Once appending a heap > > +TID to a split's pivot becomes completely unavoidable, there is a fallback > > +strategy --- "single value" mode is used, which makes page splits pack the > > +new left half full by using a high fillfactor. Single value mode leads to > > +better overall space utilization when a large number of duplicates are the > > +norm, and thereby also limits the total number of pivot tuples with an > > +untruncated heap TID attribute. > > This assumes that tuples are inserted in increasing TID order, right? > Seems like a valid assumption, no complaints there, but it's an > assumption nevertheless. I can be explicit about that. See also: my remarks above about "fillfactor" with single value mode. > I'm not sure if this level of detail is worthwhile in the README. This > logic on deciding the split point is all within the _bt_findsplitloc() > function, so maybe put this explanation there. In the README, a more > high-level explanation of what things _bt_findsplitloc() considers, > should be enough. Okay. > _bt_findsplitloc(), and all its helper structs and subroutines, are > about 1000 lines of code now, and big part of nbtinsert.c. Perhaps it > would be a good idea to move it to a whole new nbtsplitloc.c file? It's > a very isolated piece of code. Good idea. I'll give that a go. > In the comment on _bt_leave_natts_fast(): > That's an interesting tidbit, but I'd suggest just removing this comment > altogether. It's not really helping to understand the current > implementation. Will do. > v9-0005-Add-high-key-continuescan-optimization.patch commit message: > > > Note that even pre-pg_upgrade'd v3 indexes make use of this > > optimization. > > .. but we're missing the other optimizations that make it more > effective, so it probably won't do much for v3 indexes. Does it make > them slower? It's probably acceptable, even if there's a tiny > regression, but I'm curious. But v3 indexes get the same _bt_findsplitloc() treatment as v4 indexes -- the new-item-split stuff works almost as well for v3 indexes, and the other _bt_findsplitloc() stuff doesn't seem to make much difference. I'm not sure if that's the right thing to do (probably doesn't matter very much). Now, to answer your question about v3 indexes + the continuescan optimization: I think that it probably will help a bit, with or without the _bt_findsplitloc() changes. Much harder to be sure whether it's worth it on balance, since that's workload dependent. My sense is that it's a much smaller benefit much of the time, but the cost is still pretty low. So why not just make it version-generic, and keep things relatively uncluttered? Once again, I greatly appreciate your excellent review! -- Peter Geoghegan
pgsql-hackers by date: