Re: Making all nbtree entries unique by having heap TIDs participatein comparisons - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Making all nbtree entries unique by having heap TIDs participatein comparisons |
Date | |
Msg-id | CAH2-Wzn3g7TyaU7C40heNM9fEaDAezECM0RC3e_MC+8CW7tk4g@mail.gmail.com Whole thread Raw |
In response to | Re: Making all nbtree entries unique by having heap TIDs participatein comparisons (Heikki Linnakangas <hlinnaka@iki.fi>) |
Responses |
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons |
List | pgsql-hackers |
On Mon, Jan 28, 2019 at 7:32 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > I spent some time first trying to understand the current algorithm, and > then rewriting it in a way that I find easier to understand. I came up > with the attached. I think it optimizes for the same goals as your > patch, but the approach is quite different. Attached is v13 of the patch series, which significantly refactors nbtsplitloc.c to implement the algorithm using the approach from your prototype posted on January 28 -- I now take a "top down" approach that materializes all legal split points up-front, as opposed to the initial "bottom up" approach that used recursion, and weighed everything (balance of free space, suffix truncation, etc) all at once. Some of the code is directly lifted from your prototype, so there is now a question about whether or not you should be listed as a co-author. (I think that you should be credited as a secondary author of the nbtsplitloc.c patch, and as a secondary author in the release notes for the feature as a whole, which I imagine will be rolled into one item there.) I always knew that a "top down" approach would be simpler, but I underestimated how much better it would be overall, and how manageable the downsides are -- the added cycles are not actually noticeable when compared to the master branch, even during microbenchmarks. Thanks for suggesting this approach! I don't even need to simulate recursion with a loop or a goto; everything is structured as a linear series of steps now. There are still the same modes as before, though; the algorithm is essentially unchanged. All of my tests show that it's at least as effective as v12 was in terms of how effective the final _bt_findsplitloc() results are in reducing index size. The new approach will make more sophisticated suffix truncation costing much easier to implement in a future release, when suffix truncation is taught to truncate *within* individual datums/attributes (e.g. generate the text string "new m" given a split point between "new jersey" and "new york", by using some new opclass infrastructure). "Top down" also makes the implementation of the "split after new item" optimization safer and simpler -- we already have all split points conveniently available, so we can seek out an exact match instead of interpolating where it ought appear later using a dynamic fillfactor. We can back out of the "split after new item" optimization in the event of the *precise* split point we want to use not being legal. That shouldn't be necessary, and isn't necessary in practice, but it seems like a good idea be defensive with something so delicate as this. I'm using qsort() to sort the candidate split points array. I'm not trying to do something clever to avoid the up-front effort of sorting everything, even though we could probably get away with that much of the time (e.g. by doing a top-N sort in default mode). Testing has shown that using an inlined qsort() routine in the style of tuplesort.c would make my serial test cases/microbenchmarks faster, without adding much complexity. We're already competitive with the master branch even without this microoptimization, so I've put that off for now. What do you think of the idea of specializing an inlineable qsort() for nbtsplitloc.c? Performance is at least as good as v12 with a more relevant workload, such as BenchmarkSQL. Transaction throughput is 5% - 10% greater in my most recent tests (benchmarks for v13 specifically). Unlike in your prototype, v13 makes the array for holding candidate split points into a single big allocation that is always exactly BLCKSZ. The idea is that palloc() can thereby recycle the big _bt_findsplitloc() allocation within _bt_split(). palloc() considers 8KiB to be the upper limit on the size of individual blocks it manages, and we'll always go on to palloc(BLCKSZ) through the _bt_split() call to PageGetTempPage(). In a sense, we're not even allocating memory that we weren't allocating already. (Not sure that this really matters, but it is easy to do it that way.) Other changes from your prototype: * I found a more efficient representation than a pair of raw IndexTuple pointers for each candidate split. Actually, I use the same old representation (firstoldonright + newitemonleft) in each split, and provide routines to work backwards from that to get the lastleft and firstright tuples. This approach is far more space efficient, and space efficiency matters when you've allocating space for hundreds of items in a critical path like this. * You seemed to refactor _bt_checksplitloc() in passing, making it not do the newitemisfirstonright thing. I changed that back. Did I miss something that you intended here? * Fixed a bug in the loop that adds split points. Your refactoring made the main loop responsible for new item space handling, as just mentioned, but it didn't create a split where the new item is first on the page, and the split puts the new item on the left page on its own, on all existing items on the new right page. I couldn't prove that this caused failures to find a legal split, but it still seemed like a bug. In general, I think that we should generate our initial list of split points in exactly the same manner as we do so already. The only difference as far as split legality/feasibility goes is that we pessimistically assume that suffix truncation will have to add a heap TID in all cases. I don't see any advantage to going further than that. Changes to my own code since v12: * Simplified "Add "split after new tuple" optimization" commit, and made it more consistent with associated code. This is something that was made a lot easier by the new approach. It would be great to hear what you think of this. * Removed subtly wrong assertion in nbtpage.c, concerning VACUUM's page deletion. Even a page that is about to be deleted can be filled up again and split when we release and reacquire a lock, however unlikely that may be. * Rename _bt_checksplitloc() to _bt_recordsplit(). I think that it makes more sense to make that about recording a split point, rather than about making sure a split point is legal. It still does that, but perhaps 99%+ of calls to _bt_recordsplit()/_bt_checksplitloc() result in the split being deemed legal, so the new name makes much more sense. -- Peter Geoghegan
Attachment
- v13-0006-Allow-tuples-to-be-relocated-from-root-by-amchec.patch
- v13-0004-Add-split-after-new-tuple-optimization.patch
- v13-0005-Add-high-key-continuescan-optimization.patch
- v13-0007-DEBUG-Add-pageinspect-instrumentation.patch
- v13-0003-Consider-secondary-factors-during-nbtree-splits.patch
- v13-0001-Refactor-nbtree-insertion-scankeys.patch
- v13-0002-Make-heap-TID-a-tie-breaker-nbtree-index-column.patch
pgsql-hackers by date: