Re: Making all nbtree entries unique by having heap TIDs participatein comparisons - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
Date
Msg-id CAH2-Wzn3g7TyaU7C40heNM9fEaDAezECM0RC3e_MC+8CW7tk4g@mail.gmail.com
Whole thread Raw
In response to Re: Making all nbtree entries unique by having heap TIDs participatein comparisons  (Heikki Linnakangas <hlinnaka@iki.fi>)
Responses Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
List pgsql-hackers
On Mon, Jan 28, 2019 at 7:32 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> I spent some time first trying to understand the current algorithm, and
> then rewriting it in a way that I find easier to understand. I came up
> with the attached. I think it optimizes for the same goals as your
> patch, but the approach  is quite different.

Attached is v13 of the patch series, which significantly refactors
nbtsplitloc.c to implement the algorithm using the approach from your
prototype posted on January 28 -- I now take a "top down" approach
that materializes all legal split points up-front, as opposed to the
initial "bottom up" approach that used recursion, and weighed
everything (balance of free space, suffix truncation, etc) all at
once. Some of the code is directly lifted from your prototype, so
there is now a question about whether or not you should be listed as a
co-author. (I think that you should be credited as a secondary author
of the nbtsplitloc.c patch, and as a secondary author in the release
notes for the feature as a whole, which I imagine will be rolled into
one item there.)

I always knew that a "top down" approach would be simpler, but I
underestimated how much better it would be overall, and how manageable
the downsides are -- the added cycles are not actually noticeable when
compared to the master branch, even during microbenchmarks. Thanks for
suggesting this approach!

I don't even need to simulate recursion with a loop or a goto;
everything is structured as a linear series of steps now. There are
still the same modes as before, though; the algorithm is essentially
unchanged. All of my tests show that it's at least as effective as v12
was in terms of how effective the final _bt_findsplitloc() results are
in reducing index size. The new approach will make more sophisticated
suffix truncation costing much easier to implement in a future
release, when suffix truncation is taught to truncate *within*
individual datums/attributes (e.g. generate the text string "new m"
given a split point between "new jersey" and "new york", by using some
new opclass infrastructure). "Top down" also makes the implementation
of the "split after new item" optimization safer and simpler -- we
already have all split points conveniently available, so we can seek
out an exact match instead of interpolating where it ought appear
later using a dynamic fillfactor. We can back out of the "split after
new item" optimization in the event of the *precise* split point we
want to use not being legal. That shouldn't be necessary, and isn't
necessary in practice, but it seems like a good idea be defensive with
something so delicate as this.

I'm using qsort() to sort the candidate split points array. I'm not
trying to do something clever to avoid the up-front effort of sorting
everything, even though we could probably get away with that much of
the time (e.g. by doing a top-N sort in default mode). Testing has
shown that using an inlined qsort() routine in the style of
tuplesort.c would make my serial test cases/microbenchmarks faster,
without adding much complexity. We're already competitive with the
master branch even without this microoptimization, so I've put that
off for now. What do you think of the idea of specializing an
inlineable qsort() for nbtsplitloc.c?

Performance is at least as good as v12 with a more relevant workload,
such as BenchmarkSQL. Transaction throughput is 5% - 10% greater in my
most recent tests (benchmarks for v13 specifically).

Unlike in your prototype, v13 makes the array for holding candidate
split points into a single big allocation that is always exactly
BLCKSZ. The idea is that palloc() can thereby recycle the big
_bt_findsplitloc() allocation within _bt_split(). palloc() considers
8KiB to be the upper limit on the size of individual blocks it
manages, and we'll always go on to palloc(BLCKSZ) through the
_bt_split() call to PageGetTempPage(). In a sense, we're not even
allocating memory that we weren't allocating already. (Not sure that
this really matters, but it is easy to do it that way.)

Other changes from your prototype:

*  I found a more efficient representation than a pair of raw
IndexTuple pointers for each candidate split. Actually, I use the same
old representation (firstoldonright + newitemonleft) in each split,
and provide routines to work backwards from that to get the lastleft
and firstright tuples. This approach is far more space efficient, and
space efficiency matters when you've allocating space for hundreds of
items in a critical path like this.

* You seemed to refactor _bt_checksplitloc() in passing, making it not
do the newitemisfirstonright thing. I changed that back. Did I miss
something that you intended here?

* Fixed a bug in the loop that adds split points. Your refactoring
made the main loop responsible for new item space handling, as just
mentioned, but it didn't create a split where the new item is first on
the page, and the split puts the new item on the left page on its own,
on all existing items on the new right page. I couldn't prove that
this caused failures to find a legal split, but it still seemed like a
bug.

In general, I think that we should generate our initial list of split
points in exactly the same manner as we do so already. The only
difference as far as split legality/feasibility goes is that we
pessimistically assume that suffix truncation will have to add a heap
TID in all cases. I don't see any advantage to going further than
that.

Changes to my own code since v12:

* Simplified "Add "split after new tuple" optimization" commit, and
made it more consistent with associated code. This is something that
was made a lot easier by the new approach. It would be great to hear
what you think of this.

* Removed subtly wrong assertion in nbtpage.c, concerning VACUUM's
page deletion. Even a page that is about to be deleted can be filled
up again and split when we release and reacquire a lock, however
unlikely that may be.

* Rename _bt_checksplitloc() to _bt_recordsplit(). I think that it
makes more sense to make that about recording a split point, rather
than about making sure a split point is legal. It still does that, but
perhaps 99%+ of calls to _bt_recordsplit()/_bt_checksplitloc() result
in the split being deemed legal, so the new name makes much more
sense.

--
Peter Geoghegan

Attachment

pgsql-hackers by date:

Previous
From: Stephen Frost
Date:
Subject: Re: Remove Deprecated Exclusive Backup Mode
Next
From: Haribabu Kommi
Date:
Subject: Re: [HACKERS] Block level parallel vacuum