Performance optimization of btree binary search - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Performance optimization of btree binary search
Date
Msg-id CAM3SWZS=AoDCMqBhX8gf39Zk69ujkiotmjL_rDEqFQxztkt4wQ@mail.gmail.com
Whole thread Raw
Responses Re: Performance optimization of btree binary search
Re: Performance optimization of btree binary search
List pgsql-hackers
Having nothing better to do over the holiday weekend, I decided to
pursue a number of ideas for improving performance that I thought
about a long time ago. These include:

* Pre-fetching list node pointers. This looks to be moderately
promising, but I'm certainly not going to be the one to land it, given
present obligations. Stephen Frost may wish to pick it up, given his
previous interest in the matter. This is slightly controversial,
because it uses a GCC intrinsic (__builtin_prefetch), but also because
the Linux kernel removed this optimization to their generic list
data-structure [1]. However, that list was what we'd call an embedded
list, so we should probably shouldn't be totally deterred. The amount
of effort that I put into this was, frankly, pretty low. A motivated
person, willing to do the appropriate analysis could probably bring it
further. For one thing, every single foreach() has a call to this
intrinsic, even where the list doesn't store pointers (which is not
undefined). At the very least that's going to bloat things up,
frequently for no conceivable gain, and yet with the patch applied
we're still able to see see quite tangible benefits, even if it isn't
exactly a stellar improvement. I have an idea that prefetching the
last element at the start of the loop could be much better than what I
did, because we know that those lists are mostly pretty small in
practice, and that's likely to help pipelining - prefetching too late
or even too early makes the optimization useless, because you may
still get a cache miss.

* Optimizing index scans - I noticed that binary searching accounted
for many cache misses during a pgbench select.sql benchmark,
instrumented with "perf record -e cache-misses". This warranted
further investigation.

I won't say anything further about the former optimization, except to
note that it's included for comparative purposes in the set of
benchmarks I've run (I haven't included a patch). The benchmark
results are here:

http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/results

I took two approaches to the latter. This was the more interesting
piece of work. Test sets include:

* Master baseline (green)

* List optimization (as mentioned above, not really relevant to the
main topic of this mail) (red)

* "fib btree", earlier patch, please disregard (blue)

* "Fixed fib patch", Fibonacci search, no specialization (purple)

* The interesting one, Finonacci search + specialization - "fib + no
jump"  (turquoise, see below for details)

Initially, I had a little bit of success with Fibonnacci search [2] in
place of binary search, in the hope that it would better take
advantage of CPU cache characteristics - Fibonnacci search is said to
have advantages where non-uniform memory access is an issue - it
minimizes the final interval. I wasn't all that optimistic that it
would work that well given the smallish size of BLCKSZ relative to
modern CPU L1 cache sizes [3], but it did make an appreciable dent on
its own. I suppose old habits die hard, because next I hacked up
_bt_compare and had it do an int4btcmp directly, in the event of
encountering a scankey that had as its comparator the relevant pg_proc
oid. This is very much in the style (and the spirit) of the grotty
early draft patches for the inlining-comparators-for-sorting patch.
Patch is attached. This is a draft, a POC, posted only to facilitate
discussion and to allow my results to be independently
duplicated/verified. Note that there is a bug (attributable to the new
search code) that causes the regression tests to fail in exactly one
place (i.e. one line of difference). I didn't think it was worth
deferring discussion to deal with that, though, since I don't think it
undermines anything.

I'm not sure how valuable the comparator trick is if we stick with
binary search - I didn't test that. I'm sure it's a question that must
be considered, though.

I have a fairly huge amount of data here, having run plenty of
benchmarks over several nights. The short version is that the 'select'
benchmark has just over 18% greater throughput on this machine at some
client counts (in particular, when there are 4 clients - there are 4
cores, but 8 logical cores) with the attached patch. There is a 3.5%
regression with one client, which is certainly not accounted for by
noise. Note, however, that latency appears consistently better with
the patch applied. This is a machine running on dedicated hardware: a
4-core i7-4770. The working set easily fits in its 32GiB of DDR3 RAM
at all pgbench scales tested (1, 10 and 100). The kernel used is
"3.8.0-31-generic #46~precise1-Ubuntu SMP". Postgres settings are
typical for this kind of thing (6GiB shared_buffers), but you can
refer to my pgbench-tools results for full details (drill down to an
individual pgbench run for that - they're all the same). I'm kind of
curious as to what this benchmark would like like on a server with
many more cores.

I guess I could write a proper patch to have code setting up a scankey
also set a flag that indicated that it was acceptable to assume that
the special built-in comparator would do fine. I don't think that
anything as involved as the sortsupport infrastructure is required
here, because that presumed (perhaps questionably) that it was widely
useful to provide more direct comparators, and that an extensible
framework was warranted, whereas what I've done is likely to have very
limited type-applicability. In general, people don't often put indexes
on floating point columns (where we need to handle NaN comparisons in
a special way in order to provide behavior consistent with various
invariants that btree operator classes need to respect). But I'm
reasonably confident that the majority of primary indexes in the wild
are int4 or composites of int4, maybe even the large majority.

I'd be happy with a scheme with only one built-in comparator, and
allowed a few types to be cataloged such that it was indicated that
just using the "built-in" comparator was acceptable, knowledge that
could be passed to _bt_compare via the scankey. I'm thinking of just
int4, and maybe date and a few other such int4 "covariant" types.

I don't think that this has to be morally equivalent to violating the
btree/AM abstraction by giving it direct knowledge of types - rather,
I'm only proposing that we give it the facility to know that for the
purposes of scankey comparison, a given type (well, some scankey that
has been appropriately hinted by the index Relation or something like
that) has Datums that compare like int4 Datums. Theoretically, it has
nothing in particular to do with the cataloged int4 type as such - it
has more to do with the fact that a comparison of 32-bit integers is a
single instruction on popular ISAs. More practically speaking, this
optimization seems likely to appreciably help many common cases.
Obviously, I'm not proposing that it be committed more or less as-is,
since it's entirely possible that the applicability of what I've done
isn't universal; I don't presume that it's never counter-productive.
How this could be most usefully applied is a discussion well worth
having, though. To be frank: hopefully we won't have a protracted
discussion this time. This stuff is not a high priority for me these
days.

Having mentioned CPU instructions, I should also say: as with the
sorting stuff, this optimization has little to nothing to do with
shaving some instructions from not having to go through the fmgr
trampoline. The true cost for that trampoline is almost certainly paid
in cache misses and, perhaps to a lesser extent, missed opportunities
for successful branch prediction.

Thoughts?

[1] http://lwn.net/Articles/444336/

[2] http://www.mpri.lsu.edu/textbook/Chapter5-a.htm#fibonacci

[3] http://www.pvk.ca/Blog/2012/07/30/binary-search-is-a-pathological-case-for-caches/
--
Peter Geoghegan

Attachment

pgsql-hackers by date:

Previous
From: Stephen Frost
Date:
Subject: Re: Why we are going to have to go DirectIO
Next
From: Andrew Gierth
Date:
Subject: Re: WITHIN GROUP patch