Performance optimization of btree binary search - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Performance optimization of btree binary search |
Date | |
Msg-id | CAM3SWZS=AoDCMqBhX8gf39Zk69ujkiotmjL_rDEqFQxztkt4wQ@mail.gmail.com Whole thread Raw |
Responses |
Re: Performance optimization of btree binary search
Re: Performance optimization of btree binary search |
List | pgsql-hackers |
Having nothing better to do over the holiday weekend, I decided to pursue a number of ideas for improving performance that I thought about a long time ago. These include: * Pre-fetching list node pointers. This looks to be moderately promising, but I'm certainly not going to be the one to land it, given present obligations. Stephen Frost may wish to pick it up, given his previous interest in the matter. This is slightly controversial, because it uses a GCC intrinsic (__builtin_prefetch), but also because the Linux kernel removed this optimization to their generic list data-structure [1]. However, that list was what we'd call an embedded list, so we should probably shouldn't be totally deterred. The amount of effort that I put into this was, frankly, pretty low. A motivated person, willing to do the appropriate analysis could probably bring it further. For one thing, every single foreach() has a call to this intrinsic, even where the list doesn't store pointers (which is not undefined). At the very least that's going to bloat things up, frequently for no conceivable gain, and yet with the patch applied we're still able to see see quite tangible benefits, even if it isn't exactly a stellar improvement. I have an idea that prefetching the last element at the start of the loop could be much better than what I did, because we know that those lists are mostly pretty small in practice, and that's likely to help pipelining - prefetching too late or even too early makes the optimization useless, because you may still get a cache miss. * Optimizing index scans - I noticed that binary searching accounted for many cache misses during a pgbench select.sql benchmark, instrumented with "perf record -e cache-misses". This warranted further investigation. I won't say anything further about the former optimization, except to note that it's included for comparative purposes in the set of benchmarks I've run (I haven't included a patch). The benchmark results are here: http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/results I took two approaches to the latter. This was the more interesting piece of work. Test sets include: * Master baseline (green) * List optimization (as mentioned above, not really relevant to the main topic of this mail) (red) * "fib btree", earlier patch, please disregard (blue) * "Fixed fib patch", Fibonacci search, no specialization (purple) * The interesting one, Finonacci search + specialization - "fib + no jump" (turquoise, see below for details) Initially, I had a little bit of success with Fibonnacci search [2] in place of binary search, in the hope that it would better take advantage of CPU cache characteristics - Fibonnacci search is said to have advantages where non-uniform memory access is an issue - it minimizes the final interval. I wasn't all that optimistic that it would work that well given the smallish size of BLCKSZ relative to modern CPU L1 cache sizes [3], but it did make an appreciable dent on its own. I suppose old habits die hard, because next I hacked up _bt_compare and had it do an int4btcmp directly, in the event of encountering a scankey that had as its comparator the relevant pg_proc oid. This is very much in the style (and the spirit) of the grotty early draft patches for the inlining-comparators-for-sorting patch. Patch is attached. This is a draft, a POC, posted only to facilitate discussion and to allow my results to be independently duplicated/verified. Note that there is a bug (attributable to the new search code) that causes the regression tests to fail in exactly one place (i.e. one line of difference). I didn't think it was worth deferring discussion to deal with that, though, since I don't think it undermines anything. I'm not sure how valuable the comparator trick is if we stick with binary search - I didn't test that. I'm sure it's a question that must be considered, though. I have a fairly huge amount of data here, having run plenty of benchmarks over several nights. The short version is that the 'select' benchmark has just over 18% greater throughput on this machine at some client counts (in particular, when there are 4 clients - there are 4 cores, but 8 logical cores) with the attached patch. There is a 3.5% regression with one client, which is certainly not accounted for by noise. Note, however, that latency appears consistently better with the patch applied. This is a machine running on dedicated hardware: a 4-core i7-4770. The working set easily fits in its 32GiB of DDR3 RAM at all pgbench scales tested (1, 10 and 100). The kernel used is "3.8.0-31-generic #46~precise1-Ubuntu SMP". Postgres settings are typical for this kind of thing (6GiB shared_buffers), but you can refer to my pgbench-tools results for full details (drill down to an individual pgbench run for that - they're all the same). I'm kind of curious as to what this benchmark would like like on a server with many more cores. I guess I could write a proper patch to have code setting up a scankey also set a flag that indicated that it was acceptable to assume that the special built-in comparator would do fine. I don't think that anything as involved as the sortsupport infrastructure is required here, because that presumed (perhaps questionably) that it was widely useful to provide more direct comparators, and that an extensible framework was warranted, whereas what I've done is likely to have very limited type-applicability. In general, people don't often put indexes on floating point columns (where we need to handle NaN comparisons in a special way in order to provide behavior consistent with various invariants that btree operator classes need to respect). But I'm reasonably confident that the majority of primary indexes in the wild are int4 or composites of int4, maybe even the large majority. I'd be happy with a scheme with only one built-in comparator, and allowed a few types to be cataloged such that it was indicated that just using the "built-in" comparator was acceptable, knowledge that could be passed to _bt_compare via the scankey. I'm thinking of just int4, and maybe date and a few other such int4 "covariant" types. I don't think that this has to be morally equivalent to violating the btree/AM abstraction by giving it direct knowledge of types - rather, I'm only proposing that we give it the facility to know that for the purposes of scankey comparison, a given type (well, some scankey that has been appropriately hinted by the index Relation or something like that) has Datums that compare like int4 Datums. Theoretically, it has nothing in particular to do with the cataloged int4 type as such - it has more to do with the fact that a comparison of 32-bit integers is a single instruction on popular ISAs. More practically speaking, this optimization seems likely to appreciably help many common cases. Obviously, I'm not proposing that it be committed more or less as-is, since it's entirely possible that the applicability of what I've done isn't universal; I don't presume that it's never counter-productive. How this could be most usefully applied is a discussion well worth having, though. To be frank: hopefully we won't have a protracted discussion this time. This stuff is not a high priority for me these days. Having mentioned CPU instructions, I should also say: as with the sorting stuff, this optimization has little to nothing to do with shaving some instructions from not having to go through the fmgr trampoline. The true cost for that trampoline is almost certainly paid in cache misses and, perhaps to a lesser extent, missed opportunities for successful branch prediction. Thoughts? [1] http://lwn.net/Articles/444336/ [2] http://www.mpri.lsu.edu/textbook/Chapter5-a.htm#fibonacci [3] http://www.pvk.ca/Blog/2012/07/30/binary-search-is-a-pathological-case-for-caches/ -- Peter Geoghegan
Attachment
pgsql-hackers by date: