Re: Inlining comparators as a performance optimisation - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Inlining comparators as a performance optimisation |
Date | |
Msg-id | CAEYLb_WAmQsZU-g2auy0oWWCj2GzoxqUz4EmiUQRdS8ewh9LTA@mail.gmail.com Whole thread Raw |
In response to | Re: Inlining comparators as a performance optimisation (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Inlining comparators as a performance optimisation
Re: Inlining comparators as a performance optimisation Re: Inlining comparators as a performance optimisation |
List | pgsql-hackers |
I've produced something much neater than my first patch, attached, although I still consider this to be at the POC stage, not least since I'm not exactly sure how I should be selecting the right specialisation in tuplesort_performsort() (a hack is used right now that does a direct pg_proc OID comparison), and also because I haven't implemented anything other than qsort for heap tuples yet (a minor detail, I suppose). I'm pleased to say that a much clearer picture of what's really going on here has emerged. Statistics: Total runtime according to explain analyze for query "select * from orderlines order by prod_id" (dellstore2 sample db), at GCC 4.5's -02 optimisation level, after warming the cache, on my desktop: Without the patch: ~82ms With the patch, but with the "inline" keyword commented out for all new functions/meta-functions: ~60ms with the patch, unmodified: ~52ms Recent experience suggests that using a macro rather than an inline function would perhaps have been a mistake, as it would prevent us from benefiting from the compiler's smarts in surmising just where we should inline. As I've pointed out, individual call sites are inlined, not individual functions. Tom wanted to know if I could prove the benefit in inlining, as against just resolving the impedance mismatch without bothering with inlining (and, indeed, further optimisations that the compiler can avail of as a result of knowing the comparator at compile-time). These figures support my contention that these optimisations have an important role to play. However, there's no question that resolving the impedance mismatch is where the biggest benefit is seen, particularly relative to maintainability. As Tom anticipated, the question of whether or not we do the inlining is less than entirely straightforward, as it's less of a win, and it has a higher price in ugliness. I myself am very much of the opinion that it's still worth it. As I've said, sorting integers and floats is very common and very important, and this approach will likely yield benefits beyond increasing the speed of executor sort nodes, as described below. I accept that a more sophisticated benchmark is required here, but I've been a little busy actually writing the patch. Any ideas anyone? Independent benchmarks are welcome. If someone could suggest a worst case, that would be particularly useful, as I cannot think of one - I believe that each instance of inlining a call site more-or-less either is or is not a net benefit here, regardless of the number of comparisons performed, which is why the improvement is so predictable across the size of the set of integers sorted for by qsort in isolation (which is the big reason why that ~8ms decrease due to inlining turns out to be not too shabby - it's a saving per comparison, and they can add it pretty quickly). So, for example, when I build the patch on my laptop (GCC 4.6), with an 48MB table (the same orderlines table query, but I doubled up the data a few times beforehand), we see an undiminished, proportional (to the prior, isolated cost of qsorting, I think) decrease in runtime: Without patch: ~615ms With patch: ~415ms One key insight that this work brings is that resolving the impedance mismatch - making comparisons as inexpensive as possible (including using inlining) - is the best possible strategy in improving sort performance today, vastly more efficacious than, for example, tweaking the sorting algorithm. If anyone can find some more ways of shaving cycles there, it's one place where it really matters. Changes are isolated to the extent that if you decide that you don't like this one day, you can simply remove the calls to qsort_arg specialisations, and once again switch back to just using what the patch makes a fallback, qsort_arg itself. I know that we'd never really get into that situation, but the fact that we could do that serves to illustrate that these changes are fairly isolated. Incidentally, if you find writing code that is heavily dependent on macros to be a chore, I can highly recommend Clang 2.9 . But what of the maintenance burden of mostly duplicating qsort_arg.c, to produce a "template" version? Well, take a look at the *entire* history for that file: [peter@localhost port]$ git log qsort_arg.c commit 9f2e211386931f7aee48ffbc2fcaef1632d8329f Author: Magnus Hagander <magnus@hagander.net> Date: Mon Sep 20 22:08:53 2010 +0200 Remove cvs keywords from all files. commit b9954fbb4ef25fb1ea173d26017d4d128dd15be5 Author: Neil Conway <neilc@samurai.com> Date: Sun Mar 18 05:36:50 2007 +0000 Code cleanup for function prototypes: change two K&R-style prototypes to ANSI-style, and change "()" -> "(void)". Patch from Stefan Huehner. commit b38900c7677657a815e75781b776fb1e41054df3 Author: Tom Lane <tgl@sss.pgh.pa.us> Date: Thu Oct 12 15:04:55 2006 +0000 Use Min() instead of min() in qsort, for consistency and to avoid redefined-macro warnings on some platforms. Per gripe from Hiroshi Saito. commit f99a569a2ee3763b4ae174e81250c95ca0fdcbb6 Author: Bruce Momjian <bruce@momjian.us> Date: Wed Oct 4 00:30:14 2006 +0000 pgindent run for 8.2. commit 6edd2b4a91bda90b7f0290203bf5c88a8a8504db Author: Tom Lane <tgl@sss.pgh.pa.us> Date: Tue Oct 3 22:18:23 2006 +0000 Switch over to using our own qsort() all the time, as has been proposed repeatedly. ***SNIP MESSAGE*** I think that it is fair to say that the maintenance burden imposed by this change is well worth it. Only one of these changes after the initial commit is not mechanical, and that is still pretty trivial. I've attached gprof flat profile output from unmodified PostgreSQL (at master) when we create an index on a pgbench table: pgbench -i -s 1500 index_test That puts the number of rows in the pgbench_accounts table at 150 million, while the table is 19 GB in size. That’s a reasonable size, and should usefully demonstrate the likely improvement we’ll see in the performance of creating an index. I increased maintenance_work_mem to 756MB, plus used a fairly straightforward postgresql.conf, plus some other, more generic values for other GUCs. The query is: create INDEX on pgbench_accounts(abalance); -- abalance is of type integer Here is the top of the flat profile, by far the most important part: Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 43.56 79.29 79.29 6972025063 0.00 0.00 comparetup_index_btree 22.39 120.04 40.75 150000000 0.00 0.00 tuplesort_heap_siftup 4.36 127.97 7.93 2677057771 0.00 0.00 btint4cmp 2.88 133.21 5.24 314766350 0.00 0.00 LWLockRelease 1.81 136.50 3.29 314612660 0.00 0.00 LWLockAcquire 1.65 139.50 3.00 450905247 0.00 0.00 AllocSetAlloc 1.43 142.10 2.60 300000001 0.00 0.00 LogicalTapeWrite 1.17 144.23 2.13 comparetup_cluster *** SNIP *** This looks pretty top-heavy to me, and more time is spent in comparetup_index_btree than any other function by some margin, suggesting that it will be worthwhile to pursue similar optimisations to make index creation faster in a later revision of this patch, or as another patch. I didn't take care to ensure that I'd be caching the entire table in memory, which probably would have been more useful here. Thoughts? On the subject of highly ambitious optimisations to sorting, one possibility I consider much more practicable than GPU-accelerated sorting is simple threading; quicksort can be parallelised very effectively, due to its divide-and-conquer nature. If we could agree on a threading abstraction more sophisticated than forking, it's something I'd be interested in looking at. To do so would obviously entail lots of discussion about how that relates to whatever way we eventually decide on implementing parallel query, and that's obviously a difficult discussion. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
Attachment
pgsql-hackers by date: