Re: Inlining comparators as a performance optimisation - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Inlining comparators as a performance optimisation |
Date | |
Msg-id | CAEYLb_WAtXZ67BaPryQ6n_tRVf+2=HTU1dRkYK2DMdBo7n=1Fw@mail.gmail.com Whole thread Raw |
In response to | Re: Inlining comparators as a performance optimisation (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Inlining comparators as a performance optimisation
|
List | pgsql-hackers |
On 23 November 2011 19:24, Robert Haas <robertmhaas@gmail.com> wrote: > Well, right now the decision as to which mechanism should be used here > gets made in tuplesort_performsort(), which has no good way of > communicating with EXPLAIN anyway. You could pretty easily add something to Tuplesortstate to accomplish this. That isn't an endorsement of doing so, but I'm not sure that it isn't appropriate. > Actually, I think that's a > modularity violation; using the address of comparetup_heap as a flag > value seems quite ugly. How about moving that logic up to > tuplesort_begin_heap() I'll post a patch soon that does just that in the next day or two. Tuplesortstate has a pointer to a sort specialisation. > and having it set some state inside the > Tuplesort, maybe based on a flag in the opclass (or would it have to > attach to the individual operator)? I'm not sure that there's much point in such a flag. > At least on my machine, your latest patch reliably crashes the > regression tests in multiple places. > TRAP: FailedAssertion("!(state->nKeys == 1)", File: "tuplesort.c", Line: 1261); Yes, sorry about that. Should have been discriminating against nKeys > 1 cases. As of this evening, for sorts with multiple scankeys, I'm using optimisations for the first scankey but not subsequent scankeys, which is frequently almost as good as having optimisations for all scanKeys. > The formatting of src/include/utils/template_qsort_arg.h is hard to > read. At ts=8, the backslashes line up, but the code doesn't fit in > 80 columns. If you set ts=4, then it fits in 80 columns, but the > backslashes don't line up any more, and the variable declarations > don't either. I believe ts=4 is project standard. Fair enough. My working copy and .vimrc have been updated. > I still think it would be a good idea to provide a mechanism to > override heap_comparetup() with a type-specific function. I don't > think that would take much extra code, and then any data type could > get at least that much benefit out of this. > It seems like it could be a good idea to do some > per-assembler-instruction profiling of this code, and perhaps also of > the original code. I'm curious where the time is being spent. How would you go about doing that? The instrumentation that profilers use actually caused a big drop in performance here when I attempted it a few weeks ago. There's a kind of Heisenberg effect. This optimisation *more than doubles* raw sort performance for the cases. There is nothing contrived or cherry picked about the query that I selected to represent this optimisation - it was literally the first one that I selected. Sometimes, I see even a markedly better gain than a doubling of raw sort performance - I think my earlier experiments that indicated a much smaller improvement past a certain point may have been methodologically flawed. Sorry about that. If I double-up the data in the orderlines table a few times, until it reaches 385 MB (duplicate ever tuple with an insert into ...select ), then warm the cache, I get very interesting results. Here, we see a few runs of the same old query unoptimised (note that I've excluded some cold-cache runs before these runs): Before optimisation ============== Total runtime: 7785.473 ms - 3.517310 secs just sorting Total runtime: 8203.533 ms - 3.577193 secs just sorting Total runtime: 8559.743 ms - 3.892719 secs just sorting Total runtime: 9032.564 ms - 3.844746 secs just sorting Total runtime: 9637.179 ms - 4.434431 secs just sorting Total runtime: 9647.215 ms - 4.440560 secs just sorting Total runtime: 9669.701 ms - 4.448572 secs just sorting After optimisation ============== Total runtime: 5462.419 ms - 1.169963 secs just sorting Total runtime: 5510.660 ms - 1.234393 secs just sorting Total runtime: 5511.703 ms - 1.208377 secs just sorting Total runtime: 5588.604 ms - 1.175536 secs just sorting Total runtime: 5899.496 ms - 1.250403 secs just sorting Total runtime: 6023.132 ms - 1.338760 secs just sorting Total runtime: 6717.177 ms - 1.486602 secs just sorting This is a 800109kB sort. So, taking the median value as representative here, that looks to be just shy of a 40% improvement, or 3.4 seconds. My /proc/cpuinfo is attached on the off chance that someone is interested in that. More work is needed here, but this seems promising. It will be interesting to see how far all of this can be taken with comparetup_index_btree. Certainly, I'm sure there's some gain to be had there by applying lessons learned from comparetup_heap. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
Attachment
pgsql-hackers by date: