Re: Inlining comparators as a performance optimisation - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: Inlining comparators as a performance optimisation
Date
Msg-id CAEYLb_Us1cSxn94G2CyhSp+r1TxrghV6sj70htAtwL1K5cWLHg@mail.gmail.com
Whole thread Raw
In response to Re: Inlining comparators as a performance optimisation  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Inlining comparators as a performance optimisation
[PATCH] POC: inline int4 comparison in tuplesort
Re: Inlining comparators as a performance optimisation
List pgsql-hackers
On 20 September 2011 03:51, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Considering that -O2 is our standard optimization level, that
> observation seems to translate to "this patch will be useless in
> practice".  I think you had better investigate that aspect in some
> detail before spending more effort.

I don't think that the fact that that happens is at all significant at
this early stage, and it never even occurred to me that you'd think
that it might be. I was simply disclosing a quirk of this POC patch.
The workaround is probably to use a macro instead. For the benefit of
those that didn't follow the other threads, the macro-based qsort
implementation, which I found to perform significantly better than
regular qsort(), runs like this on my laptop when I built at 02 with
GCC 4.6 just now:

C stdlib quick-sort time elapsed: 2.092451 seconds
Inline quick-sort time elapsed: 1.587651 seconds

Does *that* look attractive to you? I've attached source code of the
program that produced these figures, which has been ported to C from
C++.

When I #define LARGE_SIZE 100000000, here's what I see:

[peter@peter inline_compar_test]$ ./a.out
C stdlib quick-sort time elapsed: 23.659411 seconds
Inline quick-sort time elapsed: 18.470611 seconds

Here, sorting with the function pointer/stdlib version takes about
1.28 times as long. In the prior test (with the smaller LARGE_SIZE),
it took about 1.32 times as long. Fairly predictable, linear, and not
to be sniffed at.

The variance I'm seeing across runs is low - a couple of hundredths of
a second at most. This is a Fedora 15 " Intel(R) Core(TM) i5-2540M CPU
@ 2.60GHz" machine. I'm not sure right now why the inline quick-sort
is less of a win than on my old Fedora 14 desktop (where it was 3.24
Vs 2.01), but it's still a significant win. Perhaps others can build
this simple program and tell me what they come up with.

>> This performance patch differs from most in that it's difficult in
>> principle to imagine a performance regression occurring.
>
> Really?  N copies of the same code could lead to performance loss just
> due to code bloat (ie, less of a query's inner loops fitting in CPU
> cache).

I did consider that. Of course inlining has an overhead, and I'll be
testing that each instance of inlining has a net benefit. I just meant
that many other performance patches have an obvious worst case, and I
think that it is policy to focus on that case, but I can't think of
one here. Does anyone else have any ideas?

> Not to mention the clear regression in maintainability.  So
> I'm disinclined to consider this sort of change without a significantly
> bigger win than you're suggesting above

Sure, there'll be some sort of regression in maintainability - I think
that HOT had a clear regression in maintainability too. The important
questions are obviously "how big is the loss of maintainability?", and
"is it worth it?". We'll know more when this work is actually shaped
into a proper patch. Perhaps I should have waited until I had
something along those lines before making an announcement, but I
wanted community input as early as possible. I think that there's
plenty of tweaking that can be done to get additional performance
improvements - all I've done so far is demonstrate that those
improvements are real and worth thinking about, in the fastest
possible way, partly because you expressed skepticism of the benefits
of inlining comparators to Greg Stark in an earlier thread.

Performance and maintainability are often somewhat in tension, but we
cannot ignore performance. If this work can bring us an improvement in
performance approaching the isolated macro Vs qsort() function pointer
benchmark, that's a *big* win. Sorting integers and floats is very
common and important.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: PostgreSQL X/Open Socket / BSD Socket Issue on HP-UX
Next
From: Tom Lane
Date:
Subject: Re: EXPLAIN and nfiltered, take two