Thread: Progress on fast path sorting, btree index creation time

Progress on fast path sorting, btree index creation time

From
Peter Geoghegan
Date:
I'll try and keep this terse. I've promised to justify the number of
specialisations that are in my fast-path sorting patch, and I may yet
conclude that a reduction is appropriate. Not today though - there are
quite a few ideas in the patch (even more now), and not all have been
exhaustively evaluated. Maybe that isn't what some had hoped for right
now, but deferring publicly and definitively answering those questions
to focus on tweaking the patch has brought additional benefits. I felt
that I went too long without checking in with the community though.
The purpose of this e-mail is to:

1. Report back a further improvement in the performance benefits seen
when sorting with multiple sortkeys. The performance benefits seen for
that case were previously relatively modest; they're better now.

2. Report improvements from applying these techniques to btree index
tuplesorting (short version: they're also quite good). The data used
is exactly the same as it was in my previous benchmark; orderlines is
1538MB, and has lots of duplicates. The environment is also identical.

3. Resolve two anticipated controversies that are, respectively,
somewhat orthogonal and completely orthogonal to the binary bloat
controversy. The first (controversy A) is that I have added a new
piece of infrastructure, pg_always_inline, which, as the name
suggests, is a portable way of insisting that a function should be
invariably inlined. Portable libraries like libc have been doing this
for a long time now, and I actually use the libc macro (that expands
to __attribute__((always_inline)) ) where possible. The second
possible point of contention (controversy B) is that I have jettisoned
various protections against bad qsort implementations that I believe
are a legacy of when we used the system qsort pre-2006, that can no
longer be justified. For example, quick sort performing badly in the
face of lots of duplicates is a well understood problem today
(partitioning should stop on equal keys), and ISTM that protecting
against that outside the qsort implementation (but only for index
sorts) is wrong-headed.

The first possibly controversial adjustment (controversy A) is clearly
also the reason for (1) - that has been well isolated. I haven't got
around to quantifying the performance improvement seen due to the
second possibly controversial adjustment (controversy B), but I
believe that it can be justified as a case of effectively removing
redundant code anyway. After all, we now assert against comparing a
tuple to itself anyway, and the "cheap insurance" that existed in
comparetup_index_btree was never present in any form in
comparetup_index_heap, and we heard no complaints, AFAIK.

Attached are:

* A detailing of libc's use of __always_inline /
__attribute__((always_inline)). Note that it often appears in code
that is built for all platforms.

* The WIP patch itself, rebased to integrate with the new SortSupport
infrastructure. I've gone a bit crazy with btree specialisations, but
I suspect that's where it matters least and makes most sense.

* A new Python script for bench marking index creation, that is
similar to the other one I previously posted for bench marking sorting
heap tuples.

* A spreadsheet that shows the results of re-running my earlier heap
tuple sorting benchmark with this new patch. The improvement in the
query that orders by 2 columns is all that is pertinent there, when
considering the value of (1) and the sense in standing still for
controversy A.

* A spreadsheet that shows the difference in index creation times,
generated with the help of the new python script.

Thoughts?

I had another idea when writing this patch that I haven't developed at
all but I'll share anyway. That is, it might be a good idea to use a
balance quicksort:

http://xlinux.nist.gov/dads//HTML/balancedqsrt.html

It might be possible to get a reasonable approximation of the actual
median value of a given column with existing statistics, which could
be hinted to qsort_arg. This would do a better job of guessing an
appropriate initial pivot value for qsort than the med3 sampling
technique (what we do now, advocated by Sedgewick: use the median of
the first, middle and last elements of the current partition), though
we'd still use that med3 technique to select all other pivots, and
perhaps signal to qsort_arg "you're on your own, fall back of med3 for
the initial pivot" with a null ptr, according to some heuristic.
There's obviously a not inconsiderable impedance mismatch to resolve
if we're to do that though, so that Tuplesortstate has a pointer to
the median SortTuple.

Can I get a straw poll on how much of a problem worst-case performance
of qsort is believed to be?

In a perfect world, if it were somehow possible to know the perfect
pivots ahead of time from a histogram or something, we'd have a quick
sort variant with worst-case performance of O(n log(n)). That, or the
limited approximation that I've sketched would perhaps be worthwhile,
even if it was subject to a number of caveats. Wikipedia claims that
the worst case for quick sort is O(n log(n)) with the refinements
recommended by Sedgewick's 1978 paper, but this seems like nonsense to
me - the med3 technique is a good heuristic in practice, but it's
perfectly possible in theory for it's sampling to always get things
completely wrong (this is not an unfamiliar problem). How often it
messes up in the real world and how much it matters is something that
I wouldn't like to speculate on, though it is the main factor that
will determine if the idea is worth pursuing. I can tell you that I
have heard one person observe that it had an unpredictable runtime.
However, they might not have noticed if this patch was applied,
because in general the worse quicksort does the better this patch
does. Also, the fact that we use a median of "medians" when n > 40
(Dr. Sedgewick again, if I'm not mistaken) makes me a little skeptical
of that claim. Come to think of it, it might not be a bad idea to add
a bunch of comments to qsort_arg while I have this swapped into my
head, as it currently has no comments at all.

While I'm thinking out loud, here's another idea: Have a GUC which,
when enabled (perhaps potentially at various different granularities),
makes Timsort the in-memory sorting algorithm, as it now is by default
for Python, Java SE7 (for arrays) and the Android platform; for
certain applications, this could be a real win, and I don't think that
it would have to be invasive: it could be dropped in.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Attachment

Re: Progress on fast path sorting, btree index creation time

From
Merlin Moncure
Date:
On Thu, Dec 29, 2011 at 8:03 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> * A spreadsheet that shows the results of re-running my earlier heap
> tuple sorting benchmark with this new patch. The improvement in the
> query that orders by 2 columns is all that is pertinent there, when
> considering the value of (1) and the sense in standing still for
> controversy A.
>
> * A spreadsheet that shows the difference in index creation times,
> generated with the help of the new python script.

very nice.  let me save everyone the effort of opening his
spreadsheets (which by the way both show 'HEAD/unoptimized' --
probably not what you meant): he's showing a consistent ~50% reduction
in running time of sort driven queries -- that's money.

merlin


Re: Progress on fast path sorting, btree index creation time

From
Peter Geoghegan
Date:
On 30 December 2011 19:46, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Thu, Dec 29, 2011 at 8:03 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
>> * A spreadsheet that shows the results of re-running my earlier heap
>> tuple sorting benchmark with this new patch. The improvement in the
>> query that orders by 2 columns is all that is pertinent there, when
>> considering the value of (1) and the sense in standing still for
>> controversy A.
>>
>> * A spreadsheet that shows the difference in index creation times,
>> generated with the help of the new python script.
>
> very nice.  let me save everyone the effort of opening his
> spreadsheets (which by the way both show 'HEAD/unoptimized' --
> probably not what you meant): he's showing a consistent ~50% reduction
> in running time of sort driven queries -- that's money.

Sorry, I think you may have misinterpreted the results, which is my
fault - I introduced a formatting error. In the case of the "btree"
spreadsheet, the first query on each sheet should be "create index
test on orderlines (prod_id);", and not "select * from orderlines
order by prod_id". The idea is to compare the results from each set of
binaries across pages of the spreadsheet (note that there are two
tabs). You should not compare anything between the two spreadsheets.
Revised btree results attached. The heap results that I posted do not
have any formatting errors, so they have not been revised.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Attachment

Re: Progress on fast path sorting, btree index creation time

From
Merlin Moncure
Date:
On Fri, Dec 30, 2011 at 2:30 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> On 30 December 2011 19:46, Merlin Moncure <mmoncure@gmail.com> wrote:
>> On Thu, Dec 29, 2011 at 8:03 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
>>> * A spreadsheet that shows the results of re-running my earlier heap
>>> tuple sorting benchmark with this new patch. The improvement in the
>>> query that orders by 2 columns is all that is pertinent there, when
>>> considering the value of (1) and the sense in standing still for
>>> controversy A.
>>>
>>> * A spreadsheet that shows the difference in index creation times,
>>> generated with the help of the new python script.
>>
>> very nice.  let me save everyone the effort of opening his
>> spreadsheets (which by the way both show 'HEAD/unoptimized' --
>> probably not what you meant): he's showing a consistent ~50% reduction
>> in running time of sort driven queries -- that's money.
>
> Sorry, I think you may have misinterpreted the results, which is my
> fault - I introduced a formatting error. In the case of the "btree"
> spreadsheet, the first query on each sheet should be "create index
> test on orderlines (prod_id);", and not "select * from orderlines
> order by prod_id". The idea is to compare the results from each set of
> binaries across pages of the spreadsheet (note that there are two
> tabs). You should not compare anything between the two spreadsheets.
> Revised btree results attached. The heap results that I posted do not
> have any formatting errors, so they have not been revised.

right-- my bad. still, that's 31-37% -- still pretty nice.

merlin


Re: Progress on fast path sorting, btree index creation time

From
Robert Haas
Date:
On Thu, Dec 29, 2011 at 9:03 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> 3. Resolve two anticipated controversies that are, respectively,
> somewhat orthogonal and completely orthogonal to the binary bloat
> controversy. The first (controversy A) is that I have added a new
> piece of infrastructure, pg_always_inline, which, as the name
> suggests, is a portable way of insisting that a function should be
> invariably inlined. Portable libraries like libc have been doing this
> for a long time now, and I actually use the libc macro (that expands
> to __attribute__((always_inline)) ) where possible.

I don't have a problem with the idea of a pg_always_inline, but I'm
wondering what sort of fallback mechanism you propose.  It seems to me
that if the performance improvement is conditioned on inlining be done
and we're not confident that we can force the compiler to inline, we
ought to fall back to not making the specialization available, rather
than to making something available that may not work well.  Of course
if it's only a question of a smaller performance gain, rather than an
actual loss, then that's not as much of an issue...

> The second
> possible point of contention (controversy B) is that I have jettisoned
> various protections against bad qsort implementations that I believe
> are a legacy of when we used the system qsort pre-2006, that can no
> longer be justified. For example, quick sort performing badly in the
> face of lots of duplicates is a well understood problem today
> (partitioning should stop on equal keys), and ISTM that protecting
> against that outside the qsort implementation (but only for index
> sorts) is wrong-headed.

Assuming you mean that qsort_arg() will protect against this stuff so
that callers don't need to do it separately, +1.

> I had another idea when writing this patch that I haven't developed at
> all but I'll share anyway. That is, it might be a good idea to use a
> balance quicksort:
>
> http://xlinux.nist.gov/dads//HTML/balancedqsrt.html

I can't find any description of how this actually works... obviously,
we'd like to find a close-to-median element, but how do we actually do
that cheaply?

> It might be possible to get a reasonable approximation of the actual
> median value of a given column with existing statistics, which could
> be hinted to qsort_arg.

I doubt that this would be worthwhile -- the pivot that we pick at the
toplevel doesn't really matter much; even if it's bad, we can recover
just fine if the pivot we pick one level down is good, or the level
after that.  We only really have a problem if we keep systematically
picking bad pivots every time.  And if we do have that problem,
improving the toplevel choice of pivot is only going to help modestly,
because we'll still be O(n^2) on each partition.

> Can I get a straw poll on how much of a problem worst-case performance
> of qsort is believed to be?

I'd be reluctant to remove any of the protections we currently have,
because I bet they were all put in to fix problems that people hit in
the field.  But I don't know that we need more.  Of course,
consolidating them into qsort() itself rather than its callers
probably makes sense, as noted above.

> In a perfect world, if it were somehow possible to know the perfect
> pivots ahead of time from a histogram or something, we'd have a quick
> sort variant with worst-case performance of O(n log(n)). That, or the
> limited approximation that I've sketched would perhaps be worthwhile,
> even if it was subject to a number of caveats. Wikipedia claims that
> the worst case for quick sort is O(n log(n)) with the refinements
> recommended by Sedgewick's 1978 paper, but this seems like nonsense to
> me - the med3 technique is a good heuristic in practice, but it's
> perfectly possible in theory for it's sampling to always get things
> completely wrong (this is not an unfamiliar problem).

I agree.  After reading
http://nanxkurniawan.wordpress.com/2010/05/31/quicksort/ I am inclined
to think that things are still O(n lg n) even if we always partition
so that any fixed percentage of the data -- is on one side of the
partition element and the remainder is on the other side.  So for
example if all of our partition elements are greater than 1% of the
elements in their partition and less than the other 99%, we'll still
be O(n lg n), though with a considerably higher constant factor, I
think.  However, if our partition elements are always a fixed NUMBER
of elements from the end - whether it's 1 or 100 - then we'll be
O(n^2), though again the constant factor will depend on how many
elements you knock off each time.  In practice I'm not sure whether an
algorithmic analysis matters much, because in real life the constant
factors are probably pretty important, hence the median-of-medians
approach with n > 40.

> While I'm thinking out loud, here's another idea: Have a GUC which,
> when enabled (perhaps potentially at various different granularities),
> makes Timsort the in-memory sorting algorithm, as it now is by default
> for Python, Java SE7 (for arrays) and the Android platform; for
> certain applications, this could be a real win, and I don't think that
> it would have to be invasive: it could be dropped in.

I gather from a quick read that this is supposed to be especially good
when the data is already somewhat sorted.  Would there be any merit in
trying to guess when that's likely to be the case (e.g. based on the
correlation statistic)?   That seems like a stretch, but OTOH a GUC
doesn't feel great either: what is the poor user to do with a query
that does 2 sorts, one of which is faster with QS and the other of
which is faster with Timsort?  Ugh.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Progress on fast path sorting, btree index creation time

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Dec 29, 2011 at 9:03 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
>> The first (controversy A) is that I have added a new
>> piece of infrastructure, pg_always_inline, which, as the name
>> suggests, is a portable way of insisting that a function should be
>> invariably inlined.

> I don't have a problem with the idea of a pg_always_inline, but I'm
> wondering what sort of fallback mechanism you propose.

There is no compiler anywhere that implements "always inline", unless
you are talking about a macro.  "inline" is a hint and nothing more,
and if you think you can force it you are mistaken.  So this controversy
is easily resolved: we do not need any such construct.

The real question is whether we should accept a patch that is a
performance loss when the compiler fails to inline some reasonably
simple function.  I think that would depend on the actual numbers
involved, so we'd need to see data before making a decision.

>> The second
>> possible point of contention (controversy B) is that I have jettisoned
>> various protections against bad qsort implementations that I believe
>> are a legacy of when we used the system qsort pre-2006, that can no
>> longer be justified.

No objection to that one, I think, as long as you've verified that our
implementation is in fact okay about these things.
        regards, tom lane


Re: Progress on fast path sorting, btree index creation time

From
Peter Geoghegan
Date:
On 5 January 2012 22:27, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> There is no compiler anywhere that implements "always inline", unless
> you are talking about a macro.  "inline" is a hint and nothing more,
> and if you think you can force it you are mistaken.  So this controversy
> is easily resolved: we do not need any such construct.

I'm slightly puzzled by your remarks here. GCC documentation on this
is sparse (although, as I've demonstrated, I can get better numbers
using the always_inline attribute on GCC 4.3), but take a look at
this:

http://msdn.microsoft.com/en-us/library/z8y1yy88.aspx

While it is not strictly true to say that pg_always_inline could
inline every function under every set of circumstances, it's pretty
close to the truth. I do accept that this facility could quite easily
be abused if its use isn't carefully measured. I also accept that
C99/GNU C inline functions are generally just requests to the compiler
that may be ignored (and indeed the compiler may inline without being
asked to).

It's short sighted to see this as a case of inlining itself making a
big difference, so much as it making a big difference as an enabling
transformation.

> The real question is whether we should accept a patch that is a
> performance loss when the compiler fails to inline some reasonably
> simple function.  I think that would depend on the actual numbers
> involved, so we'd need to see data before making a decision.

Who said anything about a performance loss? Since the raw improvement
to qsort_arg is so large, it's pretty difficult to imagine a
confluence of circumstances in which this results in a net loss. See
the figures at http://archives.postgresql.org/pgsql-hackers/2011-11/msg01316.php
, for example.

The pg_always_inline idea is relatively recent. It just serves to
provide additional encouragement to the compiler to inline, insofar as
that is possible on the platform in question. The compiler's
cost/benefit analysis cannot possibly appreciate how hot a code path
qsort_arg is, because it has a set of generic heuristics that are
quite fallible, and very probably are on quite conservative. We can
appreciate such things though.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


Re: Progress on fast path sorting, btree index creation time

From
Robert Haas
Date:
On Thu, Jan 5, 2012 at 5:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> There is no compiler anywhere that implements "always inline", unless
> you are talking about a macro.  "inline" is a hint and nothing more,
> and if you think you can force it you are mistaken.  So this controversy
> is easily resolved: we do not need any such construct.

That's basically in direct contradiction to the experimental evidence.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Progress on fast path sorting, btree index creation time

From
Peter Geoghegan
Date:
On 5 January 2012 20:23, Robert Haas <robertmhaas@gmail.com> wrote:
> I don't have a problem with the idea of a pg_always_inline, but I'm
> wondering what sort of fallback mechanism you propose.  It seems to me
> that if the performance improvement is conditioned on inlining be done
> and we're not confident that we can force the compiler to inline, we
> ought to fall back to not making the specialization available, rather
> than to making something available that may not work well.  Of course
> if it's only a question of a smaller performance gain, rather than an
> actual loss, then that's not as much of an issue...

pg_always_inline is more a less a neat adjunct to what I've already
done. You can take it or leave it, but it would be irrational to
dismiss it out of hand, because it is demonstrably effective in this
case. Using non-portable things like function attributes is fairly
well precedented - we use __attribute__((format(*) )) frequently. We
don't need a fallback mechanism other than "#else #define
pg_always_inline inline #endif". I think these facilities are
available to well over 99% of our users, who use GCC, GCC compatible
compiler and MSVC builds.

I have basically no sympathy for anyone who uses a compiler that can't
inline functions. Do such people actually exist?

>> ISTM that protecting
>> against that outside the qsort implementation (but only for index
>> sorts) is wrong-headed.
>
> Assuming you mean that qsort_arg() will protect against this stuff so
> that callers don't need to do it separately, +1.

That's exactly what I mean. Going quadratic in the face of lot of
duplicates is something that, remarkably, was a problem well into the
1990s, apparently because some guy noticed it one day, though I think
that lots of popular implementations happened to be unaffected anyway.

As you know, all queries tested have lots and lots of duplicates (a
~1.5GB table that contains the same number of distinct elements as a
~10MB table once did), and removing the "duplicate protection" for
btrees, on top of everything else, sees the qsort do an awful lot
better than HEAD does with the psuedo-protection. As I've said, we
already lack any such "protection" for heap tuples. We are unaffected
now anyway, in particular, because we do this, as apparently
recommended by both Sedgewick and Knuth:
    while (pb <= pc && (r = cmp(pb, a)) <= 0)    {        if (r == 0)        {            swap(pa, pb);            pa
+=es;        }        pb += es;    } 

Note that the partition swap occurs *because* the pivot and element
are equal. You might imagine that this is superfluous. It turns out
that it protects against the duplicates, resulting in sub-partitions
about the same size (though it occurs to me that it does so at the
expense of precluding parallelism). In short, Sedgewick would approve
of our qsort implementation.

As for the "compare a tuple to itself" thing, that's just silly, any
was probably only ever seen in some homespun implementation at some
point a relatively long time ago, but I've asserted against it anyway.

> I can't find any description of how this actually works... obviously,
> we'd like to find a close-to-median element, but how do we actually do
> that cheaply?

Uh, it works whatever way you want it to work. The implementation has
to figure out how to get the median ahead of time.

> I doubt that this would be worthwhile -- the pivot that we pick at the
> toplevel doesn't really matter much; even if it's bad, we can recover
> just fine if the pivot we pick one level down is good, or the level
> after that.  We only really have a problem if we keep systematically
> picking bad pivots every time.  And if we do have that problem,
> improving the toplevel choice of pivot is only going to help modestly,
> because we'll still be O(n^2) on each partition.
>
>> Can I get a straw poll on how much of a problem worst-case performance
>> of qsort is believed to be?
>
> I'd be reluctant to remove any of the protections we currently have,
> because I bet they were all put in to fix problems that people hit in
> the field.  But I don't know that we need more.  Of course,
> consolidating them into qsort() itself rather than its callers
> probably makes sense, as noted above.

I was merely proposing something that would compliment the med3 method
and our existing protections.

>> In a perfect world, if it were somehow possible to know the perfect
>> pivots ahead of time from a histogram or something, we'd have a quick
>> sort variant with worst-case performance of O(n log(n)). That, or the
>> limited approximation that I've sketched would perhaps be worthwhile,
>> even if it was subject to a number of caveats. Wikipedia claims that
>> the worst case for quick sort is O(n log(n)) with the refinements
>> recommended by Sedgewick's 1978 paper, but this seems like nonsense to
>> me - the med3 technique is a good heuristic in practice, but it's
>> perfectly possible in theory for it's sampling to always get things
>> completely wrong (this is not an unfamiliar problem).
>
> I agree.  After reading
> http://nanxkurniawan.wordpress.com/2010/05/31/quicksort/ I am inclined
> to think that things are still O(n lg n) even if we always partition
> so that any fixed percentage of the data -- is on one side of the
> partition element and the remainder is on the other side.  So for
> example if all of our partition elements are greater than 1% of the
> elements in their partition and less than the other 99%, we'll still
> be O(n lg n), though with a considerably higher constant factor, I
> think.  However, if our partition elements are always a fixed NUMBER
> of elements from the end - whether it's 1 or 100 - then we'll be
> O(n^2), though again the constant factor will depend on how many
> elements you knock off each time.  In practice I'm not sure whether an
> algorithmic analysis matters much, because in real life the constant
> factors are probably pretty important, hence the median-of-medians
> approach with n > 40.

Yeah, it's true that you have to be exceptionally unlucky to actually
see worst-case performance, but I imagined that it might be enough of
a difference to be worth pursuing. Experimentally, I can certainly see
the difference (I simulated it), but it is underwhelming next to
everything else.

Might be worth simplifying the swap logic, given as we only ever have
to swap SortTuple structs in qsort_arg, and given that it would make
the meta qsort_arg that much less ugly.

>> While I'm thinking out loud, here's another idea: Have a GUC which,
>> when enabled (perhaps potentially at various different granularities),
>> makes Timsort the in-memory sorting algorithm, as it now is by default
>> for Python, Java SE7 (for arrays) and the Android platform; for
>> certain applications, this could be a real win, and I don't think that
>> it would have to be invasive: it could be dropped in.
>
> I gather from a quick read that this is supposed to be especially good
> when the data is already somewhat sorted.  Would there be any merit in
> trying to guess when that's likely to be the case (e.g. based on the
> correlation statistic)?   That seems like a stretch, but OTOH a GUC
> doesn't feel great either: what is the poor user to do with a query
> that does 2 sorts, one of which is faster with QS and the other of
> which is faster with Timsort?  Ugh.

I'd imagined that it might work a bit like default_statistics_target.
Of course, I was just thinking out loud. Also, we do a very good job
on *perfectly* pre-sorted input because we check for pre-sorted input.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


Re: Progress on fast path sorting, btree index creation time

From
Robert Haas
Date:
On Fri, Jan 6, 2012 at 12:10 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> As you know, all queries tested have lots and lots of duplicates (a
> ~1.5GB table that contains the same number of distinct elements as a
> ~10MB table once did), and removing the "duplicate protection" for
> btrees, on top of everything else, sees the qsort do an awful lot
> better than HEAD does with the psuedo-protection.

Does that also win vs. unpatched master?  If so we may as well pull
that part out and commit it separately.

>> I can't find any description of how this actually works... obviously,
>> we'd like to find a close-to-median element, but how do we actually do
>> that cheaply?
>
> Uh, it works whatever way you want it to work. The implementation has
> to figure out how to get the median ahead of time.

Oh.  I'd be inclined to think that would be pretty inefficient, unless
we only did it for very large partitions or when we observed that some
less costly strategy was tanking.

>> I gather from a quick read that this is supposed to be especially good
>> when the data is already somewhat sorted.  Would there be any merit in
>> trying to guess when that's likely to be the case (e.g. based on the
>> correlation statistic)?   That seems like a stretch, but OTOH a GUC
>> doesn't feel great either: what is the poor user to do with a query
>> that does 2 sorts, one of which is faster with QS and the other of
>> which is faster with Timsort?  Ugh.
>
> I'd imagined that it might work a bit like default_statistics_target.
> Of course, I was just thinking out loud. Also, we do a very good job
> on *perfectly* pre-sorted input because we check for pre-sorted input.

We occasionally have people who want to do SELECT * FROM foo WHERE a >
100 AND a < 200 ORDER BY a, b where foo has an index on (a) but not
(a, b).  This tends to suck, because we fail to realize that we can
batch the sort.  We either seqscan and filter the table then sort the
results, or else we scan the index on (a) and then ignore the fact
that we have data which is already partially sorted.  It's
particularly annoying if a is "mostly unique" and needs b only
occasionally to break ties.  It would be nice to improve this, but it
may be more of a planner/executor problem that something we can fix
directly inside the sort implementation.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Progress on fast path sorting, btree index creation time

From
Peter Geoghegan
Date:
On 6 January 2012 17:29, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Jan 6, 2012 at 12:10 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
>> As you know, all queries tested have lots and lots of duplicates (a
>> ~1.5GB table that contains the same number of distinct elements as a
>> ~10MB table once did), and removing the "duplicate protection" for
>> btrees, on top of everything else, sees the qsort do an awful lot
>> better than HEAD does with the psuedo-protection.
>
> Does that also win vs. unpatched master?  If so we may as well pull
> that part out and commit it separately.

I didn't bother isolating that, because it doesn't really make sense
to (not having it is probably only of particular value when doing what
I'm doing anyway, but who knows). Go ahead and commit something to
remove that code (get it in both comparetup_index_btree and
comparetup_index_hash), as well as the tuple1 != tuple2 test now if
you like. It's patently clear that it is unnecessary, and so doesn't
have to be justified as a performance enhancement - it's a simple case
of refactoring for clarity. As I've said, we don't do this for heap
tuples and we've heard no complaints in all that time. It was added in
commit fbac1272b89b547dbaacd78bbe8da68e5493cbda, presumably when
problems with system qsorts came to light.

>> I'd imagined that it might work a bit like default_statistics_target.
>> Of course, I was just thinking out loud. Also, we do a very good job
>> on *perfectly* pre-sorted input because we check for pre-sorted input.
>
> We occasionally have people who want to do SELECT * FROM foo WHERE a >
> 100 AND a < 200 ORDER BY a, b where foo has an index on (a) but not
> (a, b).  This tends to suck, because we fail to realize that we can
> batch the sort.  We either seqscan and filter the table then sort the
> results, or else we scan the index on (a) and then ignore the fact
> that we have data which is already partially sorted.  It's
> particularly annoying if a is "mostly unique" and needs b only
> occasionally to break ties.  It would be nice to improve this, but it
> may be more of a planner/executor problem that something we can fix
> directly inside the sort implementation.

That sounds like an interesting problem. Something to look into for
9.3, perhaps.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


Re: Progress on fast path sorting, btree index creation time

From
Tom Lane
Date:
Peter Geoghegan <peter@2ndquadrant.com> writes:
> I didn't bother isolating that, because it doesn't really make sense
> to (not having it is probably only of particular value when doing what
> I'm doing anyway, but who knows). Go ahead and commit something to
> remove that code (get it in both comparetup_index_btree and
> comparetup_index_hash), as well as the tuple1 != tuple2 test now if
> you like. It's patently clear that it is unnecessary, and so doesn't
> have to be justified as a performance enhancement - it's a simple case
> of refactoring for clarity. As I've said, we don't do this for heap
> tuples and we've heard no complaints in all that time. It was added in
> commit fbac1272b89b547dbaacd78bbe8da68e5493cbda, presumably when
> problems with system qsorts came to light.

Actually, I'm going to object to reverting that commit, as I believe
that having equal-keyed index entries in physical table order may offer
some performance benefits at access time.  If you don't like the
comment, we can change that.
        regards, tom lane


Re: Progress on fast path sorting, btree index creation time

From
Peter Geoghegan
Date:
On 6 January 2012 18:45, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Peter Geoghegan <peter@2ndquadrant.com> writes:
>> I didn't bother isolating that, because it doesn't really make sense
>> to (not having it is probably only of particular value when doing what
>> I'm doing anyway, but who knows). Go ahead and commit something to
>> remove that code (get it in both comparetup_index_btree and
>> comparetup_index_hash), as well as the tuple1 != tuple2 test now if
>> you like. It's patently clear that it is unnecessary, and so doesn't
>> have to be justified as a performance enhancement - it's a simple case
>> of refactoring for clarity. As I've said, we don't do this for heap
>> tuples and we've heard no complaints in all that time. It was added in
>> commit fbac1272b89b547dbaacd78bbe8da68e5493cbda, presumably when
>> problems with system qsorts came to light.
>
> Actually, I'm going to object to reverting that commit, as I believe
> that having equal-keyed index entries in physical table order may offer
> some performance benefits at access time.  If you don't like the
> comment, we can change that.

Please explain your position. When is this supposed to be useful? Even
if you can present an argument for keeping it, it has to weigh the
fact that people don't tend to have much use for indexes with lots of
duplicates anyway. Prior to 2004, we did not do this - it was
justified as insurance against bad qsort implementations only.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


Re: Progress on fast path sorting, btree index creation time

From
Tom Lane
Date:
Peter Geoghegan <peter@2ndquadrant.com> writes:
> On 6 January 2012 18:45, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Actually, I'm going to object to reverting that commit, as I believe
>> that having equal-keyed index entries in physical table order may offer
>> some performance benefits at access time.  If you don't like the
>> comment, we can change that.

> Please explain your position. When is this supposed to be useful?

When there are lots of duplicates of a particular indexed value, the
existing code will cause an indexscan to search them in physical order,
whereas if we remove the existing logic it'll be random --- in
particular, accesses to the same heap page can no longer be expected to
be clustered.

Admittedly, I don't have any numbers quantifying just how useful that
might be, but on the other hand you've not presented any evidence
justifying removing the behavior either.  If we believe your position
that indexes don't generally have lots of duplicates, then the code in
question will seldom be reached and therefore there would be no
performance benefit in removing it.
        regards, tom lane


Re: Progress on fast path sorting, btree index creation time

From
Robert Haas
Date:
On Fri, Jan 6, 2012 at 4:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Admittedly, I don't have any numbers quantifying just how useful that
> might be, but on the other hand you've not presented any evidence
> justifying removing the behavior either.  If we believe your position
> that indexes don't generally have lots of duplicates, then the code in
> question will seldom be reached and therefore there would be no
> performance benefit in removing it.

Obviously, many indexes are unique and thus won't have duplicates at
all.  But if someone creates an index and doesn't make it unique, odds
are very high that it has some duplicates.  Not sure how many we
typically expect to see, but more than zero...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Progress on fast path sorting, btree index creation time

From
Peter Geoghegan
Date:
On 6 January 2012 21:14, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> When there are lots of duplicates of a particular indexed value, the
> existing code will cause an indexscan to search them in physical order,
> whereas if we remove the existing logic it'll be random --- in
> particular, accesses to the same heap page can no longer be expected to
> be clustered.

Isn't it possible to get them in physical order anyway, by reading
them into memory in that order? Efficient quick sort implementations
are not stable, and ours is no exception, but we could perhaps come up
with a cheaper tie-breaker value at that stage, if you're determined
to maintain this behaviour. We have sufficient incentive to, as I
describe below.

> Admittedly, I don't have any numbers quantifying just how useful that
> might be, but on the other hand you've not presented any evidence
> justifying removing the behavior either.  If we believe your position
> that indexes don't generally have lots of duplicates, then the code in
> question will seldom be reached and therefore there would be no
> performance benefit in removing it.

I ran the same btree benchmark on master, but without the "cheap
insurance". The results were interesting, to say the least.

The columns indexed were the same columns and data that I've been
using all along. Initially this made sense, as the performance of
multi sort key sorts often largely hinged on being able to get away
with doing one comparison per pair of tuples - with many duplicates, I
could avoid cheating and show something closer to worst case for the
patch.

I didn't think it mattered that indexing the same columns would
produce what happened to be a not so useful index in the real world,
due to having so many duplicates - better to have figures that were
somewhat comparable for btree tuple sorting and heap tuple sorting.

When I ran the same benchmark on a server that differed from master
only in that their was no insurance, it momentarily appeared that
*all* of the gains for btree index creation came from being able to
elide the "cheap insurance", but only where it would have to be paid
for a high number of times.

I soon realised that I'd made a blunder: the code (that is, the patch
that I posed most recently) wasn't even using my specialisation for
qsorting, because the SortSupport pointer was null! I did not have
tuplesort_begin_index_btree initialise the SortSupport struct as
tuplesort_begin_heap did, so my earlier benchmark was effectively
meaningless, except that it indicated the benefits of eliding the
cheap insurance alone, if only for that not so compelling case. You
should note that the benefits of not paying for the insurance can be
very significant indeed.

Attached are figures for an identical run of the same btree python
script, but with a version of the patch that actually uses my
specialisations. Granted, these numbers are still partially predicated
on the index in question having a large number of duplicates, but it's
worth noting:

1. The gain from specialisation isn't bad; not as good as the
improvements we saw for heap tuples, but not so bad either, especially
considering that binary bloat should be much less controversial for
btree tuples.

2. The index that results from the tests is still useful; the planner
is perfectly willing to use it rather than than performing an
in-memory sort. It will also use it to satisfy a query like "select *
from orderlines where prod_id = 5", albeit via a bitmap index scan. I
took the precaution of increasing default_statistics_target to its
maximum value, as well an performing an analyze on orderlines in
advance of checking this.

Revision to this patch that fixes the bug to follow - I produced these
new numbers from a rough cut of that.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Attachment

Re: Progress on fast path sorting, btree index creation time

From
Josh Berkus
Date:
> Obviously, many indexes are unique and thus won't have duplicates at
> all.  But if someone creates an index and doesn't make it unique, odds
> are very high that it has some duplicates.  Not sure how many we
> typically expect to see, but more than zero...

Peter may not, but I personally admin lots of databases which have
indexes on values like "category" or "city" which have 100's or 1000's
of duplicates per value.  I don't think this is uncommon at all.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


Re: Progress on fast path sorting, btree index creation time

From
Peter Geoghegan
Date:
On 9 January 2012 19:45, Josh Berkus <josh@agliodbs.com> wrote:
>> Obviously, many indexes are unique and thus won't have duplicates at
>> all.  But if someone creates an index and doesn't make it unique, odds
>> are very high that it has some duplicates.  Not sure how many we
>> typically expect to see, but more than zero...
>
> Peter may not, but I personally admin lots of databases which have
> indexes on values like "category" or "city" which have 100's or 1000's
> of duplicates per value.  I don't think this is uncommon at all.

Uh, then all the more reason to do what I recommend, I imagine. There
is most definitely a large overhead to creating such indexes, at least
for scalar types. As far as I can tell, Tom's complaint is quite
speculative.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


Re: Progress on fast path sorting, btree index creation time

From
Peter Geoghegan
Date:
On 6 January 2012 21:47, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Jan 6, 2012 at 4:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Admittedly, I don't have any numbers quantifying just how useful that
>> might be, but on the other hand you've not presented any evidence
>> justifying removing the behavior either.  If we believe your position
>> that indexes don't generally have lots of duplicates, then the code in
>> question will seldom be reached and therefore there would be no
>> performance benefit in removing it.

I have decided on a tactical retreat in relation to this patch. This
has been dragging on since September. I cannot risk not having the
patch accepted for 9.2, due to trying to handle both heap and btree
tuple sorting at once - the additional relatively modest improvements
for common cases that it will bring to btree index creation time do
not warrant digging my heels in to cover that case in one larger
commit. For that reason, I attach for your consideration a revision of
the patch without any support for btree index tuples whatsoever
(though I have adjusted btree tuplesort comments, and one tiny piece
of code, in a non-controversial way). I'll revisit btree sorting
towards the end of the commitfest, circumstances permitting.

As to the question of binary bloat, I have devised a pgbench-tools
workload that I think will go some way towards reassuring those who
are concerned about its distributed costs. The attached spreadsheet
has all the relevant details.

A custom scripts has been specified (details of which are evident from
benchmark results themselves). If we experienced some kind of
noticeable marginalisation of usefully cached instructions, that's
obviously where it'll show up.

Note that I have taken the preparatory step of updating some tables
with random data, rather than the sequential data that they have by
default, which, unusually for such large tables, can allow us to avail
of our pre-sorted input check optimisation, spoiling things. I did
vacuumdb immediately afterwards, immediately before the pgbench run in
each case.

I've kept the test at 8 clients/8 threads, on a machine with 4
physical cores + hyperthreading for 8 logical ("Intel Core(TM) i7 CPU
870  @ 2.93GHz", with 4 x 32 KB instruction caches, among several
other CPU caches) on a newly-initialised, throw-away database, single
run at scale 50 for 15 minutes in all cases. This is the same server
that I have used throughout.

My analysis is that although there may be a very slight regression in
non-affected queries (it's a tough call to make - how queries happen
to coincide undoubtedly muddies the waters, as does the fact that we
cut lots of time from periods in which backends hold a lot of locally
allocated sort memory - it might actually be a win for those other
queries sometimes but not others, and it appears that the margin
either way is very small), that is more than compensated for by the
benefit of the specialisations. The CPU cache is doing its fjob as
expected, and we clearly see a net benefit from its preference for
cacheing more instructions that are specialised - those sort
operations are way more expensive (if they weren't, it wouldn't cache
them so heavily). I also include results for running the same query
again and again with a single client, to put that effect in context
(though note that previous benchmarks avoiding paying a high client
overhead by using explain analyze, and indeed originally compared
pre-SortSupport Postgres, so these numbers aren't as good either).
It's unusual for a database workload to be so heavily CPU bound, so
I'd suggest that the latter benchmark is more representative than the
former. Either way, we win by some margin. If the queries didn't have
such a high client overhead, as for example with a sort node that
feeds a merge join, we'd do better still.

If a sorting specialisation is never used anyway, the overhead, for
practical purposes, is zero. I believe that sorting specialisations
are just too useful to not be a net win in almost all reasonable
cases. Why haven't I used all specialisations at once, rather than
only two (one for single int4, the other multiple)? Well, I might have
used all of them, but there was no floats available in the pgbench
tables, and besides, the chances of all of them being simultaneously
in play during any sufficiently short period for marginalisation of
CPU cache contents to be of particular concern is, in general, not all
that great.

A major systems programming language, C++, produces multiple
specialisations as its standard library's sort function is used, for
example, each of which will have separate entries in the procedure
linkage table (though various implementation-defined techniques are
frequently used to reduce the number of copies across translation
units at link time). If the programmer specifies either a different
datatype, or a different comparator (through the use of an alternative
to the default std::less_than<T> functor/predicate), a whole new
specialisation is generated by the compiler. This does raise concerns
in relation to binary bloat, but they're reasonably well understood,
and this is the default way of doing things, for general purpose
application development. Now, I know that Postgres isn't written in
C++, and you might not find "C++ does it" to be a particularly
compelling argument, but it does go to show that these ideas are not
by any stretch of the imagination radical.

I remind everyone that the improvement seen in raw qsort_arg runtime
is fairly large, as it would have to be, in order to bring such large
though proportionally smaller improvements to each given affected
query as a whole - there are of course many other sources of overhead
involved in parsing, planning and executing affected queries.

Here is a complete detailing of the specialisations that this latest
revision produces:

int4, single key (supports date too)
int8, multiple keys (supports timestamps too, with and without TZ,
where we HAVE_INT64_TIMESTAMP)
float4, single key
float8, multiple keys
Type-generic single key specialisation. Expected to deliver some of
the same additional benefits for types that do not merit there own
specialisations but would still benefit, like name, int2, etc, but is
used all the time where applicable, even for types like text for which
it is expected that there will be no appreciable benefit.

Thoughts?
--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Attachment

Re: Progress on fast path sorting, btree index creation time

From
Peter Geoghegan
Date:
I decided to take advantage of my ongoing access to a benchmarking
server to see how I could get on with a query with an especially large
sort. Now, that box has 16GB of ram, and some people have that much
memory in their laptops these days, so I was somewhat limited in how
far I could push things.

I eventually decided upon much the same benchmark as I'd made in my
previous e-mail (the query "SELECT * FROM pgbench_accounts ORDER BY
bid;", but primed with random data, while being sure to subsequently
vacuum).

I stress that I have not made any attempt to artificially remove
client overhead. I have, however, used a faster disk (caches were not
warmed in a seperate run of pgbench or anything like that for either
of my two most recent benchmarks, so there may have been and indeed
may still be some small bias there), in addition to connecting pgbench
with the -h option. Apparently a known problem with Linux kernels
using the Completely Fair Scheduler introduced in 2.6.23 is that it
does not schedule the pgbench program very well when it's connecting
to the database usingUnix-domain sockets, as I have been until now.
I'm not sure that this had all that much potential to spoil my
results, but didn't want to take the chance .

Anyway, the results of running this latest benchmark, with 20
successive runs of each large query, were:

Patch:

pghost: localhost pgport:  nclients: 1 nxacts: 20 dbName: pgbench
transaction type: Custom query
scaling factor: 1
query mode: prepared
number of clients: 1
number of threads: 1
number of transactions per client: 20
number of transactions actually processed: 20/20
tps = 0.030924 (including connections establishing)
tps = 0.030924 (excluding connections establishing)
statement latencies in milliseconds:31163.957400    SELECT * FROM pgbench_accounts ORDER BY bid;

Master:

pghost: localhost pgport:  nclients: 1 nxacts: 20 dbName: pgbench
pghost: localhost pgport:  nclients: 1 nxacts: 20 dbName: pgbench
transaction type: Custom query
scaling factor: 1
query mode: prepared
number of clients: 1
number of threads: 1
number of transactions per client: 20
number of transactions actually processed: 20/20
tps = 0.026769 (including connections establishing)
tps = 0.026769 (excluding connections establishing)
statement latencies in milliseconds:36179.508750    SELECT * FROM pgbench_accounts ORDER BY bid;

That's a larger proportional improvement than reported on Thursday,
and at significantly greater scale.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


Re: Progress on fast path sorting, btree index creation time

From
Robert Haas
Date:
On Thu, Jan 19, 2012 at 1:47 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> Thoughts?

I generated some random data using this query:

create table nodups (g) as select (g%10)*1000+g/10 from
generate_series(1,10000) g;

And then used pgbench to repeatedly sort it using this query:

select * from nodups order by g offset 10001;

The patch came out about 28% faster than master.  Admittedly, that's
with no client overhead, but still: not bad.

I don't like the API you've designed, though: as you have it,
PrepareSortSupportFromOrderingOp does this after calling the sort
support function:

+       ssup->usable_compar = ResolveComparatorProper(sortFunction);

I think it would be better to simply have the sort support functions
set usable_compar themselves.  That would allow any user-defined
functions that happen to have the same binary representation and
comparison rules as one of the types for which we supply a custom
qsort() to use initialize it to still make use of the optimization.
There's no real reason to have a separate switch to decide how to
initialize that field: the sort support trampoline already does that,
and I don't see any reason to introduce a second way of doing the same
thing.

I am also a little unhappy that we have to duplicate code the fastcmp
functions from nbtcompare.c in builtins.h.  Wouldn't it make more
sense to declare those functions as inline in nbtcompare.c, and then
call the qsort-generating macro from that file?

There were a couple of comment updates in tuplesort.c that looked
independent from the reset of the patch, so I've committed those
separately.  I also committed your change to downgrade the
belt-and-suspenders check for self-comparison to an assert, with some
rewording of your proposed comment.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Progress on fast path sorting, btree index creation time

From
Peter Geoghegan
Date:
On 26 January 2012 19:45, Robert Haas <robertmhaas@gmail.com> wrote:
> The patch came out about 28% faster than master.  Admittedly, that's
> with no client overhead, but still: not bad.

Thanks. There was a 28% reduction in the time it took to execute the
query, but there would have also been a larger reduction in the time
that the backend held that all of that locally-allocated memory. That
might also be worth instrumenting directly, by turning on "trace_sort"
- can you report numbers on that, please?

> I don't like the API you've designed, though: as you have it,
> PrepareSortSupportFromOrderingOp does this after calling the sort
> support function:
>
> +       ssup->usable_compar = ResolveComparatorProper(sortFunction);
>
> I think it would be better to simply have the sort support functions
> set usable_compar themselves.  That would allow any user-defined
> functions that happen to have the same binary representation and
> comparison rules as one of the types for which we supply a custom
> qsort() to use initialize it to still make use of the optimization.
> There's no real reason to have a separate switch to decide how to
> initialize that field: the sort support trampoline already does that,
> and I don't see any reason to introduce a second way of doing the same
> thing.

Hmm. You're right. I can't believe that that didn't occur to me. In
practice, types that use the SortSupport API are all going to be
façades on scalar types anyway, much like date and timestamp, and of
those a good proportion will surely have the same comparator
representation as the specialisations introduced by this patch. It
might be that virtually all third-party types that end up using the
API can avail of some specialisation.

> I am also a little unhappy that we have to duplicate code the fastcmp
> functions from nbtcompare.c in builtins.h.  Wouldn't it make more
> sense to declare those functions as inline in nbtcompare.c, and then
> call the qsort-generating macro from that file?

Maybe it would, but since the meta-qsort_arg introduced includes
partial duplicates of code from tuplesort.c, it kind of felt right to
"instantiate" specialisations there. It may be that doing it in
nbtcompare.c is the best option available to us. Off the top of my
head, I'm pretty sure that that's a good bit less code.

> There were a couple of comment updates in tuplesort.c that looked
> independent from the reset of the patch, so I've committed those
> separately.  I also committed your change to downgrade the
> belt-and-suspenders check for self-comparison to an assert, with some
> rewording of your proposed comment.

That seems reasonable.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


Re: Progress on fast path sorting, btree index creation time

From
Robert Haas
Date:
On Thu, Jan 26, 2012 at 4:09 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> On 26 January 2012 19:45, Robert Haas <robertmhaas@gmail.com> wrote:
>> The patch came out about 28% faster than master.  Admittedly, that's
>> with no client overhead, but still: not bad.
>
> Thanks. There was a 28% reduction in the time it took to execute the
> query, but there would have also been a larger reduction in the time
> that the backend held that all of that locally-allocated memory. That
> might also be worth instrumenting directly, by turning on "trace_sort"
> - can you report numbers on that, please?

Apparently not.  The sort is too short to register in the trace_sort
output.  I just get:

LOG:  begin tuple sort: nkeys = 1, workMem = 1024, randomAccess = f
LOG:  performsort starting: CPU 0.00s/0.00u sec elapsed 0.00 sec
LOG:  performsort done: CPU 0.00s/0.00u sec elapsed 0.00 sec
LOG:  internal sort ended, 853 KB used: CPU 0.00s/0.00u sec elapsed 0.00 sec

>> I don't like the API you've designed, though: as you have it,
>> PrepareSortSupportFromOrderingOp does this after calling the sort
>> support function:
>>
>> +       ssup->usable_compar = ResolveComparatorProper(sortFunction);
>>
>> I think it would be better to simply have the sort support functions
>> set usable_compar themselves.  That would allow any user-defined
>> functions that happen to have the same binary representation and
>> comparison rules as one of the types for which we supply a custom
>> qsort() to use initialize it to still make use of the optimization.
>> There's no real reason to have a separate switch to decide how to
>> initialize that field: the sort support trampoline already does that,
>> and I don't see any reason to introduce a second way of doing the same
>> thing.
>
> Hmm. You're right. I can't believe that that didn't occur to me. In
> practice, types that use the SortSupport API are all going to be
> façades on scalar types anyway, much like date and timestamp, and of
> those a good proportion will surely have the same comparator
> representation as the specialisations introduced by this patch. It
> might be that virtually all third-party types that end up using the
> API can avail of some specialisation.

Possibly.  At a minimum it keeps the door open.

>> I am also a little unhappy that we have to duplicate code the fastcmp
>> functions from nbtcompare.c in builtins.h.  Wouldn't it make more
>> sense to declare those functions as inline in nbtcompare.c, and then
>> call the qsort-generating macro from that file?
>
> Maybe it would, but since the meta-qsort_arg introduced includes
> partial duplicates of code from tuplesort.c, it kind of felt right to
> "instantiate" specialisations there. It may be that doing it in
> nbtcompare.c is the best option available to us. Off the top of my
> head, I'm pretty sure that that's a good bit less code.

I was hoping so...

>> There were a couple of comment updates in tuplesort.c that looked
>> independent from the reset of the patch, so I've committed those
>> separately.  I also committed your change to downgrade the
>> belt-and-suspenders check for self-comparison to an assert, with some
>> rewording of your proposed comment.
>
> That seems reasonable.

Cool.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Progress on fast path sorting, btree index creation time

From
Peter Geoghegan
Date:
Alright, so while I agree with everything you've asked for, the fact
is that there is a controversy in relation to binary bloat, and that's
the blocker here. How can we satisfactorily resolve that, or is that
question adequately addressed by the benchmark that I produced?

What if third party modules could include the template_qsort_arg.h
header, and instantiate their own specialisation? The sort support
function could either set an enum, or set a pointer to a qsort_arg
specialisation, if there comparator was a little different, but still
just a few instructions, as with float-based timestamps (I don't care
enough about them to provide one in core, though). It would also
essentially allow for user-defined sort functions, provided they
fulfilled a basic interface. They may not even have to be
comparison-based. I know that I expressed scepticism about the weird
and wonderful ideas that some people have put forward in that area,
but that's mostly because I don't think that GPU based sorting in a
database is going to be practical.

Why not add this capability to the SortSupport API, given that it
wouldn't be very hard? The user would set the enum too, so it would
appear in explain analyze output as "custom sort" or something like
that.

While a sorting specialisation of our qsort_arg is actually pretty
close to optimal for scalar types and façades thereof, there could be
a type that would benefit from using radix sort, for example, which
isn't even comparison-based, and a user couldn't very well do that
with the existing SortSupport API.

I certainly don't care about this capability enough to defend it
against any objections that anyone may have, especially at this late
stage in the cycle. I just think that we might as well have it.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


Re: Progress on fast path sorting, btree index creation time

From
Robert Haas
Date:
On Thu, Jan 26, 2012 at 5:10 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> Alright, so while I agree with everything you've asked for, the fact
> is that there is a controversy in relation to binary bloat, and that's
> the blocker here. How can we satisfactorily resolve that, or is that
> question adequately addressed by the benchmark that I produced?

I could go either way on that, honestly.  I took a look today and
found out that on my machine, a postgres 9.0.x executable was about
282kB larger than an 8.4 postgres executable.  9.1 was about 386kB
larger than that, and 9.2 current is 239kB larger still.  So it's not
as if your patch is the first one to enlarge the size of the binary -
it's obviously been growing steadily from release to release for
years.  I found that your patch adds 55kB to the executable size,
which is clearly a lot more than what a typical patch adds, but still
under 1%.  So I wouldn't be shocked and appalled if we decided to
commit this as-is.

But if we want to put it on a diet, the first thing I'd probably be
inclined to lose is the float4 specialization.  Some members of the
audience will recall that I take dim view of floating point arithmetic
generally, but I'll concede that there are valid reasons for using
float8.  I have a harder time coming up with a good reason to use
float4 - ever, for anything you care about.  So I would be inclined to
think that if we want to trim this back a bit, maybe that's the one to
let go.  If we want to be even more aggressive, the next thing I'd
probably lose is the optimization of multiple sortkey cases, on the
theory that single sort keys are probably by far the most common
practical case.

I'm not surprised that you weren't able to measure a performance
regression from the binary bloat.  Any such regression is bound to be
very small and probably quite difficult to notice most of the time;
it's really the cumulative effect of many binary-size-increasing
changes we're worried about, not each individual one.  Certainly,
trying to shrink the binary is micro-optimimzation at its finest, but
then so is inlining comparators.  I don't think it can be
realistically argued that we can increasing the size of the binary
arbitrarily will never get us in trouble, much like (for a typical
American family) spending $30 to have dinner at a cheap resteraunt
will never break the budget.  But if you do it every day, it gets
expensive (and fattening).

> What if third party modules could include the template_qsort_arg.h
> header, and instantiate their own specialisation?

Seems reasonable to me.

> The sort support
> function could either set an enum, or set a pointer to a qsort_arg
> specialisation, if there comparator was a little different, but still

The latter seems better.

> just a few instructions, as with float-based timestamps (I don't care
> enough about them to provide one in core, though). It would also
> essentially allow for user-defined sort functions, provided they
> fulfilled a basic interface. They may not even have to be
> comparison-based. I know that I expressed scepticism about the weird
> and wonderful ideas that some people have put forward in that area,
> but that's mostly because I don't think that GPU based sorting in a
> database is going to be practical.

A question for another day.

> Why not add this capability to the SortSupport API, given that it
> wouldn't be very hard? The user would set the enum too, so it would
> appear in explain analyze output as "custom sort" or something like
> that.

I'm unenthused about going to any trouble to change the EXPLAIN format.

> While a sorting specialisation of our qsort_arg is actually pretty
> close to optimal for scalar types and façades thereof, there could be
> a type that would benefit from using radix sort, for example, which
> isn't even comparison-based, and a user couldn't very well do that
> with the existing SortSupport API.
>
> I certainly don't care about this capability enough to defend it
> against any objections that anyone may have, especially at this late
> stage in the cycle. I just think that we might as well have it.

I don't see any reason not too, assuming it's not a lot of code.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Progress on fast path sorting, btree index creation time

From
Peter Geoghegan
Date:
On 27 January 2012 03:32, Robert Haas <robertmhaas@gmail.com> wrote:
> But if we want to put it on a diet, the first thing I'd probably be
> inclined to lose is the float4 specialization.  Some members of the
> audience will recall that I take dim view of floating point arithmetic
> generally, but I'll concede that there are valid reasons for using
> float8.  I have a harder time coming up with a good reason to use
> float4 - ever, for anything you care about.  So I would be inclined to
> think that if we want to trim this back a bit, maybe that's the one to
> let go.  If we want to be even more aggressive, the next thing I'd
> probably lose is the optimization of multiple sortkey cases, on the
> theory that single sort keys are probably by far the most common
> practical case.

Obviously I don't think that we should let anything go, as the
improvement in performance is so large that we're bound to be ahead -
we only really pay for what we use anyway, and when we use a
specialisation the difference is really quite big, particularly when
you look at sorting in isolation.  If a specialisation is never used,
it is more or less never paid for, so there's no point in worrying
about that. That said, float4 is obviously the weakest link. I'm
inclined to think that float8 is the second weakest though, mostly
because we get both dates and timestamps "for free" with the integer
specialisations.

> I'm not surprised that you weren't able to measure a performance
> regression from the binary bloat.  Any such regression is bound to be
> very small and probably quite difficult to notice most of the time;
> it's really the cumulative effect of many binary-size-increasing
> changes we're worried about, not each individual one.  Certainly,
> trying to shrink the binary is micro-optimimzation at its finest, but
> then so is inlining comparators.  I don't think it can be
> realistically argued that we can increasing the size of the binary
> arbitrarily will never get us in trouble, much like (for a typical
> American family) spending $30 to have dinner at a cheap resteraunt
> will never break the budget.  But if you do it every day, it gets
> expensive (and fattening).

Sure. At the risk of stating the obvious, and of repeating myself, I
will point out that the true cost of increasing the size of the binary
is not necessarily linear - it's a complex equation. I hope that this
doesn't sound flippant, but if some naive person were to look at just
the increasing binary size of Postgres and its performance in each
successive release, they might conclude that there was a positive
correlation between the two (since we didn't add flab to the binary,
but muscle that pulls its own weight and then some).

At the continued risk of stating the obvious, CPUs don't just cache
instructions - they cache data too.  If we spend less than half the
time sorting data, which is the level of improvement I was able to
demonstrate against pre-SortSupport Postgres, that will surely very
often have the aggregate effect of ameliorating cache contention
between cores.

>> just a few instructions, as with float-based timestamps (I don't care
>> enough about them to provide one in core, though). It would also
>> essentially allow for user-defined sort functions, provided they
>> fulfilled a basic interface. They may not even have to be
>> comparison-based. I know that I expressed scepticism about the weird
>> and wonderful ideas that some people have put forward in that area,
>> but that's mostly because I don't think that GPU based sorting in a
>> database is going to be practical.
>
> A question for another day.

Fair enough.

>> I certainly don't care about this capability enough to defend it
>> against any objections that anyone may have, especially at this late
>> stage in the cycle. I just think that we might as well have it.
>
> I don't see any reason not too, assuming it's not a lot of code.

Good.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


Re: Progress on fast path sorting, btree index creation time

From
Robert Haas
Date:
On Thu, Jan 26, 2012 at 11:36 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
>> I'm not surprised that you weren't able to measure a performance
>> regression from the binary bloat.  Any such regression is bound to be
>> very small and probably quite difficult to notice most of the time;
>> it's really the cumulative effect of many binary-size-increasing
>> changes we're worried about, not each individual one.  Certainly,
>> trying to shrink the binary is micro-optimimzation at its finest, but
>> then so is inlining comparators.  I don't think it can be
>> realistically argued that we can increasing the size of the binary
>> arbitrarily will never get us in trouble, much like (for a typical
>> American family) spending $30 to have dinner at a cheap resteraunt
>> will never break the budget.  But if you do it every day, it gets
>> expensive (and fattening).
>
> Sure. At the risk of stating the obvious, and of repeating myself, I
> will point out that the true cost of increasing the size of the binary
> is not necessarily linear - it's a complex equation. I hope that this
> doesn't sound flippant, but if some naive person were to look at just
> the increasing binary size of Postgres and its performance in each
> successive release, they might conclude that there was a positive
> correlation between the two (since we didn't add flab to the binary,
> but muscle that pulls its own weight and then some).

I completely agree.  So the point is that, when faced a patch that
adds an atypically large number of CPU instructions, we ought to ask
ourselves whether those instructions are pulling their weight.  By way
of comparison, the latest posted version of Tom's generalized
index-only paths patch adds 14kB to the resulting executable (on my
Mac).  Yours adds 55kB.  We might ask ourselves whether the benefits
of this patch are four times greater than the benefits of Tom's patch.It's pretty hard to compare them directly since
they'redoing very 
different things, and all of this is completely subjective, but I
doubt it.  On the other hand, there is no rule that every byte of code
that gets committed must be made of solid gold, either.  So, once
again: judgement call.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Progress on fast path sorting, btree index creation time

From
Peter Geoghegan
Date:
Uh, obviously I meant causal relationship and not correlation.

On 27 January 2012 13:37, Robert Haas <robertmhaas@gmail.com> wrote:
> I completely agree.  So the point is that, when faced a patch that
> adds an atypically large number of CPU instructions, we ought to ask
> ourselves whether those instructions are pulling their weight.  By way
> of comparison, the latest posted version of Tom's generalized
> index-only paths patch adds 14kB to the resulting executable (on my
> Mac).  Yours adds 55kB.  We might ask ourselves whether the benefits
> of this patch are four times greater than the benefits of Tom's patch.
>  It's pretty hard to compare them directly since they're doing very
> different things, and all of this is completely subjective, but I
> doubt it.  On the other hand, there is no rule that every byte of code
> that gets committed must be made of solid gold, either.  So, once
> again: judgement call.

Well, I don't think it's all that subjective - it's more the case that
it is just difficult, or it gets that way as you consider more
specialisations.

As for what types/specialisations may not make the cut, I'm
increasingly convinced that floats (in the following order: float4,
float8) should be the first to go. Aside from the fact that we cannot
use their specialisations for anything like dates and timestamps,
floats are just way less useful than integers in the context of
database applications, or at least those that I've been involved with.
As important as floats are in the broad context of computing, it's
usually only acceptable to store data in a database as floats within
scientific applications, and only then when their limitations are
well-understood and acceptable. I think we've all heard anecdotes at
one time or another, involving their limitations not being well
understood.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


Re: Progress on fast path sorting, btree index creation time

From
Robert Haas
Date:
On Fri, Jan 27, 2012 at 9:27 AM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> Well, I don't think it's all that subjective - it's more the case that
> it is just difficult, or it gets that way as you consider more
> specialisations.

Sure it's subjective.  Two well-meaning people could have different
opinions without either of them being "wrong".  If you do a lot of
small, in-memory sorts, more of this stuff is going to seem worthwhile
than if you don't.

> As for what types/specialisations may not make the cut, I'm
> increasingly convinced that floats (in the following order: float4,
> float8) should be the first to go. Aside from the fact that we cannot
> use their specialisations for anything like dates and timestamps,
> floats are just way less useful than integers in the context of
> database applications, or at least those that I've been involved with.
> As important as floats are in the broad context of computing, it's
> usually only acceptable to store data in a database as floats within
> scientific applications, and only then when their limitations are
> well-understood and acceptable. I think we've all heard anecdotes at
> one time or another, involving their limitations not being well
> understood.

While we're waiting for anyone else to weigh in with an opinion on the
right place to draw the line here, do you want to post an updated
patch with the changes previously discussed?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Progress on fast path sorting, btree index creation time

From
Peter Geoghegan
Date:
On 27 January 2012 14:37, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Jan 27, 2012 at 9:27 AM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
>> Well, I don't think it's all that subjective - it's more the case that
>> it is just difficult, or it gets that way as you consider more
>> specialisations.
>
> Sure it's subjective.  Two well-meaning people could have different
> opinions without either of them being "wrong".  If you do a lot of
> small, in-memory sorts, more of this stuff is going to seem worthwhile
> than if you don't.

But if you don't, then you're not going to have your cache
compromised, so the cost is limited to having to store a few tens of
kilobytes of extra binary executable data on disk, and perhaps
main-memory, that you wouldn't otherwise have to - a cost that is
virtually indistinguishable from zero. When you do eventually need to
do some in-memory sort, you get to have that go significantly faster,
and since you don't have much use for the specialisations anyway, you
get that with essentially no down-side.

The concern is a perfect storm of all specialisations being
simultaneously used such that it'd be more efficient to use a generic
qsort. I think that's pretty implausible - the general assumption is
that database applications are not frequently CPU bound. They're
assumed to be frequently memory-bound though, so any effort to reduced
memory consumption - which this patch effectively does - is probably
going to be more valuable.

Even if we suppose that the perfect storm can and does occur, on a
chip that is so starved of instruction cache that it turns out to be a
net loss, surely even then the perfect storm is a rare occurrence, and
the aggregate effect is that they benefit. Besides, Postgres
performance optimisations for which you can contrive a case that
results in a net-loss in performance are well precedented.

>> As for what types/specialisations may not make the cut, I'm
>> increasingly convinced that floats (in the following order: float4,
>> float8) should be the first to go. Aside from the fact that we cannot
>> use their specialisations for anything like dates and timestamps,
>> floats are just way less useful than integers in the context of
>> database applications, or at least those that I've been involved with.
>> As important as floats are in the broad context of computing, it's
>> usually only acceptable to store data in a database as floats within
>> scientific applications, and only then when their limitations are
>> well-understood and acceptable. I think we've all heard anecdotes at
>> one time or another, involving their limitations not being well
>> understood.
>
> While we're waiting for anyone else to weigh in with an opinion on the
> right place to draw the line here, do you want to post an updated
> patch with the changes previously discussed?

Patch is attached. I have not changed the duplicate functions. This is
because I concluded that it was the lesser of two evils to have to get
the template to generate both declarations in the header file, and
definitions in the .c file - that seemed particularly obscure. We're
never going to have to expose/duplicate any more comparators anyway.
Do you agree?

It's pretty easy to remove a specialisation at any time - just remove
less than 10 lines of code. It's also pretty difficult to determine,
with everyone's absolute confidence, where the right balance lies.
Perhaps the sensible thing to do is to not be so conservative in what
we initially commit, while clearly acknowledging that we may not have
the balance right, and that it may have to change. We then have the
entire beta part of the cycle in which to decide to roll back from
that position, without any plausible downside. If, on the other hand,
we conservatively lean towards fewer specialisations in the initial
commit, no one will complain about the lack of an improvement in
performance that they never had.

Tom's objections related to the total number of specialisations, and
their distributed costs - the very idea of full specialisations was
not objected to. I think it's fair to say that there is no controversy
at all remaining about whether or not we should have *some* number of
specialisations. Therefore, I'm going to suggest that assuming you
have no further objections to the style of the code, and no one else
voices any other objections in the next couple of days, that you
provisionally commit this latest revision with all of its
specialisations, while putting people on notice about this.

I think that possibly the one remaining blocker to tentatively
committing this with all specialisations intact is that I haven't
tested this on Windows, as I don't currently have access to a Windows
development environment. I have set one up before, but it's a huge
pain. Can anyone help me out?

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Attachment

Re: Progress on fast path sorting, btree index creation time

From
Robert Haas
Date:
On Fri, Jan 27, 2012 at 3:33 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> Patch is attached. I have not changed the duplicate functions. This is
> because I concluded that it was the lesser of two evils to have to get
> the template to generate both declarations in the header file, and
> definitions in the .c file - that seemed particularly obscure. We're
> never going to have to expose/duplicate any more comparators anyway.
> Do you agree?

Not really.  You don't really need macros to generate the prototypes;
you could just write them out longhand.

I think there's a mess of naming confusion in here, though, as perhaps
best illlustrated by this macro definition:

#define TEMPLATE_QSORT_ARG_HEAP(TYPE, COMPAR)                               \
DO_TEMPLATE_QSORT_ARG(TYPE, COMPAR, inlheap,                                \       SING_ADDITIONAL_CODE,
TYPE##inlheapcomparetup_inline)              \
 
DO_TEMPLATE_QSORT_ARG(TYPE, COMPAR, regheap,                                \
MULT_ADDITIONAL_CODE(TYPE##regheapAppFunc),                        \           TYPE##regheapcomparetup_inline)
 

The idea here is that when we have only a single sort key, we include
SING_ADDITIONAL_CODE in the tuple comparison function, whereas when we
have more than one, we instead include MULT_ADDITIONAL_CODE.  Right
there, I think we have a naming problem, because abbreviating "single"
to "sing" and multiple to "mult" is less than entirely clear.  For a
minute or two I was trying to figure out whether our sorting code was
musically inclined, and I'm a native english speaker.  But then we
switch to another set of terminology completely for the generated
functions: inlheap for the single-key case, and regheap for the
multiple-key case.   I find that even more confusing.

I think we ought to get rid of this:

+typedef enum TypeCompar
+{
+       TYPE_COMP_OTHER,
+       TYPE_COMP_INT4,
+       TYPE_COMP_INT8,
+       TYPE_COMP_FLOAT4,
+       TYPE_COMP_FLOAT8,
+       TYPE_COMP_FULL_SPECIALISATION
+} TypeCompar;

Instead, just modify SortSupportData to have two function pointers as
members, one for the single-key case and another for the multiple-key
case, and have the sortsupport functions initialize them to the actual
functions that should be called.  The layer of indirection, AFAICS,
serves no purpose.

> It's pretty easy to remove a specialisation at any time - just remove
> less than 10 lines of code. It's also pretty difficult to determine,
> with everyone's absolute confidence, where the right balance lies.
> Perhaps the sensible thing to do is to not be so conservative in what
> we initially commit, while clearly acknowledging that we may not have
> the balance right, and that it may have to change. We then have the
> entire beta part of the cycle in which to decide to roll back from
> that position, without any plausible downside. If, on the other hand,
> we conservatively lean towards fewer specialisations in the initial
> commit, no one will complain about the lack of an improvement in
> performance that they never had.

Eh, really?  Typically when we do something good, the wolves are
howling at the door to make it work in more cases.

> I think that possibly the one remaining blocker to tentatively
> committing this with all specialisations intact is that I haven't
> tested this on Windows, as I don't currently have access to a Windows
> development environment. I have set one up before, but it's a huge
> pain. Can anyone help me out?

This doesn't strike me as terribly OS-dependent, unless by that we
mean compiler-dependent.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Progress on fast path sorting, btree index creation time

From
Peter Geoghegan
Date:
On 31 January 2012 19:47, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Jan 27, 2012 at 3:33 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
>> Patch is attached. I have not changed the duplicate functions. This is
>> because I concluded that it was the lesser of two evils to have to get
>> the template to generate both declarations in the header file, and
>> definitions in the .c file - that seemed particularly obscure. We're
>> never going to have to expose/duplicate any more comparators anyway.
>> Do you agree?
>
> Not really.  You don't really need macros to generate the prototypes;
> you could just write them out longhand.

Well, since that would have the advantage of forcing the client of
those macros to explicitly acknowledge what specialisations they were
getting in order to get a warning-free build, that might be a better
approach (though uncalled static functions are discarded anyway).
However, I don't really see how you expect me to make the
comparator-proper functions within nbtcompare.c inline without some
duplication. How can I have inline functions in that .c file, while
still having those functions callable from other TUs, while avoiding
duplication/moving the functions? I suppose that I could do some trick
with macros, but once again I believe that the cure is worse than the
disease - it it really such a big deal to duplicate a handful of
trivial functions, given than we're never going to have to expand that
number?

> I think there's a mess of naming confusion in here, though, as perhaps
> best illlustrated by this macro definition:
>
> #define TEMPLATE_QSORT_ARG_HEAP(TYPE, COMPAR)                               \
> DO_TEMPLATE_QSORT_ARG(TYPE, COMPAR, inlheap,                                \
>        SING_ADDITIONAL_CODE, TYPE##inlheapcomparetup_inline)               \
> DO_TEMPLATE_QSORT_ARG(TYPE, COMPAR, regheap,                                \
>        MULT_ADDITIONAL_CODE(TYPE##regheapAppFunc),                         \
>            TYPE##regheapcomparetup_inline)
>
> The idea here is that when we have only a single sort key, we include
> SING_ADDITIONAL_CODE in the tuple comparison function, whereas when we
> have more than one, we instead include MULT_ADDITIONAL_CODE.  Right
> there, I think we have a naming problem, because abbreviating "single"
> to "sing" and multiple to "mult" is less than entirely clear.  For a
> minute or two I was trying to figure out whether our sorting code was
> musically inclined, and I'm a native english speaker.  But then we
> switch to another set of terminology completely for the generated
> functions: inlheap for the single-key case, and regheap for the
> multiple-key case.   I find that even more confusing.

I'm happy to clean that up, though I think that when you try and
browbeat C into the role of supporting generic programming, confusing
code is more or less inevitable.

> I think we ought to get rid of this:
>
> +typedef enum TypeCompar
> +{
> +       TYPE_COMP_OTHER,
> +       TYPE_COMP_INT4,
> +       TYPE_COMP_INT8,
> +       TYPE_COMP_FLOAT4,
> +       TYPE_COMP_FLOAT8,
> +       TYPE_COMP_FULL_SPECIALISATION
> +} TypeCompar;
>
> Instead, just modify SortSupportData to have two function pointers as
> members, one for the single-key case and another for the multiple-key
> case, and have the sortsupport functions initialize them to the actual
> functions that should be called.  The layer of indirection, AFAICS,
> serves no purpose.

It anticipates a time when the system will have both btree and heap
variants of the specialisations. It also serves to usefully document
the intent of the optimisation - the most direct interface isn't
necessarily the best. That said, I really am looking for the path of
least resistance towards getting this committed at this stage, so if
you still think it's a bad idea, I won't complain if you modify the
patch to have multiple function pointers as you described.

>> It's pretty easy to remove a specialisation at any time - just remove
>> less than 10 lines of code. It's also pretty difficult to determine,
>> with everyone's absolute confidence, where the right balance lies.
>> Perhaps the sensible thing to do is to not be so conservative in what
>> we initially commit, while clearly acknowledging that we may not have
>> the balance right, and that it may have to change. We then have the
>> entire beta part of the cycle in which to decide to roll back from
>> that position, without any plausible downside. If, on the other hand,
>> we conservatively lean towards fewer specialisations in the initial
>> commit, no one will complain about the lack of an improvement in
>> performance that they never had.
>
> Eh, really?  Typically when we do something good, the wolves are
> howling at the door to make it work in more cases.

Well, if anyone was complaining about a regression, because the
optimisation represented a net loss for their reasonable use-case,
then clearly we wouldn't be doing well, so we'd have to reconsider our
position, and no sensible wolf is going to argue with that. Obviously
it's not possible to prove that there couldn't be some kind of
regression under some set of circumstances in a water-tight fashion.
However, I believe that I have argued well that on balance, it's worth
maintaining the full set of specialisations. This is one case where
committing code and working out if there are problems makes sense,
because: 1) There very probably aren't (you yourself were not
surprised that a regression couldn't be demonstrated), and 2) if there
are, we can withdraw from having so many specialisations in small
increments, rather than having to throw everything out as might have
been the case in the past with other proposed performance patches.

>> I think that possibly the one remaining blocker to tentatively
>> committing this with all specialisations intact is that I haven't
>> tested this on Windows, as I don't currently have access to a Windows
>> development environment. I have set one up before, but it's a huge
>> pain. Can anyone help me out?
>
> This doesn't strike me as terribly OS-dependent, unless by that we
> mean compiler-dependent.

What concerns me is how the pg_always_inline macro behaves when
building with MSVC. While I have no reason to doubt that the effect
should be similar there, I would like to have that verified. Building
with MSVC is simple if you're only building release source packages,
but when you bring Flex + Bison into the equation, it becomes rather
difficult, which is why I asked for help there. You are generally
right to say that it is not terribly OS-dependent though, since, as
I've already shown, very portable software like the GNU C library uses
tricks like pg_always_inline extensively, both in code intended for a
particular arch, and code intended for all. Even where
pg_always_inline doesn't have an underlying compiler-specific way of
compelling inlining (it just falls back on inline, which is probably
now a no-op on this obscure platform anyway), that's just improving
performance. The usefulness of this optimisation does not hinge on
pg_always_inline in any way.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


Re: Progress on fast path sorting, btree index creation time

From
Jim Nasby
Date:
On Jan 26, 2012, at 9:32 PM, Robert Haas wrote:
> But if we want to put it on a diet, the first thing I'd probably be
> inclined to lose is the float4 specialization.  Some members of the
> audience will recall that I take dim view of floating point arithmetic
> generally, but I'll concede that there are valid reasons for using
> float8.  I have a harder time coming up with a good reason to use
> float4 - ever, for anything you care about.  So I would be inclined to
> think that if we want to trim this back a bit, maybe that's the one to
> let go.  If we want to be even more aggressive, the next thing I'd
> probably lose is the optimization of multiple sortkey cases, on the
> theory that single sort keys are probably by far the most common
> practical case.

I do find float4 to be useful, though it's possible that my understanding is flawed…

We end up using float to represent ratios in our database; things that really, honest to God do NOT need to be exact.

In most cases, 7 digits of precision (which AFAIK is what you're guaranteed with float4) is plenty, so we use float4
ratherthan bloat the database (though, since we're on 64bit hardware I guess that distinction is somewhat moot…). 

Is there something I'm missing that would make float4 useless as compared to float8?
--
Jim C. Nasby, Database Architect                   jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net




Re: Progress on fast path sorting, btree index creation time

From
Alvaro Herrera
Date:
Excerpts from Jim Nasby's message of mié feb 01 19:12:58 -0300 2012:
> On Jan 26, 2012, at 9:32 PM, Robert Haas wrote:
> > But if we want to put it on a diet, the first thing I'd probably be
> > inclined to lose is the float4 specialization.  Some members of the
> > audience will recall that I take dim view of floating point arithmetic
> > generally, but I'll concede that there are valid reasons for using
> > float8.  I have a harder time coming up with a good reason to use
> > float4 - ever, for anything you care about.  So I would be inclined to
> > think that if we want to trim this back a bit, maybe that's the one to
> > let go.  If we want to be even more aggressive, the next thing I'd
> > probably lose is the optimization of multiple sortkey cases, on the
> > theory that single sort keys are probably by far the most common
> > practical case.
>
> I do find float4 to be useful, though it's possible that my understanding is flawed…
>
> We end up using float to represent ratios in our database; things that really, honest to God do NOT need to be exact.

But then, do you sort using those ratios?

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Progress on fast path sorting, btree index creation time

From
"ktm@rice.edu"
Date:
On Wed, Feb 01, 2012 at 04:12:58PM -0600, Jim Nasby wrote:
> On Jan 26, 2012, at 9:32 PM, Robert Haas wrote:
> > But if we want to put it on a diet, the first thing I'd probably be
> > inclined to lose is the float4 specialization.  Some members of the
> > audience will recall that I take dim view of floating point arithmetic
> > generally, but I'll concede that there are valid reasons for using
> > float8.  I have a harder time coming up with a good reason to use
> > float4 - ever, for anything you care about.  So I would be inclined to
> > think that if we want to trim this back a bit, maybe that's the one to
> > let go.  If we want to be even more aggressive, the next thing I'd
> > probably lose is the optimization of multiple sortkey cases, on the
> > theory that single sort keys are probably by far the most common
> > practical case.
>
> I do find float4 to be useful, though it's possible that my understanding is flawed…
>
> We end up using float to represent ratios in our database; things that really, honest to God do NOT need to be exact.
>
> In most cases, 7 digits of precision (which AFAIK is what you're guaranteed with float4) is plenty, so we use float4
ratherthan bloat the database (though, since we're on 64bit hardware I guess that distinction is somewhat moot…). 
>
> Is there something I'm missing that would make float4 useless as compared to float8?
> --
> Jim C. Nasby, Database Architect                   jim@nasby.net
> 512.569.9461 (cell)                         http://jim.nasby.net
>
If the values stored are float4, it would be nice to have that fast-path
sort available too. The cases where I have used float4 values in the past,
I absolutely did not need any of the float8 baggage and in my case, using
the actual float4 comparison operator resulted in a significant time savings
over the normal float8. This could be processor specific, but it would be
worth testing before throwing it out.

Regards,
Ken


Re: Progress on fast path sorting, btree index creation time

From
Bruce Momjian
Date:
On Fri, Jan 27, 2012 at 09:37:37AM -0500, Robert Haas wrote:
> On Fri, Jan 27, 2012 at 9:27 AM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> > Well, I don't think it's all that subjective - it's more the case that
> > it is just difficult, or it gets that way as you consider more
> > specialisations.
> 
> Sure it's subjective.  Two well-meaning people could have different
> opinions without either of them being "wrong".  If you do a lot of
> small, in-memory sorts, more of this stuff is going to seem worthwhile
> than if you don't.
> 
> > As for what types/specialisations may not make the cut, I'm
> > increasingly convinced that floats (in the following order: float4,
> > float8) should be the first to go. Aside from the fact that we cannot
> > use their specialisations for anything like dates and timestamps,
> > floats are just way less useful than integers in the context of
> > database applications, or at least those that I've been involved with.
> > As important as floats are in the broad context of computing, it's
> > usually only acceptable to store data in a database as floats within
> > scientific applications, and only then when their limitations are
> > well-understood and acceptable. I think we've all heard anecdotes at
> > one time or another, involving their limitations not being well
> > understood.
> 
> While we're waiting for anyone else to weigh in with an opinion on the
> right place to draw the line here, do you want to post an updated
> patch with the changes previously discussed?

Well, I think we have to ask not only how many people are using
float4/8, but how many people are sorting or creating indexes on them. 
I think it would be few and perhaps should be eliminated.

Peter Geoghegan obviously has done some serious work in improving
sorting, and worked well with the community process.  He has done enough
analysis that I am hard-pressed to see how we would get similar
improvement using a different method, so I think it comes down to
whether we want the 28% speedup by adding 55k (1%) to the binary.

I think Peter has shown us how to get that, and what it will cost --- we
just need to decide now whether it is worth it.  What I am saying is
there probably isn't a cheaper way to get that speedup, either now or in
the next few years.  (COPY might need similar help for speedups.)

I believe this is a big win and well worth the increased binary size
because the speed up is significant, and because it is of general
usefulness for a wide range of queries.  Either of these would be enough
to justify the additional 1% size, but both make it an easy decision for
me.  

FYI, I believe COPY needs similar optimizations; we have gotten repeated
complaints about its performance and this method of optmization might
also be our only option.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: Progress on fast path sorting, btree index creation time

From
Bruce Momjian
Date:
On Mon, Feb 06, 2012 at 04:19:07PM -0500, Bruce Momjian wrote:
> Peter Geoghegan obviously has done some serious work in improving
> sorting, and worked well with the community process.  He has done enough
> analysis that I am hard-pressed to see how we would get similar
> improvement using a different method, so I think it comes down to
> whether we want the 28% speedup by adding 55k (1%) to the binary.
> 
> I think Peter has shown us how to get that, and what it will cost --- we
> just need to decide now whether it is worth it.  What I am saying is
> there probably isn't a cheaper way to get that speedup, either now or in
> the next few years.  (COPY might need similar help for speedups.)
> 
> I believe this is a big win and well worth the increased binary size
> because the speed up is significant, and because it is of general
> usefulness for a wide range of queries.  Either of these would be enough
> to justify the additional 1% size, but both make it an easy decision for
> me.  
> 
> FYI, I believe COPY needs similar optimizations; we have gotten repeated
> complaints about its performance and this method of optmization might
> also be our only option.

Sorry, that was wordy.  What I am saying is that years ago we did
hot-spot optimization for storage and tuple access using macros.  We
have looked for cheap optimizations for sort (and COPY) for years, but
haven't found it.  If we want optimizations in these areas, we are
probably going to need to do the same sort of hot-spot optimization we
did earlier, and Peter's work is an excellent candidate for that.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: Progress on fast path sorting, btree index creation time

From
Peter Geoghegan
Date:
On 6 February 2012 21:19, Bruce Momjian <bruce@momjian.us> wrote:
> Peter Geoghegan obviously has done some serious work in improving
> sorting, and worked well with the community process.

Thank you for acknowledging that.

It's unfortunate that C does not support expressing these kinds of
ideas in a more natural way.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


Re: Progress on fast path sorting, btree index creation time

From
Bruce Momjian
Date:
On Mon, Feb 06, 2012 at 10:49:10PM +0000, Peter Geoghegan wrote:
> On 6 February 2012 21:19, Bruce Momjian <bruce@momjian.us> wrote:
> > Peter Geoghegan obviously has done some serious work in improving
> > sorting, and worked well with the community process.
> 
> Thank you for acknowledging that.
> 
> It's unfortunate that C does not support expressing these kinds of
> ideas in a more natural way.

Yes, it is a problem, and a benefit.  We have avoided C++ because these
types of trade-offs that we are discussing are often done invisibly, so
we can't make the decision ourselves.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: Progress on fast path sorting, btree index creation time

From
Bruce Momjian
Date:
On Mon, Feb 06, 2012 at 06:43:04PM -0500, Bruce Momjian wrote:
> On Mon, Feb 06, 2012 at 10:49:10PM +0000, Peter Geoghegan wrote:
> > On 6 February 2012 21:19, Bruce Momjian <bruce@momjian.us> wrote:
> > > Peter Geoghegan obviously has done some serious work in improving
> > > sorting, and worked well with the community process.
> > 
> > Thank you for acknowledging that.
> > 
> > It's unfortunate that C does not support expressing these kinds of
> > ideas in a more natural way.
> 
> Yes, it is a problem, and a benefit.  We have avoided C++ because these
> types of trade-offs that we are discussing are often done invisibly, so
> we can't make the decision ourselves.

Let me add that while it is fine for languages like C++ to make these
decisions for application code automatically, operating systems and
high-performance databases developers prefer to make such decisions
explicitly, which is what we are discussing now.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: Progress on fast path sorting, btree index creation time

From
"Jim \"Decibel!\" Nasby"
Date:
On 2/6/12 3:19 PM, Bruce Momjian wrote:
>> >  While we're waiting for anyone else to weigh in with an opinion on the
>> >  right place to draw the line here, do you want to post an updated
>> >  patch with the changes previously discussed?
> Well, I think we have to ask not only how many people are using
> float4/8, but how many people are sorting or creating indexes on them.
> I think it would be few and perhaps should be eliminated.
I agree that it's probably pretty unusual to index floats. My objection 
was on the assumption that float8 is valid but float4 isn't. If we are 
going to provide a fast-path for one then we should do it for both if 
for no other reason than least surprise.

--
Jim C. Nasby, Database Architectjim@nasby.net
512.569.9461 (cell)http://jim.nasby.net




Re: Progress on fast path sorting, btree index creation time

From
Jay Levitt
Date:
Jim "Decibel!" Nasby wrote:
> I agree that it's probably pretty unusual to index floats.

FWIW: Cubes and points are floats, right? So would spatial indexes benefit 
from this optimization, or is it only raw floats?

Jay Levitt


Re: Progress on fast path sorting, btree index creation time

From
Robert Haas
Date:
On Tue, Feb 7, 2012 at 2:39 PM, Jay Levitt <jay.levitt@gmail.com> wrote:
> Jim "Decibel!" Nasby wrote:
>>
>> I agree that it's probably pretty unusual to index floats.
>
> FWIW: Cubes and points are floats, right? So would spatial indexes benefit
> from this optimization, or is it only raw floats?

Cubes are not floats, nor are points.

In general, what's being proposed here is to make a separate copy of
quicksort for each of N datatypes, with the comparator function
inlined.  In order for this to benefit multiple types, they'd have to
use the same set of machine instructions to compare values on disk.
So in general you can make this apply to as many datatypes as you
want: it's just that each one will require another copy of the
quicksort code in the binary.  The more complex the comparator is, the
less you'll save, because the savings presumably come largely from
things like saving and resoring registers and pushing/popping stack
frames, and that's really only going to be material if the underlining
comparator is pretty darn cheap.

Actually, to be more precise, the patch is proposing to create TWO
copies of the quicksort code for each of N datatypes, one for the case
where there is but a single sort key and the other for the case where
there are multiple sort keys.  Having a separate code for the single
sort key case saves more because it lets you get rid of a control loop
- you don't have to iterate down the list of keys, because there's
only one.

I've been skeptical of this patch for a number of reasons, the major
one of which is that most workloads spend only a very small amount of
time in doing qucksorts, and most quicksorts are of very small amounts
of data and therefore fast anyway.   It is easy to construct an
artificial test case that does lots and lots of in-memory sorting, but
in real life I think that's not the great part of what people use
databases for.  The time gets spent on things like groveling through
big tables or doing joins, and then maybe we sort the rows at the end,
but that's not where most of the time goes.  It may be rightly pointed
out, of course, that a penny saved is a penny earned: the fact that
most people don't spend much time quicksorting doesn't mean that we
shouldn't try to make quicksorting as fast as possible.  But there are
a couple of additional considerations.

First, the code is, at present, uglier than I would like.  I mean to
spend some time trying to clean that up, but there are 102 patches in
this CommitFest (the number seems to keep slowly growing, despite the
deadline being three weeks gone) and this isn't the only one I care
about, so I haven't quite gotten there yet.  But hopefully soon.
Second, there's a concern about binary bloat: duplicating lots of code
with different comparators inlined generates, well, a lot of machine
code.  Of course, an 0.8% increase in the size of the resulting binary
is very unlikely to produce any measurable performance regression on
its own, but that's not really the issue.  The issue is rather that we
could easily go nuts applying this technique in lots of different
places - or even just in one place to lots of different types.  Let's
say that we find that even for types with very complex comparators, we
can get a 5% speedup on quick-sorting speed using this technique.
Should we, then, apply the technique to every type we have?  Perhaps
the binary would grow by, I don't know, 10%.  Maybe that's still not
enough to start hurting performance (or making packagers complain),
but surely it's getting there, and what happens when we discover that
similar speedups are possible using the same trick in five other
scenarios?  I think it's a bit like going out to dinner and spending
$50.  It's nicer than eating at home, and many people can afford to do
it occasionally, but if you do it every night, it gets expensive (and
possibly fattening).

So we need some principled way of deciding how much inlining is
reasonable, because I am 100% certain this is not going to be the last
time someone discovers that a massive exercise in inlining can yield a
nifty performance benefit in some case or another: index builds and
COPY have already been mentioned on this thread, and I expect that
people will find other cases as well.  I'm not really sure what the
"budget" is - i.e. how much binary bloat we can afford to add - or how
many cases there are that can benefit, but the first isn't infinite
and the second is more than the first.

Having said all that, I am inclined to commit at least some portion of
this, but I wanted to knock off a few other patches that have been
lingering for a while first.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Progress on fast path sorting, btree index creation time

From
Bruce Momjian
Date:
On Tue, Feb 07, 2012 at 09:38:39PM -0500, Robert Haas wrote:
> So we need some principled way of deciding how much inlining is
> reasonable, because I am 100% certain this is not going to be the last
> time someone discovers that a massive exercise in inlining can yield a
> nifty performance benefit in some case or another: index builds and
> COPY have already been mentioned on this thread, and I expect that
> people will find other cases as well.  I'm not really sure what the
> "budget" is - i.e. how much binary bloat we can afford to add - or how
> many cases there are that can benefit, but the first isn't infinite
> and the second is more than the first.
> 
> Having said all that, I am inclined to commit at least some portion of
> this, but I wanted to knock off a few other patches that have been
> lingering for a while first.

One approach would be to only do a few types now, e.g. integers and
strings, and perhaps leave the others for later. 

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: Progress on fast path sorting, btree index creation time

From
Peter Geoghegan
Date:
It doesn't necessarily matter if we increase the size of the postgres
binary by 10%, precisely because most of that is not going to be in
play from one instant to the next. I'm thinking, in particular, of
btree index specialisations, where it could make perfect sense to "go
crazy". You cannot have a reasonable discussion about such costs
without considering that they will perhaps never be paid, given any
plausible workload. That's why the percentage that the postgres binary
size has been shown to increase by really isn't pertinent at all. At
best, it's a weak proxy for such costs, assuming you don't have a
demonscene-like preoccupation with reducing binary size, and I don't
believe that we should.

It would be difficult for me to measure such things objectively, but
I'd speculate that the proprietary databases have much larger binaries
than ours, while having far fewer features, precisely because they
started applying tricks like this a long time ago. You could counter
that their code bases probably look terrible, and you'd have a point,
but so do I.

On 8 February 2012 02:38, Robert Haas <robertmhaas@gmail.com> wrote:
> I've been skeptical of this patch for a number of reasons, the major
> one of which is that most workloads spend only a very small amount of
> time in doing qucksorts, and most quicksorts are of very small amounts
> of data and therefore fast anyway.   It is easy to construct an
> artificial test case that does lots and lots of in-memory sorting, but
> in real life I think that's not the great part of what people use
> databases for.

Fair enough, but if that's true, then it's also true that the cost due
to cache marginalisation - the only cost that I think is worth
considering at all - is correspondingly a small fraction of that very
small amount of sorting.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


Re: Progress on fast path sorting, btree index creation time

From
Tom Lane
Date:
Peter Geoghegan <peter@2ndquadrant.com> writes:
> It doesn't necessarily matter if we increase the size of the postgres
> binary by 10%, precisely because most of that is not going to be in
> play from one instant to the next.

You've heard of swapping, no?  Basically what I'm hearing from you is
total denial that binary bloat costs anything, and that just does not
square with reality.  Even if the costs in performance terms are
negligible in many common situations (something that you've asserted
but without offering any proof), there are other costs; software
distribution for instance is definitely sensitive to size.  As a Red Hat
person I've had to spend time fighting to keep Postgres included in the
minimum "does it fit on a DVD" distribution of RHEL/Fedora.  That case
gets harder to make every year, and yet it's pretty critical to the
success of the project --- if you don't get distributed, you lose users.

IMO this patch is already well past the point of diminishing returns in
value-per-byte-added.  I'd like to see it trimmed back to provide a fast
path for just single-column int4/int8/float4/float8 sorts.  The other
cases aren't going to offer enough of a win to justify the code space.
        regards, tom lane


Re: Progress on fast path sorting, btree index creation time

From
Robert Haas
Date:
On Wed, Feb 8, 2012 at 8:33 AM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> It doesn't necessarily matter if we increase the size of the postgres
> binary by 10%, precisely because most of that is not going to be in
> play from one instant to the next.

As Tom says, that doesn't jive with my experience.  If you add on
enough binary bloat, you will have more page faults.  It's true (as I
think you pointed out upthread) that packing all the copies of
quicksort into the binary one after the other minimizes the effect on
other things, but it doesn't eliminate it entirely.  If you've got
this:

<random other stuff> ... <a zillion copies of quicksort> .... <more other stuff>

...then a branch from the "random other stuff" section of the binary
to the "more other stuff" section of the binary may cost more.  For
example, suppose the OS does X amount of readahead.  By stuffing all
those copies of quicksort into the middle there, you increase the
chances that the page you need was beyond the readahead window.  Or,
if it wasn't going to be in the readahead window either way, then you
increase the distance that the disk head needs to move to find the
required block.

These costs are very small, no question about it.  They are almost
impossible to measure individually, in much the same way that the cost
of pollution or greenhouse gas emissions is difficult to measure.  But
it's an error to assume that because the costs are individually small
that they will never add up to anything material.  As long as you keep
insisting on that, it's hard to have a rational conversation.  We can
reasonably debate the magnitude of the costs, but to assert that they
don't exist gets us nowhere.  Suppose we could get a 10% speedup on
sin(numeric) by adding 40GB to the binary size.  Would you be in favor
of that?  Do you think it would hurt performance on any other
workloads?  Would our packagers complain at all?  Surely your same
argument would apply to that case in spades: anyone who is not using
the gigantic hard-coded lookup table will not pay any portion of the
cost of it.

> It would be difficult for me to measure such things objectively, but
> I'd speculate that the proprietary databases have much larger binaries
> than ours, while having far fewer features, precisely because they
> started applying tricks like this a long time ago. You could counter
> that their code bases probably look terrible, and you'd have a point,
> but so do I.

That might be true; I have no idea.  There are probably lots of
reasons why their code bases look terrible, including a long history
of backward compatibility with now-defunct versions, a variety of
commercial pressures, and the fact that they don't have to take flak
in the public square for what their code looks like.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Progress on fast path sorting, btree index creation time

From
Robert Haas
Date:
On Wed, Feb 8, 2012 at 9:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> IMO this patch is already well past the point of diminishing returns in
> value-per-byte-added.  I'd like to see it trimmed back to provide a fast
> path for just single-column int4/int8/float4/float8 sorts.  The other
> cases aren't going to offer enough of a win to justify the code space.

I'm curious about how much we're gaining from the single-column
specializations vs. the type-specific specializations.  I think I'm
going to go try to characterize that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Progress on fast path sorting, btree index creation time

From
Bruce Momjian
Date:
On Wed, Feb 08, 2012 at 01:33:30PM +0000, Peter Geoghegan wrote:
> It doesn't necessarily matter if we increase the size of the postgres
> binary by 10%, precisely because most of that is not going to be in
> play from one instant to the next. I'm thinking, in particular, of
> btree index specialisations, where it could make perfect sense to "go
> crazy". You cannot have a reasonable discussion about such costs
> without considering that they will perhaps never be paid, given any
> plausible workload. That's why the percentage that the postgres binary
> size has been shown to increase by really isn't pertinent at all. At
> best, it's a weak proxy for such costs, assuming you don't have a
> demonscene-like preoccupation with reducing binary size, and I don't
> believe that we should.

When you start a binary, your executable is mapped to a file system
binary, and you page-fault in the pages you need.  Now, if your
optimization was alone in its own 4k (x86) virtual pages, and you never
called the functions, odds are you would not pay a penalty, aside from
distribution penalty, and perhaps a small penalty if useful code was
before and after your block.

The sort code expansion, however, is done in-line, in the middle of the
sort code, you are clearly are filling in 64-byte (x86) CPU cache lines
with type-specific expansion code for every sort case, whether we use
the code or not.  Now, I don't think it is a big problem, and I think
the speedup is worth it for common data types, but we can't say the cost
is zero.  

Saying it another way, having a binary in your file system that you
never call is not overhead except for storage, but in this case, the
sort code expansion is inside critical functions we are already calling.

Frankly, it is the cost that has kept use away from using such
optimizations for a long time.  I recently posted that the zero-cost
optimizations are mostly completed for sort and COPY, and we have to
start considering non-zero-cost optimizations --- sad, but true.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: Progress on fast path sorting, btree index creation time

From
Bruce Momjian
Date:
On Wed, Feb 08, 2012 at 10:17:36AM -0500, Robert Haas wrote:
> On Wed, Feb 8, 2012 at 9:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > IMO this patch is already well past the point of diminishing returns in
> > value-per-byte-added.  I'd like to see it trimmed back to provide a fast
> > path for just single-column int4/int8/float4/float8 sorts.  The other
> > cases aren't going to offer enough of a win to justify the code space.
> 
> I'm curious about how much we're gaining from the single-column
> specializations vs. the type-specific specializations.  I think I'm
> going to go try to characterize that.

Yes, please.  That would be a big help.   Is there no optimization for
strings?  I assume they are sorted a lot.  

We can alway add more data types later.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: Progress on fast path sorting, btree index creation time

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> Yes, please.  That would be a big help.   Is there no optimization for
> strings?  I assume they are sorted a lot.  

It seems unlikely that it'd be worth including strings, especially if
your locale is not C.  This whole thing only makes sense for datatypes
that are comparable in approximately 1 machine instruction.
        regards, tom lane


Re: Progress on fast path sorting, btree index creation time

From
Bruce Momjian
Date:
On Wed, Feb 08, 2012 at 11:35:46AM -0500, Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > Yes, please.  That would be a big help.   Is there no optimization for
> > strings?  I assume they are sorted a lot.  
> 
> It seems unlikely that it'd be worth including strings, especially if
> your locale is not C.  This whole thing only makes sense for datatypes
> that are comparable in approximately 1 machine instruction.

Ah, OK, interesting.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: Progress on fast path sorting, btree index creation time

From
Peter Geoghegan
Date:
On 8 February 2012 15:17, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Feb 8, 2012 at 9:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> IMO this patch is already well past the point of diminishing returns in
>> value-per-byte-added.  I'd like to see it trimmed back to provide a fast
>> path for just single-column int4/int8/float4/float8 sorts.  The other
>> cases aren't going to offer enough of a win to justify the code space.
>
> I'm curious about how much we're gaining from the single-column
> specializations vs. the type-specific specializations.  I think I'm
> going to go try to characterize that.

I think it might make more sense to lose the type-specific
specialisations for the multi-key case while adding a generic
multi-key specialisation, than to lose all multi-key specialisations,
though I have not considered that question at length, and would think
that we'd still want to keep an int4 version in that case. Note that I
*did not* include a generic multi-key specialisation, though only
because I saw little point, having already covered by far the most
common cases.

While you're at it, I'd like to suggest that you perform a benchmark
on a multi-key specialisation, so we can see just what we're throwing
away before we do so. Better to have those numbers come from you.

I continue to maintain that the most appropriate course of action is
to provisionally commit all specialisations. If it's hard to know what
effect this is going to have on real workloads, let's defer to beta
testers, who presumably try the new release out with their
application. It's a question you could squarely put to them, without
gradually rolling back from that initial position being much of a
problem.

The mysql-server package is 45 MB on Fedora 16. That 1% of Postgres
binary figure is for my earlier patch with btree specialisations,
right? I'm not asking you to look at that right now. I also don't
think that "where do we eventually draw the line with specialisations
like this in Postgres generally?" is a question that you should expect
me to answer, though I will say that we should look at each case on
its merits.

I have not "totally denied" binary bloat costs. I have attempted to
quantify them, while acknowledging that such a task is difficult, as
was evident from the fact that Robert "wasn't suprised" that I could
not demonstrate any regression. Granted, my definition of a regression
is that there is very clearly no net loss in performance at some
reasonable granularity, which is a very practical definition. You can
quite easily contrive a case that HOT handles really badly. Some
people did, I believe, but HOT won out because it was clearly very
useful in the real world.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


Re: Progress on fast path sorting, btree index creation time

From
Robert Haas
Date:
On Wed, Feb 8, 2012 at 10:59 AM, Bruce Momjian <bruce@momjian.us> wrote:
> On Wed, Feb 08, 2012 at 10:17:36AM -0500, Robert Haas wrote:
>> On Wed, Feb 8, 2012 at 9:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> > IMO this patch is already well past the point of diminishing returns in
>> > value-per-byte-added.  I'd like to see it trimmed back to provide a fast
>> > path for just single-column int4/int8/float4/float8 sorts.  The other
>> > cases aren't going to offer enough of a win to justify the code space.
>>
>> I'm curious about how much we're gaining from the single-column
>> specializations vs. the type-specific specializations.  I think I'm
>> going to go try to characterize that.
>
> Yes, please.  That would be a big help.   Is there no optimization for
> strings?  I assume they are sorted a lot.

They are, but the value of the optimization is expected to diminish
for more complex comparators.

I did a quick test of this using the same setup I mentioned upthread:

create table nodups (g) as select (g%10)*1000+g/10 from
generate_series(1,10000) g;

This was the actual test query:

select * from nodups order by g offset 10001;

I did a ten-second warmup on after starting each postmaster, followed
by three one-minute test runs:

master:
tps = 298.824321 (including connections establishing)
tps = 301.741619 (including connections establishing)
tps = 302.238016 (including connections establishing)

Peter's patch, but with the type-specific optimizations disabled, so
using just the single-sortkey optimizations:
tps = 357.553690 (including connections establishing)
tps = 358.765875 (including connections establishing)
tps = 358.825051 (including connections establishing)

Peter's patch, intact:
tps = 394.566994 (including connections establishing)
tps = 394.228667 (including connections establishing)
tps = 394.492677 (including connections establishing)

So that's about a 19% speedup for just the single sort-key
optimizations, and about a 30% speedup in total.  Compared to a
baseline that includes the single sort-key optimizations, the
additional type-specific optimization is buying about 10% on this
test.

I tried changing the test query to return the data, like this:

select * from nodups order by g;

master:
tps = 181.301129 (including connections establishing)
tps = 180.348333 (excluding connections establishing)
tps = 179.600825 (including connections establishing)

Peter's patch, but with the type-specific optimizations disabled, so
using just the single-sortkey optimizations:
tps = 201.728269 (including connections establishing)
tps = 198.022527 (including connections establishing)
tps = 199.663082 (including connections establishing)

Peter's patch, intact:
tps = 214.037387 (including connections establishing)
tps = 212.895145 (including connections establishing)
tps = 211.596558 (including connections establishing)

So, with the overhead of actually returning the sorted data, we get
10% from the generic single-sortkey optimizations and 18% from the two
combined.  Compared to a baseline that includes the single-sortkey
optimizations, the incremental improvement from adding the
type-specific optimizations is about 6.6%.  It would be less on a
wider table, presumably.

I also tested a two-column sort:

create table twocol (a, b) as select g%101, g%17 from
generate_series(1,10000) g;
select * from twocol order by a, b offset 10000;

master:
tps = 270.218604 (including connections establishing)
tps = 265.634964 (including connections establishing)
tps = 268.520805 (including connections establishing)

Peter's patch, intact:
tps = 285.528626 (including connections establishing)
tps = 285.140345 (including connections establishing)
tps = 282.388457 (including connections establishing)

So here the type-specific optimizations are buying us even less: only
about 6%, and that's with no client overhead.

And I tested float8:

create table f8 (g) as select random() from generate_series(1,10000);
select * from f8 order by g offset 10000;

master:
tps = 228.373239 (including connections establishing)
tps = 225.117704 (including connections establishing)
tps = 225.196197 (including connections establishing)

Peter's patch, but with the type-specific optimizations disabled, so
using just the single-sortkey optimizations:
tps = 263.060820 (including connections establishing)
tps = 262.788926 (including connections establishing)
tps = 263.041186 (including connections establishing)

Peter's patch, intact:
tps = 283.274797 (including connections establishing)
tps = 280.597965 (including connections establishing)
tps = 280.151349 (including connections establishing)

That's +17% for the single-sortkey stuff, +25% for everything, and the
incremental improvement of the type-specific optimizations over the
generic single-key optimizations is about 6.7%.

It seems clear that the single sort-key optimizations are a much
better value per byte of code than the type-specific optimizations.
Ignoring client overhead, we get between half and two-thirds of the
benefit, and it costs us just one extra copy of the quicksort logic
for all data types.  The type-specific optimizations only apply to a
subset of the available data types, produce a lot more machine code
(one extra copy of quicksort per type), and the absolute performance
benefit is less.  My gut feeling is that the few extra percentage
points we can get by including lots of extra copies of quicksort is
not worth it.  There are probably many cases where we can squeeze a
few extra points out of inlining (Tom and I have discussed a few of
them in the past, in particular one related to index-only scans) and I
don't have any confidence that this is the most important one, or even
in the top ten.  If we were talking about speeding up COPY or index
lookups, I might well feel differently, but very few people are going
to quicksort 10,000 rows on a sufficiently regular basis to justify
doing this.  To even measure the benefit of this, you have to set up a
totally artificial test case (such as the above), and you have to run
it via pgbench, because if you just run it interactively, the
run-to-run variation swamps the actual improvement, and without doing
a statistical analysis of the results you can't even be sure whether
anything's changed.  I just can't get excited about that.  However, I
find the single-key optimizations much more compelling, for the
reasons stated above, and feel we ought to include those.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Progress on fast path sorting, btree index creation time

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> [ lots of numbers ]

> ... I just can't get excited about that.  However, I
> find the single-key optimizations much more compelling, for the
> reasons stated above, and feel we ought to include those.

This conclusion seems sound to me, for the reasons you stated and one
more: optimizations for a single sort key are going to be applicable
to a very wide variety of queries, whereas all the other cases are
necessarily less widely applicable.
        regards, tom lane


Re: Progress on fast path sorting, btree index creation time

From
Peter Geoghegan
Date:
On 8 February 2012 17:58, Robert Haas <robertmhaas@gmail.com> wrote:
> It seems clear that the single sort-key optimizations are a much
> better value per byte of code than the type-specific optimizations.
> Ignoring client overhead, we get between half and two-thirds of the
> benefit, and it costs us just one extra copy of the quicksort logic
> for all data types.

That was clear from an early stage, and is something that I
acknowledged way back in September - it's pretty obvious that chasing
diminishing returns is the nature of this sort of thing, which is
fine, provided that they are not chased beyond some point that doesn't
make sense. However, I do not see why we should not at least include
full integer specialisations as well (for single-keys at least), given
that we get the additional, not inconsiderable improvements. I do not
accept the premise that we need to find the optimal bytes added to
performance increase ratio, because, as I've already stressed, adding
bytes to the application binary does not have a linear cost.

Here's what might be a good compromise:

singlekey generic, singlekey int4, singlekey int8, multikey generic.

That is a total number of 4 contiguous copies of the qsort, because
we've obsoleted the original qsort_arg (accept for btree index
sorting, which likely matters far less). We get the most benefit for 5
very common types - int4, int8, date, timestamp and timestamptz. A 30%
improvement has to be better than the 19% speedup for just the single
sort-key optimisations, considering that in practice these types are
very commonly used.

I think that there may be additional benefits from making the
qsort_arg specialisation look less like a c stdlib one, like refining
the swap logic to have compile-time knowledge of the type it is
sorting. I'm thinking that we could usefully trim quite a bit from
this:

#define SWAPINIT(a, es) swaptype = ((char *)(a) - (char *)0) % sizeof(long) || \(es) % sizeof(long) ? 2 : (es) ==
sizeof(long)?0 : 1; 

#define thistype_vecswap(a, b, n)                                \if ((n) > 0) inl_swapfunc((a), (b), (size_t)(n),
swaptype)

#define thistype_swap(a, b)                                \
if (swaptype == 0) {                                    \    long t = *(long *)(void *)(a);                    \
*(long*)(void *)(a) = *(long *)(void *)(b);    \    *(long *)(void *)(b) = t;                        \} else
                                   \    inl_swapfunc(a, b, es, swaptype) 

inline static void
inl_swapfunc(char *a, char *b, size_t n, int swaptype)
{if (swaptype <= 1)    swapcode(long, a, b, n);else    swapcode(char, a, b, n);
}


--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


Re: Progress on fast path sorting, btree index creation time

From
Peter Geoghegan
Date:
On 8 February 2012 18:48, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> I think that there may be additional benefits from making the
> qsort_arg specialisation look less like a c stdlib one, like refining
> the swap logic to have compile-time knowledge of the type it is
> sorting. I'm thinking that we could usefully trim quite a bit from
> this:

It seems like while we cannot get any better performance, no doubt
because the code is already using all manner of optimisations,
including perhaps constant propagation, we can simplify this part of
the code quite a bit. That seems fairly incidental though.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


Re: Progress on fast path sorting, btree index creation time

From
Robert Haas
Date:
On Wed, Feb 8, 2012 at 1:48 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> That was clear from an early stage, and is something that I
> acknowledged way back in September

OK, so why didn't/don't we do and commit that part first, and then
proceed to argue about the remainder once it's in?

> I think that there may be additional benefits from making the
> qsort_arg specialisation look less like a c stdlib one, like refining
> the swap logic to have compile-time knowledge of the type it is
> sorting. I'm thinking that we could usefully trim quite a bit from
> this:

That's an interesting idea, which seems worth pursuing, though
possibly not for 9.2.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Progress on fast path sorting, btree index creation time

From
Peter Geoghegan
Date:
On 8 February 2012 23:33, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Feb 8, 2012 at 1:48 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
>> That was clear from an early stage, and is something that I
>> acknowledged way back in September
>
> OK, so why didn't/don't we do and commit that part first, and then
> proceed to argue about the remainder once it's in?

I have no objection to that. I'm not sure that it's going to be
possible to agree to anything beyond what has already been agreed. We
don't seem to be covering new ground.

What would it take to convince you to be more inclusive, and have at
least a few full specialisations, in particular, int4 and int8 single
key full specialisations? I'm sure that you'll agree that they're the
second most compelling case. There doesn't seem to be a standard that
I can meet that you can even /describe/ to have those additional
specialisations committed, which is a shame, since they appear to be
pretty decent wins above and beyond the generic single key
specialisation, all things considered.

I suppose that the standard refrain here is "it's your patch; you must
prove the case for committing it". That would be fine by me, but this
isn't just about me, and it seems to be a practical impossibility in
this case. It may be quite impossible even with far greater resources
than I can afford to apply here, since it seems like you're hinting at
some kind of rigorous proof that this cannot cause a regression for a
single client, even though in order to see that regression multiple
other clients would have to be benefiting. I cannot provide such a
proof, since it would probably have to consider all commercially
available CPUs and all possible workloads - I doubt that anyone can.
That being the case, I'm eager not to waste any more time on this. I
bear no ill will; that just seems to be the situation that we find
ourselves in.

That said, if you can describe the standard, I'll try and meet it -
you seem to be suggesting that months and months of benchmarks
wouldn't really change things, since this is the first step on the
road to binary bloat ruin. The only fundamentally new argument that I
can offer is that by applying the standard that the community does
here, it severely limits the number of performance improvements that
can be added, and the aggregate effect is that we're quite certainly
worse off. It's a pity this isn't a really easy decision for you to
make, but that's the nature of these things, and we should expect that
to increasingly be the case. Is it really so unacceptable for us to
risk doing a little worse under some rare circumstances in order to do
better under some common circumstances? Why should it be an easy
decision - when are important decisions ever easy?

>> I think that there may be additional benefits from making the
>> qsort_arg specialisation look less like a c stdlib one, like refining
>> the swap logic to have compile-time knowledge of the type it is
>> sorting. I'm thinking that we could usefully trim quite a bit from
>> this:
>
> That's an interesting idea, which seems worth pursuing, though
> possibly not for 9.2.

Well, it's really just a code clean-up. I suspect that qsort_arg is a
minimally modified version of the NetBSD one that wasn't necessarily
thoroughly understood when it was initially added (not that I'm
suggesting that it ought to have been). Then again, you might prefer
to keep it as consistent as possible with qsort (the other copy of
that sort function that we already have).

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


Re: Progress on fast path sorting, btree index creation time

From
Noah Misch
Date:
On Tue, Feb 07, 2012 at 09:38:39PM -0500, Robert Haas wrote:
> Second, there's a concern about binary bloat: duplicating lots of code
> with different comparators inlined generates, well, a lot of machine
> code.  Of course, an 0.8% increase in the size of the resulting binary
> is very unlikely to produce any measurable performance regression on
> its own, but that's not really the issue.  The issue is rather that we
> could easily go nuts applying this technique in lots of different
> places - or even just in one place to lots of different types.  Let's
> say that we find that even for types with very complex comparators, we
> can get a 5% speedup on quick-sorting speed using this technique.
> Should we, then, apply the technique to every type we have?  Perhaps
> the binary would grow by, I don't know, 10%.  Maybe that's still not
> enough to start hurting performance (or making packagers complain),
> but surely it's getting there, and what happens when we discover that
> similar speedups are possible using the same trick in five other
> scenarios?  I think it's a bit like going out to dinner and spending
> $50.  It's nicer than eating at home, and many people can afford to do
> it occasionally, but if you do it every night, it gets expensive (and
> possibly fattening).
> 
> So we need some principled way of deciding how much inlining is
> reasonable, because I am 100% certain this is not going to be the last
> time someone discovers that a massive exercise in inlining can yield a
> nifty performance benefit in some case or another: index builds and
> COPY have already been mentioned on this thread, and I expect that
> people will find other cases as well.  I'm not really sure what the
> "budget" is - i.e. how much binary bloat we can afford to add - or how
> many cases there are that can benefit, but the first isn't infinite
> and the second is more than the first.

Having such a metric would resolve this discussion, but formulating it will be
all but impossible.  We don't know how much time PostgreSQL installations now
spend sorting or how much additional time they would spend on cache misses,
TLB misses, page faults, etc.  That data won't show up on this thread.

You posed[1] the straw man of a sin(numeric) optimization requiring a 40GB
lookup table.  I would not feel bad about rejecting that, because it can live
quite comfortably as an external module.  Indeed, I think the best thing we
can do to constrain long-term bloat in the "postgres" executable is to improve
pluggability.  Let a wider range of features live outside the postmaster
binary.  For example, if we had the infrastructure to let hash, GIN, GiST and
SP-GiST index access methods live in their own DSOs (like PL/pgSQL does
today), I would support doing that.  Extensibility is a hallmark of
PostgreSQL.  It's a bad day when we reject a minority-interest feature or
optimization that has no other way to exist.

Better pluggability can't ease the decision process for Peter's patch, because
its specializations essentially must live in the "postgres" executable to live
at all.  Nominally, we could let external modules inject sort specializations
for core types, and Peter could publish an all_fast_sorts extension.  That
would be useless punting: if we can't make a principled decision on whether
these accelerated sorts justify 85 KiB of binary, how will a DBA who discovers
the module make that decision?

This patch has gotten more than its fair share of attention for bloat, and I
think that's mostly because there's a dial-a-bloat-level knob sticking out and
begging to be frobbed.  On my system, fastpath_sort_2012_01_19.patch adds 85
KiB to the postgres binary.  A recent commit added 57 KiB and bloat never came
up in discussions thereof.  I appreciate your concern over a slippery slope of
inlining proposals, but I also don't wish to see the day when every patch
needs to declare and justify its binary byte count.  Unless, of course, we
discover that elusive metric for evaluating it fairly.

If we'd like to start taking interest in binary bloat, how about having the
buildfarm log capture an "ls -l" on the bin directory?  We'll then have data
to mine if we ever wish to really take action.


All that being said, I'd want to see a 15-30% (depending on how contrived)
improvement to a microbenchmark or a 5% improvement to a generic benchmark
(pgbench, DBT-<N>) before adopting an optimization of this complexity.  Based
on your measurements[2], per-type inlining gives us a 7% microbenchmark
improvement.  I'd therefore excise the per-type inlining.  (For the record,
given a 20% improvement to the same benchmark, I'd vote yea on the submission
and perhaps even suggest int2 and uuid support.)

nm

[1] http://archives.postgresql.org/message-id/CA+TgmoZO1xSz+YiqZ2mRoKMcMqtb+JiR0Lz43CNe6de7--QDAA@mail.gmail.com
[2] http://archives.postgresql.org/message-id/CA+TgmobFz8SqTGTaaRYEryWUZFODGETprbuYgBhsPLR4+Li6TQ@mail.gmail.com


Re: Progress on fast path sorting, btree index creation time

From
Robert Haas
Date:
On Thu, Feb 9, 2012 at 7:24 AM, Noah Misch <noah@leadboat.com> wrote:
> On Tue, Feb 07, 2012 at 09:38:39PM -0500, Robert Haas wrote:
>> Second, there's a concern about binary bloat: duplicating lots of code
>> with different comparators inlined generates, well, a lot of machine
>> code.  Of course, an 0.8% increase in the size of the resulting binary
>> is very unlikely to produce any measurable performance regression on
>> its own, but that's not really the issue.  The issue is rather that we
>> could easily go nuts applying this technique in lots of different
>> places - or even just in one place to lots of different types.  Let's
>> say that we find that even for types with very complex comparators, we
>> can get a 5% speedup on quick-sorting speed using this technique.
>> Should we, then, apply the technique to every type we have?  Perhaps
>> the binary would grow by, I don't know, 10%.  Maybe that's still not
>> enough to start hurting performance (or making packagers complain),
>> but surely it's getting there, and what happens when we discover that
>> similar speedups are possible using the same trick in five other
>> scenarios?  I think it's a bit like going out to dinner and spending
>> $50.  It's nicer than eating at home, and many people can afford to do
>> it occasionally, but if you do it every night, it gets expensive (and
>> possibly fattening).
>>
>> So we need some principled way of deciding how much inlining is
>> reasonable, because I am 100% certain this is not going to be the last
>> time someone discovers that a massive exercise in inlining can yield a
>> nifty performance benefit in some case or another: index builds and
>> COPY have already been mentioned on this thread, and I expect that
>> people will find other cases as well.  I'm not really sure what the
>> "budget" is - i.e. how much binary bloat we can afford to add - or how
>> many cases there are that can benefit, but the first isn't infinite
>> and the second is more than the first.
>
> Having such a metric would resolve this discussion, but formulating it will be
> all but impossible.  We don't know how much time PostgreSQL installations now
> spend sorting or how much additional time they would spend on cache misses,
> TLB misses, page faults, etc.  That data won't show up on this thread.
>
> You posed[1] the straw man of a sin(numeric) optimization requiring a 40GB
> lookup table.  I would not feel bad about rejecting that, because it can live
> quite comfortably as an external module.  Indeed, I think the best thing we
> can do to constrain long-term bloat in the "postgres" executable is to improve
> pluggability.  Let a wider range of features live outside the postmaster
> binary.  For example, if we had the infrastructure to let hash, GIN, GiST and
> SP-GiST index access methods live in their own DSOs (like PL/pgSQL does
> today), I would support doing that.  Extensibility is a hallmark of
> PostgreSQL.  It's a bad day when we reject a minority-interest feature or
> optimization that has no other way to exist.
>
> Better pluggability can't ease the decision process for Peter's patch, because
> its specializations essentially must live in the "postgres" executable to live
> at all.  Nominally, we could let external modules inject sort specializations
> for core types, and Peter could publish an all_fast_sorts extension.  That
> would be useless punting: if we can't make a principled decision on whether
> these accelerated sorts justify 85 KiB of binary, how will a DBA who discovers
> the module make that decision?
>
> This patch has gotten more than its fair share of attention for bloat, and I
> think that's mostly because there's a dial-a-bloat-level knob sticking out and
> begging to be frobbed.  On my system, fastpath_sort_2012_01_19.patch adds 85
> KiB to the postgres binary.  A recent commit added 57 KiB and bloat never came
> up in discussions thereof.  I appreciate your concern over a slippery slope of
> inlining proposals, but I also don't wish to see the day when every patch
> needs to declare and justify its binary byte count.  Unless, of course, we
> discover that elusive metric for evaluating it fairly.
>
> If we'd like to start taking interest in binary bloat, how about having the
> buildfarm log capture an "ls -l" on the bin directory?  We'll then have data
> to mine if we ever wish to really take action.
>
>
> All that being said, I'd want to see a 15-30% (depending on how contrived)
> improvement to a microbenchmark or a 5% improvement to a generic benchmark
> (pgbench, DBT-<N>) before adopting an optimization of this complexity.  Based
> on your measurements[2], per-type inlining gives us a 7% microbenchmark
> improvement.  I'd therefore excise the per-type inlining.  (For the record,
> given a 20% improvement to the same benchmark, I'd vote yea on the submission
> and perhaps even suggest int2 and uuid support.)

That's all well-said and I mostly agree with all of it, including your
suggested percentages in the last paragraph (but does anyone really
sort int2s or uuids?).  What I think is different about inlining as
compared to writing code is that it's possible to generate an amount
of binary bloat that is disproportionate to the amount of code you
actually wrote.  You can blow up your binary more or less on
auto-pilot, just by adding more macro invocations.  Of course, the
overall impact is the same, but when you have to actually write the
code by hand, you're more likely to notice that you're cranking out
hundreds of lines of code of dubious real-world value.

I think the other thing to keep in mind is that, inevitably, people
decide what considerations to worry about with reference to a
particular patch based on the perceived value of the feature.  I
thought about the fact that Tom's parameterized-path patch adds a lot
of new code and to be honest I'm not totally sanguine about it, but on
the other hand it has the potential to make certain kinds of queries
that were seriously slow in previous releases run hundreds or
thousands of times faster, and maybe open the door to further planner
optimizations in the future.  When you start talking about multiples
with lots of zeros on the end, binary bloat fades out of the picture:
you COULD assess it, but we already pretty much know what the outcome
of that discussion will be.  Similarly, when you hear Tom complaining
about the cost of adding a new grammar keyword, that's a tip-off that
he either thinks that no effort whatsoever was made to find a syntax
that will work with the existing keywords, or he doesn't think the
feature is worth very much, because if you look at git log -p
--author=Lane src/include/parser/kwlist.h you'll find he's added them
on various occasions with nary a word spoken.  What we need to be wary
of, in my opinion anyway, is not adding binary bloat or keywords per
se, but rather of adding them for trivial gains.  And I do consider a
6% gain on a highly artificial macrobenchmark to be trivial.  If we
were talking about a 10x speedup, I doubt the word "bloat" would be
anywhere on this thread, and it certainly wouldn't be getting the
prominence that it has here.

I'm also more than slightly concerned that we are losing sight of the
forest for the trees.  I have heard reports that sorting large amounts
of data is many TIMES slower for us than for a certain other database
product.  I frankly wonder if this entire patch is a distraction.
When inlining is the tool of choice for further performance
improvement, it suggests that we're pretty much out of ideas, and that
whatever we get out of inlining - be it 6% or 30% - will be the end.
I don't like that idea very much.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Progress on fast path sorting, btree index creation time

From
Peter Geoghegan
Date:
On 9 February 2012 13:50, Robert Haas <robertmhaas@gmail.com> wrote:
> I'm also more than slightly concerned that we are losing sight of the
> forest for the trees.  I have heard reports that sorting large amounts
> of data is many TIMES slower for us than for a certain other database
> product.  I frankly wonder if this entire patch is a distraction.
> When inlining is the tool of choice for further performance
> improvement, it suggests that we're pretty much out of ideas, and that
> whatever we get out of inlining - be it 6% or 30% - will be the end.
> I don't like that idea very much.

If you're talking about the product I think you're talking about,
well, their contracts forbid any actual benchmarks, so we won't have
any hard figures, but I find it intuitively difficult to believe that
the difference is that large, mostly because I've demonstrated that
the performance of qsort is a pretty good proxy for the performance of
a query like "select * from foo order by bar", and qsort can't be too
far from optimal. Besides, didn't you yourself say "I've been
skeptical of this patch for a number of reasons, the major one of
which is that most workloads spend only a very small amount of time in
doing qucksorts"?

I have a new plan. I think you'll like it:

* drop the type specific specialisations entirely.

* drop qsort_arg (the original, unspecialised version). We now have
exactly the same number of source code copies of qsort as before: 2.

* gut the comparison of the leading sort key only from
comparetup_heap, comparetup_index_btree, etc (I collectively refer to
these functions as tuple-class encapsulating comparators, distinct
from their comparator proper).

* Instantiate two specialisations of qsort_arg: single key and
multiple key (exactly the same number as has already been agreed to,
since we lost the generic version). However, there is one key
difference now: we call one of the tuple class encapsulating functions
through a function pointer if the first comparison gives equality - .
Early indications are that this is almost as much of a win as the
single-key case. So the code effectively does this (but through a
function pointer):

#define MULTI_ADDITIONAL_CODE(COMPAR)                                        \
{                                                                            \return comparetup_heap(aT, bT, state);
                               \ 
}

This works because most comparisons won't actually need to look at the
second key, and because the cut-down tuple-class encapsulating
comparator that is used in the last revision of the patch doesn't
actually make any assumption about the particular tuple-class
encapsulating comparator in use.

It may not even be worth-while having a separate copy of the qsort
function for the multiple key case, so the question of binary bloat
becomes entirely irrelevant, as there would be a net gain of zero
copies of qsort_arg.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


Re: Progress on fast path sorting, btree index creation time

From
Robert Haas
Date:
On Thu, Feb 9, 2012 at 9:37 AM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> On 9 February 2012 13:50, Robert Haas <robertmhaas@gmail.com> wrote:
>> I'm also more than slightly concerned that we are losing sight of the
>> forest for the trees.  I have heard reports that sorting large amounts
>> of data is many TIMES slower for us than for a certain other database
>> product.  I frankly wonder if this entire patch is a distraction.
>> When inlining is the tool of choice for further performance
>> improvement, it suggests that we're pretty much out of ideas, and that
>> whatever we get out of inlining - be it 6% or 30% - will be the end.
>> I don't like that idea very much.
>
> If you're talking about the product I think you're talking about,
> well, their contracts forbid any actual benchmarks, so we won't have
> any hard figures, but I find it intuitively difficult to believe that
> the difference is that large, mostly because I've demonstrated that
> the performance of qsort is a pretty good proxy for the performance of
> a query like "select * from foo order by bar", and qsort can't be too
> far from optimal. Besides, didn't you yourself say "I've been
> skeptical of this patch for a number of reasons, the major one of
> which is that most workloads spend only a very small amount of time in
> doing qucksorts"?

Emphasis on "large" amounts of data.

I've said from the beginning that the small-sort case is less
interesting to me: it's nice to make it faster, but nobody's going to
quit using PostgreSQL because a quicksort takes 8ms instead of 6ms.
It's when we start talking about operations that take minutes or hours
that the differences become material.

> I have a new plan. I think you'll like it:
>
> * drop the type specific specialisations entirely.

Check.

> * drop qsort_arg (the original, unspecialised version). We now have
> exactly the same number of source code copies of qsort as before: 2.

This may be useful elsewhere some day, but I suppose we can always
pull it back out of git if we need it.

> * gut the comparison of the leading sort key only from
> comparetup_heap, comparetup_index_btree, etc (I collectively refer to
> these functions as tuple-class encapsulating comparators, distinct
> from their comparator proper).
>
> * Instantiate two specialisations of qsort_arg: single key and
> multiple key (exactly the same number as has already been agreed to,
> since we lost the generic version). However, there is one key
> difference now: we call one of the tuple class encapsulating functions
> through a function pointer if the first comparison gives equality - .
> Early indications are that this is almost as much of a win as the
> single-key case. So the code effectively does this (but through a
> function pointer):
>
> #define MULTI_ADDITIONAL_CODE(COMPAR)                                                                           \
> {                                                                                                                    
                                 \ 
>        return comparetup_heap(aT, bT, state);                                                                  \
> }
>
> This works because most comparisons won't actually need to look at the
> second key, and because the cut-down tuple-class encapsulating
> comparator that is used in the last revision of the patch doesn't
> actually make any assumption about the particular tuple-class
> encapsulating comparator in use.
>
> It may not even be worth-while having a separate copy of the qsort
> function for the multiple key case, so the question of binary bloat
> becomes entirely irrelevant, as there would be a net gain of zero
> copies of qsort_arg.

I'm not sure I entirely follow all this, but I'll look at the code
once you have it.  Are you saying that all the comparetup_yadda
functions are redundant to each other in the single-key case?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Progress on fast path sorting, btree index creation time

From
Bruce Momjian
Date:
On Thu, Feb 09, 2012 at 07:24:49AM -0500, Noah Misch wrote:
> This patch has gotten more than its fair share of attention for bloat, and I
> think that's mostly because there's a dial-a-bloat-level knob sticking out and
> begging to be frobbed.

I already emailed Peter privately stating that he shouldn't feel bad
about the many critiques of his patch --- this dial-a-bloat-level knob
is new for us, and we have to get used to the idea.  We are more
discussing the idea of a dial-a-bloat-level knob rather than his patch.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: Progress on fast path sorting, btree index creation time

From
Peter Geoghegan
Date:
On 9 February 2012 14:51, Robert Haas <robertmhaas@gmail.com> wrote:
> I'm not sure I entirely follow all this, but I'll look at the code
> once you have it.  Are you saying that all the comparetup_yadda
> functions are redundant to each other in the single-key case?

Yes, I am. The main reason that the loops exist in those functions
(which is the only way that they substantially differ) is because they
each have to get the other keys through various ways that characterise
the tuple class that they encapsulate (index_getattr(),
heap_getattr(), etc).

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


Re: Progress on fast path sorting, btree index creation time

From
Bruce Momjian
Date:
On Thu, Feb 09, 2012 at 03:36:23PM +0000, Peter Geoghegan wrote:
> On 9 February 2012 14:51, Robert Haas <robertmhaas@gmail.com> wrote:
> > I'm not sure I entirely follow all this, but I'll look at the code
> > once you have it.  Are you saying that all the comparetup_yadda
> > functions are redundant to each other in the single-key case?
> 
> Yes, I am. The main reason that the loops exist in those functions
> (which is the only way that they substantially differ) is because they
> each have to get the other keys through various ways that characterise
> the tuple class that they encapsulate (index_getattr(),
> heap_getattr(), etc).

Does this help all types for sorting, including strings?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: Progress on fast path sorting, btree index creation time

From
Peter Geoghegan
Date:
On 9 February 2012 17:16, Bruce Momjian <bruce@momjian.us> wrote:
>> Yes, I am. The main reason that the loops exist in those functions
>> (which is the only way that they substantially differ) is because they
>> each have to get the other keys through various ways that characterise
>> the tuple class that they encapsulate (index_getattr(),
>> heap_getattr(), etc).
>
> Does this help all types for sorting, including strings?

Yes, it does, though of course that's not expected to make too much
difference with text when the C locale isn't used, because the
comparisons are inherently much more expensive, and there isn't a
whole lot we can do about that, at least here.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


Re: Progress on fast path sorting, btree index creation time

From
Peter Geoghegan
Date:
On 9 February 2012 14:51, Robert Haas <robertmhaas@gmail.com> wrote:
> I'm not sure I entirely follow all this, but I'll look at the code
> once you have it.

I have attached a revision of the patch, with the adjustments already
described. Note that I have not made this support btree tuplesorting
yet, as there is an impedance mismatch that must be resolved,
particularly with the SortSupport stuff, and I wanted to know what you
think of the multiple key specialisation first. Arguably, we could get
away with only a single specialisation - I haven't really though about
it much.

You say "Well, how often will we sort 10,000 integers?", and I think
that btree index creation is one very common and useful case, so I'd
like to look at that in more detail. I certainly don't see any reason
to not do it too.

This should give you performance for sorting multiple-keys that is
almost as good as the single-key optimisation that you found to be
more compelling. Obviously the need to actually call comparetup_heap
to look at non-leading sortkeys will vary from case to case, and this
is based on your test case, where there are no duplicates and thus no
need to ever do that. That isn't too far from representative, as I
think that in general, a majority of comparisons won't result in
equality.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Attachment

Re: Progress on fast path sorting, btree index creation time

From
Robert Haas
Date:
On Fri, Feb 10, 2012 at 10:30 AM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> [ new patch ]

I spent quite a bit of time looking at this today - the patch
specifically, and the issue of making quicksort fast more generally.
It seemed to me that if we're going to have separate copies of the
quicksort code for tuple sorting, we might as well go whole hog and
specialize those copies to the particular needs of tuplesort.c as much
as possible.  Accordingly, I whacked the code around so that it knows
that it is sorting SortTuple objects and need not conditionalize at
runtime on the size of the objects being swapped.  You suggested
upthread that this might be worthwhile, and it seems that it is, so I
think we should do it.

Your patch removes the CHECK_FOR_INTERRUPTS() call from
comparetup_heap, which is no good.  However, since I'd already decided
to specialize the copies of quicksort intended for sorting as much as
possible, it made sense to me to refactor things so that the qsort
routine itself, rather than the comparator, is responsible for calling
CHECK_FOR_INTERRUPTS().  This slightly reduces the number of times we
CHECK_FOR_INTERRUPTS(), but never allows more than a few comparisons
before doing it.

I find that your pg_always_inline macro is equivalent to just plain
inline on my system (MacOS X v10.6.8, gcc 4.2.1).  It seems to need
something like this:

+#elif __GNUC__
+#define pg_always_inline inline __attribute__((always_inline))

...but I'm not very happy about relying on that, because I don't know
that it will work on every gcc version (never mind non-gcc compilers),
and I'm not convinced it's actually improving performance even on this
one.  The documentation seems to indicate that this is intended to
force inlining even when not optimizing, which may have something to
do with the lack of effect: that's not really the point here anyway.
What I did instead is to replace template_qsort_arg.h with a script
called gen_qsort_tuple.pl, which generates a file called qsort_tuple.c
that tuplesort.c then #includes.  This seems more flexible to me than
the macro-based approach.  In particular, it allows me to generate
versions of qsort with different call signatures.  The attached patch
generates two:

static void qsort_tuple(SortTuple *a, size_t n, SortTupleComparator
cmp_tuple, Tuplesortstate *state);
static void qsort_ssup(SortTuple *a, size_t n, SortSupport ssup);

The first of these is a drop-in replacement for qsort_arg() - any
tuplesort can use it, not just heap sorts.  But it is faster than
qsort_arg() because of the specializations for the SortTuple data
type.  The second handles the special case where we are sorting by a
single key that has an associated SortSupport object.  In this case we
don't need to carry the overhead of passing around the Tuplesortstate
and dereferencing it, nor do we need the SortTupleComparator: we can
just pass the SortSupport itself.  Maybe there's a way to get this
effect using macros, but I couldn't figure it out.  At any rate, at
least for the single-key case, this approach effectively forces the
comparator to be inlined without requiring pg_always_inline.

With this patch, I get the following results, as compared with your
2012-02-10 version and master, using the same test cases I tested
before.

select * from nodups order by g offset 10001;
tps on master: 289.471274, 289.967984, 289.595958
tps on 2012-02-10 version: 359.150280, 356.284723, 356.888900
tps on attached version: 388.212793, 386.085083, 386.867478

select * from twocol order by a, b offset 10000;
tps on master: 261.676611, 260.440886, 259.529362
tps on 2012-02-10 version: 283.941312, 279.981723, 283.140208
tps on attached version: 283.146463, 278.344827, 280.727798

select * from f8 order by g offset 10000;
tps on master: 228.299924, 222.650355, 227.408506
tps on 2012-02-10 version: 260.289273, 257.181035, 256.377456
tps on attached version: 276.985299, 275.341575, 274.428095

There's some variation (which I can't account for) between the results
on master now and the results on master before - possibly just code
shifting around between cache lines due to unrelated changes, or maybe
some inadvertent change in my test setup.  But it looks to me like
your 2012-02-10 version, without any type-specific optimizations, does
pretty much just as well on multi-key sorting as your previous
version, which had them - or if there is a difference, it's pretty
small.

Overall, I think the numbers for the version I'm attaching here look
pretty good: the single-key performance is clearly better than your
last version, and the multi-key performance is very slightly worse.  I
think that slight worsening is a good trade-off, though, because this
version can use qsort_tuple() for all kinds of tuplesorts, not just
heap tuplesorts.  Still, it seems like we ought to be able to do even
better: the multi-key specialization that you had in your patch can be
coded in this framework, too, and in theory those are ndependent of
the swapcode improvements.  I tried coding up a multi-key
specialization which I believe to be quite similar to what you did,
but it didn't seem to do much.  I'm attaching it here; maybe there's
some way to improve it (or a different test case where it pays off).

It strikes me that if we wanted to take this further, we could look at
squeezing out ApplySortComparator.  For example, suppose that, upon
discovering that we can do an in-memory quicksort on a single sort
key, we make an initial pass over the data where we check whether it's
sorted and, as we go, swap all the entries with isnull1 = true to the
end of the memtuples array.  We then sort the isnull1 = true entries
with the standard comparator, and the isnull1 = false entries with an
optimized comparator that elides most of ApplySortComparator and
instead just calls the comparison function directly.  We then decide
on whether to emit the isnull1 = true entries first or last based on
NULLS FIRST/LAST, and decide whether to emit the remaining entries in
forward or reverse order based on ASC/DESC.  Or maybe not exactly that
thing, but something like that, so that we sort the null and non-null
entries separately.  The additional complexity in the read-out logic
would probably be more than offset by being able to use a simpler
comparator.

What do you think of this version?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: Progress on fast path sorting, btree index creation time

From
Peter Geoghegan
Date:
On 15 February 2012 06:16, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Feb 10, 2012 at 10:30 AM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
>> [ new patch ]
>
> I spent quite a bit of time looking at this today - the patch
> specifically, and the issue of making quicksort fast more generally.
> It seemed to me that if we're going to have separate copies of the
> quicksort code for tuple sorting, we might as well go whole hog and
> specialize those copies to the particular needs of tuplesort.c as much
> as possible.  Accordingly, I whacked the code around so that it knows
> that it is sorting SortTuple objects and need not conditionalize at
> runtime on the size of the objects being swapped.  You suggested
> upthread that this might be worthwhile, and it seems that it is, so I
> think we should do it.

Cool. I agree that we should do this. It doesn't need to be justified
as a performance optimisation - it makes sense to refactor in this
way. If that makes things faster, then so much the better.

> Your patch removes the CHECK_FOR_INTERRUPTS() call from
> comparetup_heap, which is no good.  However, since I'd already decided
> to specialize the copies of quicksort intended for sorting as much as
> possible, it made sense to me to refactor things so that the qsort
> routine itself, rather than the comparator, is responsible for calling
> CHECK_FOR_INTERRUPTS().  This slightly reduces the number of times we
> CHECK_FOR_INTERRUPTS(), but never allows more than a few comparisons
> before doing it.

Well, the idea of that was to have one and only CHECK_FOR_INTERRUPTS()
call in the leading comparator. I should perhaps have further stressed
that that patch was only intended to establish the idea of having that
function pointer call within an inlined "leading-key's comparator
calling" outer comparator.

> I find that your pg_always_inline macro is equivalent to just plain
> inline on my system (MacOS X v10.6.8, gcc 4.2.1).  It seems to need
> something like this:
>
> +#elif __GNUC__
> +#define pg_always_inline inline __attribute__((always_inline))
>
> ...but I'm not very happy about relying on that, because I don't know
> that it will work on every gcc version (never mind non-gcc compilers),
> and I'm not convinced it's actually improving performance even on this
> one.  The documentation seems to indicate that this is intended to
> force inlining even when not optimizing, which may have something to
> do with the lack of effect: that's not really the point here anyway.

There is a scant reference that suggests this, yes, but that's
contradicted by other scanty sources. The macro was observed to be
helpful on the multi-key case, and the effect was quite apparent and
reproducible. At any rate its appearance in my most recenty patch was
vestigial - I think that there ought to be no need for it, now that
the compiler cost/benefit analysis probably has things right by
inlining.

> What I did instead is to replace template_qsort_arg.h with a script
> called gen_qsort_tuple.pl, which generates a file called qsort_tuple.c
> that tuplesort.c then #includes.  This seems more flexible to me than
> the macro-based approach.  In particular, it allows me to generate
> versions of qsort with different call signatures.  The attached patch
> generates two:
>
> static void qsort_tuple(SortTuple *a, size_t n, SortTupleComparator
> cmp_tuple, Tuplesortstate *state);
> static void qsort_ssup(SortTuple *a, size_t n, SortSupport ssup);
>
> The first of these is a drop-in replacement for qsort_arg() - any
> tuplesort can use it, not just heap sorts.  But it is faster than
> qsort_arg() because of the specializations for the SortTuple data
> type.  The second handles the special case where we are sorting by a
> single key that has an associated SortSupport object.  In this case we
> don't need to carry the overhead of passing around the Tuplesortstate
> and dereferencing it, nor do we need the SortTupleComparator: we can
> just pass the SortSupport itself.  Maybe there's a way to get this
> effect using macros, but I couldn't figure it out.

I've seen the pattern where generic programming is aped with the
preprocessor in in the style of my patch in, of all things, wxWidgets,
where it continues to be used as part of pgAdmin, or was when I worked
on wxWidgets 3 support exactly one year ago. One thing that is
particularly bizarre about that code is that clients actually #include
a cpp file. This is, I'd guess, a legacy of having to target pre C++98
compilers without decent template support. MSVC 6, in particular, was
completely inadequate in this area.

I am inclined to agree that given that we already use Perl to generate
source code like this, it seems natural that we should prefer to do
that, if only to avoid paranoia about the inclusion of a dial-a-bloat
knob. I am at a disadvantage here, since I've never written a line of
Perl.

> With this patch, I get the following results, as compared with your
> 2012-02-10 version and master, using the same test cases I tested
> before.
>
> select * from nodups order by g offset 10001;
> tps on master: 289.471274, 289.967984, 289.595958
> tps on 2012-02-10 version: 359.150280, 356.284723, 356.888900
> tps on attached version: 388.212793, 386.085083, 386.867478
>
> select * from twocol order by a, b offset 10000;
> tps on master: 261.676611, 260.440886, 259.529362
> tps on 2012-02-10 version: 283.941312, 279.981723, 283.140208
> tps on attached version: 283.146463, 278.344827, 280.727798
>
> select * from f8 order by g offset 10000;
> tps on master: 228.299924, 222.650355, 227.408506
> tps on 2012-02-10 version: 260.289273, 257.181035, 256.377456
> tps on attached version: 276.985299, 275.341575, 274.428095
>
> There's some variation (which I can't account for) between the results
> on master now and the results on master before - possibly just code
> shifting around between cache lines due to unrelated changes, or maybe
> some inadvertent change in my test setup.  But it looks to me like
> your 2012-02-10 version, without any type-specific optimizations, does
> pretty much just as well on multi-key sorting as your previous
> version, which had them - or if there is a difference, it's pretty
> small.

I'd say that that's a fair assessment. It does just as well in that
case, but with far fewer additional instructions.

> Overall, I think the numbers for the version I'm attaching here look
> pretty good: the single-key performance is clearly better than your
> last version, and the multi-key performance is very slightly worse.  I
> think that slight worsening is a good trade-off, though, because this
> version can use qsort_tuple() for all kinds of tuplesorts, not just
> heap tuplesorts.  Still, it seems like we ought to be able to do even
> better: the multi-key specialization that you had in your patch can be
> coded in this framework, too, and in theory those are ndependent of
> the swapcode improvements.  I tried coding up a multi-key
> specialization which I believe to be quite similar to what you did,
> but it didn't seem to do much.  I'm attaching it here; maybe there's
> some way to improve it (or a different test case where it pays off).

Hmm. I'll run some tests myself when I get a chance, and attempt to
determine what's going on here. I don't have immediate access to a
good benchmarking server, which would be preferable, and I'd like to
post a revision of pg_stat_statements normalisation today, as it's
been too long. Nice work though.

> It strikes me that if we wanted to take this further, we could look at
> squeezing out ApplySortComparator.  For example, suppose that, upon
> discovering that we can do an in-memory quicksort on a single sort
> key, we make an initial pass over the data where we check whether it's
> sorted and, as we go, swap all the entries with isnull1 = true to the
> end of the memtuples array.  We then sort the isnull1 = true entries
> with the standard comparator, and the isnull1 = false entries with an
> optimized comparator that elides most of ApplySortComparator and
> instead just calls the comparison function directly.  We then decide
> on whether to emit the isnull1 = true entries first or last based on
> NULLS FIRST/LAST, and decide whether to emit the remaining entries in
> forward or reverse order based on ASC/DESC.  Or maybe not exactly that
> thing, but something like that, so that we sort the null and non-null
> entries separately.  The additional complexity in the read-out logic
> would probably be more than offset by being able to use a simpler
> comparator.

While it's really easy to be wrong about these things, I'd venture to
guess that branch prediction makes this a less than compelling win in
practice. That said, if you want to know how good a win it might be,
you need only knock out a quick-and-dirty prototype and see how fast a
sort is - however well this does will be pretty close to your best
case, but not quite, since presumably such a prototype wouldn't worry
about the applicability of the optimisation.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


Re: Progress on fast path sorting, btree index creation time

From
Robert Haas
Date:
On Wed, Feb 15, 2012 at 8:29 AM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> Cool. I agree that we should do this. It doesn't need to be justified
> as a performance optimisation - it makes sense to refactor in this
> way. If that makes things faster, then so much the better.

Well, maybe so, but I think if the performance had been the same I
might not have messed with it.

> I am inclined to agree that given that we already use Perl to generate
> source code like this, it seems natural that we should prefer to do
> that, if only to avoid paranoia about the inclusion of a dial-a-bloat
> knob. I am at a disadvantage here, since I've never written a line of
> Perl.

I think it's still dial-a-bloat, but I feel pretty comfortable about
how we've got that knob adjusted in this version.  It's almost as much
improvement as any previous version, it applies to more cases, and the
code footprint is the least of any version I've measured.

> Hmm. I'll run some tests myself when I get a chance, and attempt to
> determine what's going on here. I don't have immediate access to a
> good benchmarking server, which would be preferable, and I'd like to
> post a revision of pg_stat_statements normalisation today, as it's
> been too long. Nice work though.

Thanks.

>> It strikes me that if we wanted to take this further, we could look at
>> squeezing out ApplySortComparator.  For example, suppose that, upon
>> discovering that we can do an in-memory quicksort on a single sort
>> key, we make an initial pass over the data where we check whether it's
>> sorted and, as we go, swap all the entries with isnull1 = true to the
>> end of the memtuples array.  We then sort the isnull1 = true entries
>> with the standard comparator, and the isnull1 = false entries with an
>> optimized comparator that elides most of ApplySortComparator and
>> instead just calls the comparison function directly.  We then decide
>> on whether to emit the isnull1 = true entries first or last based on
>> NULLS FIRST/LAST, and decide whether to emit the remaining entries in
>> forward or reverse order based on ASC/DESC.  Or maybe not exactly that
>> thing, but something like that, so that we sort the null and non-null
>> entries separately.  The additional complexity in the read-out logic
>> would probably be more than offset by being able to use a simpler
>> comparator.
>
> While it's really easy to be wrong about these things, I'd venture to
> guess that branch prediction makes this a less than compelling win in
> practice. That said, if you want to know how good a win it might be,
> you need only knock out a quick-and-dirty prototype and see how fast a
> sort is - however well this does will be pretty close to your best
> case, but not quite, since presumably such a prototype wouldn't worry
> about the applicability of the optimisation.

You might be right.  But note that the swapcode improvements were also
vulnerable to branch prediction, too, though there could also be other
effects there.  One effect that should perhaps be considered is the
cost of saving and restoring registers.  Even if the branch itself
doesn't cost much because we know what the result will be, the
operands still have to be loaded into registers, which isn't free of
itself and also potentially forces those registers to be saved and
restored across function calls.  Still, I'm inclined not to poke at
this right now: there are a lot of patches that still need to be
looked at, and at the rate we're currently disposing of patches this
CommitFest will still be ongoing come Easter.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Progress on fast path sorting, btree index creation time

From
Peter Geoghegan
Date:
On 15 February 2012 15:27, Robert Haas <robertmhaas@gmail.com> wrote:
>> I am inclined to agree that given that we already use Perl to generate
>> source code like this, it seems natural that we should prefer to do
>> that, if only to avoid paranoia about the inclusion of a dial-a-bloat
>> knob. I am at a disadvantage here, since I've never written a line of
>> Perl.
>
> I think it's still dial-a-bloat, but I feel pretty comfortable about
> how we've got that knob adjusted in this version.  It's almost as much
> improvement as any previous version, it applies to more cases, and the
> code footprint is the least of any version I've measured.

I'm happy that the principle that a dial-a-bloat knob isn't
necessarily a bad thing has been accepted, though that term is kind of
pejorative in that it implies that the knob necessarily adds bloat to
the binary.

I define bloat here as the addition of dead instructions to the
binary, or at least code that doesn't pull its weight. Clearly, that
isn't the case here, and I suspect that we will find that it isn't the
case in other places too.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services