Thread: Timsort performance, quicksort (was: Re: Memory usage during sorting)

Timsort performance, quicksort (was: Re: Memory usage during sorting)

From

Peter Geoghegan

Date:

18 April 2012, 22:32:23

On 17 April 2012 13:19, Greg Stark <stark@mit.edu> wrote:
> All in all I think it's handier to have a stable ORDER BY sort than an
> unstable one though. So I'm not necessarily opposed to it even if it
> means we're stuck using a stable sort indefinitely.

I think it might be useful to disguard the stability property if it's
not a real requirement, as I think doing so gives us additional leeway
to compose descending runs (pre-sorted subsets) that are not in what
is termed "strictly descending" order (that is, they can be a[0] >=
a[1] >= a[2] >= ... and not just a[0] > a[1] > a[2] > ...).

I'll share some preliminary findings on timsort performance (and,
indeed, quicksort performance). I decided on two intial tests - one
that tested performance of sorting against master for a reasonable
case, and the other for a rather sympathetic case.

The first test was for a pgbench run of a query against the dellstore
database, "select * from products order by actor offset 10001",
pgbench -T 300. Here, actors is a text column. I didn't look at scalar
types, because presumably quicksort is going to do much better there.
After running analyze on the table, the pg_stats.correlation is
-0.00493428. There is at best a very weak physical to logical
correlation for the column, as far as the random sampling of analyze
can tell, though there may well still be many individual "runs" of
ordered subsets - I should be able to instrument timsort to determine
how pre-sorted data already is in a well-principled fashion, but not
today.

master:
tps = 43.985745 (including connections establishing)
tps = 43.986028 (excluding connections establishing)

timsort:
tps = 39.181766 (including connections establishing)
tps = 39.182130 (excluding connections establishing)

Here, quicksort benefits somewhat from my earlier work (though there
is no SortSupport entry in any of the tests performed), as a full
tuplesort specialisation is used. timsort_arg is simply a drop-in
replacement for qsort_arg. It's fairly close, but quicksort does win
out here, which is perhaps not hugely surprising. Timsort does of
course have to maintain state like pending runs in a way that
quicksort does not, and quicksort is expected to take advantage of
memory locality to a greater degree. However, if we measure the
expense of the sorts in pure terms of number of comparisons, an
altogether different picture emerges:

timsort: 119,943
quicksort: 136,071

I'd like to see what happens when timsort_arg is further specialised
(not that I envisage multiple specialisations of it or anything - just
a single tuplesort one).

I contrived a very sympathetic test-case too. Namely, I created a new
numeric column for the dellstore database's orderlines table, with a
default value of "nextval('some_seq'), resulting in a
pg_stat.correlation of 1. Then, I changed the actual value of the very
last tuple in the table, with the intention of tripping up our
quicksort "check for pre-sorted array in a single bubblesort iteration
first" optimisation.

Now, I'll grant you that that was a very carefully placed banana skin,
but the results were even still quite surprising and interesting. With
the query "select * from orderlines order by test offset 60351", a
rather large gap in the performance of quicksort and timsort emerges:

Master:
=====
(first run)
number of transactions actually processed: 1891
tps = 6.301991 (including connections establishing)
tps = 6.302041 (excluding connections establishing)

(second run)

number of transactions actually processed: 1878
tps = 6.259945 (including connections establishing)
tps = 6.260839 (excluding connections establishing)

timsort:
=====
(first run)

number of transactions actually processed: 19997
tps = 66.655289 (including connections establishing)
tps = 66.655820 (excluding connections establishing)

(second run)

number of transactions actually processed: 19880
tps = 66.266234 (including connections establishing)
tps = 66.266810 (excluding connections establishing)

I can reproduce this result consistently - these two sets of figures
were produced hours apart.

Number of comparisons for single execution of this more sympathetic query:

Timsort: 60,351

Quicksort: 2,247,314 (yes, really)

The difference here is very big, and cannot be adequately explained by
the mere wastage of a single "bubble sort iteration to check if it's
presorted".  I have heard a few stories about weird quicksort
edgecases, and you probably have too, but I don't know for sure which
one this might be right now.  My theory is that we're paying a high
price for "going quadratic in the face of many duplicates" protection,
by following Sedgewick's advice and doing a partition swap in response
to the pivot and element being equal (which they pretty much always
are here).  This normally isn't so obvious, because of the "check for
pre-sorted input" thing usually takes care of this instead.

My next step is to actually try out a benchmark with a more realistic
dataset, that still reasonably showcases timsort's strengths.  As Tim
Peters himself put it in a document that describes the algorithm, "in
a database supplied by Kevin Altis, a sort on its highly skewed "on
which stock exchange does this company's stock trade?" field ran over
twice as fast under timsort", so I'll try and find something along
those lines that is relatively easy to recreate, and relatively
realistic.  Certainly, highly skewed data is not at all uncommon.

Just for the record, I don't have any strong opinions on:

1.  What we should be doing with timsort, if anything.  It is one
thing to demonstrate that it's a useful algorithm under certain
artificial conditions, but quite another to argue for its inclusion in
Postgres, or for it being selectively used at points where that is
likely to be a win, based on some criteria or another like known
cardinality, physical/logical correlation or assumed costs of
comparisons for each type.  At the very least, it is an interesting
algorithm, but without integration that makes it actually serve user
needs, that's all it will ever be to us.  Deciding if and when it
should be used is a rather nuanced process, and I'm certainly not
about to declare that we should get rid of quicksort.  It does appear
to be a fairly good fit to some of our requirements though.

2.  Whether we should attempt to patch-up quicksort to prevent this
problem.  I lean towards no, since the cure may well be worse than the
disease.

Thoughts?
--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Re: Timsort performance, quicksort

From

Dimitri Fontaine

Date:

19 April 2012, 15:24:48

Peter Geoghegan <peter@2ndquadrant.com> writes:
> 1.  What we should be doing with timsort, if anything.  It is one
> thing to demonstrate that it's a useful algorithm under certain
> artificial conditions, but quite another to argue for its inclusion in
> Postgres, or for it being selectively used at points where that is
> likely to be a win, based on some criteria or another like known
> cardinality, physical/logical correlation or assumed costs of
> comparisons for each type.  At the very least, it is an interesting
> algorithm, but without integration that makes it actually serve user
> needs, that's all it will ever be to us.  Deciding if and when it
> should be used is a rather nuanced process, and I'm certainly not
> about to declare that we should get rid of quicksort.  It does appear
> to be a fairly good fit to some of our requirements though.

I kind of understood timsort would shine in sorting text in non-C
collation, because of the comparison cost. So a test in some UTF8
collation or other would be interesting, right?

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support

Re: Timsort performance, quicksort

From

Peter Geoghegan

Date:

19 April 2012, 23:13:05

On 19 April 2012 19:24, Dimitri Fontaine <dimitri@2ndquadrant.fr> wrote:
> I kind of understood timsort would shine in sorting text in non-C
> collation, because of the comparison cost. So a test in some UTF8
> collation or other would be interesting, right?

That's certainly the theory, yes. In practice, even though timsort
lives up to its promise of significantly reducing the number of
comparisons required in many common situations, my implementation does
not actually perform better than qsort_arg. Even a reduction of over
10% in the number of comparisons does not result in a net performance
gain. It wouldn't surprise me if the implementation used is quite
suboptimal, and it might well be worth profiling and optimising. It
doesn't appear to be the big win that I'd hoped for though. It's
necessary to stretch the assumed cost of a comparison rather a lot
further than the very common case of sorting a single key of non-c
collated text for it to be worth it, and that's just too thin for me
to sink more time into this right now.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Re: Timsort performance, quicksort (was: Re: Memory usage during sorting)

From

Robert Haas

Date:

24 April 2012, 12:17:26

On Wed, Apr 18, 2012 at 9:31 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> Thoughts?

Interesting work.  I thought about trying to code up timsort at one
point, but I've been running short of round tuits.

I did some quick tests of quicksort using half a million random
strings.  On my MacBook Pro, if the strings are in random order,
quicksort takes about 12 seconds.  If they are presorted, it takes
about 800 ms.  If they are in sorted order with an empty string
appended onto the end, it takes about 25 seconds.

If I modify gen_qsort_tuple.pl to perform the "presorted" input check
only when n > 40, the for the
presorted-except-the-last-element-is-small test drops from 25 seconds
to about 21.5 seconds, without apparently harming either of the other
two cases.  If I remove the check altogether, the time further drops
to about 13.5 seconds, the time to sort completely-ordered data rises
to about 10-10.5 seconds, and the time to sort randomly ordered data
still doesn't seem to change much.

Based on that, I'm inclined to propose rejiggering things so that the
presorted-input check runs only at the top level, and not during any
recursive steps.  The theory is that it won't cost much because if the
data is randomly ordered we won't do many comparisons before falling
out, and that seems to be true.  But the only point of doing it in the
recursive steps is to win when the data is partially ordered, and we
actually seem to be losing big-time in that case, perhaps because when
the data is partially ordered, the presort check will frequently to
run through a significant part of the array - wasting cycles - but
fall out before it actually gets to the end.

Of course, even if we did that, it might not be as good as your
timsort numbers, but that doesn't mean we shouldn't do it...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Timsort performance, quicksort (was: Re: Memory usage during sorting)

From

Greg Stark

Date:

24 April 2012, 13:31:24

On Tue, Apr 24, 2012 at 4:17 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Based on that, I'm inclined to propose rejiggering things so that the
> presorted-input check runs only at the top level, and not during any
> recursive steps.

Just a thought. What about running only every nth step. Maybe
something like every 4th step.

But actually I'm confused. This seems to be misguided to me. Quicksort
isn't stable so even if you have a partially sorted data set the
recursive partitions are going to be at best partially sorted after a
pivot. I haven't walked through it but suspect even your
all-but-one-sorted data set is not finding
the data sorted in either partition on the next iteration.

-- 
greg

Re: Timsort performance, quicksort (was: Re: Memory usage during sorting)

From

Peter Geoghegan

Date:

24 April 2012, 13:51:24

On 24 April 2012 16:17, Robert Haas <robertmhaas@gmail.com> wrote:
> If they are in sorted order with an empty string
> appended onto the end, it takes about 25 seconds.

That's exactly what I'd have expected, but was surprised to have not
found with my own test. Perhaps it was same kind of fluke (i.e. a
re-creatable one - I'm quite confident that my benchmark was not
methodologically flawed, at least in execution).

> Based on that, I'm inclined to propose rejiggering things so that the
> presorted-input check runs only at the top level, and not during any
> recursive steps.  The theory is that it won't cost much because if the
> data is randomly ordered we won't do many comparisons before falling
> out, and that seems to be true.  But the only point of doing it in the
> recursive steps is to win when the data is partially ordered, and we
> actually seem to be losing big-time in that case, perhaps because when
> the data is partially ordered, the presort check will frequently to
> run through a significant part of the array - wasting cycles - but
> fall out before it actually gets to the end.

That makes sense to me, but obviously more data is needed here.

> Of course, even if we did that, it might not be as good as your
> timsort numbers, but that doesn't mean we shouldn't do it...

The frustrating thing about my timsort numbers, as I mentioned in
reply to Dimitri (He modified the subject a bit, so that might appear
to be a different thread to you), is that they appear to be almost
consistently winning when you consider the number of comparisons, but
in fact lose when you measure the duration of execution or TPS or
whatever. I expected a certain amount of this, but not enough to
entirely derail the case for replacing quicksort with timsort when
sorting a single key of text, which is obviously the really compelling
case for optimisation here. This situation is only going to be made
"worse" by the work you've done on SortSupport for text, which,
incidentally, I agree is worthwhile.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Re: Timsort performance, quicksort (was: Re: Memory usage during sorting)

From

Robert Haas

Date:

24 April 2012, 17:55:37

On Tue, Apr 24, 2012 at 12:30 PM, Greg Stark <stark@mit.edu> wrote:
> On Tue, Apr 24, 2012 at 4:17 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Based on that, I'm inclined to propose rejiggering things so that the
>> presorted-input check runs only at the top level, and not during any
>> recursive steps.
>
> Just a thought. What about running only every nth step. Maybe
> something like every 4th step.

If there's actually some advantage to that, sure.  But so far, it
looks like less is more.

> But actually I'm confused. This seems to be misguided to me. Quicksort
> isn't stable so even if you have a partially sorted data set the
> recursive partitions are going to be at best partially sorted after a
> pivot.

Exactly.  That's why I think we should do it only at the topmost
level, before we've done any pivoting.  Doing it at any lower level is
apparently a waste of energy and counterproductive.

> I haven't walked through it but suspect even your
> all-but-one-sorted data set is not finding
> the data sorted in either partition on the next iteration.

I suspect that, too.  Actually, I'm a bit confused about why it's
picking such terrible pivots.  Our median-of-medians optimization
should be doing better than this, I would think.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Timsort performance, quicksort (was: Re: Memory usage during sorting)

From

Robert Haas

Date:

24 April 2012, 18:01:49

On Tue, Apr 24, 2012 at 12:51 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> On 24 April 2012 16:17, Robert Haas <robertmhaas@gmail.com> wrote:
>> If they are in sorted order with an empty string
>> appended onto the end, it takes about 25 seconds.
>
> That's exactly what I'd have expected, but was surprised to have not
> found with my own test. Perhaps it was same kind of fluke (i.e. a
> re-creatable one - I'm quite confident that my benchmark was not
> methodologically flawed, at least in execution).

Oh, I read your results as showing something quite similar.

>> Based on that, I'm inclined to propose rejiggering things so that the
>> presorted-input check runs only at the top level, and not during any
>> recursive steps.  The theory is that it won't cost much because if the
>> data is randomly ordered we won't do many comparisons before falling
>> out, and that seems to be true.  But the only point of doing it in the
>> recursive steps is to win when the data is partially ordered, and we
>> actually seem to be losing big-time in that case, perhaps because when
>> the data is partially ordered, the presort check will frequently to
>> run through a significant part of the array - wasting cycles - but
>> fall out before it actually gets to the end.
>
> That makes sense to me, but obviously more data is needed here.

What more data do you think is needed?  I've been suspicious of that
code since the first time I looked at it, and I'm now fairly well
convinced that it's full of suckitude.  Honestly, I'm not sure I could
manage to contrive a case where that code wins if I set out to do so.

>> Of course, even if we did that, it might not be as good as your
>> timsort numbers, but that doesn't mean we shouldn't do it...
>
> The frustrating thing about my timsort numbers, as I mentioned in
> reply to Dimitri (He modified the subject a bit, so that might appear
> to be a different thread to you), is that they appear to be almost
> consistently winning when you consider the number of comparisons, but
> in fact lose when you measure the duration of execution or TPS or
> whatever. I expected a certain amount of this, but not enough to
> entirely derail the case for replacing quicksort with timsort when
> sorting a single key of text, which is obviously the really compelling
> case for optimisation here. This situation is only going to be made
> "worse" by the work you've done on SortSupport for text, ...

That is quite baffling.  Have you profiled it at all?

> ...which,
> incidentally, I agree is worthwhile.

Cool, thanks.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Timsort performance, quicksort (was: Re: Memory usage during sorting)

From

Peter Geoghegan

Date:

24 April 2012, 20:16:26

On 24 April 2012 22:01, Robert Haas <robertmhaas@gmail.com> wrote:
>> That's exactly what I'd have expected, but was surprised to have not
>> found with my own test. Perhaps it was same kind of fluke (i.e. a
>> re-creatable one - I'm quite confident that my benchmark was not
>> methodologically flawed, at least in execution).
>
> Oh, I read your results as showing something quite similar.

Sorry, I think my quick reading of your e-mail somehow left me with
the impression that the differences were not so pronounced for your
experiment, but indeed they are. Blame it on the fact that I've been
off work for a week, I suppose.

>> That makes sense to me, but obviously more data is needed here.
>
> What more data do you think is needed?  I've been suspicious of that
> code since the first time I looked at it, and I'm now fairly well
> convinced that it's full of suckitude.  Honestly, I'm not sure I could
> manage to contrive a case where that code wins if I set out to do so.

Yeah, I thought that the rationale for introducing the pre-sorted
check, as it appeared in a commit message, was a little weak. I don't
know that I'd go as far as to say that it was full of suckitude. The
worst thing about that code to my mind is that despite being highly
performance critical, it has exactly no comments beyond a brief
description, and the names of variables are arcanely curt.

>>> Of course, even if we did that, it might not be as good as your
>>> timsort numbers, but that doesn't mean we shouldn't do it...
>>
>> The frustrating thing about my timsort numbers, as I mentioned in
>> reply to Dimitri (He modified the subject a bit, so that might appear
>> to be a different thread to you), is that they appear to be almost
>> consistently winning when you consider the number of comparisons, but
>> in fact lose when you measure the duration of execution or TPS or
>> whatever. I expected a certain amount of this, but not enough to
>> entirely derail the case for replacing quicksort with timsort when
>> sorting a single key of text, which is obviously the really compelling
>> case for optimisation here. This situation is only going to be made
>> "worse" by the work you've done on SortSupport for text, ...
>
> That is quite baffling.  Have you profiled it at all?

Yes, I have profiled it, but didn't get much further than actually
producing the figures. gprof output for a backend that had a few of
these queries executed against it is attached, FWIW. I'm not sure that
the information is really very actionable, and I'm aware that oprofile
is preferred here (besides the known deficiencies of gprof, oprofile's
annotated source feature could be particularly useful here), which is
something I might get around to. Just before I decided that this might
not be the best way to spend my time off, but after profiling, I got
as far as doing this:

https://github.com/Peter2ndQuadrant/postgres/commit/cbbbd7b1d4ef82773eb272ee35a483cbf6678756

Experimentally, this appeared to make quite a big difference (at least
for the limited cases that I tried), but not so much as to alter the
practical implications of the outcome. It's not as if I've gone to any
great lengths to try and optimise the code yet though. CPython puts
the value of MIN_MERGE at 64, whereas for Java it's 32. A comment
within the Java implementation for this value reads:

  * This is the minimum sized sequence that will be merged.  Shorter
  * sequences will be lengthened by calling binarySort.  If the entire
  * array is less than this length, no merges will be performed.
  *
  * This constant should be a power of two.  It was 64 in Tim Peter's C
  * implementation, but 32 was empirically determined to work better in
  * this implementation.  In the unlikely event that you set this constant
  * to be a number that's not a power of two, you'll need to change the
...

 * See listsort.txt for a discussion
 * of the minimum stack length required as a function of the length
 * of the array being sorted and the minimum merge sequence length.

I did not determine why, and I do not have any theories as to why the
even lower value of 8 (and consequently, a greater tendency towards
merging rather than performing insertion sorts) was a win for me.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Attachment

timsort_prof.txt.gz

Re: Timsort performance, quicksort (was: Re: Memory usage during sorting)

From

Robert Haas

Date:

24 April 2012, 21:23:02

On Tue, Apr 24, 2012 at 7:16 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
>>> That makes sense to me, but obviously more data is needed here.
>>
>> What more data do you think is needed?  I've been suspicious of that
>> code since the first time I looked at it, and I'm now fairly well
>> convinced that it's full of suckitude.  Honestly, I'm not sure I could
>> manage to contrive a case where that code wins if I set out to do so.
>
> Yeah, I thought that the rationale for introducing the pre-sorted
> check, as it appeared in a commit message, was a little weak. I don't
> know that I'd go as far as to say that it was full of suckitude. The
> worst thing about that code to my mind is that despite being highly
> performance critical, it has exactly no comments beyond a brief
> description, and the names of variables are arcanely curt.

Well, what I don't like about it is that it doesn't work.  Having a
special case for pre-sorted input makes sense to me, but doing that
check recursively at every level is pointless unless it helps with
almost-sorted input, and doubling the runtime doesn't meet my
definition of helping.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company