Re: Timsort performance, quicksort (was: Re: Memory usage during sorting) - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Timsort performance, quicksort (was: Re: Memory usage during sorting)
Date
Msg-id CA+TgmobKFcb6ajVEN8eSnBa78sB3FSwuEWTHXd2x9JC5DOh5OA@mail.gmail.com
Whole thread Raw
In response to Timsort performance, quicksort (was: Re: Memory usage during sorting)  (Peter Geoghegan <peter@2ndquadrant.com>)
Responses Re: Timsort performance, quicksort (was: Re: Memory usage during sorting)
Re: Timsort performance, quicksort (was: Re: Memory usage during sorting)
List pgsql-hackers
On Wed, Apr 18, 2012 at 9:31 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> Thoughts?

Interesting work.  I thought about trying to code up timsort at one
point, but I've been running short of round tuits.

I did some quick tests of quicksort using half a million random
strings.  On my MacBook Pro, if the strings are in random order,
quicksort takes about 12 seconds.  If they are presorted, it takes
about 800 ms.  If they are in sorted order with an empty string
appended onto the end, it takes about 25 seconds.

If I modify gen_qsort_tuple.pl to perform the "presorted" input check
only when n > 40, the for the
presorted-except-the-last-element-is-small test drops from 25 seconds
to about 21.5 seconds, without apparently harming either of the other
two cases.  If I remove the check altogether, the time further drops
to about 13.5 seconds, the time to sort completely-ordered data rises
to about 10-10.5 seconds, and the time to sort randomly ordered data
still doesn't seem to change much.

Based on that, I'm inclined to propose rejiggering things so that the
presorted-input check runs only at the top level, and not during any
recursive steps.  The theory is that it won't cost much because if the
data is randomly ordered we won't do many comparisons before falling
out, and that seems to be true.  But the only point of doing it in the
recursive steps is to win when the data is partially ordered, and we
actually seem to be losing big-time in that case, perhaps because when
the data is partially ordered, the presort check will frequently to
run through a significant part of the array - wasting cycles - but
fall out before it actually gets to the end.

Of course, even if we did that, it might not be as good as your
timsort numbers, but that doesn't mean we shouldn't do it...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Usage of planner_ctx
Next
From: Robert Haas
Date:
Subject: Re: New sync commit mode remote_write