Re: Memory usage during sorting - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: Memory usage during sorting
Date
Msg-id CAEYLb_UbhYiZGm4z9gXyOEDGWKO_-3VCoMAdk4dERDEvXk0m8A@mail.gmail.com
Whole thread Raw
In response to Re: Memory usage during sorting  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Memory usage during sorting  (Greg Stark <stark@mit.edu>)
List pgsql-hackers
On 13 April 2012 18:33, Robert Haas <robertmhaas@gmail.com> wrote:
> We do use insertion sort for partitions of size < 7.  I assume you are
> referring to something else.

I stand corrected. My confusion came from the removal of a later
*switch* to insertion sort in
a3f0b3d68f9a5357a3f72b40a45bcc714a9e0649, which also added the
pre-sorted check where you'd expect to see the insertion switch. Of
course, the n < 7 insertion sort thing is right beside the added
check. So another, redundant copy of insertion sort was removed, and
not the one that almost every real-world quicksort implementation has.

> Heap-sorting could also benefit from some optimization in this area.
> Right now, if you heap-sort data that's already in order, it gets
> progressively slower as you crank up work_mem, because the heap gets
> deeper and so extraction and siftup get more expensive, for no real
> benefit.  We could do something like check, every time we use up our
> available memory, whether the heap is already entirely in order.  If
> so, we could dump all but one tuple to the current run and the start
> reading tuples again.  Or maybe dump some percentage of the array,
> though that might require a bit more bookkeeping.  Or maybe there's a
> smarter algorithm that would also cater to mostly-sorted data.

Well, timsort is specifically designed to take advantage of pre-sorted
data. It does appear to have a lot of traction, as wikipedia points
out:

"Timsort has been Python's standard sorting algorithm since version
2.3. It is now also used to sort arrays in Java SE 7,[2] and on the
Android platform.[3] "

Somebody has produced a BSD-licensed C implementation that draws
heavily on the Python/Java one here:

http://code.google.com/p/timsort/source/browse/trunk/timSort.c?spec=svn17&r=17

It looks like it has exactly the same interface as a c stdlib qsort.
So it'd be fairly easy to produce a timsort_arg() based on this, if
anyone cares to.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: how to create a non-inherited CHECK constraint in CREATE TABLE
Next
From: "Kevin Grittner"
Date:
Subject: Re: [COMMITTERS] pgsql: Add new replication mode synchronous_commit = 'write'.