Re: Memory usage during sorting - Mailing list pgsql-hackers

From Greg Stark
Subject Re: Memory usage during sorting
Date
Msg-id CAM-w4HPRLqNoDTnEvAQjx8OxcR5e6QAjr76egG7dK28+p6j5vA@mail.gmail.com
Whole thread Raw
In response to Re: Memory usage during sorting  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Memory usage during sorting  (Jeff Janes <jeff.janes@gmail.com>)
List pgsql-hackers
On Wed, Mar 7, 2012 at 7:55 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> But it would mean we have about 1.7x  more runs that need to be merged
>> (for initially random data).  Considering the minimum merge order is
>> 6, that increase in runs is likely not to lead to an additional level
>> of merging, in which case the extra speed of building the runs would
>> definitely win.  But if it does cause an additional level of merge, it
>> could end up being a loss.
>
> That's true, but the real hit to the run size should be quite a bit
> less than 1.7x, because we'd also be using memory more efficiently,
> and therefore more tuples should fit.

I'm not sure I believe the 1.7x.  Keep in mind that even after
starting a new tape we can continue to read in new tuples that belong
on the first tape. So even if you have tuples that are more than N
positions out of place (where N is the size of your heap) as long as
you don't have very many you can keep writing out the first tape for
quite a while.

I suspect analyzing truly random inputs is also a bit like saying no
compression algorithm can work on average. Partly sorted data is quite
common and the tapesort algorithm would be able to do a lot of cases
in a single merge that the quicksort and merge would generate a large
number of merges for.

All that said, quicksort and merge would always do no more i/o in
cases where the total number of tuples to be sorted is significantly
less than N^2 since that would be guaranteed to be possible to process
with a single merge pass. (Where "significantly less" has something to
do with how many tuples you have to read in one i/o to be efficient).
That probably covers a lot of cases, and Postgres already has the
stats to predict when it would happen, more or less.

Fwiw when I was doing the top-k sorting I did a bunch of experiements
and came up with a rule-of-thumb that our quicksort was about twice as
fast as our heapsort. I'm not sure whether that's a big deal or not in
this case. Keep in mind that the only win would be reducing the cpu
time on a sort where every tuple was being written to disk and read
back. For most users that won't run noticeably faster, just reduce cpu
time consumption.

--
greg


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Command Triggers, patch v11
Next
From: Andres Freund
Date:
Subject: Re: Command Triggers, patch v11