Re: Using quicksort for every external sort run - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: Using quicksort for every external sort run
Date
Msg-id CAM3SWZSqet-Bm5+1p2eZqLYK3Gzwi=tQ0YAf4Q+bPJ2Q728yTw@mail.gmail.com
Whole thread Raw
In response to Re: Using quicksort for every external sort run  (Greg Stark <stark@mit.edu>)
Responses Re: Using quicksort for every external sort run  (Peter Geoghegan <pg@heroku.com>)
Re: Using quicksort for every external sort run  (Greg Stark <stark@mit.edu>)
List pgsql-hackers
On Tue, Nov 24, 2015 at 5:42 PM, Greg Stark <stark@mit.edu> wrote:
> Actually I kind of agree. What I would like to see is a series of
> numbers for increasing sizes of sorts plotted against the same series
> for the existing algorithm. Specifically with the sort size varying to
> significantly more than the physical memory on the machine. For
> example on a 16GB machine sorting data ranging from 1GB to 128GB.

There already was a test case involving a 1TB/16 billion tuple sort
[1] (well, a 1TB gensort Postgres table [2]). Granted, I don't have a
large number of similar test cases across a variety of scales, but
there are only so many hours in the day. Disappointingly, the results
at that scale were merely good, not great, but there was probably
various flaws in how representative the hardware used was.

> There's a lot more information in a series of numbers than individual
> numbers. We'll be able to see whether all our pontificating about the
> rates of growth of costs of different algorithms or which costs
> dominate at which scales are actually borne out in reality.

You yourself said that 1GB is sufficient to get a single-pass merge
phase for a sort of about 4TB - 8TB, so I think the discussion of the
growth in costs tells us plenty about what can happen at the high end.
My approach might help less overall, but it certainly won't falter.

See the 1TB test case -- output from trace_sort is all there.

> And see
> where the break points are where I/O overtakes memory costs. And it'll
> be clearer where to look for problematic cases where the new algorithm
> might not dominate the old one.

I/O doesn't really overtake memory cost -- if it does, then it should
be worthwhile to throw more sequential I/O bandwidth at the problem,
which is a realistic, economical solution with a mature implementation
(unlike buying more memory bandwidth). I didn't do that with the 1TB
test case.

If you assume, as cost_sort() does, that it takes N log2(N)
comparisons to sort some tuples, then it breaks down like this:

10 items require 33 comparisons, ratio 3.32192809489
100 items require 664 comparisons, ratio 6.64385618977
1,000 items require 9,965 comparisons, ratio 9.96578428466
1,000,000 items require 19,931,568 comparisons, ratio 19.9315685693
1,000,000,000 items require 29,897,352,853 comparisons, ratio 29.897352854
16,000,000,000 items require 542,357,645,663 comparisons, ratio 33.897352854

The cost of writing out and reading runs should be more or less in
linear proportion to their size, which is a totally different story.
That's the main reason why "quicksort with spillover" is aimed at
relatively small sorts, which we expect more of overall.

I think the big issue is that a non-parallel sort is significantly
under-powered when you go to sort 16 billion tuples. It's probably not
very sensible to do so if you have a choice of parallelizing the sort.
There is no plausible way to do replacement selection in parallel,
since you cannot know ahead of time with any accuracy where to
partition workers, as runs can end up arbitrarily larger than memory
with presorted inputs. That might be the single best argument for what
I propose to do here.

This is what Corey's case showed for the final run with 30GB
maintenance_work_mem:

LOG:  starting quicksort of run 40: CPU 1815.99s/19339.80u sec elapsed
24910.38 sec
LOG:  finished quicksorting run 40: CPU 1820.09s/19565.94u sec elapsed
25140.69 sec
LOG:  finished writing run 40 to tape 39: CPU 1833.76s/19642.11u sec
elapsed 25234.44 sec

(Note that the time taken to copy tuples comprising the final run is
not displayed or accounted for)

This is the second last run, run 40, so it uses the full 30GB of
maintenance_work_mem. We spend 00:01:33.75 writing the run. However,
we spent 00:03:50.31 just sorting the run. That's roughly the same
ratio that I see on my laptop with far smaller runs. I think the
difference isn't wider because the server is quite I/O bound -- but we
could fix that by adding more disks.

[1] http://www.postgresql.org/message-id/CAM3SWZQtdd=Q+EF1xSZaYG1CiOYQJ7sZFcL08GYqChpJtGnKMg@mail.gmail.com
[2] https://github.com/petergeoghegan/gensort
-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Minor comment edits in nodeGather.c
Next
From: Peter Geoghegan
Date:
Subject: Re: Using quicksort for every external sort run