Home > mailing lists

Re: Using quicksort for every external sort run - Mailing list pgsql-hackers

From	Greg Stark
Subject	Re: Using quicksort for every external sort run
Date	August 21, 2015 03:03:36
Msg-id	CAM-w4HNSDGmvY=QVz7gpoqe3b3Gy_JpVSeW8pPY2Ys1ty_c_Wg@mail.gmail.com Whole thread Raw
In response to	Re: Using quicksort for every external sort run (Peter Geoghegan <pg@heroku.com>)
Responses	Re: Using quicksort for every external sort run (Peter Geoghegan <pg@heroku.com>)
List	pgsql-hackers

Tree view

On Thu, Aug 20, 2015 at 11:16 PM, Peter Geoghegan <pg@heroku.com> wrote:
> It could reduce seek time, which might be the dominant cost (but not
> I/O as such).

No I didn't quite follow the argument to completion. Increasing the
run size is a win if it reduces the number of passes. In the
single-pass case it has to read all the data once, write it all out to
tapes, then read it all back in again.So 3x the data. If it's still
not sorted it
needs to write it all back out yet again and read it all back in
again. So 5x the data. If the tapes are larger it can avoid that 66%
increase in total I/O. In large data sets it can need 3, 4, or maybe
more passes through the data and saving one pass would be a smaller
incremental difference. I haven't thought through the exponential
growth carefully enough to tell if doubling the run size should
decrease the number of passes linearly or by a constant number.

But you're right that seems to be less and less a realistic scenario.
Times when users are really processing data sets that large nowadays
they'll just throw it into Hadoop or Biigquery or whatever to get the
parallelism of many cpus. Or maybe Citus and the like.

The main case where I expect people actually run into this is in
building indexes, especially for larger data types (which come to
think of it might be exactly where the comparison is expensive enough
that quicksort's cache efficiency isn't helpful).

But to do fair tests I would suggest you configure work_mem smaller
(since running tests on multi-terabyte data sets is a pain) and sort
some slower data types that don't fit in memory. Maybe arrays of text
or json?

-- 
greg

pgsql-hackers by date:

From: Jim Nasby
Date: 21 August 2015, 02:37:53
Subject: Re: jsonb array-style subscripting

From: Kouhei Kaigai
Date: 21 August 2015, 03:29:32
Subject: Re: DBT-3 with SF=20 got failed

Re: Using quicksort for every external sort run - Mailing list pgsql-hackers

Previous

Next