Re: Using quicksort for every external sort run - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: Using quicksort for every external sort run
Date
Msg-id CAM3SWZT+KegfDfocD2jCGF_ngMxXFepc=ps3HCBjz5HC=Rmz+w@mail.gmail.com
Whole thread Raw
In response to Re: Using quicksort for every external sort run  (Greg Stark <stark@mit.edu>)
List pgsql-hackers
On Thu, Aug 20, 2015 at 5:02 PM, Greg Stark <stark@mit.edu> wrote:
> I haven't thought through the exponential
> growth carefully enough to tell if doubling the run size should
> decrease the number of passes linearly or by a constant number.

It seems that with 5 times the data that previously required ~30MB to
avoid a multi-pass sort (where ~2300MB is required for an internal
sort -- the benchmark query), it took ~60MB to avoid a multi-pass
sort. I guess I just didn't exactly determine either threshold due to
that taking too long, and that as predicted, every time the input size
quadruples, the required amount of work_mem to avoid multiple passes
only doubles. That will need to be verified more vigorously, but it
looks that way.

> But you're right that seems to be less and less a realistic scenario.
> Times when users are really processing data sets that large nowadays
> they'll just throw it into Hadoop or Biigquery or whatever to get the
> parallelism of many cpus. Or maybe Citus and the like.

I'm not sure that even that's generally true, simply because sorting a
huge amount of data is very expensive -- it's not really a "big data"
thing, so to speak. Look at recent results on this site:

http://sortbenchmark.org

Last year's winning "Gray" entrant, TritonSort, uses a huge parallel
cluster of 186 machines, but only sorts 100TB. That's just over 500GB
per node. Each node is a 32 core Intel Xeon EC2 instance with 244GB
memory, and lots of SSDs. It seems like the point of the 100TB minimum
rule in the "Gray" contest category is that that's practically
impossible to fit entirely in memory (to avoid merging).

Eventually, linearithmic growth becomes extremely painful, not matter
how much processing power you have. It takes a while, though.

-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Rajeev rastogi
Date:
Subject: Re: Autonomous Transaction is back
Next
From: Tom Lane
Date:
Subject: Re: 9.5 release notes