Re: Using quicksort for every external sort run - Mailing list pgsql-hackers

From Greg Stark
Subject Re: Using quicksort for every external sort run
Date
Msg-id CAM-w4HOHG4iGQ_xCH3xFX06qrX+mKFqvptMckS8cp4wcG+annw@mail.gmail.com
Whole thread Raw
In response to Using quicksort for every external sort run  (Peter Geoghegan <pg@heroku.com>)
Responses Re: Using quicksort for every external sort run  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Using quicksort for every external sort run  (Peter Geoghegan <pg@heroku.com>)
List pgsql-hackers
On Thu, Aug 20, 2015 at 3:24 AM, Peter Geoghegan <pg@heroku.com> wrote:
> I believe, in general, that we should consider a multi-pass sort to be
> a kind of inherently suspect thing these days, in the same way that
> checkpoints occurring 5 seconds apart are: not actually abnormal, but
> something that we should regard suspiciously. Can you really not
> afford enough work_mem to only do one pass? Does it really make sense
> to add far more I/O and CPU costs to avoid that other tiny memory
> capacity cost?

I think this is the crux of the argument. And I think you're
basically, but not entirely, right.

The key metric there is not how cheap memory has gotten but rather
what the ratio is between the system's memory and disk storage. The
use case I think you're leaving out is the classic "data warehouse"
with huge disk arrays attached to a single host running massive
queries for hours. In that case reducing run size will reduce I/O
requirements directly and halving the amount of I/O sort takes will
halve the time it takes regardless of cpu efficiency. And I have a
suspicion typical data distributions get much better than a 2x
speedup.

But I think you're basically right that this is the wrong use case to
worry about for most users. Even those users that do have large batch
queries are probably not processing so much that they should be doing
multiple passes. The ones that do are are probably more interested in
parallel query, federated databases, column stores, and so on rather
than worrying about just how many hours it takes to sort their
multiple terabytes on a single processor.

I am quite suspicious of quicksort though. It has O(n^2) worst case
and I think it's only a matter of time before people start worrying
about DOS attacks from users able to influence the data ordering. It's
also not very suitable for GPU processing. Quicksort gets most of its
advantage from cache efficiency, it isn't a super efficient algorithm
otherwise, are there not other cache efficient algorithms to consider?

Alternately, has anyone tested whether Timsort would work well?

-- 
greg



pgsql-hackers by date:

Previous
From: Simon Riggs
Date:
Subject: Re: Extension upgrade and GUCs
Next
From: David Fetter
Date:
Subject: Re: Declarative partitioning