Home > mailing lists

Re: Using quicksort for every external sort run - Mailing list pgsql-hackers

From	Greg Stark
Subject	Re: Using quicksort for every external sort run
Date	August 20, 2015 13:05:52
Msg-id	CAM-w4HOHG4iGQ_xCH3xFX06qrX+mKFqvptMckS8cp4wcG+annw@mail.gmail.com Whole thread Raw
In response to	Using quicksort for every external sort run (Peter Geoghegan <pg@heroku.com>)
Responses	Re: Using quicksort for every external sort run Re: Using quicksort for every external sort run
List	pgsql-hackers

Tree view

On Thu, Aug 20, 2015 at 3:24 AM, Peter Geoghegan <pg@heroku.com> wrote:
> I believe, in general, that we should consider a multi-pass sort to be
> a kind of inherently suspect thing these days, in the same way that
> checkpoints occurring 5 seconds apart are: not actually abnormal, but
> something that we should regard suspiciously. Can you really not
> afford enough work_mem to only do one pass? Does it really make sense
> to add far more I/O and CPU costs to avoid that other tiny memory
> capacity cost?

I think this is the crux of the argument. And I think you're
basically, but not entirely, right.

The key metric there is not how cheap memory has gotten but rather
what the ratio is between the system's memory and disk storage. The
use case I think you're leaving out is the classic "data warehouse"
with huge disk arrays attached to a single host running massive
queries for hours. In that case reducing run size will reduce I/O
requirements directly and halving the amount of I/O sort takes will
halve the time it takes regardless of cpu efficiency. And I have a
suspicion typical data distributions get much better than a 2x
speedup.

But I think you're basically right that this is the wrong use case to
worry about for most users. Even those users that do have large batch
queries are probably not processing so much that they should be doing
multiple passes. The ones that do are are probably more interested in
parallel query, federated databases, column stores, and so on rather
than worrying about just how many hours it takes to sort their
multiple terabytes on a single processor.

I am quite suspicious of quicksort though. It has O(n^2) worst case
and I think it's only a matter of time before people start worrying
about DOS attacks from users able to influence the data ordering. It's
also not very suitable for GPU processing. Quicksort gets most of its
advantage from cache efficiency, it isn't a super efficient algorithm
otherwise, are there not other cache efficient algorithms to consider?

Alternately, has anyone tested whether Timsort would work well?

-- 
greg

pgsql-hackers by date:

From: Simon Riggs
Date: 20 August 2015, 12:39:41
Subject: Re: Extension upgrade and GUCs

From: David Fetter
Date: 20 August 2015, 13:19:09
Subject: Re: Declarative partitioning

Re: Using quicksort for every external sort run - Mailing list pgsql-hackers

Previous

Next