Re: Using quicksort for every external sort run - Mailing list pgsql-hackers

From Jeff Janes
Subject Re: Using quicksort for every external sort run
Date
Msg-id CAMkU=1ym2N-v+3=aMm9m-h1A941gQV7CgM9HJuwhRFu-8tHFaw@mail.gmail.com
Whole thread Raw
In response to Using quicksort for every external sort run  (Peter Geoghegan <pg@heroku.com>)
Responses Re: Using quicksort for every external sort run  (Peter Geoghegan <pg@heroku.com>)
List pgsql-hackers
On Wed, Aug 19, 2015 at 7:24 PM, Peter Geoghegan <pg@heroku.com> wrote:

Hi Peter,

Your most recent versions of this patch series (not the ones on the
email I am replying to) give a compiler warning:

tuplesort.c: In function 'mergeruns':
tuplesort.c:2741: warning: unused variable 'memNowUsed'


> Multi-pass sorts
> ---------------------
>
> I believe, in general, that we should consider a multi-pass sort to be
> a kind of inherently suspect thing these days, in the same way that
> checkpoints occurring 5 seconds apart are: not actually abnormal, but
> something that we should regard suspiciously. Can you really not
> afford enough work_mem to only do one pass?

I don't think it is really about the cost of RAM.  What people can't
afford is spending all of their time personally supervising all the
sorts on the system. It is pretty easy for a transient excursion in
workload to make a server swap itself to death and fall over. Not just
the PostgreSQL server, but the entire OS. Since we can't let that
happen, we have to be defensive about work_mem. Yes, we have far more
RAM than we used to. We also have far more things demanding access to
it at the same time.

I agree we don't want to optimize for low memory, but I don't think we
should throw it under the bus, either.  Right now we are effectively
saying the CPU-cache problems with the heap start exceeding the larger
run size benefits at 64kb (the smallest allowed setting for work_mem).
While any number we pick is going to be a guess that won't apply to
all hardware, surely we can come up with a guess better than 64kb.
Like, 8 MB, say.  If available memory for the sort is 8MB or smaller
and the predicted size anticipates a multipass merge, then we can use
the heap method rather than the quicksort method.  Would a rule like
that complicate things much?

It doesn't matter to me personally at the moment, because the smallest
work_mem I run on a production system is 24MB.  But if for some reason
I had to increase max_connections, or had to worry about plans with
many more possible concurrent work_mem allocations (like some
partitioning), then I might not need to rethink that setting downward.

>
> In theory, the answer could be "yes", but it seems highly unlikely.
> Not only is very little memory required to avoid a multi-pass merge
> step, but as described above the amount required grows very slowly
> relative to linear growth in input. I propose to add a
> checkpoint_warning style warning (with a checkpoint_warning style GUC
> to control it).

I'm skeptical about a warning for this.  I think it is rather unlike
checkpointing, because checkpointing is done in a background process,
which greatly limits its visibility, while sorting is a foreground
thing.  I know if my sorts are slow, without having to go look in the
log file.  If we do have the warning, shouldn't it use a log-level
that gets sent to the front end where the person running the sort can
see it and locally change work_mem?  And if we have a GUC, I think it
should be a dial, not a binary.  If I have a sort that takes a 2-way
merge and then a final 29-way merge, I don't think that that is worth
reporting.  So maybe, if the maximum number of runs on a tape exceeds
2 (rather than exceeds 1, which is the current behavior with the
patch) would be the setting I would want to use, if I were to use it
at all.

...

> This patch continues to have tuplesort determine run size based on the
> availability of work_mem only. It does not entirely fix the problem of
> having work_mem sizing impact performance in counter-intuitive ways.
> In other words, smaller work_mem sizes can still be faster. It does
> make that general situation much better, though, because quicksort is
> a cache oblivious algorithm. Smaller work_mem sizes are sometimes a
> bit faster, but never dramatically faster.

Yes, that is what I found as well.   I think the main reason it is
even that small bit slower at large memory is because writing and
sorting are not finely interleaved, like they are with heap selection.
Once you sit down to qsort 3GB of data, you are not going to write any
more tuples until that qsort is entirely done.  I didn't do any
testing beyond 3GB of maintenance_work_mem, but I imagine this could
get more important if people used dozens or hundreds of GB.

One idea would be to stop and write out a just-sorted partition
whenever that partition is contiguous to the already-written portion.
If the qsort is tweaked to recurse preferentially into the left
partition first, this would result in tuples being written out at a
pretty study pace.  If the qsort was unbalanced and the left partition
was always the larger of the two, then that approach would have to be
abandoned at some point.  But I think there are already defenses
against that, and at worst you would give up and revert to the
sort-them-all then write-them-all behavior.



Overall this is very nice.  Doing some real world index builds of
short text (~20 bytes ascii) identifiers, I could easily get speed ups
of 40% with your patch if I followed the philosophy of "give it as
much maintenance_work_mem as I can afford".  If I fine-tuned the
maintenance_work_mem so that it was optimal for each sort method, then
the speed up quite a bit less, only 22%.  But 22% is still very
worthwhile, and who wants to spend their time fine-tuning the memory
use for every index build?

Cheers,

Jeff



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: [DESIGN] ParallelAppend
Next
From: Robert Haas
Date:
Subject: Re: [DESIGN] ParallelAppend