Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort" - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort"
Date
Msg-id CAM3SWZQt5erOWaAvC_14o-QeeYvYw_HB=okxE=-6L9twb9+sAw@mail.gmail.com
Whole thread Raw
In response to Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort"  (Heikki Linnakangas <hlinnaka@iki.fi>)
Responses Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort"
List pgsql-hackers
On Fri, Jul 31, 2015 at 12:59 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Oh, ok, I was confused on how the heap works. You could still abstract this
> as "in-memory tails" of the tapes, but it's more complicated than I thought
> at first:
>
> When it's time to drain the heap, in performsort, divide the array into two
> arrays, based on the run number of each tuple, and then quicksort the arrays
> separately. The first array becomes the in-memory tail of the current tape,
> and the second array becomes the in-memory tail of the next tape.
>
> You wouldn't want to actually allocate two arrays and copy SortTuples
> around, but keep using the single large array, just logically divided into
> two. So the bookkeeping isn't trivial, but seems doable.

Since you're talking about the case where we must drain all tuples
within tuplesort_performsort(), I think you're talking about a
distinct idea here (since surely you're not suggesting giving up on my
idea of avoiding dumping most tuples, which is a key strength of the
patch). That's fine, but I just want to be clear on that. Importantly,
my optimization does not care about the fact that the array may have
two runs in it, because it quicksorts the array and forgets about it
being a heap. It only cares whether or not more than one run made it
out to tape, which makes it more widely applicable than it would
otherwise be. Also, the fact that much I/O can be avoided is clearly
something that can only happen when work_mem is at least ~50% of a
work_mem setting that would have resulted in an (entirely) internal
sort.

You're talking about a new thing here, that happens when it is
necessary to dump everything and do a conventional merging of on-tape
runs. IOW, we cannot fit a significant fraction of overall tuples in
memory, and we need much of the memtuples array for the next run
(usually this ends as a TSS_FINALMERGE). That being the case
(...right?), I'm confused that you're talking about doing something
clever within tuplesort_performsort(). In the case you're targeting,
won't the vast majority of tuple dumping (through calls to
dumptuples()) occur within puttuple_common()?

I think that my optimization should probably retain it's own
state.status even if we do this (i.e. TSS_MEMTAPEMERGE should stay).

> Hmm, I can see another possible optimization here, in the way the heap is
> managed in TSS_BUILDRUNS state. Instead of keeping a single heap, with
> tupindex as the leading key, it would be more cache efficient to keep one
> heap for the currentRun, and an unsorted array of tuples belonging to
> currentRun + 1. When the heap becomes empty, and currentRun is incemented,
> quicksort the unsorted array to become the new heap.
>
> That's a completely separate idea from your patch, although if you did it
> that way, you wouldn't need the extra step to divide the large array into
> two, as you'd maintain that division all the time.

This sounds to me like a refinement of your first idea (the idea that
I just wrote about).

I think the biggest problem with tuplesort after this patch of mine is
committed is that it is still too attached to the idea of
incrementally spilling and sifting. It makes sense to some degree
where it makes my patch possible...if we hang on to the idea of
incrementally spilling tuples on to tape in sorted order for a while,
then maybe we can hang on for long enough to quicksort most tuples,
*and* to avoid actually dumping most tuples (incremental spills make
the latter possible). But when work_mem is only, say, 10% of the
setting required for a fully internal sort, then the incremental
spilling and sifting starts to look dubious, at least to me, because
the TSS_MEMTAPEMERGE optimization added by my patch could not possibly
apply, and dumping and merging many times is inevitable.

What I think you're getting at here is that we still have a heap, but
we don't use the heap to distinguish between tuples within a run. In
other words, HEAPCOMPARE() often/always only cares about run number.
We quicksort after a deferred period of time, probably just before
dumping everything. Perhaps I've misunderstood, but I don't see much
point in quicksorting a run before being sure that you're sorting as
opposed to heapifying at that point (you're not clear on what we've
decided on once we quicksort). I think it could make sense to make
HEAPCOMPARE() not care about tuples within a run that is not
currentRun, though.

I think that anything that gives up on replacement selection's ability
to generate large runs, particularly for already sorted inputs will be
too hard a sell (not that I think that's what you proposed). That's
way, way less of an advantage than it was in the past (back when
external sorts took place using actual magnetic tape, it was a huge),
but the fact remains that it is an advantage. And so, I've been
prototyping an approach where we don't heapify once it is established
that this TSS_MEMTAPEMERGE optimization of mine cannot possibly apply.
We quicksort in batches rather than heapify.

With this approach, tuples are just added arbitrarily to the memtuples
array when status is TSS_BUILDRUNS and we decided not to heapify. When
this path is first taken (when we first decide not to heapify
memtuples anymore), we quicksort and dump the first half of memtuples
in a batch. Now, when puttuple_common() once again fills the first
half of memtuples (in the style of TSS_INITIAL), we quicksort again,
and dump the first half again. Repeat until done.

This is something that you could perhaps call "batch replacement
selection". This is not anticipated to affect the finished runs one
bit as compared to tuplesort today, because we still create a new run
(state->currentRun++) any time a value in a now-sorted batch is less
than the last (greatest) on-tape value in the currentRun. I am just
batching things -- it's the same basic algorithm with minimal use of a
heap.

I think this will help appreciably because we quicksort, but I'm not
sure yet. I think this may also help with introducing asynchronous I/O
(maybe we can reuse memtuples for that, with clever staggering of
stages during dumping, but that can come later). This is all fairly
hand-wavey, but I think it could help appreciably for the case where
sorts must produce runs to merge and avoiding dumping all tuples is
impossible.
-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: [sqlsmith] Failed assertion in joinrels.c
Next
From: Stephen Frost
Date:
Subject: Re: RLS restrictive hook policies