Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort" - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort" |
Date | |
Msg-id | CAM3SWZQt5erOWaAvC_14o-QeeYvYw_HB=okxE=-6L9twb9+sAw@mail.gmail.com Whole thread Raw |
In response to | Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort" (Heikki Linnakangas <hlinnaka@iki.fi>) |
Responses |
Re: Using quicksort and a merge step to significantly
improve on tuplesort's single run "external sort"
|
List | pgsql-hackers |
On Fri, Jul 31, 2015 at 12:59 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > Oh, ok, I was confused on how the heap works. You could still abstract this > as "in-memory tails" of the tapes, but it's more complicated than I thought > at first: > > When it's time to drain the heap, in performsort, divide the array into two > arrays, based on the run number of each tuple, and then quicksort the arrays > separately. The first array becomes the in-memory tail of the current tape, > and the second array becomes the in-memory tail of the next tape. > > You wouldn't want to actually allocate two arrays and copy SortTuples > around, but keep using the single large array, just logically divided into > two. So the bookkeeping isn't trivial, but seems doable. Since you're talking about the case where we must drain all tuples within tuplesort_performsort(), I think you're talking about a distinct idea here (since surely you're not suggesting giving up on my idea of avoiding dumping most tuples, which is a key strength of the patch). That's fine, but I just want to be clear on that. Importantly, my optimization does not care about the fact that the array may have two runs in it, because it quicksorts the array and forgets about it being a heap. It only cares whether or not more than one run made it out to tape, which makes it more widely applicable than it would otherwise be. Also, the fact that much I/O can be avoided is clearly something that can only happen when work_mem is at least ~50% of a work_mem setting that would have resulted in an (entirely) internal sort. You're talking about a new thing here, that happens when it is necessary to dump everything and do a conventional merging of on-tape runs. IOW, we cannot fit a significant fraction of overall tuples in memory, and we need much of the memtuples array for the next run (usually this ends as a TSS_FINALMERGE). That being the case (...right?), I'm confused that you're talking about doing something clever within tuplesort_performsort(). In the case you're targeting, won't the vast majority of tuple dumping (through calls to dumptuples()) occur within puttuple_common()? I think that my optimization should probably retain it's own state.status even if we do this (i.e. TSS_MEMTAPEMERGE should stay). > Hmm, I can see another possible optimization here, in the way the heap is > managed in TSS_BUILDRUNS state. Instead of keeping a single heap, with > tupindex as the leading key, it would be more cache efficient to keep one > heap for the currentRun, and an unsorted array of tuples belonging to > currentRun + 1. When the heap becomes empty, and currentRun is incemented, > quicksort the unsorted array to become the new heap. > > That's a completely separate idea from your patch, although if you did it > that way, you wouldn't need the extra step to divide the large array into > two, as you'd maintain that division all the time. This sounds to me like a refinement of your first idea (the idea that I just wrote about). I think the biggest problem with tuplesort after this patch of mine is committed is that it is still too attached to the idea of incrementally spilling and sifting. It makes sense to some degree where it makes my patch possible...if we hang on to the idea of incrementally spilling tuples on to tape in sorted order for a while, then maybe we can hang on for long enough to quicksort most tuples, *and* to avoid actually dumping most tuples (incremental spills make the latter possible). But when work_mem is only, say, 10% of the setting required for a fully internal sort, then the incremental spilling and sifting starts to look dubious, at least to me, because the TSS_MEMTAPEMERGE optimization added by my patch could not possibly apply, and dumping and merging many times is inevitable. What I think you're getting at here is that we still have a heap, but we don't use the heap to distinguish between tuples within a run. In other words, HEAPCOMPARE() often/always only cares about run number. We quicksort after a deferred period of time, probably just before dumping everything. Perhaps I've misunderstood, but I don't see much point in quicksorting a run before being sure that you're sorting as opposed to heapifying at that point (you're not clear on what we've decided on once we quicksort). I think it could make sense to make HEAPCOMPARE() not care about tuples within a run that is not currentRun, though. I think that anything that gives up on replacement selection's ability to generate large runs, particularly for already sorted inputs will be too hard a sell (not that I think that's what you proposed). That's way, way less of an advantage than it was in the past (back when external sorts took place using actual magnetic tape, it was a huge), but the fact remains that it is an advantage. And so, I've been prototyping an approach where we don't heapify once it is established that this TSS_MEMTAPEMERGE optimization of mine cannot possibly apply. We quicksort in batches rather than heapify. With this approach, tuples are just added arbitrarily to the memtuples array when status is TSS_BUILDRUNS and we decided not to heapify. When this path is first taken (when we first decide not to heapify memtuples anymore), we quicksort and dump the first half of memtuples in a batch. Now, when puttuple_common() once again fills the first half of memtuples (in the style of TSS_INITIAL), we quicksort again, and dump the first half again. Repeat until done. This is something that you could perhaps call "batch replacement selection". This is not anticipated to affect the finished runs one bit as compared to tuplesort today, because we still create a new run (state->currentRun++) any time a value in a now-sorted batch is less than the last (greatest) on-tape value in the currentRun. I am just batching things -- it's the same basic algorithm with minimal use of a heap. I think this will help appreciably because we quicksort, but I'm not sure yet. I think this may also help with introducing asynchronous I/O (maybe we can reuse memtuples for that, with clever staggering of stages during dumping, but that can come later). This is all fairly hand-wavey, but I think it could help appreciably for the case where sorts must produce runs to merge and avoiding dumping all tuples is impossible. -- Peter Geoghegan
pgsql-hackers by date: