Re: Using quicksort for every external sort run - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: Using quicksort for every external sort run |
Date | |
Msg-id | CA+TgmoaxNQE97escn3WVLUX1zYO49zZiCJTsBdm04_zq0QeC-Q@mail.gmail.com Whole thread Raw |
In response to | Re: Using quicksort for every external sort run (Peter Geoghegan <pg@heroku.com>) |
Responses |
Re: Using quicksort for every external sort run
(Peter Geoghegan <pg@heroku.com>)
|
List | pgsql-hackers |
On Sun, Dec 6, 2015 at 7:25 PM, Peter Geoghegan <pg@heroku.com> wrote: > On Tue, Nov 24, 2015 at 4:33 PM, Peter Geoghegan <pg@heroku.com> wrote: >> So, the bottom line is: This patch seems very good, is unlikely to >> have any notable downside (no case has been shown to be regressed), >> but has yet to receive code review. I am working on a new version with >> the first two commits consolidated, and better comments, but that will >> have the same code, unless I find bugs or am dissatisfied. It mostly >> needs thorough code review, and to a lesser extent some more >> performance testing. > > I'm currently spending a lot of time working on parallel CREATE INDEX. > I should not delay posting a new version of my patch series any > further, though. I hope to polish up parallel CREATE INDEX to be able > to show people something in a couple of weeks. > > This version features consolidated commits, the removal of the > multipass_warning parameter, and improved comments and commit > messages. It has almost entirely unchanged functionality. > > The only functional changes are: > > * The function useselection() is taught to distrust an obviously bogus > caller reltuples hint (when it's already less than half of what we > know to be the minimum number of tuples that the sort must sort, > immediately after LACKMEM() first becomes true -- this is probably a > generic estimate). > > * Prefetching only occurs when writing tuples. Explicit prefetching > appears to hurt in some cases, as David Rowley has shown over on the > dedicated thread. But it might still be that writing tuples is a case > that is simple enough to benefit consistently, due to the relatively > uniform processing that memory latency can hide behind for that case > (before, the same prefetching instructions were used for CREATE INDEX > and for aggregates, for example). > > Maybe we should consider trying to get patch 0002 (the memory > pool/merge patch) committed first, something Greg Stark suggested > privately. That might actually be an easier way of integrating this > work, since it changes nothing about the algorithm we use for merging > (it only improves memory locality), and so is really an independent > piece of work (albeit one that makes a huge overall difference due to > the other patches increasing the time spent merging in absolute terms, > and especially as a proportion of the total). So I was looking at the 0001 patch and came across this code: + /* + * Crossover point is somewhere between where memtuples is between 40% + * and all-but-one of total tuples to sort. This weighs approximate + * savings in I/O, against generic heap sorting cost. + */ + avgTupleSize = (double) memNowUsed / (double) state->memtupsize; + + /* + * Starting from a threshold of 90%, refund 7.5% per 32 byte + * average-size-increment. + */ + increments = MAXALIGN_DOWN((int) avgTupleSize) / 32; + crossover = 0.90 - (increments * 0.075); + + /* + * Clamp, making either outcome possible regardless of average size. + * + * 40% is about the minimum point at which "quicksort with spillover" + * can still occur without a logical/physical correlation. + */ + crossover = Max(0.40, Min(crossover, 0.85)); + + /* + * The point where the overhead of maintaining the heap invariant is + * likely to dominate over any saving in I/O is somewhat arbitrarily + * assumed to be the point where memtuples' size exceeds MaxAllocSize + * (note that overall memory consumption may be far greater). Past + * this point, only the most compelling cases use replacement selection + * for their first run. + * + * This is not about cache characteristics so much as the O(n log n) + * cost of sorting larger runs dominating over the O(n) cost of + * writing/reading tuples. + */ + if (sizeof(SortTuple) * state->memtupcount > MaxAllocSize) + crossover = avgTupleSize > 32 ? 0.90 : 0.95; This looks like voodoo to me. I assume you tested it and maybe it gives correct answers, but it's got to be some kind of world record for number of arbitrary constants per SLOC, and there's no real justification for any of it. The comments say, essentially, well, we do this because it works. But suppose I try it on some new piece of hardware and it doesn't work well. What do I do? Email the author and ask him to tweak the arbitrary constants? The dependency on MaxAllocSize seems utterly bizarre to me. If we decide to modify our TOAST infrastructure so that we support datums up to 2GB in size, or alternatively datums of up to only 512MB in size, do you expect that to change the behavior of tuplesort.c? I bet not, but that's a major reason why MaxAllocSize is defined the way it is. I wonder if there's a way to accomplish what you're trying to do here that avoids the need to have a cost model at all. As I understand it, and please correct me wherever I go off the rails, the situation is: 1. If we're sorting a large amount of data, such that we can't fit it all in memory, we will need to produce a number of sorted runs and then merge those runs. If we generate each run using a heap with replacement selection, rather than quicksort, we will produce runs that are, on the average, about twice as long, which means that we will have fewer runs to merge at the end. 2. Replacement selection is slower than quicksort on a per-tuple basis. Furthermore, merging more runs isn't necessarily any slower than merging fewer runs. Therefore, building runs via replacement selection tends to lose even though it tends to reduce the number of runs to merge. Even when having a larger number of runs results in an increase in the number merge passes, we save so much time building the runs that we often (maybe not always) still come out ahead. 3. However, when replacement selection would result in a single run, and quicksort results in multiple runs, using quicksort loses. This is especially true when we the amount of data we have is between one and two times work_mem. If we fit everything into one run, we do not need to write any data to tape, but if we overflow by even a single tuple, we have to write a lot of data to tape. If this is correct so far, then I wonder if we could do this: Forget replacement selection. Always build runs by quicksorting. However, when dumping the first run to tape, dump it a little at a time rather than all at once. If the input ends before we've completely written the run, then we've got all of run 1 in memory and run 0 split between memory and tape. So we don't need to do any extra I/O; we can do a merge between run 1 and the portion of run 0 which is on tape. When the tape is exhausted, we only need to finish merging the in-memory tails of the two runs. I also wonder if you've thought about the case where we are asked to sort an enormous amount of data that is already in order, or very nearly in order (2,1,4,3,6,5,8,7,...). It seems worth including a check to see whether the low value of run N+1 is higher than the high value of run N, and if so, append it to the existing run rather than starting a new one. In some cases this could completely eliminate the final merge pass at very low cost, which seems likely to be worthwhile. Unfortunately, it's possible to fool this algorithm pretty easily - suppose the data is as in the parenthetical note in the previous paragraph, but the number of tuples that fits in work_mem is odd. I wonder if we can find instances where such cases regress significantly as compared with the replacement selection approach, which might be able to produce a single run out of an arbitrary amount of data. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: