Using quicksort for every external sort run - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Using quicksort for every external sort run |
Date | |
Msg-id | CAM3SWZQVSpNeHHKKq-rjJddOcbpdmyHDJUMBBL2-AP2R+4YCHg@mail.gmail.com Whole thread Raw |
Responses |
Re: Using quicksort for every external sort run
(Greg Stark <stark@mit.edu>)
Re: Using quicksort for every external sort run (Simon Riggs <simon@2ndQuadrant.com>) Re: Using quicksort for every external sort run (Peter Geoghegan <pg@heroku.com>) Re: Using quicksort for every external sort run (Peter Geoghegan <pg@heroku.com>) Re: Using quicksort for every external sort run (Jeff Janes <jeff.janes@gmail.com>) |
List | pgsql-hackers |
I'll start a new thread for this, since my external sorting patch has now evolved well past the original "quicksort with spillover" idea...although not quite how I anticipated it would. It seems like I've reached a good point to get some feedback. I attach a patch series featuring a new, more comprehensive approach to quicksorting runs during external sorts. What I have now still includes "quicksort with spillover", but it's just a part of a larger project. I am quite happy with the improvements in performance shown by my testing, which I go into below. Controversy ========= A few weeks ago, I did not anticipate that I'd propose that replacement selection sort be used far less (only somewhat less, since I was only somewhat doubtful about the algorithm at the time). I had originally planned on continuing to *always* use it for the first run, to make "quicksort with spillover" possible (thereby sometimes avoiding significant I/O by not spilling most tuples), but also to make cases always considered sympathetic to replacement selection continue to happen. I thought that second or subsequent runs could still be quicksorted, but that I must still care about this latter category, the traditional sympathetic cases. This latter category is mostly just one important property of replacement selection: even without a strong logical/physical correlation, the algorithm tends to produce runs that are about twice the size of work_mem. (It's also notable that replacement selection only produces one run with mostly presorted input, even where input far exceeds work_mem, which is a neat trick.) I wanted to avoid controversy, but the case for the controversy is too strong for me to ignore: despite these upsides, replacement selection is obsolete, and should usually be avoided. Replacement selection sort still has a role to play in making "quicksort with spillover" possible (when a sympathetic case is *anticipated*), but other than that it seems generally inferior to a simple hybrid sort-merge strategy on modern hardware. By modern hardware, I mean anything manufactured in at least the last 20 years. We've already seen that the algorithm's use of a heap works badly with modern CPU caches, but that is just one factor contributing to its obsolescence. The big selling point of replacement selection sort in the 20th century was that it sometimes avoided multi-pass sorts as compared to a simple sort-merge strategy (remember when tuplesort.c always used 7 tapes? When you need to use 7 actual magnetic tapes, rewinding is expensive and in general this matters a lot!). We all know that memory capacity has grown enormously since then, but we must also consider another factor: At the same time, a simple hybrid sort-merge strategy's capacity to more or less get the important detail here right -- to avoid a multi-pass sort -- has increased quadratically (relative to work_mem/memory capacity). As an example, testing shows that for a datum tuplesort that requires about 2300MB of work_mem to be completed as a simple internal sort this patch only needs 30MB to just do one pass (see benchmark query below). I've mostly regressed that particular property of tuplesort (it used to be less than 30MB), but that's clearly the wrong thing to worry about for all kinds of reasons, probably even in the unimportant cases now forced to do multiple passes. Multi-pass sorts --------------------- I believe, in general, that we should consider a multi-pass sort to be a kind of inherently suspect thing these days, in the same way that checkpoints occurring 5 seconds apart are: not actually abnormal, but something that we should regard suspiciously. Can you really not afford enough work_mem to only do one pass? Does it really make sense to add far more I/O and CPU costs to avoid that other tiny memory capacity cost? In theory, the answer could be "yes", but it seems highly unlikely. Not only is very little memory required to avoid a multi-pass merge step, but as described above the amount required grows very slowly relative to linear growth in input. I propose to add a checkpoint_warning style warning (with a checkpoint_warning style GUC to control it). ISTM that these days, multi-pass merges are like saving $2 on replacing a stairwell light bulb, at the expense of regularly stumbling down the stairs in the dark. It shouldn't matter if you have a 50 terabyte decision support database or if you're paying Heroku a small monthly fee to run a database backing your web app: simply avoiding multi-pass merges is probably always the most economical solution, and by a wide margin. Note that I am not skeptical of polyphase merging itself, even though it is generally considered to be a complimentary technique to replacement selection (some less formal writing on external sorting seemingly fails to draw a sharp distinction). Nothing has changed there. Patch, performance =============== Let's focus on a multi-run sort, that does not use "quicksort with spillover", since that is all new, and is probably the most compelling case for very large databases with hundreds of gigabytes of data to sort. I think that this patch requires a machine with more I/O bandwidth than my laptop to get a proper sense of the improvement made. I've been using a tmpfs temp_tablespace for testing, to simulate this. That may leave me slightly optimistic about I/O costs, but you can usually get significantly more sequential I/O bandwidth by adding additional disks, whereas you cannot really buy new hardware to improve the situation with excessive CPU cache misses. Benchmark --------------- -- Setup, 100 million tuple table with high cardinality int4 column (2 billion possible int4 values) create table big_high_cardinality_int4 as select (random() * 2000000000)::int4 s, 'abcdefghijlmn'::text junk from generate_series(1, 100000000); -- Make cost model hinting accurate: analyze big_high_cardinality_int4; checkpoint; Let's start by comparing an external sort that uses 1/3 the memory of an internal sort against the master branch. That's completely unfair on the patch, of course, but it is a useful indicator of how well external sorts do overall. Although an external sort surely cannot be as fast as an internal sort, it might be able to approach an internal sort's speed when there is plenty of I/O bandwidth. That's a good thing to aim for, I think. -- Master (just enough memory for internal sort): set work_mem = '2300MB'; select count(distinct(s)) from big_high_cardinality; ***** Runtime after stabilization: ~33.6 seconds ***** -- Patch series, but with just over 1/3 the memory: set work_mem = '800MB'; select count(distinct(s)) from big_high_cardinality; ***** Runtime after stabilization: ~37.1 seconds ***** The patch only takes ~10% more time to execute this query, which seems very good considering that ~1/3 the work_mem has been put to use. trace_sort output for patch during execution of this case: LOG: begin datum sort: workMem = 819200, randomAccess = f LOG: switching to external sort with 2926 tapes: CPU 0.39s/2.66u sec elapsed 3.06 sec LOG: replacement selection avg tuple size 24.00 crossover: 0.85 LOG: hybrid sort-merge in use from row 34952532 with 100000000.00 total rows LOG: finished quicksorting run 1: CPU 0.39s/8.84u sec elapsed 9.24 sec LOG: finished writing quicksorted run 1 to tape 0: CPU 0.60s/9.61u sec elapsed 10.22 sec LOG: finished quicksorting run 2: CPU 0.87s/18.61u sec elapsed 19.50 sec LOG: finished writing quicksorted run 2 to tape 1: CPU 1.07s/19.38u sec elapsed 20.46 sec LOG: performsort starting: CPU 1.27s/21.79u sec elapsed 23.07 sec LOG: finished quicksorting run 3: CPU 1.27s/27.07u sec elapsed 28.35 sec LOG: finished writing quicksorted run 3 to tape 2: CPU 1.47s/27.69u sec elapsed 29.18 sec LOG: performsort done (except 3-way final merge): CPU 1.51s/28.54u sec elapsed 30.07 sec LOG: external sort ended, 146625 disk blocks used: CPU 1.76s/35.32u sec elapsed 37.10 sec Note that the on-tape runs are small relative to CPU costs, so this query is a bit sympathetic (consider the time spent writing batches that trace_sort indicates here). CREATE INDEX would not compare so well with an internal sort, for example, especially if it was a composite index or something. I've sized work_mem here in a deliberate way, to make sure there are 3 runs of similar size by the time the merge step is reached, which makes a small difference in the patch's favor. All told, this seems like a very significant overall improvement. Now, consider master's performance with the same work_mem setting (a fair test with comparable resource usage for master and patch): -- Master set work_mem = '800MB'; select count(distinct(s)) from big_high_cardinality; ***** Runtime after stabilization: ~120.9 seconds ***** The patch is ~3.25x faster than master here, which also seems like a significant improvement. That's pretty close to the improvement previously seen for good "quicksort with spillover" cases, but suitable for every external sort case that doesn't use "quicksort with spillover". In other words, no variety of external sort is not significantly improved by the patch. I think it's safe to suppose that there are also big benefits when multiple concurrent sort operations run on the same system. For example, when pg_restore has multiple jobs. Worst case --------------- Even with a traditionally sympathetic case for replacement selection sort, the patch beats replacement selection with multiple on-tape runs. When experimenting here, I did not forget to account for our qsort()'s behavior in the event of *perfectly* presorted input ("Bubble sort best case" behavior [1]). Other than that, I have a hard time thinking of an unsympathetic case for the patch, and could not find any actual regressions with a fair amount of effort. Abbreviated keys are not used when merging, but that doesn't seem to be something that notably counts against the new approach (which will have shorter runs on average). After all, the reason why abbreviated keys aren't saved on disk for merging is that they're probably not very useful when merging. They would resolve far fewer comparisons if they were used during merging, and having somewhat smaller runs does not result in significantly more non-abbreviated comparisons, even when sorting random noise strings. Avoiding replacement selection *altogether* ================================= Assuming you agree with my conclusions on replacement selection sort mostly not being worth it, we need to avoid replacement selection except when it'll probably allow a "quicksort with spillover". In my mind, that's now the *only* reason to use replacement selection. Callers pass a hint to tuplesort indicating how many tuples it is estimated will ultimately be passed before a sort is performed. (Typically, this comes from a scan plan node's row estimate, or more directly from the relcache for things like CREATE INDEX.) Cost model -- details ---------------------------- Second or subsequent runs *never* use replacement selection -- it is only *considered* for the first run, right before the possible point of initial heapification within inittapes(). The cost model is contained within the new function useselection(). See the second patch in the series for full details. That's where this is added. I have a fairly high bar for even using replacement selection for the first run -- several factors can result in a simple hybrid sort-merge strategy being used instead of a "quicksort with spillover", because in general most of the benefit seems to be around CPU cache misses rather than savings in I/O. Consider my benchmark query above once more -- with replacement selection used for the first run in the benchmark case above (e.g., with just the first patch in the series applied, or setting the "optimize_avoid_selection" debug GUC to "off"), I found that it took over twice as long to execute, even though the second-or-subsequent (now smaller) runs were quicksorted just the same, and were all merged just the same. The numbers should make it obvious why I gave in to the temptation of adding an ad-hoc, tuplesort-private cost model. At this point, I'd rather scrap "quicksort with spillover" (and the use of replacement selection under all possible circumstances) than scrap the idea of a cost model. That would make more sense, even though it would give up on the idea of saving most I/O where the work_mem threshold is only crossed by a small amount. Future work ========= I anticipate a number of other things within the first patch in the series, some of which are already worked out to some degree. Asynchronous I/O ------------------------- This patch leaves open the possibility of using something like libaio/librt for sorting. That would probably use half of memtuples as scratch space, while the other half is quicksorted. Memory prefetching --------------------------- To test what role memory prefetching is likely to have here, I attach a custom version of my tuplesort/tuplestore prefetch patch, with prefetching added to the "quicksort with spillover" and batch dumping runs WRITETUP()-calling code. This seems to help performance measurably. However, I guess it shouldn't really be considered as part of this patch. It can follow the initial commit of the big, base patch (or will becomes part of the base patch if and when prefetching is committed first). cost_sort() changes -------------------------- I had every intention of making cost_sort() a continuous cost function as part of this work. This could be justified by "quicksort with spillover" allowing tuplesort to "blend" from internal to external sorting as input size is gradually increased. This seemed like something that would have significant non-obvious benefits in several other areas. However, I've put off dealing with making any change to cost_sort() because of concerns about the complexity of overlaying such changes on top of the tuplesort-private cost model. I think that this will need to be discussed in a lot more detail. As a further matter, materialization of sort nodes will probably also require tweaks to the costing for "quicksort with spillover". Recall that "quicksort with spillover" can only work for !randomAccess tuplesort callers. Run size ------------ This patch continues to have tuplesort determine run size based on the availability of work_mem only. It does not entirely fix the problem of having work_mem sizing impact performance in counter-intuitive ways. In other words, smaller work_mem sizes can still be faster. It does make that general situation much better, though, because quicksort is a cache oblivious algorithm. Smaller work_mem sizes are sometimes a bit faster, but never dramatically faster. In general, the whole idea of making run size as big as possible is bogus, unless that enables or is likely to enable a "quicksort with spillover". The caller-supplied row count hint I've added may in the future be extended to determine optimal run size ahead of time, when it's perfectly clear (leaving aside misestimation) that a fully internal sort (or "quicksort with spillover") will not occur. This will result in faster external sorts where additional work_mem cannot be put to good use. As a side benefit, external sorts will not be effectively wasting a large amount of memory. The cost model we eventually come up with to determine optimal run size ought to balance certain things. Assuming a one-pass merge step, then we should balance the time lost waiting on the first run and time quicksorting the last run with the gradual increase in the cost during the merge step. Maybe the non-use of abbreviated keys during the merge step should also be considered. Alternatively, the run size may be determined by a GUC that is typically sized at drive controller cache size (e.g. 1GB) when any kind of I/O avoidance for the sort appears impossible. [1] Commit a3f0b3d6 -- Peter Geoghegan
Attachment
pgsql-hackers by date: