Re: Using quicksort for every external sort run - Mailing list pgsql-hackers
From | Corey Huinker |
---|---|
Subject | Re: Using quicksort for every external sort run |
Date | |
Msg-id | CADkLM=faYKwOfhDjVUFDVk_JGw1a_X-vqp-eQKJVnj9QQE4rQA@mail.gmail.com Whole thread Raw |
In response to | Re: Using quicksort for every external sort run (Peter Geoghegan <pg@heroku.com>) |
List | pgsql-hackers |
<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Fri, Nov 6, 2015 at 8:08 PM, Peter Geoghegan <span dir="ltr"><<ahref="mailto:pg@heroku.com" target="_blank">pg@heroku.com</a>></span> wrote:<br /><blockquote class="gmail_quote"style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On Wed, Aug 19, 2015at 7:24 PM, Peter Geoghegan <<a href="mailto:pg@heroku.com">pg@heroku.com</a>> wrote:<br /></span><span class="">>I'll start a new thread for this, since my external sorting patch has<br /> > now evolved well past the original"quicksort with spillover"<br /> > idea...although not quite how I anticipated it would. It seems like<br /> >I've reached a good point to get some feedback.<br /><br /></span>Corey Huinker has once again assisted me with thiswork, by doing some<br /> benchmarking on an AWS instance of his:<br /><br /> 32 cores (c3.8xlarge, I suppose)<br />MemTotal: 251902912 kB<br /><br /> I believe it had one EBS volume.<br /><br /> This testing included 2 data sets:<br/><br /> * A data set that he happens to have that is representative of his<br /> production use-case. Corey hadsome complaints about the sort<br /> performance of PostgreSQL, particularly prior to 9.5, and I like to<br /> link anyparticular performance optimization to an improvement in an<br /> actual production workload, if at all possible.<br /><br/> * A tool that I wrote, that works on top of <a href="http://sortbenchmark.org" rel="noreferrer" target="_blank">sortbenchmark.org</a>'s<br/> "gensort" [1] data generation tool. It seems reasonable to me to drive<br />this work in part with a benchmark devised by Jim Gray. He did after<br /> all receive a Turing award for this contributionto transaction<br /> processing. I'm certainly a fan of his work. A key practical advantage<br /> of that isthat is has reasonable guarantees about determinism, making<br /> these results relatively easy to recreate independently.<br/><br /> The modified "gensort" is available from<br /><a href="https://github.com/petergeoghegan/gensort"rel="noreferrer" target="_blank">https://github.com/petergeoghegan/gensort</a><br/><br /> The python script postgres_load.py, which performsbulk-loading for<br /> Postgres using COPY FREEZE. It ought to be fairly self-documenting:<br /><br /> $:~/gensort$./postgres_load.py --help<br /> usage: postgres_load.py [-h] [-w WORKERS] [-m MILLION] [-s] [-l] [-c]<br /><br/> optional arguments:<br /> -h, --help show this help message and exit<br /> -w WORKERS, --workers WORKERS<br/> Number of gensort workers (default: 4)<br /> -m MILLION, --million MILLION<br /> Generate n million tuples (default: 100)<br /> -s, --skew Skew distribution of outputkeys (default: False)<br /> -l, --logged Use logged PostgreSQL table (default: False)<br /> -c, --collate Use default collation rather than C collation<br /> (default: False)<br /><br />For this initial report to the list, I'm going to focus on a case<br /> involving 16 billion non-skewed tuples generatedusing the gensort<br /> tool. I wanted to see how a sort of a ~1TB table (1017GB as reported<br /> by psql, actually)could be improved, as compared to relatively small<br /> volumes of data (in the multiple gigabyte range) that wereso improved<br /> by sorts on my laptop, which has enough memory to avoid blocking on<br /> physical I/O much of thetime. How the new approach deals with<br /> hundreds of runs that are actually reasonably sized is also of<br /> interest.This server does have a lot of memory, and many CPU cores.<br /> It was kind of underpowered on I/O, though.<br/><br /> The initial load of 16 billion tuples (with a sortkey that is "C"<br /> locale text) took about 10 hours.My tool supports parallel generation<br /> of COPY format files, but serial performance of that stage isn't<br /> especiallyfast. Further, in order to support COPY FREEZE, and in<br /> order to ensure perfect determinism, the COPY operationsoccur<br /> serially in a single transaction that creates the table that we<br /> performed a CREATE INDEX on.<br/><br /> Patch, with 3GB maintenance_work_mem:<br /><br /> ...<br /> LOG: performsort done (except 411-way final merge):CPU<br /> 1017.95s/17615.74u sec elapsed 23910.99 sec<br /> STATEMENT: create index on sort_test (sortkey );<br />LOG: external sort ended, 54740802 disk blocks used: CPU<br /> 2001.81s/31395.96u sec elapsed 41648.05 sec<br /> STATEMENT: create index on sort_test (sortkey );<br /><br /> So just over 11 hours (11:34:08), then. The initial sortingfor 411<br /> runs took 06:38:30.99, as you can see.<br /><br /> Master branch:<br /><br /> ...<br /> LOG: finishedwriting run 202 to tape 201: CPU 1224.68s/31060.15u sec<br /> elapsed 34409.16 sec<br /> LOG: finished writing run203 to tape 202: CPU 1230.48s/31213.55u sec<br /> elapsed 34580.41 sec<br /> LOG: finished writing run 204 to tape 203:CPU 1236.74s/31366.63u sec<br /> elapsed 34750.28 sec<br /> LOG: performsort starting: CPU 1241.70s/31501.61u sec elapsed34898.63 sec<br /> LOG: finished writing run 205 to tape 204: CPU 1242.19s/31516.52u sec<br /> elapsed 34914.17 sec<br/> LOG: finished writing final run 206 to tape 205: CPU<br /> 1243.23s/31564.23u sec elapsed 34963.03 sec<br /> LOG: performsort done (except 206-way final merge): CPU<br /> 1243.86s/31570.58u sec elapsed 34974.08 sec<br /> LOG: externalsort ended, 54740731 disk blocks used: CPU<br /> 2026.98s/48448.13u sec elapsed 55299.24 sec<br /> CREATE INDEX<br/> Time: 55299315.220 ms<br /><br /> So 15:21:39 for master -- it's much improved, but this was still<br /> disappointinggiven the huge improvements on relatively small cases.<br /><br /> Finished index was fairly large, which canbe seen here by working<br /> back from "total relation size":<br /><br /> postgres=# select pg_size_pretty(pg_total_relation_size('sort_test'));<br/> pg_size_pretty<br /> ----------------<br /> 1487 GB<br /> (1row)<br /><br /> I think that this is probably due to the relatively slow I/O on this<br /> server, and because the mergestep is more of a bottleneck. As we<br /> increase maintenance_work_mem, we're likely to then suffer from the<br />lack of explicit asynchronous I/O here. It helps, still, but not<br /> dramatically. With with maintenance_work_mem = 30GB,patch is somewhat<br /> faster (no reason to think that this would help master at all, so that<br /> was untested):<br/><br /> ...<br /> LOG: starting quicksort of run 40: CPU 1815.99s/19339.80u sec elapsed<br /> 24910.38 sec<br/> LOG: finished quicksorting run 40: CPU 1820.09s/19565.94u sec elapsed<br /> 25140.69 sec<br /> LOG: finished writingrun 40 to tape 39: CPU 1833.76s/19642.11u sec<br /> elapsed 25234.44 sec<br /> LOG: performsort starting: CPU 1849.46s/19803.28usec elapsed 25499.98 sec<br /> LOG: starting quicksort of run 41: CPU 1849.46s/19803.28u sec elapsed<br/> 25499.98 sec<br /> LOG: finished quicksorting run 41: CPU 1852.37s/20000.73u sec elapsed<br /> 25700.43 sec<br/> LOG: finished writing run 41 to tape 40: CPU 1864.89s/20069.09u sec<br /> elapsed 25782.93 sec<br /> LOG: performsortdone (except 41-way final merge): CPU<br /> 1965.43s/20086.28u sec elapsed 25980.80 sec<br /> LOG: external sortended, 54740909 disk blocks used: CPU<br /> 3270.57s/31595.37u sec elapsed 40376.43 sec<br /> CREATE INDEX<br /> Time:40383174.977 ms<br /><br /> So that takes 11:13:03 in total -- we only managed to shave about 20<br /> minutes off thetotal time taken, despite a 10x increase in<br /> maintenance_work_mem. Still, at least it gets moderately better, not<br/> worse, which is certainly what I'd expect from the master branch. 60GB<br /> was half way between 3GB and 30GB interms of performance, so it<br /> doesn't continue to help, but, again, at least things don't get much<br /> worse.<br/><br /> Thoughts on these results:<br /><br /> * I'd really like to know the role of I/O here. Better, low-overhead<br/> instrumentation is required to see when and how we are I/O bound. I've<br /> been doing much of that ona more-or-less ad hoc basis so far, using<br /> iotop. I'm looking into a way to usefully graph the I/O activity over<br/> many hours, to correlate with the trace_sort output that I'll also<br /> show. I'm open to suggestions on the easiestway of doing that. Having<br /> used the "perf" tool for instrumenting I/O at all in the past.<br /><br /> * Parallelismwould probably help us here *a lot*.<br /><br /> * As I said, I think we suffer from the lack of asynchronousI/O much<br /> more at this scale. Will need to confirm that theory.<br /><br /> * It seems kind of ill-advisedto make run size (which is always in<br /> linear proportion to maintenance_work_mem with this new approach to<br/> sorting) larger, because it probably will hurt writing runs more than<br /> it will help in making merging cheaper(perhaps mostly due to the lack<br /> of asynchronous I/O to hide the latency of writes -- Linux might not<br /> doso well at this scale).<br /><br /> * Maybe adding actual I/O bandwidth is the way to go to get a better<br /> picture.I wouldn't be surprised if we were very bottlenecked on I/O<br /> here. Might be worth using many parallel EBS volumeshere, for<br /> example.<br /><br /> [1] <a href="http://sortbenchmark.org/FAQ-2015.html" rel="noreferrer" target="_blank">http://sortbenchmark.org/FAQ-2015.html</a><br/><span class="HOEnZb"><font color="#888888">--<br /> PeterGeoghegan<br /></font></span></blockquote></div><br /></div><div class="gmail_extra">The machine in question still exists,so if you have questions about it, commands you'd like me to run to give you insight as to the I/O capabilities ofthe machine, let me know. I can't guarantee we'll keep the machine much longer.</div><div class="gmail_extra"><br /></div><divclass="gmail_extra"><br /></div><div class="gmail_extra"><br /></div><div class="gmail_extra"><br /></div></div>
pgsql-hackers by date: