Re: Using quicksort for every external sort run - Mailing list pgsql-hackers

From Corey Huinker
Subject Re: Using quicksort for every external sort run
Date
Msg-id CADkLM=faYKwOfhDjVUFDVk_JGw1a_X-vqp-eQKJVnj9QQE4rQA@mail.gmail.com
Whole thread Raw
In response to Re: Using quicksort for every external sort run  (Peter Geoghegan <pg@heroku.com>)
List pgsql-hackers
<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Fri, Nov 6, 2015 at 8:08 PM, Peter Geoghegan <span
dir="ltr"><<ahref="mailto:pg@heroku.com" target="_blank">pg@heroku.com</a>></span> wrote:<br /><blockquote
class="gmail_quote"style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On Wed, Aug 19,
2015at 7:24 PM, Peter Geoghegan <<a href="mailto:pg@heroku.com">pg@heroku.com</a>> wrote:<br /></span><span
class="">>I'll start a new thread for this, since my external sorting patch has<br /> > now evolved well past the
original"quicksort with spillover"<br /> > idea...although not quite how I anticipated it would. It seems like<br />
>I've reached a good point to get some feedback.<br /><br /></span>Corey Huinker has once again assisted me with
thiswork, by doing some<br /> benchmarking on an AWS instance of his:<br /><br /> 32 cores (c3.8xlarge, I suppose)<br
/>MemTotal:       251902912 kB<br /><br /> I believe it had one EBS volume.<br /><br /> This testing included 2 data
sets:<br/><br /> * A data set that he happens to have that is representative of his<br /> production use-case. Corey
hadsome complaints about the sort<br /> performance of PostgreSQL, particularly prior to 9.5, and I like to<br /> link
anyparticular performance optimization to an improvement in an<br /> actual production workload, if at all possible.<br
/><br/> * A tool that I wrote, that works on top of <a href="http://sortbenchmark.org" rel="noreferrer"
target="_blank">sortbenchmark.org</a>'s<br/> "gensort" [1] data generation tool. It seems reasonable to me to drive<br
/>this work in part with a benchmark devised by Jim Gray. He did after<br /> all receive a Turing award for this
contributionto transaction<br /> processing. I'm certainly a fan of his work. A key practical advantage<br /> of that
isthat is has reasonable guarantees about determinism, making<br /> these results relatively easy to recreate
independently.<br/><br /> The modified "gensort" is available from<br /><a
href="https://github.com/petergeoghegan/gensort"rel="noreferrer"
target="_blank">https://github.com/petergeoghegan/gensort</a><br/><br /> The python script postgres_load.py, which
performsbulk-loading for<br /> Postgres using COPY FREEZE. It ought to be fairly self-documenting:<br /><br />
$:~/gensort$./postgres_load.py --help<br /> usage: postgres_load.py [-h] [-w WORKERS] [-m MILLION] [-s] [-l] [-c]<br
/><br/> optional arguments:<br />   -h, --help            show this help message and exit<br />   -w WORKERS, --workers
WORKERS<br/>                         Number of gensort workers (default: 4)<br />   -m MILLION, --million MILLION<br />
                       Generate n million tuples (default: 100)<br />   -s, --skew            Skew distribution of
outputkeys (default: False)<br />   -l, --logged          Use logged PostgreSQL table (default: False)<br />   -c,
--collate        Use default collation rather than C collation<br />                         (default: False)<br /><br
/>For this initial report to the list, I'm going to focus on a case<br /> involving 16 billion non-skewed tuples
generatedusing the gensort<br /> tool. I wanted to see how a sort of a ~1TB table (1017GB as reported<br /> by psql,
actually)could be improved, as compared to relatively small<br /> volumes of data (in the multiple gigabyte range) that
wereso improved<br /> by sorts on my laptop, which has enough memory to avoid blocking on<br /> physical I/O much of
thetime. How the new approach deals with<br /> hundreds of runs that are actually reasonably sized is also of<br />
interest.This server does have a lot of memory, and many CPU cores.<br /> It was kind of underpowered on I/O,
though.<br/><br /> The initial load of 16 billion tuples (with a sortkey that is "C"<br /> locale text) took about 10
hours.My tool supports parallel generation<br /> of COPY format files, but serial performance of that stage isn't<br />
especiallyfast. Further, in order to support COPY FREEZE, and in<br /> order to ensure perfect determinism, the COPY
operationsoccur<br /> serially in a single transaction that creates the table that we<br /> performed a CREATE INDEX
on.<br/><br /> Patch, with 3GB maintenance_work_mem:<br /><br /> ...<br /> LOG:  performsort done (except 411-way final
merge):CPU<br /> 1017.95s/17615.74u sec elapsed 23910.99 sec<br /> STATEMENT:  create index on sort_test (sortkey );<br
/>LOG:  external sort ended, 54740802 disk blocks used: CPU<br /> 2001.81s/31395.96u sec elapsed 41648.05 sec<br />
STATEMENT: create index on sort_test (sortkey );<br /><br /> So just over 11 hours (11:34:08), then. The initial
sortingfor 411<br /> runs took 06:38:30.99, as you can see.<br /><br /> Master branch:<br /><br /> ...<br /> LOG: 
finishedwriting run 202 to tape 201: CPU 1224.68s/31060.15u sec<br /> elapsed 34409.16 sec<br /> LOG:  finished writing
run203 to tape 202: CPU 1230.48s/31213.55u sec<br /> elapsed 34580.41 sec<br /> LOG:  finished writing run 204 to tape
203:CPU 1236.74s/31366.63u sec<br /> elapsed 34750.28 sec<br /> LOG:  performsort starting: CPU 1241.70s/31501.61u sec
elapsed34898.63 sec<br /> LOG:  finished writing run 205 to tape 204: CPU 1242.19s/31516.52u sec<br /> elapsed 34914.17
sec<br/> LOG:  finished writing final run 206 to tape 205: CPU<br /> 1243.23s/31564.23u sec elapsed 34963.03 sec<br />
LOG: performsort done (except 206-way final merge): CPU<br /> 1243.86s/31570.58u sec elapsed 34974.08 sec<br /> LOG: 
externalsort ended, 54740731 disk blocks used: CPU<br /> 2026.98s/48448.13u sec elapsed 55299.24 sec<br /> CREATE
INDEX<br/> Time: 55299315.220 ms<br /><br /> So 15:21:39 for master -- it's much improved, but this was still<br />
disappointinggiven the huge improvements on relatively small cases.<br /><br /> Finished index was fairly large, which
canbe seen here by working<br /> back from "total relation size":<br /><br /> postgres=# select
pg_size_pretty(pg_total_relation_size('sort_test'));<br/>  pg_size_pretty<br /> ----------------<br />  1487 GB<br />
(1row)<br /><br /> I think that this is probably due to the relatively slow I/O on this<br /> server, and because the
mergestep is more of a bottleneck. As we<br /> increase maintenance_work_mem, we're likely to then suffer from the<br
/>lack of explicit asynchronous I/O here. It helps, still, but not<br /> dramatically. With with maintenance_work_mem =
30GB,patch is somewhat<br /> faster (no reason to think that this would help master at all, so that<br /> was
untested):<br/><br /> ...<br /> LOG:  starting quicksort of run 40: CPU 1815.99s/19339.80u sec elapsed<br /> 24910.38
sec<br/> LOG:  finished quicksorting run 40: CPU 1820.09s/19565.94u sec elapsed<br /> 25140.69 sec<br /> LOG:  finished
writingrun 40 to tape 39: CPU 1833.76s/19642.11u sec<br /> elapsed 25234.44 sec<br /> LOG:  performsort starting: CPU
1849.46s/19803.28usec elapsed 25499.98 sec<br /> LOG:  starting quicksort of run 41: CPU 1849.46s/19803.28u sec
elapsed<br/> 25499.98 sec<br /> LOG:  finished quicksorting run 41: CPU 1852.37s/20000.73u sec elapsed<br /> 25700.43
sec<br/> LOG:  finished writing run 41 to tape 40: CPU 1864.89s/20069.09u sec<br /> elapsed 25782.93 sec<br /> LOG: 
performsortdone (except 41-way final merge): CPU<br /> 1965.43s/20086.28u sec elapsed 25980.80 sec<br /> LOG:  external
sortended, 54740909 disk blocks used: CPU<br /> 3270.57s/31595.37u sec elapsed 40376.43 sec<br /> CREATE INDEX<br />
Time:40383174.977 ms<br /><br /> So that takes 11:13:03 in total -- we only managed to shave about 20<br /> minutes off
thetotal time taken, despite a 10x increase in<br /> maintenance_work_mem. Still, at least it gets moderately better,
not<br/> worse, which is certainly what I'd expect from the master branch. 60GB<br /> was half way between 3GB and 30GB
interms of performance, so it<br /> doesn't continue to help, but, again, at least things don't get much<br />
worse.<br/><br /> Thoughts on these results:<br /><br /> * I'd really like to know the role of I/O here. Better,
low-overhead<br/> instrumentation is required to see when and how we are I/O bound. I've<br /> been doing much of that
ona more-or-less ad hoc basis so far, using<br /> iotop. I'm looking into a way to usefully graph the I/O activity
over<br/> many hours, to correlate with the trace_sort output that I'll also<br /> show. I'm open to suggestions on the
easiestway of doing that. Having<br /> used the "perf" tool for instrumenting I/O at all in the past.<br /><br /> *
Parallelismwould probably help us here *a lot*.<br /><br /> * As I said, I think we suffer from the lack of
asynchronousI/O much<br /> more at this scale. Will need to confirm that theory.<br /><br /> * It seems kind of
ill-advisedto make run size (which is always in<br /> linear proportion to maintenance_work_mem with this new approach
to<br/> sorting) larger, because it probably will hurt writing runs more than<br /> it will help in making merging
cheaper(perhaps mostly due to the lack<br /> of asynchronous I/O to hide the latency of writes -- Linux might not<br />
doso well at this scale).<br /><br /> * Maybe adding actual I/O bandwidth is the way to go to get a better<br />
picture.I wouldn't be surprised if we were very bottlenecked on I/O<br /> here. Might be worth using many parallel EBS
volumeshere, for<br /> example.<br /><br /> [1] <a href="http://sortbenchmark.org/FAQ-2015.html" rel="noreferrer"
target="_blank">http://sortbenchmark.org/FAQ-2015.html</a><br/><span class="HOEnZb"><font color="#888888">--<br />
PeterGeoghegan<br /></font></span></blockquote></div><br /></div><div class="gmail_extra">The machine in question still
exists,so if you have questions about it, commands you'd like me to run to give you insight as to the I/O capabilities
ofthe machine, let me know. I can't guarantee we'll keep the machine much longer.</div><div class="gmail_extra"><br
/></div><divclass="gmail_extra"><br /></div><div class="gmail_extra"><br /></div><div class="gmail_extra"><br
/></div></div>

pgsql-hackers by date:

Previous
From: Simon Riggs
Date:
Subject: Re: can we add SKIP LOCKED to UPDATE?
Next
From: Pavel Stehule
Date:
Subject: Re: proposal: PL/Pythonu - function ereport