Thread: PoC: Duplicate Tuple Elidation during External Sort for DISTINCT
Greetings -hackers: I have worked up a patch to PostgreSQL which elides tuples during an external sort. The primary use case is when sorted input is being used to feed a DISTINCT operation. The idea is to throw out tuples that compare as identical whenever it's convenient, predicated on the assumption that even a single I/O is more expensive than some number of (potentially extra) comparisons. Obviously, this is where a cost model comes in, which has not been implemented. This patch is a work-in-progress. Being very new to hacking PostgreSQL, I've probably done a bunch of things wrong, but I've tried to test it thoroughly. A rough summary of the patch follows: - a GUC variable enables or disables this capability - in nodeAgg.c, eliding duplicate tuples is enabled if the number of distinct columns is equal to the number of sort columns (and both are greater than zero). - in createplan.c, eliding duplicate tuples is enabled if we are creating a unique plan which involves sorting first - ditto planner.c - all of the remaining changes are in tuplesort.c, which consist of: + a new macro, DISCARDTUP and a new structure member, discardtup, are both defined and operate similar to COMPARETUP, COPYTUP, etc... + in puttuple_common, when state is TSS_BUILDRUNS, we *may* simply throw out the new tuple if it compares as identical to the tuple at the top of the heap. Since we're already performing this comparison, this is essentially free. + in mergeonerun, we may discard a tuple if it compares as identical to the *last written tuple*. This is a comparison that did not take place before, so it's not free, but it saves a write I/O. + We perform the same logic in dumptuples I have tested this patch with a bunch of different data, and the results are summarized below. Test 1: I used a corpus of approximately 1.5TB of textual data, consisting of 7.4 billion input rows, 12.6% of which is unique. This is a 'real-world' test. Before the patch: 44.25 hours using 274GB of temporary files After the patch: 36.44 hours using 125GB of temporary files Difference: -7.6 hours, 18% faster Test 2: Acquire the unique set of strings from a totally random set that contains no duplicates. This is a 'worst case'. Input: 700 million rows, 32GB. Before: 18,796 seconds. After: 19,358 seconds Difference: +562 seconds, *slower* by 3%. In this particular case, the performance varies from faster to slower, and I'm not sure why. Statistical noise? Test 3: Acquire the unique set of integers from a set of 100 million, all happen to have the value '1'. This is a 'best case'. Input: 100 million rows, 3.4GB on-disk Before: 26.1 seconds using 1.3GB of temporary files After: 8.9 seconds using 8KB of disk Difference: -17.2 seconds, 65% faster Test 4: Similar to test 3, but using 1 billion rows. Before: 1450 seconds, 13 GB of temporary files After: 468 seconds, 8 KB of temporary files Difference: -982 seconds, 68% faster See below for details on test 4. Lastly, 'make check' passes without issues (modulo rangefuncs grumping about changes in pg_settings). Tests 1 through 3 were performed against PostgreSQL 9.4 HEAD as of early September 2013. I have rebased the patch against PostgreSQL 9.4 HEAD as of 19 January 2014, and re-run tests 2, 3, and 4 with similar results. Regarding the patch: I've tried to conform to the indenting and style rules, including using pg_bsd_indent and pg_indent, but I've not had 100% success. The details on test 4: A simple example, using 1 billion rows all with the value '1' (integer): This is a fabricated *best case*. postgres=# set enable_hashagg = false; SET Time: 30.402 ms postgres=# explain analyze verbose select distinct * from i ; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------- Unique (cost=245942793.27..250942793.27 rows=1 width=4) (actual time=468188.900..468188.901 rows=1 loops=1) Output: i -> Sort (cost=245942793.27..248442793.27 rows=1000000000 width=4) (actual time=468188.897..468188.897 rows=1 loops=1) Output: i Sort Key: i.i Sort Method: external sort Disk: 8kB -> Seq Scan on public.i (cost=0.00..14424779.00 rows=1000000000 width=4) (actual time=34.765..254595.990 rows=1000000000 loops=1) Output: i Total runtime: 468189.125 ms (9 rows) Time: 468189.755 ms postgres=# set enable_opportunistic_tuple_elide = false; SET Time: 30.402 ms postgres=# explain analyze verbose select distinct * from i ; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------- Unique (cost=245942793.27..250942793.27 rows=1 width=4) (actual time=1047226.451..1449371.632 rows=1 loops=1) Output: i -> Sort (cost=245942793.27..248442793.27 rows=1000000000 width=4) (actual time=1047226.447..1284922.616 rows=1000000000 loops=1) Output: i Sort Key: i.i Sort Method: external sort Disk: 13685248kB -> Seq Scan on public.i (cost=0.00..14424779.00 rows=1000000000 width=4) (actual time=29.832..450978.694 rows=1000000000 loops=1) Output: i Total runtime: 1449644.717 ms (9 rows) There are some issues that remain. For example, the following two queries behave very differently (they *are* different queries, of course). Compare this: postgres=# explain ( analyze true, verbose true, costs true, buffers true, timing true ) select count(distinct i.i) from i ; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=16924779.00..16924779.01 rows=1 width=4) (actual time=894138.114..894138.115 rows=1 loops=1) Output: count(DISTINCT i) Buffers: shared hit=102 read=4424683, temp read=1466277 written=1466277 -> Seq Scan on public.i (cost=0.00..14424779.00 rows=1000000000 width=4) (actual time=0.012..349142.072 rows=1000000000 loops=1) Output: i Buffers: shared hit=96 read=4424683 Total runtime: 894138.233 ms (7 rows) Time: 894190.392 ms postgres=# to this (with this code path enabled): postgres=# explain ( analyze true, verbose true, costs true, buffers true, timing true ) select count(1) from (select distinct i.i from i) as sub ; Aggregate (cost=182583418.28..182583418.29 rows=1 width=0) (actual time=451973.381..451973.417 rows=1 loops=1) Output: count(1) Buffers: shared hit=192 read=4424587, temp read=1 written=1 -> Unique (cost=177583418.27..182583418.27 rows=1 width=4) (actual time=451973.370..451973.372 rows=1 loops=1) Output: i.i Buffers: shared hit=192 read=4424587, temp read=1 written=1 -> Sort (cost=177583418.27..180083418.27 rows=1000000000 width=4) (actual time=451973.366..451973.367 rows=1 loops=1) Output: i.i Sort Key: i.i Sort Method: external sort Disk: 8kB Buffers: shared hit=192 read=4424587, temp read=1 written=1 -> Seq Scan on public.i (cost=0.00..14424779.00 rows=1000000000 width=4) (actual time=0.016..223587.173 rows=1000000000 loops=1) Output: i.i Buffers: shared hit=192 read=4424587 Total runtime: 451994.583 ms (which also logged the following information) LOG: tuplesort_end: opportunistically elided 999999999 tuples (1864133 extra comparisons): 0 during initial accrual, 998135866 during build, 0 during merge, 1864133 during dump For comparison purposes, here is the same query but with hash aggregation enabled: Aggregate (cost=16924779.02..16924779.03 rows=1 width=0) (actual time=559379.525..559379.525 rows=1 loops=1) Output: count(1) Buffers: shared hit=224 read=4424555 -> HashAggregate (cost=16924779.00..16924779.01 rows=1 width=4) (actual time=559379.513..559379.513 rows=1 loops=1) Output: i.i Group Key: i.i Buffers: shared hit=224 read=4424555 -> Seq Scan on public.i (cost=0.00..14424779.00 rows=1000000000 width=4) (actual time=0.016..223704.468 rows=1000000000 loops=1) Output: i.i Buffers: shared hit=224 read=4424555 Total runtime: 559379.608 ms (11 rows) A note: this work was supported by my employer, Renesys Corporation ( http://www.renesys.com/ ). The patch was formatted using 'unified' diff, per http://wiki.postgresql.org/wiki/Submitting_a_Patch -- Jon
Attachment
Jon Nelson <jnelson+pgsql@jamponi.net> writes: > A rough summary of the patch follows: > - a GUC variable enables or disables this capability > - in nodeAgg.c, eliding duplicate tuples is enabled if the number of > distinct columns is equal to the number of sort columns (and both are > greater than zero). > - in createplan.c, eliding duplicate tuples is enabled if we are > creating a unique plan which involves sorting first > - ditto planner.c > - all of the remaining changes are in tuplesort.c, which consist of: > + a new macro, DISCARDTUP and a new structure member, discardtup, are > both defined and operate similar to COMPARETUP, COPYTUP, etc... > + in puttuple_common, when state is TSS_BUILDRUNS, we *may* simply > throw out the new tuple if it compares as identical to the tuple at > the top of the heap. Since we're already performing this comparison, > this is essentially free. > + in mergeonerun, we may discard a tuple if it compares as identical > to the *last written tuple*. This is a comparison that did not take > place before, so it's not free, but it saves a write I/O. > + We perform the same logic in dumptuples [ raised eyebrow ... ] And what happens if the planner drops the unique step and then the sort doesn't actually go to disk? regards, tom lane
On Tue, Jan 21, 2014 at 9:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Jon Nelson <jnelson+pgsql@jamponi.net> writes: >> A rough summary of the patch follows: > >> - a GUC variable enables or disables this capability >> - in nodeAgg.c, eliding duplicate tuples is enabled if the number of >> distinct columns is equal to the number of sort columns (and both are >> greater than zero). >> - in createplan.c, eliding duplicate tuples is enabled if we are >> creating a unique plan which involves sorting first >> - ditto planner.c >> - all of the remaining changes are in tuplesort.c, which consist of: >> + a new macro, DISCARDTUP and a new structure member, discardtup, are >> both defined and operate similar to COMPARETUP, COPYTUP, etc... >> + in puttuple_common, when state is TSS_BUILDRUNS, we *may* simply >> throw out the new tuple if it compares as identical to the tuple at >> the top of the heap. Since we're already performing this comparison, >> this is essentially free. >> + in mergeonerun, we may discard a tuple if it compares as identical >> to the *last written tuple*. This is a comparison that did not take >> place before, so it's not free, but it saves a write I/O. >> + We perform the same logic in dumptuples > > [ raised eyebrow ... ] And what happens if the planner drops the > unique step and then the sort doesn't actually go to disk? I'm not familiar enough with the code to be able to answer your question with any sort of authority, but I believe that if the state TSS_BUILDRUNS is never hit, then basically nothing new happens. -- Jon
On 22/01/14 03:16, Jon Nelson wrote: > Greetings -hackers: > > I have worked up a patch to PostgreSQL which elides tuples during an > external sort. The primary use case is when sorted input is being used > to feed a DISTINCT operation. The idea is to throw out tuples that > compare as identical whenever it's convenient, predicated on the > assumption that even a single I/O is more expensive than some number > of (potentially extra) comparisons. Obviously, this is where a cost > model comes in, which has not been implemented. This patch is a > work-in-progress. Dedup-in-sort is also done by my WIP internal merge sort, and extended (in much the same ways as Jon's) to the external merge. https://github.com/j47996/pgsql_sorb I've not done a cost model either, but the dedup capability is exposed from tuplesort.c to the executor, and downstream uniq nodes removed. I've not worked out yet how to eliminate upstream hashagg nodes, which would be worthwhile from testing results. -- Cheers, Jeremy
On 22/01/14 03:53, Tom Lane wrote: > Jon Nelson <jnelson+pgsql@jamponi.net> writes: >> A rough summary of the patch follows: > >> - a GUC variable enables or disables this capability >> - in nodeAgg.c, eliding duplicate tuples is enabled if the number of >> distinct columns is equal to the number of sort columns (and both are >> greater than zero). >> - in createplan.c, eliding duplicate tuples is enabled if we are >> creating a unique plan which involves sorting first >> - ditto planner.c >> - all of the remaining changes are in tuplesort.c, which consist of: >> + a new macro, DISCARDTUP and a new structure member, discardtup, are >> both defined and operate similar to COMPARETUP, COPYTUP, etc... >> + in puttuple_common, when state is TSS_BUILDRUNS, we *may* simply >> throw out the new tuple if it compares as identical to the tuple at >> the top of the heap. Since we're already performing this comparison, >> this is essentially free. >> + in mergeonerun, we may discard a tuple if it compares as identical >> to the *last written tuple*. This is a comparison that did not take >> place before, so it's not free, but it saves a write I/O. >> + We perform the same logic in dumptuples > > [ raised eyebrow ... ] And what happens if the planner drops the > unique step and then the sort doesn't actually go to disk? I don't think Jon was suggesting that the planner drop the unique step. -- Cheers, Jeremy
Jeremy Harris <jgh@wizmail.org> writes: > On 22/01/14 03:53, Tom Lane wrote: >> Jon Nelson <jnelson+pgsql@jamponi.net> writes: >>> - in createplan.c, eliding duplicate tuples is enabled if we are >>> creating a unique plan which involves sorting first >> [ raised eyebrow ... ] And what happens if the planner drops the >> unique step and then the sort doesn't actually go to disk? > I don't think Jon was suggesting that the planner drop the unique step. Hm, OK, maybe I misread what he said there. Still, if we've told tuplesort to remove duplicates, why shouldn't we expect it to have done the job? Passing the data through a useless Unique step is not especially cheap. regards, tom lane
On Wed, Jan 22, 2014 at 3:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Jeremy Harris <jgh@wizmail.org> writes: >> On 22/01/14 03:53, Tom Lane wrote: >>> Jon Nelson <jnelson+pgsql@jamponi.net> writes: >>>> - in createplan.c, eliding duplicate tuples is enabled if we are >>>> creating a unique plan which involves sorting first > >>> [ raised eyebrow ... ] And what happens if the planner drops the >>> unique step and then the sort doesn't actually go to disk? > >> I don't think Jon was suggesting that the planner drop the unique step. > > Hm, OK, maybe I misread what he said there. Still, if we've told > tuplesort to remove duplicates, why shouldn't we expect it to have > done the job? Passing the data through a useless Unique step is > not especially cheap. That's correct - I do not propose to drop the unique step. Duplicates are only dropped if it's convenient to do so. In one case, it's a zero-cost drop (no extra comparison is made). In most other cases, an extra comparison is made, typically right before writing a tuple to tape. If it compares as identical to the previously-written tuple, it's thrown out instead of being written. The output of the modified code is still sorted, still *might* (and in most cases, probably will) contain duplicates, but will (probably) contain fewer duplicates. -- Jon
What are my next steps here? I believe the concept is sound, the code is appears to work and doesn't crash, and the result does show a performance win in most cases (sometimes a big win). It's also incomplete, at least insofar as it doesn't interface with the cost models at all, etc... -- Jon
What - if anything - do I need to do to get this on the commitfest list for the next commitfest?
On Wed, Feb 5, 2014 at 10:33 AM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote: > What - if anything - do I need to do to get this on the commitfest > list for the next commitfest? The list of instructions is here: http://wiki.postgresql.org/wiki/Submitting_a_Patch#Patch_submission Then the next commit fest (#1 for 9.5), will be in June and is here: https://commitfest.postgresql.org/action/commitfest_view?id=22 Note that you will need an account on postgresql.org to register your patch. Regards, -- Michael