PoC: Duplicate Tuple Elidation during External Sort for DISTINCT - Mailing list pgsql-hackers
From | Jon Nelson |
---|---|
Subject | PoC: Duplicate Tuple Elidation during External Sort for DISTINCT |
Date | |
Msg-id | CAKuK5J2x-5kbALXHO1kX-njU0+JLDf4DSfjp5v4JoH3-MCxOOg@mail.gmail.com Whole thread Raw |
Responses |
Re: PoC: Duplicate Tuple Elidation during External Sort for DISTINCT
Re: PoC: Duplicate Tuple Elidation during External Sort for DISTINCT |
List | pgsql-hackers |
Greetings -hackers: I have worked up a patch to PostgreSQL which elides tuples during an external sort. The primary use case is when sorted input is being used to feed a DISTINCT operation. The idea is to throw out tuples that compare as identical whenever it's convenient, predicated on the assumption that even a single I/O is more expensive than some number of (potentially extra) comparisons. Obviously, this is where a cost model comes in, which has not been implemented. This patch is a work-in-progress. Being very new to hacking PostgreSQL, I've probably done a bunch of things wrong, but I've tried to test it thoroughly. A rough summary of the patch follows: - a GUC variable enables or disables this capability - in nodeAgg.c, eliding duplicate tuples is enabled if the number of distinct columns is equal to the number of sort columns (and both are greater than zero). - in createplan.c, eliding duplicate tuples is enabled if we are creating a unique plan which involves sorting first - ditto planner.c - all of the remaining changes are in tuplesort.c, which consist of: + a new macro, DISCARDTUP and a new structure member, discardtup, are both defined and operate similar to COMPARETUP, COPYTUP, etc... + in puttuple_common, when state is TSS_BUILDRUNS, we *may* simply throw out the new tuple if it compares as identical to the tuple at the top of the heap. Since we're already performing this comparison, this is essentially free. + in mergeonerun, we may discard a tuple if it compares as identical to the *last written tuple*. This is a comparison that did not take place before, so it's not free, but it saves a write I/O. + We perform the same logic in dumptuples I have tested this patch with a bunch of different data, and the results are summarized below. Test 1: I used a corpus of approximately 1.5TB of textual data, consisting of 7.4 billion input rows, 12.6% of which is unique. This is a 'real-world' test. Before the patch: 44.25 hours using 274GB of temporary files After the patch: 36.44 hours using 125GB of temporary files Difference: -7.6 hours, 18% faster Test 2: Acquire the unique set of strings from a totally random set that contains no duplicates. This is a 'worst case'. Input: 700 million rows, 32GB. Before: 18,796 seconds. After: 19,358 seconds Difference: +562 seconds, *slower* by 3%. In this particular case, the performance varies from faster to slower, and I'm not sure why. Statistical noise? Test 3: Acquire the unique set of integers from a set of 100 million, all happen to have the value '1'. This is a 'best case'. Input: 100 million rows, 3.4GB on-disk Before: 26.1 seconds using 1.3GB of temporary files After: 8.9 seconds using 8KB of disk Difference: -17.2 seconds, 65% faster Test 4: Similar to test 3, but using 1 billion rows. Before: 1450 seconds, 13 GB of temporary files After: 468 seconds, 8 KB of temporary files Difference: -982 seconds, 68% faster See below for details on test 4. Lastly, 'make check' passes without issues (modulo rangefuncs grumping about changes in pg_settings). Tests 1 through 3 were performed against PostgreSQL 9.4 HEAD as of early September 2013. I have rebased the patch against PostgreSQL 9.4 HEAD as of 19 January 2014, and re-run tests 2, 3, and 4 with similar results. Regarding the patch: I've tried to conform to the indenting and style rules, including using pg_bsd_indent and pg_indent, but I've not had 100% success. The details on test 4: A simple example, using 1 billion rows all with the value '1' (integer): This is a fabricated *best case*. postgres=# set enable_hashagg = false; SET Time: 30.402 ms postgres=# explain analyze verbose select distinct * from i ; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------- Unique (cost=245942793.27..250942793.27 rows=1 width=4) (actual time=468188.900..468188.901 rows=1 loops=1) Output: i -> Sort (cost=245942793.27..248442793.27 rows=1000000000 width=4) (actual time=468188.897..468188.897 rows=1 loops=1) Output: i Sort Key: i.i Sort Method: external sort Disk: 8kB -> Seq Scan on public.i (cost=0.00..14424779.00 rows=1000000000 width=4) (actual time=34.765..254595.990 rows=1000000000 loops=1) Output: i Total runtime: 468189.125 ms (9 rows) Time: 468189.755 ms postgres=# set enable_opportunistic_tuple_elide = false; SET Time: 30.402 ms postgres=# explain analyze verbose select distinct * from i ; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------- Unique (cost=245942793.27..250942793.27 rows=1 width=4) (actual time=1047226.451..1449371.632 rows=1 loops=1) Output: i -> Sort (cost=245942793.27..248442793.27 rows=1000000000 width=4) (actual time=1047226.447..1284922.616 rows=1000000000 loops=1) Output: i Sort Key: i.i Sort Method: external sort Disk: 13685248kB -> Seq Scan on public.i (cost=0.00..14424779.00 rows=1000000000 width=4) (actual time=29.832..450978.694 rows=1000000000 loops=1) Output: i Total runtime: 1449644.717 ms (9 rows) There are some issues that remain. For example, the following two queries behave very differently (they *are* different queries, of course). Compare this: postgres=# explain ( analyze true, verbose true, costs true, buffers true, timing true ) select count(distinct i.i) from i ; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=16924779.00..16924779.01 rows=1 width=4) (actual time=894138.114..894138.115 rows=1 loops=1) Output: count(DISTINCT i) Buffers: shared hit=102 read=4424683, temp read=1466277 written=1466277 -> Seq Scan on public.i (cost=0.00..14424779.00 rows=1000000000 width=4) (actual time=0.012..349142.072 rows=1000000000 loops=1) Output: i Buffers: shared hit=96 read=4424683 Total runtime: 894138.233 ms (7 rows) Time: 894190.392 ms postgres=# to this (with this code path enabled): postgres=# explain ( analyze true, verbose true, costs true, buffers true, timing true ) select count(1) from (select distinct i.i from i) as sub ; Aggregate (cost=182583418.28..182583418.29 rows=1 width=0) (actual time=451973.381..451973.417 rows=1 loops=1) Output: count(1) Buffers: shared hit=192 read=4424587, temp read=1 written=1 -> Unique (cost=177583418.27..182583418.27 rows=1 width=4) (actual time=451973.370..451973.372 rows=1 loops=1) Output: i.i Buffers: shared hit=192 read=4424587, temp read=1 written=1 -> Sort (cost=177583418.27..180083418.27 rows=1000000000 width=4) (actual time=451973.366..451973.367 rows=1 loops=1) Output: i.i Sort Key: i.i Sort Method: external sort Disk: 8kB Buffers: shared hit=192 read=4424587, temp read=1 written=1 -> Seq Scan on public.i (cost=0.00..14424779.00 rows=1000000000 width=4) (actual time=0.016..223587.173 rows=1000000000 loops=1) Output: i.i Buffers: shared hit=192 read=4424587 Total runtime: 451994.583 ms (which also logged the following information) LOG: tuplesort_end: opportunistically elided 999999999 tuples (1864133 extra comparisons): 0 during initial accrual, 998135866 during build, 0 during merge, 1864133 during dump For comparison purposes, here is the same query but with hash aggregation enabled: Aggregate (cost=16924779.02..16924779.03 rows=1 width=0) (actual time=559379.525..559379.525 rows=1 loops=1) Output: count(1) Buffers: shared hit=224 read=4424555 -> HashAggregate (cost=16924779.00..16924779.01 rows=1 width=4) (actual time=559379.513..559379.513 rows=1 loops=1) Output: i.i Group Key: i.i Buffers: shared hit=224 read=4424555 -> Seq Scan on public.i (cost=0.00..14424779.00 rows=1000000000 width=4) (actual time=0.016..223704.468 rows=1000000000 loops=1) Output: i.i Buffers: shared hit=224 read=4424555 Total runtime: 559379.608 ms (11 rows) A note: this work was supported by my employer, Renesys Corporation ( http://www.renesys.com/ ). The patch was formatted using 'unified' diff, per http://wiki.postgresql.org/wiki/Submitting_a_Patch -- Jon
Attachment
pgsql-hackers by date: