PoC: Duplicate Tuple Elidation during External Sort for DISTINCT - Mailing list pgsql-hackers

From Jon Nelson
Subject PoC: Duplicate Tuple Elidation during External Sort for DISTINCT
Date
Msg-id CAKuK5J2x-5kbALXHO1kX-njU0+JLDf4DSfjp5v4JoH3-MCxOOg@mail.gmail.com
Whole thread Raw
Responses Re: PoC: Duplicate Tuple Elidation during External Sort for DISTINCT
Re: PoC: Duplicate Tuple Elidation during External Sort for DISTINCT
List pgsql-hackers
Greetings -hackers:

I have worked up a patch to PostgreSQL which elides tuples during an
external sort. The primary use case is when sorted input is being used
to feed a DISTINCT operation. The idea is to throw out tuples that
compare as identical whenever it's convenient, predicated on the
assumption that even a single I/O is more expensive than some number
of (potentially extra) comparisons.  Obviously, this is where a cost
model comes in, which has not been implemented. This patch is a
work-in-progress.

Being very new to hacking PostgreSQL, I've probably done a bunch of
things wrong, but I've tried to test it thoroughly.

A rough summary of the patch follows:

- a GUC variable enables or disables this capability
- in nodeAgg.c, eliding duplicate tuples is enabled if the number of
  distinct columns is equal to the number of sort columns (and both are
  greater than zero).
- in createplan.c, eliding duplicate tuples is enabled if we are
  creating a unique plan which involves sorting first
- ditto planner.c
- all of the remaining changes are in tuplesort.c, which consist of:
  + a new macro, DISCARDTUP and a new structure member, discardtup, are
    both defined and operate similar to COMPARETUP, COPYTUP, etc...
  + in puttuple_common, when state is TSS_BUILDRUNS, we *may* simply
    throw out the new tuple if it compares as identical to the tuple at
    the top of the heap. Since we're already performing this comparison,
    this is essentially free.
  + in mergeonerun, we may discard a tuple if it compares as identical
    to the *last written tuple*. This is a comparison that did not take
    place before, so it's not free, but it saves a write I/O.
  + We perform the same logic in dumptuples

I have tested this patch with a bunch of different data, and the
results are summarized below.

Test 1:
I used a corpus of approximately 1.5TB of textual data, consisting of
7.4 billion input rows, 12.6% of which is unique. This is a 'real-world'
test.

Before the patch: 44.25 hours using 274GB of temporary files
After the patch: 36.44 hours using 125GB of temporary files
Difference: -7.6 hours, 18% faster

Test 2:
Acquire the unique set of strings from a totally random set that
contains no duplicates. This is a 'worst case'.

Input: 700 million rows, 32GB.
Before: 18,796 seconds.
After: 19,358 seconds
Difference: +562 seconds, *slower* by 3%.
In this particular case, the performance varies from faster to slower, and I'm
not sure why. Statistical noise?

Test 3:
Acquire the unique set of integers from a set of 100 million, all
happen to have the value '1'. This is a 'best case'.

Input: 100 million rows, 3.4GB on-disk
Before: 26.1 seconds using 1.3GB of temporary files
After: 8.9 seconds using 8KB of disk
Difference: -17.2 seconds, 65% faster

Test 4:
Similar to test 3, but using 1 billion rows.
Before: 1450 seconds, 13 GB of temporary files
After: 468 seconds, 8 KB of temporary files
Difference: -982 seconds, 68% faster

See below for details on test 4.

Lastly, 'make check' passes without issues (modulo rangefuncs grumping about
changes in pg_settings).

Tests 1 through 3 were performed against PostgreSQL 9.4 HEAD as of
early September 2013.  I have rebased the patch against
PostgreSQL 9.4 HEAD as of 19 January 2014, and re-run tests
2, 3, and 4 with similar results.

Regarding the patch: I've tried to conform to the indenting and style
rules, including using pg_bsd_indent and pg_indent, but I've not had
100% success.

The details on test 4:

A simple example, using 1 billion rows all with the value '1' (integer):
This is a fabricated *best case*.

postgres=# set enable_hashagg = false;
SET
Time: 30.402 ms
postgres=# explain analyze verbose select distinct * from i ;
                                                                 QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=245942793.27..250942793.27 rows=1 width=4) (actual
time=468188.900..468188.901 rows=1 loops=1)
   Output: i
   ->  Sort  (cost=245942793.27..248442793.27 rows=1000000000 width=4)
(actual time=468188.897..468188.897 rows=1 loops=1)
         Output: i
         Sort Key: i.i
         Sort Method: external sort  Disk: 8kB
         ->  Seq Scan on public.i  (cost=0.00..14424779.00
rows=1000000000 width=4) (actual time=34.765..254595.990
rows=1000000000 loops=1)
               Output: i
 Total runtime: 468189.125 ms
(9 rows)

Time: 468189.755 ms
postgres=# set enable_opportunistic_tuple_elide = false;
SET
Time: 30.402 ms
postgres=# explain analyze verbose select distinct * from i ;
                                                                 QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=245942793.27..250942793.27 rows=1 width=4) (actual
time=1047226.451..1449371.632 rows=1 loops=1)
   Output: i
   ->  Sort  (cost=245942793.27..248442793.27 rows=1000000000 width=4)
(actual time=1047226.447..1284922.616 rows=1000000000 loops=1)
         Output: i
         Sort Key: i.i
         Sort Method: external sort  Disk: 13685248kB
         ->  Seq Scan on public.i  (cost=0.00..14424779.00
rows=1000000000 width=4) (actual time=29.832..450978.694
rows=1000000000 loops=1)
               Output: i
 Total runtime: 1449644.717 ms
(9 rows)

There are some issues that remain.
For example, the following two queries behave very differently (they
*are* different queries, of course).

Compare this:
postgres=# explain ( analyze true, verbose true, costs true, buffers
true, timing true ) select count(distinct i.i) from i ;

                                                              QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=16924779.00..16924779.01 rows=1 width=4) (actual
time=894138.114..894138.115 rows=1 loops=1)
   Output: count(DISTINCT i)
   Buffers: shared hit=102 read=4424683, temp read=1466277 written=1466277
   ->  Seq Scan on public.i  (cost=0.00..14424779.00 rows=1000000000
width=4) (actual time=0.012..349142.072 rows=1000000000 loops=1)
         Output: i
         Buffers: shared hit=96 read=4424683
 Total runtime: 894138.233 ms
(7 rows)

Time: 894190.392 ms
postgres=#

to this (with this code path enabled):

postgres=# explain ( analyze true, verbose true, costs true, buffers
true, timing true ) select count(1) from (select distinct i.i from i)
as sub ;

 Aggregate  (cost=182583418.28..182583418.29 rows=1 width=0) (actual
time=451973.381..451973.417 rows=1 loops=1)
   Output: count(1)
   Buffers: shared hit=192 read=4424587, temp read=1 written=1
   ->  Unique  (cost=177583418.27..182583418.27 rows=1 width=4)
(actual time=451973.370..451973.372 rows=1 loops=1)
         Output: i.i
         Buffers: shared hit=192 read=4424587, temp read=1 written=1
         ->  Sort  (cost=177583418.27..180083418.27 rows=1000000000
width=4) (actual time=451973.366..451973.367 rows=1 loops=1)
               Output: i.i
               Sort Key: i.i
               Sort Method: external sort  Disk: 8kB
               Buffers: shared hit=192 read=4424587, temp read=1 written=1
               ->  Seq Scan on public.i  (cost=0.00..14424779.00
rows=1000000000 width=4) (actual time=0.016..223587.173
rows=1000000000 loops=1)
                     Output: i.i
                     Buffers: shared hit=192 read=4424587
 Total runtime: 451994.583 ms

 (which also logged the following information)
LOG:  tuplesort_end: opportunistically elided 999999999 tuples
(1864133 extra comparisons): 0 during initial accrual, 998135866
during build, 0 during merge, 1864133 during dump

For comparison purposes, here is the same query but with hash
aggregation enabled:

 Aggregate  (cost=16924779.02..16924779.03 rows=1 width=0) (actual
time=559379.525..559379.525 rows=1 loops=1)
   Output: count(1)
   Buffers: shared hit=224 read=4424555
   ->  HashAggregate  (cost=16924779.00..16924779.01 rows=1 width=4)
(actual time=559379.513..559379.513 rows=1 loops=1)
         Output: i.i
         Group Key: i.i
         Buffers: shared hit=224 read=4424555
         ->  Seq Scan on public.i  (cost=0.00..14424779.00
rows=1000000000 width=4) (actual time=0.016..223704.468
rows=1000000000 loops=1)
               Output: i.i
               Buffers: shared hit=224 read=4424555
 Total runtime: 559379.608 ms
(11 rows)


A note: this work was supported by my employer, Renesys Corporation (
http://www.renesys.com/ ).

The patch was formatted using 'unified' diff, per
http://wiki.postgresql.org/wiki/Submitting_a_Patch

--
Jon

Attachment

pgsql-hackers by date:

Previous
From: Florian Pflug
Date:
Subject: Re: Add %z support to elog/ereport?
Next
From: Kyotaro HORIGUCHI
Date:
Subject: Re: Funny representation in pg_stat_statements.query.