Home > mailing lists

Use LIMIT instead of Unique for DISTINCT when all distinct pathkeys are redundant - Mailing list pgsql-hackers

From	David Rowley
Subject	Use LIMIT instead of Unique for DISTINCT when all distinct pathkeys are redundant
Date	October 12, 2022 12:19:20
Msg-id	CAApHDvqS0j8RUWRUSgCAXxOqnYjHUXmKwspRj4GzVfOO25ByHA@mail.gmail.com Whole thread Raw
Responses	Re: Use LIMIT instead of Unique for DISTINCT when all distinct pathkeys are redundant (Richard Guo <guofenglinux@gmail.com>) Re: Use LIMIT instead of Unique for DISTINCT when all distinct pathkeys are redundant (Richard Guo <guofenglinux@gmail.com>)
List	pgsql-hackers

Tree view

Over on [1], Klint highlights a query with a DISTINCT which is a
little sub-optimal in PostgreSQL.  ISTM that in cases where all
DISTINCT pathkeys have been marked as redundant due to constants
existing in all of the EquivalenceClasses of the DISTINCT columns,
then it looks like it should be okay not to bother using a Unique node
to remove duplicate results.

When all the distinct pathkeys are redundant then there can only be,
at most, 1 single distinct value. There may be many rows with that
value, but we can remove those extra ones with a LIMIT 1 rather than
troubling over needlessly uniquifing them.

This might not be a hugely common case, but; 1) it is very cheap to
detect and 2) the speedups are likely to be *very* good.

With the attached we get:

regression=# explain (analyze, costs off, timing off) SELECT DISTINCT
four,1,2,3 FROM tenk1 WHERE four = 0;
                   QUERY PLAN
-------------------------------------------------
 Limit (actual rows=1 loops=1)
   ->  Seq Scan on tenk1 (actual rows=1 loops=1)
         Filter: (four = 0)
 Planning Time: 0.215 ms
 Execution Time: 0.071 ms

naturally, if we removed the WHERE four = 0, we can't optimise this
plan using this method.

I see no reason why this also can't work for DISTINCT ON too.

regression=# explain (analyze, costs off, timing off) SELECT DISTINCT
ON (four,two) four,two FROM tenk1 WHERE four = 0 order by 1,2;
                        QUERY PLAN
----------------------------------------------------------
 Unique (actual rows=1 loops=1)
   ->  Sort (actual rows=2500 loops=1)
         Sort Key: two
         Sort Method: quicksort  Memory: 175kB
         ->  Seq Scan on tenk1 (actual rows=2500 loops=1)
               Filter: (four = 0)
               Rows Removed by Filter: 7500
 Planning Time: 0.123 ms
 Execution Time: 4.251 ms
(9 rows)

then, of course, if we introduce some column that the pathkey is not
redundant for then we must do the distinct operation as normal.

regression=# explain (analyze, costs off, timing off) SELECT DISTINCT
four,two FROM tenk1 WHERE four = 0 order by 1,2;
                        QUERY PLAN
----------------------------------------------------------
 Sort (actual rows=1 loops=1)
   Sort Key: two
   Sort Method: quicksort  Memory: 25kB
   ->  HashAggregate (actual rows=1 loops=1)
         Group Key: four, two
         Batches: 1  Memory Usage: 24kB
         ->  Seq Scan on tenk1 (actual rows=2500 loops=1)
               Filter: (four = 0)
               Rows Removed by Filter: 7500
 Planning Time: 0.137 ms
 Execution Time: 4.274 ms
(11 rows)

Does this seem like something we'd want to do?

Patch attached.

David

[1] https://postgr.es/m/MEYPR01MB7101CD5DA0A07C9DE2B74850A4239@MEYPR01MB7101.ausprd01.prod.outlook.com

Attachment

v1-use-limit-instead-of-unique-for-distinct.patch

pgsql-hackers by date:

From: bt22nakamorit
Date: 12 October 2022, 12:13:07
Subject: Re: Make ON_ERROR_STOP stop on shell script failure

From: Yugo NAGATA
Date: 12 October 2022, 12:27:04
Subject: Re: make_ctags: use -I option to ignore pg_node_attr macro

Use LIMIT instead of Unique for DISTINCT when all distinct pathkeys are redundant - Mailing list pgsql-hackers

Attachment

Previous

Next