select DISTINCT - Mailing list pgsql-general

From pg noob
Subject select DISTINCT
Date
Msg-id CAPNY-2Utce-c+kNTwsMCbAk58=9mYeAeViTXT9LO7r1k77jukw@mail.gmail.com
Whole thread Raw
Responses Re: select DISTINCT  (Kevin Grittner <kgrittn@ymail.com>)
Re: select DISTINCT  (Jeff Janes <jeff.janes@gmail.com>)
List pgsql-general

Hi all,

I'm curious about some of the query estimates that I'm seeing with queries that use DISTINCT.
I am using postgres 8.4.13

I did a couple of quick tests, and found that PostgreSQL seems to do some expensive work to
return DISTINCT rows.  This is contrary to what I was expecting because I expected that
if it knows that it is building a distinct result set that it would maintain
some kind of ordering as it built the results to be able to quickly throw
away duplicates without any real overhead to the query.  This is different than an ORDER BY
which requires knowledge of all rows in the table before it can determine the
ordering.

I would think that distinct is something that can be determined even from a partial result set.

But what I find is that there is a significant difference between a count vs. a count distinct:

db=# explain analyze select count(column1) from table1;
                                                          QUERY PLAN                                                         
------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=631397.32..631397.33 rows=1 width=8) (actual time=5265.143..5265.143 rows=1 loops=1)
   ->  Seq Scan on table1 (cost=0.00..591411.65 rows=15994265 width=8) (actual time=0.005..3392.991 rows=16163849 loops=1)
 Total runtime: 5265.170 ms
(3 rows)

db=# explain analyze select count(distinct(column1)) from table1;
                                                          QUERY PLAN                                                         
------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=631397.32..631397.33 rows=1 width=8) (actual time=21828.644..21828.644 rows=1 loops=1)
   ->  Seq Scan on table1  (cost=0.00..591411.65 rows=15994265 width=8) (actual time=0.005..2742.391 rows=16163937 loops=1)
 Total runtime: 21828.736 ms
(3 rows)

Somehow the estimated query plan cost of the two queries is _exactly_ the same but the actual
time takes 4 times longer for one than the other to execute.
And it is the query returning less rows in the result which takes more time.
That says that there is work being done to return distinct results that is not accounted for by the
query plan.
(I tried this several times on different systems because I could not believe it.
But it seems consistent.)

But it gets a bit stranger than that.
According to the docs,
“DISTINCT ON column will eliminate all duplicates in the specified column; this is equivalent to using GROUP BY column.”

Note that it says that using distinct is EQUIVALENT to using GROUP BY.
And yet, look at the execution time difference of DISTINCT vs. GROUP BY:

db=# explain analyze select count(distinct(column1)) from table1;
                                                          QUERY PLAN                                                         
------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=631397.32..631397.33 rows=1 width=8) (actual time=18596.606..18596.607 rows=1 loops=1)
   ->  Seq Scan on table1  (cost=0.00..591411.65 rows=15994265 width=8) (actual time=0.003..2391.644 rows=16151368 loops=1)
 Total runtime: 18596.631 ms
(3 rows)

db=# explain analyze select count(foo.column1) from (select column1 from table1 group by column1) as foo;
 
                                                             QUERY PLAN                                                            
------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=631419.39..631419.40 rows=1 width=8) (actual time=4954.187..4954.188 rows=1 loops=1)
   ->  HashAggregate  (cost=631397.31..631407.12 rows=981 width=8) (actual time=4953.477..4954.044 rows=2389 loops=1)
         ->  Seq Scan on table1  (cost=0.00..591411.65 rows=15994265 width=8) (actual time=0.004..2050.133 rows=16151805 loops=1)
 Total runtime: 4954.223 ms
(4 rows)

The GROUP BY performs much better than DISTINCT even though both these two queries return the exact same count result.

In the above examples, column1 is type bigint and it is not indexed.

Any ideas why this is?  Or what I am doing wrong?

Thank you.


pgsql-general by date:

Previous
From: "Janek Sendrowski"
Date:
Subject: Re: Levenshtein Distance with more than 255 characters
Next
From: Peter Geoghegan
Date:
Subject: Re: Call for design: PostgreSQL mugs