Re: select DISTINCT - Mailing list pgsql-general

From Jeff Janes
Subject Re: select DISTINCT
Date
Msg-id CAMkU=1y58XfcYeds9RMAHad_E=Y=N9mpgyBLiqqrw+0-_MsbJQ@mail.gmail.com
Whole thread Raw
In response to select DISTINCT  (pg noob <pgnube@gmail.com>)
Responses Re: select DISTINCT  (pg noob <pgnube@gmail.com>)
List pgsql-general
On Friday, September 6, 2013, pg noob wrote:

Hi all,

I'm curious about some of the query estimates that I'm seeing with queries that use DISTINCT.
I am using postgres 8.4.13

I did a couple of quick tests, and found that PostgreSQL seems to do some expensive work to
return DISTINCT rows.  This is contrary to what I was expecting because I expected that
if it knows that it is building a distinct result set that it would maintain
some kind of ordering as it built the results to be able to quickly throw
away duplicates without any real overhead to the query.

There is no cost-free way of doing that.   You either have to sort, or hash, or walk in index order, and none of those things are free.

...
 
But it gets a bit stranger than that.
According to the docs,
“DISTINCT ON column will eliminate all duplicates in the specified column; this is equivalent to using GROUP BY column.”

DISTINCT ON is not the same thing as DISTINCT, which is not the same as count(distinct ...)


 

Note that it says that using distinct is EQUIVALENT to using GROUP BY.
And yet, look at the execution time difference of DISTINCT vs. GROUP BY:

db=# explain analyze select count(distinct(column1)) from table1;
                                                          QUERY PLAN                                                         
------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=631397.32..631397.33 rows=1 width=8) (actual time=18596.606..18596.607 rows=1 loops=1)
   ->  Seq Scan on table1  (cost=0.00..591411.65 rows=15994265 width=8) (actual time=0.003..2391.644 rows=16151368 loops=1)
 Total runtime: 18596.631 ms
(3 rows)

db=# explain analyze select count(foo.column1) from (select column1 from table1 group by column1) as foo;

Compare that to:

explain analyze select count(*) from (select DISTINCT column1 from table1) as foo;
 

Any ideas why this is?  Or what I am doing wrong?

The code that executes "count(distinct ....)" has never been taught how to use hash aggregation, whereas DISTINCT and GROUP BY have been.  It always sorts, which is often much slower than a hash, and also often much less memory efficient.   I speculate that this is because the implementation of count(distinct ...) is really ancient code and never been brought up to date, either about hashing or about more thorough EXPLAIN estimates--I have been meaning to dig into it but haven't gotten around to it yet.

You can get the increased performance by just spelling it in the more verbose way, but be aware that count (distinct ...) deals with NULL differently than select count(*) from (select DISTINCT...)  does, so they are not exactly identical.
 
Cheers,

Jeff

pgsql-general by date:

Previous
From: Kevin Grittner
Date:
Subject: Re: select DISTINCT
Next
From: Steve Atkins
Date:
Subject: Constraint exclusion and overlapping range checks