Thread: DISTINCT vs. GROUP BY

DISTINCT vs. GROUP BY

From
Dimi Paun
Date:
From what I've read on the net, these should be very similar,
and should generate equivalent plans, in such cases:

SELECT DISTINCT x FROM mytable
SELECT x FROM mytable GROUP BY x

However, in my case (postgresql-server-8.1.18-2.el5_4.1),
they generated different results with quite different
execution times (73ms vs 40ms for DISTINCT and GROUP BY
respectively):

tts_server_db=# EXPLAIN ANALYZE select userdata from tagrecord where clientRmaInId = 'CPC-RMA-00110' group by userdata;
                                                                 QUERY PLAN
                    

--------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=775.68..775.69 rows=1 width=146) (actual time=40.058..40.058 rows=0 loops=1)
   ->  Bitmap Heap Scan on tagrecord  (cost=4.00..774.96 rows=286 width=146) (actual time=40.055..40.055 rows=0
loops=1)
         Recheck Cond: ((clientrmainid)::text = 'CPC-RMA-00110'::text)
         ->  Bitmap Index Scan on idx_tagdata_clientrmainid  (cost=0.00..4.00 rows=286 width=0) (actual
time=40.050..40.050rows=0 loops=1) 
               Index Cond: ((clientrmainid)::text = 'CPC-RMA-00110'::text)
 Total runtime: 40.121 ms

tts_server_db=# EXPLAIN ANALYZE select distinct userdata from tagrecord where clientRmaInId = 'CPC-RMA-00109';
                                                                    QUERY PLAN
                          

--------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=786.63..788.06 rows=1 width=146) (actual time=73.018..73.018 rows=0 loops=1)
   ->  Sort  (cost=786.63..787.34 rows=286 width=146) (actual time=73.016..73.016 rows=0 loops=1)
         Sort Key: userdata
         ->  Bitmap Heap Scan on tagrecord  (cost=4.00..774.96 rows=286 width=146) (actual time=72.940..72.940 rows=0
loops=1)
               Recheck Cond: ((clientrmainid)::text = 'CPC-RMA-00109'::text)
               ->  Bitmap Index Scan on idx_tagdata_clientrmainid  (cost=0.00..4.00 rows=286 width=0) (actual
time=72.936..72.936rows=0 loops=1) 
                     Index Cond: ((clientrmainid)::text = 'CPC-RMA-00109'::text)
 Total runtime: 73.144 ms

What gives?

--
Dimi Paun <dimi@lattica.com>
Lattica, Inc.


Re: DISTINCT vs. GROUP BY

From
Thom Brown
Date:
On 9 February 2010 21:46, Dimi Paun <dimi@lattica.com> wrote:
> >From what I've read on the net, these should be very similar,
> and should generate equivalent plans, in such cases:
>
> SELECT DISTINCT x FROM mytable
> SELECT x FROM mytable GROUP BY x
>
> However, in my case (postgresql-server-8.1.18-2.el5_4.1),
> they generated different results with quite different
> execution times (73ms vs 40ms for DISTINCT and GROUP BY
> respectively):
>
> tts_server_db=# EXPLAIN ANALYZE select userdata from tagrecord where clientRmaInId = 'CPC-RMA-00110' group by
userdata;
>                                                                 QUERY PLAN
>
--------------------------------------------------------------------------------------------------------------------------------------------
>  HashAggregate  (cost=775.68..775.69 rows=1 width=146) (actual time=40.058..40.058 rows=0 loops=1)
>   ->  Bitmap Heap Scan on tagrecord  (cost=4.00..774.96 rows=286 width=146) (actual time=40.055..40.055 rows=0
loops=1)
>         Recheck Cond: ((clientrmainid)::text = 'CPC-RMA-00110'::text)
>         ->  Bitmap Index Scan on idx_tagdata_clientrmainid  (cost=0.00..4.00 rows=286 width=0) (actual
time=40.050..40.050rows=0 loops=1) 
>               Index Cond: ((clientrmainid)::text = 'CPC-RMA-00110'::text)
>  Total runtime: 40.121 ms
>
> tts_server_db=# EXPLAIN ANALYZE select distinct userdata from tagrecord where clientRmaInId = 'CPC-RMA-00109';
>                                                                    QUERY PLAN
>
--------------------------------------------------------------------------------------------------------------------------------------------------
>  Unique  (cost=786.63..788.06 rows=1 width=146) (actual time=73.018..73.018 rows=0 loops=1)
>   ->  Sort  (cost=786.63..787.34 rows=286 width=146) (actual time=73.016..73.016 rows=0 loops=1)
>         Sort Key: userdata
>         ->  Bitmap Heap Scan on tagrecord  (cost=4.00..774.96 rows=286 width=146) (actual time=72.940..72.940 rows=0
loops=1)
>               Recheck Cond: ((clientrmainid)::text = 'CPC-RMA-00109'::text)
>               ->  Bitmap Index Scan on idx_tagdata_clientrmainid  (cost=0.00..4.00 rows=286 width=0) (actual
time=72.936..72.936rows=0 loops=1) 
>                     Index Cond: ((clientrmainid)::text = 'CPC-RMA-00109'::text)
>  Total runtime: 73.144 ms
>
> What gives?
>
Firstly, the 2 queries aren't equal.  They're matching against
different clientrmainid values.

Also, look at the bitmap index scan for each:

Bitmap Index Scan on idx_tagdata_clientrmainid  (cost=0.00..4.00
rows=286 width=0) (actual time=40.050..40.050 rows=0 loops=1)

Bitmap Index Scan on idx_tagdata_clientrmainid  (cost=0.00..4.00
rows=286 width=0) (actual time=72.936..72.936 rows=0 loops=1)

That's where the difference is.  An identical scan takes longer in one
than the other, either due to the index scan looking for different
values in each case, or at the time you were running it, another
process was using more resources.  You'd have to run these several
times to get an idea of average times.

Have you run ANALYZE on the table beforehand to make sure your stats
are up to date?

Regards

Thom

Re: DISTINCT vs. GROUP BY

From
Tom Lane
Date:
Dimi Paun <dimi@lattica.com> writes:
>> From what I've read on the net, these should be very similar,
> and should generate equivalent plans, in such cases:

> SELECT DISTINCT x FROM mytable
> SELECT x FROM mytable GROUP BY x

> However, in my case (postgresql-server-8.1.18-2.el5_4.1),
> they generated different results with quite different
> execution times (73ms vs 40ms for DISTINCT and GROUP BY
> respectively):

The results certainly ought to be the same (although perhaps not with
the same ordering) --- if they aren't, please provide a reproducible
test case.

As for efficiency, though, 8.1 didn't understand how to use hash
aggregation for DISTINCT.  Less-obsolete versions do know how to do
that.

            regards, tom lane

Re: DISTINCT vs. GROUP BY

From
Dimi Paun
Date:
On Tue, 2010-02-09 at 17:38 -0500, Tom Lane wrote:
> The results certainly ought to be the same (although perhaps not with
> the same ordering) --- if they aren't, please provide a reproducible
> test case.

The results are the same, this is not a problem.

> As for efficiency, though, 8.1 didn't understand how to use hash
> aggregation for DISTINCT.  Less-obsolete versions do know how to do
> that.

Indeed, this seem to be the issue:

tts_server_db=# EXPLAIN ANALYZE select userdata from tagrecord where clientRmaInId = 'CPC-RMA-00110' group by userdata;
                                                                   QUERY PLAN
                        

------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=253.34..253.50 rows=16 width=15) (actual time=0.094..0.094 rows=0 loops=1)
   ->  Index Scan using idx_tagdata_clientrmainid on tagrecord  (cost=0.00..252.85 rows=195 width=15) (actual
time=0.091..0.091rows=0 loops=1) 
         Index Cond: ((clientrmainid)::text = 'CPC-RMA-00110'::text)
 Total runtime: 0.146 ms
(4 rows)

tts_server_db=# EXPLAIN ANALYZE select distinct userdata from tagrecord where clientRmaInId = 'CPC-RMA-00110';
                                                                      QUERY PLAN
                              

------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=260.27..261.25 rows=16 width=15) (actual time=0.115..0.115 rows=0 loops=1)
   ->  Sort  (cost=260.27..260.76 rows=195 width=15) (actual time=0.113..0.113 rows=0 loops=1)
         Sort Key: userdata
         ->  Index Scan using idx_tagdata_clientrmainid on tagrecord  (cost=0.00..252.85 rows=195 width=15) (actual
time=0.105..0.105rows=0 loops=1) 
               Index Cond: ((clientrmainid)::text = 'CPC-RMA-00110'::text)
 Total runtime: 0.151 ms
(6 rows)

For now we are stuck with 8.1, so the easiest fix for us is to use GROUP BY.
Since this is fixed in later versions, I guess there's not much to see here... :)

Thanks for the quick reply!

--
Dimi Paun <dimi@lattica.com>
Lattica, Inc.