Re: DISTINCT vs. GROUP BY - Mailing list pgsql-performance

From Thom Brown
Subject Re: DISTINCT vs. GROUP BY
Date
Msg-id bddc86151002091422k7e401785t2407f6227643d1dd@mail.gmail.com
Whole thread Raw
In response to DISTINCT vs. GROUP BY  (Dimi Paun <dimi@lattica.com>)
List pgsql-performance
On 9 February 2010 21:46, Dimi Paun <dimi@lattica.com> wrote:
> >From what I've read on the net, these should be very similar,
> and should generate equivalent plans, in such cases:
>
> SELECT DISTINCT x FROM mytable
> SELECT x FROM mytable GROUP BY x
>
> However, in my case (postgresql-server-8.1.18-2.el5_4.1),
> they generated different results with quite different
> execution times (73ms vs 40ms for DISTINCT and GROUP BY
> respectively):
>
> tts_server_db=# EXPLAIN ANALYZE select userdata from tagrecord where clientRmaInId = 'CPC-RMA-00110' group by
userdata;
>                                                                 QUERY PLAN
>
--------------------------------------------------------------------------------------------------------------------------------------------
>  HashAggregate  (cost=775.68..775.69 rows=1 width=146) (actual time=40.058..40.058 rows=0 loops=1)
>   ->  Bitmap Heap Scan on tagrecord  (cost=4.00..774.96 rows=286 width=146) (actual time=40.055..40.055 rows=0
loops=1)
>         Recheck Cond: ((clientrmainid)::text = 'CPC-RMA-00110'::text)
>         ->  Bitmap Index Scan on idx_tagdata_clientrmainid  (cost=0.00..4.00 rows=286 width=0) (actual
time=40.050..40.050rows=0 loops=1) 
>               Index Cond: ((clientrmainid)::text = 'CPC-RMA-00110'::text)
>  Total runtime: 40.121 ms
>
> tts_server_db=# EXPLAIN ANALYZE select distinct userdata from tagrecord where clientRmaInId = 'CPC-RMA-00109';
>                                                                    QUERY PLAN
>
--------------------------------------------------------------------------------------------------------------------------------------------------
>  Unique  (cost=786.63..788.06 rows=1 width=146) (actual time=73.018..73.018 rows=0 loops=1)
>   ->  Sort  (cost=786.63..787.34 rows=286 width=146) (actual time=73.016..73.016 rows=0 loops=1)
>         Sort Key: userdata
>         ->  Bitmap Heap Scan on tagrecord  (cost=4.00..774.96 rows=286 width=146) (actual time=72.940..72.940 rows=0
loops=1)
>               Recheck Cond: ((clientrmainid)::text = 'CPC-RMA-00109'::text)
>               ->  Bitmap Index Scan on idx_tagdata_clientrmainid  (cost=0.00..4.00 rows=286 width=0) (actual
time=72.936..72.936rows=0 loops=1) 
>                     Index Cond: ((clientrmainid)::text = 'CPC-RMA-00109'::text)
>  Total runtime: 73.144 ms
>
> What gives?
>
Firstly, the 2 queries aren't equal.  They're matching against
different clientrmainid values.

Also, look at the bitmap index scan for each:

Bitmap Index Scan on idx_tagdata_clientrmainid  (cost=0.00..4.00
rows=286 width=0) (actual time=40.050..40.050 rows=0 loops=1)

Bitmap Index Scan on idx_tagdata_clientrmainid  (cost=0.00..4.00
rows=286 width=0) (actual time=72.936..72.936 rows=0 loops=1)

That's where the difference is.  An identical scan takes longer in one
than the other, either due to the index scan looking for different
values in each case, or at the time you were running it, another
process was using more resources.  You'd have to run these several
times to get an idea of average times.

Have you run ANALYZE on the table beforehand to make sure your stats
are up to date?

Regards

Thom

pgsql-performance by date:

Previous
From: Dimi Paun
Date:
Subject: DISTINCT vs. GROUP BY
Next
From: Tom Lane
Date:
Subject: Re: DISTINCT vs. GROUP BY