Thread: Why are distinct and group by choosing different plans?

Why are distinct and group by choosing different plans?

From

Ron Mayer

Date:

02 August 2007, 18:32:21

I notice that I get different plans when I run the
following two queries that I thought would be
identical.

  select distinct test_col from mytable;
  select test_col from mytable group by test_col;

Any reason why it favors one in one case but not the other?



d=# explain analyze select distinct test_col from mytable;
                                                                         QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=0.00..14927.69 rows=27731 width=4) (actual time=0.144..915.214 rows=208701 loops=1)
   ->  Index Scan using "mytable(test_col)" on mytable  (cost=0.00..14160.38 rows=306925 width=4) (actual
time=0.140..575.580rows=306925 loops=1) 
 Total runtime: 1013.657 ms
(3 rows)

d=# explain analyze select test_col from mytable group by test_col;
                                                       QUERY PLAN

------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=7241.56..7518.87 rows=27731 width=4) (actual time=609.058..745.295 rows=208701 loops=1)
   ->  Seq Scan on mytable  (cost=0.00..6474.25 rows=306925 width=4) (actual time=0.063..280.000 rows=306925 loops=1)
 Total runtime: 840.321 ms
(3 rows)


d=# select version();
                                                    version
----------------------------------------------------------------------------------------------------------------
 PostgreSQL 8.2.1 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)
(1 row)

Re: Why are distinct and group by choosing different plans?

From

Gregory Stark

Date:

02 August 2007, 19:37:56

"Ron Mayer" <rm_pg@cheapcomplexdevices.com> writes:

> I notice that I get different plans when I run the
> following two queries that I thought would be
> identical.
>
>   select distinct test_col from mytable;
>   select test_col from mytable group by test_col;
>
> Any reason why it favors one in one case but not the other?

I think "distinct" just doesn't know about hash aggregates yet. That's partly
an oversight and partly of a "feature" in that it gives a convenient way to
write a query which avoids them. I think it's also partly that "distinct" is
trickier to fix because it's the same codepath as "distinct on" which is
decidedly more complex than a simple "distinct".

> d=# explain analyze select distinct test_col from mytable;
>                                                                          QUERY PLAN
>
------------------------------------------------------------------------------------------------------------------------------------------------------------
>  Unique  (cost=0.00..14927.69 rows=27731 width=4) (actual time=0.144..915.214 rows=208701 loops=1)
>    ->  Index Scan using "mytable(test_col)" on mytable  (cost=0.00..14160.38 rows=306925 width=4) (actual
time=0.140..575.580rows=306925 loops=1) 
>  Total runtime: 1013.657 ms
> (3 rows)

I assume you have random_page_cost dialled way down? The costs seem too low
for the default random_page_cost. This query would usually generate a sort
rather than an index scan.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

Re: Why are distinct and group by choosing different plans?

From

Tom Lane

Date:

02 August 2007, 21:48:13

Gregory Stark <stark@enterprisedb.com> writes:
> I think "distinct" just doesn't know about hash aggregates yet. That's partly
> an oversight and partly of a "feature" in that it gives a convenient way to
> write a query which avoids them. I think it's also partly that "distinct" is
> trickier to fix because it's the same codepath as "distinct on" which is
> decidedly more complex than a simple "distinct".

It's not an oversight :-(.  But the DISTINCT/DISTINCT ON code is old,
crufty, and tightly entwined with ORDER BY processing.  It'd be nice to
clean it all up someday, but the effort seems a bit out of proportion
to the reward...

            regards, tom lane