Thread: Why are distinct and group by choosing different plans?
I notice that I get different plans when I run the following two queries that I thought would be identical. select distinct test_col from mytable; select test_col from mytable group by test_col; Any reason why it favors one in one case but not the other? d=# explain analyze select distinct test_col from mytable; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------ Unique (cost=0.00..14927.69 rows=27731 width=4) (actual time=0.144..915.214 rows=208701 loops=1) -> Index Scan using "mytable(test_col)" on mytable (cost=0.00..14160.38 rows=306925 width=4) (actual time=0.140..575.580rows=306925 loops=1) Total runtime: 1013.657 ms (3 rows) d=# explain analyze select test_col from mytable group by test_col; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------ HashAggregate (cost=7241.56..7518.87 rows=27731 width=4) (actual time=609.058..745.295 rows=208701 loops=1) -> Seq Scan on mytable (cost=0.00..6474.25 rows=306925 width=4) (actual time=0.063..280.000 rows=306925 loops=1) Total runtime: 840.321 ms (3 rows) d=# select version(); version ---------------------------------------------------------------------------------------------------------------- PostgreSQL 8.2.1 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20061115 (prerelease) (Debian 4.1.1-21) (1 row)
"Ron Mayer" <rm_pg@cheapcomplexdevices.com> writes: > I notice that I get different plans when I run the > following two queries that I thought would be > identical. > > select distinct test_col from mytable; > select test_col from mytable group by test_col; > > Any reason why it favors one in one case but not the other? I think "distinct" just doesn't know about hash aggregates yet. That's partly an oversight and partly of a "feature" in that it gives a convenient way to write a query which avoids them. I think it's also partly that "distinct" is trickier to fix because it's the same codepath as "distinct on" which is decidedly more complex than a simple "distinct". > d=# explain analyze select distinct test_col from mytable; > QUERY PLAN > ------------------------------------------------------------------------------------------------------------------------------------------------------------ > Unique (cost=0.00..14927.69 rows=27731 width=4) (actual time=0.144..915.214 rows=208701 loops=1) > -> Index Scan using "mytable(test_col)" on mytable (cost=0.00..14160.38 rows=306925 width=4) (actual time=0.140..575.580rows=306925 loops=1) > Total runtime: 1013.657 ms > (3 rows) I assume you have random_page_cost dialled way down? The costs seem too low for the default random_page_cost. This query would usually generate a sort rather than an index scan. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark <stark@enterprisedb.com> writes: > I think "distinct" just doesn't know about hash aggregates yet. That's partly > an oversight and partly of a "feature" in that it gives a convenient way to > write a query which avoids them. I think it's also partly that "distinct" is > trickier to fix because it's the same codepath as "distinct on" which is > decidedly more complex than a simple "distinct". It's not an oversight :-(. But the DISTINCT/DISTINCT ON code is old, crufty, and tightly entwined with ORDER BY processing. It'd be nice to clean it all up someday, but the effort seems a bit out of proportion to the reward... regards, tom lane