Re: Multiple Uniques - Mailing list pgsql-performance
From | Markus Schaber |
---|---|
Subject | Re: Multiple Uniques |
Date | |
Msg-id | 20040906175618.5497a124@kingfisher.intern.logi-track.com Whole thread Raw |
In response to | Re: Multiple Uniques (Greg Stark <gsstark@mit.edu>) |
List | pgsql-performance |
Hi, Greg, On 02 Sep 2004 15:33:38 -0400 Greg Stark <gsstark@mit.edu> wrote: > Markus Schaber <schabios@logi-track.com> writes: > > > logigis=# explain select count(id) from (select ref_in_id as id from streets union select nref_in_id as id from streets)as blubb; > > QUERY PLAN > > --------------------------------------------------------------------------------------------------------- > > Aggregate (cost=16220893.16..16220893.16 rows=1 width=8) > > -> Subquery Scan blubb (cost=15254815.03..16082881.99 rows=55204464 width=8) > > -> Unique (cost=15254815.03..15530837.35 rows=55204464 width=8) > > -> Sort (cost=15254815.03..15392826.19 rows=55204464 width=8) > > Sort Key: id > > -> Append (cost=0.00..6810225.28 rows=55204464 width=8) > > -> Subquery Scan "*SELECT* 1" (cost=0.00..3405112.64 rows=27602232 width=8) > > -> Seq Scan on streets (cost=0.00..3129090.32 rows=27602232 width=8) > > -> Subquery Scan "*SELECT* 2" (cost=0.00..3405112.64 rows=27602232 width=8) > > -> Seq Scan on streets (cost=0.00..3129090.32 rows=27602232 width=8) > > You can actually go one step further and do: > > select count(distinct id) from (select ... union all select ...) as blubb; Hmm, as query plan, it gives me: > logigis=# explain select count(distinct id) from (select ref_in_id as id from streets union all select nref_in_id as idfrom streets) as blubb; > QUERY PLAN > --------------------------------------------------------------------------------------------- > Aggregate (cost=7500281.08..7500281.08 rows=1 width=8) > -> Subquery Scan blubb (cost=0.00..7362269.92 rows=55204464 width=8) > -> Append (cost=0.00..6810225.28 rows=55204464 width=8) > -> Subquery Scan "*SELECT* 1" (cost=0.00..3405112.64 rows=27602232 width=8) > -> Seq Scan on streets (cost=0.00..3129090.32 rows=27602232 width=8) > -> Subquery Scan "*SELECT* 2" (cost=0.00..3405112.64 rows=27602232 width=8) > -> Seq Scan on streets (cost=0.00..3129090.32 rows=27602232 width=8) > (7 rows) This leaves off the sort and unique completely, and leaves all work to the count aggrecation. Maybe this is implemented by hashing and thus somehow more efficient... > I'm not sure why this is any faster since it still has to do all the same > work, but it's a different code path and it seems to be about 20% faster for > me. When comparing the actual timings, I get: > logigis=# \timing > Timing is on. > logigis=# select count(distinct id) from (select ref_in_id as id from streets union all select nref_in_id as id from streets)as blubb; > count > ---------- > 20225596 > (1 row) > > Time: 1339098.226 ms > logigis=# select count(id) from (select ref_in_id as id from streets union select nref_in_id as id from streets) as blubb; > count > ---------- > 20225596 > (1 row) > > Time: 1558920.784 ms > logigis=# So you're right, its faster this way. There seems to be some room to play with optimizer enhancements. But simply removing all distincts from subselects would not be the best way to go, as reducing intermediate datasets can also improve the performance. Markus -- markus schaber | dipl. informatiker logi-track ag | rennweg 14-16 | ch 8001 zürich phone +41-43-888 62 52 | fax +41-43-888 62 53 mailto:schabios@logi-track.com | www.logi-track.com
pgsql-performance by date: