Home > mailing lists

Re: Multiple Uniques - Mailing list pgsql-performance

From	Markus Schaber
Subject	Re: Multiple Uniques
Date	September 6, 2004 16:57:29
Msg-id	20040906175618.5497a124@kingfisher.intern.logi-track.com Whole thread Raw
In response to	Re: Multiple Uniques (Greg Stark <gsstark@mit.edu>)
List	pgsql-performance

Tree view

Hi, Greg,

On 02 Sep 2004 15:33:38 -0400
Greg Stark <gsstark@mit.edu> wrote:

> Markus Schaber <schabios@logi-track.com> writes:
>
> > logigis=# explain select count(id) from (select  ref_in_id as id from streets union select  nref_in_id as id from
streets)as blubb; 
> >                                                QUERY PLAN
> > ---------------------------------------------------------------------------------------------------------
> >  Aggregate  (cost=16220893.16..16220893.16 rows=1 width=8)
> >    ->  Subquery Scan blubb  (cost=15254815.03..16082881.99 rows=55204464 width=8)
> >          ->  Unique  (cost=15254815.03..15530837.35 rows=55204464 width=8)
> >                ->  Sort  (cost=15254815.03..15392826.19 rows=55204464 width=8)
> >                      Sort Key: id
> >                      ->  Append  (cost=0.00..6810225.28 rows=55204464 width=8)
> >                            ->  Subquery Scan "*SELECT* 1"  (cost=0.00..3405112.64 rows=27602232 width=8)
> >                                  ->  Seq Scan on streets  (cost=0.00..3129090.32 rows=27602232 width=8)
> >                            ->  Subquery Scan "*SELECT* 2"  (cost=0.00..3405112.64 rows=27602232 width=8)
> >                                  ->  Seq Scan on streets  (cost=0.00..3129090.32 rows=27602232 width=8)
>
> You can actually go one step further and do:
>
>  select count(distinct id) from (select ... union all select ...) as blubb;

Hmm, as query plan, it gives me:

> logigis=# explain select count(distinct id) from (select ref_in_id as id from streets union all select nref_in_id as
idfrom streets) as blubb; 
>                                          QUERY PLAN
> ---------------------------------------------------------------------------------------------
>  Aggregate  (cost=7500281.08..7500281.08 rows=1 width=8)
>    ->  Subquery Scan blubb  (cost=0.00..7362269.92 rows=55204464 width=8)
>          ->  Append  (cost=0.00..6810225.28 rows=55204464 width=8)
>                ->  Subquery Scan "*SELECT* 1"  (cost=0.00..3405112.64 rows=27602232 width=8)
>                      ->  Seq Scan on streets  (cost=0.00..3129090.32 rows=27602232 width=8)
>                ->  Subquery Scan "*SELECT* 2"  (cost=0.00..3405112.64 rows=27602232 width=8)
>                      ->  Seq Scan on streets  (cost=0.00..3129090.32 rows=27602232 width=8)
> (7 rows)

This leaves off the sort and unique completely, and leaves all work to
the count aggrecation. Maybe this is implemented by hashing and thus
somehow more efficient...

> I'm not sure why this is any faster since it still has to do all the same
> work, but it's a different code path and it seems to be about 20% faster for
> me.

When comparing the actual timings, I get:

> logigis=# \timing
> Timing is on.
> logigis=# select count(distinct id) from (select ref_in_id as id from streets union all select nref_in_id as id from
streets)as blubb; 
>   count
> ----------
>  20225596
> (1 row)
>
> Time: 1339098.226 ms
> logigis=# select count(id) from (select  ref_in_id as id from streets union select  nref_in_id as id from streets) as
blubb;
>   count
> ----------
>  20225596
> (1 row)
>
> Time: 1558920.784 ms
> logigis=#

So you're right, its faster this way. There seems to be some room to
play with optimizer enhancements.

But simply removing all distincts from subselects would not be the best
way to go, as reducing intermediate datasets can also improve the
performance.


Markus

--
markus schaber | dipl. informatiker
logi-track ag | rennweg 14-16 | ch 8001 zürich
phone +41-43-888 62 52 | fax +41-43-888 62 53
mailto:schabios@logi-track.com | www.logi-track.com

pgsql-performance by date:

From: Dennis Bjorklund
Date: 06 September 2004, 13:47:05
Subject: Re: The usual sequential scan, but with LIMIT !

From: Tom Lane
Date: 06 September 2004, 17:43:56
Subject: Re: The usual sequential scan, but with LIMIT !

Re: Multiple Uniques - Mailing list pgsql-performance

Previous

Next