Thread: Multiple Uniques

Multiple Uniques

From
Markus Schaber
Date:
Hello,

Today, we stumbled about the following query plan on PostGreSQL 7.4.1:

logigis=# explain select count(id) from (select distinct id from (select distinct ref_in_id as id from streets union
selectdistinct nref_in_id as id from streets) as blubb) as blabb;  
                                                              QUERY PLAN
              

--------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=33246086.32..33246086.32 rows=1 width=8)
   ->  Subquery Scan blabb  (cost=32970061.50..33246085.82 rows=200 width=8)
         ->  Unique  (cost=32970061.50..33246083.82 rows=200 width=8)
               ->  Sort  (cost=32970061.50..33108072.66 rows=55204464 width=8)
                     Sort Key: id
                     ->  Subquery Scan blubb  (cost=23697404.79..24525471.75 rows=55204464 width=8)
                           ->  Unique  (cost=23697404.79..23973427.11 rows=55204464 width=8)
                                 ->  Sort  (cost=23697404.79..23835415.95 rows=55204464 width=8)
                                       Sort Key: id
                                       ->  Append  (cost=7212374.04..15252815.03 rows=55204464 width=8)
                                             ->  Subquery Scan "*SELECT* 1"  (cost=7212374.04..7626407.52 rows=27602232
width=8)
                                                   ->  Unique  (cost=7212374.04..7350385.20 rows=27602232 width=8)
                                                         ->  Sort  (cost=7212374.04..7281379.62 rows=27602232 width=8)
                                                               Sort Key: ref_in_id
                                                               ->  Seq Scan on streets  (cost=0.00..3129090.32
rows=27602232width=8) 
                                             ->  Subquery Scan "*SELECT* 2"  (cost=7212374.04..7626407.52 rows=27602232
width=8)
                                                   ->  Unique  (cost=7212374.04..7350385.20 rows=27602232 width=8)
                                                         ->  Sort  (cost=7212374.04..7281379.62 rows=27602232 width=8)
                                                               Sort Key: nref_in_id
                                                               ->  Seq Scan on streets  (cost=0.00..3129090.32
rows=27602232width=8) 
(20 rows)

I might have to add that this query is not an every-day query, it's
merely part of some statistics that a colleague needs for some
estimations as he has to create a tool that works on this data.

Both ref_in_id and nref_in_id are non-indexed columns with type int8.

I was somehow irritated by the fact that the Query Plan contains 4 Uniques.

Then, I tried the following query:

logigis=# explain select count(id) from (select distinct id from (select  ref_in_id as id from streets union select
nref_in_idas id from streets) as blubb) as blabb; 
                                                        QUERY PLAN
   

---------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=24803496.57..24803496.57 rows=1 width=8)
   ->  Subquery Scan blabb  (cost=24527471.75..24803496.07 rows=200 width=8)
         ->  Unique  (cost=24527471.75..24803494.07 rows=200 width=8)
               ->  Sort  (cost=24527471.75..24665482.91 rows=55204464 width=8)
                     Sort Key: id
                     ->  Subquery Scan blubb  (cost=15254815.03..16082881.99 rows=55204464 width=8)
                           ->  Unique  (cost=15254815.03..15530837.35 rows=55204464 width=8)
                                 ->  Sort  (cost=15254815.03..15392826.19 rows=55204464 width=8)
                                       Sort Key: id
                                       ->  Append  (cost=0.00..6810225.28 rows=55204464 width=8)
                                             ->  Subquery Scan "*SELECT* 1"  (cost=0.00..3405112.64 rows=27602232
width=8)
                                                   ->  Seq Scan on streets  (cost=0.00..3129090.32 rows=27602232
width=8)
                                             ->  Subquery Scan "*SELECT* 2"  (cost=0.00..3405112.64 rows=27602232
width=8)
                                                   ->  Seq Scan on streets  (cost=0.00..3129090.32 rows=27602232
width=8)
(14 rows)

And after re-parsing the documentation about UNION, the following one:

logigis=# explain select count(id) from (select  ref_in_id as id from streets union select  nref_in_id as id from
streets)as blubb; 
                                               QUERY PLAN
---------------------------------------------------------------------------------------------------------
 Aggregate  (cost=16220893.16..16220893.16 rows=1 width=8)
   ->  Subquery Scan blubb  (cost=15254815.03..16082881.99 rows=55204464 width=8)
         ->  Unique  (cost=15254815.03..15530837.35 rows=55204464 width=8)
               ->  Sort  (cost=15254815.03..15392826.19 rows=55204464 width=8)
                     Sort Key: id
                     ->  Append  (cost=0.00..6810225.28 rows=55204464 width=8)
                           ->  Subquery Scan "*SELECT* 1"  (cost=0.00..3405112.64 rows=27602232 width=8)
                                 ->  Seq Scan on streets  (cost=0.00..3129090.32 rows=27602232 width=8)
                           ->  Subquery Scan "*SELECT* 2"  (cost=0.00..3405112.64 rows=27602232 width=8)
                                 ->  Seq Scan on streets  (cost=0.00..3129090.32 rows=27602232 width=8)
(10 rows)

Those queries should give the same result.

So, now my question is, why does the query optimizer not recognize that
it can throw away those "non-unique" Sort/Unique passes?

Is PostGreSQL 8 capable of this optimization?


Thanks,
Markus


--
markus schaber | dipl. informatiker
logi-track ag | rennweg 14-16 | ch 8001 zürich
phone +41-43-888 62 52 | fax +41-43-888 62 53
mailto:schabios@logi-track.com | www.logi-track.com

Re: Multiple Uniques

From
Tom Lane
Date:
Markus Schaber <schabios@logi-track.com> writes:
> Today, we stumbled about the following query plan on PostGreSQL 7.4.1:

> logigis=# explain select count(id) from (select distinct id from (select distinct ref_in_id as id from streets union
selectdistinct nref_in_id as id from streets) as blubb) as blabb;  

> I was somehow irritated by the fact that the Query Plan contains 4 Uniques.

Well, if you write a silly query, you can get a silly plan ...

As you appear to have realized later, given the definition of UNION,
all three of the explicit DISTINCTs are redundant.

> So, now my question is, why does the query optimizer not recognize that
> it can throw away those "non-unique" Sort/Unique passes?

Because the issue doesn't come up often enough to justify expending
cycles to check for it.

            regards, tom lane

Re: Multiple Uniques

From
Greg Stark
Date:
Markus Schaber <schabios@logi-track.com> writes:

> logigis=# explain select count(id) from (select  ref_in_id as id from streets union select  nref_in_id as id from
streets)as blubb; 
>                                                QUERY PLAN
> ---------------------------------------------------------------------------------------------------------
>  Aggregate  (cost=16220893.16..16220893.16 rows=1 width=8)
>    ->  Subquery Scan blubb  (cost=15254815.03..16082881.99 rows=55204464 width=8)
>          ->  Unique  (cost=15254815.03..15530837.35 rows=55204464 width=8)
>                ->  Sort  (cost=15254815.03..15392826.19 rows=55204464 width=8)
>                      Sort Key: id
>                      ->  Append  (cost=0.00..6810225.28 rows=55204464 width=8)
>                            ->  Subquery Scan "*SELECT* 1"  (cost=0.00..3405112.64 rows=27602232 width=8)
>                                  ->  Seq Scan on streets  (cost=0.00..3129090.32 rows=27602232 width=8)
>                            ->  Subquery Scan "*SELECT* 2"  (cost=0.00..3405112.64 rows=27602232 width=8)
>                                  ->  Seq Scan on streets  (cost=0.00..3129090.32 rows=27602232 width=8)

You can actually go one step further and do:

 select count(distinct id) from (select ... union all select ...) as blubb;

I'm not sure why this is any faster since it still has to do all the same
work, but it's a different code path and it seems to be about 20% faster for
me.

--
greg

Re: Multiple Uniques

From
Markus Schaber
Date:
Hi, Greg,

On 02 Sep 2004 15:33:38 -0400
Greg Stark <gsstark@mit.edu> wrote:

> Markus Schaber <schabios@logi-track.com> writes:
>
> > logigis=# explain select count(id) from (select  ref_in_id as id from streets union select  nref_in_id as id from
streets)as blubb; 
> >                                                QUERY PLAN
> > ---------------------------------------------------------------------------------------------------------
> >  Aggregate  (cost=16220893.16..16220893.16 rows=1 width=8)
> >    ->  Subquery Scan blubb  (cost=15254815.03..16082881.99 rows=55204464 width=8)
> >          ->  Unique  (cost=15254815.03..15530837.35 rows=55204464 width=8)
> >                ->  Sort  (cost=15254815.03..15392826.19 rows=55204464 width=8)
> >                      Sort Key: id
> >                      ->  Append  (cost=0.00..6810225.28 rows=55204464 width=8)
> >                            ->  Subquery Scan "*SELECT* 1"  (cost=0.00..3405112.64 rows=27602232 width=8)
> >                                  ->  Seq Scan on streets  (cost=0.00..3129090.32 rows=27602232 width=8)
> >                            ->  Subquery Scan "*SELECT* 2"  (cost=0.00..3405112.64 rows=27602232 width=8)
> >                                  ->  Seq Scan on streets  (cost=0.00..3129090.32 rows=27602232 width=8)
>
> You can actually go one step further and do:
>
>  select count(distinct id) from (select ... union all select ...) as blubb;

Hmm, as query plan, it gives me:

> logigis=# explain select count(distinct id) from (select ref_in_id as id from streets union all select nref_in_id as
idfrom streets) as blubb; 
>                                          QUERY PLAN
> ---------------------------------------------------------------------------------------------
>  Aggregate  (cost=7500281.08..7500281.08 rows=1 width=8)
>    ->  Subquery Scan blubb  (cost=0.00..7362269.92 rows=55204464 width=8)
>          ->  Append  (cost=0.00..6810225.28 rows=55204464 width=8)
>                ->  Subquery Scan "*SELECT* 1"  (cost=0.00..3405112.64 rows=27602232 width=8)
>                      ->  Seq Scan on streets  (cost=0.00..3129090.32 rows=27602232 width=8)
>                ->  Subquery Scan "*SELECT* 2"  (cost=0.00..3405112.64 rows=27602232 width=8)
>                      ->  Seq Scan on streets  (cost=0.00..3129090.32 rows=27602232 width=8)
> (7 rows)

This leaves off the sort and unique completely, and leaves all work to
the count aggrecation. Maybe this is implemented by hashing and thus
somehow more efficient...

> I'm not sure why this is any faster since it still has to do all the same
> work, but it's a different code path and it seems to be about 20% faster for
> me.

When comparing the actual timings, I get:

> logigis=# \timing
> Timing is on.
> logigis=# select count(distinct id) from (select ref_in_id as id from streets union all select nref_in_id as id from
streets)as blubb; 
>   count
> ----------
>  20225596
> (1 row)
>
> Time: 1339098.226 ms
> logigis=# select count(id) from (select  ref_in_id as id from streets union select  nref_in_id as id from streets) as
blubb;
>   count
> ----------
>  20225596
> (1 row)
>
> Time: 1558920.784 ms
> logigis=#

So you're right, its faster this way. There seems to be some room to
play with optimizer enhancements.

But simply removing all distincts from subselects would not be the best
way to go, as reducing intermediate datasets can also improve the
performance.


Markus

--
markus schaber | dipl. informatiker
logi-track ag | rennweg 14-16 | ch 8001 zürich
phone +41-43-888 62 52 | fax +41-43-888 62 53
mailto:schabios@logi-track.com | www.logi-track.com

Re: Multiple Uniques

From
Neil Conway
Date:
Tom Lane wrote:
> Markus Schaber <schabios@logi-track.com> writes:
>>So, now my question is, why does the query optimizer not recognize that
>>it can throw away those "non-unique" Sort/Unique passes?
>
> Because the issue doesn't come up often enough to justify expending
> cycles to check for it.

How many cycles are we really talking about, though? I have a patch
which I'll send along in a few days which implements a similar
optimization: if a subselect is referenced by EXISTS or IN, we can
discard DISTINCT and ORDER BY clauses in the subquery (actually, we
can't discard ORDER BY in the case of IN if LIMIT is also specified, but
the point remains). It's very cheap computationally for the planner to
do this simplification, and I'd imagine doing the equivalent
simplifications for UNION is similarly cheap.

While I understand what you're saying WRT to it being a silly query, in
the real world people make mistakes...

-Neil

Re: Multiple Uniques

From
Tom Lane
Date:
Neil Conway <neilc@samurai.com> writes:
> Tom Lane wrote:
>> Because the issue doesn't come up often enough to justify expending
>> cycles to check for it.

> How many cycles are we really talking about, though? I have a patch
> which I'll send along in a few days which implements a similar
> optimization: if a subselect is referenced by EXISTS or IN, we can
> discard DISTINCT and ORDER BY clauses in the subquery

I don't think either of those is worth doing.  ORDER BY in a sub-select
isn't even legal SQL, much less probable, so why should we expend even
a nanosecond to optimize it?  The DISTINCT is more of a judgment call,
but my thought when I looked at it originally is that it would give
people a possible optimization knob.  If you write DISTINCT in an IN
clause then you can get a different plan (the IN reduces to an ordinary
join) that might or might not be better than without it.  We shouldn't
take away that possibility just on the grounds of nanny-ism.

            regards, tom lane

Re: Multiple Uniques

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Neil Conway <neilc@samurai.com> writes:
>
> > How many cycles are we really talking about, though? I have a patch
> > which I'll send along in a few days which implements a similar
> > optimization: if a subselect is referenced by EXISTS or IN, we can
> > discard DISTINCT and ORDER BY clauses in the subquery
>
> I don't think either of those is worth doing.  ORDER BY in a sub-select
> isn't even legal SQL, much less probable, so why should we expend even
> a nanosecond to optimize it?  The DISTINCT is more of a judgement call,
> but my thought when I looked at it originally is that it would give
> people a possible optimization knob.  If you write DISTINCT in an IN
> clause then you can get a different plan (the IN reduces to an ordinary
> join) that might or might not be better than without it.  We shouldn't
> take away that possibility just on the grounds of nanny-ism.

Just one user's 2c: Consider the plight of dynamically constructed queries.
The queries within "IN" clauses are particularly likely to be constructed this
way. The query in the IN clause could be a constructed in an entirely separate
function without any idea that it will be used within an IN clause.

E.g. something like:

$accessible_ids = $security_manager->get_accessible_ids_query($this->userid);
$selected_columns = $this->selected_columns_parameters();
$query = "select $selected_columns where id IN ($accessible_ids)"

In an ideal world functionally equivalent queries should always generate
identical plans. Of course there are limitations, it's not an ideal world, but
as much as possible it should be possible to write code without having to
worry whether the optimizer will be able to figure it out.

--
greg

Question on Byte Sizes

From
Pierre-Frédéric Caillaud
Date:
Hello,

* I need information on the size of pg ARRAY[]'s :

I did not find any info in the Docs on this.
How many bytes does an array take on disk ?

Is there a difference between an array of fixed size elements like
integers, and an array of  variable length elements like text ? is there a
pointer table ? Or are the elements packed together ?

Is there any advantage in using a smallint[] over an integer[] regarding
size ?

Does a smallint[] with 2 elements really take 12 bytes ?

* On Alignment :

The docs say fields are aligned on 4-bytes boundaries.
Does this mean that several consecutive smallint fields will take 4 bytes
each ?
What about seleral consecutive "char" fields ? 4 bytes each too ?

I ask this because I'll have a lot of columns with small values to store
in a table, and
would like it to be small and to fit in the cache.


Thanks for any info.