Thread: sub-select with aggregate

sub-select with aggregate

From
Tomasz Myrta
Date:
Hi
I want to perform query looking like this:

select user_id, a/sum_a as percent_a, b/sum_b as percent_b
from users join (select   group_id,   sum(a) as sum_a,   sum(b) as sum_b  from users group by group_id) X using
(group_id)
where group_id=3;

This query works, but very slow. Subquery with aggregate is performed 
for all table rows instead of group_id=3.

I wish I could write it like this, but it doesn't work at all:
select user_id, a/sum_a as percent_a, b/sum_b as percent_b
from users U, (select   sum(a) as sum_a,   sum(b) as sum_b  from users where group_id=U.group_id) X
where group_id=3;

I don't want to make it as single query with "group by", because it has 
a lot of fields and "group by" would be very long.

Regards,
Tomasz Myrta



Re: sub-select with aggregate

From
Stephan Szabo
Date:
On Wed, 23 Oct 2002, Tomasz Myrta wrote:

> Hi
> I want to perform query looking like this:
>
> select
>   user_id,
>   a/sum_a as percent_a,
>   b/sum_b as percent_b
> from
>   users join
>   (select
>     group_id,
>     sum(a) as sum_a,
>     sum(b) as sum_b
>    from users group by group_id) X using (group_id)
> where group_id=3;
>
> This query works, but very slow. Subquery with aggregate is performed
> for all table rows instead of group_id=3.

Does using X.group_id=3 in the where clause work better?

On 7.3 with no statistics for the table, that appears
to move the filter into the subquery plan rather than the
outer users scan.



Re: sub-select with aggregate

From
Tomasz Myrta
Date:
Uz.ytkownik Stephan Szabo napisa?:
> Does using X.group_id=3 in the where clause work better?
It works better, but not if you want to create a view and make
"select * from some_view where group_id=3" :-(
> 
> On 7.3 with no statistics for the table, that appears
> to move the filter into the subquery plan rather than the
> outer users scan.
Do you mean the second query will work on 7.3?
select  group_id,  user_id,  a/sum_a as percent_a,  b/sum_b as percent_b
from  users U,  (select    sum(a) as sum_a,    sum(b) as sum_b   from users where group_id=U.group_id) X
where group_id=3;

Tomasz Myrta



Re: sub-select with aggregate

From
Stephan Szabo
Date:
On Wed, 23 Oct 2002, Tomasz Myrta wrote:

> Uz.ytkownik Stephan Szabo napisa?:
> > Does using X.group_id=3 in the where clause work better?
> It works better, but not if you want to create a view and make
> "select * from some_view where group_id=3" :-(

But you can't do that anyway, because you don't expose group_id
in the original query.  I assume user_id was a mistake then and was
meant to be group_id or that both were meant to be in the
select list.

I see these two queries that in 7.3 push the clause into the
subquery and I believe should have the same output:

create view v as
select group_id, a/sum_a as percent_a, b/sum_b as percent_b
from (select   group_id,   sum(a) as sum_a,   sum(b) as sum_b  from users group by group_id) X join users using
(group_id);

and

create view v as
select X.group_id, a/sum_a as percent_a, b/sum_b as percent_b
from users join (select   group_id,   sum(a) as sum_a,   sum(b) as sum_b  from users group by group_id) X on
(X.group_id=users.group_id);

In the first case changing the order means that the output
group_id column is X.group_id rather than users.group_id
(using removes one of them which is why group_id isn't
ambiguous.  In the second it uses on to get both group_ids
and exposes the one from X.

> > On 7.3 with no statistics for the table, that appears
> > to move the filter into the subquery plan rather than the
> > outer users scan.
> Do you mean the second query will work on 7.3?

No, the case of setting X.group_id.



Re: sub-select with aggregate

From
Tomasz Myrta
Date:
Uz.ytkownik Stephan Szabo napisa?:
> But you can't do that anyway, because you don't expose group_id
> in the original query.  I assume user_id was a mistake then and was
> meant to be group_id or that both were meant to be in the
> select list.
Yes, I meant group_id, but in orginal query I didn't have to add 
group_id to select list.

> In the first case changing the order means that the output
> group_id column is X.group_id rather than users.group_id
> (using removes one of them which is why group_id isn't
> ambiguous.  In the second it uses on to get both group_ids
> and exposes the one from X.
The problem isn't ambigous columns, but speed.
I think Postgres first performs sub-query with all records from table 
(it takes very long time). After this Postgres permforms joining table 
with sub-query. The question is: How to speed up query like this? How to 
give param group_id from first table (users) to subquery?
Tomasz Myrta



Re: sub-select with aggregate

From
Stephan Szabo
Date:
On Wed, 23 Oct 2002, Tomasz Myrta wrote:

> Uz.ytkownik Stephan Szabo napisa?:
> > But you can't do that anyway, because you don't expose group_id
> > in the original query.  I assume user_id was a mistake then and was
> > meant to be group_id or that both were meant to be in the
> > select list.
> Yes, I meant group_id, but in orginal query I didn't have to add
> group_id to select list.

Without group_id in the select list you couldn't do a where
group_id = <something> if the select was a view.

> > In the first case changing the order means that the output
> > group_id column is X.group_id rather than users.group_id
> > (using removes one of them which is why group_id isn't
> > ambiguous.  In the second it uses on to get both group_ids
> > and exposes the one from X.
> The problem isn't ambigous columns, but speed.
> I think Postgres first performs sub-query with all records from table
> (it takes very long time). After this Postgres permforms joining table
> with sub-query. The question is: How to speed up query like this? How to
> give param group_id from first table (users) to subquery?

Did you see the other two queries I gave?  On 7.3, both of those queries
appear (according to explain output) to do the limiting of group_id
inside the subquery rather than doing the subquery with all rows.
The explanation above was why I believe it was different from your
original query.




Re: sub-select with aggregate

From
Stephan Szabo
Date:
On Wed, 23 Oct 2002, Stephan Szabo wrote:

> Did you see the other two queries I gave?  On 7.3, both of those queries
> appear (according to explain output) to do the limiting of group_id
> inside the subquery rather than doing the subquery with all rows.
> The explanation above was why I believe it was different from your
> original query.

Replying to myself with a clarification.

The other forms only move the filtering clauses around.  There's still
only a filter on the outer group_id equaling the inner group_id and
a filter on group_id=3.  It's just a question of whether it's:

Scan users in subselect from group_id=3, group and aggregate themand join with users on subselect's group_id =
users.group_id

or

Scan users for group_id=3, scan users in subselect, group andaggregate them and join these two on
subselect'sgroup_id=users.group_id.

The latter is effectively what your original query gave, the former
appears to be what my two queries from a couple of mails ago gives.




Re: sub-select with aggregate

From
"Tomasz Myrta"
Date:
> On Wed, 23 Oct 2002, Tomasz Myrta wrote:
> 
> > Uz.ytkownik Stephan Szabo napisa?:
> Without group_id in the select list you couldn't do a where
> group_id = <something> if the select was a view.
I know - it was just example of query.

> Did you see the other two queries I gave?  On 7.3, both of those queries
> appear (according to explain output) to do the limiting of group_id
> inside the subquery rather than doing the subquery with all rows.
> The explanation above was why I believe it was different from your
> original query.
I saw them. So should I wait for 7.3? Currently I have 7.2 and my work 
stopped because of this.
Tomasz Myrta


Re: sub-select with aggregate

From
Stephan Szabo
Date:
On Wed, 23 Oct 2002, Tomasz Myrta wrote:

> > On Wed, 23 Oct 2002, Tomasz Myrta wrote:
> >
> > > Uz.ytkownik Stephan Szabo napisa?:
> > Without group_id in the select list you couldn't do a where
> > group_id = <something> if the select was a view.
> I know - it was just example of query.
>
> > Did you see the other two queries I gave?  On 7.3, both of those queries
> > appear (according to explain output) to do the limiting of group_id
> > inside the subquery rather than doing the subquery with all rows.
> > The explanation above was why I believe it was different from your
> > original query.
> I saw them. So should I wait for 7.3? Currently I have 7.2 and my work
> stopped because of this.

IIRC 7.2's explain doesn't show which clauses are on which plans (it'll
show that it's doing an index scan or whatever, but not what it's actually
looking for.  The query probably optimizes similarly though.  I'd suggest
trying it and seeing if it runs any better.  I'm also assuming you have
an index on group_id to prevent the sequential scan of users.




Re: sub-select with aggregate

From
"Tomasz Myrta"
Date:
> The other forms only move the filtering clauses around.  There's 
> still only a filter on the outer group_id equaling the inner 
> group_id and a filter on group_id=3.  It's just a question of 
> whether it's:
> 
> Scan users in subselect from group_id=3, group and aggregate them
>  and join with users on subselect's group_id = users.group_id
> 
> or
> 
> Scan users for group_id=3, scan users in subselect, group and
>  aggregate them and join these two on subselect's
>  group_id=users.group_id.
> 
> The latter is effectively what your original query gave, the former
> appears to be what my two queries from a couple of mails ago gives.

I'm sorry, if you misunderstood me - I didn't show it clear.
I want to scan users for group_id=3 and then join subselect for the same 
group_id. I can't pass group_id=3 directly to subselect, because it's not a 
query but a view.

--
Tomasz Myrta <jasiek@klaster.net>



Re: sub-select with aggregate

From
Tomasz Myrta
Date:
Uz.ytkownik Stephan Szabo napisa?:
> I see these two queries that in 7.3 push the clause into the
> subquery and I believe should have the same output:
> 
> create view v as
> select
>   group_id,
>   a/sum_a as percent_a,
>   b/sum_b as percent_b
> from
>   (select
>     group_id,
>     sum(a) as sum_a,
>     sum(b) as sum_b
>    from users group by group_id) X join
>   users using (group_id);
> 
> and
> 
> create view v as
> select
>   X.group_id,
>   a/sum_a as percent_a,
>   b/sum_b as percent_b
> from
>   users join
>   (select
>     group_id,
>     sum(a) as sum_a,
>     sum(b) as sum_b
>    from users group by group_id) X
>   on (X.group_id=users.group_id);

I made tests of these queries (Postgres 7.2). In my database there are 
master table "bilety" and detail "przejazdy":

1) Exposing field from sub-query
drop view v; create view v as select X.id_biletu, netto, vat
from bilety B join (select  id_biletu,  sum(netto) as netto,  sum(vat) as vat  from przejazdy group by id_biletu) X on
(X.id_biletu=B.id_biletu);
explain select * from v where id_biletu=12345;
Nested Loop  (cost=0.00..29.58 rows=5 width=24)  ->  Subquery Scan x  (cost=0.00..7.94 rows=1 width=20)        ->
Aggregate (cost=0.00..7.94 rows=1 width=20)              ->  Group  (cost=0.00..7.93 rows=1 width=20)
->  Index Scan using qq2 on przejazdy 
 
(cost=0.00..7.92 rows=1 width=20)  ->  Index Scan using ind_bil_id on bilety b  (cost=0.00..21.58 rows=5 
width=4)

2) Exposing field from master-table
drop view v; create view v as select X.id_biletu, netto, vat
from bilety B join (select  id_biletu,  sum(netto) as netto,  sum(vat) as vat  from przejazdy group by id_biletu) X on
(X.id_biletu=B.id_biletu);
explain select * from v where id_biletu=12345;
Merge Join  (cost=4595.39..4627.36 rows=3 width=24)  ->  Index Scan using ind_bil_id on bilety b  (cost=0.00..21.55
rows=5
 
width=4)  ->  Sort  (cost=4595.39..4595.39 rows=4146 width=20)        ->  Subquery Scan x  (cost=0.00..4346.25
rows=4146width=20)              ->  Aggregate  (cost=0.00..4346.25 rows=4146 width=20)                    ->  Group
(cost=0.00..4138.93rows=41463 width=20)                          ->  Index Scan using qq2 on przejazdy 
 
(cost=0.00..4035.28 rows=41463 width=20)

3) Using planner to choose exposed field
drop view v;
create view v as
select id_biletu, netto, vat
from bilety B join (select  id_biletu,  sum(netto) as netto,  sum(vat) as vat  from przejazdy group by id_biletu) X
using(id_biletu);
 
explain select * from v where id_biletu=12345;
Merge Join  (cost=4595.39..4627.36 rows=3 width=24)  ->  Index Scan using ind_bil_id on bilety b  (cost=0.00..21.55
rows=5
 
width=4)  ->  Sort  (cost=4595.39..4595.39 rows=4146 width=20)        ->  Subquery Scan x  (cost=0.00..4346.25
rows=4146width=20)              ->  Aggregate  (cost=0.00..4346.25 rows=4146 width=20)                    ->  Group
(cost=0.00..4138.93rows=41463 width=20)                          ->  Index Scan using qq2 on przejazdy 
 
(cost=0.00..4035.28 rows=41463 width=20)


Now I know the solution - to speed up this query I have to manually 
expose field from sub-query. It works fine, but I still don't know why I 
can't do this opposite way.

Tomasz Myrta



Re: sub-select with aggregate

From
Tomasz Myrta
Date:
Sorry,
In second query is:
drop view v; create view v as select B.id_biletu...

Tomasz Myrta



Re: sub-select with aggregate

From
Stephan Szabo
Date:
On Thu, 24 Oct 2002, Tomasz Myrta wrote:

> Now I know the solution - to speed up this query I have to manually
> expose field from sub-query. It works fine, but I still don't know why I
> can't do this opposite way.

The system doesn't realize that it can limit the subquery when the only
things it has is the join clause and the limiting clause on the other
table.  Like Tom's message about the other view, it won't imply that
X.foo=12345 from foo=12345 when the exposed foo is B.foo.




Re: sub-select with aggregate

From
"Tomasz Myrta"
Date:
> The system doesn't realize that it can limit the subquery when the only
> things it has is the join clause and the limiting clause on the other
> table.  Like Tom's message about the other view, it won't imply that
> X.foo=12345 from foo=12345 when the exposed foo is B.foo.

Is this what we should expect, or rather a bug?
The system knows how to join tables, but fails with joining views or 
subselects :-(
Tomasz Myrta


Re: sub-select with aggregate

From
Stephan Szabo
Date:
On Thu, 24 Oct 2002, Tomasz Myrta wrote:

> > The system doesn't realize that it can limit the subquery when the only
> > things it has is the join clause and the limiting clause on the other
> > table.  Like Tom's message about the other view, it won't imply that
> > X.foo=12345 from foo=12345 when the exposed foo is B.foo.
>
> Is this what we should expect, or rather a bug?
> The system knows how to join tables, but fails with joining views or
> subselects :-(

It knows how to join them, it doesn't know that it can change the
clauses given to it in order to make it more efficient.

Basically, as I understand it,select * from a,b where a.a=b.a and a.a=3;
isn't going to realize that b.a=3 and act as if
you typed that. It may consider an index scan in
a nested loop for a.a=b.a, but that's not quite
the same.




Re: sub-select with aggregate

From
Tom Lane
Date:
Stephan Szabo <sszabo@megazone23.bigpanda.com> writes:
> Basically, as I understand it,
>  select * from a,b where a.a=b.a and a.a=3;
> isn't going to realize that b.a=3 and act as if
> you typed that.

We have talked about adding code to make that deduction.  The issue is
how to do so without expending a lot of cycles (that will be wasted in
every query where no improvement results, which is most of them).  There
are also some ticklish issues about side-effects of adding such clauses.
in the above example, if we did filter b with b.a=3, then the join
clause a.a=b.a becomes a no-op and does not reduce the number of rows;
if we fail to account for that fact we will underestimate the number of
rows out of the join, possibly pessimizing higher levels of plan.  Also,
we might as well implement the join as a nestloop; no percentage in
fooling with hash or merge overhead.

I have some thoughts about this, but it's not happening for 7.3 ...
        regards, tom lane


Re: sub-select with aggregate

From
Tom Lane
Date:
[ followup on a gripe from October ]

Tomasz Myrta <jasiek@klaster.net> writes:
> I want to perform query looking like this:

> select
>   user_id,
>   a/sum_a as percent_a,
>   b/sum_b as percent_b
> from
>   users join
>   (select
>     group_id,
>     sum(a) as sum_a,
>     sum(b) as sum_b
>    from users group by group_id) X using (group_id)
> where group_id=3;

> This query works, but very slow. Subquery with aggregate is performed 
> for all table rows instead of group_id=3.

Just FYI, this problem is fixed as of CVS tip.  Using an empty table
with an index on group_id, I get a plan like so:
Nested Loop  (cost=17.10..34.21 rows=1 width=36)  ->  Subquery Scan x  (cost=17.10..17.11 rows=1 width=12)        ->
HashAggregate (cost=17.10..17.11 rows=1 width=12)              ->  Index Scan using users_group_id on users
(cost=0.00..17.07rows=5 width=12)                    Index Cond: (3 = group_id)  ->  Index Scan using users_group_id on
users (cost=0.00..17.08 rows=1 width=16)        Index Cond: (users.group_id = "outer".group_id)        Filter:
(group_id= 3)
 

Notice the condition on group_id has been propagated into both sides of
the join.
        regards, tom lane