Thread: sub-select with aggregate
Hi I want to perform query looking like this: select user_id, a/sum_a as percent_a, b/sum_b as percent_b from users join (select group_id, sum(a) as sum_a, sum(b) as sum_b from users group by group_id) X using (group_id) where group_id=3; This query works, but very slow. Subquery with aggregate is performed for all table rows instead of group_id=3. I wish I could write it like this, but it doesn't work at all: select user_id, a/sum_a as percent_a, b/sum_b as percent_b from users U, (select sum(a) as sum_a, sum(b) as sum_b from users where group_id=U.group_id) X where group_id=3; I don't want to make it as single query with "group by", because it has a lot of fields and "group by" would be very long. Regards, Tomasz Myrta
On Wed, 23 Oct 2002, Tomasz Myrta wrote: > Hi > I want to perform query looking like this: > > select > user_id, > a/sum_a as percent_a, > b/sum_b as percent_b > from > users join > (select > group_id, > sum(a) as sum_a, > sum(b) as sum_b > from users group by group_id) X using (group_id) > where group_id=3; > > This query works, but very slow. Subquery with aggregate is performed > for all table rows instead of group_id=3. Does using X.group_id=3 in the where clause work better? On 7.3 with no statistics for the table, that appears to move the filter into the subquery plan rather than the outer users scan.
Uz.ytkownik Stephan Szabo napisa?: > Does using X.group_id=3 in the where clause work better? It works better, but not if you want to create a view and make "select * from some_view where group_id=3" :-( > > On 7.3 with no statistics for the table, that appears > to move the filter into the subquery plan rather than the > outer users scan. Do you mean the second query will work on 7.3? select group_id, user_id, a/sum_a as percent_a, b/sum_b as percent_b from users U, (select sum(a) as sum_a, sum(b) as sum_b from users where group_id=U.group_id) X where group_id=3; Tomasz Myrta
On Wed, 23 Oct 2002, Tomasz Myrta wrote: > Uz.ytkownik Stephan Szabo napisa?: > > Does using X.group_id=3 in the where clause work better? > It works better, but not if you want to create a view and make > "select * from some_view where group_id=3" :-( But you can't do that anyway, because you don't expose group_id in the original query. I assume user_id was a mistake then and was meant to be group_id or that both were meant to be in the select list. I see these two queries that in 7.3 push the clause into the subquery and I believe should have the same output: create view v as select group_id, a/sum_a as percent_a, b/sum_b as percent_b from (select group_id, sum(a) as sum_a, sum(b) as sum_b from users group by group_id) X join users using (group_id); and create view v as select X.group_id, a/sum_a as percent_a, b/sum_b as percent_b from users join (select group_id, sum(a) as sum_a, sum(b) as sum_b from users group by group_id) X on (X.group_id=users.group_id); In the first case changing the order means that the output group_id column is X.group_id rather than users.group_id (using removes one of them which is why group_id isn't ambiguous. In the second it uses on to get both group_ids and exposes the one from X. > > On 7.3 with no statistics for the table, that appears > > to move the filter into the subquery plan rather than the > > outer users scan. > Do you mean the second query will work on 7.3? No, the case of setting X.group_id.
Uz.ytkownik Stephan Szabo napisa?: > But you can't do that anyway, because you don't expose group_id > in the original query. I assume user_id was a mistake then and was > meant to be group_id or that both were meant to be in the > select list. Yes, I meant group_id, but in orginal query I didn't have to add group_id to select list. > In the first case changing the order means that the output > group_id column is X.group_id rather than users.group_id > (using removes one of them which is why group_id isn't > ambiguous. In the second it uses on to get both group_ids > and exposes the one from X. The problem isn't ambigous columns, but speed. I think Postgres first performs sub-query with all records from table (it takes very long time). After this Postgres permforms joining table with sub-query. The question is: How to speed up query like this? How to give param group_id from first table (users) to subquery? Tomasz Myrta
On Wed, 23 Oct 2002, Tomasz Myrta wrote: > Uz.ytkownik Stephan Szabo napisa?: > > But you can't do that anyway, because you don't expose group_id > > in the original query. I assume user_id was a mistake then and was > > meant to be group_id or that both were meant to be in the > > select list. > Yes, I meant group_id, but in orginal query I didn't have to add > group_id to select list. Without group_id in the select list you couldn't do a where group_id = <something> if the select was a view. > > In the first case changing the order means that the output > > group_id column is X.group_id rather than users.group_id > > (using removes one of them which is why group_id isn't > > ambiguous. In the second it uses on to get both group_ids > > and exposes the one from X. > The problem isn't ambigous columns, but speed. > I think Postgres first performs sub-query with all records from table > (it takes very long time). After this Postgres permforms joining table > with sub-query. The question is: How to speed up query like this? How to > give param group_id from first table (users) to subquery? Did you see the other two queries I gave? On 7.3, both of those queries appear (according to explain output) to do the limiting of group_id inside the subquery rather than doing the subquery with all rows. The explanation above was why I believe it was different from your original query.
On Wed, 23 Oct 2002, Stephan Szabo wrote: > Did you see the other two queries I gave? On 7.3, both of those queries > appear (according to explain output) to do the limiting of group_id > inside the subquery rather than doing the subquery with all rows. > The explanation above was why I believe it was different from your > original query. Replying to myself with a clarification. The other forms only move the filtering clauses around. There's still only a filter on the outer group_id equaling the inner group_id and a filter on group_id=3. It's just a question of whether it's: Scan users in subselect from group_id=3, group and aggregate themand join with users on subselect's group_id = users.group_id or Scan users for group_id=3, scan users in subselect, group andaggregate them and join these two on subselect'sgroup_id=users.group_id. The latter is effectively what your original query gave, the former appears to be what my two queries from a couple of mails ago gives.
> On Wed, 23 Oct 2002, Tomasz Myrta wrote: > > > Uz.ytkownik Stephan Szabo napisa?: > Without group_id in the select list you couldn't do a where > group_id = <something> if the select was a view. I know - it was just example of query. > Did you see the other two queries I gave? On 7.3, both of those queries > appear (according to explain output) to do the limiting of group_id > inside the subquery rather than doing the subquery with all rows. > The explanation above was why I believe it was different from your > original query. I saw them. So should I wait for 7.3? Currently I have 7.2 and my work stopped because of this. Tomasz Myrta
On Wed, 23 Oct 2002, Tomasz Myrta wrote: > > On Wed, 23 Oct 2002, Tomasz Myrta wrote: > > > > > Uz.ytkownik Stephan Szabo napisa?: > > Without group_id in the select list you couldn't do a where > > group_id = <something> if the select was a view. > I know - it was just example of query. > > > Did you see the other two queries I gave? On 7.3, both of those queries > > appear (according to explain output) to do the limiting of group_id > > inside the subquery rather than doing the subquery with all rows. > > The explanation above was why I believe it was different from your > > original query. > I saw them. So should I wait for 7.3? Currently I have 7.2 and my work > stopped because of this. IIRC 7.2's explain doesn't show which clauses are on which plans (it'll show that it's doing an index scan or whatever, but not what it's actually looking for. The query probably optimizes similarly though. I'd suggest trying it and seeing if it runs any better. I'm also assuming you have an index on group_id to prevent the sequential scan of users.
> The other forms only move the filtering clauses around. There's > still only a filter on the outer group_id equaling the inner > group_id and a filter on group_id=3. It's just a question of > whether it's: > > Scan users in subselect from group_id=3, group and aggregate them > and join with users on subselect's group_id = users.group_id > > or > > Scan users for group_id=3, scan users in subselect, group and > aggregate them and join these two on subselect's > group_id=users.group_id. > > The latter is effectively what your original query gave, the former > appears to be what my two queries from a couple of mails ago gives. I'm sorry, if you misunderstood me - I didn't show it clear. I want to scan users for group_id=3 and then join subselect for the same group_id. I can't pass group_id=3 directly to subselect, because it's not a query but a view. -- Tomasz Myrta <jasiek@klaster.net>
Uz.ytkownik Stephan Szabo napisa?: > I see these two queries that in 7.3 push the clause into the > subquery and I believe should have the same output: > > create view v as > select > group_id, > a/sum_a as percent_a, > b/sum_b as percent_b > from > (select > group_id, > sum(a) as sum_a, > sum(b) as sum_b > from users group by group_id) X join > users using (group_id); > > and > > create view v as > select > X.group_id, > a/sum_a as percent_a, > b/sum_b as percent_b > from > users join > (select > group_id, > sum(a) as sum_a, > sum(b) as sum_b > from users group by group_id) X > on (X.group_id=users.group_id); I made tests of these queries (Postgres 7.2). In my database there are master table "bilety" and detail "przejazdy": 1) Exposing field from sub-query drop view v; create view v as select X.id_biletu, netto, vat from bilety B join (select id_biletu, sum(netto) as netto, sum(vat) as vat from przejazdy group by id_biletu) X on (X.id_biletu=B.id_biletu); explain select * from v where id_biletu=12345; Nested Loop (cost=0.00..29.58 rows=5 width=24) -> Subquery Scan x (cost=0.00..7.94 rows=1 width=20) -> Aggregate (cost=0.00..7.94 rows=1 width=20) -> Group (cost=0.00..7.93 rows=1 width=20) -> Index Scan using qq2 on przejazdy (cost=0.00..7.92 rows=1 width=20) -> Index Scan using ind_bil_id on bilety b (cost=0.00..21.58 rows=5 width=4) 2) Exposing field from master-table drop view v; create view v as select X.id_biletu, netto, vat from bilety B join (select id_biletu, sum(netto) as netto, sum(vat) as vat from przejazdy group by id_biletu) X on (X.id_biletu=B.id_biletu); explain select * from v where id_biletu=12345; Merge Join (cost=4595.39..4627.36 rows=3 width=24) -> Index Scan using ind_bil_id on bilety b (cost=0.00..21.55 rows=5 width=4) -> Sort (cost=4595.39..4595.39 rows=4146 width=20) -> Subquery Scan x (cost=0.00..4346.25 rows=4146width=20) -> Aggregate (cost=0.00..4346.25 rows=4146 width=20) -> Group (cost=0.00..4138.93rows=41463 width=20) -> Index Scan using qq2 on przejazdy (cost=0.00..4035.28 rows=41463 width=20) 3) Using planner to choose exposed field drop view v; create view v as select id_biletu, netto, vat from bilety B join (select id_biletu, sum(netto) as netto, sum(vat) as vat from przejazdy group by id_biletu) X using(id_biletu); explain select * from v where id_biletu=12345; Merge Join (cost=4595.39..4627.36 rows=3 width=24) -> Index Scan using ind_bil_id on bilety b (cost=0.00..21.55 rows=5 width=4) -> Sort (cost=4595.39..4595.39 rows=4146 width=20) -> Subquery Scan x (cost=0.00..4346.25 rows=4146width=20) -> Aggregate (cost=0.00..4346.25 rows=4146 width=20) -> Group (cost=0.00..4138.93rows=41463 width=20) -> Index Scan using qq2 on przejazdy (cost=0.00..4035.28 rows=41463 width=20) Now I know the solution - to speed up this query I have to manually expose field from sub-query. It works fine, but I still don't know why I can't do this opposite way. Tomasz Myrta
Sorry, In second query is: drop view v; create view v as select B.id_biletu... Tomasz Myrta
On Thu, 24 Oct 2002, Tomasz Myrta wrote: > Now I know the solution - to speed up this query I have to manually > expose field from sub-query. It works fine, but I still don't know why I > can't do this opposite way. The system doesn't realize that it can limit the subquery when the only things it has is the join clause and the limiting clause on the other table. Like Tom's message about the other view, it won't imply that X.foo=12345 from foo=12345 when the exposed foo is B.foo.
> The system doesn't realize that it can limit the subquery when the only > things it has is the join clause and the limiting clause on the other > table. Like Tom's message about the other view, it won't imply that > X.foo=12345 from foo=12345 when the exposed foo is B.foo. Is this what we should expect, or rather a bug? The system knows how to join tables, but fails with joining views or subselects :-( Tomasz Myrta
On Thu, 24 Oct 2002, Tomasz Myrta wrote: > > The system doesn't realize that it can limit the subquery when the only > > things it has is the join clause and the limiting clause on the other > > table. Like Tom's message about the other view, it won't imply that > > X.foo=12345 from foo=12345 when the exposed foo is B.foo. > > Is this what we should expect, or rather a bug? > The system knows how to join tables, but fails with joining views or > subselects :-( It knows how to join them, it doesn't know that it can change the clauses given to it in order to make it more efficient. Basically, as I understand it,select * from a,b where a.a=b.a and a.a=3; isn't going to realize that b.a=3 and act as if you typed that. It may consider an index scan in a nested loop for a.a=b.a, but that's not quite the same.
Stephan Szabo <sszabo@megazone23.bigpanda.com> writes: > Basically, as I understand it, > select * from a,b where a.a=b.a and a.a=3; > isn't going to realize that b.a=3 and act as if > you typed that. We have talked about adding code to make that deduction. The issue is how to do so without expending a lot of cycles (that will be wasted in every query where no improvement results, which is most of them). There are also some ticklish issues about side-effects of adding such clauses. in the above example, if we did filter b with b.a=3, then the join clause a.a=b.a becomes a no-op and does not reduce the number of rows; if we fail to account for that fact we will underestimate the number of rows out of the join, possibly pessimizing higher levels of plan. Also, we might as well implement the join as a nestloop; no percentage in fooling with hash or merge overhead. I have some thoughts about this, but it's not happening for 7.3 ... regards, tom lane
[ followup on a gripe from October ] Tomasz Myrta <jasiek@klaster.net> writes: > I want to perform query looking like this: > select > user_id, > a/sum_a as percent_a, > b/sum_b as percent_b > from > users join > (select > group_id, > sum(a) as sum_a, > sum(b) as sum_b > from users group by group_id) X using (group_id) > where group_id=3; > This query works, but very slow. Subquery with aggregate is performed > for all table rows instead of group_id=3. Just FYI, this problem is fixed as of CVS tip. Using an empty table with an index on group_id, I get a plan like so: Nested Loop (cost=17.10..34.21 rows=1 width=36) -> Subquery Scan x (cost=17.10..17.11 rows=1 width=12) -> HashAggregate (cost=17.10..17.11 rows=1 width=12) -> Index Scan using users_group_id on users (cost=0.00..17.07rows=5 width=12) Index Cond: (3 = group_id) -> Index Scan using users_group_id on users (cost=0.00..17.08 rows=1 width=16) Index Cond: (users.group_id = "outer".group_id) Filter: (group_id= 3) Notice the condition on group_id has been propagated into both sides of the join. regards, tom lane