Thread: Pull up aggregate subquery
I sometimes wonder if we could pull up aggregate query in the optimizer. My typical problem is: Consider two relations, medium size M and large size L. L has reference column to M's primary key. Say, create table size_m as select i as id, repeat(i::text, i % 100) as val from generate_series(1, 20000) i; create table size_l as select i as id, m.id as m_id, repeat(i::text, i % 100) as val from generate_series(1, 100000) i, size_m m where i.i / 10 = m.id; Now, you want to query M under some condition with join aggregate L group by M's primary key. select m_id, sum_len from size_m m inner join(select m_id, sum(length(val)) as sum_len from size_l group by m_id)l on m.id = l.m_id where val = '1'; The generated plan is: QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------- Nested Loop (cost=36116.92..38339.67 rows=50 width=235) (actual time=440.679..1039.964 rows=1 loops=1) Join Filter: (m.id = size_l.m_id) -> Seq Scan on size_m m (cost=0.00..897.00 rows=1 width=227) (actual time=0.021..16.698 rows=1 loops=1) Filter: (val = '1'::text) -> GroupAggregate (cost=36116.92..37217.09 rows=10026 width=248) (actual time=440.651..1013.704 rows=10000 loops=1) -> Sort (cost=36116.92..36366.90 rows=99991 width=248) (actual time=440.619..593.062 rows=99991 loops=1) Sort Key: size_l.m_id Sort Method: external sort Disk: 25736kB -> Seq Scan on size_l (cost=0.00..4565.91 rows=99991 width=248) (actual time=0.011..138.243 rows=99991 loops=1) Total runtime: 1044.039 ms (10 rows) Note that most of the result of aggregate is discarded on join M, because M resulted in small output with filter by M.val. If we can filter M first and filter L by the M's result before running aggregate, the response may dramatically get faster. If you filter by M.id instead of M.val the optimizer is smart enough to push down the condition to L, which is filtered before aggregate. select m_id, sum_len from size_m m inner join(select m_id, sum(length(val)) as sum_len from size_l group by m_id)l on m.id = l.m_id where id = 1; QUERY PLAN -------------------------------------------------------------------------------------------------------------------- Nested Loop (cost=0.00..5713.02 rows=2 width=235) (actual time=72.121..82.364 rows=1 loops=1) -> Seq Scan on size_m m (cost=0.00..897.00 rows=1 width=227) (actual time=0.028..10.252 rows=1 loops=1) Filter: (id = 1) -> GroupAggregate (cost=0.00..4815.98 rows=2 width=248) (actual time=72.065..72.067 rows=1 loops=1) -> Seq Scan on size_l (cost=0.00..4815.89 rows=10 width=248) (actual time=0.051..71.968 rows=10 loops=1) Filter: (m_id = 1) Total runtime: 82.525 ms (7 rows) This seems like the benefit of EquivalentClass. 1 = M.id = L.m_id is implied and the optimizer adds implicit constant filter L.m_id = 1 in the plan tree. In contrast, in the former case of M.val = '1' doesn't imply any condition for L.m_id. That's fair enough. However, I think we can filter L by L.m_id before aggregate because L.m_id is of the group clause as well as the join condition and M.id is unique in M. In such cases, the query can be transform something like: GroupAggregate -> NestLoop (L.m_id = M.id) -> SeqScan L -> SeqScan M (filter: M.val = '1') This transformation can be done by pulling up aggregate Query in pull_up_subqueries(). Currently the optimizer doesn't pull up any queries which contains aggregate, but as shown above in some cases we can do it. Attached is WIP proof of concept patch to do it. I know it breaks general queries but it transforms as I described above. I suppose the missing piece is adding condition of "when" to pull up aggregate. "how" is done in the patch. db1=# explain select m_id, sum_len from size_m m inner join(select m_id, sum(length(val)) as sum_len from size_l group by m_id)l on m.id = l.m_id where val = '1'; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------- GroupAggregate (cost=6712.96..6713.16 rows=10 width=471) (actual time=125.496..125.499 rows=1 loops=1) -> Sort (cost=6712.96..6712.99 rows=10 width=471) (actual time=125.228..125.288 rows=10 loops=1) Sort Key: size_l.m_id Sort Method: quicksort Memory: 25kB -> Nested Loop (cost=0.00..6712.80 rows=10 width=471) (actual time=0.142..125.089 rows=10 loops=1) Join Filter: (m.id = size_l.m_id) -> Seq Scan on size_m m (cost=0.00..897.00 rows=1 width=227) (actual time=0.037..8.956 rows=1 loops=1) Filter: (val = '1'::text) -> Seq Scan on size_l (cost=0.00..4565.91 rows=99991 width=248) (actual time=0.060..65.910 rows=99991 loops=1) Total runtime: 125.658 ms (10 rows) (You may notice that by the transformation non-group/non-agg TargetEntry is in the Agg node, but actually there's no check for that during optimizer and executor. May be thanks to Functional Dependency.) Comments? Regards, -- Hitoshi Harada
Attachment
Hitoshi Harada <umi.tanuki@gmail.com> writes: > I sometimes wonder if we could pull up aggregate query in the optimizer. I don't have time to look at this right now, but please add to the upcoming commitfest. regards, tom lane
2011/5/5 Tom Lane <tgl@sss.pgh.pa.us>: > Hitoshi Harada <umi.tanuki@gmail.com> writes: >> I sometimes wonder if we could pull up aggregate query in the optimizer. > > I don't have time to look at this right now, but please add to the > upcoming commitfest. Done. https://commitfest.postgresql.org/action/patch_view?id=548 I'll work further if I find time. Regards, -- Hitoshi Harada
2011/5/5 Hitoshi Harada <umi.tanuki@gmail.com>: > https://commitfest.postgresql.org/action/patch_view?id=548 > > I'll work further if I find time. After more thought, pulling up aggregate subquery in narrow conditional cases is quite hard path, especially when the joinrel is more than 2. It will be hard to check pulling up is safe for other relations than the target relation. It was a big shame I missed Tom Lane's session in PGCon, but finding "Parameterized Scan" in his slides, it occurred to me that it might help my problem, too. Before hitting the "pull up" idea, I once thought if it would be possible to push outer Var of join down to Agg's HAVING, which is transferred to underlying SeqScan's filter. Resulted in something like: NestLoop -> SeqScan M (filter: M.val = '1') -> GroupAggregate -> SeqScan M (filter: L.m_id = M.id) However, currently we don't have such mechanism to push down Var as a qual to non-NestLoop. Yeah, it could be even now, but we should avoid N-loop of Agg. We want to scan Agg once, with Param $1 = M.id = multiple values. Since I didn't attend his session I'm afraid I don't understand "Parameterized Scan" correctly, but once we've got such mechanism, one example introduced in Robert Haas's blog[1] (originally shown by Andrew Gierth[2]) and LATERAL maybe. Do I understand correctly? If so, could someone explain more detail of how to get Parameterized Scan in the planner? Regards, [1]: http://rhaas.blogspot.com/2010/04/finding-our-way-to-lateral.html [2]: http://archives.postgresql.org/pgsql-hackers/2009-09/msg00525.php -- Hitoshi Harada
On Sat, May 21, 2011 at 12:49 PM, Hitoshi Harada <umi.tanuki@gmail.com> wrote: > 2011/5/5 Hitoshi Harada <umi.tanuki@gmail.com>: >> https://commitfest.postgresql.org/action/patch_view?id=548 >> >> I'll work further if I find time. > > After more thought, pulling up aggregate subquery in narrow > conditional cases is quite hard path, especially when the joinrel is > more than 2. It will be hard to check pulling up is safe for other > relations than the target relation. > > It was a big shame I missed Tom Lane's session in PGCon, but finding > "Parameterized Scan" in his slides, it occurred to me that it might > help my problem, too. Before hitting the "pull up" idea, I once > thought if it would be possible to push outer Var of join down to > Agg's HAVING, which is transferred to underlying SeqScan's filter. > Resulted in something like: > > NestLoop > -> SeqScan M (filter: M.val = '1') > -> GroupAggregate > -> SeqScan M (filter: L.m_id = M.id) > > However, currently we don't have such mechanism to push down Var as a > qual to non-NestLoop. Yeah, it could be even now, but we should avoid > N-loop of Agg. We want to scan Agg once, with Param $1 = M.id = > multiple values. Since I didn't attend his session I'm afraid I don't > understand "Parameterized Scan" correctly, but once we've got such > mechanism, one example introduced in Robert Haas's blog[1] (originally > shown by Andrew Gierth[2]) and LATERAL maybe. > > Do I understand correctly? If so, could someone explain more detail of > how to get Parameterized Scan in the planner? I think we're going to need Tom to give the definitive word on this, but I believe that the current situation is that the executor is capable of handling a parameterized scan (yeah!) but the planner doesn't know how to generate them (boo!). This is an improvement of a sort over the 9.0 code base, where neither the planner nor the executor could handle this case, but we need planner to support in order to get anywhere useful with it. The problem is how to figure out whether a parameterized scan is a win without expending too much planning time. For example, in the case you mention: select m_id, sum_len from size_m m inner join (select m_id, sum(length(val)) as sum_len from size_l group by m_id) l on m.id = l.m_id where val = '1'; ...we'd need to plan the subquery twice, once with a parameterized qual m_id = $1 pushed down, and once without that. We could then compare the cost of a nest-loop with the qual to the cost of a merge or hash join without it. But this seems very expensive. In the particular case you have here, the subquery is simple enough that this probably wouldn't be any big deal, but in general there's no reason why that subquery couldn't be quite complex - or why it couldn't have subqueries of its own that would requite the same treatment recursively. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Sat, May 21, 2011 at 12:49 PM, Hitoshi Harada <umi.tanuki@gmail.com> wrote: >> Do I understand correctly? If so, could someone explain more detail of >> how to get Parameterized Scan in the planner? > I think we're going to need Tom to give the definitive word on this, > but I believe that the current situation is that the executor is > capable of handling a parameterized scan (yeah!) but the planner > doesn't know how to generate them (boo!). This is an improvement of > a sort over the 9.0 code base, where neither the planner nor the > executor could handle this case, but we need planner to support in > order to get anywhere useful with it. Yeah. I fixed the executor in 9.1, but got hung up on how to fix the planner. So the planner still only knows how to do this for the case of an inner indexscan, ie, it can't handle parameterizing any piece of a plan larger than a single indexscan. I have some ideas about fixing that but have had no time to pursue them since December :-( > ...we'd need to plan the subquery twice, once with a parameterized > qual m_id = $1 pushed down, and once without that. We could then > compare the cost of a nest-loop with the qual to the cost of a merge > or hash join without it. But this seems very expensive. In the > particular case you have here, the subquery is simple enough that this > probably wouldn't be any big deal, but in general there's no reason > why that subquery couldn't be quite complex - or why it couldn't have > subqueries of its own that would requite the same treatment > recursively. Yeah. For simple scan/join queries it seems likely that we only care about parameterizing indexscans, since the main opportunity for a win is to not scan all of a large table. Restricting things that way would help reduce the number of extra Paths to carry around. But I'm not sure whether the same argument can be made for arbitrary subqueries. regards, tom lane
2011/5/24 Robert Haas <robertmhaas@gmail.com>: > On Sat, May 21, 2011 at 12:49 PM, Hitoshi Harada <umi.tanuki@gmail.com> wrote: >> 2011/5/5 Hitoshi Harada <umi.tanuki@gmail.com>: >> Do I understand correctly? If so, could someone explain more detail of >> how to get Parameterized Scan in the planner? > > I think we're going to need Tom to give the definitive word on this, > but I believe that the current situation is that the executor is > capable of handling a parameterized scan (yeah!) Ah, greping git log pointed me to http://git.postgresql.org/gitweb?p=postgresql.git;a=commit;h=53e757689ce94520f1c53a89dbaa14ea57b09da7 > The problem is how to figure out whether a parameterized scan is a win > without expending too much planning time. For example, in the case > you mention: > > select m_id, sum_len from size_m m inner join (select m_id, > sum(length(val)) as sum_len from size_l group by m_id) l on m.id = > l.m_id where val = '1'; > > ...we'd need to plan the subquery twice, once with a parameterized > qual m_id = $1 pushed down, and once without that. We could then > compare the cost of a nest-loop with the qual to the cost of a merge > or hash join without it. But this seems very expensive. In the > particular case you have here, the subquery is simple enough that this > probably wouldn't be any big deal, but in general there's no reason > why that subquery couldn't be quite complex - or why it couldn't have > subqueries of its own that would requite the same treatment > recursively. That's true. But if the planning cost is an only issue, why not adding new GUC for user to choose if they prefer it or not? Of course if we have some method to predict which way to go before proving both ways, it's great. Do you have some blue picture on it? In addition, I wonder if the "generalized nestloop" pattern can do whole outer scan before proving inner? I mean, in my example if the outer scan qualified more than one tuple I'm afraid that inner Agg and its underlying SeqScan are re-scanned more than one. That will bloat performance if the Agg is expensive. It would be great if we can outer scan can do scan to the end and stores param variables in something like array and prove once inner Agg, so that we can ensure the Agg is computed once. To do that, do we need more work on executor? Regards, -- Hitoshi Harada
2011/5/24 Tom Lane <tgl@sss.pgh.pa.us>: > Robert Haas <robertmhaas@gmail.com> writes: > Yeah. I fixed the executor in 9.1, but got hung up on how to fix the > planner. So the planner still only knows how to do this for the case of > an inner indexscan, ie, it can't handle parameterizing any piece of a > plan larger than a single indexscan. I have some ideas about fixing > that but have had no time to pursue them since December :-( If you can afford, could you describe the ideas? Or I'm glad if you tell where to start looking. Anyway I'm going to dig into optimizer more by myself. > But I'm not sure > whether the same argument can be made for arbitrary subqueries. I guess it's hard to apply the mechanism to arbitrary subqueries, but there are many use cases like my example, Agg node in inner. Regards, -- Hitoshi Harada
On Mon, May 23, 2011 at 4:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> ...we'd need to plan the subquery twice, once with a parameterized >> qual m_id = $1 pushed down, and once without that. We could then >> compare the cost of a nest-loop with the qual to the cost of a merge >> or hash join without it. But this seems very expensive. In the >> particular case you have here, the subquery is simple enough that this >> probably wouldn't be any big deal, but in general there's no reason >> why that subquery couldn't be quite complex - or why it couldn't have >> subqueries of its own that would requite the same treatment >> recursively. > > Yeah. For simple scan/join queries it seems likely that we only care > about parameterizing indexscans, since the main opportunity for a win is > to not scan all of a large table. Restricting things that way would > help reduce the number of extra Paths to carry around. But I'm not sure > whether the same argument can be made for arbitrary subqueries. I must be misunderstanding you, because index scans are the thing we already *do* parameterize; and what else would make any sense? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, May 23, 2011 at 10:47 PM, Hitoshi Harada <umi.tanuki@gmail.com> wrote: > That's true. But if the planning cost is an only issue, why not adding > new GUC for user to choose if they prefer it or not? Of course if we > have some method to predict which way to go before proving both ways, > it's great. Do you have some blue picture on it? I don't really like the idea of adding a GUC for this, unless we convince ourselves that nothing else is sensible. I mean, that leads to conversations like this: Newbie: My query is slow. Hacker: Turn on enable_magic_pixie_dust and it'll get better. Newbie: Oh, yeah, that is better. Why isn't this turned on by default, anyway? Hacker: Well, on pathological queries, it makes planning take way too long, so we think it's not really safe to enable it by default. Newbie: Wait... I thought you just told me to enable it. It's not safe? Hacker: Well, sort of. I mean, uh... hey, look, an elephant! I think that if you're interested in working on this, the first step would be to get it to work at all, even if it's a bit inefficient. Even though in theory you could get exponential planning time growth if you have an aggregate inside of an aggregate inside of an aggregate inside of an aggregate, very few people are going to do that. If it turns out to be a problem, we can dream up some suitable defense; as I think about it a little more, I don't think that would be terribly difficult to work out. The hard part is probably to get this to work in the first place. > In addition, I wonder if the "generalized nestloop" pattern can do > whole outer scan before proving inner? I mean, in my example if the > outer scan qualified more than one tuple I'm afraid that inner Agg and > its underlying SeqScan are re-scanned more than one. That will bloat > performance if the Agg is expensive. It would be great if we can outer > scan can do scan to the end and stores param variables in something > like array and prove once inner Agg, so that we can ensure the Agg is > computed once. To do that, do we need more work on executor? That sounds to me like way too much complexity for the first go-around. We should either push the agg down or not; it sounds like you're proposing some kind of half-way state. That might not be useful in practice, and it will certainly be a lot more work. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, May 23, 2011 at 4:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Yeah. For simple scan/join queries it seems likely that we only care >> about parameterizing indexscans, since the main opportunity for a win is >> to not scan all of a large table. Restricting things that way would >> help reduce the number of extra Paths to carry around. But I'm not sure >> whether the same argument can be made for arbitrary subqueries. > I must be misunderstanding you, because index scans are the thing we > already *do* parameterize; and what else would make any sense? The point I was trying to make is that the ultimate reason for having a parameterized portion-of-a-plan will be that there's a parameterized indexscan somewhere down at the bottom. I had originally imagined that we might parameterize any old scan; for example consider replacing Nestloop Join Filter: a.x = b.y -> Seqscan on a -> Seqscan on b with Nestloop -> Seqscan on a -> Seqscan on b Filter: b.y = a.x Although this isn't nearly as useful as if we could be using an index on b.y, there would still be some marginal gain to be had, because we'd be able to reject rows of b without first passing them up to the join node. But I'm afraid that going all-out like that would slow the planner down far too much (too many Paths to consider) to be justified by a marginal runtime gain. So the idea I have at the moment is that we'll still only parameterize indexscans, but then allow those to be joined to unrelated relations while still remaining parameterized. That should reduce the number of parameterized Paths hanging around, because only joinclauses that match indexes will give rise to such Paths. But I think this is all fairly unrelated to the case that Hitoshi is on about. As you said earlier, it seems like we'd have to derive both parameterized and unparameterized plans for the subquery, which seems mighty expensive. regards, tom lane
On Tue, May 24, 2011 at 11:11 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I must be misunderstanding you, because index scans are the thing we >> already *do* parameterize; and what else would make any sense? > > The point I was trying to make is that the ultimate reason for having a > parameterized portion-of-a-plan will be that there's a parameterized > indexscan somewhere down at the bottom. I had originally imagined that > we might parameterize any old scan; for example consider replacing > > Nestloop > Join Filter: a.x = b.y > -> Seqscan on a > -> Seqscan on b > > with > > Nestloop > -> Seqscan on a > -> Seqscan on b > Filter: b.y = a.x Oh, I see. I have a general gripe with nested loop plans: we already consider too many of them. IIRC, when I last fooled around with this, the number of nested loop paths that we generate far exceeded the number of merge or hash join paths, and most of those paths suck and are a complete waste of time. It strikes me that we ought to be trying to find ways to get rid of some of the paths we're already considering, rather than adding any more. In this particular case, if the second plan is actually faster, it probably won't be by much; we could think about trying to make some kind of ex-post-facto transformation instead of throwing everything into the path machinery. > Although this isn't nearly as useful as if we could be using an index on > b.y, there would still be some marginal gain to be had, because we'd be > able to reject rows of b without first passing them up to the join node. > But I'm afraid that going all-out like that would slow the planner down > far too much (too many Paths to consider) to be justified by a marginal > runtime gain. Agreed. > So the idea I have at the moment is that we'll still only parameterize > indexscans, but then allow those to be joined to unrelated relations > while still remaining parameterized. That should reduce the number > of parameterized Paths hanging around, because only joinclauses that > match indexes will give rise to such Paths. That seems fine, yeah. If anything, we might want to limit it even more, but certainly that's a good place to start, and see how it shakes out. > But I think this is all fairly unrelated to the case that Hitoshi is on > about. As you said earlier, it seems like we'd have to derive both > parameterized and unparameterized plans for the subquery, which seems > mighty expensive. That was my first thought, too, but then I wondered if I was getting cheap. Most of the time, the subquery will be something simple, and replanning it twice won't really matter much. If it happens to be something complicated, then it will take longer, but on the other hand that's exactly the sort of byzantine query where you probably want the planner to pull out all the stops. Aggregates tend to "feel" slow almost invariably, because the amount of data being processed under the hood is much larger than what actually gets emitted, so I think we should at least consider the possibility that users really won't care about a bit of extra work. The case I'm concerned about is where you have several levels of nested aggregates, and the effect starts to multiply. But even if that turns out to be a problem, we could handle it by limiting consideration of the alternate path to the top 1 or 2 levels and handle the rest as we do now. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, May 24, 2011 at 11:11 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The point I was trying to make is that the ultimate reason for having a >> parameterized portion-of-a-plan will be that there's a parameterized >> indexscan somewhere down at the bottom. > Oh, I see. I have a general gripe with nested loop plans: we already > consider too many of them. IIRC, when I last fooled around with this, > the number of nested loop paths that we generate far exceeded the > number of merge or hash join paths, and most of those paths suck and > are a complete waste of time. Hm, really? My experience is that it's the mergejoin paths that breed like rabbits, because there are so many potential sort orders. >> But I think this is all fairly unrelated to the case that Hitoshi is on >> about. As you said earlier, it seems like we'd have to derive both >> parameterized and unparameterized plans for the subquery, which seems >> mighty expensive. > That was my first thought, too, but then I wondered if I was getting > cheap. Yeah, it's certainly possible that we're worrying too much. Usually I only get concerned about added planner logic if it will impact the planning time for simple queries. "Simple" tends to be in the eye of the beholder, but something with a complicated aggregate subquery is probably not simple by anyone's definition. In this case the sticky point is that there could be multiple possible sets of clauses available to be pushed down, depending on what you assume is the outer relation for the eventual upper-level nestloop. So worst case, you could have not just one parameterized plan to generate in addition to the regular kind, but 2^N of them ... regards, tom lane
2011/5/25 Tom Lane <tgl@sss.pgh.pa.us>: > Robert Haas <robertmhaas@gmail.com> writes: >> That was my first thought, too, but then I wondered if I was getting >> cheap. > > Yeah, it's certainly possible that we're worrying too much. Usually > I only get concerned about added planner logic if it will impact the > planning time for simple queries. "Simple" tends to be in the eye of > the beholder, but something with a complicated aggregate subquery is > probably not simple by anyone's definition. > > In this case the sticky point is that there could be multiple possible > sets of clauses available to be pushed down, depending on what you > assume is the outer relation for the eventual upper-level nestloop. > So worst case, you could have not just one parameterized plan to > generate in addition to the regular kind, but 2^N of them ... My intention is that if join qual matches subqury Agg's grouping keys then the Var can be pushed down, so I'm not worried about the exponential possibilities of paths growth. And I found the right place to hack, where set_subquery_pathlist() pushes down some baseristrictinfo. We don't have Var in the RestrictInfo now, but I guess we can put them in it somehow before reaching there. Even if I can do it, the effective case is only outer is only one tuple case. As I noted earlier this optimization will complete by executor's cooperation, which is something like gather-param-values-as-array before starting Agg scan. So I'm still thinking which of pulling up and parameterized scan is better. Regards, -- Hitoshi Harada
On Tue, May 24, 2011 at 12:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Oh, I see. I have a general gripe with nested loop plans: we already >> consider too many of them. IIRC, when I last fooled around with this, >> the number of nested loop paths that we generate far exceeded the >> number of merge or hash join paths, and most of those paths suck and >> are a complete waste of time. > > Hm, really? My experience is that it's the mergejoin paths that breed > like rabbits, because there are so many potential sort orders. *scratches head* Well, I'm pretty sure that's how it looked when I was testing it. I wonder how this could be different for the two of us. Or maybe one of us is confused. Admittedly, I haven't looked at it in a while. >>> But I think this is all fairly unrelated to the case that Hitoshi is on >>> about. As you said earlier, it seems like we'd have to derive both >>> parameterized and unparameterized plans for the subquery, which seems >>> mighty expensive. > >> That was my first thought, too, but then I wondered if I was getting >> cheap. > > Yeah, it's certainly possible that we're worrying too much. Usually > I only get concerned about added planner logic if it will impact the > planning time for simple queries. "Simple" tends to be in the eye of > the beholder, but something with a complicated aggregate subquery is > probably not simple by anyone's definition. > > In this case the sticky point is that there could be multiple possible > sets of clauses available to be pushed down, depending on what you > assume is the outer relation for the eventual upper-level nestloop. > So worst case, you could have not just one parameterized plan to > generate in addition to the regular kind, but 2^N of them ... Hmm. Well, 2^N is more than 2. But I bet most of them are boring. Judging by his followup email, Hitoshi Harada seems to think we can just look at the case where we can parameterize on all of the grouping columns. The only other case that seems like it might be interesting is parameterizing on any single one of the grouping columns. I can't get excited about pushing down arbitrary subsets. Of course, even O(N) in the number of grouping columns might be too much, but then we could fall back to just all or nothing. I think the "all" case by itself would probably extract 90%+ of the benefit, especially since "all" will often mean "the only one there is". -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
2011/5/25 Hitoshi Harada <umi.tanuki@gmail.com>: > So I'm still > thinking which of pulling up and parameterized scan is better. After more investigation I came up with third idea, pushing down RangeTblEntry to aggregate subquery. This sounds like a crazy idea, but it seems to me like it is slightly easier than pulling up agg-subquery. The main point is that when you want to pull up, you must care if the final output would be correct with other join relations than the aggregate-related one. In contrast, when pushing down the target relation to agg-subquery it is simple to ensure the result. I'm looking around pull_up_subqueries() in subquery_planner() to add the pushing down logic. It could be possible to do it around make_one_rel() but I bet query structure changes are doable against Query, not PlannerInfo. I appreciate any comments. Regards, -- Hitoshi Harada
Hitoshi Harada <umi.tanuki@gmail.com> writes: > 2011/5/25 Hitoshi Harada <umi.tanuki@gmail.com>: >> So I'm still >> thinking which of pulling up and parameterized scan is better. > After more investigation I came up with third idea, pushing down > RangeTblEntry to aggregate subquery. This sounds like a crazy idea, Yes, it sure does. Won't that typically change the number of rows produced by the subquery's jointree? And thus possibly change the aggregate's result? regards, tom lane
2011/5/26 Tom Lane <tgl@sss.pgh.pa.us>: > Hitoshi Harada <umi.tanuki@gmail.com> writes: >> 2011/5/25 Hitoshi Harada <umi.tanuki@gmail.com>: >>> So I'm still >>> thinking which of pulling up and parameterized scan is better. > >> After more investigation I came up with third idea, pushing down >> RangeTblEntry to aggregate subquery. This sounds like a crazy idea, > > Yes, it sure does. Won't that typically change the number of rows > produced by the subquery's jointree? And thus possibly change the > aggregate's result? No, because it will be restricted to assumption that join qual == grouping key and outer relation is unique on the var in that join qual. Pushing down the baserel to sub-aggregate is equivalent to pulling up sub-aggregate eventually if there are only two relations (which is my first example.) The difference is if the optimizer needs to care about other relations which may change the number of rows produced. Regards, -- Hitoshi Harada
On Wed, May 25, 2011 at 10:35 AM, Hitoshi Harada <umi.tanuki@gmail.com> wrote: > 2011/5/25 Hitoshi Harada <umi.tanuki@gmail.com>: >> So I'm still >> thinking which of pulling up and parameterized scan is better. > > After more investigation I came up with third idea, pushing down > RangeTblEntry to aggregate subquery. This sounds like a crazy idea, > but it seems to me like it is slightly easier than pulling up > agg-subquery. The main point is that when you want to pull up, you > must care if the final output would be correct with other join > relations than the aggregate-related one. In contrast, when pushing > down the target relation to agg-subquery it is simple to ensure the > result. > > I'm looking around pull_up_subqueries() in subquery_planner() to add > the pushing down logic. It could be possible to do it around > make_one_rel() but I bet query structure changes are doable against > Query, not PlannerInfo. How do you decide whether or not to push down? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
2011/5/26 Robert Haas <robertmhaas@gmail.com>: > On Wed, May 25, 2011 at 10:35 AM, Hitoshi Harada <umi.tanuki@gmail.com> wrote: >> 2011/5/25 Hitoshi Harada <umi.tanuki@gmail.com>: >>> So I'm still >>> thinking which of pulling up and parameterized scan is better. >> >> After more investigation I came up with third idea, pushing down >> RangeTblEntry to aggregate subquery. This sounds like a crazy idea, >> but it seems to me like it is slightly easier than pulling up >> agg-subquery. The main point is that when you want to pull up, you >> must care if the final output would be correct with other join >> relations than the aggregate-related one. In contrast, when pushing >> down the target relation to agg-subquery it is simple to ensure the >> result. >> >> I'm looking around pull_up_subqueries() in subquery_planner() to add >> the pushing down logic. It could be possible to do it around >> make_one_rel() but I bet query structure changes are doable against >> Query, not PlannerInfo. > > How do you decide whether or not to push down? Yeah, that's the problem. In addition to the conditions of join-qual == grouping key && outer is unique on qual, we need some criteria if it should be done. At first I started to think I can compare cost of two different plan nodes, which are generated by calling subquery_planner() twice. But now my plan is to apply some heuristics like that join qual selectivity is less than 10% or so. I either don't like magic numbers but given Query restructuring instead of PlannerInfo (which means we cannot use Path) it is only left way. To get it work is my first goal anyway. Regards, -- Hitoshi Harada
On Mon, May 23, 2011 at 11:08:40PM -0400, Robert Haas wrote: > > I don't really like the idea of adding a GUC for this, unless we > convince ourselves that nothing else is sensible. I mean, that leads > to conversations like this: > > Newbie: My query is slow. > Hacker: Turn on enable_magic_pixie_dust and it'll get better. > Newbie: Oh, yeah, that is better. Why isn't this turned on by default, anyway? > Hacker: Well, on pathological queries, it makes planning take way too > long, so we think it's not really safe to enable it by default. > Newbie: Wait... I thought you just told me to enable it. It's not safe? > Hacker: Well, sort of. I mean, uh... hey, look, an elephant! > ROTFL! This needs to go on the wiki somewhere discussing why HACKERs rejects tunable knobs as often as happens. Ross -- Ross Reedstrom, Ph.D. reedstrm@rice.edu Systems Engineer & Admin, Research Scientist phone: 713-348-6166 Connexions http://cnx.org fax: 713-348-3665 Rice University MS-375, Houston, TX 77005 GPG Key fingerprint = F023 82C8 9B0E 2CC6 0D8E F888 D3AE 810E 88F0 BEDE
On Wed, May 25, 2011 at 1:37 PM, Hitoshi Harada <umi.tanuki@gmail.com> wrote: >> How do you decide whether or not to push down? > > Yeah, that's the problem. In addition to the conditions of join-qual > == grouping key && outer is unique on qual, we need some criteria if > it should be done. At first I started to think I can compare cost of > two different plan nodes, which are generated by calling > subquery_planner() twice. But now my plan is to apply some heuristics > like that join qual selectivity is less than 10% or so. I either don't > like magic numbers but given Query restructuring instead of > PlannerInfo (which means we cannot use Path) it is only left way. To > get it work is my first goal anyway. I think getting it working is probably a good first goal. I am not really sure that we want to commit it that way, and I think my vote would be for you to work on the approach we discussed before rather than this one, but it's your project, and I think you'll probably learn enough in getting it working that it will be a step forward in any case. The planner is complex enough that it's worth trying to get something that works, first, before trying to make it perfect. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > I think getting it working is probably a good first goal. I am not > really sure that we want to commit it that way, and I think my vote > would be for you to work on the approach we discussed before rather > than this one, but it's your project, and I think you'll probably > learn enough in getting it working that it will be a step forward in > any case. The planner is complex enough that it's worth trying to get > something that works, first, before trying to make it perfect. Remember Brooks' advice: "Plan to throw one away. You will anyhow." regards, tom lane
On Tue, May 24, 2011 at 3:47 AM, Hitoshi Harada <umi.tanuki@gmail.com> wrote: > That's true. But if the planning cost is an only issue, why not adding > new GUC for user to choose if they prefer it or not? Of course if we > have some method to predict which way to go before proving both ways, > it's great. Do you have some blue picture on it? I like your simple patch and looks like it fixes your concern. Your problem statement ignores the fact that most people would not write the "original query" like this select m_id, sum_len from size_m m inner join(select m_id, sum(length(val)) as sum_len from size_l group by m_id)l on m.id = l.m_id where val = '1'; they would write it like this select m_id, sum(length(val)) from size_m m join size_l l on m.id = l.m_id where val = '1' group by m_id; Which gives a far worse plan and one that is not solved by your patch. Your way of writing the SQL is one of the "hand optimized" ways that an SQL expert would try to re-write the SQL. We shouldn't be optimizing only for hand-altered code, since it can always be further tweaked by hand. We should be optimizing the original, simple queries (as well as other forms of expressing the same thing). This highlights that we do not have the infrastructure to push aggregates up or down, and that the lack of a known "primary key" for the output of each plan node prevents us from developing a general transformation infrastructure to solve the general case. That particular piece of infrastructure is also an essential step towards materialized views, which would be pretty useless without the capability to transform aggregates up and down the join tree. In terms of costing, I think it would be likely that we can apply simple heuristics. We already assume that applying quals down to the lowest level possible make sense. I would guess that anything that reduces the number of rows should be pushed down as far as possible. I'm sure there are cases where that isn't true, but lets not stop from solving simple general cases because of the theoretical existence of complex cases -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
2011/6/4 Simon Riggs <simon@2ndquadrant.com>: > > I like your simple patch and looks like it fixes your concern. Thanks for your interest. I forgot to mention but this type of query is quite general in one-to-many entities and likely to be generated by simple ORMappers. > Your problem statement ignores the fact that most people would not > write the "original query" like this > > select m_id, sum_len from size_m m inner join(select m_id, > sum(length(val)) as sum_len from size_l group by m_id)l on m.id = > l.m_id where val = '1'; > > they would write it like this > > select m_id, sum(length(val)) > from size_m m join size_l l on m.id = l.m_id > where val = '1' > group by m_id; Hm, I didn't notice this hand transformation. Will check what's happened. But my example is simplified and it might be likely that some other joins (not uniquely) to size_m. > In terms of costing, I think it would be likely that we can apply > simple heuristics. We already assume that applying quals down to the > lowest level possible make sense. I would guess that anything that > reduces the number of rows should be pushed down as far as possible. > I'm sure there are cases where that isn't true, but lets not stop from > solving simple general cases because of the theoretical existence of > complex cases Agreed. After more thought, push-down-qual approach would be better than push down/pull up aggregates. The only concern was multiple aggregate call case in such cases like more rows than one are qualified in size_m. But it is clear that each aggregate call scans only qualified size_l rows if we can push down parameter qual to the subquery. Nestloop with parameter push down to aggregate subquery appoach is more general because it doesn't concern about "primary key" issue. You can push it down whenever the total execution cost is smaller. So, I'm now walking through planner code and finally I found the clue to start. First, the current problem of parameterized nestloop idea in general case is that while nested index scan pushes parameter to the other join relation, more general approach needs to do it with subquery. A nested index scan Path is generated in match_unsorted_outer(), which is at much deeper of make_rel_from_joinlist(), which is after set_base_rel_pathlist(), which contains set_subquery_pathlist(), which calls subquery_planner(). This means that if you want to add code to generate "general nestloop with parameter push down" during join search process, it is too late to push down the parameter to subquery, because subquery was already planned at that time. So we need to add new operation before subquery_planner(). It is hard because any join-relevant information is not ready at the stage. But I bet some simple cases like my aggregate-join can find it possible to make parameter from join qual. (I haven't yet written any code nor proven my theory). In this case we need to plan subquery twice, one with pure and the other with parameterized. Other than subquery case, LATERAL will be ok with near the nested index scan approach, since the joinned relation is FunctionScan, which is not planned lie subquery. "s JOIN(l1 LEFT JOIN l2)" case is unclear which of subquery or index scan. Maybe the third way, because l1 LEFT JOIN l2 is inside deconstructed jointree which is not planned subquery but also not plain RelOptInfo like base relation / function scan. Regards, -- Hitoshi Harada