Thread: Pull up aggregate subquery

Pull up aggregate subquery

From
Hitoshi Harada
Date:
I sometimes wonder if we could pull up aggregate query in the optimizer.

My typical problem is:

Consider two relations, medium size M and large size L. L has
reference column to M's primary key. Say,

    create table size_m as select i as id, repeat(i::text, i % 100) as
val from generate_series(1, 20000) i;
    create table size_l as select i as id, m.id as m_id,
repeat(i::text, i % 100) as val from generate_series(1, 100000) i,
size_m m where i.i / 10 = m.id;

Now, you want to query M under some condition with join aggregate L
group by M's primary key.

    select m_id, sum_len from size_m m inner join(select m_id,
sum(length(val)) as sum_len from size_l group by m_id)l on m.id =
l.m_id where val = '1';

The generated plan is:

                                                           QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=36116.92..38339.67 rows=50 width=235) (actual
time=440.679..1039.964 rows=1 loops=1)
   Join Filter: (m.id = size_l.m_id)
   ->  Seq Scan on size_m m  (cost=0.00..897.00 rows=1 width=227)
(actual time=0.021..16.698 rows=1 loops=1)
         Filter: (val = '1'::text)
   ->  GroupAggregate  (cost=36116.92..37217.09 rows=10026 width=248)
(actual time=440.651..1013.704 rows=10000 loops=1)
         ->  Sort  (cost=36116.92..36366.90 rows=99991 width=248)
(actual time=440.619..593.062 rows=99991 loops=1)
               Sort Key: size_l.m_id
               Sort Method: external sort  Disk: 25736kB
               ->  Seq Scan on size_l  (cost=0.00..4565.91 rows=99991
width=248) (actual time=0.011..138.243 rows=99991 loops=1)
 Total runtime: 1044.039 ms
(10 rows)

Note that most of the result of aggregate is discarded on join M,
because M resulted in small output with filter by M.val. If we can
filter M first and filter L by the M's result before running
aggregate, the response may dramatically get faster.

If you filter by M.id instead of M.val the optimizer is smart enough
to push down the condition to L, which is filtered before aggregate.

    select m_id, sum_len from size_m m inner join(select m_id,
sum(length(val)) as sum_len from size_l group by m_id)l on m.id =
l.m_id where id = 1;

                                                     QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..5713.02 rows=2 width=235) (actual
time=72.121..82.364 rows=1 loops=1)
   ->  Seq Scan on size_m m  (cost=0.00..897.00 rows=1 width=227)
(actual time=0.028..10.252 rows=1 loops=1)
         Filter: (id = 1)
   ->  GroupAggregate  (cost=0.00..4815.98 rows=2 width=248) (actual
time=72.065..72.067 rows=1 loops=1)
         ->  Seq Scan on size_l  (cost=0.00..4815.89 rows=10
width=248) (actual time=0.051..71.968 rows=10 loops=1)
               Filter: (m_id = 1)
 Total runtime: 82.525 ms
(7 rows)

This seems like the benefit of EquivalentClass. 1 = M.id = L.m_id is
implied and the optimizer adds implicit constant filter L.m_id = 1 in
the plan tree. In contrast, in the former case of M.val = '1' doesn't
imply any condition for L.m_id. That's fair enough.

However, I think we can filter L by L.m_id before aggregate because
L.m_id is of the group clause as well as the join condition and M.id
is unique in M. In such cases, the query can be transform something
like:
    GroupAggregate
      -> NestLoop (L.m_id = M.id)
        -> SeqScan L
        -> SeqScan M (filter: M.val = '1')
This transformation can be done by pulling up aggregate Query in
pull_up_subqueries(). Currently the optimizer doesn't pull up any
queries which contains aggregate, but as shown above in some cases we
can do it. Attached is WIP proof of concept patch to do it. I know it
breaks general queries but it transforms as I described above. I
suppose the missing piece is adding condition of "when" to pull up
aggregate. "how" is done in the patch.

db1=# explain select m_id, sum_len from size_m m inner join(select
m_id, sum(length(val)) as sum_len from size_l group by m_id)l on m.id
= l.m_id where val = '1';
                                                           QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=6712.96..6713.16 rows=10 width=471) (actual
time=125.496..125.499 rows=1 loops=1)
   ->  Sort  (cost=6712.96..6712.99 rows=10 width=471) (actual
time=125.228..125.288 rows=10 loops=1)
         Sort Key: size_l.m_id
         Sort Method: quicksort  Memory: 25kB
         ->  Nested Loop  (cost=0.00..6712.80 rows=10 width=471)
(actual time=0.142..125.089 rows=10 loops=1)
               Join Filter: (m.id = size_l.m_id)
               ->  Seq Scan on size_m m  (cost=0.00..897.00 rows=1
width=227) (actual time=0.037..8.956 rows=1 loops=1)
                     Filter: (val = '1'::text)
               ->  Seq Scan on size_l  (cost=0.00..4565.91 rows=99991
width=248) (actual time=0.060..65.910 rows=99991 loops=1)
 Total runtime: 125.658 ms
(10 rows)

(You may notice that by the transformation non-group/non-agg
TargetEntry is in the Agg node, but actually there's no check for that
during optimizer and executor. May be thanks to Functional
Dependency.)

Comments?

Regards,

--
Hitoshi Harada

Attachment

Re: Pull up aggregate subquery

From
Tom Lane
Date:
Hitoshi Harada <umi.tanuki@gmail.com> writes:
> I sometimes wonder if we could pull up aggregate query in the optimizer.

I don't have time to look at this right now, but please add to the
upcoming commitfest.
        regards, tom lane


Re: Pull up aggregate subquery

From
Hitoshi Harada
Date:
2011/5/5 Tom Lane <tgl@sss.pgh.pa.us>:
> Hitoshi Harada <umi.tanuki@gmail.com> writes:
>> I sometimes wonder if we could pull up aggregate query in the optimizer.
>
> I don't have time to look at this right now, but please add to the
> upcoming commitfest.

Done.
https://commitfest.postgresql.org/action/patch_view?id=548

I'll work further if I find time.

Regards,

-- 
Hitoshi Harada


Re: Pull up aggregate subquery

From
Hitoshi Harada
Date:
2011/5/5 Hitoshi Harada <umi.tanuki@gmail.com>:
> https://commitfest.postgresql.org/action/patch_view?id=548
>
> I'll work further if I find time.

After more thought, pulling up aggregate subquery in narrow
conditional cases is quite hard path, especially when the joinrel is
more than 2. It will be hard to check pulling up is safe for other
relations than the target relation.

It was a big shame I missed Tom Lane's session in PGCon, but finding
"Parameterized Scan" in his slides, it occurred to me that it might
help my problem, too. Before hitting the "pull up" idea, I once
thought if it would be possible to push outer Var of join down to
Agg's HAVING, which is transferred to underlying SeqScan's filter.
Resulted in something like:
 NestLoop   -> SeqScan M (filter: M.val = '1')   -> GroupAggregate     -> SeqScan M (filter: L.m_id = M.id)

However, currently we don't have such mechanism to push down Var as a
qual to non-NestLoop. Yeah, it could be even now, but we should avoid
N-loop of Agg. We want to scan Agg once, with Param $1 = M.id =
multiple values. Since I didn't attend his session I'm afraid I don't
understand "Parameterized Scan" correctly, but once we've got such
mechanism, one example introduced in Robert Haas's blog[1] (originally
shown by Andrew Gierth[2])  and LATERAL maybe.

Do I understand correctly? If so, could someone explain more detail of
how to get Parameterized Scan in the planner?

Regards,

[1]: http://rhaas.blogspot.com/2010/04/finding-our-way-to-lateral.html
[2]: http://archives.postgresql.org/pgsql-hackers/2009-09/msg00525.php

-- 
Hitoshi Harada


Re: Pull up aggregate subquery

From
Robert Haas
Date:
On Sat, May 21, 2011 at 12:49 PM, Hitoshi Harada <umi.tanuki@gmail.com> wrote:
> 2011/5/5 Hitoshi Harada <umi.tanuki@gmail.com>:
>> https://commitfest.postgresql.org/action/patch_view?id=548
>>
>> I'll work further if I find time.
>
> After more thought, pulling up aggregate subquery in narrow
> conditional cases is quite hard path, especially when the joinrel is
> more than 2. It will be hard to check pulling up is safe for other
> relations than the target relation.
>
> It was a big shame I missed Tom Lane's session in PGCon, but finding
> "Parameterized Scan" in his slides, it occurred to me that it might
> help my problem, too. Before hitting the "pull up" idea, I once
> thought if it would be possible to push outer Var of join down to
> Agg's HAVING, which is transferred to underlying SeqScan's filter.
> Resulted in something like:
>
>  NestLoop
>    -> SeqScan M (filter: M.val = '1')
>    -> GroupAggregate
>      -> SeqScan M (filter: L.m_id = M.id)
>
> However, currently we don't have such mechanism to push down Var as a
> qual to non-NestLoop. Yeah, it could be even now, but we should avoid
> N-loop of Agg. We want to scan Agg once, with Param $1 = M.id =
> multiple values. Since I didn't attend his session I'm afraid I don't
> understand "Parameterized Scan" correctly, but once we've got such
> mechanism, one example introduced in Robert Haas's blog[1] (originally
> shown by Andrew Gierth[2])  and LATERAL maybe.
>
> Do I understand correctly? If so, could someone explain more detail of
> how to get Parameterized Scan in the planner?

I think we're going to need Tom to give the definitive word on this,
but I believe that the current situation is that the executor is
capable of handling a parameterized scan (yeah!) but the planner
doesn't know how to generate them (boo!).   This is an improvement of
a sort over the 9.0 code base, where neither the planner nor the
executor could handle this case, but we need planner to support in
order to get anywhere useful with it.

The problem is how to figure out whether a parameterized scan is a win
without expending too much planning time.  For example, in the case
you mention:

select m_id, sum_len from size_m m inner join (select m_id,
sum(length(val)) as sum_len from size_l group by m_id) l on m.id =
l.m_id where val = '1';

...we'd need to plan the subquery twice, once with a parameterized
qual m_id = $1 pushed down, and once without that.  We could then
compare the cost of a nest-loop with the qual to the cost of a merge
or hash join without it.  But this seems very expensive.  In the
particular case you have here, the subquery is simple enough that this
probably wouldn't be any big deal, but in general there's no reason
why that subquery couldn't be quite complex - or why it couldn't have
subqueries of its own that would requite the same treatment
recursively.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Pull up aggregate subquery

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Sat, May 21, 2011 at 12:49 PM, Hitoshi Harada <umi.tanuki@gmail.com> wrote:
>> Do I understand correctly? If so, could someone explain more detail of
>> how to get Parameterized Scan in the planner?

> I think we're going to need Tom to give the definitive word on this,
> but I believe that the current situation is that the executor is
> capable of handling a parameterized scan (yeah!) but the planner
> doesn't know how to generate them (boo!).   This is an improvement of
> a sort over the 9.0 code base, where neither the planner nor the
> executor could handle this case, but we need planner to support in
> order to get anywhere useful with it.

Yeah.  I fixed the executor in 9.1, but got hung up on how to fix the
planner.  So the planner still only knows how to do this for the case of
an inner indexscan, ie, it can't handle parameterizing any piece of a
plan larger than a single indexscan.  I have some ideas about fixing
that but have had no time to pursue them since December :-(

> ...we'd need to plan the subquery twice, once with a parameterized
> qual m_id = $1 pushed down, and once without that.  We could then
> compare the cost of a nest-loop with the qual to the cost of a merge
> or hash join without it.  But this seems very expensive.  In the
> particular case you have here, the subquery is simple enough that this
> probably wouldn't be any big deal, but in general there's no reason
> why that subquery couldn't be quite complex - or why it couldn't have
> subqueries of its own that would requite the same treatment
> recursively.

Yeah.  For simple scan/join queries it seems likely that we only care
about parameterizing indexscans, since the main opportunity for a win is
to not scan all of a large table.  Restricting things that way would
help reduce the number of extra Paths to carry around.  But I'm not sure
whether the same argument can be made for arbitrary subqueries.
        regards, tom lane


Re: Pull up aggregate subquery

From
Hitoshi Harada
Date:
2011/5/24 Robert Haas <robertmhaas@gmail.com>:
> On Sat, May 21, 2011 at 12:49 PM, Hitoshi Harada <umi.tanuki@gmail.com> wrote:
>> 2011/5/5 Hitoshi Harada <umi.tanuki@gmail.com>:
>> Do I understand correctly? If so, could someone explain more detail of
>> how to get Parameterized Scan in the planner?
>
> I think we're going to need Tom to give the definitive word on this,
> but I believe that the current situation is that the executor is
> capable of handling a parameterized scan (yeah!)

Ah, greping git log pointed me to
http://git.postgresql.org/gitweb?p=postgresql.git;a=commit;h=53e757689ce94520f1c53a89dbaa14ea57b09da7

> The problem is how to figure out whether a parameterized scan is a win
> without expending too much planning time.  For example, in the case
> you mention:
>
> select m_id, sum_len from size_m m inner join (select m_id,
> sum(length(val)) as sum_len from size_l group by m_id) l on m.id =
> l.m_id where val = '1';
>
> ...we'd need to plan the subquery twice, once with a parameterized
> qual m_id = $1 pushed down, and once without that.  We could then
> compare the cost of a nest-loop with the qual to the cost of a merge
> or hash join without it.  But this seems very expensive.  In the
> particular case you have here, the subquery is simple enough that this
> probably wouldn't be any big deal, but in general there's no reason
> why that subquery couldn't be quite complex - or why it couldn't have
> subqueries of its own that would requite the same treatment
> recursively.

That's true. But if the planning cost is an only issue, why not adding
new GUC for user to choose if they prefer it or not? Of course if we
have some method to predict which way to go before proving both ways,
it's great. Do you have some blue picture on it?

In addition, I wonder if the "generalized nestloop" pattern can do
whole outer scan before proving inner? I mean, in my example if the
outer scan qualified more than one tuple I'm afraid that inner Agg and
its underlying SeqScan are re-scanned more than one. That will bloat
performance if the Agg is expensive. It would be great if we can outer
scan can do scan to the end and stores param variables in something
like array and prove once inner Agg, so that we can ensure the Agg is
computed once. To do that, do we need more work on executor?

Regards,



--
Hitoshi Harada


Re: Pull up aggregate subquery

From
Hitoshi Harada
Date:
2011/5/24 Tom Lane <tgl@sss.pgh.pa.us>:
> Robert Haas <robertmhaas@gmail.com> writes:
> Yeah.  I fixed the executor in 9.1, but got hung up on how to fix the
> planner.  So the planner still only knows how to do this for the case of
> an inner indexscan, ie, it can't handle parameterizing any piece of a
> plan larger than a single indexscan.  I have some ideas about fixing
> that but have had no time to pursue them since December :-(

If you can afford, could you describe the ideas? Or I'm glad if you
tell where to start looking. Anyway I'm going to dig into optimizer
more by myself.

> But I'm not sure
> whether the same argument can be made for arbitrary subqueries.

I guess it's hard to apply the mechanism to arbitrary subqueries, but
there are many use cases like my example, Agg node in inner.

Regards,

--
Hitoshi Harada


Re: Pull up aggregate subquery

From
Robert Haas
Date:
On Mon, May 23, 2011 at 4:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> ...we'd need to plan the subquery twice, once with a parameterized
>> qual m_id = $1 pushed down, and once without that.  We could then
>> compare the cost of a nest-loop with the qual to the cost of a merge
>> or hash join without it.  But this seems very expensive.  In the
>> particular case you have here, the subquery is simple enough that this
>> probably wouldn't be any big deal, but in general there's no reason
>> why that subquery couldn't be quite complex - or why it couldn't have
>> subqueries of its own that would requite the same treatment
>> recursively.
>
> Yeah.  For simple scan/join queries it seems likely that we only care
> about parameterizing indexscans, since the main opportunity for a win is
> to not scan all of a large table.  Restricting things that way would
> help reduce the number of extra Paths to carry around.  But I'm not sure
> whether the same argument can be made for arbitrary subqueries.

I must be misunderstanding you, because index scans are the thing we
already *do* parameterize; and what else would make any sense?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Pull up aggregate subquery

From
Robert Haas
Date:
On Mon, May 23, 2011 at 10:47 PM, Hitoshi Harada <umi.tanuki@gmail.com> wrote:
> That's true. But if the planning cost is an only issue, why not adding
> new GUC for user to choose if they prefer it or not? Of course if we
> have some method to predict which way to go before proving both ways,
> it's great. Do you have some blue picture on it?

I don't really like the idea of adding a GUC for this, unless we
convince ourselves that nothing else is sensible.  I mean, that leads
to conversations like this:

Newbie: My query is slow.
Hacker: Turn on enable_magic_pixie_dust and it'll get better.
Newbie: Oh, yeah, that is better.  Why isn't this turned on by default, anyway?
Hacker: Well, on pathological queries, it makes planning take way too
long, so we think it's not really safe to enable it by default.
Newbie: Wait... I thought you just told me to enable it.  It's not safe?
Hacker: Well, sort of.  I mean, uh... hey, look, an elephant!

I think that if you're interested in working on this, the first step
would be to get it to work at all, even if it's a bit inefficient.
Even though in theory you could get exponential planning time growth
if you have an aggregate inside of an aggregate inside of an aggregate
inside of an aggregate, very few people are going to do that.  If it
turns out to be a problem, we can dream up some suitable defense; as I
think about it a little more, I don't think that would be terribly
difficult to work out.  The hard part is probably to get this to work
in the first place.

> In addition, I wonder if the "generalized nestloop" pattern can do
> whole outer scan before proving inner? I mean, in my example if the
> outer scan qualified more than one tuple I'm afraid that inner Agg and
> its underlying SeqScan are re-scanned more than one. That will bloat
> performance if the Agg is expensive. It would be great if we can outer
> scan can do scan to the end and stores param variables in something
> like array and prove once inner Agg, so that we can ensure the Agg is
> computed once. To do that, do we need more work on executor?

That sounds to me like way too much complexity for the first
go-around.  We should either push the agg down or not; it sounds like
you're proposing some kind of half-way state.  That might not be
useful in practice, and it will certainly be a lot more work.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Pull up aggregate subquery

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, May 23, 2011 at 4:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Yeah.  For simple scan/join queries it seems likely that we only care
>> about parameterizing indexscans, since the main opportunity for a win is
>> to not scan all of a large table.  Restricting things that way would
>> help reduce the number of extra Paths to carry around.  But I'm not sure
>> whether the same argument can be made for arbitrary subqueries.

> I must be misunderstanding you, because index scans are the thing we
> already *do* parameterize; and what else would make any sense?

The point I was trying to make is that the ultimate reason for having a
parameterized portion-of-a-plan will be that there's a parameterized
indexscan somewhere down at the bottom.  I had originally imagined that
we might parameterize any old scan; for example consider replacing
Nestloop  Join Filter: a.x = b.y  -> Seqscan on a  -> Seqscan on b

with
Nestloop  -> Seqscan on a  -> Seqscan on b       Filter: b.y = a.x

Although this isn't nearly as useful as if we could be using an index on
b.y, there would still be some marginal gain to be had, because we'd be
able to reject rows of b without first passing them up to the join node.
But I'm afraid that going all-out like that would slow the planner down
far too much (too many Paths to consider) to be justified by a marginal
runtime gain.

So the idea I have at the moment is that we'll still only parameterize
indexscans, but then allow those to be joined to unrelated relations
while still remaining parameterized.  That should reduce the number
of parameterized Paths hanging around, because only joinclauses that
match indexes will give rise to such Paths.

But I think this is all fairly unrelated to the case that Hitoshi is on
about.  As you said earlier, it seems like we'd have to derive both
parameterized and unparameterized plans for the subquery, which seems
mighty expensive.
        regards, tom lane


Re: Pull up aggregate subquery

From
Robert Haas
Date:
On Tue, May 24, 2011 at 11:11 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I must be misunderstanding you, because index scans are the thing we
>> already *do* parameterize; and what else would make any sense?
>
> The point I was trying to make is that the ultimate reason for having a
> parameterized portion-of-a-plan will be that there's a parameterized
> indexscan somewhere down at the bottom.  I had originally imagined that
> we might parameterize any old scan; for example consider replacing
>
>        Nestloop
>          Join Filter: a.x = b.y
>          -> Seqscan on a
>          -> Seqscan on b
>
> with
>
>        Nestloop
>          -> Seqscan on a
>          -> Seqscan on b
>               Filter: b.y = a.x

Oh, I see.  I have a general gripe with nested loop plans: we already
consider too many of them.  IIRC, when I last fooled around with this,
the number of nested loop paths that we generate far exceeded the
number of merge or hash join paths, and most of those paths suck and
are a complete waste of time.  It strikes me that we ought to be
trying to find ways to get rid of some of the paths we're already
considering, rather than adding any more.  In this particular case, if
the second plan is actually faster, it probably won't be by much; we
could think about trying to make some kind of ex-post-facto
transformation instead of throwing everything into the path machinery.

> Although this isn't nearly as useful as if we could be using an index on
> b.y, there would still be some marginal gain to be had, because we'd be
> able to reject rows of b without first passing them up to the join node.
> But I'm afraid that going all-out like that would slow the planner down
> far too much (too many Paths to consider) to be justified by a marginal
> runtime gain.

Agreed.

> So the idea I have at the moment is that we'll still only parameterize
> indexscans, but then allow those to be joined to unrelated relations
> while still remaining parameterized.  That should reduce the number
> of parameterized Paths hanging around, because only joinclauses that
> match indexes will give rise to such Paths.

That seems fine, yeah.  If anything, we might want to limit it even
more, but certainly that's a good place to start, and see how it
shakes out.

> But I think this is all fairly unrelated to the case that Hitoshi is on
> about.  As you said earlier, it seems like we'd have to derive both
> parameterized and unparameterized plans for the subquery, which seems
> mighty expensive.

That was my first thought, too, but then I wondered if I was getting
cheap.  Most of the time, the subquery will be something simple, and
replanning it twice won't really matter much.  If it happens to be
something complicated, then it will take longer, but on the other hand
that's exactly the sort of byzantine query where you probably want the
planner to pull out all the stops.  Aggregates tend to "feel" slow
almost invariably, because the amount of data being processed under
the hood is much larger than what actually gets emitted, so I think we
should at least consider the possibility that users really won't care
about a bit of extra work.  The case I'm concerned about is where you
have several levels of nested aggregates, and the effect starts to
multiply.  But even if that turns out to be a problem, we could handle
it by limiting consideration of the alternate path to the top 1 or 2
levels and handle the rest as we do now.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Pull up aggregate subquery

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, May 24, 2011 at 11:11 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The point I was trying to make is that the ultimate reason for having a
>> parameterized portion-of-a-plan will be that there's a parameterized
>> indexscan somewhere down at the bottom.

> Oh, I see.  I have a general gripe with nested loop plans: we already
> consider too many of them.  IIRC, when I last fooled around with this,
> the number of nested loop paths that we generate far exceeded the
> number of merge or hash join paths, and most of those paths suck and
> are a complete waste of time.

Hm, really?  My experience is that it's the mergejoin paths that breed
like rabbits, because there are so many potential sort orders.

>> But I think this is all fairly unrelated to the case that Hitoshi is on
>> about.  As you said earlier, it seems like we'd have to derive both
>> parameterized and unparameterized plans for the subquery, which seems
>> mighty expensive.

> That was my first thought, too, but then I wondered if I was getting
> cheap.

Yeah, it's certainly possible that we're worrying too much.  Usually
I only get concerned about added planner logic if it will impact the
planning time for simple queries.  "Simple" tends to be in the eye of
the beholder, but something with a complicated aggregate subquery is
probably not simple by anyone's definition.

In this case the sticky point is that there could be multiple possible
sets of clauses available to be pushed down, depending on what you
assume is the outer relation for the eventual upper-level nestloop.
So worst case, you could have not just one parameterized plan to
generate in addition to the regular kind, but 2^N of them ...
        regards, tom lane


Re: Pull up aggregate subquery

From
Hitoshi Harada
Date:
2011/5/25 Tom Lane <tgl@sss.pgh.pa.us>:
> Robert Haas <robertmhaas@gmail.com> writes:
>> That was my first thought, too, but then I wondered if I was getting
>> cheap.
>
> Yeah, it's certainly possible that we're worrying too much.  Usually
> I only get concerned about added planner logic if it will impact the
> planning time for simple queries.  "Simple" tends to be in the eye of
> the beholder, but something with a complicated aggregate subquery is
> probably not simple by anyone's definition.
>
> In this case the sticky point is that there could be multiple possible
> sets of clauses available to be pushed down, depending on what you
> assume is the outer relation for the eventual upper-level nestloop.
> So worst case, you could have not just one parameterized plan to
> generate in addition to the regular kind, but 2^N of them ...

My intention is that if join qual matches subqury Agg's grouping keys
then the Var can be pushed down, so I'm not worried about the
exponential possibilities of paths growth.

And I found the right place to hack, where set_subquery_pathlist()
pushes down some baseristrictinfo. We don't have Var in the
RestrictInfo now, but I guess we can put them in it somehow before
reaching there.

Even if I can do it, the effective case is only outer is only one
tuple case. As I noted earlier this optimization will complete by
executor's cooperation, which is something like
gather-param-values-as-array before starting Agg scan. So I'm still
thinking which of pulling up and parameterized scan is better.

Regards,



--
Hitoshi Harada


Re: Pull up aggregate subquery

From
Robert Haas
Date:
On Tue, May 24, 2011 at 12:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Oh, I see.  I have a general gripe with nested loop plans: we already
>> consider too many of them.  IIRC, when I last fooled around with this,
>> the number of nested loop paths that we generate far exceeded the
>> number of merge or hash join paths, and most of those paths suck and
>> are a complete waste of time.
>
> Hm, really?  My experience is that it's the mergejoin paths that breed
> like rabbits, because there are so many potential sort orders.

*scratches head*

Well, I'm pretty sure that's how it looked when I was testing it.  I
wonder how this could be different for the two of us.  Or maybe one of
us is confused.  Admittedly, I haven't looked at it in a while.

>>> But I think this is all fairly unrelated to the case that Hitoshi is on
>>> about.  As you said earlier, it seems like we'd have to derive both
>>> parameterized and unparameterized plans for the subquery, which seems
>>> mighty expensive.
>
>> That was my first thought, too, but then I wondered if I was getting
>> cheap.
>
> Yeah, it's certainly possible that we're worrying too much.  Usually
> I only get concerned about added planner logic if it will impact the
> planning time for simple queries.  "Simple" tends to be in the eye of
> the beholder, but something with a complicated aggregate subquery is
> probably not simple by anyone's definition.
>
> In this case the sticky point is that there could be multiple possible
> sets of clauses available to be pushed down, depending on what you
> assume is the outer relation for the eventual upper-level nestloop.
> So worst case, you could have not just one parameterized plan to
> generate in addition to the regular kind, but 2^N of them ...

Hmm.  Well, 2^N is more than 2.  But I bet most of them are boring.
Judging by his followup email, Hitoshi Harada seems to think we can
just look at the case where we can parameterize on all of the grouping
columns.  The only other case that seems like it might be interesting
is parameterizing on any single one of the grouping columns.  I can't
get excited about pushing down arbitrary subsets.  Of course, even
O(N) in the number of grouping columns might be too much, but then we
could fall back to just all or nothing.  I think the "all" case by
itself would probably extract 90%+ of the benefit, especially since
"all" will often mean "the only one there is".

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Pull up aggregate subquery

From
Hitoshi Harada
Date:
2011/5/25 Hitoshi Harada <umi.tanuki@gmail.com>:
> So I'm still
> thinking which of pulling up and parameterized scan is better.

After more investigation I came up with third idea, pushing down
RangeTblEntry to aggregate subquery. This sounds like a crazy idea,
but it seems to me like it is slightly easier than pulling up
agg-subquery. The main point is that when you want to pull up, you
must care if the final output would be correct with other join
relations than the aggregate-related one. In contrast, when pushing
down the target relation to agg-subquery it is simple to ensure the
result.

I'm looking around pull_up_subqueries() in subquery_planner() to add
the pushing down logic. It could be possible to do it around
make_one_rel() but I bet query structure changes are doable against
Query, not PlannerInfo.

I appreciate any comments.

Regards,

-- 
Hitoshi Harada


Re: Pull up aggregate subquery

From
Tom Lane
Date:
Hitoshi Harada <umi.tanuki@gmail.com> writes:
> 2011/5/25 Hitoshi Harada <umi.tanuki@gmail.com>:
>> So I'm still
>> thinking which of pulling up and parameterized scan is better.

> After more investigation I came up with third idea, pushing down
> RangeTblEntry to aggregate subquery. This sounds like a crazy idea,

Yes, it sure does.  Won't that typically change the number of rows
produced by the subquery's jointree?  And thus possibly change the
aggregate's result?
        regards, tom lane


Re: Pull up aggregate subquery

From
Hitoshi Harada
Date:
2011/5/26 Tom Lane <tgl@sss.pgh.pa.us>:
> Hitoshi Harada <umi.tanuki@gmail.com> writes:
>> 2011/5/25 Hitoshi Harada <umi.tanuki@gmail.com>:
>>> So I'm still
>>> thinking which of pulling up and parameterized scan is better.
>
>> After more investigation I came up with third idea, pushing down
>> RangeTblEntry to aggregate subquery. This sounds like a crazy idea,
>
> Yes, it sure does.  Won't that typically change the number of rows
> produced by the subquery's jointree?  And thus possibly change the
> aggregate's result?

No, because it will be restricted to assumption that join qual ==
grouping key and outer relation is unique on the var in that join
qual. Pushing down the baserel to sub-aggregate is equivalent to
pulling up sub-aggregate eventually if there are only two relations
(which is my first example.) The difference is if the optimizer needs
to care about other relations which may change the number of rows
produced.

Regards,

--
Hitoshi Harada


Re: Pull up aggregate subquery

From
Robert Haas
Date:
On Wed, May 25, 2011 at 10:35 AM, Hitoshi Harada <umi.tanuki@gmail.com> wrote:
> 2011/5/25 Hitoshi Harada <umi.tanuki@gmail.com>:
>> So I'm still
>> thinking which of pulling up and parameterized scan is better.
>
> After more investigation I came up with third idea, pushing down
> RangeTblEntry to aggregate subquery. This sounds like a crazy idea,
> but it seems to me like it is slightly easier than pulling up
> agg-subquery. The main point is that when you want to pull up, you
> must care if the final output would be correct with other join
> relations than the aggregate-related one. In contrast, when pushing
> down the target relation to agg-subquery it is simple to ensure the
> result.
>
> I'm looking around pull_up_subqueries() in subquery_planner() to add
> the pushing down logic. It could be possible to do it around
> make_one_rel() but I bet query structure changes are doable against
> Query, not PlannerInfo.

How do you decide whether or not to push down?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Pull up aggregate subquery

From
Hitoshi Harada
Date:
2011/5/26 Robert Haas <robertmhaas@gmail.com>:
> On Wed, May 25, 2011 at 10:35 AM, Hitoshi Harada <umi.tanuki@gmail.com> wrote:
>> 2011/5/25 Hitoshi Harada <umi.tanuki@gmail.com>:
>>> So I'm still
>>> thinking which of pulling up and parameterized scan is better.
>>
>> After more investigation I came up with third idea, pushing down
>> RangeTblEntry to aggregate subquery. This sounds like a crazy idea,
>> but it seems to me like it is slightly easier than pulling up
>> agg-subquery. The main point is that when you want to pull up, you
>> must care if the final output would be correct with other join
>> relations than the aggregate-related one. In contrast, when pushing
>> down the target relation to agg-subquery it is simple to ensure the
>> result.
>>
>> I'm looking around pull_up_subqueries() in subquery_planner() to add
>> the pushing down logic. It could be possible to do it around
>> make_one_rel() but I bet query structure changes are doable against
>> Query, not PlannerInfo.
>
> How do you decide whether or not to push down?

Yeah, that's the problem. In addition to the conditions of join-qual
== grouping key && outer is unique on qual, we need some criteria if
it should be done. At first I started to think I can compare cost of
two different plan nodes, which are generated by calling
subquery_planner() twice. But now my plan is to apply some heuristics
like that join qual selectivity is less than 10% or so. I either don't
like magic numbers but given Query restructuring instead of
PlannerInfo (which means we cannot use Path) it is only left way. To
get it work is my first goal anyway.

Regards,

-- 
Hitoshi Harada


Re: Pull up aggregate subquery

From
"Ross J. Reedstrom"
Date:
On Mon, May 23, 2011 at 11:08:40PM -0400, Robert Haas wrote:
> 
> I don't really like the idea of adding a GUC for this, unless we
> convince ourselves that nothing else is sensible.  I mean, that leads
> to conversations like this:
> 
> Newbie: My query is slow.
> Hacker: Turn on enable_magic_pixie_dust and it'll get better.
> Newbie: Oh, yeah, that is better.  Why isn't this turned on by default, anyway?
> Hacker: Well, on pathological queries, it makes planning take way too
> long, so we think it's not really safe to enable it by default.
> Newbie: Wait... I thought you just told me to enable it.  It's not safe?
> Hacker: Well, sort of.  I mean, uh... hey, look, an elephant!
> 

ROTFL! This needs to go on the wiki somewhere discussing why HACKERs
rejects tunable knobs as often as happens.

Ross
-- 
Ross Reedstrom, Ph.D.                                 reedstrm@rice.edu
Systems Engineer & Admin, Research Scientist        phone: 713-348-6166
Connexions                  http://cnx.org            fax: 713-348-3665
Rice University MS-375, Houston, TX 77005
GPG Key fingerprint = F023 82C8 9B0E 2CC6 0D8E  F888 D3AE 810E 88F0 BEDE


Re: Pull up aggregate subquery

From
Robert Haas
Date:
On Wed, May 25, 2011 at 1:37 PM, Hitoshi Harada <umi.tanuki@gmail.com> wrote:
>> How do you decide whether or not to push down?
>
> Yeah, that's the problem. In addition to the conditions of join-qual
> == grouping key && outer is unique on qual, we need some criteria if
> it should be done. At first I started to think I can compare cost of
> two different plan nodes, which are generated by calling
> subquery_planner() twice. But now my plan is to apply some heuristics
> like that join qual selectivity is less than 10% or so. I either don't
> like magic numbers but given Query restructuring instead of
> PlannerInfo (which means we cannot use Path) it is only left way. To
> get it work is my first goal anyway.

I think getting it working is probably a good first goal.  I am not
really sure that we want to commit it that way, and I think my vote
would be for you to work on the approach we discussed before rather
than this one, but it's your project, and I think you'll probably
learn enough in getting it working that it will be a step forward in
any case.  The planner is complex enough that it's worth trying to get
something that works, first, before trying to make it perfect.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Pull up aggregate subquery

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> I think getting it working is probably a good first goal.  I am not
> really sure that we want to commit it that way, and I think my vote
> would be for you to work on the approach we discussed before rather
> than this one, but it's your project, and I think you'll probably
> learn enough in getting it working that it will be a step forward in
> any case.  The planner is complex enough that it's worth trying to get
> something that works, first, before trying to make it perfect.

Remember Brooks' advice: "Plan to throw one away.  You will anyhow."
        regards, tom lane


Re: Pull up aggregate subquery

From
Simon Riggs
Date:
On Tue, May 24, 2011 at 3:47 AM, Hitoshi Harada <umi.tanuki@gmail.com> wrote:

> That's true. But if the planning cost is an only issue, why not adding
> new GUC for user to choose if they prefer it or not? Of course if we
> have some method to predict which way to go before proving both ways,
> it's great. Do you have some blue picture on it?

I like your simple patch and looks like it fixes your concern.

Your problem statement ignores the fact that most people would not
write the "original query" like this

select m_id, sum_len from size_m m inner join(select m_id,
sum(length(val)) as sum_len from size_l group by m_id)l on m.id =
l.m_id where val = '1';

they would write it like this

select m_id, sum(length(val))
from size_m m join size_l l on m.id = l.m_id
where val = '1'
group by m_id;

Which gives a far worse plan and one that is not solved by your patch.
Your way of writing the SQL is one of the "hand optimized" ways that
an SQL expert would try to re-write the SQL. We shouldn't be
optimizing only for hand-altered code, since it can always be further
tweaked by hand. We should be optimizing the original, simple queries
(as well as other forms of expressing the same thing).

This highlights that we do not have the infrastructure to push
aggregates up or down, and that the lack of a known "primary key" for
the output of each plan node prevents us from developing a general
transformation infrastructure to solve the general case. That
particular piece of infrastructure is also an essential step towards
materialized views, which would be pretty useless without the
capability to transform aggregates up and down the join tree.

In terms of costing, I think it would be likely that we can apply
simple heuristics. We already assume that applying quals down to the
lowest level possible make sense. I would guess that anything that
reduces the number of rows should be pushed down as far as possible.
I'm sure there are cases where that isn't true, but lets not stop from
solving simple general cases because of the theoretical existence of
complex cases

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Pull up aggregate subquery

From
Hitoshi Harada
Date:
2011/6/4 Simon Riggs <simon@2ndquadrant.com>:
>
> I like your simple patch and looks like it fixes your concern.

Thanks for your interest. I forgot to mention but this type of query
is quite general in one-to-many entities and likely to be generated by
simple ORMappers.

> Your problem statement ignores the fact that most people would not
> write the "original query" like this
>
> select m_id, sum_len from size_m m inner join(select m_id,
> sum(length(val)) as sum_len from size_l group by m_id)l on m.id =
> l.m_id where val = '1';
>
> they would write it like this
>
> select m_id, sum(length(val))
> from size_m m join size_l l on m.id = l.m_id
> where val = '1'
> group by m_id;

Hm, I didn't notice this hand transformation. Will check what's
happened. But my example is simplified and it might be likely that
some other joins (not uniquely) to size_m.

> In terms of costing, I think it would be likely that we can apply
> simple heuristics. We already assume that applying quals down to the
> lowest level possible make sense. I would guess that anything that
> reduces the number of rows should be pushed down as far as possible.
> I'm sure there are cases where that isn't true, but lets not stop from
> solving simple general cases because of the theoretical existence of
> complex cases

Agreed. After more thought, push-down-qual approach would be better
than push down/pull up aggregates. The only concern was multiple
aggregate call case in such cases like more rows than one are
qualified in size_m. But it is clear that each aggregate call scans
only qualified size_l rows if we can push down parameter qual to the
subquery. Nestloop with parameter push down to aggregate subquery
appoach is more general because it doesn't concern about "primary key"
issue. You can push it down whenever the total execution cost is
smaller.

So, I'm now walking through planner code and finally I found the clue
to start. First, the current problem of parameterized nestloop idea in
general case is that while nested index scan pushes parameter to the
other join relation, more general approach needs to do it with
subquery. A nested index scan Path is generated in
match_unsorted_outer(), which is at much deeper of
make_rel_from_joinlist(), which is after set_base_rel_pathlist(),
which contains set_subquery_pathlist(), which calls
subquery_planner(). This means that if you want to add code to
generate "general nestloop with parameter push down" during join
search process, it is too late to push down the parameter to subquery,
because subquery was already planned at that time.

So we need to add new operation before subquery_planner(). It is hard
because any join-relevant information is not ready at the stage. But I
bet some simple cases like my aggregate-join can find it possible to
make parameter from join qual. (I haven't yet written any code nor
proven my theory). In this case we need to plan subquery twice, one
with pure and the other with parameterized.

Other than subquery case, LATERAL will be ok with near the nested
index scan approach, since the joinned relation is FunctionScan, which
is not planned lie subquery. "s JOIN(l1 LEFT JOIN l2)" case is unclear
which of subquery or index scan. Maybe the third way, because l1 LEFT
JOIN l2 is inside deconstructed jointree which is not planned subquery
but also not plain RelOptInfo like base relation / function scan.

Regards,

-- 
Hitoshi Harada