Re: Pull up aggregate subquery - Mailing list pgsql-hackers

From Hitoshi Harada
Subject Re: Pull up aggregate subquery
Date
Msg-id BANLkTikX_HELz7Yzx9hxXKE5s24zeiMdvQ@mail.gmail.com
Whole thread Raw
In response to Re: Pull up aggregate subquery  (Simon Riggs <simon@2ndQuadrant.com>)
List pgsql-hackers
2011/6/4 Simon Riggs <simon@2ndquadrant.com>:
>
> I like your simple patch and looks like it fixes your concern.

Thanks for your interest. I forgot to mention but this type of query
is quite general in one-to-many entities and likely to be generated by
simple ORMappers.

> Your problem statement ignores the fact that most people would not
> write the "original query" like this
>
> select m_id, sum_len from size_m m inner join(select m_id,
> sum(length(val)) as sum_len from size_l group by m_id)l on m.id =
> l.m_id where val = '1';
>
> they would write it like this
>
> select m_id, sum(length(val))
> from size_m m join size_l l on m.id = l.m_id
> where val = '1'
> group by m_id;

Hm, I didn't notice this hand transformation. Will check what's
happened. But my example is simplified and it might be likely that
some other joins (not uniquely) to size_m.

> In terms of costing, I think it would be likely that we can apply
> simple heuristics. We already assume that applying quals down to the
> lowest level possible make sense. I would guess that anything that
> reduces the number of rows should be pushed down as far as possible.
> I'm sure there are cases where that isn't true, but lets not stop from
> solving simple general cases because of the theoretical existence of
> complex cases

Agreed. After more thought, push-down-qual approach would be better
than push down/pull up aggregates. The only concern was multiple
aggregate call case in such cases like more rows than one are
qualified in size_m. But it is clear that each aggregate call scans
only qualified size_l rows if we can push down parameter qual to the
subquery. Nestloop with parameter push down to aggregate subquery
appoach is more general because it doesn't concern about "primary key"
issue. You can push it down whenever the total execution cost is
smaller.

So, I'm now walking through planner code and finally I found the clue
to start. First, the current problem of parameterized nestloop idea in
general case is that while nested index scan pushes parameter to the
other join relation, more general approach needs to do it with
subquery. A nested index scan Path is generated in
match_unsorted_outer(), which is at much deeper of
make_rel_from_joinlist(), which is after set_base_rel_pathlist(),
which contains set_subquery_pathlist(), which calls
subquery_planner(). This means that if you want to add code to
generate "general nestloop with parameter push down" during join
search process, it is too late to push down the parameter to subquery,
because subquery was already planned at that time.

So we need to add new operation before subquery_planner(). It is hard
because any join-relevant information is not ready at the stage. But I
bet some simple cases like my aggregate-join can find it possible to
make parameter from join qual. (I haven't yet written any code nor
proven my theory). In this case we need to plan subquery twice, one
with pure and the other with parameterized.

Other than subquery case, LATERAL will be ok with near the nested
index scan approach, since the joinned relation is FunctionScan, which
is not planned lie subquery. "s JOIN(l1 LEFT JOIN l2)" case is unclear
which of subquery or index scan. Maybe the third way, because l1 LEFT
JOIN l2 is inside deconstructed jointree which is not planned subquery
but also not plain RelOptInfo like base relation / function scan.

Regards,

-- 
Hitoshi Harada


pgsql-hackers by date:

Previous
From: "Kevin Grittner"
Date:
Subject: Re: reducing the overhead of frequent table locks - now, with WIP patch
Next
From: Tom Lane
Date:
Subject: Re: Postmaster holding unlinked files for pg_largeobject table