Re: upper planner path-ification - Mailing list pgsql-hackers

From Kouhei Kaigai
Subject Re: upper planner path-ification
Date
Msg-id 9A28C8860F777E439AA12E8AEA7694F8011090B3@BPXM15GP.gisp.nec.co.jp
Whole thread Raw
In response to upper planner path-ification  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: upper planner path-ification  (David Rowley <david.rowley@2ndquadrant.com>)
List pgsql-hackers
> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org
> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Robert Haas
> Sent: Thursday, May 14, 2015 10:39 AM
> To: pgsql-hackers@postgresql.org; Tom Lane
> Subject: [HACKERS] upper planner path-ification
> 
> Hi,
> 
> I've been pulling over Tom's occasional remarks about redoing
> grouping_planner - and maybe further layers of the planner - to work
> with Paths instead of Plans.  I've had difficulty locating all of the
> relevant threads.  Here's everything I've found so far, which I'm
> pretty sure is not everything:
> 
> http://www.postgresql.org/message-id/17400.1311716272@sss.pgh.pa.us
> http://www.postgresql.org/message-id/2614.1375730493@sss.pgh.pa.us
> http://www.postgresql.org/message-id/22721.1385048662@sss.pgh.pa.us
> http://www.postgresql.org/message-id/BANLkTinDjjFHNOzESG2J2U4GOkqLu69Zqg@mai
> l.gmail.com
> http://www.postgresql.org/message-id/8479.1418420018@sss.pgh.pa.us
> 
> I think there are two separate problems here.  First, there's the
> problem that grouping_planner() is complicated.  It's doing cost
> comparisons, but all in ad-hoc fashion rather than using a consistent
> mechanic the way add_path() does.  Generally, we can plan an aggregate
> using either (1) a hashed aggregate, (2) a sorted aggregate, or (3)
> for min or max, an index scan that just grabs the highest or lowest
> value in lieu of a full table scan.  Instead of generating a plan for
> each of these possibilities, we'd like to generate paths for each one,
> and then pick one to turn into a plan.  AIUI, the hope is that this
> would simplify the cost calculations, and also make it easier to
> inject other paths, such as a path where an FDW performs the
> aggregation step on the remote server.
> 
> Second, there's the problem that we might like to order aggregates
> with respect to joins.  If we have something like SELECT DISTINCT ON
> (foo.x) foo.x, foo.y, bar.y FROM foo, bar WHERE foo.x = bar.x, then
> (a) if foo.x has many duplicates, it will be better to DISTINCT-ify
> foo and then join to bar afterwards but (b) if foo.x = bar.x is highly
> selective, it will be better to join to bar first and then
> DISTINCT-ify the result.  Currently, aggregation is always last;
> that's not great.  Hitoshi Harada's proposed strategy of essentially
> figuring out where the aggregation steps can go and then re-planning
> for each one is also not great, because each join problem will be a
> subtree of the one we use for the aggregate-last strategy, and thus
> we're wasting effort by planning the same subtrees multiple times.
> Instead, we might imagine letting grouping planner fish out the best
> paths for the joinrels that represent possible aggregation points,
> generate aggregation paths for each of those, and then work out what
> additional rels need to be joined afterwards.  That sounds hard, but
> not as hard as doing something sensible with what we have today.
>
Once we support to add aggregation path during path consideration,
we need to pay attention morphing of the final target-list according
to the intermediate path combination, tentatively chosen.
For example, if partial-aggregation makes sense from cost perspective;
like SUM(NRows) of partial COUNT(*) AS NRows instead of COUNT(*) on
billion rows, planner also has to adjust the final target-list according
to the interim paths. In this case, final output shall be SUM(), instead
of COUNT().

I think, this planner enhancement needs a mechanism to inform planner
expected final output for each Path. It means, individual Path node
can have a replacement rule: which target-entry should have what
expression instead, once this path is applied.
In above case, PartialAggPath will have an indication; SUM(nrows)
shall be applied on the location where "COUNT(*)" is originally
placed.

In case of relations join that involves paths with this replacement
rules, it may be a bit complicated if an expression takes argument
come from both relations; like CORR(inner_rel.X, outer_rel.Y),
I'm not sure whether we can have more broken-down partial aggregate.
However, relations on either of inner/outer side are responsible to
the columns in inner/outer side. Unless all the arguments in aggregate
function come from same side, it should not block aggregate push-
down as long as estimated cost makes sense.

That is a probably makes a problem when we allow to add aggregate
path as a part of partial plan tree.

Thanks,


> I'm inclined to think that it would be useful to solve the first
> problem even if we didn't solve the second one right away (but that
> might be wrong).  As a preparatory step, I'm thinking it would be
> sensible to split grouping_planner() into an outer function that would
> handle the addition of Limit and LockRows nodes and maybe planning of
> set operations, and an inner function that would handle GROUP BY,
> DISTINCT, and possibly window function planning.
> 
> Thoughts?
>
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: less log level for success dynamic background workers for 9.5
Next
From: Tomas Vondra
Date:
Subject: Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H