Re: upper planner path-ification - Mailing list pgsql-hackers

From Kouhei Kaigai
Subject Re: upper planner path-ification
Date
Msg-id 9A28C8860F777E439AA12E8AEA7694F801109422@BPXM15GP.gisp.nec.co.jp
Whole thread Raw
In response to Re: upper planner path-ification  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org
> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Tom Lane
> Sent: Monday, May 18, 2015 1:12 AM
> To: Robert Haas
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] upper planner path-ification
>
> Robert Haas <robertmhaas@gmail.com> writes:
> > So, getting back to this part, what's the value of returning a list of
> > Paths rather than a list of Plans?
>
> (1) less work, since we don't have to fill in details not needed for
>     costing purposes;
> (2) paths carry info that the planner wants but the executor doesn't,
>     notably sort-order annotations.
>
> > target lists are normally computed when paths are converted to plans,
> > but for the higher-level plan nodes adding by grouping_planner, the
> > path list is typically just passed in.  So would the new path types be
> > expected to carry target lists of their own, or would they need to
> > figure out the target list on the fly at plan generation time?
>
> Yeah, that is something I've been struggling with while thinking about
> this.  I don't really want to add tlists as such to Paths, but it's
> not very clear how else to annotate a Path as to what it computes,
> and that seems like an annotation we have to have in some form in order
> to convert these planning steps into a Path universe.
>
> There are other cases where it would be useful to have some notion of
> this kind.  An example is that right now, if you have an expression index
> on an expensive function and a query that wants the value of that function,
> the planner is able to extract the value from the index --- but there is
> nothing that gives any cost benefit to doing so, so it's just as likely
> to choose some other index and eat the cost of recalculating the function.
> It seems like the only way to fix that in a principled fashion is to have
> some concept that the index-scan Path can produce the function value,
> and then when we come to some appropriate costing step, penalize the other
> paths for having to compute the value that's available for free from this
> one.
>
> Rather than adding tlists per se to Paths, I've been vaguely toying with
> a notion of identifying all the "interesting" subexpressions in a query
> (expensive functions, aggregates, etc), giving them indexes 1..n, and then
> marking Paths with bitmapsets showing which interesting subexpressions
> they can produce values for.  This would make things like "does this Path
> compute all the needed aggregates" much cheaper to deal with than a raw
> tlist representation would do.  But maybe that's still not the best way.
>
Hmm.... it seems to me a little bit complicated than flat expression node.

> Another point is that a Path that computes aggregates is fundamentally
> different from a Path that doesn't, because it doesn't even produce the
> same number of rows.  So I'm not at all sure how to visualize the idea
> of a Path that computes only some aggregates, or whether it's even a
> sensible thing to worry about supporting.
>
I expect partial aggregate shall be done per a particular input stream,
not per aggregate function. In other words, once planner determined
a relation scan/join has advantage to run partial aggregate, all the
aggregate functions that consume rows produced by this scan/join have
to have partial aggregate / final aggregate form, doesn't it?
If so, number of rows to be returned is associated with a Path.

For example, when we break down a query below using 2-phase aggregation,
 SELECT t1.cat, avg(t2.x) FROM t1 JOIN t2 ON t1.id_1 = t2.id_2 GROUP BY t1.cat;

expected plan is as shown below, isn't it?
 FinalAgg (nrows=100)     tlist: t1.cat, avg(nrows, sum(t2.x))     grouping key: t1.cat  -> HashJoin (nrows=1000)
tlist:t1.cat, count(t2.x) nrows, sum(t2.x)      -> PartialAgg (nrows=1000)         tlist: count(t2.x) nrows, sum(t2.x),
t2.id_2        grouping key: t2.id_2          -> SeqScan on t2 (nrows=100000000)      -> Hash          -> SeqScan on t1
(nrows=100)

It is clear that PartialAgg consumes 100000000 rows, then output 1000
rows because of partial reduction. All the partial aggregation on this
node will work in a coordinated manner.

Do you have another vision for the partial aggregation behavior?


> > One thing that seems like it might complicate things here is that a
> > lot of planner functions take PlannerInfo *root as an argument.  But
> > if we generate only paths in grouping_planner() and path-ify them
> > later, the subquery's root will not be available when we're trying to
> > do the Path -> Plan transformation.
>
> Ah, you're wrong there, because we hang onto the subquery's root already
> (I forget why exactly, but see PlannerGlobal.subroots for SubPlans, and
> RelOptInfo.subroot for subquery-in-FROM).  So it would not be a
> fundamental problem to postpone create_plan() for a subquery.
>
> > I think grouping_planner() is badly in need of some refactoring just
> > to make it shorter.  It's over 1000 lines of code, which IMHO is a
> > fairly ridiculous length for a single function.
>
> Amen to that.  But as I said to Andrew, I think this will be a side-effect
> of path-ification in this area, and is probably not something to set out
> to do first.
>

--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>




pgsql-hackers by date:

Previous
From: Craig Ringer
Date:
Subject: Re: Time to get rid of PQnoPasswordSupplied?
Next
From: Kouhei Kaigai
Date:
Subject: Re: upper planner path-ification