Re: Combining Aggregates - Mailing list pgsql-hackers

From David Rowley
Subject Re: Combining Aggregates
Date
Msg-id CAKJS1f8wkD0mK4jdg5bHcTvFCUhGj76_VkNaEhNJVkqkg_FhVQ@mail.gmail.com
Whole thread Raw
In response to Re: Combining Aggregates  (Kouhei Kaigai <kaigai@ak.jp.nec.com>)
List pgsql-hackers
On 27 July 2015 at 12:14, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
> The main use case people have been talking about is parallel query, but
> is there some other case this would be useful right now, without the
> parallel query feature? You and Simon talked about this case:
>
> > 2. Queries such as:
> >
> > SELECT p.name, SUM(s.qty) FROM sales s INNER JOIN product p ON s.product_id
> > = p.product_id GROUP BY p.name;
> >
> > Such a query could be transformed into:
> >
> > SELECT p.name,SUM(qty) FROM (SELECT product_id,SUM(qty) AS qty FROM sales
> > GROUP BY product_id) s
> > INNER JOIN product p ON p.product_id = s.product_id GROUP BY p_name;
> >
> > Of course the outer query's SUM and GROUP BY would not be required if there
> > happened to be a UNIQUE index on product(name), but assuming there's not
> > then the above should produce the results faster. This of course works ok
> > for SUM(), but for something like AVG() or STDDEV() the combine/merge
> > aggregate functions would be required to process those intermediate
> > aggregate results that were produced by the sub-query.
>
> Any chance you could implement that in the planner?
>
It likely needs planner enhancement prior to other applications...
http://www.postgresql.org/message-id/CA+TgmobgWKHfZc09B+s2LxJTwoRD5Ht-avoVDvaQ4+RpwrO4bg@mail.gmail.com


I had thought this too, but I'm not sure that's 100% correct. As I just said in the my previous email on this thread, I am now working on "group by before join". I had started it with the intentions of path-ifying the grouping planner, but I soon realised that the grouping_planner() does quite a bit more at that final stage that I propose to do to allow "group by before join". This is mostly around handling of DISTINCT, HAVING and LIMIT. I don't think those need to be handled in my patch, perhaps with the exception of DISTINCT without GROUP BY, but not when both are present.

Instead I've started by inventing GroupingPath and I'm now building these new path types once there's enough tables joined for all Vars of the GROUP BY and agg parameters.

I believe this optimisation needs to be costed as pushing the GROUP BY down to happen as early as possible is *not* always a win. Perhaps the JOIN is very selective and eliminates many groups. Hence I've added costing and allowed the planner to choose what it thinks is cheapest.
 
Once planner allowed to have both of normal path and partial aggregation
paths to compare according to the cost, it is the straightforward way to
do.

I've ended up with 2 boolean members to struct GroupingPath, combine_states and finalize_aggs. I plan to modify nodeAgg.c to use these, and EXPLAIN to give a better description of what its doing.
 

Here are various academic research, for example, below is the good starting
point to clarify aggregate queries that we can run with 2-phase approach.
http://www.researchgate.net/publication/2715288_Performing_Group-By_before_Join


Thanks, I've been using that very paper.

Regards

David Rowley

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

pgsql-hackers by date:

Previous
From: David Rowley
Date:
Subject: Re: Combining Aggregates
Next
From: Craig Ringer
Date:
Subject: Re: security labels on databases are bad for dump & restore