On Fri, 14 Jun 2019 at 11:45, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
> On Wed, Jun 12, 2019 at 10:58:44AM +0800, Richard Guo wrote:
> ># explain (costs off, verbose) select c1, c2, avg(c3) from t2 group by
> >grouping sets((c1,c2), (c1));
> > QUERY PLAN
> >--------------------------------------------------------------
> > Finalize GroupAggregate
> > Output: c1, c2, avg(c3), (gset_id)
> > Group Key: t2.c1, t2.c2, (gset_id)
> > -> Gather Merge
> > Output: c1, c2, (gset_id), (PARTIAL avg(c3))
> > Workers Planned: 2
> > -> Sort
> > Output: c1, c2, (gset_id), (PARTIAL avg(c3))
> > Sort Key: t2.c1, t2.c2, (gset_id)
> > -> Partial HashAggregate
> > Output: c1, c2, gset_id, PARTIAL avg(c3)
> > Hash Key: t2.c1, t2.c2
> > Hash Key: t2.c1
> > -> Parallel Seq Scan on public.t2
> > Output: c1, c2, c3
> >(15 rows)
> >
>
> OK, I'm not sure I understand the point of this - can you give an
> example which is supposed to benefit from this? Where does the speedup
> came from?
I think this is a bad example since the first grouping set is a
superset of the 2nd. If those were independent and each grouping set
produced a reasonable number of groups then it may be better to do it
this way instead of grouping by all exprs in all grouping sets in the
first phase, as is done by #1. To do #2 would require that we tag
the aggregate state with the grouping set that belong to, which seem
to be what gset_id is in Richard's output.
In my example upthread the first phase of aggregation produced a group
per input row. Method #2 would work better for that case since it
would only produce 2000 groups instead of 1 million.
Likely both methods would be good to consider, but since #1 seems much
easier than #2, then to me it seems to make sense to start there.
--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services