On Fri, Jun 14, 2019 at 12:02:52PM +1200, David Rowley wrote:
>On Fri, 14 Jun 2019 at 11:45, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Wed, Jun 12, 2019 at 10:58:44AM +0800, Richard Guo wrote:
>
>> ># explain (costs off, verbose) select c1, c2, avg(c3) from t2 group by
>> >grouping sets((c1,c2), (c1));
>> > QUERY PLAN
>> >--------------------------------------------------------------
>> > Finalize GroupAggregate
>> > Output: c1, c2, avg(c3), (gset_id)
>> > Group Key: t2.c1, t2.c2, (gset_id)
>> > -> Gather Merge
>> > Output: c1, c2, (gset_id), (PARTIAL avg(c3))
>> > Workers Planned: 2
>> > -> Sort
>> > Output: c1, c2, (gset_id), (PARTIAL avg(c3))
>> > Sort Key: t2.c1, t2.c2, (gset_id)
>> > -> Partial HashAggregate
>> > Output: c1, c2, gset_id, PARTIAL avg(c3)
>> > Hash Key: t2.c1, t2.c2
>> > Hash Key: t2.c1
>> > -> Parallel Seq Scan on public.t2
>> > Output: c1, c2, c3
>> >(15 rows)
>> >
>>
>> OK, I'm not sure I understand the point of this - can you give an
>> example which is supposed to benefit from this? Where does the speedup
>> came from?
>
>I think this is a bad example since the first grouping set is a
>superset of the 2nd. If those were independent and each grouping set
>produced a reasonable number of groups then it may be better to do it
>this way instead of grouping by all exprs in all grouping sets in the
>first phase, as is done by #1. To do #2 would require that we tag
>the aggregate state with the grouping set that belong to, which seem
>to be what gset_id is in Richard's output.
>
Aha! So if we have grouping sets (a,b) and (c,d), then with the first
approach we'd do partial aggregate on (a,b,c,d) - which may produce
quite a few distinct groups, making it inefficient. But with the second
approach, we'd do just (a,b) and (c,d) and mark the rows with gset_id.
Neat!
>In my example upthread the first phase of aggregation produced a group
>per input row. Method #2 would work better for that case since it
>would only produce 2000 groups instead of 1 million.
>
>Likely both methods would be good to consider, but since #1 seems much
>easier than #2, then to me it seems to make sense to start there.
>
Yep. Thanks for the explanation.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services