Re: Parallel grouping sets - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Parallel grouping sets
Date
Msg-id 20190614004438.e2y7ubdnpobq5t37@development
Whole thread Raw
In response to Re: Parallel grouping sets  (David Rowley <david.rowley@2ndquadrant.com>)
List pgsql-hackers
On Fri, Jun 14, 2019 at 12:02:52PM +1200, David Rowley wrote:
>On Fri, 14 Jun 2019 at 11:45, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Wed, Jun 12, 2019 at 10:58:44AM +0800, Richard Guo wrote:
>
>> ># explain (costs off, verbose) select c1, c2, avg(c3) from t2 group by
>> >grouping sets((c1,c2), (c1));
>> >                          QUERY PLAN
>> >--------------------------------------------------------------
>> > Finalize GroupAggregate
>> >   Output: c1, c2, avg(c3), (gset_id)
>> >   Group Key: t2.c1, t2.c2, (gset_id)
>> >   ->  Gather Merge
>> >         Output: c1, c2, (gset_id), (PARTIAL avg(c3))
>> >         Workers Planned: 2
>> >         ->  Sort
>> >               Output: c1, c2, (gset_id), (PARTIAL avg(c3))
>> >               Sort Key: t2.c1, t2.c2, (gset_id)
>> >               ->  Partial HashAggregate
>> >                     Output: c1, c2, gset_id, PARTIAL avg(c3)
>> >                     Hash Key: t2.c1, t2.c2
>> >                     Hash Key: t2.c1
>> >                     ->  Parallel Seq Scan on public.t2
>> >                           Output: c1, c2, c3
>> >(15 rows)
>> >
>>
>> OK, I'm not sure I understand the point of this - can you give an
>> example which is supposed to benefit from this? Where does the speedup
>> came from?
>
>I think this is a bad example since the first grouping set is a
>superset of the 2nd. If those were independent and each grouping set
>produced a reasonable number of groups then it may be better to do it
>this way instead of grouping by all exprs in all grouping sets in the
>first phase, as is done by #1.   To do #2 would require that we tag
>the aggregate state with the grouping set that belong to, which seem
>to be what gset_id is in Richard's output.
>

Aha! So if we have grouping sets (a,b) and (c,d), then with the first
approach we'd do partial aggregate on (a,b,c,d) - which may produce
quite a few distinct groups, making it inefficient. But with the second
approach, we'd do just (a,b) and (c,d) and mark the rows with gset_id.

Neat!

>In my example upthread the first phase of aggregation produced a group
>per input row. Method #2 would work better for that case since it
>would only produce 2000 groups instead of 1 million.
>
>Likely both methods would be good to consider, but since #1 seems much
>easier than #2, then to me it seems to make sense to start there.
>

Yep. Thanks for the explanation.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Fix inconsistencies for v12 (pass 2)
Next
From: Tom Lane
Date:
Subject: Re: Improve handling of pg_stat_statements handling of bind "IN" variables