Re: Parallel grouping sets - Mailing list pgsql-hackers

From Richard Guo
Subject Re: Parallel grouping sets
Date
Msg-id CAN_9JTyPnkHuZ8ruWCwW-QiCefntONaE+E0w7H63hw2ywyh-4w@mail.gmail.com
Whole thread Raw
In response to Re: Parallel grouping sets  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: Parallel grouping sets  (Pengzhou Tang <ptang@pivotal.io>)
List pgsql-hackers
On Tue, Jul 30, 2019 at 11:05 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
On Tue, Jul 30, 2019 at 03:50:32PM +0800, Richard Guo wrote:
>On Wed, Jun 12, 2019 at 10:58 AM Richard Guo <riguo@pivotal.io> wrote:
>
>> Hi all,
>>
>> Paul and I have been hacking recently to implement parallel grouping
>> sets, and here we have two implementations.
>>
>> Implementation 1
>> ================
>>
>> Attached is the patch and also there is a github branch [1] for this
>> work.
>>
>
>Rebased with the latest master.
>

Hi Richard,

thanks for the rebased patch. I think the patch is mostly fine (at least I
don't see any serious issues). A couple minor comments:

Hi Tomas,

Thank you for reviewing this patch.
 

1) I think get_number_of_groups() would deserve a short explanation why
it's OK to handle (non-partial) grouping sets and regular GROUP BY in the
same branch. Before these cases were clearly separated, now it seems a bit
mixed up and it may not be immediately obvious why it's OK.

Added a short comment in get_number_of_groups() explaining the behavior
when doing partial aggregation for grouping sets.
 

2) There are new regression tests, but they are not added to any schedule
(parallel or serial), and so are not executed as part of "make check". I
suppose this is a mistake.

Yes, thanks. Added the new regression test in parallel_schedule and
serial_schedule.
 

3) The regression tests do check plan and results like this:

    EXPLAIN (COSTS OFF, VERBOSE) SELECT ...;
    SELECT ... ORDER BY 1, 2, 3;

which however means that the query might easily use a different plan than
what's verified in the eplain (thanks to the additional ORDER BY clause).
So I think this should explain and execute the same query.

(In this case the plans seems to be the same, but that may easily change
in the future, and we could miss it here, failing to verify the results.)

Thank you for pointing this out. Fixed it in V4 patch.
 

4) It might be a good idea to check the negative case too, i.e. a query on
data set that we should not parallelize (because the number of partial
groups would be too high).

Yes, agree. Added a negative case.
 


Do you have any plans to hack on the second approach too? AFAICS those two
approaches are complementary (address different data sets / queries), and
it would be nice to have both. One of the things I've been wondering is if
we need to invent gset_id as a new concept, or if we could simply use the
existing GROUPING() function - that uniquely identifies the grouping set.


Yes, I'm planning to hack on the second approach in short future. I'm
also reconsidering the gset_id stuff since it brings a lot of complexity
for the second approach.  I agree with you that we can try GROUPING()
function to see if it can replace gset_id.

Thanks
Richard
Attachment

pgsql-hackers by date:

Previous
From: Amit Langote
Date:
Subject: Re: partition routing layering in nodeModifyTable.c
Next
From: Amit Langote
Date:
Subject: Re: Problem with default partition pruning