Re: Parallel Aggregate - Mailing list pgsql-hackers

From Haribabu Kommi
Subject Re: Parallel Aggregate
Date
Msg-id CAJrrPGdPCezCAt0DNtJBpPMQ=ypfKSv+EryedVN=AiXi2iTe+w@mail.gmail.com
Whole thread Raw
In response to Re: Parallel Aggregate  (David Rowley <david.rowley@2ndquadrant.com>)
Responses Re: Parallel Aggregate  (David Rowley <david.rowley@2ndquadrant.com>)
List pgsql-hackers
On Tue, Oct 13, 2015 at 5:53 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On 13 October 2015 at 17:09, Haribabu Kommi <kommi.haribabu@gmail.com>
> wrote:
>>
>> On Tue, Oct 13, 2015 at 12:14 PM, Robert Haas <robertmhaas@gmail.com>
>> wrote:
>> > Also, I think the path for parallel aggregation should probably be
>> > something like FinalizeAgg -> Gather -> PartialAgg -> some partial
>> > path here.  I'm not clear whether that is what you are thinking or
>> > not.
>>
>> No. I am thinking of the following way.
>> Gather->partialagg->some partial path
>>
>> I want the Gather node to merge the results coming from all workers,
>> otherwise
>> it may be difficult to merge at parent of gather node. Because in case
>> the partial
>> group aggregate is under the Gather node, if any of two workers are
>> returning
>> same group key data, we need to compare them and combine it to make it a
>> single group. If we are at Gather node, it is possible that we can
>> wait till we get
>> slots from all workers. Once all workers returns the slots we can compare
>> and merge the necessary slots and return the result. Am I missing
>> something?
>
>
> My assumption is the same as Robert's here.
> Unless I've misunderstood, it sounds like you're proposing to add logic into
> the Gather node to handle final aggregation? That sounds like a modularity
> violation of the whole node concept.
>
> The handling of the final aggregate stage is not all that different from the
> initial aggregate stage. The primary difference is just that your calling
> the combine function instead of the transition function, and the values

Yes, you are correct, till now i am thinking of using transition types as the
approach, because of that reason only I proposed it as Gather node to handle
the finalize aggregation.

> being aggregated are aggregates states rather than the type of the values
> which were initially aggregated. The handling of GROUP BY is all the same,
> yet you only apply the HAVING clause during final aggregation. This is why I
> ended up implementing this in nodeAgg.c instead of inventing some new node
> type that's mostly a copy and paste of nodeAgg.c [1]

After going through your Partial Aggregation / GROUP BY before JOIN patch,
Following is my understanding of parallel aggregate.

Finalize [hash] aggregate       -> Gather             -> Partial [hash] aggregate

The data that comes from the Gather node contains the group key and
grouping results.
Based on these we can generate another hash table in case of hash aggregate at
finalize aggregate and return the final results. This approach works
for both plain and
hash aggregates.

For group aggregate support of parallel aggregate, the plan should be
as follows.

Finalize Group aggregate   ->sort       -> Gather             -> Partial group aggregate                  ->sort

The data that comes from Gather node needs to be sorted again based on
the grouping key,
merge the data and generates the final grouping result.

With this approach, we no need to change anything in Gather node. Is
my understanding correct?

Regards,
Hari Babu
Fujitsu Australia



pgsql-hackers by date:

Previous
From: Ashutosh Bapat
Date:
Subject: Re: Getting sorted data from foreign server
Next
From: Michael Paquier
Date:
Subject: Re: Postgres service stops when I kill client backend on Windows