Re: [HACKERS] Partition-wise aggregation/grouping - Mailing list pgsql-hackers

From Ashutosh Bapat
Subject Re: [HACKERS] Partition-wise aggregation/grouping
Date
Msg-id CAFjFpReeyP53HDPoEimU9QYV+SO5OYDp4iMfXxfbJ52Dac=1zg@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] Partition-wise aggregation/grouping  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Mon, Mar 19, 2018 at 11:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Mar 16, 2018 at 1:50 PM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>>> This patch also renames can_parallel_agg to
>>> can_partial_agg and removes the parallelism-specific bits from it.
>>
>> I think we need to update the comments in this function to use phrase
>> "partial aggregation" instead of "parallel aggregation". And I think
>> we need to change the conditions as well. For example if
>> parse->groupClause == NIL, why can't we do partial aggregation? This
>> is the classical case when we will need patial aggregation. Probably
>> we should test this with Jeevan's patches for partition-wise aggregate
>> to see if it considers partition-wise aggregate or not.
>
> I think the case where we have neither any aggregates nor a grouping
> clause is where we are doing SELECT DISTINCT.

That's a case which will also benefit from partial partition-wise
grouping where each partition produces its own distinct values, thus
reducing the number of rows that flow through an Append node or better
over network, and finalization step removes duplicates. In fact, what
we are attempting here is partition-wise grouping and partial
aggregation happens to be a requirement for doing that if there are
aggregates. If there are no aggregates, we still benefit from
partition-wise grouping. So, can_partial_agg should only tell whether
we can calculate partial aggregates. If there are aggregates present,
and can_partial_agg is false, we can not attempt partition-wise
grouping. But if there are no aggregates and can_partial_agg is false,
it shouldn't prohibit us from using partition-wise aggregates.

> Something like SELECT
> COUNT(*) FROM ... is not this case because that has an aggregate.
>
> I'm sort of on the fence as to whether and how to update the comments.
> I agree that it's a little strange to leave the comments here
> referencing parallel aggregation when the function has been renamed to
> is_partial_agg(), but a simple search-and-replace doesn't necessarily
> improve the situation very much.

Hmm, agree. And as you have mentioned downthread, it's not part of the
this patchset to attempt to do that. May be I will try providing a
patch to update comments once we have committed PWA.

> Most notably, hasNonSerial is
> irrelevant for partial but non-parallel aggregation, but we still have
> the test because we haven't done the work to really do the right thing
> here, which is to separately track whether we can do parallel partial
> aggregation (either hasNonPartial or hasNonSerial is a problem) and
> non-parallel partial aggregation (only hasNonPartial is a problem).
> This needs a deeper reassessment, but I don't think that can or should
> be something we try to do right now.

I think this bit is easy to mention in the comments. We could always
say that since partial aggregation is using serialization and
deserialization while calculating partial aggregates, even though the
step is not needed, we need hasNonSerial to be true for partial
aggregation to work. A future improvement to avoid serialization and
deserialization in partial aggregation when no parallel query is
involved should remove this condition from here and treat it as a
requirement for parallel aggregation.

>
>> OR When parse->groupingSets is true, I can see why we can't use
>> parallel query, but we can still compute partial aggregates. This
>> condition doesn't hurt since partition-wise aggregation bails out when
>> there are grouping sets, so it's not that harmful here.
>
> I haven't thought deeply about what will break when GROUPING SETS are
> in use, but it's not the purpose of this patch set to make them work
> where they didn't before.  The point of hoisting the first two tests
> out of this function was just to avoid doing repeated work when
> partition-wise aggregate is in use.  Those two tests could conceivably
> produce different results for different children, whereas the
> remaining tests won't give different answers.  Let's not get
> distracted by the prospect of improving the tests.  I suspect that's
> not anyway so simple to achieve as what you seem to be speculating
> here...

+1.


>
> I'm fine with using a structure to bundle details that need to be sent
> to the FDW, but why should the FDW need to know about
> can_sort/can_hash?  I suppose if it needs to know about
> can_partial_agg then it doesn't really cost anything to pass through
> all the flags, but I doubt whether the FDW has any use for the others.
> Anyway, if that's the goal, let's just make sure that the set of
> things we're passing that way are exactly the set of things that we
> think the FDW might need.

I am speculating that an FDW or custom planner hook, which does some
local processing or hints something to the foreign server can use
those flags. But I agree, that unless we see such a requirement we
shouldn't expose those in the structure. It will get difficult to
remove those later.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


pgsql-hackers by date:

Previous
From: Kyotaro HORIGUCHI
Date:
Subject: Re: [HACKERS] GUC for cleanup indexes threshold.
Next
From: Andrew Dunstan
Date:
Subject: Re: jsonpath