Re: Final Patch for GROUPING SETS - Mailing list pgsql-hackers

From Andrew Gierth
Subject Re: Final Patch for GROUPING SETS
Date
Msg-id 8761dhu9ts.fsf@news-spur.riddles.org.uk
Whole thread Raw
In response to Re: Final Patch for GROUPING SETS  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Final Patch for GROUPING SETS  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
>> What that code does is produce plans that look like this:
>> GroupAggregate>> -> Sort>>    -> ChainAggregate>>       -> Sort>>          -> ChainAggregate
>> in much the same way that WindowAgg nodes are generated.
Tom> That seems pretty messy, especially given your further commentsTom> that these plan nodes are interconnected and
knowabout eachTom> other (though you failed to say exactly how).
 

I'd already explained in more detail way back when we posted the
patch. But to reiterate: the ChainAggregate nodes pass through their
input data unchanged, but on group boundaries they write aggregated
result rows to a tuplestore shared by the whole chain. The top node
returns the data from the tuplestore after its own output is
completed.

The chain_head pointer in the ChainAggregate nodes is used for:
 - obtaining the head node's targetlist and qual, to use to project   rows into the tuplestore (the ChainAggregate
nodesdon't do   ordinary projection so they have dummy targetlists like the Sort   nodes do)
 
 - obtaining the pointer to the tuplestore itself
 - on rescan without parameter change, to inform the parent node   whether or not the child nodes are also being
rescanned(since   the Sort nodes may or may not block this)
 
Tom> The claimed analogy to WindowAgg therefore seems bogus sinceTom> stacked WindowAggs are independent, AFAIR
anyway.

The analogy is only in that they need to see the same input rows but
in different sort orders.
Tom> I'm also wondering about performance: doesn't this imply moreTom> rows passing through some of the plan steps than
reallyTom>necessary?
 

There's no way to cut down the number of rows seen by intermediate
nodes unless you implement (and require) associative aggregates, which
we do not do in this patch (that's left for possible future
optimization efforts). Our approach makes no new demands on the
implementation of aggregate functions.
Tom> Also, how would this extend to preferring hashed aggregation inTom> some of the grouping steps?

My suggestion for extending it to hashed aggs is: by having a (single)
HashAggregate node keep multiple hash tables, per grouping set, then
any arbitrary collection of grouping sets can be handled in one node
provided that memory permits and no non-hashable features are used.
So the normal plan for CUBE(a,b) under this scheme would be just:
 HashAggregate     Grouping Sets: (), (a), (b), (a,b) -> (input path in unsorted order)

If a mixture of hashable and non-hashable data types are used, for
example CUBE(hashable,unhashable), then a plan of this form could be
constructed:
 HashAggregate     Grouping Sets: (), (hashable) -> ChainAggregate        Grouping Sets: (unhashable),
(unhashable,hashable)   -> (input path sorted by (unhashable,hashable))
 

Likewise, plans of this form could be considered for cases like
CUBE(low_card, high_card) where hashed grouping on high_card would
require excessive memory:
 HashAggregate     Grouping Sets: (), (low_card) -> ChainAggregate        Grouping Sets: (high_card), (high_card,
low_card)   -> (input path sorted by (high_card, low_card))
 
Tom> ISTM that maybe what we should do is take a cue from the SQLTom> spec, which defines these things in terms of
UNIONALL ofTom> plain-GROUP-BY operations reading from a common CTE.
 

I looked at that, in fact that was our original plan, but it became
clear quite quickly that it was not going to be easy.

I tried two different approaches. First was to actually re-plan the
input (i.e. running query_planner more than once) for different sort
orders; that crashed and burned quickly thanks to the extent to which
the planner assumes that it'll be run once only on any given input.

Second was to generate a CTE for the input data. This didn't get very
far because everything that already exists to handle CTE nodes assumes
that they are explicit in the planner's input (that they have their
own Query node, etc.) and I was not able to determine a good solution.
If you have any suggestions for how to approach the problem, I'm happy
to have another go at it.

-- 
Andrew (irc:RhodiumToad)



pgsql-hackers by date:

Previous
From: Simon Riggs
Date:
Subject: Re: Proposal : REINDEX SCHEMA
Next
From: Michael Paquier
Date:
Subject: Re: Commitfest problems