Re: Final Patch for GROUPING SETS - Mailing list pgsql-hackers

From Andrew Gierth
Subject Re: Final Patch for GROUPING SETS
Date
Msg-id 87oam6tjux.fsf@news-spur.riddles.org.uk
Whole thread Raw
In response to Re: Final Patch for GROUPING SETS  (Andres Freund <andres@anarazel.de>)
Responses Re: Final Patch for GROUPING SETS
Re: Final Patch for GROUPING SETS
List pgsql-hackers
>>>>> "Andres" == Andres Freund <andres@anarazel.de> writes:
Andres> This is not a real review.  I'm just scanning through theAndres> patch, without reading the thread, to
understandif I seeAndres> something "worthy" of controversy. While scanning I might haveAndres> a couple observations
orquestions.
 
>> + *      A list of grouping sets which is structurally equivalent to a ROLLUP>> + *      clause (e.g. (a,b,c),
(a,b),(a)) can be processed in a single pass over>> + *      ordered data.  We do this by keeping a separate set of
transitionvalues>> + *      for each grouping set being concurrently processed; for each input tuple>> + *      we
updatethem all, and on group boundaries we reset some initial subset>> + *      of the states (the list of grouping
setsis ordered from most specific to>> + *      least specific).  One AGG_SORTED node thus handles any number of
grouping>>+ *      sets as long as they share a sort order.
 
Andres> Found "initial subset" not very clear, even if I probablyAndres> guessed the right meaning.

How about:
*  [...], and on group boundaries we reset those states*  (starting at the front of the list) whose grouping values
have* changed (the list of grouping sets is ordered from most specific to*  least specific).  One AGG_SORTED node thus
handlesany number [...]
 
>> + *      To handle multiple grouping sets that _don't_ share a sort order, we use>> + *      a different strategy.
AnAGG_CHAINED node receives rows in sorted order>> + *      and returns them unchanged, but computes transition values
forits own>> + *      list of grouping sets.  At group boundaries, rather than returning the>> + *      aggregated row
(whichis incompatible with the input rows), it writes it>> + *      to a side-channel in the form of a tuplestore.
Thus,a number of>> + *      AGG_CHAINED nodes are associated with a single AGG_SORTED node (the>> + *      "chain
head"),which creates the side channel and, when it has returned>> + *      all of its own data, returns the tuples from
thetuplestore to its own>> + *      caller.
 
Andres> This paragraph deserves to be expanded imo.

OK, but what in particular needs clarifying?
>> + *      In order to avoid excess memory consumption from a chain of alternating>> + *      Sort and AGG_CHAINED
nodes,we reset each child Sort node preemptively,>> + *      allowing us to cap the memory usage for all the sorts in
thechain at>> + *      twice the usage for a single node.
 
Andres> What does reseting 'preemtively' mean?

Plan nodes are normally not reset (in the sense of calling ExecReScan)
just because they finished, but rather it's done before a subsequent new
scan is done.  Doing the rescan call after all the sorted output has
been read means we discard the data from each sort node as soon as it is
transferred to the next one.

There is a more specific comment in agg_retrieve_chained where this
actually happens.
>> + *      From the perspective of aggregate transition and final functions, the>> + *      only issue regarding
groupingsets is this: a single call site (flinfo)>> + *      of an aggregate function may be used for updating several
different>>+ *      transition values in turn. So the function must not cache in the flinfo>> + *      anything which
logicallybelongs as part of the transition value (most>> + *      importantly, the memory context in which the
transitionvalue exists).>> + *      The support API functions (AggCheckCallContext, AggRegisterCallback) are>> + *
sensitiveto the grouping set for which the aggregate function is>> + *      currently being called.
 
Andres> Hm. I've seen a bunch of aggreates do this.

Such as? This was discussed about a year ago in the context of WITHIN
GROUP:

http://www.postgresql.org/message-id/87r424i24w.fsf@news-spur.riddles.org.uk
>> + *      TODO: AGG_HASHED doesn't support multiple grouping sets yet.
Andres> Are you intending to resolve this before an eventual commit?

Original plan was to tackle AGG_HASHED after a working implementation
was committed; we figured that we'd have two commitfests to get the
basics right, and then have a chance to get AGG_HASHED done for the
third one. Also, there was talk of other people working on hashagg
memory usage issues, and we didn't want to conflict with that.

Naturally the extended delays rather put paid to that plan. Going ahead
and writing code for AGG_HASHED anyway wasn't really an option, since
with the overall structural questions unresolved there was too much
chance of it being wasted effort.
Andres> Possibly after the 'structural' issues are resolved? Or do youAndres> think this can safely be put of for
anotherrelease?
 

I think the feature is useful even without AGG_HASHED, even though that
means it can sometimes be beaten on performance by using UNION ALL of
many separate GROUP BYs; but I'd defer to the opinions of others on that
point.
Andres> Maybe it's just me, but I get twitchy if I see a default beingAndres> used like this. I'd much, much rather see
thetwo remainingAndres> AGG_* types and get a warning from the compiler if a new one isAndres> added.
 

Meh. It also needs a bogus initialization of "result" to avoid compiler
warnings if done that way.
Andres> I'll bet this will look absolutely horrid after a pgindent run :/

pgindent doesn't touch it, I just checked.
[making CUBE and ROLLUP work without being reserved]
Andres> This is somewhat icky. I've not really thought abuot this veryAndres> much, but it's imo something to pay
attentionto.
 

This one was discussed in December or so - all the arguments were
thrashed out then.

The bottom line is that reserving "cube" is really painful due to
contrib/cube, and of the possible workarounds, using precedence rules to
resolve it in the grammar is already being used for some other
constructs.
Andres> So, having quickly scanned through the patch, do I understandAndres> correctly that the contentious problems
are:
Andres> * Arguably this converts the execution *tree* into a DAG. TomAndres> seems to be rather uncomfortable with
that.I am wonderingAndres> whether this really is a big deal - essentially this onlyAndres> happens in a relatively
'isolated'part of the tree right?Andres> I.e. if those chained together nodes were considered one node,Andres> there
wouldnot be any loops?  Additionally, the wayAndres> parametrized scans works already essentially "violates" theAndres>
treeparadigma somewhat.
 

The major downsides as I see them with the current approach are:

1. It makes plans (and hence explain output) nest very deeply if you
have complex grouping sets (especially cubes with high dimensionality).

This can make explain very slow in the most extreme cases (explain seems
to perform about O(N^3) in the nesting depth of the plan, I don't know
why).  But it's important not to exaggerate this effect: if anyone
actually has a real-world example of a 12-d cube I'll eat the headgear
of their choice, and for an 8-d cube the explain overhead is only on the
order of ~40ms.  (A 12-d cube generates more than 35 megs of explain
output, nested about 1850 levels deep, and takes about 45 seconds to
explain, though only about 200ms to plan.)

In more realistic cases, the nesting isn't too bad (4-d cube gives a
12-deep plan: 6 sorts and 6 agg nodes), but it's still somewhat less
readable than a union-based plan would be. But honestly I think that
explain output aesthetics should not be a major determining factor for
implementations.

2. A union-based approach would have a chance of including AGG_HASHED
support without any significant code changes, just by using one HashAgg
node per qualifying grouping set. However, this would be potentially
significantly slower than teaching HashAgg to do multiple grouping sets,
and memory usage would be an issue.

(The current approach is specifically intended to allow the use of an
AGG_HASHED node as the head of the chain, once it has been extended to
support multiple grouping sets.)
Andres> There still might be better representations, about which I wantAndres> to think, don't get me wrong. I'm "just"
notseing this as aAndres> fundamental problem.
 

The obvious alternative is this:
 -> CTE x    -> entire input subplan here -> Append    -> GroupAggregate       -> Sort          -> CTE Scan x    ->
GroupAggregate      -> Sort          -> CTE Scan x    -> HashAggregate       -> CTE Scan x    [...]
 

which was basically what we expected to do originally. But all of the
existing code to deal with CTE / CTEScan is based on the assumption that
each CTE has a rangetable entry before planning starts, and it is
completely non-obvious how to arrange to generate such CTEs on the fly
while planning.  Tom offered in December to look into that aspect for
us, and of course we've heard nothing about it since.
Andres> * The whole grammar/keyword issue. To me this seems to be aAndres> problem of swallowing one out of several
similarlycolouredAndres> poisonous pills.
 

Right. Which is why this issue was thrashed out months ago and the
current approach decided on. I consider this question closed.
Andres> Are those the two bigger controversial areas? Or are thereAndres> others in your respective views?

Another controversial item was the introduction of GroupedVar. The need
for this can be avoided by explicitly setting to NULL the relevant
columns of the representative group tuple when evaluating result rows,
but (a) I don't think that's an especially clean approach (though I'm
not pushing back very hard on it) and (b) the logic needed in its
absence is different between the current chaining implementation and a
possible union implementation, so I decided against making any changes
on wasted-effort grounds.

-- 
Andrew (irc:RhodiumToad)



pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Reducing tuple overhead
Next
From: Chris Rogers
Date:
Subject: Re: CTE optimization fence on the todo list?