Re: Final Patch for GROUPING SETS - Mailing list pgsql-hackers
From | Andrew Gierth |
---|---|
Subject | Re: Final Patch for GROUPING SETS |
Date | |
Msg-id | 87oam6tjux.fsf@news-spur.riddles.org.uk Whole thread Raw |
In response to | Re: Final Patch for GROUPING SETS (Andres Freund <andres@anarazel.de>) |
Responses |
Re: Final Patch for GROUPING SETS
Re: Final Patch for GROUPING SETS |
List | pgsql-hackers |
>>>>> "Andres" == Andres Freund <andres@anarazel.de> writes: Andres> This is not a real review. I'm just scanning through theAndres> patch, without reading the thread, to understandif I seeAndres> something "worthy" of controversy. While scanning I might haveAndres> a couple observations orquestions. >> + * A list of grouping sets which is structurally equivalent to a ROLLUP>> + * clause (e.g. (a,b,c), (a,b),(a)) can be processed in a single pass over>> + * ordered data. We do this by keeping a separate set of transitionvalues>> + * for each grouping set being concurrently processed; for each input tuple>> + * we updatethem all, and on group boundaries we reset some initial subset>> + * of the states (the list of grouping setsis ordered from most specific to>> + * least specific). One AGG_SORTED node thus handles any number of grouping>>+ * sets as long as they share a sort order. Andres> Found "initial subset" not very clear, even if I probablyAndres> guessed the right meaning. How about: * [...], and on group boundaries we reset those states* (starting at the front of the list) whose grouping values have* changed (the list of grouping sets is ordered from most specific to* least specific). One AGG_SORTED node thus handlesany number [...] >> + * To handle multiple grouping sets that _don't_ share a sort order, we use>> + * a different strategy. AnAGG_CHAINED node receives rows in sorted order>> + * and returns them unchanged, but computes transition values forits own>> + * list of grouping sets. At group boundaries, rather than returning the>> + * aggregated row (whichis incompatible with the input rows), it writes it>> + * to a side-channel in the form of a tuplestore. Thus,a number of>> + * AGG_CHAINED nodes are associated with a single AGG_SORTED node (the>> + * "chain head"),which creates the side channel and, when it has returned>> + * all of its own data, returns the tuples from thetuplestore to its own>> + * caller. Andres> This paragraph deserves to be expanded imo. OK, but what in particular needs clarifying? >> + * In order to avoid excess memory consumption from a chain of alternating>> + * Sort and AGG_CHAINED nodes,we reset each child Sort node preemptively,>> + * allowing us to cap the memory usage for all the sorts in thechain at>> + * twice the usage for a single node. Andres> What does reseting 'preemtively' mean? Plan nodes are normally not reset (in the sense of calling ExecReScan) just because they finished, but rather it's done before a subsequent new scan is done. Doing the rescan call after all the sorted output has been read means we discard the data from each sort node as soon as it is transferred to the next one. There is a more specific comment in agg_retrieve_chained where this actually happens. >> + * From the perspective of aggregate transition and final functions, the>> + * only issue regarding groupingsets is this: a single call site (flinfo)>> + * of an aggregate function may be used for updating several different>>+ * transition values in turn. So the function must not cache in the flinfo>> + * anything which logicallybelongs as part of the transition value (most>> + * importantly, the memory context in which the transitionvalue exists).>> + * The support API functions (AggCheckCallContext, AggRegisterCallback) are>> + * sensitiveto the grouping set for which the aggregate function is>> + * currently being called. Andres> Hm. I've seen a bunch of aggreates do this. Such as? This was discussed about a year ago in the context of WITHIN GROUP: http://www.postgresql.org/message-id/87r424i24w.fsf@news-spur.riddles.org.uk >> + * TODO: AGG_HASHED doesn't support multiple grouping sets yet. Andres> Are you intending to resolve this before an eventual commit? Original plan was to tackle AGG_HASHED after a working implementation was committed; we figured that we'd have two commitfests to get the basics right, and then have a chance to get AGG_HASHED done for the third one. Also, there was talk of other people working on hashagg memory usage issues, and we didn't want to conflict with that. Naturally the extended delays rather put paid to that plan. Going ahead and writing code for AGG_HASHED anyway wasn't really an option, since with the overall structural questions unresolved there was too much chance of it being wasted effort. Andres> Possibly after the 'structural' issues are resolved? Or do youAndres> think this can safely be put of for anotherrelease? I think the feature is useful even without AGG_HASHED, even though that means it can sometimes be beaten on performance by using UNION ALL of many separate GROUP BYs; but I'd defer to the opinions of others on that point. Andres> Maybe it's just me, but I get twitchy if I see a default beingAndres> used like this. I'd much, much rather see thetwo remainingAndres> AGG_* types and get a warning from the compiler if a new one isAndres> added. Meh. It also needs a bogus initialization of "result" to avoid compiler warnings if done that way. Andres> I'll bet this will look absolutely horrid after a pgindent run :/ pgindent doesn't touch it, I just checked. [making CUBE and ROLLUP work without being reserved] Andres> This is somewhat icky. I've not really thought abuot this veryAndres> much, but it's imo something to pay attentionto. This one was discussed in December or so - all the arguments were thrashed out then. The bottom line is that reserving "cube" is really painful due to contrib/cube, and of the possible workarounds, using precedence rules to resolve it in the grammar is already being used for some other constructs. Andres> So, having quickly scanned through the patch, do I understandAndres> correctly that the contentious problems are: Andres> * Arguably this converts the execution *tree* into a DAG. TomAndres> seems to be rather uncomfortable with that.I am wonderingAndres> whether this really is a big deal - essentially this onlyAndres> happens in a relatively 'isolated'part of the tree right?Andres> I.e. if those chained together nodes were considered one node,Andres> there wouldnot be any loops? Additionally, the wayAndres> parametrized scans works already essentially "violates" theAndres> treeparadigma somewhat. The major downsides as I see them with the current approach are: 1. It makes plans (and hence explain output) nest very deeply if you have complex grouping sets (especially cubes with high dimensionality). This can make explain very slow in the most extreme cases (explain seems to perform about O(N^3) in the nesting depth of the plan, I don't know why). But it's important not to exaggerate this effect: if anyone actually has a real-world example of a 12-d cube I'll eat the headgear of their choice, and for an 8-d cube the explain overhead is only on the order of ~40ms. (A 12-d cube generates more than 35 megs of explain output, nested about 1850 levels deep, and takes about 45 seconds to explain, though only about 200ms to plan.) In more realistic cases, the nesting isn't too bad (4-d cube gives a 12-deep plan: 6 sorts and 6 agg nodes), but it's still somewhat less readable than a union-based plan would be. But honestly I think that explain output aesthetics should not be a major determining factor for implementations. 2. A union-based approach would have a chance of including AGG_HASHED support without any significant code changes, just by using one HashAgg node per qualifying grouping set. However, this would be potentially significantly slower than teaching HashAgg to do multiple grouping sets, and memory usage would be an issue. (The current approach is specifically intended to allow the use of an AGG_HASHED node as the head of the chain, once it has been extended to support multiple grouping sets.) Andres> There still might be better representations, about which I wantAndres> to think, don't get me wrong. I'm "just" notseing this as aAndres> fundamental problem. The obvious alternative is this: -> CTE x -> entire input subplan here -> Append -> GroupAggregate -> Sort -> CTE Scan x -> GroupAggregate -> Sort -> CTE Scan x -> HashAggregate -> CTE Scan x [...] which was basically what we expected to do originally. But all of the existing code to deal with CTE / CTEScan is based on the assumption that each CTE has a rangetable entry before planning starts, and it is completely non-obvious how to arrange to generate such CTEs on the fly while planning. Tom offered in December to look into that aspect for us, and of course we've heard nothing about it since. Andres> * The whole grammar/keyword issue. To me this seems to be aAndres> problem of swallowing one out of several similarlycolouredAndres> poisonous pills. Right. Which is why this issue was thrashed out months ago and the current approach decided on. I consider this question closed. Andres> Are those the two bigger controversial areas? Or are thereAndres> others in your respective views? Another controversial item was the introduction of GroupedVar. The need for this can be avoided by explicitly setting to NULL the relevant columns of the representative group tuple when evaluating result rows, but (a) I don't think that's an especially clean approach (though I'm not pushing back very hard on it) and (b) the logic needed in its absence is different between the current chaining implementation and a possible union implementation, so I decided against making any changes on wasted-effort grounds. -- Andrew (irc:RhodiumToad)
pgsql-hackers by date: