Re: Final Patch for GROUPING SETS - unrecognized node type: 347 - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Final Patch for GROUPING SETS - unrecognized node type: 347
Date
Msg-id 540C5081.90205@fuzzy.cz
Whole thread Raw
In response to Re: Final Patch for GROUPING SETS - unrecognized node type: 347  (Andrew Gierth <andrew@tao11.riddles.org.uk>)
Responses Re: Final Patch for GROUPING SETS - unrecognized node type: 347
List pgsql-hackers
On 6.9.2014 23:34, Andrew Gierth wrote:
>>>>>> "Tomas" == Tomas Vondra <tv@fuzzy.cz> writes:
> 
>  Tomas> I have significant doubts about the whole design,
>  Tomas> though. Especially the decision not to use HashAggregate,
> 
> There is no "decision not to use HashAggregate". There is simply no
> support for HashAggregate yet.
> 
> Having it be able to work with GroupAggregate is essential, because
> there are always cases where HashAggregate is simply not permitted
> (e.g. when using distinct or sorted aggs; or unhashable types; or with
> the current code, when the estimated memory usage exceeds work_mem).
> HashAggregate may be a performance improvement, but it's something
> that can be added afterwards rather than an essential part of the
> feature.

Ah, OK. I got confused by the "final patch" subject, and so the
possibility of additional optimization somehow didn't occur to me.

>  Tomas> Now, the chaining only makes this worse, because it
>  Tomas> effectively forces a separate sort of the whole table for each
>  Tomas> grouping set.
> 
> It's not one sort per grouping set, it's the minimal number of sorts
> needed to express the result as a union of ROLLUP clauses. The planner
> code will (I believe) always find the smallest number of sorts needed.

You're probably right. Although when doing GROUP BY CUBE (a,b,c,a) I get
one more ChainAggregate than with CUBE(a,b,c). and we seem to compute
all the aggregates twice. Not sure if we need to address this though,
because it's mostly user's fault.


> Each aggregate node can process any number of grouping sets as long as
> they represent a single rollup list (and therefore share a single sort
> order).
> 
> Yes, this is slower than using one hashagg. But it solves the general
> problem in a way that does not interfere with future optimization.
> 
> (HashAggregate can be added to the current implementation by first
> adding executor support for hashagg with multiple grouping sets, then
> in the planner, extracting as many hashable grouping sets as possible
> from the list before looking for rollup lists. The chained aggregate
> code can work just fine with a HashAggregate as the chain head.
> 
> We have not actually tackled this, since I'm not going to waste any
> time adding optimizations before the basic idea is accepted.)

OK, understood.

> 
>  Tomas> What I envisioned when considering hacking on this a few
>  Tomas> months back, was extending the aggregate API with "merge
>  Tomas> state" function,
> 
> That's not really on the cards for arbitrary non-trivial aggregate
> functions.
> 
> Yes, it can be done for simple ones, and if you want to use that as a
> basis for adding optimizations that's fine. But a solution that ONLY
> works in simple cases isn't sufficient, IMO.

I believe it can be done for most aggregates, assuming you have access
to the internal state somehow (not just the). Adding it for in-core
aggregates would not be difficult, in most cases. But you're right we
don't have this now at all.

regards
Tomas




pgsql-hackers by date:

Previous
From: Petr Jelinek
Date:
Subject: Re: Built-in binning functions
Next
From: Andrew Gierth
Date:
Subject: Re: Final Patch for GROUPING SETS - unrecognized node type: 347