Re: Aggregate Supporting Functions - Mailing list pgsql-hackers

From David Rowley
Subject Re: Aggregate Supporting Functions
Date
Msg-id CAKJS1f9EQ64EJHOy81X38_K8u_XE7209NiuRwMU-Uh-FYO_pAw@mail.gmail.com
Whole thread Raw
In response to Re: Aggregate Supporting Functions  (Kevin Grittner <kgrittn@ymail.com>)
List pgsql-hackers
On 10 June 2015 at 01:53, Kevin Grittner <kgrittn@ymail.com> wrote:
David Rowley <david.rowley@2ndquadrant.com> wrote:

> 3. Add logic in the planner to look for look for supporting
> cases. With logic something along the lines of:
>
>   a. Does the query have any aggregates? If not -> return;
>   b. Does the query have more than 1 aggregate? If not -> return;
>   c. Does the at least one of the aggregates have hassuppagg set
>      to true? if not -> return;
>   d. Analyze aggregates to eliminate aggregates that are covered
>      by another aggregate. We should use the aggregate which
>      eliminates the most other aggregates*
>
> * For example stddev(x) will support avg(x), sum(x) and count(x)
> so a query such as select stddev(x), avg(x), sum(x), count(x)
> will eliminate avg(x), sum(x), count(x) as stddev(x) supports 3,
> avg(x) only supports 2, so will have to be eliminated.

Are you assuming that "x" must match exactly among the aggregates
to allow coverage? 

In my haste I neglected to mention that critical part :) Yeah the expression within the aggregation function must match exactly.
This would mean that count(*) would not optimise but count(x) would. I believe that's an ok restriction on a first cut implementation, as that rewrite requires some more NULLability analysis.


Have you given any thought to whether ties are
possible and how they should be resolved?


I guess ties are possible, although I can't immediately think of which of the standard aggs could cause that. I've not thought on it a great deal to be honest. I'm no too sure if pg_proc.procost is better to check or if pg_aggregate.aggtransspace would be better. I see procost is not very well set for quite a lot of functions, e.g float4pl and numeric_sum currently both have the cost of 1, which is certainly not the case. So perhaps it would be easier to just use aggtransspace, and if there's still a tie then just prefer the one that comes first in the targetlist. Likely that would be good as it keeps it predictable and allows the users to have influence on the decision.

 
> I'm a little bit concerned that someone will one day report that:
>
> SELECT avg(x), sum(x), count(x) from bigtable;
>
> Is faster than:
>
> SELECT sum(x), count(x) from bigtable;
>
> Of course, this will be just because we've made case 1 faster,
> NOT because we've slowed down case 2.

Yeah, that seems possible (if not with these functions, at least
technically possible with *some* functions), and hard to explain to
a novice.

> I can't immediately think of a way to fix that without risking
> slowing down: select count(x) from bigtable;

From the information you have proposed storing, with cost factors
associated with the functions, it seems technically possible to
infer that you could run (for example) the avg() aggregate to
accumulate both but only run the final functions of the aggregates
referenced by the query.  That seems like an optimization to try
hard to forget about until you have at least one real-world use
case where it would yield a significant benefit.  It seems
premature to optimize for that before having the rest working.

I see your point, and in my example I think your idea works well, but there's a risk of slowing things down here too for other cases.

With the "supporting aggregates" we're just listing the aggregates we happen to calculate as a byproduct. Using their value is about as cheap as calling the final function, but with this, if we decided to use avg(x) instead of sum(x) and count(x) we really have no way to understand what else avg(x) might be doing.  For example, if we pretend avg() does not exist, and we have stddev(x), sum(x) and count(x) as aggregates, given the query "select sum(x), count(x) from t"...  stddev(x) gives us count(x) and sum(x) as a byproduct, but stddev also calculates sum(x*x), which quite likely ends up slower than just doing sum(x) and count(x) like normal. Perhaps the code that looks for these patterns could take the aggregate that supports the smallest number of supporting aggregates that's enough to cover the requirements, but still...

We could perhaps add some smarts that analyses the costs of the transition function calls here which goes to ensure that stuff won't actually happen. Maybe that's phase 3 though. I don't think I want to touch it at this stage, but for sure something to remember for the future!
 
--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
 

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: [patch] PL/Python is too lossy with floats
Next
From: David Rowley
Date:
Subject: Re: The Future of Aggregation