Thread: Aggregate Supporting Functions

Aggregate Supporting Functions

From
David Rowley
Date:
I believe this is an idea that's been discussed before, but I'm not exactly sure where that happened:

Overview:

The idea is that we skip a major chunk of processing in situations like:

SELECT avg(x),sum(x),count(x) FROM bigtable;

Because avg(x) already technically knows what the values of sum(x) and count(x) are.

The performance improvement of this particular case is as follows:

create table bigtable as select
generate_series(1,1000000)::numeric as x; vacuum bigtable;

SELECT avg(x),sum(x),count(x) FROM bigtable; -- Query 1

Time: 390.325 ms
Time: 392.297 ms
Time: 400.790 ms

SELECT avg(x) FROM bigtable; -- Query 2

Time: 219.700 ms
Time: 215.285 ms
Time: 233.691 ms

With the implementation I'm proposing below, I believe that query 1 should perform almost as well as query 2. The only extra CPU work that would be done would be some extra checks during planning, and the calling of 2 simple new final functions which will extract the count(x) and sum(x) from the avg transition's state.

Purpose of this Email:

For technical review of proposed implementation.

Implementation:

1. Add a new boolean column pg_aggregate named hassuppagg which will be set to true if the aggregate supports other aggregates. For example avg(int) will support count(int) and sum(int)
2. Add new system table named pg_aggregate_support (Or some better shorter name)

This system table will be defined as follows:
aspfnoid regproc,
aspfnsupported regproc,
aspfinalfn regproc,
primary key (aspfnoid, aspfnsupported)

Where in the above example aspfnoid will be avg(int) and 2 rows will exist. 1 with count(int) in aspfnsupported, and one with sum(int) in the aspfnsupported column. aspfinalfn will be a new final function which extracts the required portion of the avg's aggregate state.

3. Add logic in the planner to look for look for supporting cases. With logic something along the lines of:

  a. Does the query have any aggregates? If not -> return;
  b. Does the query have more than 1 aggregate? If not -> return;
  c. Does the at least one of the aggregates have hassuppagg set to true? if not -> return;
  d. Analyze aggregates to eliminate aggregates that are covered by another aggregate. We should use the aggregate which eliminates the most other aggregates*

* For example stddev(x) will support avg(x), sum(x) and count(x) so a query such as select stddev(x), avg(x), sum(x), count(x) will eliminate avg(x), sum(x), count(x) as stddev(x) supports 3, avg(x) only supports 2, so will have to be eliminated.

Concerns:

I'm a little bit concerned that someone will one day report that:

SELECT avg(x), sum(x), count(x) from bigtable;

Is faster than:

SELECT sum(x), count(x) from bigtable;

Of course, this will be just because we've made case 1 faster, NOT because we've slowed down case 2.
I can't immediately think of a way to fix that without risking slowing down: select count(x) from bigtable;

CREATE AGGREGATE Syntax:

To allow users to implement aggregates which take advantage of this we'd better also expand the CREATE AGGREGATE syntax.

I've not given this a huge amount of thought. The only thing I've come up with so far is;

CREATE AGGREGATE avg(bigint)
(FINALFUNC = avgfinal)
SUPPORTS (count(bigint) = int8_avg_countfn, sum(bigint) = int8_avg_sumfn);


Can anyone think of anything that I've not accounted for before I go off and work on this?

Regards

David Rowley

"The research leading to these results has received funding from the European Union’s
Seventh Framework Programme (FP7/2007-2015) under grant agreement n° 318633"

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Aggregate Supporting Functions

From
Kevin Grittner
Date:
David Rowley <david.rowley@2ndquadrant.com> wrote:

> The idea is that we skip a major chunk of processing in
> situations like:
>
> SELECT avg(x),sum(x),count(x) FROM bigtable;
>
> Because avg(x) already technically knows what the values of
> sum(x) and count(x) are.

That has occurred to me as a possible optimization, but I hadn't
gone so far as to try to determine what the possible performance
improvement would be.

> The performance improvement of this particular case is as
> follows:
>
> create table bigtable as select
> generate_series(1,1000000)::numeric as x; vacuum bigtable;
>
> SELECT avg(x),sum(x),count(x) FROM bigtable; -- Query 1
>
> Time: 390.325 ms
> Time: 392.297 ms
> Time: 400.790 ms
>
> SELECT avg(x) FROM bigtable; -- Query 2
>
> Time: 219.700 ms
> Time: 215.285 ms
> Time: 233.691 ms
>
> With the implementation I'm proposing below, I believe that
> query 1 should perform almost as well as query 2. The only extra
> CPU work that would be done would be some extra checks during
> planning, and the calling of 2 simple new final functions which
> will extract the count(x) and sum(x) from the avg transition's
> state.

I agree that if the increase in planning time is negligible, the
performance to be gained in this example is a reduction of about
45% from the original run time.  That certainly seems to make it
worth considering.

> Implementation:
>
> 1. Add a new boolean column pg_aggregate named hassuppagg which
> will be set to true if the aggregate supports other aggregates.
> For example avg(int) will support count(int) and sum(int)
> 2. Add new system table named pg_aggregate_support (Or some
> better shorter name)
>
> This system table will be defined as follows:
> aspfnoid regproc,
> aspfnsupported regproc,
> aspfinalfn regproc,
> primary key (aspfnoid, aspfnsupported)
>
> Where in the above example aspfnoid will be avg(int) and 2 rows
> will exist. 1 with count(int) in aspfnsupported, and one with
> sum(int) in the aspfnsupported column. aspfinalfn will be a new
> final function which extracts the required portion of the avg's
> aggregate state.

The caffeine hasn't had a chance to fully kick in yet, but that
seems like enough information to optimize out the count() and sum()
aggregations.

> 3. Add logic in the planner to look for look for supporting
> cases. With logic something along the lines of:
>
>   a. Does the query have any aggregates? If not -> return;
>   b. Does the query have more than 1 aggregate? If not -> return;
>   c. Does the at least one of the aggregates have hassuppagg set
>      to true? if not -> return;
>   d. Analyze aggregates to eliminate aggregates that are covered
>      by another aggregate. We should use the aggregate which
>      eliminates the most other aggregates*
>
> * For example stddev(x) will support avg(x), sum(x) and count(x)
> so a query such as select stddev(x), avg(x), sum(x), count(x)
> will eliminate avg(x), sum(x), count(x) as stddev(x) supports 3,
> avg(x) only supports 2, so will have to be eliminated.

Are you assuming that "x" must match exactly among the aggregates
to allow coverage?  Have you given any thought to whether ties are
possible and how they should be resolved?

> I'm a little bit concerned that someone will one day report that:
>
> SELECT avg(x), sum(x), count(x) from bigtable;
>
> Is faster than:
>
> SELECT sum(x), count(x) from bigtable;
>
> Of course, this will be just because we've made case 1 faster,
> NOT because we've slowed down case 2.

Yeah, that seems possible (if not with these functions, at least
technically possible with *some* functions), and hard to explain to
a novice.

> I can't immediately think of a way to fix that without risking
> slowing down: select count(x) from bigtable;

From the information you have proposed storing, with cost factors
associated with the functions, it seems technically possible to
infer that you could run (for example) the avg() aggregate to
accumulate both but only run the final functions of the aggregates
referenced by the query.  That seems like an optimization to try
hard to forget about until you have at least one real-world use
case where it would yield a significant benefit.  It seems
premature to optimize for that before having the rest working.

> To allow users to implement aggregates which take advantage of
> this we'd better also expand the CREATE AGGREGATE syntax.
>
> I've not given this a huge amount of thought.  The only thing
> I've come up with so far is;
>
> CREATE AGGREGATE avg(bigint)
>   (FINALFUNC = avgfinal)
>   SUPPORTS (count(bigint) = int8_avg_countfn,
>             sum(bigint) = int8_avg_sumfn);

I would try to find an existing keyword which could be used instead
of adding SUPPORTS to the list.  On a quick look over the keyword
list EXTRACT jumped out at me as a candidate.  There should be
something usable in there somewhere.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Aggregate Supporting Functions

From
Tom Lane
Date:
Kevin Grittner <kgrittn@ymail.com> writes:
> David Rowley <david.rowley@2ndquadrant.com> wrote:
>> [ avoid duplicate calculations for related aggregates ]

> From the information you have proposed storing, with cost factors
> associated with the functions, it seems technically possible to
> infer that you could run (for example) the avg() aggregate to
> accumulate both but only run the final functions of the aggregates
> referenced by the query.  That seems like an optimization to try
> hard to forget about until you have at least one real-world use
> case where it would yield a significant benefit.  It seems
> premature to optimize for that before having the rest working.

Actually, I would suggest that you forget about all the other aspects
and *just* do that, because it could be made to work today on existing
aggregate functions, and it would not require hundreds-to-thousands
of lines of boilerplate support code in the grammar, catalog support,
pg_dump, yadda yadda.  That is, look to see which aggregates use the
same transition function and run that just once.  We already have the
rule that the final function can't destroy the transition output,
so running two different final functions on the same transition result
should Just Work.

The rest of what David is thinking about could be done in a followon
version by allowing the same aggregate to be implemented by any of several
transition-function/final-function pairs, then teaching the planner to
prefer pairs that let the same transition function be used for several
aggregates.  But I'd see that as a later refinement that might well fail
the bang-for-buck test, and hence shouldn't be the first step.
        regards, tom lane



Re: Aggregate Supporting Functions

From
Kevin Grittner
Date:
Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Kevin Grittner <kgrittn@ymail.com> writes:
>> David Rowley <david.rowley@2ndquadrant.com> wrote:
>>> [ avoid duplicate calculations for related aggregates ]

>> From the information you have proposed storing, with cost factors
>> associated with the functions, it seems technically possible to
>> infer that you could run (for example) the avg() aggregate to
>> accumulate both but only run the final functions of the aggregates
>> referenced by the query.  That seems like an optimization to try
>> hard to forget about until you have at least one real-world use
>> case where it would yield a significant benefit.  It seems
>> premature to optimize for that before having the rest working.

> Actually, I would suggest that you forget about all the other aspects
> and *just* do that, because it could be made to work today on existing
> aggregate functions, and it would not require hundreds-to-thousands
> of lines of boilerplate support code in the grammar, catalog support,
> pg_dump, yadda yadda.  That is, look to see which aggregates use the
> same transition function and run that just once.

I was responding to David's suggestion that this particular query
be optimized by using the transition function from avg():
 SELECT sum(x), count(x) from bigtable;

Reviewing what you said caused me to notice something that I had
missed before -- that sum() and avg() share a transition function.
Still, that function is not used for count(), so I don't see how
that fits into what you're saying above.

I agree that what you're suggesting does allow access to some very
low-hanging fruit that I had not noticed; it makes sense to get
that first.

> The rest of what David is thinking about could be done in a followon
> version by allowing the same aggregate to be implemented by any of several
> transition-function/final-function pairs, then teaching the planner to
> prefer pairs that let the same transition function be used for several
> aggregates.  But I'd see that as a later refinement that might well fail
> the bang-for-buck test, and hence shouldn't be the first step.

Well, that part will be a little tricky, because it would be
infrastructure which would allow what are likely significant
optimizations in several other features.  There's a bit of a
chicken-and-egg problem, because these other optimizations can't be
written without the infrastructure, and the infrastructure will not
show its full worth until the other optimizations are able to take
advantage of it.  But we can cross that bridge when we come to it.

This doesn't look to me like the traditional 80/20 rule.  I think
the easy stuff might be 5% of the benefit for 1% of the work; but
still a better bang for the buck than the other work.

What you are proposing as an alternative to what David proposed for
the later work looks (on the face of it) like a bigger, more
complicated mechanism than he proposed, but more flexible if it can
be made to work.  What I'd hate to see is failure to get a toaster 
because it's too expensive to get one that also bakes pizza.  We're 
gonna want to make a lot of toast over the next few years.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Aggregate Supporting Functions

From
David Rowley
Date:
On 10 June 2015 at 01:53, Kevin Grittner <kgrittn@ymail.com> wrote:
David Rowley <david.rowley@2ndquadrant.com> wrote:

> 3. Add logic in the planner to look for look for supporting
> cases. With logic something along the lines of:
>
>   a. Does the query have any aggregates? If not -> return;
>   b. Does the query have more than 1 aggregate? If not -> return;
>   c. Does the at least one of the aggregates have hassuppagg set
>      to true? if not -> return;
>   d. Analyze aggregates to eliminate aggregates that are covered
>      by another aggregate. We should use the aggregate which
>      eliminates the most other aggregates*
>
> * For example stddev(x) will support avg(x), sum(x) and count(x)
> so a query such as select stddev(x), avg(x), sum(x), count(x)
> will eliminate avg(x), sum(x), count(x) as stddev(x) supports 3,
> avg(x) only supports 2, so will have to be eliminated.

Are you assuming that "x" must match exactly among the aggregates
to allow coverage? 

In my haste I neglected to mention that critical part :) Yeah the expression within the aggregation function must match exactly.
This would mean that count(*) would not optimise but count(x) would. I believe that's an ok restriction on a first cut implementation, as that rewrite requires some more NULLability analysis.


Have you given any thought to whether ties are
possible and how they should be resolved?


I guess ties are possible, although I can't immediately think of which of the standard aggs could cause that. I've not thought on it a great deal to be honest. I'm no too sure if pg_proc.procost is better to check or if pg_aggregate.aggtransspace would be better. I see procost is not very well set for quite a lot of functions, e.g float4pl and numeric_sum currently both have the cost of 1, which is certainly not the case. So perhaps it would be easier to just use aggtransspace, and if there's still a tie then just prefer the one that comes first in the targetlist. Likely that would be good as it keeps it predictable and allows the users to have influence on the decision.

 
> I'm a little bit concerned that someone will one day report that:
>
> SELECT avg(x), sum(x), count(x) from bigtable;
>
> Is faster than:
>
> SELECT sum(x), count(x) from bigtable;
>
> Of course, this will be just because we've made case 1 faster,
> NOT because we've slowed down case 2.

Yeah, that seems possible (if not with these functions, at least
technically possible with *some* functions), and hard to explain to
a novice.

> I can't immediately think of a way to fix that without risking
> slowing down: select count(x) from bigtable;

From the information you have proposed storing, with cost factors
associated with the functions, it seems technically possible to
infer that you could run (for example) the avg() aggregate to
accumulate both but only run the final functions of the aggregates
referenced by the query.  That seems like an optimization to try
hard to forget about until you have at least one real-world use
case where it would yield a significant benefit.  It seems
premature to optimize for that before having the rest working.

I see your point, and in my example I think your idea works well, but there's a risk of slowing things down here too for other cases.

With the "supporting aggregates" we're just listing the aggregates we happen to calculate as a byproduct. Using their value is about as cheap as calling the final function, but with this, if we decided to use avg(x) instead of sum(x) and count(x) we really have no way to understand what else avg(x) might be doing.  For example, if we pretend avg() does not exist, and we have stddev(x), sum(x) and count(x) as aggregates, given the query "select sum(x), count(x) from t"...  stddev(x) gives us count(x) and sum(x) as a byproduct, but stddev also calculates sum(x*x), which quite likely ends up slower than just doing sum(x) and count(x) like normal. Perhaps the code that looks for these patterns could take the aggregate that supports the smallest number of supporting aggregates that's enough to cover the requirements, but still...

We could perhaps add some smarts that analyses the costs of the transition function calls here which goes to ensure that stuff won't actually happen. Maybe that's phase 3 though. I don't think I want to touch it at this stage, but for sure something to remember for the future!
 
--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
 

Re: Aggregate Supporting Functions

From
David Rowley
Date:

On 10 June 2015 at 02:26, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Kevin Grittner <kgrittn@ymail.com> writes:
> David Rowley <david.rowley@2ndquadrant.com> wrote:
>> [ avoid duplicate calculations for related aggregates ]

> From the information you have proposed storing, with cost factors
> associated with the functions, it seems technically possible to
> infer that you could run (for example) the avg() aggregate to
> accumulate both but only run the final functions of the aggregates
> referenced by the query.  That seems like an optimization to try
> hard to forget about until you have at least one real-world use
> case where it would yield a significant benefit.  It seems
> premature to optimize for that before having the rest working.

Actually, I would suggest that you forget about all the other aspects
and *just* do that, because it could be made to work today on existing
aggregate functions, and it would not require hundreds-to-thousands
of lines of boilerplate support code in the grammar, catalog support,
pg_dump, yadda yadda.  That is, look to see which aggregates use the
same transition function and run that just once.  We already have the
rule that the final function can't destroy the transition output,
so running two different final functions on the same transition result
should Just Work.


Good idea.
I believe the only extra check, besides do they use the same transfn, would be the initvalue of the aggregate.

I'll write a patch and post in the next couple of days.

Regards

David Rowley

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services