Re: Aggregate Supporting Functions - Mailing list pgsql-hackers

From Kevin Grittner
Subject Re: Aggregate Supporting Functions
Date
Msg-id 429093099.7931718.1433858027761.JavaMail.yahoo@mail.yahoo.com
Whole thread Raw
In response to Aggregate Supporting Functions  (David Rowley <david.rowley@2ndquadrant.com>)
Responses Re: Aggregate Supporting Functions  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Aggregate Supporting Functions  (David Rowley <david.rowley@2ndquadrant.com>)
List pgsql-hackers
David Rowley <david.rowley@2ndquadrant.com> wrote:

> The idea is that we skip a major chunk of processing in
> situations like:
>
> SELECT avg(x),sum(x),count(x) FROM bigtable;
>
> Because avg(x) already technically knows what the values of
> sum(x) and count(x) are.

That has occurred to me as a possible optimization, but I hadn't
gone so far as to try to determine what the possible performance
improvement would be.

> The performance improvement of this particular case is as
> follows:
>
> create table bigtable as select
> generate_series(1,1000000)::numeric as x; vacuum bigtable;
>
> SELECT avg(x),sum(x),count(x) FROM bigtable; -- Query 1
>
> Time: 390.325 ms
> Time: 392.297 ms
> Time: 400.790 ms
>
> SELECT avg(x) FROM bigtable; -- Query 2
>
> Time: 219.700 ms
> Time: 215.285 ms
> Time: 233.691 ms
>
> With the implementation I'm proposing below, I believe that
> query 1 should perform almost as well as query 2. The only extra
> CPU work that would be done would be some extra checks during
> planning, and the calling of 2 simple new final functions which
> will extract the count(x) and sum(x) from the avg transition's
> state.

I agree that if the increase in planning time is negligible, the
performance to be gained in this example is a reduction of about
45% from the original run time.  That certainly seems to make it
worth considering.

> Implementation:
>
> 1. Add a new boolean column pg_aggregate named hassuppagg which
> will be set to true if the aggregate supports other aggregates.
> For example avg(int) will support count(int) and sum(int)
> 2. Add new system table named pg_aggregate_support (Or some
> better shorter name)
>
> This system table will be defined as follows:
> aspfnoid regproc,
> aspfnsupported regproc,
> aspfinalfn regproc,
> primary key (aspfnoid, aspfnsupported)
>
> Where in the above example aspfnoid will be avg(int) and 2 rows
> will exist. 1 with count(int) in aspfnsupported, and one with
> sum(int) in the aspfnsupported column. aspfinalfn will be a new
> final function which extracts the required portion of the avg's
> aggregate state.

The caffeine hasn't had a chance to fully kick in yet, but that
seems like enough information to optimize out the count() and sum()
aggregations.

> 3. Add logic in the planner to look for look for supporting
> cases. With logic something along the lines of:
>
>   a. Does the query have any aggregates? If not -> return;
>   b. Does the query have more than 1 aggregate? If not -> return;
>   c. Does the at least one of the aggregates have hassuppagg set
>      to true? if not -> return;
>   d. Analyze aggregates to eliminate aggregates that are covered
>      by another aggregate. We should use the aggregate which
>      eliminates the most other aggregates*
>
> * For example stddev(x) will support avg(x), sum(x) and count(x)
> so a query such as select stddev(x), avg(x), sum(x), count(x)
> will eliminate avg(x), sum(x), count(x) as stddev(x) supports 3,
> avg(x) only supports 2, so will have to be eliminated.

Are you assuming that "x" must match exactly among the aggregates
to allow coverage?  Have you given any thought to whether ties are
possible and how they should be resolved?

> I'm a little bit concerned that someone will one day report that:
>
> SELECT avg(x), sum(x), count(x) from bigtable;
>
> Is faster than:
>
> SELECT sum(x), count(x) from bigtable;
>
> Of course, this will be just because we've made case 1 faster,
> NOT because we've slowed down case 2.

Yeah, that seems possible (if not with these functions, at least
technically possible with *some* functions), and hard to explain to
a novice.

> I can't immediately think of a way to fix that without risking
> slowing down: select count(x) from bigtable;

From the information you have proposed storing, with cost factors
associated with the functions, it seems technically possible to
infer that you could run (for example) the avg() aggregate to
accumulate both but only run the final functions of the aggregates
referenced by the query.  That seems like an optimization to try
hard to forget about until you have at least one real-world use
case where it would yield a significant benefit.  It seems
premature to optimize for that before having the rest working.

> To allow users to implement aggregates which take advantage of
> this we'd better also expand the CREATE AGGREGATE syntax.
>
> I've not given this a huge amount of thought.  The only thing
> I've come up with so far is;
>
> CREATE AGGREGATE avg(bigint)
>   (FINALFUNC = avgfinal)
>   SUPPORTS (count(bigint) = int8_avg_countfn,
>             sum(bigint) = int8_avg_sumfn);

I would try to find an existing keyword which could be used instead
of adding SUPPORTS to the list.  On a quick look over the keyword
list EXTRACT jumped out at me as a candidate.  There should be
something usable in there somewhere.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Jeevan Chalke
Date:
Subject: Re: psql :: support for \ev viewname and \sv viewname
Next
From: Tomas Vondra
Date:
Subject: Re: The Future of Aggregation