Aggregate Supporting Functions - Mailing list pgsql-hackers

From David Rowley
Subject Aggregate Supporting Functions
Date
Msg-id CAKJS1f8ebkc=EhEq+ArM8vwYZ5vSapJ1Seub5=FvRRuDCtFfsQ@mail.gmail.com
Whole thread Raw
Responses Re: Aggregate Supporting Functions  (Kevin Grittner <kgrittn@ymail.com>)
List pgsql-hackers
I believe this is an idea that's been discussed before, but I'm not exactly sure where that happened:

Overview:

The idea is that we skip a major chunk of processing in situations like:

SELECT avg(x),sum(x),count(x) FROM bigtable;

Because avg(x) already technically knows what the values of sum(x) and count(x) are.

The performance improvement of this particular case is as follows:

create table bigtable as select
generate_series(1,1000000)::numeric as x; vacuum bigtable;

SELECT avg(x),sum(x),count(x) FROM bigtable; -- Query 1

Time: 390.325 ms
Time: 392.297 ms
Time: 400.790 ms

SELECT avg(x) FROM bigtable; -- Query 2

Time: 219.700 ms
Time: 215.285 ms
Time: 233.691 ms

With the implementation I'm proposing below, I believe that query 1 should perform almost as well as query 2. The only extra CPU work that would be done would be some extra checks during planning, and the calling of 2 simple new final functions which will extract the count(x) and sum(x) from the avg transition's state.

Purpose of this Email:

For technical review of proposed implementation.

Implementation:

1. Add a new boolean column pg_aggregate named hassuppagg which will be set to true if the aggregate supports other aggregates. For example avg(int) will support count(int) and sum(int)
2. Add new system table named pg_aggregate_support (Or some better shorter name)

This system table will be defined as follows:
aspfnoid regproc,
aspfnsupported regproc,
aspfinalfn regproc,
primary key (aspfnoid, aspfnsupported)

Where in the above example aspfnoid will be avg(int) and 2 rows will exist. 1 with count(int) in aspfnsupported, and one with sum(int) in the aspfnsupported column. aspfinalfn will be a new final function which extracts the required portion of the avg's aggregate state.

3. Add logic in the planner to look for look for supporting cases. With logic something along the lines of:

  a. Does the query have any aggregates? If not -> return;
  b. Does the query have more than 1 aggregate? If not -> return;
  c. Does the at least one of the aggregates have hassuppagg set to true? if not -> return;
  d. Analyze aggregates to eliminate aggregates that are covered by another aggregate. We should use the aggregate which eliminates the most other aggregates*

* For example stddev(x) will support avg(x), sum(x) and count(x) so a query such as select stddev(x), avg(x), sum(x), count(x) will eliminate avg(x), sum(x), count(x) as stddev(x) supports 3, avg(x) only supports 2, so will have to be eliminated.

Concerns:

I'm a little bit concerned that someone will one day report that:

SELECT avg(x), sum(x), count(x) from bigtable;

Is faster than:

SELECT sum(x), count(x) from bigtable;

Of course, this will be just because we've made case 1 faster, NOT because we've slowed down case 2.
I can't immediately think of a way to fix that without risking slowing down: select count(x) from bigtable;

CREATE AGGREGATE Syntax:

To allow users to implement aggregates which take advantage of this we'd better also expand the CREATE AGGREGATE syntax.

I've not given this a huge amount of thought. The only thing I've come up with so far is;

CREATE AGGREGATE avg(bigint)
(FINALFUNC = avgfinal)
SUPPORTS (count(bigint) = int8_avg_countfn, sum(bigint) = int8_avg_sumfn);


Can anyone think of anything that I've not accounted for before I go off and work on this?

Regards

David Rowley

"The research leading to these results has received funding from the European Union’s
Seventh Framework Programme (FP7/2007-2015) under grant agreement n° 318633"

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

pgsql-hackers by date:

Previous
From: Fujii Masao
Date:
Subject: Re: Memory leak with XLogFileCopy since de768844 (WAL file with .partial)
Next
From: Michael Paquier
Date:
Subject: Information of pg_stat_ssl visible to all users