The Future of Aggregation - Mailing list pgsql-hackers

From David Rowley
Subject The Future of Aggregation
Date
Msg-id CAKJS1f-TmWi-4c5K6CBLRdTfGsVxOJhadefzjE7SWuVBgMSkXA@mail.gmail.com
Whole thread Raw
Responses Re: The Future of Aggregation  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Re: The Future of Aggregation  (Kevin Grittner <kgrittn@ymail.com>)
List pgsql-hackers
It appears to me that there's quite a few new features and optimisations on the not too distant horizon which will require adding yet more fields into pg_aggregate.

These are things along the lines of:

1. Parallel Aggregation (computes each aggregate state in parallel worker processes and then merges these states in serial mode)
2. Aggregate push-down / Aggregate before join (requires passing partially computed aggregate states between executor nodes)
3. Auto-updating Materialized views (ones which contain aggregate functions)
4. Foreign table aggregation

Items 1-4 above I believe require support of "Aggregate State Combine Support" -> https://commitfest.postgresql.org/5/131/ which I believe will need to be modified to implement complex database types to backup our internal aggregate state types so that these types be properly passed between executor nodes, between worker processes and perhaps foreign data wrappers (maybe just postgres_fdw I've not looked into this yet)

Item 5 makes items 1-4 a bit more complex as with this item there's opportunity for very good performance improvements by allowing aggregates like AVG(x) also perform all the required work to allow SUM(x) and COUNT(x) to be calculated for "free" in a query containing all 3 aggregates.

I've discussed item 5 off-list with Simon and he mentioned that we might invent a transition state and transition functions which can have parts switched on and off much like how calcSumX2 controls if do_numeric_accum() should calculate sum(x*x) or not.  The problem with this is that if we ever want to store aggregate states in an auto-updating materialized view, then this generic aggregate state will have to contain at least 3 fields (to store count(x), sum(x) and sum(x*x)), and those 3 fields would have to be stored even if the aggregate was just a simple count(*).

The idea I discussed in the link in item 5 above gets around this problem, but it's a perhaps more surprise filled implementation as it will mean "select avg(x),sum(x),count(x) from t" is actually faster than "select sum(x),count(x) from t" as the agg state for avg() will satisfy sum and count too.

The purpose of this email is to open the doors for discussion about this so that nobody goes off and develops feature X into a corner and disallows feature Y and so that we end up with the most optimal solution that does not trip us up in the future.

I'm interested to hear if Kevin or Amit have had any time to give this any thought before. It would be good to ensure we all have the same vision here.

Regards

David Rowley

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

pgsql-hackers by date:

Previous
From: Jeevan Chalke
Date:
Subject: Dead code in Create/RenameRole() after RoleSpec changes related to CURRENT/SESSION_USER
Next
From: Jeevan Chalke
Date:
Subject: Re: bugfix: incomplete implementation of errhidecontext