Re: Combining Aggregates - Mailing list pgsql-hackers

From David Rowley
Subject Re: Combining Aggregates
Date
Msg-id CAKJS1f_yVNaHawu+ty-_1p5i7nLX0iK3ntjWcjGqvXNm6YDjaQ@mail.gmail.com
Whole thread Raw
In response to Re: Combining Aggregates  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On 20 January 2016 at 05:56, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Dec 21, 2015 at 4:02 AM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> Now, there has been talk of this previously, on various threads, but I don't
> believe any final decisions were made on how exactly it should be done. At
> the moment I plan to make changes as follows:
>
>  Add 3 new columns to pg_aggregate, aggserialfn, aggdeserialfn and
> aggserialtype These will only be required when aggtranstype is INTERNAL.

Check.

> Perhaps we should disallow CREATE AGGREAGET from accepting them for any
> other type...

Well, we should definitely not accept them and have them be
meaningless.  We should either reject them or accept them and then use
them.  I can't immediately think of a reason to serialize one
non-internal type as another, but maybe there is one.


In the latest patch I'm disallowing serial and deserial functions for non INTERNAL state aggregates during CREATE AGGREGATE.
I think it's best to keep this disabled for now, and do so until we discover some reason that we might want want to enable it. If we enabled it, and later decided that was a dumb idea, then it'll be much harder to add the restriction later, since it may cause errors for users who have created their own aggregates.
 
> Add a new bool field to nodeAgg's state named serialStates. If this is field
> is set to true then when we're in finalizeAgg = false mode, we'll call the
> serialfn on the agg state instead of the finalfn. This will allow the
> serialized state to be stored in the tuple and sent off to the main backend.
> The combine agg node should also be set to serialStates = true, so that it
> knows to deserialize instead of just assuming that the agg state is of type
> aggtranstype.

I'm not quite sure, but it sounds like you might be overloading
serialStates with two different meanings here.

Hmm, only in the sense that serialStates means "serialise" and "deserialise". The only time that both could occur in the same node is if combineStates=true and finalizeAggs=false (in other words 3 or more aggregation stages).

Let's say we are performing aggregation in 3 stages, stage 1 is operating on normal values and uses the transfn on these values, but does not finalise the states. If serialStates = true here then, if one exists, we call the serialfn, passing in the aggregated state, and pass the return value of that function up to the next node. Now, this next (middle) node is important, serialStates must also be true here so that the executor knows to deserialise the previously serialised states. Now, this node uses the combinefn to merge states which require, and then since serialStates is true, it also (re)serialises those new states again. Now if there was some reason that this middle node should deserialise, but not re-serialise those states, then we may need an extra parameter to instruct it to do so. I guess this may be required if we were to perform some partial aggregation and combining again within a single process (in which case we'd not need to serialise INTERNAL states, we can just pass the pointer to them in the Tuple),  but then we might require the states to be serialised in order to hand them over to the main process, from a worker process.

I can imagine cases where we might want to do this in the future, so perhaps it is worth it:

Imagine:

SELECT COUNT(*) FROM (SELECT * FROM (SELECT * FROM a UNION ALL SELECT * FROM b) AS ab UNION ALL (SELECT * FROM c UNION ALL SELECT * FROM d)) abcd; 

Where the planner may decide to, on 1 worker perform count(*) on a then b, and combine that into ab, while doing the same for c and d on some other worker process.


> I believe this should allow us to not cause any performance regressions by
> moving away from INTERNAL agg states. It should also be very efficient for
> internal states such as Int8TransTypeData, as this struct merely has 2 int64
> fields which should be very simple to stuff into a bytea varlena type. We
> don't need to mess around converting the ->count and ->sum into a strings or
> anything.

I think it would be more user-friendly to emit these as 2-element
integer arrays rather than bytea.  Sure, bytea is fine when PostgreSQL
is talking to itself, but if you are communicating with an external
system, it's a different situation.  If the remote system is
PostgreSQL then you are again OK, except for the data going over the
wire being incomprehensible and maybe byte-order-dependent, but what
if you want some other database system to do partial aggregation and
then further aggregate the result?  You don't want the intermediate
state to be some kooky thing that only another PostgreSQL database has
a chance of generating correctly.

If we do that then we also need to invent composite database types for the more complicated internal states such as NumericAggState.
We should probably also consider performance here too. I had been considering just using the *send() functions to build a bytea.

> Then in order for the planner to allow parallel aggregation all aggregates
> must:
>
> Not have a DISTINCT or ORDER BY clause
> Have a combinefn
> If aggtranstype = INTERNAL, must have a aggserialfn and aggdeserialfn.
>
> We can relax the requirement on 3 if we're using 2-stage aggregation, but
> not parallel aggregation.

When would we do that?

This is probably best explained in one of the code comments:

+ *       states as input rather than input tuples. This mode facilitates multiple
+ *       aggregate stages which allows us to support pushing aggregation down
+ *       deeper into the plan rather than leaving it for the final stage. For
+ *       example with a query such as:
+ *
+ *       SELECT count(*) FROM (SELECT * FROM a UNION ALL SELECT * FROM b);
+ *
+ *       with this functionality the planner has the flexibility to generate a
+ *       plan which performs count(*) on table a and table b separately and then
+ *       add a combine phase to combine both results. In this case the combine
+ *       function would simply add both counts together.

There's also the possibility to use this functionality later to allow GROUP BY to occur before the final plan stage, e.g before joining to some relation.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: checkpointer continuous flushing
Next
From: Tomas Vondra
Date:
Subject: Re: PATCH: postpone building buckets to the end of Hash (in HashJoin)