Re: Combining Aggregates - Mailing list pgsql-hackers
From | David Rowley |
---|---|
Subject | Re: Combining Aggregates |
Date | |
Msg-id | CAKJS1f-_EzM2d=rhNiqZObMg1mfbHEwSyKXWdAo7f4nav0AQAQ@mail.gmail.com Whole thread Raw |
In response to | Re: Combining Aggregates (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Combining Aggregates
Re: Combining Aggregates |
List | pgsql-hackers |
On 25 March 2016 at 07:26, Robert Haas <robertmhaas@gmail.com> wrote: > OK, I noticed another documentation problem: you need to update > catalogs.sgml for these new columns. Thanks. Done. > > + * Validate the serial function, if present. We must ensure > that the return > + * Validate the de-serial function, if present. We must ensure that the > I think that you should refer to these consistently in the comments as > the "serialization function" and the "deserialization function", even > though the SQL syntax is different. And unhyphenated. That's probably better. Fixed. > + /* check that we also got a serial type */ > + if (!OidIsValid(aggSerialType)) > + ereport(ERROR, > + (errcode(ERRCODE_INVALID_FUNCTION_DEFINITION), > + errmsg("must specify serialization > type when specifying serialization function"))); > > I think that in parallel cases, this check is in DefineAggregate(), > not here. See, e.g. "aggregate mfinalfunc must not be specified > without mstype". Good point. I've fixed this. > Existing type parameters to CREATE AGGREGATE have IsPolymorphicType() > checks to enforce sanity in various ways, but you seem not to have > added that for the serial type. Added check for this to disallow pseudo types that are not polymorphic. > + /* don't call a strict serial function with > NULL input */ > + if (pertrans->serialfn.fn_strict && > + pergroupstate->transValueIsNull) > + continue; > > Shouldn't this instead set aggnulls[aggno] = true? And doesn't the > hunk in combine_aggregates() have the same problem? hmm, no, I can't see a problem there. aggvalues[aggno] is what's going to be set after that condition. > + /* > + * serial and de-serial functions must match, if > present. Remember that > + * these will be InvalidOid if they're not required > for this agg node > + */ > > Explain WHY they need to match. And maybe update the overall comment > for the function. I've added: /* * The serialization and deserialization functions must match, if * present, as we're unable to share the trans state for aggregates * which will serialize or deserialize into different formats. Remember * that these will be InvalidOid if they're not required for this agg * node. */ This is for the code which shares transition states between aggregate functions which have the same transition function, for example SUM(NUMERIC) and AVG(NUMERIC). The serialization functions must match too. Perhaps, providing the serial/deserial function does its job properly, then it does not matter, but it does seem better to be safe here. I can't really imagine why anyone would want to have the same transition function but have different serialization functions. > + "'-' AS > aggdeserialfn,aggmtransfn, aggminvtransfn, " > > Whitespace. Fixed. > In your pg_aggregate.h changes, avg(numeric) sets aggserialtype but no > aggserialfn or aggdeserialfn. Fixed. I've also added some checks to ensure a combine function being set on an aggregate with an INTERNAL state is not-strict. I realised when writing the documentation for combine functions that it needed to be this way, as, if the function was strict, then advance_combine_function() could incorrectly assign the memory address of the 2nd state to the transValue. Actually, this seems a little tricky to optimise well, as in the case when we're serialising states, the deserialisation function should allocate a new state in the correct memory context, so technically the combine function could be strict in this case, as it would be fine to use state2 if state1 was null, but parallel aggregation is probably unique and could be the only time we'll need to use the serialisation/deserialisation functions, so without serialStates = true, the memory address could still belong to some other aggregate node's memory context which is certainly not right, so the code as of this patch is playing it safe. You could say that this is technically a bug in master, but it should be impossible to hit it now due to lack of ability to combine internal aggregate states. The checks for the correct strict property in the combine function are 3-fold. 1. CREATE AGGREGATE check. 2. executor startup for Agg nodes (must be here too, since someone could modify the strict property after CREATE AGGREGATE). 3. Regression test for built-in aggregates. ... long pause... Ok, so on further look at this I've decided to make changes and have it so the serialisation function can be dumb about memory contexts in the same way as finalize_aggregate() allows the final function to be dumb... notice at the end of the function it copies byref types into the correct memory context, if they're not already. So in the attached the serialisation function call code now does the same thing. In fact I decided to move all that code off into a function called finalize_partialaggregate(). As for the deserialisation function... there's not much we can do there in regards to copying the value back to the correct memory context, as we always expect the return value of that function to be INTERNAL, which is byval (pointer). So I've just set the code to aggstate->tmpcontext->ecxt_per_tuple_memory, and leave it up to the combine function to do the right thing, and make sure the new state is built properly in the aggregate memory context, which is analogous to how INTERNAL state transition functions work. As for the optimisation I mentioned above; what I have now is not perfect, as when we Gather the first state per group from the first worker to return it, we deserialise that serialised state into a per tuple context, next the combine function pulls that state apart and rebuilds it again into another newly built state in the correct memory context. I can't really see a way around this. I'd really like it if someone could look over this and make sure its all sane, as I'm not sure if I've fully wrapped my head around the expected life time of each context used here... Status of other patched: 0002: * Changed the new incorrectly defined combine functions so they're no longer strict. * Removed all memory context switches from serialisation/deserialisation functions. 0003: I've added a few new regression tests to test for strictness of combine functions, and also threw in some tests to ensure the serialisation/deserialisation functions are all strict, since the current ones can't handle NULLs, and it would be wasteful to make them trouble themselves with that ability. 0004: No change from last time. 0005: Haribabu's patch; no change from last time. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
- 0001-Allow-INTERNAL-state-aggregates-to-participate-in-pa_2016-03-26.patch
- 0002-Add-various-aggregate-combine-and-serialize-de-seria_2016-03-26.patch
- 0003-Add-sanity-regression-tests-for-new-aggregate-serial_2016-03-26.patch
- 0004-Add-documents-to-explain-which-aggregates-support-pa_2016-03-26.patch
- 0005-Add-combine-functions-for-various-floating-point-agg_2016-03-26.patch
pgsql-hackers by date: