Re: Combining Aggregates - Mailing list pgsql-hackers

From David Rowley
Subject Re: Combining Aggregates
Date
Msg-id CAKJS1f-DrQztzZHxHc74fgKO3cAKoEf4QocGv+YUWr2xr6=b7w@mail.gmail.com
Whole thread Raw
In response to Re: Combining Aggregates  (Haribabu Kommi <kommi.haribabu@gmail.com>)
Responses Re: Combining Aggregates  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On 18 January 2016 at 14:36, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
On Sat, Jan 16, 2016 at 12:00 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On 16 January 2016 at 03:03, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> On Tue, Dec 29, 2015 at 7:39 PM, David Rowley
>> <david.rowley@2ndquadrant.com> wrote:
>> >> No, the idea I had in mind was to allow it to continue to exist in the
>> >> expanded format until you really need it in the varlena format, and
>> >> then serialize it at that point.  You'd actually need to do the
>> >> opposite: if you get an input that is not in expanded format, expand
>> >> it.
>> >
>> > Admittedly I'm struggling to see how this can be done. I've spent a good
>> > bit
>> > of time analysing how the expanded object stuff works.
>> >
>> > Hypothetically let's say we can make it work like:
>> >
>> > 1. During partial aggregation (finalizeAggs = false), in
>> > finalize_aggregates(), where we'd normally call the final function,
>> > instead
>> > flatten INTERNAL states and store the flattened Datum instead of the
>> > pointer
>> > to the INTERNAL state.
>> > 2. During combining aggregation (combineStates = true) have all the
>> > combine
>> > functions written in such a ways that the INTERNAL states expand the
>> > flattened states before combining the aggregate states.
>> >
>> > Does that sound like what you had in mind?
>>
>> More or less.  But what I was really imagining is that we'd get rid of
>> the internal states and replace them with new datatypes built to
>> purpose.  So, for example, for string_agg(text, text) you could make a
>> new datatype that is basically a StringInfo.  In expanded form, it
>> really is a StringInfo.  When you flatten it, you just get the string.
>> When somebody expands it again, they again have a StringInfo.  So the
>> RW pointer to the expanded form supports append cheaply.
>
>
> That sounds fine in theory, but where and how do you suppose we determine
> which expand function to call? Nothing exists in the catalogs to decide this
> currently.

I am thinking of transition function returns and accepts the StringInfoData
instead of PolyNumAggState internal data for int8_avg_accum for example.

hmm, so wouldn't that mean that the transition function would need to (for each input tuple):

1. Parse that StringInfo into tokens.
2. Create a new aggregate state object.
3. Populate the new aggregate state based on the tokenised StringInfo, this would perhaps require that various *_in() functions are called on each token.
4. Add the new tuple to the aggregate state.
5. Build a new StringInfo based on the aggregate state modified in 4.

?

Currently the transition function only does 4, and performs 2 only if it's the first Tuple.

Is that what you mean? as I'd say that would slow things down significantly!

To get a gauge on how much more CPU work that would be for some aggregates, have a look at how simple int8_avg_accum() is currently when we have HAVE_INT128 defined. For the case of AVG(BIGINT) we just really have:

state->sumX += newval;
state->N++;

The above code is step 4 only. So unless I've misunderstood you, that would need to turn into steps 1-5 above. Step 4 here is probably just a handful of instructions right now, but adding code for steps 1,2,3 and 5 would turn that into hundreds.
 
I've been trying to avoid any overhead by adding the serializeStates flag to make_agg() so that we can maintain the same performance when we're just passing internal states around in the same process. This keeps the conversions between internal state and serialised state to a minimum.

The StringInfoData is formed with the members of the PolyNumAggState
structure data. The input given StringInfoData is transformed into
PolyNumAggState data and finish the calculation and again form the
StringInfoData and return. Similar changes needs to be done for final
functions input type also. I am not sure whether this approach may have
some impact on performance?

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

pgsql-hackers by date:

Previous
From: Stephen Frost
Date:
Subject: Re: Additional role attributes && superuser review
Next
From: Bruce Momjian
Date:
Subject: Re: Additional role attributes && superuser review