Thread: Parallel Aggregates for string_agg and array_agg

Parallel Aggregates for string_agg and array_agg

From

David Rowley

Date:

17 December 2017, 17:30:55

Hi,

While working on partial aggregation a few years ago, I didn't really
think it was worthwhile allowing partial aggregation of string_agg and
array_agg. I soon realised that I was wrong about that and allowing
parallelisation of these aggregates still could be very useful when
many rows are filtered out during the scan.

Some benchmarks that I've done locally show that parallel string_agg
and array_agg do actually perform better, despite the fact that the
aggregate state grows linearly with each aggregated item. Obviously,
the performance will get even better when workers are filtering out
rows before aggregation takes place, so it seems worthwhile doing
this. However, the main reason that I'm motivated to do this is that
there are more uses for partial aggregation other than just parallel
aggregation, and it seems a shame to disable all these features if a
single aggregate does not support partial mode.

I've attached a patch which implements all this. I've had most of it
stashed away for a while now, but I managed to get some time this
weekend to get it into a more completed state.

Things are now looking pretty good for the number of aggregates that
support partial mode.

Just a handful of aggregates now don't support partial aggregation;

postgres=# select aggfnoid from pg_aggregate where aggcombinefn=0 and
aggkind='n';
     aggfnoid
------------------
 xmlagg
 json_agg
 json_object_agg
 jsonb_agg
 jsonb_object_agg
(5 rows)

... and a good number do support it;

postgres=# select count(*) from pg_aggregate where aggcombinefn<>0 and
aggkind='n';
 count
-------
   122
(1 row)

There's probably no reason why the last 5 of those couldn't be done
either, it might just require shifting a bit more work into the final
functions, although, I'm not planning on that for this patch.

As for the patch; there's a bit of a quirk in the implementation of
string_agg. We previously always threw away the delimiter that belongs
to the first aggregated value, but we do now need to keep that around
so we can put it in between two states in the combine function. I
decided the path of least resistance to do this was just to borrow
StringInfo's cursor variable to use as a pointer to the state of the
first value and put the first delimiter before that. Both the
string_agg(text) and string_agg(bytea) already have a final function,
so we just need to skip over the bytes up until the cursor position to
get rid of the first delimiter. I could go and invent some new state
type to do the same, but I don't really see the trouble with what I've
done with StringInfo, but I'll certainly listen if someone else thinks
this is wrong.

Another thing that I might review later about this is seeing about
getting rid of some of the code duplication between
array_agg_array_combine and accumArrayResultArr.

I'm going to add this to PG11's final commitfest rather than the
January 'fest as it seems more like a final commitfest type of patch.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

combinefn_for_string_and_array_aggs_v1.patch

Re: Parallel Aggregates for string_agg and array_agg

From

David Steele

Date:

01 March 2018, 22:33:17

Hi David,

On 12/17/17 9:30 AM, David Rowley wrote:
> 
> I'm going to add this to PG11's final commitfest rather than the
> January 'fest as it seems more like a final commitfest type of patch.

This patch applies but no longer builds:

$ make -C /home/vagrant/test/build
<...>
cd /postgres/src/include/catalog && '/usr/bin/perl' ./duplicate_oids
3419
3420
3421
3422

Looks like duplicate OIDs in pg_proc.h.

Thanks,
-- 
-David
david@pgmasters.net

Re: Parallel Aggregates for string_agg and array_agg

From

Andres Freund

Date:

02 March 2018, 00:26:31

Hi,

On 2017-12-18 03:30:55 +1300, David Rowley wrote:
> While working on partial aggregation a few years ago, I didn't really
> think it was worthwhile allowing partial aggregation of string_agg and
> array_agg. I soon realised that I was wrong about that and allowing
> parallelisation of these aggregates still could be very useful when
> many rows are filtered out during the scan.

+1


> Some benchmarks that I've done locally show that parallel string_agg
> and array_agg do actually perform better, despite the fact that the
> aggregate state grows linearly with each aggregated item.

It also allows *other* aggregates with different space consumption to be
computed in parallel, that can be fairly huge.


> Just a handful of aggregates now don't support partial aggregation;
> 
> postgres=# select aggfnoid from pg_aggregate where aggcombinefn=0 and
> aggkind='n';
>      aggfnoid
> ------------------
>  xmlagg
>  json_agg
>  json_object_agg
>  jsonb_agg
>  jsonb_object_agg
> (5 rows)

FWIW, I've heard numerous people yearn for json*agg



> I'm going to add this to PG11's final commitfest rather than the
> January 'fest as it seems more like a final commitfest type of patch.

Not sure I understand?


> +/*
> + * array_agg_serialize
> + *        Serialize ArrayBuildState into bytea.
> + */
> +Datum
> +array_agg_serialize(PG_FUNCTION_ARGS)
> +{

> +    /*
> +     * dvalues -- this is very simple when the value type is byval, we can
> +     * simply just send the Datums over, however, for non-byval types we must
> +     * work a little harder.
> +     */
> +    if (state->typbyval)
> +        pq_sendbytes(&buf, (char *) state->dvalues, sizeof(Datum) * state->nelems);
> +    else
> +    {
> +        Oid            typsend;
> +        bool        typisvarlena;
> +        bytea       *outputbytes;
> +        int            i;
> +
> +        /* XXX is there anywhere to cache this to save calling each time? */
> +        getTypeBinaryOutputInfo(state->element_type, &typsend, &typisvarlena);

Yes, you should be able to store this in fcinfo->flinfo->fn_extra. Or do
you see a problem doing so?

> +
> +Datum
> +array_agg_deserialize(PG_FUNCTION_ARGS)
> +{

> +    /*
> +     * dvalues -- this is very simple when the value type is byval, we can
> +     * simply just get all the Datums at once, however, for non-byval types we
> +     * must work a little harder.
> +     */
> +    if (result->typbyval)
> +    {
> +        temp = pq_getmsgbytes(&buf, sizeof(Datum) * nelems);
> +        memcpy(result->dvalues, temp, sizeof(Datum) * nelems);
> +    }
> +    else
> +    {
> +        Oid        typreceive;
> +        Oid        typioparam;
> +        int        i;
> +
> +        getTypeBinaryInputInfo(element_type, &typreceive, &typioparam);
> +
> +        for (i = 0; i < nelems; i++)
> +        {
> +            /* XXX? Surely this cannot be the way to do this? */
> +            StringInfoData tmp;
> +            tmp.cursor = 0;
> +            tmp.maxlen = tmp.len = pq_getmsgint(&buf, 4);
> +            tmp.data = (char *) pq_getmsgbytes(&buf, tmp.len);
> +
> +            result->dvalues[i] = OidReceiveFunctionCall(typreceive, &tmp,
> +                                                        typioparam, -1);

Hm, doesn't look too bad to me? What's your concern here?


This isn't a real review, I was basically just doing a quick
scroll-through.

Greetings,

Andres Freund

Re: Parallel Aggregates for string_agg and array_agg

From

David Rowley

Date:

02 March 2018, 00:27:42

Thanks for looking at the patch.

On 2 March 2018 at 08:33, David Steele <david@pgmasters.net> wrote:
> This patch applies but no longer builds:
...

> Looks like duplicate OIDs in pg_proc.h.

Who stole my OIDs?!

Updated patch attached.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

combinefn_for_string_and_array_aggs_v2.patch

Re: Parallel Aggregates for string_agg and array_agg

From

David Rowley

Date:

02 March 2018, 03:48:00

On 2 March 2018 at 10:26, Andres Freund <andres@anarazel.de> wrote:
> On 2017-12-18 03:30:55 +1300, David Rowley wrote:
>> Just a handful of aggregates now don't support partial aggregation;
>>
>> postgres=# select aggfnoid from pg_aggregate where aggcombinefn=0 and
>> aggkind='n';
>>      aggfnoid
>> ------------------
>>  xmlagg
>>  json_agg
>>  json_object_agg
>>  jsonb_agg
>>  jsonb_object_agg
>> (5 rows)
>
> FWIW, I've heard numerous people yearn for json*agg

I guess it'll need to be PG12 for now. I'd imagine string_agg and
array_agg are more important ones to tick off for now, but I bet many
people would disagree with that.

>
>> I'm going to add this to PG11's final commitfest rather than the
>> January 'fest as it seems more like a final commitfest type of patch.
>
> Not sure I understand?

hmm, yeah. That was more of a consideration that I should be working
hard on + reviewing more complex and invasive patches before the final
'fest. I saw this more as something that would gain more interest once
patches such as partition-wise aggs gets in.

>> +             /* XXX is there anywhere to cache this to save calling each time? */
>> +             getTypeBinaryOutputInfo(state->element_type, &typsend, &typisvarlena);
>
> Yes, you should be able to store this in fcinfo->flinfo->fn_extra. Or do
> you see a problem doing so?

No, just didn't think of it.

>> +                     /* XXX? Surely this cannot be the way to do this? */
>> +                     StringInfoData tmp;
>> +                     tmp.cursor = 0;
>> +                     tmp.maxlen = tmp.len = pq_getmsgint(&buf, 4);
>> +                     tmp.data = (char *) pq_getmsgbytes(&buf, tmp.len);
>> +
>> +                     result->dvalues[i] = OidReceiveFunctionCall(typreceive, &tmp,
>> +
typioparam,-1);
 
>
> Hm, doesn't look too bad to me? What's your concern here?

I was just surprised at having to fake up a StringInfoData.

> This isn't a real review, I was basically just doing a quick
> scroll-through.

Understood.

I've attached a delta patch which can be applied on top of the v2
patch which removes that XXX comment above and also implements the
fn_extra caching.

Thanks for looking!

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

combinefn_for_string_and_array_aggs_v3_delta.patch

Re: Parallel Aggregates for string_agg and array_agg

From

Andres Freund

Date:

02 March 2018, 03:50:20

On 2018-03-02 13:48:00 +1300, David Rowley wrote:
> On 2 March 2018 at 10:26, Andres Freund <andres@anarazel.de> wrote:
> > FWIW, I've heard numerous people yearn for json*agg
> 
> I guess it'll need to be PG12 for now. I'd imagine string_agg and
> array_agg are more important ones to tick off for now, but I bet many
> people would disagree with that.

Oh, this was definitely not meant as an ask for something done together
with this submission.

Thanks for the quick update.

Greetings,

Andres Freund

Re: Parallel Aggregates for string_agg and array_agg

From

David Rowley

Date:

03 March 2018, 12:47:21

I've attached a rebased patch which fixes the conflicts caused by fd1a421.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

combinefn_for_string_and_array_aggs_v4.patch

Re: Parallel Aggregates for string_agg and array_agg

From

Tomas Vondra

Date:

04 March 2018, 18:54:07

Hi,

I've looked at the rebased patch you sent yesterday, and I do have a
couple of comments.

1) There seems to be forgotten declaration of initArrayResultInternal in
arrayfuncs.c. I suppose you've renamed it to initArrayResultWithSize and
moved it to a header file, and forgotten to remove this bit.

2) bytea_string_agg_deserialize has this comment:

 /*
  * data: technically we could reuse the buf.data buffer, but that is
  * slightly larger than we need due to the extra 4 bytes for the cursor
  */

I find the argument "it has 4 extra bytes, so we'll allocate new chunk"
somewhat weak. We do allocate the memory in 2^n chunks anyway, so the
new chunk is likely to be much larger anyway. I'm not saying we need to
reuse the buffer, IMHO the savings will be non-measurable.

3) string_agg_transfn and bytea_string_agg_transfn say this"

 /*
  * We always append the delimiter text before the value text, even
  * with the first aggregated item.  The reason for this is that if
  * this state needs to be combined with another state using the
  * aggregate's combine function, then we need to have the delimiter
  * for the first aggregated item.  The first delimiter will be
  * stripped off in the final function anyway.  We use a little cheat
  * here and store the position of the actual first value (after the
  * delimiter) using the StringInfo's cursor variable.  This relies on
  * the cursor not being used for any other purpose.
  */

How does this make the code any simpler, compared to not adding the
delimiter (as before) and adding it when combining the values (at which
point you should know when it's actually needed)?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Parallel Aggregates for string_agg and array_agg

From

David Rowley

Date:

05 March 2018, 06:51:20

On 5 March 2018 at 04:54, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> 1) There seems to be forgotten declaration of initArrayResultInternal in
> arrayfuncs.c. I suppose you've renamed it to initArrayResultWithSize and
> moved it to a header file, and forgotten to remove this bit.

Oops. Must be left over from trying things another way. Removed.

> 2) bytea_string_agg_deserialize has this comment:
>
>  /*
>   * data: technically we could reuse the buf.data buffer, but that is
>   * slightly larger than we need due to the extra 4 bytes for the cursor
>   */
>
> I find the argument "it has 4 extra bytes, so we'll allocate new chunk"
> somewhat weak. We do allocate the memory in 2^n chunks anyway, so the
> new chunk is likely to be much larger anyway. I'm not saying we need to
> reuse the buffer, IMHO the savings will be non-measurable.

Agreed. I've removed that part of the comment.

> 3) string_agg_transfn and bytea_string_agg_transfn say this"
>
>  /*
>   * We always append the delimiter text before the value text, even
>   * with the first aggregated item.  The reason for this is that if
>   * this state needs to be combined with another state using the
>   * aggregate's combine function, then we need to have the delimiter
>   * for the first aggregated item.  The first delimiter will be
>   * stripped off in the final function anyway.  We use a little cheat
>   * here and store the position of the actual first value (after the
>   * delimiter) using the StringInfo's cursor variable.  This relies on
>   * the cursor not being used for any other purpose.
>   */
>
> How does this make the code any simpler, compared to not adding the
> delimiter (as before) and adding it when combining the values (at which
> point you should know when it's actually needed)?

The problem is that if we don't record the first delimiter then we
won't know what it is when it comes to combining the states.

Consider the following slightly backward-looking case;

select string_agg(',', x::text) from generate_Series(1,10)x;
      string_agg
----------------------
 ,2,3,4,5,6,7,8,9,10,

Here the delimiter is the number, not the ','. Of course, the
delimiter for the first aggregated result is not shown.  The previous
implementation of the transfn for this aggregate just threw away the
first delimiter, but that's no good for parallel aggregates as the
transfn may be getting called in a parallel worker, in which case
we'll need that delimiter in the combine function to properly join to
the other partial aggregated results matching the same group.

Probably I could improve the comment a bit. I understand that having a
variable delimiter is not commonplace in the normal world. I certainly
had never considered it before working on this.  I scratched my head a
bit when doing this and thought about inventing a new trans type, but
I decided that the most efficient design for an aggregate state was to
store the delimiter and data all in one buffer and have a pointer to
the start of the actual data... StringInfo has exactly what's required
if you use the cursor as the pointer, so I didn't go and reinvent the
wheel.

I've now changed the comment to read:

/*
* It's important that we store the delimiter text for all aggregated
* items, even the first one, which at first thought you might think
* could just be discarded.  The reason for this is that if this
* function is being called from a parallel worker, then we'll need
* the first delimiter in order to properly combine the partially
* aggregated state with the states coming from other workers.  In the
* final output, the first delimiter will be stripped off of the final
* aggregate state, but in order to know where the actual first data
* is we must store the position of the first data value somewhere.
* Conveniently, StringInfo has a cursor property which can be used
* to serve our needs here.
*/

I've also updated an incorrect Assert in array_agg_array_serialize.

Please find attached the updated patch.

Many thanks for reviewing this.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

combinefn_for_string_and_array_aggs_v5.patch

Re: Parallel Aggregates for string_agg and array_agg

From

Tomas Vondra

Date:

11 March 2018, 02:11:57


On 03/05/2018 04:51 AM, David Rowley wrote:
> On 5 March 2018 at 04:54, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>> 1) There seems to be forgotten declaration of initArrayResultInternal in
>> arrayfuncs.c. I suppose you've renamed it to initArrayResultWithSize and
>> moved it to a header file, and forgotten to remove this bit.
> 
> Oops. Must be left over from trying things another way. Removed.
> 
>> 2) bytea_string_agg_deserialize has this comment:
>>
>>  /*
>>   * data: technically we could reuse the buf.data buffer, but that is
>>   * slightly larger than we need due to the extra 4 bytes for the cursor
>>   */
>>
>> I find the argument "it has 4 extra bytes, so we'll allocate new chunk"
>> somewhat weak. We do allocate the memory in 2^n chunks anyway, so the
>> new chunk is likely to be much larger anyway. I'm not saying we need to
>> reuse the buffer, IMHO the savings will be non-measurable.
> 
> Agreed. I've removed that part of the comment.
> 
>> 3) string_agg_transfn and bytea_string_agg_transfn say this"
>>
>>  /*
>>   * We always append the delimiter text before the value text, even
>>   * with the first aggregated item.  The reason for this is that if
>>   * this state needs to be combined with another state using the
>>   * aggregate's combine function, then we need to have the delimiter
>>   * for the first aggregated item.  The first delimiter will be
>>   * stripped off in the final function anyway.  We use a little cheat
>>   * here and store the position of the actual first value (after the
>>   * delimiter) using the StringInfo's cursor variable.  This relies on
>>   * the cursor not being used for any other purpose.
>>   */
>>
>> How does this make the code any simpler, compared to not adding the
>> delimiter (as before) and adding it when combining the values (at which
>> point you should know when it's actually needed)?
> 
> The problem is that if we don't record the first delimiter then we
> won't know what it is when it comes to combining the states.
> 
> Consider the following slightly backward-looking case;
> 
> select string_agg(',', x::text) from generate_Series(1,10)x;
>       string_agg
> ----------------------
>  ,2,3,4,5,6,7,8,9,10,
> 
> Here the delimiter is the number, not the ','. Of course, the
> delimiter for the first aggregated result is not shown.  The previous
> implementation of the transfn for this aggregate just threw away the
> first delimiter, but that's no good for parallel aggregates as the
> transfn may be getting called in a parallel worker, in which case
> we'll need that delimiter in the combine function to properly join to
> the other partial aggregated results matching the same group.
> 

Hmmm, you're right I haven't considered the delimiter might be variable.
But isn't just yet another hint that while StringInfo was suitable for
aggregate state of non-parallel string_agg, it may not be for the
parallel version?

What if the parallel string_agg instead used something like this:

struct StringAggState
{
    char           *delimiter;    /* first delimiter */
    StringInfoData  str;          /* partial aggregate state */
} StringAggState;

I think that would make the code cleaner, compared to using the cursor
field for that purpose. But maybe it'd make it not possible to share
code with the non-parallel aggregate, not sure.

I always considered the cursor field to be meant for scanning through
StringInfo, so using it for this purpose seems a bit like a misuse.

> Probably I could improve the comment a bit. I understand that having a
> variable delimiter is not commonplace in the normal world. I certainly
> had never considered it before working on this.  I scratched my head a
> bit when doing this and thought about inventing a new trans type, but
> I decided that the most efficient design for an aggregate state was to
> store the delimiter and data all in one buffer and have a pointer to
> the start of the actual data... StringInfo has exactly what's required
> if you use the cursor as the pointer, so I didn't go and reinvent the
> wheel.
> 

I wonder why the variable delimiter is not mentioned in the docs ... (at
least I haven't found anything mentioning the behavior).

> I've now changed the comment to read:
> 
> /*
> * It's important that we store the delimiter text for all aggregated
> * items, even the first one, which at first thought you might think
> * could just be discarded.  The reason for this is that if this
> * function is being called from a parallel worker, then we'll need
> * the first delimiter in order to properly combine the partially
> * aggregated state with the states coming from other workers.  In the
> * final output, the first delimiter will be stripped off of the final
> * aggregate state, but in order to know where the actual first data
> * is we must store the position of the first data value somewhere.
> * Conveniently, StringInfo has a cursor property which can be used
> * to serve our needs here.
> */
> 
> I've also updated an incorrect Assert in array_agg_array_serialize.
> 
> Please find attached the updated patch.
> 

Seems fine to me. I plan to do a bit more testing/review next week, but
I plan to move it to RFC after that (unless I run into something during
the review, but I don't expect that).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Parallel Aggregates for string_agg and array_agg

From

David Rowley

Date:

11 March 2018, 09:31:49

On 11 March 2018 at 12:11, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> On 03/05/2018 04:51 AM, David Rowley wrote:
>> On 5 March 2018 at 04:54, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>> Consider the following slightly backward-looking case;
>>
>> select string_agg(',', x::text) from generate_Series(1,10)x;
>>       string_agg
>> ----------------------
>>  ,2,3,4,5,6,7,8,9,10,
>>
>> Here the delimiter is the number, not the ','. Of course, the
>> delimiter for the first aggregated result is not shown.  The previous
>> implementation of the transfn for this aggregate just threw away the
>> first delimiter, but that's no good for parallel aggregates as the
>> transfn may be getting called in a parallel worker, in which case
>> we'll need that delimiter in the combine function to properly join to
>> the other partial aggregated results matching the same group.
>>
>
> Hmmm, you're right I haven't considered the delimiter might be variable.
> But isn't just yet another hint that while StringInfo was suitable for
> aggregate state of non-parallel string_agg, it may not be for the
> parallel version?
>
> What if the parallel string_agg instead used something like this:
>
> struct StringAggState
> {
>     char           *delimiter;    /* first delimiter */
>     StringInfoData  str;          /* partial aggregate state */
> } StringAggState;
>
> I think that would make the code cleaner, compared to using the cursor
> field for that purpose. But maybe it'd make it not possible to share
> code with the non-parallel aggregate, not sure.

It would be possible to use something like that. The final function
would just need to take 'str' and ignore 'delimiter', whereas the
combine function would need both. However, you could optimize the
above to just have a single StringInfoData and have a pointer to the
start of the actual data (beyond the first delimiter), that would save
us a call to palloc and also allow the state's data to exist in one
contiguous block of memory, which will be more cache friendly when it
comes to reading it back again.  The pointer here would, of course,
have to be an offset into the data array, since repallocs would cause
problems as the state grew.

This is kinda the option I was going for with using the cursor. I
didn't think that was going to be a problem. It seems that cursor was
invented so that users of StringInfo could store a marker somehow
along with the string. The comment for cursor read:

 * cursor is initialized to zero by makeStringInfo or initStringInfo,
 * but is not otherwise touched by the stringinfo.c routines.
 * Some routines use it to scan through a StringInfo.

The use of the cursor field does not interfere with pqformat.c's use
of it as in the serialization functions we're creating a new
StringInfo for the 'result'. If anything tries to send the internal
state, then that's certainly broken. It needs to be serialized before
that can happen.

I don't quite see how inventing a new struct would make the code
cleaner. It would make the serialization and deserialization functions
have to write and read more fields for the lengths of the data
contained in the state.

I understand that the cursor field is used in pqformat.c quite a bit,
but I don't quite understand why it should get to use that field the
way it wants, but the serialization and deserialization functions
shouldn't.  I'd understand that if we were trying to phase out the
cursor field from StringInfoData, but I can't imagine that's going to
happen.

Of course, with that all said. If you really strongly object to how
I've done this, then I'll change it to use a new struct type. I had
just tried to make the patch footprint as small as possible, and the
above text is me just explaining my reasoning behind this, not me
objecting to your request for me to change it. Let me know your
thoughts after reading the above.

In the meantime, I've attached an updated patch. The only change it
contains is updating the "Partial Mode" column in the docs from "No"
to "Yes".

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

combinefn_for_string_and_array_aggs_v6.patch

Re: Parallel Aggregates for string_agg and array_agg

From

Tomas Vondra

Date:

11 March 2018, 14:10:00


On 03/11/2018 07:31 AM, David Rowley wrote:
> On 11 March 2018 at 12:11, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>> On 03/05/2018 04:51 AM, David Rowley wrote:
>>> On 5 March 2018 at 04:54, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>> Consider the following slightly backward-looking case;
>>>
>>> select string_agg(',', x::text) from generate_Series(1,10)x;
>>>       string_agg
>>> ----------------------
>>>  ,2,3,4,5,6,7,8,9,10,
>>>
>>> Here the delimiter is the number, not the ','. Of course, the
>>> delimiter for the first aggregated result is not shown.  The previous
>>> implementation of the transfn for this aggregate just threw away the
>>> first delimiter, but that's no good for parallel aggregates as the
>>> transfn may be getting called in a parallel worker, in which case
>>> we'll need that delimiter in the combine function to properly join to
>>> the other partial aggregated results matching the same group.
>>>
>>
>> Hmmm, you're right I haven't considered the delimiter might be variable.
>> But isn't just yet another hint that while StringInfo was suitable for
>> aggregate state of non-parallel string_agg, it may not be for the
>> parallel version?
>>
>> What if the parallel string_agg instead used something like this:
>>
>> struct StringAggState
>> {
>>     char           *delimiter;    /* first delimiter */
>>     StringInfoData  str;          /* partial aggregate state */
>> } StringAggState;
>>
>> I think that would make the code cleaner, compared to using the cursor
>> field for that purpose. But maybe it'd make it not possible to share
>> code with the non-parallel aggregate, not sure.
> 
> It would be possible to use something like that. The final function
> would just need to take 'str' and ignore 'delimiter', whereas the
> combine function would need both. However, you could optimize the
> above to just have a single StringInfoData and have a pointer to the
> start of the actual data (beyond the first delimiter), that would save
> us a call to palloc and also allow the state's data to exist in one
> contiguous block of memory, which will be more cache friendly when it
> comes to reading it back again.  The pointer here would, of course,
> have to be an offset into the data array, since repallocs would cause
> problems as the state grew.
> 
> This is kinda the option I was going for with using the cursor. I
> didn't think that was going to be a problem. It seems that cursor was
> invented so that users of StringInfo could store a marker somehow
> along with the string. The comment for cursor read:
> 
>  * cursor is initialized to zero by makeStringInfo or initStringInfo,
>  * but is not otherwise touched by the stringinfo.c routines.
>  * Some routines use it to scan through a StringInfo.
> 
> The use of the cursor field does not interfere with pqformat.c's use
> of it as in the serialization functions we're creating a new
> StringInfo for the 'result'. If anything tries to send the internal
> state, then that's certainly broken. It needs to be serialized before
> that can happen.
> 
> I don't quite see how inventing a new struct would make the code
> cleaner. It would make the serialization and deserialization functions
> have to write and read more fields for the lengths of the data
> contained in the state.
> 
> I understand that the cursor field is used in pqformat.c quite a bit,
> but I don't quite understand why it should get to use that field the
> way it wants, but the serialization and deserialization functions
> shouldn't.  I'd understand that if we were trying to phase out the
> cursor field from StringInfoData, but I can't imagine that's going to
> happen.
> 
> Of course, with that all said. If you really strongly object to how 
> I've done this, then I'll change it to use a new struct type. I had 
> just tried to make the patch footprint as small as possible, and the 
> above text is me just explaining my reasoning behind this, not me 
> objecting to your request for me to change it. Let me know your 
> thoughts after reading the above.
> 

Hmmm, I guess you're right, I didn't really thought that through. I
don't object to sticking to the current approach.

> In the meantime, I've attached an updated patch. The only change it 
> contains is updating the "Partial Mode" column in the docs from "No" 
> to "Yes".
> 

Yeah, seems fine to me. I wonder what else would be needed before
switching the patch to RFC. I plan to do that after a bit more testing
sometime early next week, unless someone objects.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Parallel Aggregates for string_agg and array_agg

From

Tomas Vondra

Date:

16 March 2018, 03:46:39


On 03/11/2018 12:10 PM, Tomas Vondra wrote:
> 
> ...
>
> Yeah, seems fine to me. I wonder what else would be needed before 
> switching the patch to RFC. I plan to do that after a bit more
> testing sometime early next week, unless someone objects.
> 

I've done more testing on this patch, and I haven't found any new issues
with it, so PFA I'm marking it as ready for committer.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Parallel Aggregates for string_agg and array_agg

From

David Rowley

Date:

16 March 2018, 05:44:39

On 16 March 2018 at 13:46, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> I've done more testing on this patch, and I haven't found any new issues
> with it, so PFA I'm marking it as ready for committer.

Great! Many thanks for the review.


-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Parallel Aggregates for string_agg and array_agg

From

Stephen Frost

Date:

26 March 2018, 05:26:47

Greetings,

* David Rowley (david.rowley@2ndquadrant.com) wrote:
> On 16 March 2018 at 13:46, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> > I've done more testing on this patch, and I haven't found any new issues
> > with it, so PFA I'm marking it as ready for committer.
>
> Great! Many thanks for the review.

I read through the thread and started reviewing this, comments below.

diff --git a/src/backend/utils/adt/array_userfuncs.c b/src/backend/utils/adt/array_userfuncs.c
[...]
+        for (i = 0; i < state2->nelems; i++)
+            state1->dvalues[i] = datumCopy(state2->dvalues[i],
+                                           state1->typbyval, state1->typlen);
+
+        memcpy(state1->dnulls, state2->dnulls, sizeof(bool) * state2->nelems);

The header at the top of datumCopy() pretty clearly says that it's for
"non-NULL" datums, yet this function seems to be happily ignoring that
and just trying to use it to copy everything.  Perhaps I'm missing
something, but I don't see anything obvious that would lead me to
conclude that what's being done here (or in other similar cases in this
patch) is acceptable.

[...]

+        for (i = 0; i < state->nelems; i++)
+        {
+            outputbytes = OidSendFunctionCall(iodata->typsend, state->dvalues[i]);
+            pq_sendint32(&buf, VARSIZE(outputbytes) - VARHDRSZ);
+            pq_sendbytes(&buf, VARDATA(outputbytes),
+                            VARSIZE(outputbytes) - VARHDRSZ);
+        }
+    }
+
+    /* dnulls */
+    pq_sendbytes(&buf, (char *) state->dnulls, sizeof(bool) * state->nelems);

SendFunctionCall() has a similar "Do not call this on NULL datums"
comment.

+        for (i = 0; i < nelems; i++)
+        {
+            StringInfoData tmp;
+            tmp.cursor = 0;
+            tmp.maxlen = tmp.len = pq_getmsgint(&buf, 4);
+            tmp.data = (char *) pq_getmsgbytes(&buf, tmp.len);
+
+            result->dvalues[i] = OidReceiveFunctionCall(iodata->typreceive,
+                                                        &tmp,
+                                                        iodata->typioparam,
+                                                        -1);
+        }
+    }
+
+    /* dnulls */
+    temp = pq_getmsgbytes(&buf, sizeof(bool) * nelems);
+    memcpy(result->dnulls, temp, sizeof(bool) * nelems);

And if we aren't sending them then we probably need to do something
different on the receive side...

I glanced through the rest and it seemed alright, but will want to still
go over it more carefully once the above comments are addressed.

Marking Waiting-for-Author.

Thanks!

Stephen

Attachment

signature.asc

Re: Parallel Aggregates for string_agg and array_agg

From

David Rowley

Date:

26 March 2018, 12:37:37

On 26 March 2018 at 15:26, Stephen Frost <sfrost@snowman.net> wrote:
> The header at the top of datumCopy() pretty clearly says that it's for
> "non-NULL" datums, yet this function seems to be happily ignoring that
> and just trying to use it to copy everything.  Perhaps I'm missing
> something, but I don't see anything obvious that would lead me to
> conclude that what's being done here (or in other similar cases in this
> patch) is acceptable.

Thanks for looking at this.

You're right. I've overlooked this. The code should be checking for
NULL value Datums there.  I've fixed this locally, but on testing, I
discovered another bug around string_agg. At the moment string_agg's
transfn only allocates the state when it gets a non-NULL string to
aggregate, whereas it seems other trans functions which return an
internal state allocate their state on the first call. e.g.
int8_avg_accum().  This NULL state is causing the serial function
segfault on a null pointer dereference.   I think the fix is to always
allocate the state in the transfn, but I just wanted to point this out
before I go and do that.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Parallel Aggregates for string_agg and array_agg

From

Stephen Frost

Date:

26 March 2018, 16:20:09

Greetings,

* David Rowley (david.rowley@2ndquadrant.com) wrote:
> On 26 March 2018 at 15:26, Stephen Frost <sfrost@snowman.net> wrote:
> > The header at the top of datumCopy() pretty clearly says that it's for
> > "non-NULL" datums, yet this function seems to be happily ignoring that
> > and just trying to use it to copy everything.  Perhaps I'm missing
> > something, but I don't see anything obvious that would lead me to
> > conclude that what's being done here (or in other similar cases in this
> > patch) is acceptable.
>
> Thanks for looking at this.
>
> You're right. I've overlooked this. The code should be checking for
> NULL value Datums there.  I've fixed this locally, but on testing, I
> discovered another bug around string_agg. At the moment string_agg's
> transfn only allocates the state when it gets a non-NULL string to
> aggregate, whereas it seems other trans functions which return an
> internal state allocate their state on the first call. e.g.
> int8_avg_accum().  This NULL state is causing the serial function
> segfault on a null pointer dereference.   I think the fix is to always
> allocate the state in the transfn, but I just wanted to point this out
> before I go and do that.

Just to be clear- the segfault is just happening with your patch and
you're just contemplating having string_agg always allocate state on the
first call, similar to what int8_avg_accum() does?

If so, then, yes, that seems alright to me.

Thanks!

Stephen

Attachment

signature.asc

Re: Parallel Aggregates for string_agg and array_agg

From

David Rowley

Date:

26 March 2018, 17:04:29

On 27 March 2018 at 02:20, Stephen Frost <sfrost@snowman.net> wrote:
> Just to be clear- the segfault is just happening with your patch and
> you're just contemplating having string_agg always allocate state on the
> first call, similar to what int8_avg_accum() does?
>
> If so, then, yes, that seems alright to me.

I did write some code to do this, but I ended up having to invent a
new state type which contained an additional "gotvalue" bool flag. We
needed to know the difference between a new state and one that's just
not had any non-NULL values aggregated yet. This meant the transfn and
combinefn needed to track this extra flag, and the serial and deserial
functions had to send it too. It started to look pretty horrible so I
started wondering about modifying the serial and deserial functions to
allow NULL states instead, that's when I realised that the serial and
deserial functions were both strict, and should never have been called
with NULL states in the first place!

It turns out this is a bit of an existing bug, but perhaps one that is
pretty harmless without this patch.  Basically, the string_agg trans
function was always returning PG_RETURN_POINTER(state), even when the
state was NULL. This meant that the fcinfo->isnull flag was not
properly set, which caused the nodeAgg.c code to call the strict
serial function regardless.

In the end, I just fixed this by changing the PG_RETURN_POINTER(state) into:

if (state)
    PG_RETURN_POINTER(state);
PG_RETURN_NULL();

The final functions seem to have managed to escape this bug by the way
they check for NULL states:

state = PG_ARGISNULL(0) ? NULL : (StringInfo) PG_GETARG_POINTER(0);

if (state != NULL)
...

Which allows an SQL null, or a NULL pointer.  These could probably now
be changed to become:

if (!PG_ARGISNULL(0))
{
    state =  (StringInfo) PG_GETARG_POINTER(0);
...

Or it may be possible to make the finalfn strict, but I didn't do
either of these things.

I did also wonder if PG_RETURN_POINTER should test for NULL pointers
and set fcinfo->isnull, but there might be valid cases where we'd want
to not set the flag... I just can't imagine a valid reason why right
now.

In the attached, I also got rid of the serial and deserial function
for string_agg(text). I realised that these were doing pq_sendtext()
instead of pq_sendbytes. This meant that text would be changed to the
client encoding which was not what we wanted as we were not sending
anything to the client.  Fixing this meant that the string_agg(bytea)
and string_agg(text) aggregates had identical serial/deserial
functions, so they now share the functions instead.

I also looked at all the built-in aggregate trans functions which use
an internal aggregate state to verify they're also doing the right
thing in regards to NULLs. The following query returns 16 distinct
functions, all of these apart from the ones I've fixed in the attached
patch all allocate the internal state on first call, so it does not
appear that there's anything else to fix in this area:

select distinct aggtransfn from pg_aggregate where aggtranstype=2281 order by 1;

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

combinefn_for_string_and_array_aggs_v7.patch

Re: Parallel Aggregates for string_agg and array_agg

From

Tom Lane

Date:

26 March 2018, 23:27:47

David Rowley <david.rowley@2ndquadrant.com> writes:
> [ combinefn_for_string_and_array_aggs_v7.patch ]

I spent a fair amount of time hacking on this with intent to commit,
but just as I was getting to code that I liked, I started to have second
thoughts about whether this is a good idea at all.  I quote from the fine
manual:

    The aggregate functions array_agg, json_agg, jsonb_agg,
    json_object_agg, jsonb_object_agg, string_agg, and xmlagg, as well as
    similar user-defined aggregate functions, produce meaningfully
    different result values depending on the order of the input
    values. This ordering is unspecified by default, but can be controlled
    by writing an ORDER BY clause within the aggregate call, as shown in
    Section 4.2.7. Alternatively, supplying the input values from a sorted
    subquery will usually work ...

I do not think it is accidental that these aggregates are exactly the ones
that do not have parallelism support today.  Rather, that's because you
just about always have an interest in the order in which the inputs get
aggregated, which is something that parallel aggregation cannot support.

I fear that what will happen, if we commit this, is that something like
0.01% of the users of array_agg and string_agg will be pleased, another
maybe 20% will be unaffected because they wrote ORDER BY which prevents
parallel aggregation, and the remaining 80% will scream because we broke
their queries.  Telling them they should've written ORDER BY isn't going
to cut it, IMO, when the benefit of that breakage will accrue only to some
very tiny fraction of use-cases.

In short, I think we ought to reject this.

Just in case I'm outvoted, attached is what I'd gotten done so far.
The main noncosmetic changes I'd made were to improve the caching logic
(it's silly to set up a lookup cache and then not cache the fmgr_info
lookup) and to not cheat quite as much on the StringInfo passed down to
the element typreceive function.  There isn't any other place, I don't
think, where we don't honor the expectation that StringInfos have trailing
null bytes, and some places may depend on it --- array_recv does.

The main thing that remains undone is to get some test coverage ---
AFAICS, none of these new functions get exercised in the standard
regression tests.

I'm also a bit unhappy that the patch introduces code into
array_userfuncs.c that knows everything there is to know about the
contents of ArrayBuildState and ArrayBuildStateArr.  Previously that
knowledge was pretty well localized in arrayfuncs.c.  I wonder if it'd
be better to move the new combinefuncs and serial/deserial funcs into
arrayfuncs.c.  Or perhaps export primitives from there that could be
used to build them.

            regards, tom lane

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 7b1a85f..faa4939 100644
*** a/doc/src/sgml/func.sgml
--- b/doc/src/sgml/func.sgml
*************** NULL baz</literallayout>(3 rows)</entry>
*** 13358,13364 ****
        <entry>
         array of the argument type
        </entry>
!       <entry>No</entry>
        <entry>input values, including nulls, concatenated into an array</entry>
       </row>

--- 13358,13364 ----
        <entry>
         array of the argument type
        </entry>
!       <entry>Yes</entry>
        <entry>input values, including nulls, concatenated into an array</entry>
       </row>

*************** NULL baz</literallayout>(3 rows)</entry>
*** 13372,13378 ****
        <entry>
         same as argument data type
        </entry>
!       <entry>No</entry>
        <entry>input arrays concatenated into array of one higher dimension
         (inputs must all have same dimensionality,
          and cannot be empty or NULL)</entry>
--- 13372,13378 ----
        <entry>
         same as argument data type
        </entry>
!       <entry>Yes</entry>
        <entry>input arrays concatenated into array of one higher dimension
         (inputs must all have same dimensionality,
          and cannot be empty or NULL)</entry>
*************** NULL baz</literallayout>(3 rows)</entry>
*** 13633,13639 ****
        <entry>
         same as argument types
        </entry>
!       <entry>No</entry>
        <entry>input values concatenated into a string, separated by delimiter</entry>
       </row>

--- 13633,13639 ----
        <entry>
         same as argument types
        </entry>
!       <entry>Yes</entry>
        <entry>input values concatenated into a string, separated by delimiter</entry>
       </row>

diff --git a/src/backend/utils/adt/array_userfuncs.c b/src/backend/utils/adt/array_userfuncs.c
index cb7a6b6..0007a28 100644
*** a/src/backend/utils/adt/array_userfuncs.c
--- b/src/backend/utils/adt/array_userfuncs.c
***************
*** 13,24 ****
--- 13,44 ----
  #include "postgres.h"

  #include "catalog/pg_type.h"
+ #include "libpq/pqformat.h"
  #include "common/int.h"
  #include "utils/array.h"
+ #include "utils/datum.h"
  #include "utils/builtins.h"
  #include "utils/lsyscache.h"
  #include "utils/typcache.h"

+ /*
+  * SerialIOData
+  *        Used for caching element-type data in array_agg_serialize
+  */
+ typedef struct SerialIOData
+ {
+     FmgrInfo    typsend;
+ } SerialIOData;
+
+ /*
+  * DeserialIOData
+  *        Used for caching element-type data in array_agg_deserialize
+  */
+ typedef struct DeserialIOData
+ {
+     FmgrInfo    typreceive;
+     Oid            typioparam;
+ } DeserialIOData;

  static Datum array_position_common(FunctionCallInfo fcinfo);

*************** array_agg_transfn(PG_FUNCTION_ARGS)
*** 499,504 ****
--- 519,838 ----
  }

  Datum
+ array_agg_combine(PG_FUNCTION_ARGS)
+ {
+     ArrayBuildState *state1;
+     ArrayBuildState *state2;
+     MemoryContext agg_context;
+     MemoryContext old_context;
+     int            i;
+
+     if (!AggCheckCallContext(fcinfo, &agg_context))
+         elog(ERROR, "aggregate function called in non-aggregate context");
+
+     state1 = PG_ARGISNULL(0) ? NULL : (ArrayBuildState *) PG_GETARG_POINTER(0);
+     state2 = PG_ARGISNULL(1) ? NULL : (ArrayBuildState *) PG_GETARG_POINTER(1);
+
+     if (state2 == NULL)
+     {
+         /*
+          * NULL state2 is easy, just return state1, which we know is already
+          * in the agg_context
+          */
+         if (state1 == NULL)
+             PG_RETURN_NULL();
+         PG_RETURN_POINTER(state1);
+     }
+
+     if (state1 == NULL)
+     {
+         /* We must copy state2's data into the agg_context */
+         state1 = initArrayResultWithSize(state2->element_type, agg_context,
+                                          false, state2->alen);
+
+         old_context = MemoryContextSwitchTo(agg_context);
+
+         for (i = 0; i < state2->nelems; i++)
+         {
+             if (!state2->dnulls[i])
+                 state1->dvalues[i] = datumCopy(state2->dvalues[i],
+                                                state1->typbyval,
+                                                state1->typlen);
+             else
+                 state1->dvalues[i] = (Datum) 0;
+         }
+
+         MemoryContextSwitchTo(old_context);
+
+         memcpy(state1->dnulls, state2->dnulls, sizeof(bool) * state2->nelems);
+
+         state1->nelems = state2->nelems;
+
+         PG_RETURN_POINTER(state1);
+     }
+     else if (state2->nelems > 0)
+     {
+         /* We only need to combine the two states if state2 has any elements */
+         int            reqsize = state1->nelems + state2->nelems;
+         MemoryContext oldContext = MemoryContextSwitchTo(state1->mcontext);
+
+         Assert(state1->element_type == state2->element_type);
+
+         /* Enlarge state1 arrays if needed */
+         if (state1->alen < reqsize)
+         {
+             /* Use a power of 2 size rather than allocating just reqsize */
+             while (state1->alen < reqsize)
+                 state1->alen *= 2;
+
+             state1->dvalues = (Datum *) repalloc(state1->dvalues,
+                                                  state1->alen * sizeof(Datum));
+             state1->dnulls = (bool *) repalloc(state1->dnulls,
+                                                state1->alen * sizeof(bool));
+         }
+
+         /* Copy in the state2 elements to the end of the state1 arrays */
+         for (i = 0; i < state2->nelems; i++)
+         {
+             if (!state2->dnulls[i])
+                 state1->dvalues[i + state1->nelems] =
+                     datumCopy(state2->dvalues[i],
+                               state1->typbyval,
+                               state1->typlen);
+             else
+                 state1->dvalues[i + state1->nelems] = (Datum) 0;
+         }
+
+         memcpy(&state1->dnulls[state1->nelems], state2->dnulls,
+                sizeof(bool) * state2->nelems);
+
+         state1->nelems = reqsize;
+
+         MemoryContextSwitchTo(oldContext);
+     }
+
+     PG_RETURN_POINTER(state1);
+ }
+
+ /*
+  * array_agg_serialize
+  *        Serialize ArrayBuildState into bytea.
+  */
+ Datum
+ array_agg_serialize(PG_FUNCTION_ARGS)
+ {
+     ArrayBuildState *state;
+     StringInfoData buf;
+     bytea       *result;
+
+     /* cannot be called directly because of internal-type argument */
+     Assert(AggCheckCallContext(fcinfo, NULL));
+
+     state = (ArrayBuildState *) PG_GETARG_POINTER(0);
+
+     pq_begintypsend(&buf);
+
+     /*
+      * element_type. Putting this first is more convenient in deserialization
+      */
+     pq_sendint32(&buf, state->element_type);
+
+     /*
+      * nelems -- send first so we know how large to make the dvalues and
+      * dnulls array during deserialization.
+      */
+     pq_sendint64(&buf, state->nelems);
+
+     /* alen can be decided during deserialization */
+
+     /* typlen */
+     pq_sendint16(&buf, state->typlen);
+
+     /* typbyval */
+     pq_sendbyte(&buf, state->typbyval);
+
+     /* typalign */
+     pq_sendbyte(&buf, state->typalign);
+
+     /* dnulls */
+     pq_sendbytes(&buf, (char *) state->dnulls, sizeof(bool) * state->nelems);
+
+     /*
+      * dvalues.  By agreement with array_agg_deserialize, when the element
+      * type is byval, we just transmit the Datum array as-is, including any
+      * null elements.  For by-ref types, we must invoke the element type's
+      * send function, and we skip null elements (which is why the nulls flags
+      * must be sent first).
+      */
+     if (state->typbyval)
+         pq_sendbytes(&buf, (char *) state->dvalues,
+                      sizeof(Datum) * state->nelems);
+     else
+     {
+         SerialIOData *iodata;
+         int            i;
+
+         /* Avoid repeat catalog lookups for typsend function */
+         iodata = (SerialIOData *) fcinfo->flinfo->fn_extra;
+         if (iodata == NULL)
+         {
+             Oid            typsend;
+             bool        typisvarlena;
+
+             iodata = (SerialIOData *)
+                 MemoryContextAlloc(fcinfo->flinfo->fn_mcxt,
+                                    sizeof(SerialIOData));
+             getTypeBinaryOutputInfo(state->element_type, &typsend,
+                                     &typisvarlena);
+             fmgr_info_cxt(typsend, &iodata->typsend,
+                           fcinfo->flinfo->fn_mcxt);
+             fcinfo->flinfo->fn_extra = (void *) iodata;
+         }
+
+         for (i = 0; i < state->nelems; i++)
+         {
+             bytea       *outputbytes;
+
+             if (state->dnulls[i])
+                 continue;
+             outputbytes = SendFunctionCall(&iodata->typsend,
+                                            state->dvalues[i]);
+             pq_sendint32(&buf, VARSIZE(outputbytes) - VARHDRSZ);
+             pq_sendbytes(&buf, VARDATA(outputbytes),
+                          VARSIZE(outputbytes) - VARHDRSZ);
+         }
+     }
+
+     result = pq_endtypsend(&buf);
+
+     PG_RETURN_BYTEA_P(result);
+ }
+
+ Datum
+ array_agg_deserialize(PG_FUNCTION_ARGS)
+ {
+     bytea       *sstate;
+     ArrayBuildState *result;
+     StringInfoData buf;
+     Oid            element_type;
+     int64        nelems;
+     const char *temp;
+
+     if (!AggCheckCallContext(fcinfo, NULL))
+         elog(ERROR, "aggregate function called in non-aggregate context");
+
+     sstate = PG_GETARG_BYTEA_PP(0);
+
+     /*
+      * Copy the bytea into a StringInfo so that we can "receive" it using the
+      * standard recv-function infrastructure.
+      */
+     initStringInfo(&buf);
+     appendBinaryStringInfo(&buf,
+                            VARDATA_ANY(sstate), VARSIZE_ANY_EXHDR(sstate));
+
+     /* element_type */
+     element_type = pq_getmsgint(&buf, 4);
+
+     /* nelems */
+     nelems = pq_getmsgint64(&buf);
+
+     /* Create output ArrayBuildState with the needed number of elements */
+     result = initArrayResultWithSize(element_type, CurrentMemoryContext,
+                                      false, nelems);
+     result->nelems = nelems;
+
+     /* typlen */
+     result->typlen = pq_getmsgint(&buf, 2);
+
+     /* typbyval */
+     result->typbyval = pq_getmsgbyte(&buf);
+
+     /* typalign */
+     result->typalign = pq_getmsgbyte(&buf);
+
+     /* dnulls */
+     temp = pq_getmsgbytes(&buf, sizeof(bool) * nelems);
+     memcpy(result->dnulls, temp, sizeof(bool) * nelems);
+
+     /* dvalues --- see comment in array_agg_serialize */
+     if (result->typbyval)
+     {
+         temp = pq_getmsgbytes(&buf, sizeof(Datum) * nelems);
+         memcpy(result->dvalues, temp, sizeof(Datum) * nelems);
+     }
+     else
+     {
+         DeserialIOData *iodata;
+         int            i;
+
+         /* Avoid repeat catalog lookups for typreceive function */
+         iodata = (DeserialIOData *) fcinfo->flinfo->fn_extra;
+         if (iodata == NULL)
+         {
+             Oid            typreceive;
+
+             iodata = (DeserialIOData *)
+                 MemoryContextAlloc(fcinfo->flinfo->fn_mcxt,
+                                    sizeof(DeserialIOData));
+             getTypeBinaryInputInfo(element_type, &typreceive,
+                                    &iodata->typioparam);
+             fmgr_info_cxt(typreceive, &iodata->typreceive,
+                           fcinfo->flinfo->fn_mcxt);
+             fcinfo->flinfo->fn_extra = (void *) iodata;
+         }
+
+         for (i = 0; i < nelems; i++)
+         {
+             int            itemlen;
+             StringInfoData elem_buf;
+             char        csave;
+
+             if (result->dnulls[i])
+             {
+                 result->dvalues[i] = (Datum) 0;
+                 continue;
+             }
+
+             itemlen = pq_getmsgint(&buf, 4);
+             if (itemlen < 0 || itemlen > (buf.len - buf.cursor))
+                 ereport(ERROR,
+                         (errcode(ERRCODE_INVALID_BINARY_REPRESENTATION),
+                          errmsg("insufficient data left in message")));
+
+             /*
+              * Rather than copying data around, we just set up a phony
+              * StringInfo pointing to the correct portion of the input buffer.
+              * We assume we can scribble on the input buffer so as to maintain
+              * the convention that StringInfos have a trailing null.
+              */
+             elem_buf.data = &buf.data[buf.cursor];
+             elem_buf.maxlen = itemlen + 1;
+             elem_buf.len = itemlen;
+             elem_buf.cursor = 0;
+
+             buf.cursor += itemlen;
+
+             csave = buf.data[buf.cursor];
+             buf.data[buf.cursor] = '\0';
+
+             /* Now call the element's receiveproc */
+             result->dvalues[i] = ReceiveFunctionCall(&iodata->typreceive,
+                                                      &elem_buf,
+                                                      iodata->typioparam,
+                                                      -1);
+
+             buf.data[buf.cursor] = csave;
+         }
+     }
+
+     pq_getmsgend(&buf);
+     pfree(buf.data);
+
+     PG_RETURN_POINTER(result);
+ }
+
+ Datum
  array_agg_finalfn(PG_FUNCTION_ARGS)
  {
      Datum        result;
*************** array_agg_array_transfn(PG_FUNCTION_ARGS
*** 578,583 ****
--- 912,1216 ----
  }

  Datum
+ array_agg_array_combine(PG_FUNCTION_ARGS)
+ {
+     ArrayBuildStateArr *state1;
+     ArrayBuildStateArr *state2;
+     MemoryContext agg_context;
+     MemoryContext old_context;
+
+     if (!AggCheckCallContext(fcinfo, &agg_context))
+         elog(ERROR, "aggregate function called in non-aggregate context");
+
+     state1 = PG_ARGISNULL(0) ? NULL : (ArrayBuildStateArr *) PG_GETARG_POINTER(0);
+     state2 = PG_ARGISNULL(1) ? NULL : (ArrayBuildStateArr *) PG_GETARG_POINTER(1);
+
+     if (state2 == NULL)
+     {
+         /*
+          * NULL state2 is easy, just return state1, which we know is already
+          * in the agg_context
+          */
+         if (state1 == NULL)
+             PG_RETURN_NULL();
+         PG_RETURN_POINTER(state1);
+     }
+
+     if (state1 == NULL)
+     {
+         /* We must copy state2's data into the agg_context */
+         old_context = MemoryContextSwitchTo(agg_context);
+
+         state1 = initArrayResultArr(state2->array_type, InvalidOid,
+                                     agg_context, false);
+
+         state1->abytes = state2->abytes;
+         state1->data = (char *) palloc(state1->abytes);
+
+         if (state2->nullbitmap)
+         {
+             int            size = (state2->aitems + 7) / 8;
+
+             state1->nullbitmap = (bits8 *) palloc(size);
+             memcpy(state1->nullbitmap, state2->nullbitmap, size);
+         }
+
+         memcpy(state1->data, state2->data, state2->nbytes);
+         state1->nbytes = state2->nbytes;
+         state1->aitems = state2->aitems;
+         state1->nitems = state2->nitems;
+         state1->ndims = state2->ndims;
+         memcpy(state1->dims, state2->dims, sizeof(state2->dims));
+         memcpy(state1->lbs, state2->lbs, sizeof(state2->lbs));
+         state1->array_type = state2->array_type;
+         state1->element_type = state2->element_type;
+
+         MemoryContextSwitchTo(old_context);
+
+         PG_RETURN_POINTER(state1);
+     }
+
+     /* We only need to combine the two states if state2 has any items */
+     else if (state2->nitems > 0)
+     {
+         MemoryContext oldContext;
+         int            reqsize = state1->nbytes + state2->nbytes;
+         int            i;
+
+         /*
+          * Check the states are compatible with each other.  Ensure we use the
+          * same error messages that are listed in accumArrayResultArr so that
+          * the same error is shown as would have been if we'd not used the
+          * combine function for the aggregation.
+          */
+         if (state1->ndims != state2->ndims)
+             ereport(ERROR,
+                     (errcode(ERRCODE_ARRAY_SUBSCRIPT_ERROR),
+                      errmsg("cannot accumulate arrays of different dimensionality")));
+
+         /* Check dimensions match ignoring the first dimension. */
+         for (i = 1; i < state1->ndims; i++)
+         {
+             if (state1->dims[i] != state2->dims[i] || state1->lbs[i] != state2->lbs[i])
+                 ereport(ERROR,
+                         (errcode(ERRCODE_ARRAY_SUBSCRIPT_ERROR),
+                          errmsg("cannot accumulate arrays of different dimensionality")));
+         }
+
+
+         oldContext = MemoryContextSwitchTo(state1->mcontext);
+
+         /*
+          * If there's not enough space in state1 then we'll need to reallocate
+          * more.
+          */
+         if (state1->abytes < reqsize)
+         {
+             /* use a power of 2 size rather than allocating just reqsize */
+             while (state1->abytes < reqsize)
+                 state1->abytes *= 2;
+
+             state1->data = (char *) repalloc(state1->data, state1->abytes);
+         }
+
+         if (state2->nullbitmap)
+         {
+             int            newnitems = state1->nitems + state2->nitems;
+
+             if (state1->nullbitmap == NULL)
+             {
+                 /*
+                  * First input with nulls; we must retrospectively handle any
+                  * previous inputs by marking all their items non-null.
+                  */
+                 state1->aitems = 256;
+                 while (state1->aitems <= newnitems)
+                     state1->aitems *= 2;
+                 state1->nullbitmap = (bits8 *) palloc((state1->aitems + 7) / 8);
+                 array_bitmap_copy(state1->nullbitmap, 0,
+                                   NULL, 0,
+                                   state1->nitems);
+             }
+             else if (newnitems > state1->aitems)
+             {
+                 int            newaitems = state1->aitems + state2->aitems;
+
+                 while (state1->aitems < newaitems)
+                     state1->aitems *= 2;
+
+                 state1->nullbitmap = (bits8 *)
+                     repalloc(state1->nullbitmap, (state1->aitems + 7) / 8);
+             }
+             array_bitmap_copy(state1->nullbitmap, state1->nitems,
+                               state2->nullbitmap, 0,
+                               state2->nitems);
+         }
+
+         memcpy(state1->data + state1->nbytes, state2->data, state2->nbytes);
+         state1->nbytes += state2->nbytes;
+         state1->nitems += state2->nitems;
+
+         state1->dims[0] += state2->dims[0];
+         /* remaing dims already match, per test above */
+
+         Assert(state1->array_type == state2->array_type);
+         Assert(state1->element_type = state2->element_type);
+
+         MemoryContextSwitchTo(oldContext);
+     }
+
+     PG_RETURN_POINTER(state1);
+ }
+
+ /*
+  * array_agg_array_serialize
+  *        Serialize ArrayBuildStateArr into bytea.
+  */
+ Datum
+ array_agg_array_serialize(PG_FUNCTION_ARGS)
+ {
+     ArrayBuildStateArr *state;
+     StringInfoData buf;
+     bytea       *result;
+
+     /* cannot be called directly because of internal-type argument */
+     Assert(AggCheckCallContext(fcinfo, NULL));
+
+     state = (ArrayBuildStateArr *) PG_GETARG_POINTER(0);
+
+     pq_begintypsend(&buf);
+
+     /*
+      * element_type. Putting this first is more convenient in deserialization
+      * so that we can init the new state sooner.
+      */
+     pq_sendint32(&buf, state->element_type);
+
+     /* array_type */
+     pq_sendint32(&buf, state->array_type);
+
+     /* nbytes */
+     pq_sendint32(&buf, state->nbytes);
+
+     /* data */
+     pq_sendbytes(&buf, state->data, state->nbytes);
+
+     /* abytes */
+     pq_sendint32(&buf, state->abytes);
+
+     /* aitems */
+     pq_sendint32(&buf, state->aitems);
+
+     /* nullbitmap */
+     if (state->nullbitmap)
+     {
+         Assert(state->aitems > 0);
+         pq_sendbytes(&buf, (char *) state->nullbitmap, (state->aitems + 7) / 8);
+     }
+
+     /* nitems */
+     pq_sendint32(&buf, state->nitems);
+
+     /* ndims */
+     pq_sendint32(&buf, state->ndims);
+
+     /* dims: XXX should we just send ndims elements? */
+     pq_sendbytes(&buf, (char *) state->dims, sizeof(state->dims));
+
+     /* lbs */
+     pq_sendbytes(&buf, (char *) state->lbs, sizeof(state->lbs));
+
+     result = pq_endtypsend(&buf);
+
+     PG_RETURN_BYTEA_P(result);
+ }
+
+ Datum
+ array_agg_array_deserialize(PG_FUNCTION_ARGS)
+ {
+     bytea       *sstate;
+     ArrayBuildStateArr *result;
+     StringInfoData buf;
+     Oid            element_type;
+     Oid            array_type;
+     int            nbytes;
+     const char *temp;
+
+     /* cannot be called directly because of internal-type argument */
+     Assert(AggCheckCallContext(fcinfo, NULL));
+
+     sstate = PG_GETARG_BYTEA_PP(0);
+
+     /*
+      * Copy the bytea into a StringInfo so that we can "receive" it using the
+      * standard recv-function infrastructure.
+      */
+     initStringInfo(&buf);
+     appendBinaryStringInfo(&buf,
+                            VARDATA_ANY(sstate), VARSIZE_ANY_EXHDR(sstate));
+
+     /* element_type */
+     element_type = pq_getmsgint(&buf, 4);
+
+     /* array_type */
+     array_type = pq_getmsgint(&buf, 4);
+
+     /* nbytes */
+     nbytes = pq_getmsgint(&buf, 4);
+
+     result = initArrayResultArr(array_type, element_type,
+                                 CurrentMemoryContext, false);
+
+     result->abytes = 1024;
+     while (result->abytes < nbytes)
+         result->abytes *= 2;
+
+     result->data = (char *) palloc(result->abytes);
+
+     /* data */
+     temp = pq_getmsgbytes(&buf, nbytes);
+     memcpy(result->data, temp, nbytes);
+     result->nbytes = nbytes;
+
+     /* abytes */
+     result->abytes = pq_getmsgint(&buf, 4);
+
+     /* aitems: might be 0 */
+     result->aitems = pq_getmsgint(&buf, 4);
+
+     /* nullbitmap */
+     if (result->aitems > 0)
+     {
+         int            size = (result->aitems + 7) / 8;
+
+         result->nullbitmap = (bits8 *) palloc(size);
+         temp = pq_getmsgbytes(&buf, size);
+         memcpy(result->nullbitmap, temp, size);
+     }
+     else
+         result->nullbitmap = NULL;
+
+     /* nitems */
+     result->nitems = pq_getmsgint(&buf, 4);
+
+     /* ndims */
+     result->ndims = pq_getmsgint(&buf, 4);
+
+     /* dims */
+     temp = pq_getmsgbytes(&buf, sizeof(result->dims));
+     memcpy(result->dims, temp, sizeof(result->dims));
+
+     /* lbs */
+     temp = pq_getmsgbytes(&buf, sizeof(result->lbs));
+     memcpy(result->lbs, temp, sizeof(result->lbs));
+
+     pq_getmsgend(&buf);
+     pfree(buf.data);
+
+     PG_RETURN_POINTER(result);
+ }
+
+ Datum
  array_agg_array_finalfn(PG_FUNCTION_ARGS)
  {
      Datum        result;
diff --git a/src/backend/utils/adt/arrayfuncs.c b/src/backend/utils/adt/arrayfuncs.c
index 0cbdbe5..1210704 100644
*** a/src/backend/utils/adt/arrayfuncs.c
--- b/src/backend/utils/adt/arrayfuncs.c
*************** array_insert_slice(ArrayType *destArray,
*** 4996,5007 ****
   * subcontext=false.
   *
   * In cases when the array build states have different lifetimes, using a
!  * single memory context is impractical. Instead, pass subcontext=true so that
!  * the array build states can be freed individually.
   */
  ArrayBuildState *
  initArrayResult(Oid element_type, MemoryContext rcontext, bool subcontext)
  {
      ArrayBuildState *astate;
      MemoryContext arr_context = rcontext;

--- 4996,5025 ----
   * subcontext=false.
   *
   * In cases when the array build states have different lifetimes, using a
!  * single memory context is impractical.  Instead, pass subcontext=true so
!  * that the array build states can be freed individually.
   */
  ArrayBuildState *
  initArrayResult(Oid element_type, MemoryContext rcontext, bool subcontext)
  {
+     /*
+      * When using a subcontext, we can afford to start with a somewhat larger
+      * initial array size.  Without subcontexts, we'd better hope that most of
+      * the states stay small ...
+      */
+     return initArrayResultWithSize(element_type, rcontext, subcontext,
+                                    subcontext ? 64 : 8);
+ }
+
+ /*
+  * initArrayResultWithSize
+  *        As initArrayResult, but allow the initial size of the allocated arrays
+  *        to be specified.
+  */
+ ArrayBuildState *
+ initArrayResultWithSize(Oid element_type, MemoryContext rcontext,
+                         bool subcontext, int initsize)
+ {
      ArrayBuildState *astate;
      MemoryContext arr_context = rcontext;

*************** initArrayResult(Oid element_type, Memory
*** 5015,5021 ****
          MemoryContextAlloc(arr_context, sizeof(ArrayBuildState));
      astate->mcontext = arr_context;
      astate->private_cxt = subcontext;
!     astate->alen = (subcontext ? 64 : 8);    /* arbitrary starting array size */
      astate->dvalues = (Datum *)
          MemoryContextAlloc(arr_context, astate->alen * sizeof(Datum));
      astate->dnulls = (bool *)
--- 5033,5039 ----
          MemoryContextAlloc(arr_context, sizeof(ArrayBuildState));
      astate->mcontext = arr_context;
      astate->private_cxt = subcontext;
!     astate->alen = initsize;
      astate->dvalues = (Datum *)
          MemoryContextAlloc(arr_context, astate->alen * sizeof(Datum));
      astate->dnulls = (bool *)
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 4346410..fff89e7 100644
*** a/src/backend/utils/adt/varlena.c
--- b/src/backend/utils/adt/varlena.c
*************** bytea_string_agg_transfn(PG_FUNCTION_ARG
*** 453,481 ****

      state = PG_ARGISNULL(0) ? NULL : (StringInfo) PG_GETARG_POINTER(0);

!     /* Append the value unless null. */
      if (!PG_ARGISNULL(1))
      {
          bytea       *value = PG_GETARG_BYTEA_PP(1);

!         /* On the first time through, we ignore the delimiter. */
          if (state == NULL)
              state = makeStringAggState(fcinfo);
!         else if (!PG_ARGISNULL(2))
          {
              bytea       *delim = PG_GETARG_BYTEA_PP(2);

!             appendBinaryStringInfo(state, VARDATA_ANY(delim), VARSIZE_ANY_EXHDR(delim));
          }

!         appendBinaryStringInfo(state, VARDATA_ANY(value), VARSIZE_ANY_EXHDR(value));
      }

      /*
       * The transition type for string_agg() is declared to be "internal",
       * which is a pass-by-value type the same size as a pointer.
       */
!     PG_RETURN_POINTER(state);
  }

  Datum
--- 453,502 ----

      state = PG_ARGISNULL(0) ? NULL : (StringInfo) PG_GETARG_POINTER(0);

!     /* Append the value unless null, preceding it with the delimiter. */
      if (!PG_ARGISNULL(1))
      {
          bytea       *value = PG_GETARG_BYTEA_PP(1);
+         bool        isfirst = false;

!         /*
!          * We store the delimiter for each aggregated item, even the first
!          * one, though you might think that could be discarded immediately.
!          * Otherwise, partial aggregation can't work, because the combine
!          * function needs the delimiter for the first item in the second
!          * aggregated state.  (IOW, what we now think is the first item might
!          * not be first overall.)  The finalfn is responsible for stripping
!          * off the first delimiter.  So that it can tell how much to strip, we
!          * store the length of the first delimiter in the StringInfo's cursor
!          * field, which we don't otherwise need here.
!          */
          if (state == NULL)
+         {
              state = makeStringAggState(fcinfo);
!             isfirst = true;
!         }
!
!         if (!PG_ARGISNULL(2))
          {
              bytea       *delim = PG_GETARG_BYTEA_PP(2);

!             appendBinaryStringInfo(state, VARDATA_ANY(delim),
!                                    VARSIZE_ANY_EXHDR(delim));
!             if (isfirst)
!                 state->cursor = VARSIZE_ANY_EXHDR(delim);
          }

!         appendBinaryStringInfo(state, VARDATA_ANY(value),
!                                VARSIZE_ANY_EXHDR(value));
      }

      /*
       * The transition type for string_agg() is declared to be "internal",
       * which is a pass-by-value type the same size as a pointer.
       */
!     if (state)
!         PG_RETURN_POINTER(state);
!     PG_RETURN_NULL();
  }

  Datum
*************** bytea_string_agg_finalfn(PG_FUNCTION_ARG
*** 490,500 ****

      if (state != NULL)
      {
          bytea       *result;

!         result = (bytea *) palloc(state->len + VARHDRSZ);
!         SET_VARSIZE(result, state->len + VARHDRSZ);
!         memcpy(VARDATA(result), state->data, state->len);
          PG_RETURN_BYTEA_P(result);
      }
      else
--- 511,523 ----

      if (state != NULL)
      {
+         /* As per comment in transfn, strip data before the cursor position */
          bytea       *result;
+         int            strippedlen = state->len - state->cursor;

!         result = (bytea *) palloc(strippedlen + VARHDRSZ);
!         SET_VARSIZE(result, strippedlen + VARHDRSZ);
!         memcpy(VARDATA(result), &state->data[state->cursor], strippedlen);
          PG_RETURN_BYTEA_P(result);
      }
      else
*************** string_agg_transfn(PG_FUNCTION_ARGS)
*** 4653,4675 ****

      state = PG_ARGISNULL(0) ? NULL : (StringInfo) PG_GETARG_POINTER(0);

!     /* Append the value unless null. */
      if (!PG_ARGISNULL(1))
      {
!         /* On the first time through, we ignore the delimiter. */
          if (state == NULL)
              state = makeStringAggState(fcinfo);
!         else if (!PG_ARGISNULL(2))
!             appendStringInfoText(state, PG_GETARG_TEXT_PP(2));    /* delimiter */

!         appendStringInfoText(state, PG_GETARG_TEXT_PP(1));    /* value */
      }

      /*
       * The transition type for string_agg() is declared to be "internal",
       * which is a pass-by-value type the same size as a pointer.
       */
!     PG_RETURN_POINTER(state);
  }

  Datum
--- 4676,4846 ----

      state = PG_ARGISNULL(0) ? NULL : (StringInfo) PG_GETARG_POINTER(0);

!     /* Append the value unless null, preceding it with the delimiter. */
      if (!PG_ARGISNULL(1))
      {
!         text       *value = PG_GETARG_TEXT_PP(1);
!         bool        isfirst = false;
!
!         /*
!          * We store the delimiter for each aggregated item, even the first
!          * one, though you might think that could be discarded immediately.
!          * Otherwise, partial aggregation can't work, because the combine
!          * function needs the delimiter for the first item in the second
!          * aggregated state.  (IOW, what we now think is the first item might
!          * not be first overall.)  The finalfn is responsible for stripping
!          * off the first delimiter.  So that it can tell how much to strip, we
!          * store the length of the first delimiter in the StringInfo's cursor
!          * field, which we don't otherwise need here.
!          */
          if (state == NULL)
+         {
              state = makeStringAggState(fcinfo);
!             isfirst = true;
!         }

!         if (!PG_ARGISNULL(2))
!         {
!             text       *delim = PG_GETARG_TEXT_PP(2);
!
!             appendStringInfoText(state, delim);
!             if (isfirst)
!                 state->cursor = VARSIZE_ANY_EXHDR(delim);
!         }
!
!         appendStringInfoText(state, value);
      }

      /*
       * The transition type for string_agg() is declared to be "internal",
       * which is a pass-by-value type the same size as a pointer.
       */
!     if (state)
!         PG_RETURN_POINTER(state);
!     PG_RETURN_NULL();
! }
!
! /*
!  * string_agg_combine
!  *        Aggregate combine function for string_agg(text) and string_agg(bytea)
!  */
! Datum
! string_agg_combine(PG_FUNCTION_ARGS)
! {
!     StringInfo    state1;
!     StringInfo    state2;
!     MemoryContext agg_context;
!
!     if (!AggCheckCallContext(fcinfo, &agg_context))
!         elog(ERROR, "aggregate function called in non-aggregate context");
!
!     state1 = PG_ARGISNULL(0) ? NULL : (StringInfo) PG_GETARG_POINTER(0);
!     state2 = PG_ARGISNULL(1) ? NULL : (StringInfo) PG_GETARG_POINTER(1);
!
!     if (state2 == NULL)
!     {
!         /*
!          * NULL state2 is easy, just return state1, which we know is already
!          * in the agg_context
!          */
!         if (state1 == NULL)
!             PG_RETURN_NULL();
!         PG_RETURN_POINTER(state1);
!     }
!
!     if (state1 == NULL)
!     {
!         /* We must copy state2's data into the agg_context */
!         MemoryContext old_context;
!
!         old_context = MemoryContextSwitchTo(agg_context);
!         state1 = makeStringAggState(fcinfo);
!         appendBinaryStringInfo(state1, state2->data, state2->len);
!         state1->cursor = state2->cursor;
!         MemoryContextSwitchTo(old_context);
!     }
!     else if (state2->len > 0)
!     {
!         /* Combine ... state1->cursor does not change in this case */
!         appendBinaryStringInfo(state1, state2->data, state2->len);
!     }
!
!     PG_RETURN_POINTER(state1);
! }
!
! /*
!  * string_agg_serialize
!  *        Aggregate serialize function for string_agg(text) and string_agg(bytea)
!  *
!  * This is strict, so we need not handle NULL input
!  */
! Datum
! string_agg_serialize(PG_FUNCTION_ARGS)
! {
!     StringInfo    state;
!     StringInfoData buf;
!     bytea       *result;
!
!     /* cannot be called directly because of internal-type argument */
!     Assert(AggCheckCallContext(fcinfo, NULL));
!
!     state = (StringInfo) PG_GETARG_POINTER(0);
!
!     pq_begintypsend(&buf);
!
!     /* cursor */
!     pq_sendint(&buf, state->cursor, 4);
!
!     /* data */
!     pq_sendbytes(&buf, state->data, state->len);
!
!     result = pq_endtypsend(&buf);
!
!     PG_RETURN_BYTEA_P(result);
! }
!
! /*
!  * string_agg_deserialize
!  *        Aggregate deserial function for string_agg(text) and string_agg(bytea)
!  *
!  * This is strict, so we need not handle NULL input
!  */
! Datum
! string_agg_deserialize(PG_FUNCTION_ARGS)
! {
!     bytea       *sstate;
!     StringInfo    result;
!     StringInfoData buf;
!     char       *data;
!     int            datalen;
!
!     /* cannot be called directly because of internal-type argument */
!     Assert(AggCheckCallContext(fcinfo, NULL));
!
!     sstate = PG_GETARG_BYTEA_PP(0);
!
!     /*
!      * Copy the bytea into a StringInfo so that we can "receive" it using the
!      * standard recv-function infrastructure.
!      */
!     initStringInfo(&buf);
!     appendBinaryStringInfo(&buf,
!                            VARDATA_ANY(sstate), VARSIZE_ANY_EXHDR(sstate));
!
!     result = makeStringAggState(fcinfo);
!
!     /* cursor */
!     result->cursor = pq_getmsgint(&buf, 4);
!
!     /* data */
!     datalen = VARSIZE_ANY_EXHDR(sstate) - 4;
!     data = (char *) pq_getmsgbytes(&buf, datalen);
!     appendBinaryStringInfo(result, data, datalen);
!
!     pq_getmsgend(&buf);
!     pfree(buf.data);
!
!     PG_RETURN_POINTER(result);
  }

  Datum
*************** string_agg_finalfn(PG_FUNCTION_ARGS)
*** 4683,4689 ****
      state = PG_ARGISNULL(0) ? NULL : (StringInfo) PG_GETARG_POINTER(0);

      if (state != NULL)
!         PG_RETURN_TEXT_P(cstring_to_text_with_len(state->data, state->len));
      else
          PG_RETURN_NULL();
  }
--- 4854,4864 ----
      state = PG_ARGISNULL(0) ? NULL : (StringInfo) PG_GETARG_POINTER(0);

      if (state != NULL)
!     {
!         /* As per comment in transfn, strip data before the cursor position */
!         PG_RETURN_TEXT_P(cstring_to_text_with_len(&state->data[state->cursor],
!                                                   state->len - state->cursor));
!     }
      else
          PG_RETURN_NULL();
  }
diff --git a/src/include/catalog/pg_aggregate.h b/src/include/catalog/pg_aggregate.h
index 125bb5b..b624f4e 100644
*** a/src/include/catalog/pg_aggregate.h
--- b/src/include/catalog/pg_aggregate.h
*************** DATA(insert ( 2243    n 0 bitor        -                bitor
*** 300,313 ****
  DATA(insert ( 2901    n 0 xmlconcat2    -                -        -    -    -                -                -
        f f r r 0    142        0    0        0    _null_ _null_ )); 

  /* array */
! DATA(insert ( 2335    n 0 array_agg_transfn        array_agg_finalfn        -    -    -    -        -
-               t f r r 0    2281    0    0        0    _null_ _null_ )); 
! DATA(insert ( 4053    n 0 array_agg_array_transfn array_agg_array_finalfn -    -    -    -        -                -
             t f r r 0    2281    0    0        0    _null_ _null_ )); 

  /* text */
! DATA(insert ( 3538    n 0 string_agg_transfn    string_agg_finalfn    -    -    -    -                -
-                f f r r 0    2281    0    0        0    _null_ _null_ )); 

  /* bytea */
! DATA(insert ( 3545    n 0 bytea_string_agg_transfn    bytea_string_agg_finalfn    -    -    -    -                -
            -        f f r r 0    2281    0    0        0    _null_ _null_ )); 

  /* json */
  DATA(insert ( 3175    n 0 json_agg_transfn    json_agg_finalfn            -    -    -    -                -
    -                f f r r 0    2281    0    0        0    _null_ _null_ )); 
--- 300,313 ----
  DATA(insert ( 2901    n 0 xmlconcat2    -                -        -    -    -                -                -
        f f r r 0    142        0    0        0    _null_ _null_ )); 

  /* array */
! DATA(insert ( 2335    n 0 array_agg_transfn        array_agg_finalfn        array_agg_combine
array_agg_serialize           array_agg_deserialize        -        -                -                t f r r 0    2281
  0    0        0    _null_ _null_ )); 
! DATA(insert ( 4053    n 0 array_agg_array_transfn array_agg_array_finalfn array_agg_array_combine
array_agg_array_serialize   array_agg_array_deserialize    -        -                -                t f r r 0    2281
  0    0        0    _null_ _null_ )); 

  /* text */
! DATA(insert ( 3538    n 0 string_agg_transfn    string_agg_finalfn    string_agg_combine    string_agg_serialize
string_agg_deserialize   -                -                -                f f r r 0    2281    0    0        0
_null__null_ )); 

  /* bytea */
! DATA(insert ( 3545    n 0 bytea_string_agg_transfn    bytea_string_agg_finalfn    string_agg_combine
string_agg_serialize   string_agg_deserialize    -                -                -        f f r r 0    2281    0    0
      0    _null_ _null_ )); 

  /* json */
  DATA(insert ( 3175    n 0 json_agg_transfn    json_agg_finalfn            -    -    -    -                -
    -                f f r r 0    2281    0    0        0    _null_ _null_ )); 
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index bfc9009..84cdc44 100644
*** a/src/include/catalog/pg_proc.h
--- b/src/include/catalog/pg_proc.h
*************** DATA(insert OID = 3168 (  array_replace
*** 942,953 ****
--- 942,965 ----
  DESCR("replace any occurrences of an element in an array");
  DATA(insert OID = 2333 (  array_agg_transfn   PGNSP PGUID 12 1 0 0 0 f f f f f i s 2 0 2281 "2281 2776" _null_ _null_
_null__null_ _null_ array_agg_transfn _null_ _null_ _null_ )); 
  DESCR("aggregate transition function");
+ DATA(insert OID = 3423 (  array_agg_combine     PGNSP PGUID 12 1 0 0 0 f f f f f i s 2 0 2281 "2281 2281" _null_
_null__null_ _null_ _null_ array_agg_combine _null_ _null_ _null_ )); 
+ DESCR("aggregate combine function");
+ DATA(insert OID = 3424 (  array_agg_serialize    PGNSP PGUID 12 1 0 0 0 f f f t f i s 1 0 17 "2281" _null_ _null_
_null__null_ _null_ array_agg_serialize _null_ _null_ _null_ )); 
+ DESCR("aggregate serial function");
+ DATA(insert OID = 3425 (  array_agg_deserialize     PGNSP PGUID 12 1 0 0 0 f f f t f i s 2 0 2281 "17 2281" _null_
_null__null_ _null_ _null_ array_agg_deserialize _null_ _null_ _null_ )); 
+ DESCR("aggregate deserial function");
  DATA(insert OID = 2334 (  array_agg_finalfn   PGNSP PGUID 12 1 0 0 0 f f f f f i s 2 0 2277 "2281 2776" _null_ _null_
_null__null_ _null_ array_agg_finalfn _null_ _null_ _null_ )); 
  DESCR("aggregate final function");
  DATA(insert OID = 2335 (  array_agg           PGNSP PGUID 12 1 0 0 0 a f f f f i s 1 0 2277 "2776" _null_ _null_
_null__null_ _null_ aggregate_dummy _null_ _null_ _null_ )); 
  DESCR("concatenate aggregate input into an array");
  DATA(insert OID = 4051 (  array_agg_array_transfn    PGNSP PGUID 12 1 0 0 0 f f f f f i s 2 0 2281 "2281 2277" _null_
_null__null_ _null_ _null_ array_agg_array_transfn _null_ _null_ _null_ )); 
  DESCR("aggregate transition function");
+ DATA(insert OID = 3426 (  array_agg_array_combine    PGNSP PGUID 12 1 0 0 0 f f f f f i s 2 0 2281 "2281 2281" _null_
_null__null_ _null_ _null_ array_agg_array_combine _null_ _null_ _null_ )); 
+ DESCR("aggregate combine function");
+ DATA(insert OID = 3427 (  array_agg_array_serialize PGNSP PGUID 12 1 0 0 0 f f f t f i s 1 0 17 "2281" _null_ _null_
_null__null_ _null_ array_agg_array_serialize _null_ _null_ _null_ )); 
+ DESCR("aggregate serial function");
+ DATA(insert OID = 3428 (  array_agg_array_deserialize    PGNSP PGUID 12 1 0 0 0 f f f t f i s 2 0 2281 "17 2281"
_null__null_ _null_ _null_ _null_ array_agg_array_deserialize _null_ _null_ _null_ )); 
+ DESCR("aggregate deserial function");
  DATA(insert OID = 4052 (  array_agg_array_finalfn    PGNSP PGUID 12 1 0 0 0 f f f f f i s 2 0 2277 "2281 2277" _null_
_null__null_ _null_ _null_ array_agg_array_finalfn _null_ _null_ _null_ )); 
  DESCR("aggregate final function");
  DATA(insert OID = 4053 (  array_agg           PGNSP PGUID 12 1 0 0 0 a f f f f i s 1 0 2277 "2277" _null_ _null_
_null__null_ _null_ aggregate_dummy _null_ _null_ _null_ )); 
*************** DESCR("aggregate final function");
*** 2710,2715 ****
--- 2722,2733 ----

  DATA(insert OID = 3535 (  string_agg_transfn        PGNSP PGUID 12 1 0 0 0 f f f f f i s 3 0 2281 "2281 25 25" _null_
_null__null_ _null_ _null_ string_agg_transfn _null_ _null_ _null_ )); 
  DESCR("aggregate transition function");
+ DATA(insert OID = 3429 (  string_agg_combine        PGNSP PGUID 12 1 0 0 0 f f f f f i s 2 0 2281 "2281 2281" _null_
_null__null_ _null_ _null_ string_agg_combine _null_ _null_ _null_ )); 
+ DESCR("aggregate combine function");
+ DATA(insert OID = 3430 (  string_agg_serialize        PGNSP PGUID 12 1 0 0 0 f f f t f i s 1 0 17 "2281" _null_
_null__null_ _null_ _null_ string_agg_serialize _null_ _null_ _null_ )); 
+ DESCR("aggregate serial function");
+ DATA(insert OID = 3431 (  string_agg_deserialize    PGNSP PGUID 12 1 0 0 0 f f f t f i s 2 0 2281 "17 2281" _null_
_null__null_ _null_ _null_ string_agg_deserialize _null_ _null_ _null_ )); 
+ DESCR("aggregate deserial function");
  DATA(insert OID = 3536 (  string_agg_finalfn        PGNSP PGUID 12 1 0 0 0 f f f f f i s 1 0 25 "2281" _null_ _null_
_null__null_ _null_ string_agg_finalfn _null_ _null_ _null_ )); 
  DESCR("aggregate final function");
  DATA(insert OID = 3538 (  string_agg                PGNSP PGUID 12 1 0 0 0 a f f f f i s 2 0 25 "25 25" _null_ _null_
_null__null_ _null_ aggregate_dummy _null_ _null_ _null_ )); 
diff --git a/src/include/utils/array.h b/src/include/utils/array.h
index afbb532..f99e05a 100644
*** a/src/include/utils/array.h
--- b/src/include/utils/array.h
*************** extern bool array_contains_nulls(ArrayTy
*** 393,398 ****
--- 393,401 ----

  extern ArrayBuildState *initArrayResult(Oid element_type,
                  MemoryContext rcontext, bool subcontext);
+ extern ArrayBuildState *initArrayResultWithSize(Oid element_type,
+                         MemoryContext rcontext,
+                         bool subcontext, int initsize);
  extern ArrayBuildState *accumArrayResult(ArrayBuildState *astate,
                   Datum dvalue, bool disnull,
                   Oid element_type,

Re: Parallel Aggregates for string_agg and array_agg

From

Tomas Vondra

Date:

26 March 2018, 23:55:22

On 03/26/2018 10:27 PM, Tom Lane wrote:
> David Rowley <david.rowley@2ndquadrant.com> writes:
>> [ combinefn_for_string_and_array_aggs_v7.patch ]
> 
> I spent a fair amount of time hacking on this with intent to commit,
> but just as I was getting to code that I liked, I started to have second
> thoughts about whether this is a good idea at all.  I quote from the fine
> manual:
> 
>     The aggregate functions array_agg, json_agg, jsonb_agg,
>     json_object_agg, jsonb_object_agg, string_agg, and xmlagg, as well as
>     similar user-defined aggregate functions, produce meaningfully
>     different result values depending on the order of the input
>     values. This ordering is unspecified by default, but can be controlled
>     by writing an ORDER BY clause within the aggregate call, as shown in
>     Section 4.2.7. Alternatively, supplying the input values from a sorted
>     subquery will usually work ...
> 
> I do not think it is accidental that these aggregates are exactly the ones
> that do not have parallelism support today.  Rather, that's because you
> just about always have an interest in the order in which the inputs get
> aggregated, which is something that parallel aggregation cannot support.
> 

I don't think that's quite true. I know plenty of people who do things
like this:

    SELECT
       a,
       b,
       avg(c),
       sum(d),
       array_agg(e),
       array_agg(f),
       string_agg(g)
    FROM hugetable GROUP BY a,b HAVING avg(c) > 100.89;

and then do some additional processing on the result in some way
(subquery, matview, ...). They don't really care about ordering of
values in the arrays, as long as orderings of all the arrays match.

Currently queries like this can use parallelism at all, and the patch
fixes that I believe.

> I fear that what will happen, if we commit this, is that something like
> 0.01% of the users of array_agg and string_agg will be pleased, another
> maybe 20% will be unaffected because they wrote ORDER BY which prevents
> parallel aggregation, and the remaining 80% will scream because we broke
> their queries.  Telling them they should've written ORDER BY isn't going
> to cut it, IMO, when the benefit of that breakage will accrue only to some
> very tiny fraction of use-cases.
> 

Isn't the ordering unreliable *already*? It depends on ordering of
tuples on the input. So if the table is scanned by index scan or
sequential scan, that will affect the array_agg/string_agg results. If
the input is a join, it's even more volatile.

IMHO it's not like we're making the ordering unpredictable - it's been
like that since forever.

Also, how is this different from ORDER BY clause? If a user does not
specify an ORDER BY clause, I don't think we'd care very much about
changes to output ordering due to plan changes, for example.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Parallel Aggregates for string_agg and array_agg

From

Tom Lane

Date:

27 March 2018, 00:19:54

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> On 03/26/2018 10:27 PM, Tom Lane wrote:
>> I fear that what will happen, if we commit this, is that something like
>> 0.01% of the users of array_agg and string_agg will be pleased, another
>> maybe 20% will be unaffected because they wrote ORDER BY which prevents
>> parallel aggregation, and the remaining 80% will scream because we broke
>> their queries.  Telling them they should've written ORDER BY isn't going
>> to cut it, IMO, when the benefit of that breakage will accrue only to some
>> very tiny fraction of use-cases.

> Isn't the ordering unreliable *already*?

Not if the query is such that what gets chosen is, say, an indexscan or
mergejoin.  It might be theoretically unreliable and yet work fine for
a given application.

I might be too pessimistic about the fraction of users who are depending
on ordered input without having written anything that explicitly forces
that ... but I stand by the theory that it substantially exceeds the
fraction of users who could get any benefit.

Your own example of assuming that separate aggregates are computed
in the same order reinforces my point, I think.  In principle, anybody
who's doing that should write

       array_agg(e order by x),
       array_agg(f order by x),
       string_agg(g order by x)

because otherwise they shouldn't assume that; the manual certainly doesn't
promise it.  But nobody does that in production, because if they did
they'd get killed by the fact that the sorts are all done independently.
(We should improve that someday, but it hasn't been done yet.)  So I think
there are an awful lot of people out there who are assuming more than a
lawyerly reading of the manual would allow.  Their reaction to this will
be about like ours every time the GCC guys decide that some longstanding
behavior of C code isn't actually promised by the text of the C standard.

            regards, tom lane

Re: Parallel Aggregates for string_agg and array_agg

From

Stephen Frost

Date:

27 March 2018, 00:20:06

Greetings,

* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
> On 03/26/2018 10:27 PM, Tom Lane wrote:
> > I fear that what will happen, if we commit this, is that something like
> > 0.01% of the users of array_agg and string_agg will be pleased, another
> > maybe 20% will be unaffected because they wrote ORDER BY which prevents
> > parallel aggregation, and the remaining 80% will scream because we broke
> > their queries.  Telling them they should've written ORDER BY isn't going
> > to cut it, IMO, when the benefit of that breakage will accrue only to some
> > very tiny fraction of use-cases.
>
> Isn't the ordering unreliable *already*? It depends on ordering of
> tuples on the input. So if the table is scanned by index scan or
> sequential scan, that will affect the array_agg/string_agg results. If
> the input is a join, it's even more volatile.

It's not even just that- for a seq scan, it'll also depend on if anyone
else is also scanning the table as we might not start at the beginning
of the table thanks to synchronize seq scans.

> IMHO it's not like we're making the ordering unpredictable - it's been
> like that since forever.

Yeah, there certainly seems like a lot of opportunity for the ordering
to end up being volatile already and queries depending on it not
changing really shouldn't be making that assumption.  I do think it was
probably a bad move on our part to say that ordering a subquery will
"usually" work in the documentation but having that slip-up in the
documentation mean that we actually *are* going to guarantee it'll
always work to use a subquery ordering to feed an aggregate is a pretty
big stretch (and is it even really true today anyway?  the documentation
certainly doesn't seem to be clear on that...).

> Also, how is this different from ORDER BY clause? If a user does not
> specify an ORDER BY clause, I don't think we'd care very much about
> changes to output ordering due to plan changes, for example.

This seems like it boils down to "well, everyone *knows* that rows can
come back in any order" combined with "our docs claim that an ordered
subquery will *usually* work".

In the end, I do tend to agree that we probably should add parallel
support to these aggregates, but it'd also be nice to hear from those
who had worked to add parallelism to the various aggregates as to if
there was some reason these were skipped.

Thanks!

Stephen

Attachment

signature.asc

Re: Parallel Aggregates for string_agg and array_agg

From

Tom Lane

Date:

27 March 2018, 00:31:18

Stephen Frost <sfrost@snowman.net> writes:
> Yeah, there certainly seems like a lot of opportunity for the ordering
> to end up being volatile already and queries depending on it not
> changing really shouldn't be making that assumption.  I do think it was
> probably a bad move on our part to say that ordering a subquery will
> "usually" work in the documentation but having that slip-up in the
> documentation mean that we actually *are* going to guarantee it'll
> always work to use a subquery ordering to feed an aggregate is a pretty
> big stretch (and is it even really true today anyway?  the documentation
> certainly doesn't seem to be clear on that...).

It's true as long as there's nothing in the outer query that would cause
re-ordering of the subquery output.  I'd be the first to agree that the
wording is a bit sloppy, but I think there are lots of people depending
on that behavior, not least because that was the only way to do it before
we supported ORDER BY inside an aggregate call.

Now, that doesn't directly bear on the argument at hand, because if you're
aggregating over the result of an ORDER BY subquery then you aren't doing
parallel aggregation anyway.  (Perhaps there's parallelism inside the
subquery, but that's irrelevant.)  It is related in the sense that there
are lots of people depending on things other than ORDER-BY-in-the-
aggregate-call to get the results they expect.

            regards, tom lane

Re: Parallel Aggregates for string_agg and array_agg

From

David Rowley

Date:

27 March 2018, 00:54:32

On 27 March 2018 at 10:20, Stephen Frost <sfrost@snowman.net> wrote:
> In the end, I do tend to agree that we probably should add parallel
> support to these aggregates, but it'd also be nice to hear from those
> who had worked to add parallelism to the various aggregates as to if
> there was some reason these were skipped.

That was me. I originally skipped string_agg and array_agg as I
thought parallelism would not be much use due to the growth of the
size of the state. I thought the combine function could become very
expensive as it has to work on each aggregated value, instead of a
consolidated result, as the other aggregates provide. The reason for
adding this now and not back then is:

1. I was wrong, parallel aggregation can improve performance for these
aggregates, and;
2. It's annoying that a single non-partial aggregate in the query
disabled parallel aggregation, and;
3. When I originally considered the performance I didn't consider that
some queries may filter values before aggregation via a FILTER or a
WHERE clause.
4. It's not just parallel aggregation that uses partial aggregation,
we can now push aggregation below Append for partitioned tables in
some cases. Hopefully, we'll also have "group before join" one day
too.

While 2 is still true, there should be a reduction in the times this
happens. Just the json and xml aggs holding us back from the standard
set now.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Parallel Aggregates for string_agg and array_agg

From

David Rowley

Date:

27 March 2018, 01:05:07

On 27 March 2018 at 09:27, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I spent a fair amount of time hacking on this with intent to commit,
> but just as I was getting to code that I liked, I started to have second
> thoughts about whether this is a good idea at all.  I quote from the fine
> manual:
>
>     The aggregate functions array_agg, json_agg, jsonb_agg,
>     json_object_agg, jsonb_object_agg, string_agg, and xmlagg, as well as
>     similar user-defined aggregate functions, produce meaningfully
>     different result values depending on the order of the input
>     values. This ordering is unspecified by default, but can be controlled
>     by writing an ORDER BY clause within the aggregate call, as shown in
>     Section 4.2.7. Alternatively, supplying the input values from a sorted
>     subquery will usually work ...
>
> I do not think it is accidental that these aggregates are exactly the ones
> that do not have parallelism support today.  Rather, that's because you
> just about always have an interest in the order in which the inputs get
> aggregated, which is something that parallel aggregation cannot support.

This was not in my list of reasons for not adding them the first time
around. I mentioned these reasons in a response to Stephen.

> I fear that what will happen, if we commit this, is that something like
> 0.01% of the users of array_agg and string_agg will be pleased, another
> maybe 20% will be unaffected because they wrote ORDER BY which prevents
> parallel aggregation, and the remaining 80% will scream because we broke
> their queries.  Telling them they should've written ORDER BY isn't going
> to cut it, IMO, when the benefit of that breakage will accrue only to some
> very tiny fraction of use-cases.

This very much reminds me of something that exists in the 8.4 release notes:

> SELECT DISTINCT and UNION/INTERSECT/EXCEPT no longer always produce sorted output (Tom)

> Previously, these types of queries always removed duplicate rows by means of Sort/Unique processing (i.e., sort then
removeadjacent duplicates). Now they can be implemented by hashing, which will not produce sorted output. If an
applicationrelied on the output being in sorted order, the recommended fix is to add an ORDER BY clause. As a
short-termworkaround, the previous behavior can be restored by disabling enable_hashagg, but that is a very
performance-expensivefix. SELECT DISTINCT ON never uses hashing, however, so its behavior is unchanged. 

Seems we were happy enough then to tell users to add an ORDER BY.

However, this case is different, since before the results were always
ordered. This time they're possibly ordered. So we'll probably
surprise fewer people this time around.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Parallel Aggregates for string_agg and array_agg

From

Tomas Vondra

Date:

27 March 2018, 01:14:27

On 03/26/2018 11:19 PM, Tom Lane wrote:
> Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>> On 03/26/2018 10:27 PM, Tom Lane wrote:
>>> I fear that what will happen, if we commit this, is that something like
>>> 0.01% of the users of array_agg and string_agg will be pleased, another
>>> maybe 20% will be unaffected because they wrote ORDER BY which prevents
>>> parallel aggregation, and the remaining 80% will scream because we broke
>>> their queries.  Telling them they should've written ORDER BY isn't going
>>> to cut it, IMO, when the benefit of that breakage will accrue only to some
>>> very tiny fraction of use-cases.
> 
>> Isn't the ordering unreliable *already*?
> 
> Not if the query is such that what gets chosen is, say, an indexscan 
> or mergejoin. It might be theoretically unreliable and yet work fine 
> for a given application.
> 

What about parallel versions of those plans, though? That is, what
prevents the planner from generating plans like this

    Hash Aggregate
       -> Gather
          -> Partial Index Scan

AFAIK that's a valid plan, and it does not have the stable ordering
property of a non-parallel index scan. Similarly for Merge Join.

> I might be too pessimistic about the fraction of users who are depending
> on ordered input without having written anything that explicitly forces
> that ... but I stand by the theory that it substantially exceeds the
> fraction of users who could get any benefit.
> 

Not sure, it's pretty much impossible to produce a number that would
meaningfully represent the whole user base. All I can say is that the
number of people I know, who would benefit from this is significant.

>
> Your own example of assuming that separate aggregates are computed
> in the same order reinforces my point, I think.  In principle, anybody
> who's doing that should write
> 
>        array_agg(e order by x),
>        array_agg(f order by x),
>        string_agg(g order by x)
> 
> because otherwise they shouldn't assume that; the manual certainly doesn't
> promise it.  But nobody does that in production, because if they did
> they'd get killed by the fact that the sorts are all done independently.

Yes, the manual does not promise that. But I find that expectation way
more "natural" than deterministic ordering without ORDER BY clause. I
expect each tuple to be processed "at once" - that it's added to all
arrays in the same step, irrespectedly of the ordering of input tuples.

> (We should improve that someday, but it hasn't been done yet.)  So I think
> there are an awful lot of people out there who are assuming more than a
> lawyerly reading of the manual would allow.  Their reaction to this will
> be about like ours every time the GCC guys decide that some longstanding
> behavior of C code isn't actually promised by the text of the C standard.
> 

But we're not *already* providing reliable ordering without an ORDER BY
clause (both for queries as a whole and array_agg/string_agg). It can
produce arbitrary ordering, depending on what plan you happen to get.

So I don't think we'd be changing any longstanding behavior, because
it's already unreliable. If it was semi-reliable before, assuming the
query only ever chose index scans or merge joins, it likely got broken
by introduction of parallel versions of those plans. If the users
disable parallel queries (to get the old plans), they're going to get
non-parallel semi-reliable plans again.

OTOH we are currently accumulating all the arrays in the same order, so
the results will match. (At least that's what I believe is the current
reliable behavior.) So following the GCC analogy, changing this seems
much more akin to GCC people changing some longstanding behavior.

Also, if the arrays were not accumulated in the same way reliably,
wouldn't this mean all the discussion about reliable ordering with
certain plans is pointless?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Parallel Aggregates for string_agg and array_agg

From

Tom Lane

Date:

27 March 2018, 01:28:46

David Rowley <david.rowley@2ndquadrant.com> writes:
> On 27 March 2018 at 09:27, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I do not think it is accidental that these aggregates are exactly the ones
>> that do not have parallelism support today.  Rather, that's because you
>> just about always have an interest in the order in which the inputs get
>> aggregated, which is something that parallel aggregation cannot support.

> This very much reminds me of something that exists in the 8.4 release notes:
>> SELECT DISTINCT and UNION/INTERSECT/EXCEPT no longer always produce sorted output (Tom)

That's a completely false analogy: we got a significant performance
benefit for a significant fraction of users by supporting hashed
aggregation.  My argument here is that only a very tiny fraction of
string_agg/array_agg users will not care about aggregation order, and thus
I don't believe that this patch can help very many people.  Against that,
it's likely to hurt other people, by breaking their queries and forcing
them to insert expensive explicit sorts to fix it.  Even discounting the
backwards-compatibility question, we don't normally adopt performance
features for which it's unclear that the net gain over all users is
positive.

Maybe what that says is that rather than giving up on this altogether,
we should shelve it till we have less stupid planner+executor behavior
for ORDER BY inside aggregates.  That's been on the to-do list for
a long while, but not risen to the top ...

            regards, tom lane

Re: Parallel Aggregates for string_agg and array_agg

From

Tom Lane

Date:

27 March 2018, 02:49:24

I wrote:
> The main thing that remains undone is to get some test coverage ---
> AFAICS, none of these new functions get exercised in the standard
> regression tests.

Oh, I thought of another thing that would need to get done, if we decide
to commit this.  array_agg_serialize/deserialize only work if the array
element type has send/receive functions.  The planner's logic to decide
whether partial aggregation is possible doesn't look any deeper than
whether the aggregate has serialize/deserialize, but I think we have to
teach it that if it's these particular serialize/deserialize functions,
it needs to check the array element type.  Otherwise we'll get runtime
failures if partial aggregation is attempted on array_agg().

This would not matter for any core datatypes, since they all have
send/receive functions, but I imagine it's still pretty common for
extension types not to.

For my money it'd be sufficient to hard-code the test like

    if ((aggserialfn == F_ARRAY_AGG_SERIALIZE ||
         aggdeserialfn == F_ARRAY_AGG_DESERIALIZE) &&
        !array_element_type_has_send_and_recv(exprType(aggregate input)))

Someday we might need more flexible testing, but that can be solved
some other time.

            regards, tom lane

Re: Parallel Aggregates for string_agg and array_agg

From

David Rowley

Date:

27 March 2018, 03:14:15

On 27 March 2018 at 11:28, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> David Rowley <david.rowley@2ndquadrant.com> writes:
>> This very much reminds me of something that exists in the 8.4 release notes:
>>> SELECT DISTINCT and UNION/INTERSECT/EXCEPT no longer always produce sorted output (Tom)
>
> That's a completely false analogy: we got a significant performance
> benefit for a significant fraction of users by supporting hashed
> aggregation.  My argument here is that only a very tiny fraction of
> string_agg/array_agg users will not care about aggregation order, and thus
> I don't believe that this patch can help very many people.  Against that,
> it's likely to hurt other people, by breaking their queries and forcing
> them to insert expensive explicit sorts to fix it.  Even discounting the
> backwards-compatibility question, we don't normally adopt performance
> features for which it's unclear that the net gain over all users is
> positive.

I don't believe you can go and claim this is a false analogy based on
your estimates on the number of cases in which each applies. The
analogy which I was pointing at was that we've been here before...
we've had versions of the planner which generated plans which would
have a deterministic sort order and we've later discovered that we can
improve performance by allowing the planner to have more flexibility
to do things in different ways which may no longer provide implicitly
sorted results. We've previously recognised that some users may have
become accustomed to the previous behaviour and we've mentioned a
workaround in the release notes so that those users are not out in the
cold. This seems exactly the same to me, and certainly not "false".

I have to say, it really would be a shame to have this concern block
us from future optimisations in aggregation.

I also think that anyone who expects string_agg to always aggregate in
the same order has not done a good job of looking at the docs. I see
you already quoted the "This ordering is unspecified by default, but
can be controlled by writing an ORDER BY clause within the aggregate
call, as shown in Section 4.2.7." part. This text appears in all
version of PostgreSQL that we currently support.   We do sometimes
change things where the docs say something like "this works, but don't
rely on it, we might change in the future", in this case, we've never
even claimed that it worked in the first place!

Anyway, the options are not zero for anyone who is strongly affected
with no other workaround. They just need to disable parallel query,
which to me seems fairly similar to the 8.4 release note's "the
previous behavior can be restored by disabling enable_hashagg"

If we go by your argument then we should perhaps remove parallel
aggregation for all the floating point types too, since the order in
which such values are aggregated also can affect the result. I
mentioned this in [1], but nobody seemed too concerned at the time.

I see some other discussion on this whole topic in [2]. Looks like
parallel array_agg would please the PostGIS people.

[1] https://www.postgresql.org/message-id/CAKJS1f8QRDLvewk336SzUzxiXH1wBHG8rdKsqWEHbAraMXA2_Q%40mail.gmail.com
[2]
https://www.postgresql.org/message-id/flat/CAFjFpRe9W5xvYai-QOs6RshrJf7gWFsiZEZtxnu8vD4qLQZ3LQ%40mail.gmail.com#CAFjFpRe9W5xvYai-QOs6RshrJf7gWFsiZEZtxnu8vD4qLQZ3LQ@mail.gmail.com

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Parallel Aggregates for string_agg and array_agg

From

Andres Freund

Date:

27 March 2018, 03:20:01

Hi,

On 2018-03-27 13:14:15 +1300, David Rowley wrote:
> I have to say, it really would be a shame to have this concern block
> us from future optimisations in aggregation.

Yea, I think that's an important point. By the dint of Tom's argument
we're never going to be able to provide parallelism for any aggregate
where input order matters.  It's not like users are magically going to
stop depending on it, if they already do. Seems more likely that we'll
grow more users depending on it.

We could theoretically provide differently named aggregates where one
allows parallelism, the other doesn't, without duplicating the backing
code... Not pretty tho.

Greetings,

Andres Freund

Re: Parallel Aggregates for string_agg and array_agg

From

Alvaro Herrera

Date:

27 March 2018, 03:26:17

David Rowley wrote:

> Anyway, the options are not zero for anyone who is strongly affected
> with no other workaround. They just need to disable parallel query,
> which to me seems fairly similar to the 8.4 release note's "the
> previous behavior can be restored by disabling enable_hashagg"

synchronized_seqscans is another piece of precedent in the area, FWIW.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Parallel Aggregates for string_agg and array_agg

From

David Rowley

Date:

27 March 2018, 03:38:42

On 27 March 2018 at 13:26, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> David Rowley wrote:
>
>> Anyway, the options are not zero for anyone who is strongly affected
>> with no other workaround. They just need to disable parallel query,
>> which to me seems fairly similar to the 8.4 release note's "the
>> previous behavior can be restored by disabling enable_hashagg"
>
> synchronized_seqscans is another piece of precedent in the area, FWIW.

This is true. I guess the order of aggregation could be made more
certain if we remove the cost based optimiser completely, and just
rely on a syntax based optimiser.  Unfortunately, even with that
UPDATE will still reuse free space created by DELETE+VACUUM, so we'd
need to get rid of VACUUM too. Of course, I'm not being serious, but I
was just trying to show there's a lot more to be done if we want to
make the aggregate input order deterministic without an ORDER BY in
the aggregate function.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Parallel Aggregates for string_agg and array_agg

From

David Rowley

Date:

27 March 2018, 03:45:29

On 27 March 2018 at 12:49, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I wrote:
>> The main thing that remains undone is to get some test coverage ---
>> AFAICS, none of these new functions get exercised in the standard
>> regression tests.
>
> Oh, I thought of another thing that would need to get done, if we decide
> to commit this.  array_agg_serialize/deserialize only work if the array
> element type has send/receive functions.  The planner's logic to decide
> whether partial aggregation is possible doesn't look any deeper than
> whether the aggregate has serialize/deserialize, but I think we have to
> teach it that if it's these particular serialize/deserialize functions,
> it needs to check the array element type.  Otherwise we'll get runtime
> failures if partial aggregation is attempted on array_agg().
>
> This would not matter for any core datatypes, since they all have
> send/receive functions, but I imagine it's still pretty common for
> extension types not to.
>
> For my money it'd be sufficient to hard-code the test like
>
>     if ((aggserialfn == F_ARRAY_AGG_SERIALIZE ||
>          aggdeserialfn == F_ARRAY_AGG_DESERIALIZE) &&
>         !array_element_type_has_send_and_recv(exprType(aggregate input)))

I'd failed to consider this.

Would it not be cleaner to just reject trans types without a
send/receive function? Or do you think that would exclude other valid
cases?


-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Parallel Aggregates for string_agg and array_agg

From

David Rowley

Date:

27 March 2018, 05:51:08

On 27 March 2018 at 13:45, David Rowley <david.rowley@2ndquadrant.com> wrote:
> On 27 March 2018 at 12:49, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Oh, I thought of another thing that would need to get done, if we decide
>> to commit this.  array_agg_serialize/deserialize only work if the array
>> element type has send/receive functions.  The planner's logic to decide
>> whether partial aggregation is possible doesn't look any deeper than
>> whether the aggregate has serialize/deserialize, but I think we have to
>> teach it that if it's these particular serialize/deserialize functions,
>> it needs to check the array element type.  Otherwise we'll get runtime
>> failures if partial aggregation is attempted on array_agg().
>>
>> This would not matter for any core datatypes, since they all have
>> send/receive functions, but I imagine it's still pretty common for
>> extension types not to.
>>
>> For my money it'd be sufficient to hard-code the test like
>>
>>     if ((aggserialfn == F_ARRAY_AGG_SERIALIZE ||
>>          aggdeserialfn == F_ARRAY_AGG_DESERIALIZE) &&
>>         !array_element_type_has_send_and_recv(exprType(aggregate input)))
>
> I'd failed to consider this.
>
> Would it not be cleaner to just reject trans types without a
> send/receive function? Or do you think that would exclude other valid
> cases?

Seems I didn't mean "trans types". I should have said aggregate
function argument types.

The attached I adds this check.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

no_parallel_agg_when_agg_args_dont_have_sendrecv.patch

Re: Parallel Aggregates for string_agg and array_agg

From

Magnus Hagander

Date:

27 March 2018, 10:06:59

On Tue, Mar 27, 2018 at 12:28 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

David Rowley <david.rowley@2ndquadrant.com> writes:
> On 27 March 2018 at 09:27, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I do not think it is accidental that these aggregates are exactly the ones
>> that do not have parallelism support today. Rather, that's because you
>> just about always have an interest in the order in which the inputs get
>> aggregated, which is something that parallel aggregation cannot support.

> This very much reminds me of something that exists in the 8.4 release notes:
>> SELECT DISTINCT and UNION/INTERSECT/EXCEPT no longer always produce sorted output (Tom)

That's a completely false analogy: we got a significant performance
benefit for a significant fraction of users by supporting hashed
aggregation. My argument here is that only a very tiny fraction of
string_agg/array_agg users will not care about aggregation order, and thus
I don't believe that this patch can help very many people. Against that,
it's likely to hurt other people, by breaking their queries and forcing
them to insert expensive explicit sorts to fix it. Even discounting the
backwards-compatibility question, we don't normally adopt performance
features for which it's unclear that the net gain over all users is
positive.

I think you are quite wrong in claiming that only a tiny fraction of the users are going to care.

This may, and quite probably does, hold true for string_agg(), but not for array_agg(). I see a lot of cases where people use that to load it into an unordered array/hashmap/set/whatever on the client side, which looses ordering *anyway*,and they would definitely benefit from it.

Magnus Hagander
Me: https://www.hagander.net/
Work: https://www.redpill-linpro.com/

Re: Parallel Aggregates for string_agg and array_agg

From

Andrew Dunstan

Date:

27 March 2018, 10:22:39

On Tue, Mar 27, 2018 at 5:36 PM, Magnus Hagander <magnus@hagander.net> wrote:
> On Tue, Mar 27, 2018 at 12:28 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>
>> David Rowley <david.rowley@2ndquadrant.com> writes:
>> > On 27 March 2018 at 09:27, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> >> I do not think it is accidental that these aggregates are exactly the
>> >> ones
>> >> that do not have parallelism support today.  Rather, that's because you
>> >> just about always have an interest in the order in which the inputs get
>> >> aggregated, which is something that parallel aggregation cannot
>> >> support.
>>
>> > This very much reminds me of something that exists in the 8.4 release
>> > notes:
>> >> SELECT DISTINCT and UNION/INTERSECT/EXCEPT no longer always produce
>> >> sorted output (Tom)
>>
>> That's a completely false analogy: we got a significant performance
>> benefit for a significant fraction of users by supporting hashed
>> aggregation.  My argument here is that only a very tiny fraction of
>> string_agg/array_agg users will not care about aggregation order, and thus
>> I don't believe that this patch can help very many people.  Against that,
>> it's likely to hurt other people, by breaking their queries and forcing
>> them to insert expensive explicit sorts to fix it.  Even discounting the
>> backwards-compatibility question, we don't normally adopt performance
>> features for which it's unclear that the net gain over all users is
>> positive.
>
>
> I think you are quite wrong in claiming that only a tiny fraction of the
> users are going to care.
>
> This may, and quite probably does, hold true for string_agg(), but not for
> array_agg(). I see a lot of cases where people use that to load it into an
> unordered array/hashmap/set/whatever on the client side, which looses
> ordering *anyway*,and they would definitely benefit from it.

Agreed, I have seen lots of uses of array_agg where the order didn't matter.

cheers

andrew

-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Parallel Aggregates for string_agg and array_agg

From

Tom Lane

Date:

27 March 2018, 17:58:14

David Rowley <david.rowley@2ndquadrant.com> writes:
> On 27 March 2018 at 13:26, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>> synchronized_seqscans is another piece of precedent in the area, FWIW.

> This is true. I guess the order of aggregation could be made more
> certain if we remove the cost based optimiser completely, and just
> rely on a syntax based optimiser.

None of this is responding to my point.  I think the number of people
who actually don't care about aggregation order for these aggregates
is negligible, and none of you have argued against that; you've instead
selected straw men to attack.

            regards, tom lane

Re: Parallel Aggregates for string_agg and array_agg

From

Tomas Vondra

Date:

27 March 2018, 20:31:46

On 03/27/2018 04:58 PM, Tom Lane wrote:
> David Rowley <david.rowley@2ndquadrant.com> writes:
>> On 27 March 2018 at 13:26, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>>> synchronized_seqscans is another piece of precedent in the area, FWIW.
> 
>> This is true. I guess the order of aggregation could be made more 
>> certain if we remove the cost based optimiser completely, and just 
>> rely on a syntax based optimiser.
> 
> None of this is responding to my point. I think the number of people 
> who actually don't care about aggregation order for these aggregates 
> is negligible, and none of you have argued against that; you've
> instead selected straw men to attack.
> 

I don't quite see what the straw men are, TBH. (And David's response was
obviously meant as a tongue-in-cheek.)

Let me try addressing your point, as stated in your earlier message.

On 03/27/2018 12:28 AM, Tom Lane wrote:
> My argument here is that only a very tiny fraction of 
> string_agg/array_agg users will not care about aggregation order,
> and thus I don't believe that this patch can help very many people. 
> Against that, it's likely to hurt other people, by breaking their 
> queries and forcing them to insert expensive explicit sorts to fix
> it.
You're claiming this is a choice between "reliable aggregation order"
vs. "parallel aggregation." I find that dubious.

My argument is that we do not provide any reliable aggregation order,
unless the user supplies ORDER BY in the aggregate, or takes care about
input ordering (essentially doing a subquery with ORDER BY).

Users may rely on getting the "right" plan, but unless they take special
care (disabling some plans etc.) that's just pure luck. Not only is
there plenty of stuff that can trigger plan change (misestimates, data
growth, ...) but we also add new plans that don't produce the same
ordering (e.g. parallel versions of some plans).

And if they already take the extra care, they can disable parallel query
to get the same plan. It's true that's all or nothing, and they might be
benefiting from parallel query in different part of the query, in which
case they'll need to modify the query.

> Even discounting the backwards-compatibility question, we don't 
> normally adopt performance features for which it's unclear that the 
> net gain over all users is positive.
We only really have anecdotal data, both of us. We can make some
guesses, but that's about it.

I don't know anyone who would be relying on array_agg ordering without
the explicit ORDER BY somewhere (particularly not after major upgrades).
But obviously those issues only surface after the change is made.

I do however know people who are badly affected by array_agg() disabling
parallel aggregate for other aggregates. They don't care about aggregate
order (typically it's filtering + unnest or export).

> Maybe what that says is that rather than giving up on this 
> altogether, we should shelve it till we have less stupid 
> planner+executor behavior for ORDER BY inside aggregates. That's
> been on the to-do list for a long while, but not risen to the top
> ...
That would be a valid choice if there was a reliable aggregate order.
But AFAIK there isn't, as explained above.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Parallel Aggregates for string_agg and array_agg

From

Tomas Vondra

Date:

28 March 2018, 03:06:37

On 03/27/2018 04:51 AM, David Rowley wrote:
> On 27 March 2018 at 13:45, David Rowley <david.rowley@2ndquadrant.com> wrote:
>> On 27 March 2018 at 12:49, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Oh, I thought of another thing that would need to get done, if we decide
>>> to commit this.  array_agg_serialize/deserialize only work if the array
>>> element type has send/receive functions.  The planner's logic to decide
>>> whether partial aggregation is possible doesn't look any deeper than
>>> whether the aggregate has serialize/deserialize, but I think we have to
>>> teach it that if it's these particular serialize/deserialize functions,
>>> it needs to check the array element type.  Otherwise we'll get runtime
>>> failures if partial aggregation is attempted on array_agg().
>>>
>>> This would not matter for any core datatypes, since they all have
>>> send/receive functions, but I imagine it's still pretty common for
>>> extension types not to.
>>>
>>> For my money it'd be sufficient to hard-code the test like
>>>
>>>     if ((aggserialfn == F_ARRAY_AGG_SERIALIZE ||
>>>          aggdeserialfn == F_ARRAY_AGG_DESERIALIZE) &&
>>>         !array_element_type_has_send_and_recv(exprType(aggregate input)))
>>
>> I'd failed to consider this.
>>
>> Would it not be cleaner to just reject trans types without a
>> send/receive function? Or do you think that would exclude other valid
>> cases?
> 
> Seems I didn't mean "trans types". I should have said aggregate
> function argument types.
> 
> The attached I adds this check.
> 

I'm not sure that's better than the check proposed by Tom. An argument
type without send/receive function does not necessarily mean we can't
serialize/deserialize the trans value. Because who says the argument
value will be embedded in the trans value?

For example, I might create an aggregate to build a bloom filter,
accepting anyelement. That will hash the argument value, and obviously
has no issues with serializing/deserializing the trans value.

This check would effectively disable parallelism for all aggregates with
anyarray, anyelement, anynonarray, anyenum (and a few more) arguments.
That for example includes jsonb_agg, which some people already mentioned
as another candidate for enabling parallelism. Not great, I guess.

Tom's solution is much more focused - it recognizes that array_agg is a
fairly rare case where the (actual) argument type gets stored in the
trans type, and detects that.

PS: The function is misnamed - you're talking about argument types, yet
the function name refers to transtypes.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Parallel Aggregates for string_agg and array_agg

From

Tom Lane

Date:

28 March 2018, 06:28:19

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> On 03/27/2018 04:51 AM, David Rowley wrote:
>> Seems I didn't mean "trans types". I should have said aggregate
>> function argument types.

> I'm not sure that's better than the check proposed by Tom. An argument
> type without send/receive function does not necessarily mean we can't
> serialize/deserialize the trans value. Because who says the argument
> value will be embedded in the trans value?

In general we would not know that, but *for these specific serial/
deserial functions*, we know exactly what they will do.  Also, IIRC,
the trans type is declared as INTERNAL, so we don't really have any
hope of identifying the behavior by inspecting that type declaration.

Getting a solution that would work for other polymorphic serialization
functions seems like a bit of a research project to me.  In the meantime,
I think David's right that what we need to look at is the actual input
type of the aggregate, and then assume that what's to be serialized is
an array of that.  Conceivably an aggregate could be built that uses
these serial/deserial functions and yet its input type is something else
than what it constructs an array of ... but I find it a bit hard to
wrap my brain around what that would be exactly.

            regards, tom lane

Re: Parallel Aggregates for string_agg and array_agg

From

Tomas Vondra

Date:

28 March 2018, 15:29:11


On 03/28/2018 05:28 AM, Tom Lane wrote:
> Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>> On 03/27/2018 04:51 AM, David Rowley wrote:
>>> Seems I didn't mean "trans types". I should have said aggregate
>>> function argument types.
> 
>> I'm not sure that's better than the check proposed by Tom. An argument
>> type without send/receive function does not necessarily mean we can't
>> serialize/deserialize the trans value. Because who says the argument
>> value will be embedded in the trans value?
> 
> In general we would not know that, but *for these specific serial/
> deserial functions*, we know exactly what they will do.  Also, IIRC,
> the trans type is declared as INTERNAL, so we don't really have any
> hope of identifying the behavior by inspecting that type declaration.
> 

Sure, which is why I thought your solution was better, as it was
targeted at those particular functions.

> Getting a solution that would work for other polymorphic serialization
> functions seems like a bit of a research project to me.  In the meantime,
> I think David's right that what we need to look at is the actual input
> type of the aggregate, and then assume that what's to be serialized is
> an array of that.  Conceivably an aggregate could be built that uses
> these serial/deserial functions and yet its input type is something else
> than what it constructs an array of ... but I find it a bit hard to
> wrap my brain around what that would be exactly.
> 

But David's fix doesn't check the aggregate to produce an array of the
input type (or anyarray). It could easily be an aggregate computing a
bloom filter or something like that, which has no such issues in the
serial/deserial functions.

Also, if it's checking aggref->aggargtypes, it'll reject anyelement
parameters, no?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Parallel Aggregates for string_agg and array_agg

From

Tom Lane

Date:

28 March 2018, 16:54:38

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> On 03/28/2018 05:28 AM, Tom Lane wrote:
>> Getting a solution that would work for other polymorphic serialization
>> functions seems like a bit of a research project to me.  In the meantime,
>> I think David's right that what we need to look at is the actual input
>> type of the aggregate, and then assume that what's to be serialized is
>> an array of that.  Conceivably an aggregate could be built that uses
>> these serial/deserial functions and yet its input type is something else
>> than what it constructs an array of ... but I find it a bit hard to
>> wrap my brain around what that would be exactly.

> But David's fix doesn't check the aggregate to produce an array of the
> input type (or anyarray). It could easily be an aggregate computing a
> bloom filter or something like that, which has no such issues in the
> serial/deserial functions.

Oh, if he's not restricting it to these serialization functions, I agree
that seems wrong.  I thought the discussion was about what to do after
checking the functions.

> Also, if it's checking aggref->aggargtypes, it'll reject anyelement
> parameters, no?

I had in mind to look at exprType() of the argument.

            regards, tom lane

Re: Parallel Aggregates for string_agg and array_agg

From

David Rowley

Date:

28 March 2018, 16:57:30

On 28 March 2018 at 03:58, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> David Rowley <david.rowley@2ndquadrant.com> writes:
>> On 27 March 2018 at 13:26, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>>> synchronized_seqscans is another piece of precedent in the area, FWIW.
>
>> This is true. I guess the order of aggregation could be made more
>> certain if we remove the cost based optimiser completely, and just
>> rely on a syntax based optimiser.
>
> None of this is responding to my point.  I think the number of people
> who actually don't care about aggregation order for these aggregates
> is negligible, and none of you have argued against that; you've instead
> selected straw men to attack.

I think what everyone else is getting at is that we're not offering to
improve the performance of the people who care about the order which
the values are aggregated. This patch only offers a possible
performance benefit to those who don't. I mentioned and linked to a
thread about someone from PostGIS asking for this, which as far as I
understand has quite a large user base. You've either ignored that or
think the number of people using PostGIS is negligible.

As far as I understand your argument, it's about there being a
possibility a group of people existing who rely on the aggregation
order being defined without an ORDER BY in the aggregate function.
Unfortunately, It appears from the responses from many of the other's
who voiced an opinion about this is that there is no shortage of other
reasons why relying on values being aggregated in a defined order
without an ORDER BY in the aggregate function arguments is a dangerous
assumption to make.  Several reasons were listed why this is undefined
and I mentioned the great lengths we'd need to go to do make the order
more defined without an explicit ORDER BY, and I still imagine I'd
have missed some of the reasons.  I imagine the number of people which
rely on the order being defined without an ORDER BY is diminishing
each release as we add parallelism support for more node types.  Both
Andres and I agree that it's a shame to block useful optimisations due
to the needs of a small diminishing group of people who are not very
good at reading our carefully crafted documentation... for the past 8
years [1].

I imagine this small group of people, if they do exist, must slowly be
waking up to the fact that we've been devising new ways to ruin their
query results for many years now.  It seems pretty strange for us to
call a truce now after we've been wreaking havoc on this group for so
many releases... I really do hope this part is not true, but if such a
person appeared in -novice or -general asking for help, we'd be
telling them to add an ORDER BY, and we'd be quoting the 8-year-old
line in the documents which states that we make no guarantees in this
area in the absence of an ORDER BY.  If they truly have no other
choice then we might consider suggesting they may get what they want
if they disable parallel query, and we'd probably rhyme off a few
other reasons why it might suddenly break on them again.

[1]
https://git.postgresql.org/gitweb/?p=postgresql.git;a=blobdiff;f=doc/src/sgml/func.sgml;h=7d6125c97e5203c9d092ceec3aaf351c1a5fcf1b;hp=f2906cc82230150f72353609e9c831e90dcc10ca;hb=34d26872ed816b299eef2fa4240d55316697f42d;hpb=6a6efb964092902bf53965649c3ed78b1868b37e

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Parallel Aggregates for string_agg and array_agg

From

Tomas Vondra

Date:

28 March 2018, 17:05:05


On 03/28/2018 03:54 PM, Tom Lane wrote:
> Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>> On 03/28/2018 05:28 AM, Tom Lane wrote:
>>> Getting a solution that would work for other polymorphic serialization
>>> functions seems like a bit of a research project to me.  In the meantime,
>>> I think David's right that what we need to look at is the actual input
>>> type of the aggregate, and then assume that what's to be serialized is
>>> an array of that.  Conceivably an aggregate could be built that uses
>>> these serial/deserial functions and yet its input type is something else
>>> than what it constructs an array of ... but I find it a bit hard to
>>> wrap my brain around what that would be exactly.
> 
>> But David's fix doesn't check the aggregate to produce an array of the
>> input type (or anyarray). It could easily be an aggregate computing a
>> bloom filter or something like that, which has no such issues in the
>> serial/deserial functions.
> 
> Oh, if he's not restricting it to these serialization functions, I 
> agree that seems wrong. I thought the discussion was about what to
> do after checking the functions.
> 

Nope, David's patch does not check what the serial/deserial functions
are (and checks aggref->aggargtypes directly, not the exprType thing).

>> Also, if it's checking aggref->aggargtypes, it'll reject anyelement
>> parameters, no?
> 
> I had in mind to look at exprType() of the argument.
> 

Right, I'm fine with that.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Parallel Aggregates for string_agg and array_agg

From

David Rowley

Date:

28 March 2018, 18:36:42

On 27 March 2018 at 09:27, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> The main thing that remains undone is to get some test coverage ---
> AFAICS, none of these new functions get exercised in the standard
> regression tests.

I've attached a patch which implements some tests for the new functions.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

parallel_string_agg_array_agg_tests.patch

Re: Parallel Aggregates for string_agg and array_agg

From

David Rowley

Date:

29 March 2018, 06:49:02

On 29 March 2018 at 03:05, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> On 03/28/2018 03:54 PM, Tom Lane wrote:
>> I had in mind to look at exprType() of the argument.
>
> Right, I'm fine with that.

Attached is v9, which is based on Tom's v8 but includes the new tests
and I think the required fix to disable use of the serial/deserial
function for array_agg().

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

combinefn_for_string_and_array_aggs_v9.patch

Re: Parallel Aggregates for string_agg and array_agg

From

Tomas Vondra

Date:

29 March 2018, 16:00:34

On 03/29/2018 05:49 AM, David Rowley wrote:
> On 29 March 2018 at 03:05, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>> On 03/28/2018 03:54 PM, Tom Lane wrote:
>>> I had in mind to look at exprType() of the argument.
>>
>> Right, I'm fine with that.
> 
> Attached is v9, which is based on Tom's v8 but includes the new tests
> and I think the required fix to disable use of the serial/deserial
> function for array_agg().
> 

I have only looked at the diff, but it seems fine to me (in the sense
that it's doing the right checks to disable parallelism only when really
needed).

FWIW I wonder if we might simply fallback to input/output functions when
the send/receive functions are not available. Not sure it's worth it.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Parallel Aggregates for string_agg and array_agg

From

David Rowley

Date:

29 March 2018, 16:09:33

On 30 March 2018 at 02:00, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> On 03/29/2018 05:49 AM, David Rowley wrote:
>> Attached is v9, which is based on Tom's v8 but includes the new tests
>> and I think the required fix to disable use of the serial/deserial
>> function for array_agg().
>>
>
> I have only looked at the diff, but it seems fine to me (in the sense
> that it's doing the right checks to disable parallelism only when really
> needed).
>
> FWIW I wonder if we might simply fallback to input/output functions when
> the send/receive functions are not available. Not sure it's worth it.

I think it's a corner case to have a type without send/receive, but I
might just be lacking imagination.

I meant to mention earlier that I coded
agg_args_have_sendreceive_funcs() to only check for send/receive
functions. Really we could allow a byval types without send/receive
functions, since the serial/deserial just send the raw datums in that
case, but then the function becomes
agg_byref_args_have_sendreceive_funcs(), which seemed a bit obscure,
so I didn't do that.  Maybe I should?

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Parallel Aggregates for string_agg and array_agg

From

Tomas Vondra

Date:

29 March 2018, 16:55:50


On 03/29/2018 03:09 PM, David Rowley wrote:
> On 30 March 2018 at 02:00, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>> On 03/29/2018 05:49 AM, David Rowley wrote:
>>> Attached is v9, which is based on Tom's v8 but includes the new tests
>>> and I think the required fix to disable use of the serial/deserial
>>> function for array_agg().
>>>
>>
>> I have only looked at the diff, but it seems fine to me (in the sense
>> that it's doing the right checks to disable parallelism only when really
>> needed).
>>
>> FWIW I wonder if we might simply fallback to input/output functions when
>> the send/receive functions are not available. Not sure it's worth it.
> 
> I think it's a corner case to have a type without send/receive, but I
> might just be lacking imagination.
> 

Not sure, but it's a valid case. OTOH we probably don't need to obsess
about handling it perfectly.

> I meant to mention earlier that I coded
> agg_args_have_sendreceive_funcs() to only check for send/receive
> functions. Really we could allow a byval types without send/receive
> functions, since the serial/deserial just send the raw datums in that
> case, but then the function becomes
> agg_byref_args_have_sendreceive_funcs(), which seemed a bit obscure,
> so I didn't do that.  Maybe I should?
> 

I'd do that. Not sure the function name needs to change, but perhaps
agg_args_support_sendreceive() would be better - it covers both byref
types (which require send/receive functions) and byval (which don't).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Parallel Aggregates for string_agg and array_agg

From

David Rowley

Date:

31 March 2018, 17:42:12

On 30 March 2018 at 02:55, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> On 03/29/2018 03:09 PM, David Rowley wrote:
>> I meant to mention earlier that I coded
>> agg_args_have_sendreceive_funcs() to only check for send/receive
>> functions. Really we could allow a byval types without send/receive
>> functions, since the serial/deserial just send the raw datums in that
>> case, but then the function becomes
>> agg_byref_args_have_sendreceive_funcs(), which seemed a bit obscure,
>> so I didn't do that.  Maybe I should?
>
> I'd do that. Not sure the function name needs to change, but perhaps
> agg_args_support_sendreceive() would be better - it covers both byref
> types (which require send/receive functions) and byval (which don't).

The attached patch implements this.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

combinefn_for_string_and_array_aggs_v10.patch

Re: Parallel Aggregates for string_agg and array_agg

From

Tomas Vondra

Date:

04 April 2018, 22:12:54


On 03/31/2018 04:42 PM, David Rowley wrote:
> On 30 March 2018 at 02:55, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>> On 03/29/2018 03:09 PM, David Rowley wrote:
>>> I meant to mention earlier that I coded
>>> agg_args_have_sendreceive_funcs() to only check for send/receive
>>> functions. Really we could allow a byval types without send/receive
>>> functions, since the serial/deserial just send the raw datums in that
>>> case, but then the function becomes
>>> agg_byref_args_have_sendreceive_funcs(), which seemed a bit obscure,
>>> so I didn't do that.  Maybe I should?
>>
>> I'd do that. Not sure the function name needs to change, but perhaps
>> agg_args_support_sendreceive() would be better - it covers both byref
>> types (which require send/receive functions) and byval (which don't).
> 
> The attached patch implements this.
> 

Seems fine to me, although we should handle the anyarray case too, I
guess. That is, get_agg_clause_costs_walker() should do this too:

    /* Same thing for array_agg_array_(de)serialize. */
    if ((aggserialfn == F_ARRAY_AGG_ARRAY_SERIALIZE ||
         aggdeserialfn == F_ARRAY_AGG_ARRAY_DESERIALIZE) &&
         !agg_args_support_sendreceive(aggref))
        costs->hasNonSerial = true;

Other than that, the patch seems fine to me, and it's already marked as
RFC so I'll leave it at that.

The last obstacle seems to be the argument about the risks of the patch
breaking queries of people relying on the ordering. Not sure what's are
the right next steps in this regard ...

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Parallel Aggregates for string_agg and array_agg

From

David Rowley

Date:

05 April 2018, 06:41:30

Hi Tomas,

Thanks for taking another look.

On 5 April 2018 at 07:12, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> Seems fine to me, although we should handle the anyarray case too, I
> guess. That is, get_agg_clause_costs_walker() should do this too:
>
>     /* Same thing for array_agg_array_(de)serialize. */
>     if ((aggserialfn == F_ARRAY_AGG_ARRAY_SERIALIZE ||
>          aggdeserialfn == F_ARRAY_AGG_ARRAY_DESERIALIZE) &&
>          !agg_args_support_sendreceive(aggref))
>         costs->hasNonSerial = true;

hmm, array_agg_array_serialize and array_agg_array_deserialize don't
use the send/receive functions though, so not sure why that's
required?

> Other than that, the patch seems fine to me, and it's already marked as
> RFC so I'll leave it at that.

Thanks.

> The last obstacle seems to be the argument about the risks of the patch
> breaking queries of people relying on the ordering. Not sure what's are
> the right next steps in this regard ...

yeah, seems like a bit of a stalemate.

Personally, think if the group of people Tom mentions do exist, then
they've already been through some troubled times since Parallel Query
was born. I'd hope they've already put up their defenses due to the
advent of that feature. I stand by my thoughts that it's pretty
bizarre to draw the line here when we've probably been causing these
people issues for many years already. I said my piece on this already
so likely not much point in going on about it. These people are also
perfectly capable of sidestepping the whole issue with setting
max_parallel_workers_per_gather TO 0.

Perhaps one solution is to drop the string_agg and keep the array_agg.
Unsure if that would be good enough for Tom? More people seem to have
voiced that array_agg ordering is generally less of a concern, which I
imagine is probably true, but my opinion might not matter here as I'm
the sort of person who if I needed a specific ordering I'd have
written an ORDER BY clause.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Parallel Aggregates for string_agg and array_agg

From

"Tels"

Date:

05 April 2018, 22:10:13

Moin,

On Wed, April 4, 2018 11:41 pm, David Rowley wrote:
> Hi Tomas,
>
> Thanks for taking another look.
>
> On 5 April 2018 at 07:12, Tomas Vondra <tomas.vondra@2ndquadrant.com>
> wrote:
>> Other than that, the patch seems fine to me, and it's already marked as
>> RFC so I'll leave it at that.
>
> Thanks.

I have one more comment - sorry for not writing sooner, the flu got to me ...

Somewhere in the code there is a new allocation of memory when the string
grows beyond the current size - and that doubles the size. This can lead
to a lot of wasted space (think: constructing a string that is a bit over
1 Gbyte, which would presumable allocate 2 GByte).

The same issue happens when each worker allocated 512 MByte for a 256
Mbyte + 1 result.

IMHO a factor of like 1.4 or 1.2 would work better here - not sure what
the current standard in situations like this in PG is.

>> The last obstacle seems to be the argument about the risks of the patch
>> breaking queries of people relying on the ordering. Not sure what's are
>> the right next steps in this regard ...
>
> yeah, seems like a bit of a stalemate.
>
> Personally, think if the group of people Tom mentions do exist, then
> they've already been through some troubled times since Parallel Query
> was born. I'd hope they've already put up their defenses due to the
> advent of that feature. I stand by my thoughts that it's pretty
> bizarre to draw the line here when we've probably been causing these
> people issues for many years already. I said my piece on this already
> so likely not much point in going on about it. These people are also
> perfectly capable of sidestepping the whole issue with setting
> max_parallel_workers_per_gather TO 0.

Coming from the Perl side, this issue has popped up there a lot with the
ordering of hash keys (e.g. "for my $key (keys %$hash) { ... }") and even
tho there was never a guarantee, it often caught people by surprise.
Mostly in testsuites, tho, because real output often needs some ordering,
anyway.

However, changes where nevertheless done, and code had to be adapted. But
the world didn't end, so to speak.

For PG, I think the benefits greatly outweight the problems with the
sorting order - after all, if you have code that relies on an implicit
order, you've already got a problem, no?

So my favourite would be something along these lines:

* add the string_agg
* document it in the release notes, and document workaround/solutions (add
ORDER BY, disabled workers etc.)
* if not already done, stress in the documentation that if you don't ORDER
things, things might not come out in the order you expect - especially
with paralell queries, new processing nodes etc.

Best wishes,

Tels

PS: We use string_agg() in a case where we first agg each row, then
string_agg() all rows, and the resulting string is really huge. We did run
into the "out of memory"-problem, so we now use a LIMIT and assembly the
resulting parts on the client side. My wish here would be to better know
how large the LIMIT can be, I found it quite difficult to predict with how
many rows PG runs out of memory for the string buffer, even tho all rows
have almost the same length as text. But that aside, getting the parts
faster with parallel agg would be very cool, too.

Re: Parallel Aggregates for string_agg and array_agg

From

Tomas Vondra

Date:

05 April 2018, 23:46:26

On 04/05/2018 09:10 PM, Tels wrote:
> Moin,
> 
> On Wed, April 4, 2018 11:41 pm, David Rowley wrote:
>> Hi Tomas,
>>
>> Thanks for taking another look.
>>
>> On 5 April 2018 at 07:12, Tomas Vondra <tomas.vondra@2ndquadrant.com>
>> wrote:
>>> Other than that, the patch seems fine to me, and it's already marked as
>>> RFC so I'll leave it at that.
>>
>> Thanks.
> 
> I have one more comment - sorry for not writing sooner, the flu got to me ...
> 
> Somewhere in the code there is a new allocation of memory when the string
> grows beyond the current size - and that doubles the size. This can lead
> to a lot of wasted space (think: constructing a string that is a bit over
> 1 Gbyte, which would presumable allocate 2 GByte).
> 

I don't think we support memory chunks above 1GB, so that's likely going
to fail anyway. See

  #define MaxAllocSize       ((Size) 0x3fffffff) /* 1 gigabyte - 1 */
  #define AllocSizeIsValid(size)     ((Size) (size) <= MaxAllocSize)

But I get your point - we may be wasting space here. But that's hardly
something this patch should mess with - that's a more generic allocation
question.

> The same issue happens when each worker allocated 512 MByte for a 256
> Mbyte + 1 result.
> 
> IMHO a factor of like 1.4 or 1.2 would work better here - not sure what
> the current standard in situations like this in PG is.
> 

With a 2x scale factor, we only waste 25% of the space on average.
Consider that you're growing because you've reached the current size,
and you double the size - say, from 1MB to 2MB. But the 1MB wasted space
is the worst case - in reality we'll use something between 1MB and 2MB,
so 1.5MB on average. At which point we've wasted just 0.5MB, i.e. 25%.

That sounds perfectly reasonable to me. Lower factor would be more
expensive in terms of repalloc, for example.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Parallel Aggregates for string_agg and array_agg

From

Tomas Vondra

Date:

05 April 2018, 23:49:03


On 04/05/2018 05:41 AM, David Rowley wrote:
> Hi Tomas,
> 
> Thanks for taking another look.
> 
> On 5 April 2018 at 07:12, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>> Seems fine to me, although we should handle the anyarray case too, I
>> guess. That is, get_agg_clause_costs_walker() should do this too:
>>
>>     /* Same thing for array_agg_array_(de)serialize. */
>>     if ((aggserialfn == F_ARRAY_AGG_ARRAY_SERIALIZE ||
>>          aggdeserialfn == F_ARRAY_AGG_ARRAY_DESERIALIZE) &&
>>          !agg_args_support_sendreceive(aggref))
>>         costs->hasNonSerial = true;
> 
> hmm, array_agg_array_serialize and array_agg_array_deserialize don't
> use the send/receive functions though, so not sure why that's
> required?
> 
>> Other than that, the patch seems fine to me, and it's already marked as
>> RFC so I'll leave it at that.
> 
> Thanks.
> 
>> The last obstacle seems to be the argument about the risks of the patch
>> breaking queries of people relying on the ordering. Not sure what's are
>> the right next steps in this regard ...
> 
> yeah, seems like a bit of a stalemate.
> 
> Personally, think if the group of people Tom mentions do exist, then
> they've already been through some troubled times since Parallel Query
> was born. I'd hope they've already put up their defenses due to the
> advent of that feature. I stand by my thoughts that it's pretty
> bizarre to draw the line here when we've probably been causing these
> people issues for many years already. I said my piece on this already
> so likely not much point in going on about it. These people are also
> perfectly capable of sidestepping the whole issue with setting
> max_parallel_workers_per_gather TO 0.
> 
> Perhaps one solution is to drop the string_agg and keep the array_agg.
> Unsure if that would be good enough for Tom? More people seem to have
> voiced that array_agg ordering is generally less of a concern, which I
> imagine is probably true, but my opinion might not matter here as I'm
> the sort of person who if I needed a specific ordering I'd have
> written an ORDER BY clause.
> 

I don't really see how dropping string_agg would solve anything. Either
the risks apply to both these functions, or neither of them I think. If
anything, doing this only for one of them would be inconsistent and
surprising, considering how similar they are.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Parallel Aggregates for string_agg and array_agg

From

David Rowley

Date:

06 April 2018, 00:51:05

Thanks for your input on all this. It's good to some words from people
using the software rather than just writing it.

On 6 April 2018 at 07:10, Tels <nospam-pg-abuse@bloodgate.com> wrote:
> PS: We use string_agg() in a case where we first agg each row, then
> string_agg() all rows, and the resulting string is really huge. We did run
> into the "out of memory"-problem, so we now use a LIMIT and assembly the
> resulting parts on the client side. My wish here would be to better know
> how large the LIMIT can be, I found it quite difficult to predict with how
> many rows PG runs out of memory for the string buffer, even tho all rows
> have almost the same length as text. But that aside, getting the parts
> faster with parallel agg would be very cool, too.

For this problem, if you're getting "Cannot enlarge string buffer
containing X bytes by Y more bytes.", then you're not actually hitting
an OOM error, it's more the required string length is above
MaxAllocSize, which as mentioned by Tomas is 1GB.

See enlargeStringInfo() in stringinfo.c

if (((Size) needed) >= (MaxAllocSize - (Size) str->len))
ereport(ERROR,
(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
errmsg("out of memory"),
errdetail("Cannot enlarge string buffer containing %d bytes by %d more bytes.",
   str->len, needed)));

Allocating chunks in smaller increments won't help if you're getting this.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Parallel Aggregates for string_agg and array_agg

From

Robert Haas

Date:

01 May 2018, 23:16:02

On Mon, Mar 26, 2018 at 4:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I do not think it is accidental that these aggregates are exactly the ones
> that do not have parallelism support today.  Rather, that's because you
> just about always have an interest in the order in which the inputs get
> aggregated, which is something that parallel aggregation cannot support.

That's not the reason, as has been stated by the person who wrote the
code.  As the person who committed the code, my recollection of the
reasoning involved matches his.

> I fear that what will happen, if we commit this, is that something like
> 0.01% of the users of array_agg and string_agg will be pleased, another
> maybe 20% will be unaffected because they wrote ORDER BY which prevents
> parallel aggregation, and the remaining 80% will scream because we broke
> their queries.  Telling them they should've written ORDER BY isn't going
> to cut it, IMO, when the benefit of that breakage will accrue only to some
> very tiny fraction of use-cases.

I think your estimated percentages here are wildly inaccurate.

I am in favor of this patch, like everyone else who has voted with the
exception of you.  I have not reviewed the code, but I firmly believe
that preventing parallel query is far more damaging than breaking
queries for whatever fraction of users are relying on behavior that
wasn't guaranteed in the first place.  I do understand your concern
and I respect your opinion, so please do not take the fact that I do
not agree with you in this case as an indication of disrespect.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel Aggregates for string_agg and array_agg

From

Tom Lane

Date:

01 May 2018, 23:59:35

Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Mar 26, 2018 at 4:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I fear that what will happen, if we commit this, is that something like
>> 0.01% of the users of array_agg and string_agg will be pleased, another
>> maybe 20% will be unaffected because they wrote ORDER BY which prevents
>> parallel aggregation, and the remaining 80% will scream because we broke
>> their queries.  Telling them they should've written ORDER BY isn't going
>> to cut it, IMO, when the benefit of that breakage will accrue only to some
>> very tiny fraction of use-cases.

> I think your estimated percentages here are wildly inaccurate.

My estimate for the number of people positively impacted could be off
by a factor of a thousand, and it still wouldn't change the conclusion
that this will hurt more people than it helps.

I see that I'm in the minority on this, so I'm prepared to accept defeat,
but I stand by that conclusion.

            regards, tom lane

Re: Parallel Aggregates for string_agg and array_agg

From

Mark Dilger

Date:

02 May 2018, 00:09:39

> On Mar 27, 2018, at 7:58 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> David Rowley <david.rowley@2ndquadrant.com> writes:
>> On 27 March 2018 at 13:26, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>>> synchronized_seqscans is another piece of precedent in the area, FWIW.
>
>> This is true. I guess the order of aggregation could be made more
>> certain if we remove the cost based optimiser completely, and just
>> rely on a syntax based optimiser.
>
> None of this is responding to my point.  I think the number of people
> who actually don't care about aggregation order for these aggregates
> is negligible, and none of you have argued against that; you've instead
> selected straw men to attack.

I frequently care about the order, but only to the extent that the order
is stable between aggregates of several different columns, along the lines
of:

select array_agg(a) AS x, array_agg(b) AS y
  from generate_a_b_func(foo);

I don't care which order the data is in, as long as x[i] and y[i] are
matched correctly.  It sounds like this patch would force me to write
that as, for example:

select array_agg(a order by a, b) AS x, array_agg(b order by a, b) AS y
  from generate_a_b_func(foo);

which I did not need to do before.  I would expect a performance regression
from the two newly required sorts.  So in that case I agree with Tom.

But I also agree with others that I want the parallel aggregation functionality.
Could we perhaps introduce some option for the aggregate to force it to be
stable?  Something like:

select array_agg(a order undisturbed) AS x, array_agg(b order undisturbed) AS y
  from generate_a_b_func(foo);

which would not perform an extra sort operation but would guarantee to not
disturb the pre-existing sort order coming from generate_a_b_func(foo)?

I don't care about the syntax / keywords in the example above.  I'm just
looking to get the benefits of the parallel aggregation when I don't care
about ordering while preserving the order for these cases where it matters.

mark

Re: Parallel Aggregates for string_agg and array_agg

From

Andres Freund

Date:

02 May 2018, 00:11:47

Hi,

On 2018-05-01 14:09:39 -0700, Mark Dilger wrote:
> I don't care which order the data is in, as long as x[i] and y[i] are
> matched correctly.  It sounds like this patch would force me to write
> that as, for example:
> 
> select array_agg(a order by a, b) AS x, array_agg(b order by a, b) AS y
>   from generate_a_b_func(foo);
> 
> which I did not need to do before.

Why would it require that? Rows are still processed row-by-row even if
there's parallelism, no?

Greetings,

Andres Freund

Re: Parallel Aggregates for string_agg and array_agg

From

Tom Lane

Date:

02 May 2018, 00:16:16

Andres Freund <andres@anarazel.de> writes:
> On 2018-05-01 14:09:39 -0700, Mark Dilger wrote:
>> I don't care which order the data is in, as long as x[i] and y[i] are
>> matched correctly.  It sounds like this patch would force me to write
>> that as, for example:
>> 
>> select array_agg(a order by a, b) AS x, array_agg(b order by a, b) AS y
>> from generate_a_b_func(foo);
>> 
>> which I did not need to do before.

> Why would it require that? Rows are still processed row-by-row even if
> there's parallelism, no?

Yeah, as long as we distribute all the aggregates in the same way,
it seems like they'd all see the same random-ish input ordering.
I can vaguely conceive of future optimizations that might break
that, but not what we have today.

            regards, tom lane

Re: Parallel Aggregates for string_agg and array_agg

From

Andres Freund

Date:

02 May 2018, 00:21:27

On 2018-05-01 17:16:16 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2018-05-01 14:09:39 -0700, Mark Dilger wrote:
> >> I don't care which order the data is in, as long as x[i] and y[i] are
> >> matched correctly.  It sounds like this patch would force me to write
> >> that as, for example:
> >> 
> >> select array_agg(a order by a, b) AS x, array_agg(b order by a, b) AS y
> >> from generate_a_b_func(foo);
> >> 
> >> which I did not need to do before.
> 
> > Why would it require that? Rows are still processed row-by-row even if
> > there's parallelism, no?
> 
> Yeah, as long as we distribute all the aggregates in the same way,
> it seems like they'd all see the same random-ish input ordering.
> I can vaguely conceive of future optimizations that might break
> that, but not what we have today.

Yea, a column store would with a and b being in different column sets,
or a and b originating from different tables and processing the two
aggregates in independent parts of the query tree, or other similar
stuff could result in trouble for the above assumption. But that seems
pretty unrelated to the matter at hand...

Greetings,

Andres Freund

Re: Parallel Aggregates for string_agg and array_agg

From

Mark Dilger

Date:

02 May 2018, 00:35:46

> On May 1, 2018, at 2:11 PM, Andres Freund <andres@anarazel.de> wrote:
> 
> Hi,
> 
> On 2018-05-01 14:09:39 -0700, Mark Dilger wrote:
>> I don't care which order the data is in, as long as x[i] and y[i] are
>> matched correctly.  It sounds like this patch would force me to write
>> that as, for example:
>> 
>> select array_agg(a order by a, b) AS x, array_agg(b order by a, b) AS y
>>  from generate_a_b_func(foo);
>> 
>> which I did not need to do before.
> 
> Why would it require that? Rows are still processed row-by-row even if
> there's parallelism, no?

I was responding in part to Tom's upthread statement:

  Your own example of assuming that separate aggregates are computed
  in the same order reinforces my point, I think.  In principle, anybody
  who's doing that should write

      array_agg(e order by x),
      array_agg(f order by x),
      string_agg(g order by x)

  because otherwise they shouldn't assume that;

It seems Tom is saying that you can't assume separate aggregates will be
computed in the same order.  Hence my response.  What am I missing here?

mark

Re: Parallel Aggregates for string_agg and array_agg

From

Andres Freund

Date:

02 May 2018, 00:38:32

On 2018-05-01 14:35:46 -0700, Mark Dilger wrote:
> 
> > On May 1, 2018, at 2:11 PM, Andres Freund <andres@anarazel.de> wrote:
> > 
> > Hi,
> > 
> > On 2018-05-01 14:09:39 -0700, Mark Dilger wrote:
> >> I don't care which order the data is in, as long as x[i] and y[i] are
> >> matched correctly.  It sounds like this patch would force me to write
> >> that as, for example:
> >> 
> >> select array_agg(a order by a, b) AS x, array_agg(b order by a, b) AS y
> >>  from generate_a_b_func(foo);
> >> 
> >> which I did not need to do before.
> > 
> > Why would it require that? Rows are still processed row-by-row even if
> > there's parallelism, no?
> 
> I was responding in part to Tom's upthread statement:
> 
>   Your own example of assuming that separate aggregates are computed
>   in the same order reinforces my point, I think.  In principle, anybody
>   who's doing that should write
> 
>       array_agg(e order by x),
>       array_agg(f order by x),
>       string_agg(g order by x)
> 
>   because otherwise they shouldn't assume that;
> 
> It seems Tom is saying that you can't assume separate aggregates will be
> computed in the same order.  Hence my response.  What am I missing here?

Afaict Tom was just making a theoretical argument, and one that seems
largely independent of the form of parallelism we're discussing here.

Greetings,

Andres Freund

Re: Parallel Aggregates for string_agg and array_agg

From

David Rowley

Date:

02 May 2018, 01:14:25

On 2 May 2018 at 08:59, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Mon, Mar 26, 2018 at 4:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> I fear that what will happen, if we commit this, is that something like
>>> 0.01% of the users of array_agg and string_agg will be pleased, another
>>> maybe 20% will be unaffected because they wrote ORDER BY which prevents
>>> parallel aggregation, and the remaining 80% will scream because we broke
>>> their queries.  Telling them they should've written ORDER BY isn't going
>>> to cut it, IMO, when the benefit of that breakage will accrue only to some
>>> very tiny fraction of use-cases.
>
>> I think your estimated percentages here are wildly inaccurate.
>
> My estimate for the number of people positively impacted could be off
> by a factor of a thousand, and it still wouldn't change the conclusion
> that this will hurt more people than it helps.

It's probably best to have this argument again in 6 months or so.
Someone might decide to put some effort into teaching the planner
about ordered aggregates so it can choose to provide pre-sorted input
to aggregate functions so that nodeAgg.c no longer has to perform the
costly explicit sorts.  Windowing functions already do this for the
first window clause, so it might not be too hard to do it for the
first aggregate, and any subsequent aggregates which share the same
ORDER BY / DISTINCT clause. We'd probably need to add some sort of
evaluation order to the Aggref, but ... details need not be discussed
here ...

I think if we had that, then the current objection to the patch loses
quite a bit of traction.

I imagine Tom will give up his objection given that he wrote this:

On 27 March 2018 at 11:28, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Maybe what that says is that rather than giving up on this altogether,
> we should shelve it till we have less stupid planner+executor behavior
> for ORDER BY inside aggregates.  That's been on the to-do list for
> a long while, but not risen to the top ...

So it seems there's probably room here to please everyone.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Parallel Aggregates for string_agg and array_agg

From

David Rowley

Date:

20 June 2018, 10:07:54

On 2 May 2018 at 10:14, David Rowley <david.rowley@2ndquadrant.com> wrote:
> On 2 May 2018 at 08:59, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> My estimate for the number of people positively impacted could be off
>> by a factor of a thousand, and it still wouldn't change the conclusion
>> that this will hurt more people than it helps.
>
> It's probably best to have this argument again in 6 months or so.
> Someone might decide to put some effort into teaching the planner
> about ordered aggregates so it can choose to provide pre-sorted input
> to aggregate functions so that nodeAgg.c no longer has to perform the
> costly explicit sorts.

It seems unlikely that ordered aggregates will be fixed in the July
'fest or even any time soon, so I'm going to mark this patch as
returned with feedback in aid of slightly reducing the patch list.

I'll submit it again when there more consensus that we want this.

Thanks for all the reviews.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Parallel Aggregates for string_agg and array_agg

From

David Rowley

Date:

02 August 2022, 23:45:31

On Wed, 20 Jun 2018 at 19:08, David Rowley <david.rowley@2ndquadrant.com> wrote:
> I'll submit it again when there more consensus that we want this.

Waking up this old thread again. If you don't have a copy, the entire
thread is in [1].

The remaining item that seemed to cause this patch to be rejected was
raised in [2]. The summary of that was that it might not be a good
idea to allow parallel aggregation of string_agg() and array_agg() as
there might be some people who rely on the current ordering they get
without having an ORDER BY clause in the aggregate function call.  Tom
mentioned in [3] that users might not want to add an ORDER BY to their
aggregate function because the performance of it is terrible.  That
was true up until 1349d2790 [4], where I changed how ORDER BY /
DISTINCT aggregation worked to allow the planner to provide pre-sorted
input rather than always having nodeAgg.c do the sorting.  I think
this removes quite a lot of the argument against the patch, but not
all of it.  So here goes testing the water on seeing if any opinions
have changed over the past few years?

A rebased patch is attached.

David

[1]
https://www.postgresql.org/message-id/flat/CAKJS1f98yPkRMsE0JnDh72%3DAQEUuE3atiCJtPVCtjhFwzCRJHQ%40mail.gmail.com#8bbce15b9279d2da2da99071f732a99d
[2] https://www.postgresql.org/message-id/6538.1522096067@sss.pgh.pa.us
[3] https://www.postgresql.org/message-id/18594.1522099194@sss.pgh.pa.us
[4] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=1349d2790bf48a4de072931c722f39337e72055e

Attachment

combinefn_for_string_and_array_aggs_v11.patch

Re: Parallel Aggregates for string_agg and array_agg

From

Tom Lane

Date:

03 August 2022, 00:18:20

David Rowley <dgrowleyml@gmail.com> writes:
> Waking up this old thread again. If you don't have a copy, the entire
> thread is in [1].
> The remaining item that seemed to cause this patch to be rejected was
> raised in [2].

Hmm.  My estimate of the percentages of users who will be pleased or
not hasn't really changed since [2].  Also, realizing that you can't
apply these aggregates to extremely large input sets, I wonder just how
much it buys to be able to parallelize them.  So I still put this patch
in the category of "just because we can doesn't mean we should".

Now as against that, if the underlying relation scan is parallelized
then you already have unpredictable input ordering and thus unpredictable
aggregate results.  So anyone who cares about that has already had
to either disable parallelism or insert an ORDER BY somewhere, and
either of those fixes will still work if the aggregates support
parallelism.  Hence, I'm not going to fight hard if you really want
to do it.  But I remain unconvinced that the cost/benefit tradeoff
is attractive.

            regards, tom lane

Re: Parallel Aggregates for string_agg and array_agg

From

David Rowley

Date:

03 August 2022, 01:42:47

On Wed, 3 Aug 2022 at 12:18, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Now as against that, if the underlying relation scan is parallelized
> then you already have unpredictable input ordering and thus unpredictable
> aggregate results.  So anyone who cares about that has already had
> to either disable parallelism or insert an ORDER BY somewhere, and
> either of those fixes will still work if the aggregates support
> parallelism.

I think users relying on the transition order without an ORDER BY
clause in the aggregate function likely have become accustomed to
having to disable various GUCs before they run their query.

For example, they might have something like:

BEGIN;
SET LOCAL enable_hashagg = 0;
SET LOCAL enable_hashjoin = 0;
SET LOCAL enable_nestloop = 0;

SELECT col, string_agg(col2, ','), array_agg(col3) FROM tab1 INNER
JOIN tab2 ON ... ORDER BY ...

COMMIT;

and are relying on a Merge Join being chosen so that the final join
rel is ordered according to the ORDER BY clause which allows GroupAgg
to skip the Sort phase thus providing a pre-sorted path for
aggregation.

Here we're adding to the list of GUCs they'd need to disable as
they'll maybe want to add SET max_parallel_workers_per_gather = 0; now
too. I bet all those people knew they were on thin ice already.

I stand by the fact we can't let undefined behaviour stand in the way
of future optimisations. Not having these disables more than just
parallel query. There's other reasons to want combine functions, for
example partial aggregation done during partition-wise aggregation is
disabled for aggregates without a combine function too. I'm pretty
sure we must be nearing wanting a give-me-partial-aggregate-results
syntax so that we can have FDWs feed us with partial results so we can
harness the resources of an array of foreign servers to farm off the
aggregation work for us.  We already have async Append. It feels (to
me) like we're on the cusp of needing that syntax.

> Hence, I'm not going to fight hard if you really want
> to do it.  But I remain unconvinced that the cost/benefit tradeoff
> is attractive.

Perhaps we're early enough in the release cycle to give enough time
for people to realise and complain if they're in that 80% group,
according to your estimates.

David

Re: Parallel Aggregates for string_agg and array_agg

From

Zhihong Yu

Date:

03 August 2022, 02:46:47

On Tue, Aug 2, 2022 at 4:46 PM David Rowley <dgrowleyml@gmail.com> wrote:

On Wed, 20 Jun 2018 at 19:08, David Rowley <david.rowley@2ndquadrant.com> wrote:
> I'll submit it again when there more consensus that we want this.

Waking up this old thread again. If you don't have a copy, the entire
thread is in [1].

The remaining item that seemed to cause this patch to be rejected was
raised in [2]. The summary of that was that it might not be a good
idea to allow parallel aggregation of string_agg() and array_agg() as
there might be some people who rely on the current ordering they get
without having an ORDER BY clause in the aggregate function call. Tom
mentioned in [3] that users might not want to add an ORDER BY to their
aggregate function because the performance of it is terrible. That
was true up until 1349d2790 [4], where I changed how ORDER BY /
DISTINCT aggregation worked to allow the planner to provide pre-sorted
input rather than always having nodeAgg.c do the sorting. I think
this removes quite a lot of the argument against the patch, but not
all of it. So here goes testing the water on seeing if any opinions
have changed over the past few years?

A rebased patch is attached.

David

[1] https://www.postgresql.org/message-id/flat/CAKJS1f98yPkRMsE0JnDh72%3DAQEUuE3atiCJtPVCtjhFwzCRJHQ%40mail.gmail.com#8bbce15b9279d2da2da99071f732a99d
[2] https://www.postgresql.org/message-id/6538.1522096067@sss.pgh.pa.us
[3] https://www.postgresql.org/message-id/18594.1522099194@sss.pgh.pa.us
[4] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=1349d2790bf48a4de072931c722f39337e72055e

Hi,

For array_agg_combine():

+ if (state1->alen < reqsize)
+ {
+ /* Use a power of 2 size rather than allocating just reqsize */

+ state1->alen = pg_nextpower2_32(reqsize);

...

+ state1->nelems = reqsize;

I wonder why pg_nextpower2_32(reqsize) is used in the if block. It seems reqsize should suffice.

Cheers

Re: Parallel Aggregates for string_agg and array_agg

From

David Rowley

Date:

03 August 2022, 03:05:31

On Wed, 3 Aug 2022 at 14:40, Zhihong Yu <zyu@yugabyte.com> wrote:
> For array_agg_combine():
>
> +       if (state1->alen < reqsize)
> +       {
> +           /* Use a power of 2 size rather than allocating just reqsize */
> +           state1->alen = pg_nextpower2_32(reqsize);
> ...
> +       state1->nelems = reqsize;
>
> I wonder why pg_nextpower2_32(reqsize) is used in the if block. It seems reqsize should suffice.

It would suffice, but it's a question of trying to minimise the
reallocations. There might be many parallel workers to combine results
from. Nothing says this one is the last call to combine the given
aggregate state. As far as aset.c allocations, for small / medium
allocations, the actual memory is allocated in powers of 2 anyway. If
you request less, you'll maybe need to repalloc more often,
needlessly. The actual underlying allocation might be big enough
already, it's just that the palloc() caller does not have knowledge of
that.

David

Re: Parallel Aggregates for string_agg and array_agg

From

Pavel Stehule

Date:

27 December 2022, 07:48:11

út 27. 12. 2022 v 6:26 odesílatel David Rowley <dgrowleyml@gmail.com> napsal:

On Wed, 20 Jun 2018 at 19:08, David Rowley <david.rowley@2ndquadrant.com> wrote:
> I'll submit it again when there more consensus that we want this.

Waking up this old thread again. If you don't have a copy, the entire
thread is in [1].

The remaining item that seemed to cause this patch to be rejected was
raised in [2]. The summary of that was that it might not be a good
idea to allow parallel aggregation of string_agg() and array_agg() as
there might be some people who rely on the current ordering they get
without having an ORDER BY clause in the aggregate function call. Tom
mentioned in [3] that users might not want to add an ORDER BY to their
aggregate function because the performance of it is terrible. That
was true up until 1349d2790 [4], where I changed how ORDER BY /
DISTINCT aggregation worked to allow the planner to provide pre-sorted
input rather than always having nodeAgg.c do the sorting. I think
this removes quite a lot of the argument against the patch, but not
all of it. So here goes testing the water on seeing if any opinions
have changed over the past few years?

A rebased patch is attached.

The technical part of this patch was discussed before, and there were no objections against it.

I did some performance benchmarks and I agree with David. Sure - it can change the ordering, but without the ORDER BY clause, the order depended before too. And the performance with ORDER clause is not changed with and without this patch.

I can confirm that all regress tests passed, and there are not any problems with compilation.

I'll mark this patch as ready for committer

Regards

Pavel

David

[1] https://www.postgresql.org/message-id/flat/CAKJS1f98yPkRMsE0JnDh72%3DAQEUuE3atiCJtPVCtjhFwzCRJHQ%40mail.gmail.com#8bbce15b9279d2da2da99071f732a99d
[2] https://www.postgresql.org/message-id/6538.1522096067@sss.pgh.pa.us
[3] https://www.postgresql.org/message-id/18594.1522099194@sss.pgh.pa.us
[4] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=1349d2790bf48a4de072931c722f39337e72055e

Re: Parallel Aggregates for string_agg and array_agg

From

David Rowley

Date:

19 January 2023, 07:44:08

On Wed, 3 Aug 2022 at 12:18, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Hmm.  My estimate of the percentages of users who will be pleased or
> not hasn't really changed since [2].  Also, realizing that you can't
> apply these aggregates to extremely large input sets, I wonder just how
> much it buys to be able to parallelize them.  So I still put this patch
> in the category of "just because we can doesn't mean we should".

It's a shame you categorised it without testing the actual
performance. That's disappointingly unscientific.

Here's a quick benchmark to let you see what we might expect and how
it compares to the existing aggregate functions which can already be
parallelised:

create table ab (a int, b int);
insert into ab select x,y from
generate_series(1,1000)x,generate_Series(1,1000)y;

In the attached agg_comparison_milliseconds.png, you can see the time
it took to execute "select a,sum(b) from ab group by a;" and "select
a,string_agg(b::text,',') from ab group by a;". Both queries seem to
increase in performance roughly the same amount when adding additional
parallel workers.

The attached agg_comparison_scalability.png shows the time it took to
execute the query with 0 workers divided by the time for N workers,
thus showing how many times the query was sped up with N workers. The
scalability of the string_agg() appears slightly better than the sum()
in this test due to the string_agg() query being slightly slower with
0 workers than the sum() version of the query. The sum query got 7.63x
faster with 32 workers whereas the string_agg query beat that slightly
with 7.67x faster than when the query was executed with parallelism
disabled.

I'd expect that scalability, especially the sum() version of the
query, to improve further with more rows.

Tested on an amd3990x machine with 64 cores / 128 threads.

> Now as against that, if the underlying relation scan is parallelized
> then you already have unpredictable input ordering and thus unpredictable
> aggregate results.  So anyone who cares about that has already had
> to either disable parallelism or insert an ORDER BY somewhere, and
> either of those fixes will still work if the aggregates support
> parallelism.

I'm glad you're no longer worried about users skipping the ORDER BY to
increase performance and relying on luck for the input ordering.

I think, in addition to this now allowing parallel aggregation for
string_agg() and array_agg(), it's also important not to forget that
there are more reasons to have combine/serialize/deserialize functions
for all aggregates.  Currently, a partition-wise aggregate can be
performed partially on the partition then if further grouping is
required at the top-level then we'll need the combine function to
perform the final grouping. No parallelism is required.  Also, with
all the advancements that have been made in postgres_fdw of late, then
I imagine it surely can't be long until someone wants to invent some
syntax for requesting partially aggregated results at the SQL level.
It sure would be a shame if that worked for every aggregate but
string_agg and array_agg needed the foreign server to spit those
tuples out individually so that the aggregation could be performed
(painstakingly) on the local server.

> Hence, I'm not going to fight hard if you really want
> to do it.  But I remain unconvinced that the cost/benefit tradeoff
> is attractive.

Thanks. Pending one last look, I'm planning to move ahead with it
unless there are any further objections or concerns.

David

Attachment

Re: Parallel Aggregates for string_agg and array_agg

From

David Rowley

Date:

23 January 2023, 00:49:25

On Thu, 19 Jan 2023 at 20:44, David Rowley <dgrowleyml@gmail.com> wrote:
> The attached agg_comparison_scalability.png shows the time it took to
> execute the query with 0 workers divided by the time for N workers,
> thus showing how many times the query was sped up with N workers. The
> scalability of the string_agg() appears slightly better than the sum()
> in this test due to the string_agg() query being slightly slower with
> 0 workers than the sum() version of the query. The sum query got 7.63x
> faster with 32 workers whereas the string_agg query beat that slightly
> with 7.67x faster than when the query was executed with parallelism
> disabled.

I was just looking at this again today and I noticed I messed up the
script and used the sum(b) query twice instead of the
string_agg(b::text, ',') query. oops.

I've attached the correct graphs this time.  The scalability of
string_agg() is not quite as good as I reported before, but it's still
good.  sum(b) with 32 workers with a new benchmark run I found to be
7.15x faster than with 0 parallel workers and string_agg(b::text, ',')
I found to be 6.49x faster with 32 workers than with 0 parallel
workers. This test showed that the string_agg query scaled better with
up to 6 workers before the scalability dropped below that of the
sum(b) query.

David

Attachment

Re: Parallel Aggregates for string_agg and array_agg

From

David Rowley

Date:

23 January 2023, 04:37:21

On Thu, 19 Jan 2023 at 20:44, David Rowley <dgrowleyml@gmail.com> wrote:
> Thanks. Pending one last look, I'm planning to move ahead with it
> unless there are any further objections or concerns.

I've now pushed this.

Thank you to everyone who reviewed or gave input on this patch.

David

Re: Parallel Aggregates for string_agg and array_agg

From

Alexander Lakhin

Date:

26 March 2024, 14:00:00

Hello David,

23.01.2023 07:37, David Rowley wrote:
> I've now pushed this.
>

I've discovered that the test query:
-- Ensure parallel aggregation is actually being used.
explain (costs off) select * from v_pagg_test order by y;

added by 16fd03e95 fails sometimes. For instance:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=urutu&dt=2024-03-19%2021%3A04%3A05

--- /home/bf/bf-build/urutu/HEAD/pgsql/src/test/regress/expected/aggregates.out 2024-02-24 06:42:47.039073180 +0000
+++ /home/bf/bf-build/urutu/HEAD/pgsql.build/src/test/regress/results/aggregates.out 2024-03-19 22:24:18.155876135
+0000
@@ -1993,14 +1993,16 @@
           Sort Key: pagg_test.y, (((unnest(regexp_split_to_array((string_agg((pagg_test.x)::text, ','::text)), 
','::text))))::integer)
           ->  Result
                 ->  ProjectSet
-                     ->  Finalize HashAggregate
+                     ->  Finalize GroupAggregate
...

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=phycodurus&dt=2024-02-28%2007%3A38%3A27

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=avocet&dt=2024-02-08%2008%3A47%3A45

I suspect that these failures caused by autovacuum.

I could reproduce this plan change when running multiple tests in
parallel, and also with the attached test patch applied (several sleeps are
needed for autovacuum/relation_needs_vacanalyze() to get a non-zero
mod_since_analyze value from pgstat):
TEMP_CONFIG=/tmp/temp.config TESTS="test_setup create_index create_aggregate aggregates" make -s check-tests

where /tmp/temp.config is:
autovacuum_naptime = 1
log_autovacuum_min_duration = 0

log_min_messages = INFO
log_min_error_statement = log
log_statement = 'all'

With EXPLAIN (VERBOSE), I see a slight change of the Seq Scan cost/rows
estimate:
... ->  Parallel Seq Scan on public.pagg_test  (cost=0.00..48.99 rows=2599 width=8)
vs
.. ->  Parallel Seq Scan on public.pagg_test  (cost=0.00..48.00 rows=2500 width=8)
(after automatic analyze of table "regression.public.pagg_test")

Best regards,
Alexander

Attachment

aggregates.patch

Re: Parallel Aggregates for string_agg and array_agg

From

David Rowley

Date:

27 March 2024, 11:19:06

On Wed, 27 Mar 2024 at 03:00, Alexander Lakhin <exclusion@gmail.com> wrote:
> I've discovered that the test query:
> -- Ensure parallel aggregation is actually being used.
> explain (costs off) select * from v_pagg_test order by y;
>
> added by 16fd03e95 fails sometimes. For instance:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=urutu&dt=2024-03-19%2021%3A04%3A05
>
> --- /home/bf/bf-build/urutu/HEAD/pgsql/src/test/regress/expected/aggregates.out 2024-02-24 06:42:47.039073180 +0000
> +++ /home/bf/bf-build/urutu/HEAD/pgsql.build/src/test/regress/results/aggregates.out 2024-03-19 22:24:18.155876135
+0000
> @@ -1993,14 +1993,16 @@
>            Sort Key: pagg_test.y, (((unnest(regexp_split_to_array((string_agg((pagg_test.x)::text, ','::text)),
> ','::text))))::integer)
>            ->  Result
>                  ->  ProjectSet
> -                     ->  Finalize HashAggregate
> +                     ->  Finalize GroupAggregate
> ...
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=phycodurus&dt=2024-02-28%2007%3A38%3A27
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=avocet&dt=2024-02-08%2008%3A47%3A45
>
> I suspect that these failures caused by autovacuum.

I agree with this analysis.  In hindsight, adding 5000 records to a
table without disabling autovacuum or vacuuming the table manually was
a bad idea.

I opted to fix by disabling autovacuum for the table as that allows us
to maintain the current plan. Vacuuming would result in the plan
changing, which I opted not to do as it's a bit more churn in the
expected output file.

Thanks for the report and the analysis.

David