Re: Parallel Aggregates for string_agg and array_agg - Mailing list pgsql-hackers

From Mark Dilger
Subject Re: Parallel Aggregates for string_agg and array_agg
Date
Msg-id 1C2959D0-56F2-4067-B2AC-DF9A3B1D0FB5@gmail.com
Whole thread Raw
In response to Re: Parallel Aggregates for string_agg and array_agg  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Parallel Aggregates for string_agg and array_agg  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
> On Mar 27, 2018, at 7:58 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> David Rowley <david.rowley@2ndquadrant.com> writes:
>> On 27 March 2018 at 13:26, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>>> synchronized_seqscans is another piece of precedent in the area, FWIW.
>
>> This is true. I guess the order of aggregation could be made more
>> certain if we remove the cost based optimiser completely, and just
>> rely on a syntax based optimiser.
>
> None of this is responding to my point.  I think the number of people
> who actually don't care about aggregation order for these aggregates
> is negligible, and none of you have argued against that; you've instead
> selected straw men to attack.

I frequently care about the order, but only to the extent that the order
is stable between aggregates of several different columns, along the lines
of:

select array_agg(a) AS x, array_agg(b) AS y
  from generate_a_b_func(foo);

I don't care which order the data is in, as long as x[i] and y[i] are
matched correctly.  It sounds like this patch would force me to write
that as, for example:

select array_agg(a order by a, b) AS x, array_agg(b order by a, b) AS y
  from generate_a_b_func(foo);

which I did not need to do before.  I would expect a performance regression
from the two newly required sorts.  So in that case I agree with Tom.

But I also agree with others that I want the parallel aggregation functionality.
Could we perhaps introduce some option for the aggregate to force it to be
stable?  Something like:

select array_agg(a order undisturbed) AS x, array_agg(b order undisturbed) AS y
  from generate_a_b_func(foo);

which would not perform an extra sort operation but would guarantee to not
disturb the pre-existing sort order coming from generate_a_b_func(foo)?


I don't care about the syntax / keywords in the example above.  I'm just
looking to get the benefits of the parallel aggregation when I don't care
about ordering while preserving the order for these cases where it matters.

mark



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Oddity in tuple routing for foreign partitions
Next
From: Andres Freund
Date:
Subject: Re: Parallel Aggregates for string_agg and array_agg