Re: Parallel Aggregates for string_agg and array_agg - Mailing list pgsql-hackers

From David Rowley
Subject Re: Parallel Aggregates for string_agg and array_agg
Date
Msg-id CAKJS1f9zUpF4Ntb4=2ba5cQ9YmHptnKUB4tuZZNDmv0OAZ2T4g@mail.gmail.com
Whole thread Raw
In response to Re: Parallel Aggregates for string_agg and array_agg  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Parallel Aggregates for string_agg and array_agg  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On 27 March 2018 at 09:27, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I spent a fair amount of time hacking on this with intent to commit,
> but just as I was getting to code that I liked, I started to have second
> thoughts about whether this is a good idea at all.  I quote from the fine
> manual:
>
>     The aggregate functions array_agg, json_agg, jsonb_agg,
>     json_object_agg, jsonb_object_agg, string_agg, and xmlagg, as well as
>     similar user-defined aggregate functions, produce meaningfully
>     different result values depending on the order of the input
>     values. This ordering is unspecified by default, but can be controlled
>     by writing an ORDER BY clause within the aggregate call, as shown in
>     Section 4.2.7. Alternatively, supplying the input values from a sorted
>     subquery will usually work ...
>
> I do not think it is accidental that these aggregates are exactly the ones
> that do not have parallelism support today.  Rather, that's because you
> just about always have an interest in the order in which the inputs get
> aggregated, which is something that parallel aggregation cannot support.

This was not in my list of reasons for not adding them the first time
around. I mentioned these reasons in a response to Stephen.

> I fear that what will happen, if we commit this, is that something like
> 0.01% of the users of array_agg and string_agg will be pleased, another
> maybe 20% will be unaffected because they wrote ORDER BY which prevents
> parallel aggregation, and the remaining 80% will scream because we broke
> their queries.  Telling them they should've written ORDER BY isn't going
> to cut it, IMO, when the benefit of that breakage will accrue only to some
> very tiny fraction of use-cases.

This very much reminds me of something that exists in the 8.4 release notes:

> SELECT DISTINCT and UNION/INTERSECT/EXCEPT no longer always produce sorted output (Tom)

> Previously, these types of queries always removed duplicate rows by means of Sort/Unique processing (i.e., sort then
removeadjacent duplicates). Now they can be implemented by hashing, which will not produce sorted output. If an
applicationrelied on the output being in sorted order, the recommended fix is to add an ORDER BY clause. As a
short-termworkaround, the previous behavior can be restored by disabling enable_hashagg, but that is a very
performance-expensivefix. SELECT DISTINCT ON never uses hashing, however, so its behavior is unchanged. 

Seems we were happy enough then to tell users to add an ORDER BY.

However, this case is different, since before the results were always
ordered. This time they're possibly ordered. So we'll probably
surprise fewer people this time around.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: [HACKERS] Partition-wise aggregation/grouping
Next
From: Peter Geoghegan
Date:
Subject: Re: [HACKERS] MERGE SQL Statement for PG11