Re: Parallel Aggregates for string_agg and array_agg - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Parallel Aggregates for string_agg and array_agg
Date
Msg-id cd620e7f-a9dd-ad15-72a8-6c0ac10ea4c3@2ndquadrant.com
Whole thread Raw
In response to Re: Parallel Aggregates for string_agg and array_agg  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Parallel Aggregates for string_agg and array_agg  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Parallel Aggregates for string_agg and array_agg  (Stephen Frost <sfrost@snowman.net>)
List pgsql-hackers
On 03/26/2018 10:27 PM, Tom Lane wrote:
> David Rowley <david.rowley@2ndquadrant.com> writes:
>> [ combinefn_for_string_and_array_aggs_v7.patch ]
> 
> I spent a fair amount of time hacking on this with intent to commit,
> but just as I was getting to code that I liked, I started to have second
> thoughts about whether this is a good idea at all.  I quote from the fine
> manual:
> 
>     The aggregate functions array_agg, json_agg, jsonb_agg,
>     json_object_agg, jsonb_object_agg, string_agg, and xmlagg, as well as
>     similar user-defined aggregate functions, produce meaningfully
>     different result values depending on the order of the input
>     values. This ordering is unspecified by default, but can be controlled
>     by writing an ORDER BY clause within the aggregate call, as shown in
>     Section 4.2.7. Alternatively, supplying the input values from a sorted
>     subquery will usually work ...
> 
> I do not think it is accidental that these aggregates are exactly the ones
> that do not have parallelism support today.  Rather, that's because you
> just about always have an interest in the order in which the inputs get
> aggregated, which is something that parallel aggregation cannot support.
> 

I don't think that's quite true. I know plenty of people who do things
like this:

    SELECT
       a,
       b,
       avg(c),
       sum(d),
       array_agg(e),
       array_agg(f),
       string_agg(g)
    FROM hugetable GROUP BY a,b HAVING avg(c) > 100.89;

and then do some additional processing on the result in some way
(subquery, matview, ...). They don't really care about ordering of
values in the arrays, as long as orderings of all the arrays match.

Currently queries like this can use parallelism at all, and the patch
fixes that I believe.

> I fear that what will happen, if we commit this, is that something like
> 0.01% of the users of array_agg and string_agg will be pleased, another
> maybe 20% will be unaffected because they wrote ORDER BY which prevents
> parallel aggregation, and the remaining 80% will scream because we broke
> their queries.  Telling them they should've written ORDER BY isn't going
> to cut it, IMO, when the benefit of that breakage will accrue only to some
> very tiny fraction of use-cases.
> 

Isn't the ordering unreliable *already*? It depends on ordering of
tuples on the input. So if the table is scanned by index scan or
sequential scan, that will affect the array_agg/string_agg results. If
the input is a join, it's even more volatile.

IMHO it's not like we're making the ordering unpredictable - it's been
like that since forever.

Also, how is this different from ORDER BY clause? If a user does not
specify an ORDER BY clause, I don't think we'd care very much about
changes to output ordering due to plan changes, for example.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


pgsql-hackers by date:

Previous
From: Mark Rofail
Date:
Subject: Re: Re: [HACKERS] GSoC 2017: Foreign Key Arrays
Next
From: Dean Rasheed
Date:
Subject: Re: [HACKERS] PATCH: multivariate histograms and MCV lists