Re: Document aggregate functions better w.r.t. ORDER BY - Mailing list pgsql-hackers

From David G. Johnston
Subject Re: Document aggregate functions better w.r.t. ORDER BY
Date
Msg-id CAKFQuwadWwDfyA5O4v92CCvDb=hKCGY97pq8wrH=dXFX84WnCA@mail.gmail.com
Whole thread Raw
In response to Re: Document aggregate functions better w.r.t. ORDER BY  (David Rowley <dgrowleyml@gmail.com>)
Responses Re: Document aggregate functions better w.r.t. ORDER BY
List pgsql-hackers
On Wed, Oct 25, 2023 at 7:13 PM David Rowley <dgrowleyml@gmail.com> wrote:
On Thu, 26 Oct 2023 at 13:10, David G. Johnston
<david.g.johnston@gmail.com> wrote:
> Question: Do you know whether we for certain always sort ascending here to compute the unique values or whether if, say, there is an index on the column in descending order (or ascending and traversed backwards) that the data within the aggregate could, with an order by, be returned in descending order?

The way it's currently coded, we seem to always require ascending
order.  See addTargetToGroupList().  The call to
get_sort_group_operators() only requests the ltOpr.

A quick test creating an index on a column with DESC shows that we end
up doing a backwards index scan so that we get the requested ascending
order:

create table b (b text);
create index on b (b desc);
explain select string_agg(distinct b,',') from b;
                                        QUERY PLAN
------------------------------------------------------------------------------------------
 Aggregate  (cost=67.95..67.97 rows=1 width=32)
   ->  Index Only Scan Backward using b_b_idx on b  (cost=0.15..64.55
rows=1360 width=32)
(2 rows)

However, I think we'd best stay clear of offering any guarantees in
the documents about this.  If we did that it would be much harder in
the future if we wanted to implement the DISTINCT aggregates by
hashing.

So, I think we are mischaracterizing the Standard here, if only in the specific case of array_agg.

SQL Standard: 4.16.4

Every unary aggregate function takes an arbitrary <value expression> as the argument; most unary aggregate
functions can optionally be qualified with either DISTINCT or ALL.

If ARRAY_AGG is specified, then an array value with one element formed from the <value expression>
evaluated for each row that qualifies.

Neither DISTINCT nor ALL are allowed to be specified for VAR_POP, VAR_SAMP, STDDEV_POP, or
STDDEV_SAMP; redundant duplicates are not removed when computing these functions.

10.9

<array aggregate function> ::=
ARRAY_AGG
<left paren> <value expression> [ ORDER BY <sort specification list> ] <right paren>

I would reword the existing note to be something like:

The SQL Standard defines specific aggregates and their properties, including which of DISTINCT and/or ORDER BY is allowed.  Due to the extensible nature of PostgreSQL it accepts either or both clauses for any aggregate.

From the most recent patch:

    <para>
-    If <literal>DISTINCT</literal> is specified in addition to an
-    <replaceable>order_by_clause</replaceable>, then all the <literal>ORDER BY</literal>
-    expressions must match regular arguments of the aggregate; that is,
-    you cannot sort on an expression that is not included in the
-    <literal>DISTINCT</literal> list.
+    If <literal>DISTINCT</literal> is specified with an
+    <replaceable>order_by_clause</replaceable>, <literal>ORDER
+    BY</literal> expressions can only reference columns in the
+    <literal>DISTINCT</literal> list.  For example:
+<programlisting>
+WITH vals (v1, v2) AS ( VALUES (1,'Z'),(3,'D'),(4,'R'),(3,'A'),(2,'T') )
+SELECT array_agg(DISTINCT v2 ORDER BY v2 DESC) FROM vals;
+  array_agg
+-------------
+ {Z,T,R,D,A}
+</programlisting>

The change to a two-column vals was mostly to try and find corner-cases that might need to be addressed.  If we don't intend to show the error case of DISTINCT v1 ORDER BY v2 then we should go back to the original example and just add ORDER BY v DESC.  I'm fine with not using string_agg here.

+    For example:
+<programlisting>
+WITH vals (v) AS ( VALUES (1),(3),(4),(3),(2) )
+SELECT array_agg(DISTINCT v ORDER BY v DESC) FROM vals;
+ array_agg
+-----------
+ {4,3,2,1}
+</programlisting>

We get enough complaints regarding "apparent ordering" that I would like to add:

As a reminder, while some DISTINCT processing algorithms produce sorted output as a side-effect, only by specifying ORDER BY is the output order guaranteed.

David J.


pgsql-hackers by date:

Previous
From: Jeff Davis
Date:
Subject: Re: Is this a problem in GenericXLogFinish()?
Next
From: Jeff Davis
Date:
Subject: Re: Container Types