Thread: BUG #18126: Ordered set aggregate: result does not depend on sort order

BUG #18126: Ordered set aggregate: result does not depend on sort order

From

PG Bug reporting form

Date:

21 September 2023, 10:57:26

The following bug has been logged on the website:

Bug reference:      18126
Logged by:          Richard Wollhofen
Email address:      wollhofen@unisoftwareplus.com
PostgreSQL version: 16.0
Operating system:   Debian 16.0-1.pgdg120+1
Description:

Probably related to bug report #18125.
Using a custom ordered set aggregate, the sort order in the ORDER BY after
the WITHIN GROUP does not show an influence on the results, as demonstrated
in the script below. There are three possible ways to achieve the same
ascending or descending sort order in the test data, respectively. However,
the result depends on WHAT is used for sorting (columns, whole row) instead
of the actual ordered result set.
Tested for postgres major versions 10 - 16.


Link to test script
https://dbfiddle.uk/2G7NgXPn


-- custom md5 aggregate sfunc
CREATE FUNCTION md5_agg_sfunc(text, anyelement)
       RETURNS text
       LANGUAGE sql
AS
$$
  SELECT md5($1 || $2::text)
$$;

-- custom md5 aggregate function
CREATE AGGREGATE md5_agg (ORDER BY anyelement)
(
  STYPE = text,
  SFUNC = md5_agg_sfunc,
  INITCOND = ''
);

-- test data, two correlated columns
CREATE TABLE t_test (a int, b int);
INSERT INTO t_test
       SELECT   x, x + 10
       FROM     generate_series(1, 100000) AS x;

-- Result for ASC sort order is the same for all three criteria!
SELECT * FROM t_test ORDER BY t_test ASC LIMIT 3;
SELECT * FROM t_test ORDER BY t_test.a ASC LIMIT 3;
SELECT * FROM t_test ORDER BY t_test.b ASC LIMIT 3;

-- Result for DESC sort order is the same for all three criteria!
SELECT * FROM t_test ORDER BY t_test DESC LIMIT 3;
SELECT * FROM t_test ORDER BY t_test.a DESC LIMIT 3;
SELECT * FROM t_test ORDER BY t_test.b DESC LIMIT 3;

-- expected 2 hashes, hash H1 for columns 1-3 for ASC order, 
--                    hash H2 for columns 4-6 for DESC order
SELECT md5_agg() WITHIN GROUP (ORDER BY t_test ASC),
       md5_agg() WITHIN GROUP (ORDER BY t_test.a ASC),
       md5_agg() WITHIN GROUP (ORDER BY t_test.b ASC),
       md5_agg() WITHIN GROUP (ORDER BY t_test DESC),
       md5_agg() WITHIN GROUP (ORDER BY t_test.a DESC),
       md5_agg() WITHIN GROUP (ORDER BY t_test.b DESC)
FROM t_test;
-- received 3 hashes:
--               H3 for col. 1, 4
--               H4 for col. 2, 5
--               H5 for col. 3, 6
-- Sort order has no influence. 
-- Specification of order criterion (whole record or different columns)
determines the result.

Re: BUG #18126: Ordered set aggregate: result does not depend on sort order

From

Tom Lane

Date:

21 September 2023, 15:01:30

PG Bug reporting form <noreply@postgresql.org> writes:
> Probably related to bug report #18125.
> Using a custom ordered set aggregate, the sort order in the ORDER BY after
> the WITHIN GROUP does not show an influence on the results, as demonstrated
> in the script below.

I think the problem in both of these bugs is that you're expecting
the system to sort the input for you.  As per [1], this is not the
case for ordered-set aggregates:

    Unlike the case for normal aggregates, the sorting of input rows for
    an ordered-set aggregate is not done behind the scenes, but is the
    responsibility of the aggregate's support functions. The typical
    implementation approach is to keep a reference to a “tuplesort” object
    in the aggregate's state value, feed the incoming rows into that
    object, and then complete the sorting and read out the data in the
    final function. This design allows the final function to perform
    special operations such as injecting additional “hypothetical” rows
    into the data to be sorted. While normal aggregates can often be
    implemented with support functions written in PL/pgSQL or another PL
    language, ordered-set aggregates generally have to be written in C,
    since their state values aren't definable as any SQL data type.

Perhaps it would be useful to have some kind of shim whereby a
"dumb" transition function could be the basis of an ordered-set
aggregate.  But I'm not quite seeing why that would be an
improvement over a traditional aggregate that you use with an
ORDER BY option.

            regards, tom lane

[1] https://www.postgresql.org/docs/current/xaggr.html#XAGGR-ORDERED-SET-AGGREGATES