Thread: ARRAY_AGG(DISTINCT a ORDER BY b) Error

ARRAY_AGG(DISTINCT a ORDER BY b) Error

From
"David Johnston"
Date:
"in an aggregate with DISTINCT, ORDER BY expressions must appear in argument
list"

Why?

If I add the fields of the ORDER BY expression to the DISTINCT clause I can
no longer use DISTINCT since the ORDER BY values are not unique. Nor do I
want the contents of the final ARRAY to contain the ORDER BY column.

I presuppose this is a technical limitation since my particular use-case
(and I've come across a few cases where this would be useful) doesn't seem
that obscure.

My specific sample query (use-case) is as follows:

SELECT control, ARRAY_AGG(DISTINCT accountnumber ORDER BY amount DESC)
FROM (VALUES ('A', '1000', 100), ('A', '1000', -50), ('A', '2000',200))
accountdetail (control, accountnumber, amount)
GROUP BY control

I want to create an array of the unique account numbers associated with a
control with the ordering of the array matching the order of the amounts.
In this case I would want the output to be:

(A, {'2000','1000'})

David J.







Re: ARRAY_AGG(DISTINCT a ORDER BY b) Error

From
Tom Lane
Date:
"David Johnston" <polobo@yahoo.com> writes:
> "in an aggregate with DISTINCT, ORDER BY expressions must appear in argument
> list"

> Why?

Because the results are ill-defined otherwise.  In your example,

>    ... ARRAY_AGG(DISTINCT accountnumber ORDER BY amount DESC) ...

there may be many rows with the same "accountnumber" and yet different
"amount" values.  Which of those amounts should be used as the sort key
for the aggregated row?

            regards, tom lane

Re: ARRAY_AGG(DISTINCT a ORDER BY b) Error

From
Thom Brown
Date:
On 31 August 2011 23:54, David Johnston <polobo@yahoo.com> wrote:
> "in an aggregate with DISTINCT, ORDER BY expressions must appear in argument
> list"
>
> Why?
>
> If I add the fields of the ORDER BY expression to the DISTINCT clause I can
> no longer use DISTINCT since the ORDER BY values are not unique. Nor do I
> want the contents of the final ARRAY to contain the ORDER BY column.
>
> I presuppose this is a technical limitation since my particular use-case
> (and I've come across a few cases where this would be useful) doesn't seem
> that obscure.
>
> My specific sample query (use-case) is as follows:
>
> SELECT control, ARRAY_AGG(DISTINCT accountnumber ORDER BY amount DESC)
> FROM (VALUES ('A', '1000', 100), ('A', '1000', -50), ('A', '2000',200))
> accountdetail (control, accountnumber, amount)
> GROUP BY control
>
> I want to create an array of the unique account numbers associated with a
> control with the ordering of the array matching the order of the amounts.
> In this case I would want the output to be:
>
> (A, {'2000','1000'})

I'm not sure that makes sense.  If you're aggregating accountnumber as
an array of distinct values, what do you expect your query to output
if, say you had the following:

accountnumber, amount
1000,100
2000,200
1000,300

You've ordered by amount, but accountnumber has 2 identical values,
where the amount is less than the amount corresponding to
accountnumber 2000 in one instance, but greater in another.  Where
does 1000 appear?  Before or after 2000?

--
Thom Brown
Twitter: @darkixion
IRC (freenode): dark_ixion
Registered Linux user: #516935

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: ARRAY_AGG(DISTINCT a ORDER BY b) Error

From
"David Johnston"
Date:
-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Wednesday, August 31, 2011 7:10 PM
To: David Johnston
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] ARRAY_AGG(DISTINCT a ORDER BY b) Error

"David Johnston" <polobo@yahoo.com> writes:
> "in an aggregate with DISTINCT, ORDER BY expressions must appear in
> argument list"

> Why?

Because the results are ill-defined otherwise.  In your example,

>    ... ARRAY_AGG(DISTINCT accountnumber ORDER BY amount DESC) ...

there may be many rows with the same "accountnumber" and yet different
"amount" values.  Which of those amounts should be used as the sort key for
the aggregated row?

            regards, tom lane


----------------------------------

My take is that you look at the non-aggregated data using the same ORDER BY
clause and then add the values in the order they appear in the detail.

So (account, amount, [Action]):

1000, 150, [Add]
1000, 130, [Skip]
2000, 120, [Add]
2000, 100, [Skip]
1000, 50, [Skip]
2000, 0, [Skip]
3000, -10, [Add]
1000, -50, [Skip]

This is basically how a plain DISTINCT would have to work (taking the first
encountered value and skipping any subsequent repeats).  I want the same
general behavior but have the opportunity to order the input "frame" so that
I can manipulate the encounter order.

In a 2-phase process you would do the following (I will remain specific for
the time being):

Phase 1: Select the representative record for each DISTINCT (accountnumber);
The representative would be the FIRST record as determined via an ORDER BY
of all candidate records.
Phase 2: Order the DISTINCT representatives based upon the same ORDER BY
clause

So:
Phase 1: (3000, -10), (1000, 150), (2000, 120) [Not Ordered]
Phase 2: (1000, 150), (2000, 120), (3000, -10) [ORDER BY amount DESC]

The behavior/concept of [ARRAY_AGG(DISTINCT ON (accountnumber) .?. ORDER BY
amount] matches here but I do not want to output an amount column at all.

The other view is that you can create the ARRAY using just the ORDER BY and
then immediately "DISTINCT"ify the array so that there are no duplicates.
This is basically the 2-phase process described above.  In this view you
basically keep the DISTINCT value that has the lowest array index.

I would be interested in other possible interpretations/algorithms that
would then cause ambiguity in deciding which algorithm to implement.

I know that I am only considering "ARRAY_AGG" in my examples but my first
reaction is that other aggregates would behave acceptably under the
algorithm described; and since the current behavior is FAIL at least some
well-defined behavior would exist for the specified syntax.

David J.