Thread: ARRAY_AGG(DISTINCT a ORDER BY b) Error
"in an aggregate with DISTINCT, ORDER BY expressions must appear in argument list" Why? If I add the fields of the ORDER BY expression to the DISTINCT clause I can no longer use DISTINCT since the ORDER BY values are not unique. Nor do I want the contents of the final ARRAY to contain the ORDER BY column. I presuppose this is a technical limitation since my particular use-case (and I've come across a few cases where this would be useful) doesn't seem that obscure. My specific sample query (use-case) is as follows: SELECT control, ARRAY_AGG(DISTINCT accountnumber ORDER BY amount DESC) FROM (VALUES ('A', '1000', 100), ('A', '1000', -50), ('A', '2000',200)) accountdetail (control, accountnumber, amount) GROUP BY control I want to create an array of the unique account numbers associated with a control with the ordering of the array matching the order of the amounts. In this case I would want the output to be: (A, {'2000','1000'}) David J.
"David Johnston" <polobo@yahoo.com> writes: > "in an aggregate with DISTINCT, ORDER BY expressions must appear in argument > list" > Why? Because the results are ill-defined otherwise. In your example, > ... ARRAY_AGG(DISTINCT accountnumber ORDER BY amount DESC) ... there may be many rows with the same "accountnumber" and yet different "amount" values. Which of those amounts should be used as the sort key for the aggregated row? regards, tom lane
On 31 August 2011 23:54, David Johnston <polobo@yahoo.com> wrote: > "in an aggregate with DISTINCT, ORDER BY expressions must appear in argument > list" > > Why? > > If I add the fields of the ORDER BY expression to the DISTINCT clause I can > no longer use DISTINCT since the ORDER BY values are not unique. Nor do I > want the contents of the final ARRAY to contain the ORDER BY column. > > I presuppose this is a technical limitation since my particular use-case > (and I've come across a few cases where this would be useful) doesn't seem > that obscure. > > My specific sample query (use-case) is as follows: > > SELECT control, ARRAY_AGG(DISTINCT accountnumber ORDER BY amount DESC) > FROM (VALUES ('A', '1000', 100), ('A', '1000', -50), ('A', '2000',200)) > accountdetail (control, accountnumber, amount) > GROUP BY control > > I want to create an array of the unique account numbers associated with a > control with the ordering of the array matching the order of the amounts. > In this case I would want the output to be: > > (A, {'2000','1000'}) I'm not sure that makes sense. If you're aggregating accountnumber as an array of distinct values, what do you expect your query to output if, say you had the following: accountnumber, amount 1000,100 2000,200 1000,300 You've ordered by amount, but accountnumber has 2 identical values, where the amount is less than the amount corresponding to accountnumber 2000 in one instance, but greater in another. Where does 1000 appear? Before or after 2000? -- Thom Brown Twitter: @darkixion IRC (freenode): dark_ixion Registered Linux user: #516935 EnterpriseDB UK: http://www.enterprisedb.com The Enterprise PostgreSQL Company
-----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Wednesday, August 31, 2011 7:10 PM To: David Johnston Cc: pgsql-general@postgresql.org Subject: Re: [GENERAL] ARRAY_AGG(DISTINCT a ORDER BY b) Error "David Johnston" <polobo@yahoo.com> writes: > "in an aggregate with DISTINCT, ORDER BY expressions must appear in > argument list" > Why? Because the results are ill-defined otherwise. In your example, > ... ARRAY_AGG(DISTINCT accountnumber ORDER BY amount DESC) ... there may be many rows with the same "accountnumber" and yet different "amount" values. Which of those amounts should be used as the sort key for the aggregated row? regards, tom lane ---------------------------------- My take is that you look at the non-aggregated data using the same ORDER BY clause and then add the values in the order they appear in the detail. So (account, amount, [Action]): 1000, 150, [Add] 1000, 130, [Skip] 2000, 120, [Add] 2000, 100, [Skip] 1000, 50, [Skip] 2000, 0, [Skip] 3000, -10, [Add] 1000, -50, [Skip] This is basically how a plain DISTINCT would have to work (taking the first encountered value and skipping any subsequent repeats). I want the same general behavior but have the opportunity to order the input "frame" so that I can manipulate the encounter order. In a 2-phase process you would do the following (I will remain specific for the time being): Phase 1: Select the representative record for each DISTINCT (accountnumber); The representative would be the FIRST record as determined via an ORDER BY of all candidate records. Phase 2: Order the DISTINCT representatives based upon the same ORDER BY clause So: Phase 1: (3000, -10), (1000, 150), (2000, 120) [Not Ordered] Phase 2: (1000, 150), (2000, 120), (3000, -10) [ORDER BY amount DESC] The behavior/concept of [ARRAY_AGG(DISTINCT ON (accountnumber) .?. ORDER BY amount] matches here but I do not want to output an amount column at all. The other view is that you can create the ARRAY using just the ORDER BY and then immediately "DISTINCT"ify the array so that there are no duplicates. This is basically the 2-phase process described above. In this view you basically keep the DISTINCT value that has the lowest array index. I would be interested in other possible interpretations/algorithms that would then cause ambiguity in deciding which algorithm to implement. I know that I am only considering "ARRAY_AGG" in my examples but my first reaction is that other aggregates would behave acceptably under the algorithm described; and since the current behavior is FAIL at least some well-defined behavior would exist for the specified syntax. David J.