RFC: array_agg() per SQL:200n - Mailing list pgsql-hackers

I recently noticed that SQL:200n[1] defines a new aggregate function,
array_agg(). The relevant portions of the spec are:

p. 66: "If ARRAY_AGG is specified, then an array value with one element
formed from the <value expression> evaluated for each row that
qualifies."

p. 556: <array aggregate function> ::= ARRAY_AGG <left paren>     <value expression> [ ORDER BY <sort specification
list>] <right paren>
 

p. 564 discusses the required behavior. The result of array_agg() is an
array with one element per input value, sorted according to the optional
ORDER BY clause. NULL input values are included in the array, and the
result for an empty group is NULL, not an empty array. Note that per
page 66, I'd expect array values in the input to array_agg() not to be
flattened.

I'd like to implement array_agg() for 8.4. In the past, we've talked
about moving the array_accum() example aggregate into the backend[2].
Now that there's SQL-standard syntax, that's another reason to do it --
I think this is clearly useful functionality.

The previous discussion got tied up in how to expose the aggregate's
transition value to the type system. The problem is that the aggregate
wants to use a transition value that is not a SQL-level type, to allow
efficient array append operations. Various solutions were mooted about,
typically involving a pass-by-val pseudotype used to hold a pointer to
the C struct holding the transition state.

AFAIK the conclusion reached by the previous thread was that to be type
safe, you'd need one distinct pseudotype per aggregate function, along
with some way to let the planner distinguish this class of pseudotypes
from other types (in order to apply the heuristic that functions like
these are likely to consume more memory). You could identify this class
by an additional column in pg_type, but I think we'd need a lot of
machinery to do this properly (e.g. to allow these types to be created
via SQL). I wonder if this isn't over-engineering: the simple approach
originally followed by Stephen Frost was to declare the transition value
as, say, int8, and just disallow the transition and final functions from
being called outside an aggregate context. AFAIK this would be safe,
although of course it is ugly.

To parse the ORDER BY clause, we'd need to special-case array_agg() in
the grammar, which is a bit unfortunate. Implementation-wise, because
there is no way to lazily evaluate an array expression, I don't think
there's much to be gained by using the tuplesort infrastructure -- we'll
need to materialize the entire array into memory when the final function
is called anyway. Therefore, a simpler approach might be to just
accumulate inputs in the transition function as usual, and then qsort()
them in the final function. We could also have the planner arrange for
the sort to be skipped if it knows that the input to the aggregate will
be delivered in a compatible ordering.

Comments welcome.

-Neil

[1] http://www.wiscorp.com/SQLStandards.html ; apparently SQL:200n is
likely to become SQL:2008 without further changes.

[2] http://archives.postgresql.org/pgsql-hackers/2006-10/msg00362.php
http://archives.postgresql.org/pgsql-patches/2006-10/msg00059.php
http://archives.postgresql.org/pgsql-hackers/2006-10/msg00683.php




pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Vacuum threshold and non-serializable read-only transaction
Next
From: ITAGAKI Takahiro
Date:
Subject: Re: Vacuum threshold and non-serializable read-only transaction