Thread: Proposal/design feedback needed: WITHIN GROUP (sql standard ordered set aggregate functions)

The spec defines two types of aggregate function classed as "ordered set
function", as follows:
1. An "inverse distribution function" taking one argument (which must be  a grouped column or otherwise constant within
groups)plus a sorted  group with exactly one column:  =# SELECT (func(p) WITHIN GROUP (ORDER BY q)) from ...  The
motivatingexample for this (and the only ones in the spec) are  percentile_cont and percentile_disc, to return a
percentileresult  from a continuous or discrete distribution. (Thus  percentile_cont(0.5) within group (order by x) is
thespec's version  of a median(x) function.)
 
2. A "hypothetical set function" taking N arguments of arbitrary types  (a la VARIADIC "any", rather than a fixed list)
plusa sorted group  with N columns of matching types:  =# SELECT (func(p1,p2,...) WITHIN GROUP (ORDER BY q1,q2,...))
from...  (where typeof(p1)==typeof(q1) and so on, at least up to trivial  conversions)  The motivating example here is
tobe able to do rank(p1,p2,...) to  return the rank that the specified values would have had if they were  added to the
group.
As usual, we do not want to constrain ourselves to supporting only the
specific cases in the spec, but would prefer a general solution.
We (meaning myself and Atri) have an implementation that basically
works, though it is not yet complete, but before taking it any further
we need to resolve the design question of how to represent these two
types of function in the system catalogs. The fact that there are in
effect two parts to the parameter list, which are either independent
(for inverse distribution funcs) or closely related (for hypothetical
set functions), doesn't seem to point to an obvious way to represent
this in pg_proc/pg_aggregate.
I'm not yet satisfied with the method used in our implementation, so
we're throwing this open for suggestions. We will post the
work-in-progress patch along with a description of its current
implementation shortly.
One of the major complications is that we ideally want to be able to do
polymorphism based on the type of the sorted group, specifically in
order to be able to do
percentile_disc(float8) within group (order by anyelement)
returning anyelement. (i.e. we should be able to get a discrete
percentile from any type that is orderable.) The question here is how to
resolve the return type both of the aggregate itself and of the finalfn.
We've also had an expression of interest in extending this to allow
percentile_disc(float8[]) and percentile_cont(float8[]) returning
arrays; e.g. percentile_cont(array[0, 0.25, 0.5, 0.75, 1]) to return an
array containing the bounds, median and quartiles in one go. This is an
extension to the spec but it seems sufficiently obviously useful to be
worth supporting.

Comments?

-- 
Andrew (irc:RhodiumToad)



On Thu, Jul 18, 2013 at 03:15:14AM +0000, Andrew Gierth wrote:
> The spec defines two types of aggregate function classed as "ordered set
> function", as follows:
>  
> 1. An "inverse distribution function" taking one argument (which must be
>    a grouped column or otherwise constant within groups) plus a sorted
>    group with exactly one column:
>  
>    =# SELECT (func(p) WITHIN GROUP (ORDER BY q)) from ...
>  
>    The motivating example for this (and the only ones in the spec) are
>    percentile_cont and percentile_disc, to return a percentile result
>    from a continuous or discrete distribution. (Thus
>    percentile_cont(0.5) within group (order by x) is the spec's version
>    of a median(x) function.)
>  
> 2. A "hypothetical set function" taking N arguments of arbitrary types
>    (a la VARIADIC "any", rather than a fixed list) plus a sorted group
>    with N columns of matching types:
>  
>    =# SELECT (func(p1,p2,...) WITHIN GROUP (ORDER BY q1,q2,...)) from ...
>  
>    (where typeof(p1)==typeof(q1) and so on, at least up to trivial
>    conversions)
>  
>    The motivating example here is to be able to do rank(p1,p2,...) to
>    return the rank that the specified values would have had if they were
>    added to the group.
>  
> As usual, we do not want to constrain ourselves to supporting only the
> specific cases in the spec, but would prefer a general solution.
>  
> We (meaning myself and Atri) have an implementation that basically
> works, though it is not yet complete, but before taking it any further
> we need to resolve the design question of how to represent these two
> types of function in the system catalogs. The fact that there are in
> effect two parts to the parameter list, which are either independent
> (for inverse distribution funcs) or closely related (for hypothetical
> set functions), doesn't seem to point to an obvious way to represent
> this in pg_proc/pg_aggregate.
>  
> I'm not yet satisfied with the method used in our implementation,

What is that method?

> so we're throwing this open for suggestions. We will post the
> work-in-progress patch along with a description of its current
> implementation shortly.
>  
> One of the major complications is that we ideally want to be able to
> do polymorphism based on the type of the sorted group, specifically
> in order to be able to do
>  
> percentile_disc(float8) within group (order by anyelement)
>  
> returning anyelement. (i.e. we should be able to get a discrete
> percentile from any type that is orderable.) The question here is
> how to resolve the return type both of the aggregate itself and of
> the finalfn.
>  
> We've also had an expression of interest in extending this to allow
> percentile_disc(float8[]) and percentile_cont(float8[]) returning
> arrays; e.g. percentile_cont(array[0, 0.25, 0.5, 0.75, 1]) to return
> an array containing the bounds, median and quartiles in one go. This
> is an extension to the spec but it seems sufficiently obviously
> useful to be worth supporting.
> 
> Comments?

I'm really happy to see PostgreSQL come into its own when it comes to
the analytics side of the house :)

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate



On Thu, Jul 18, 2013 at 10:02 AM, David Fetter <david@fetter.org> wrote:
> On Thu, Jul 18, 2013 at 03:15:14AM +0000, Andrew Gierth wrote:
>> The spec defines two types of aggregate function classed as "ordered set
>> function", as follows:
>>
>> 1. An "inverse distribution function" taking one argument (which must be
>>    a grouped column or otherwise constant within groups) plus a sorted
>>    group with exactly one column:
>>
>>    =# SELECT (func(p) WITHIN GROUP (ORDER BY q)) from ...
>>
>>    The motivating example for this (and the only ones in the spec) are
>>    percentile_cont and percentile_disc, to return a percentile result
>>    from a continuous or discrete distribution. (Thus
>>    percentile_cont(0.5) within group (order by x) is the spec's version
>>    of a median(x) function.)
>>
>> 2. A "hypothetical set function" taking N arguments of arbitrary types
>>    (a la VARIADIC "any", rather than a fixed list) plus a sorted group
>>    with N columns of matching types:
>>
>>    =# SELECT (func(p1,p2,...) WITHIN GROUP (ORDER BY q1,q2,...)) from ...
>>
>>    (where typeof(p1)==typeof(q1) and so on, at least up to trivial
>>    conversions)
>>
>>    The motivating example here is to be able to do rank(p1,p2,...) to
>>    return the rank that the specified values would have had if they were
>>    added to the group.
>>
>> As usual, we do not want to constrain ourselves to supporting only the
>> specific cases in the spec, but would prefer a general solution.
>>
>> We (meaning myself and Atri) have an implementation that basically
>> works, though it is not yet complete, but before taking it any further
>> we need to resolve the design question of how to represent these two
>> types of function in the system catalogs. The fact that there are in
>> effect two parts to the parameter list, which are either independent
>> (for inverse distribution funcs) or closely related (for hypothetical
>> set functions), doesn't seem to point to an obvious way to represent
>> this in pg_proc/pg_aggregate.
>>
>> I'm not yet satisfied with the method used in our implementation,
>
> What is that method?

We currently represent ordered set functions with a new bool flag in
pg_aggregate. The flag is set to true for ordered set
functions(obviously) and false for all others. The currently
implemented functions i.e. percentile_disc, percentile_cont and
percentile_cont for intervals have their finalfns present in
pg_aggregate.

The aggregate functions take in two arguments, one for the percentile
value and other for the input row set. So, percentile_cont's entry in
pg_proc has float8 and float8 as its parameters and another entry of
percentile_cont (with the interval version as the finalfn) has float8
and interval as its parameter types.

As you can see, there isn't a way right now to resolve the return type
of the aggregate for polymorphic cases. This is something we wish to
resolve.

Regards,

Atri





--
Regards,

Atri
l'apprenant



On 07/17/2013 08:15 PM, Andrew Gierth wrote:
> The spec defines two types of aggregate function classed as "ordered set
> function", as follows:
>  
> 1. An "inverse distribution function" taking one argument (which must be
>    a grouped column or otherwise constant within groups) plus a sorted
>    group with exactly one column:
>  
>    =# SELECT (func(p) WITHIN GROUP (ORDER BY q)) from ...
>  
>    The motivating example for this (and the only ones in the spec) are
>    percentile_cont and percentile_disc, to return a percentile result
>    from a continuous or discrete distribution. (Thus
>    percentile_cont(0.5) within group (order by x) is the spec's version
>    of a median(x) function.)

One question is how this relates to the existing
  SELECT agg_func(x order by y)

... syntax.  Clearly there's some extra functionality here, but the two
are very similar conceptually.

> 2. A "hypothetical set function" taking N arguments of arbitrary types
>    (a la VARIADIC "any", rather than a fixed list) plus a sorted group
>    with N columns of matching types:
>  
>    =# SELECT (func(p1,p2,...) WITHIN GROUP (ORDER BY q1,q2,...)) from ...
>  
>    (where typeof(p1)==typeof(q1) and so on, at least up to trivial
>    conversions)
>  
>    The motivating example here is to be able to do rank(p1,p2,...) to
>    return the rank that the specified values would have had if they were
>    added to the group.

Wow, I can't possibly grasp the purpose of this.  Maybe a practical example?

> We've also had an expression of interest in extending this to allow
> percentile_disc(float8[]) and percentile_cont(float8[]) returning
> arrays; e.g. percentile_cont(array[0, 0.25, 0.5, 0.75, 1]) to return an
> array containing the bounds, median and quartiles in one go. This is an
> extension to the spec but it seems sufficiently obviously useful to be
> worth supporting.

To be specific, I asked for this because it's already something I do
using PL/R, although in PL/R it's pretty much limited to floats.

Anyway, for anyone who isn't following why we want this: statitical
summary reports.  For example, I'd love to be able to do a quartile
distribution of query execution times without resorting to R.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Josh Berkus wrote:
> On 07/17/2013 08:15 PM, Andrew Gierth wrote:
> > The spec defines two types of aggregate function classed as "ordered set
> > function", as follows:
> >  
> > 1. An "inverse distribution function" taking one argument (which must be
> >    a grouped column or otherwise constant within groups) plus a sorted
> >    group with exactly one column:
> >  
> >    =# SELECT (func(p) WITHIN GROUP (ORDER BY q)) from ...
> >  
> >    The motivating example for this (and the only ones in the spec) are
> >    percentile_cont and percentile_disc, to return a percentile result
> >    from a continuous or discrete distribution. (Thus
> >    percentile_cont(0.5) within group (order by x) is the spec's version
> >    of a median(x) function.)
> 
> One question is how this relates to the existing
> 
>    SELECT agg_func(x order by y)
> 
> ... syntax.  Clearly there's some extra functionality here, but the two
> are very similar conceptually.

Well, as you probably know, the spec is a whole pile of random
special-case syntax and any similarities are probably more accidental
than anything else.

A major difference is that in agg(x order by y), the values of y are
not passed to the aggregate function - they serve no purpose other
than controlling the order of the "x" values. Whereas in WITHIN GROUP,
the values in the ORDER BY ... clause are in some sense the primary
input to the aggregate, and the "p" argument is secondary and can't
vary between rows of the group.

Our implementation does heavily reuse the existing executor mechanics
for ORDER BY in aggregates, and it also reuses a fair chunk of the
parser code for it, but there are significant differences.

>[of hypothetical set functions]
> Wow, I can't possibly grasp the purpose of this.  Maybe a practical
> example?

=# select rank(123) within group (order by x)    from (values (10),(50),(100),(200),(500)) v(x);

would return 1 row containing the value 4, because if you added the
value 123 to the grouped values, it would have been ranked 4th.

Any time you want to calculate what the rank, dense_rank or cume_dist
would be of a specific row within a group without actually adding the
row to the group, this is how it's done.

I don't have any practical examples to hand, but this beast seems to
be implemented in at least Oracle and MSSQL so I guess it has uses.

>[on supporting arrays of percentiles]
> To be specific, I asked for this because it's already something I do
> using PL/R, although in PL/R it's pretty much limited to floats.

percentile_cont is limited to floats and intervals in the spec; to be
precise, it's limited to taking args of either interval or any numeric
type, and returns interval for interval args, and "approximate numeric
with implementation-defined precision", i.e. some form of float, for
numeric-type args. The definition requires interpolation between values,
so it's not clear that there's any point in trying to allow other types.

percentile_disc is also limited to floats and intervals in the spec, but
I see absolutely no reason whatsoever for this, since the definition
given is valid for any type with ordering operators; there is no reason
not to make it fully polymorphic. (The requirement for ordering will be
enforced in parse-analysis anyway, by the ORDER BY transformations, and
the function simply returns one of the input values unaltered.)

-- 
Andrew (irc:RhodiumToad)



Andrew,

> Well, as you probably know, the spec is a whole pile of random
> special-case syntax and any similarities are probably more accidental
> than anything else.

Hah, I didn't realize that our ordered aggregate syntax even *was* spec.

> A major difference is that in agg(x order by y), the values of y are
> not passed to the aggregate function - they serve no purpose other
> than controlling the order of the "x" values. Whereas in WITHIN GROUP,
> the values in the ORDER BY ... clause are in some sense the primary
> input to the aggregate, and the "p" argument is secondary and can't
> vary between rows of the group.
> 
> Our implementation does heavily reuse the existing executor mechanics
> for ORDER BY in aggregates, and it also reuses a fair chunk of the
> parser code for it, but there are significant differences.

Well, seems like it would work the same as
    agg_func(constx,coly,colz ORDER BY coly, colz)

... which means you could reuse a LOT of the internal plumbing.  Or am I
missing something?

Also, what would a CREATE AGGREGATE and state function definition for
custom WITHIN GROUP aggregates look like?

> Any time you want to calculate what the rank, dense_rank or cume_dist
> would be of a specific row within a group without actually adding the
> row to the group, this is how it's done.
> 
> I don't have any practical examples to hand, but this beast seems to
> be implemented in at least Oracle and MSSQL so I guess it has uses.

Well, I still can't imagine a practical use for it, at least based on
RANK.  I certainly have no objections if you have the code, though.

I'll also point out that mode() requires ordered input as well, so add
that to the set of functions we'll want to eventually support.

One thing I find myself wanting with ordered aggregates is the ability
to exclude NULLs.  Thoughts?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Josh Berkus wrote:
> Hah, I didn't realize that our ordered aggregate syntax even *was* spec.

The spec defines agg(x order by y) only for array_agg and xmlagg; the
generalization to arbitrary other aggregates is our extension. (But
kind of obvious really.)

> > Our implementation does heavily reuse the existing executor mechanics
> > for ORDER BY in aggregates, and it also reuses a fair chunk of the
> > parser code for it, but there are significant differences.
>
> Well, seems like it would work the same as
>
>      agg_func(constx,coly,colz ORDER BY coly, colz)
>
> ... which means you could reuse a LOT of the internal plumbing.  Or am I
> missing something?

You missed the part above which said "...does heavily reuse..." :-)

i.e. we are in fact reusing a lot of the internal plumbing.

> Also, what would a CREATE AGGREGATE and state function definition for
> custom WITHIN GROUP aggregates look like?

Now this is exactly the part we haven't nailed down yet and want ideas
for.

The problem is, given that the parser is looking at:
 foo(p1,p2,...) within group (order by q1,q2,...)

how do we best represent the possible matching functions in pg_proc
and pg_aggregate? Our partial solution so far does not allow
polymorphism to work properly, so we need a better way; I'm hoping for
some independent suggestions before I post my own ideas.

-- 
Andrew (irc:RhodiumToad)



> The problem is, given that the parser is looking at:
> 
>   foo(p1,p2,...) within group (order by q1,q2,...)
> 
> how do we best represent the possible matching functions in pg_proc
> and pg_aggregate? Our partial solution so far does not allow
> polymorphism to work properly, so we need a better way; I'm hoping for
> some independent suggestions before I post my own ideas.

Yeah, you'd need to extend VARIADIC somehow.  That is, I should be able
to define a function as:

percentile_state (pctl float,ordercols VARIADIC ANY )
returns VARIADIC ANY

... so that it can handle the sorting.  Another way to look at it would be:

percentile_state (pctl float,orderedset ANONYMOUS ROW )
returns ANONYMOUS ROW as ...

... because really, what you're handing the state function is an
anonymous row type constructed of the order by phrase.  Of course, then
we have to have some way to manipulate the anonymous row from within the
function; at the very least, an equality operator.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



On Thu, Jul 18, 2013 at 10:33:15PM +0000, Andrew Gierth wrote:
> Josh Berkus wrote:
> > Well, seems like it would work the same as
> >
> >      agg_func(constx,coly,colz ORDER BY coly, colz)

I'd try transforming WITHIN GROUP into the above during parse analysis.  The
default would be the transformation for hypothetical set functions:
   agg(x,y,z) WITHIN GROUP (ORDER BY a,b,c) -> agg(x,y,z ORDER BY a,b,c)

Add a CREATE AGGREGATE option, say SQL_INVERSE_DISTRIBUTION_FUNCTION =
{true|false} or SQLIDF, that chooses the IDF transformation:
   agg(x) WITHIN GROUP (ORDER BY y) -> agg(x, y ORDER BY y)

Then there's perhaps no new core aggregation or function candidacy machinery.
(I don't know whether VARIADIC transition functions work today, but that would
become an orthogonal project.)  Compare how we handle standard interval typmod
syntax; only the parser and deparser know about it.

Atri's description upthread sounded pretty similar to that.

> > Also, what would a CREATE AGGREGATE and state function definition for
> > custom WITHIN GROUP aggregates look like?
> 
> Now this is exactly the part we haven't nailed down yet and want ideas
> for.

PERCENTILE_DISC would be declared as (float8, anyelement) with that SQLIDF
option.  To diagnose nonsensical calls made through nonstandard syntax, it
could dig into its AggState to verify that its second argument is equal() to
its first ORDER BY expression.

There would be a question of whether to accept the WITHIN GROUP syntax for any
aggregate or just for those for which the standard indicates it.  Then follows
the question of when to deparse as WITHIN GROUP and when to deparse as the
internal syntax.  I'd lean toward accepting WITHIN GROUP for any aggregate but
deparsing that way only SQLIDF aggregates and aggregates with names matching
the standard hypothetical set function names.  Or you could add a second
CREATE AGGREGATE option requesting hypothetical-set-function deparse style.

-- 
Noah Misch
EnterpriseDB                                 http://www.enterprisedb.com



Noah Misch <noah@leadboat.com> writes:
> (I don't know whether VARIADIC transition functions work today, but that would
> become an orthogonal project.)

Coincidentally enough, some Salesforce folk were asking me about allowing
VARIADIC aggregates just a few days ago.  I experimented enough to find
out that if you make an array-accepting transition function, and then
force the aggregate's pg_proc entry to look like it's variadic (by
manually setting provariadic and some other fields), then everything
seems to Just Work: the parser and executor are both fine with it.
So I think all that's needed here is to add some syntax support to
CREATE AGGREGATE, and probably make some tweaks in pg_dump.  I was
planning to go work on that sometime soon.

Having said that, though, what Andrew seemed to want was VARIADIC ANY,
which is a *completely* different kettle of fish, since the actual
parameters can't be converted to an array.  I'm not sure if that's
as easy to support.
        regards, tom lane



Hi all,

This is our current work-in-progress patch for WITHIN GROUP.

What mostly works:
  - percentile_cont(float8) within group (order by float8)
  - percentile_cont(float8) within group (order by interval)
  - percentile_disc(float8) within group (order by float8)

What doesn't work:
  - supporting other types in percentile_disc (want polymorphism to
    work first)
  - no commands yet to add new ordered set functions (want to nail
    down the catalog representation first)
  - no hypothetical set functions yet (need to resolve the above two
    points first)
  - some rough edges
  - probably some bugs
  - docs

Implementation details:

For execution, we repurpose the existing aggregate-orderby mechanics.
Given func(directargs) WITHIN GROUP (ORDER BY args), we process the
(ORDER BY args) into a tuplesort in the same way currently done for
agg(args ORDER BY args). Rather than using a transfn, we then call the
finalfn as finalfn(directargs), providing an API by which the finalfn
can access the tuplesort. (This is somewhat inspired by the window
function API, but unfortunately has nothing in common with it in terms
of requirements, so we couldn't just reuse it.)

func(p1,p2,...) WITHIN GROUP (ORDER BY q1,q2,...) is represented in the
catalog with two pg_proc rows:

func(p1,p2,...,q1,q2,...)  (proisagg=true)
func_final(p1,p2,...)

with the usual pg_aggregate row linking them, though aggtransfn is set
to InvalidOid (as is aggtranstype) and an additional flag indicates
that this is an ordered set function.

(This representation is inadequate for a number of reasons; it does not
handle polymorphism well and would require special-case coding for
hypothetical set functions, which we have not yet tackled. See our other
post.)


Regards,

Atri
--
Regards,

Atri
l'apprenant

Attachment
"some Salesforce folks" that would be me!   It looks like I didn't quite communicate to Tom just what I was looking for as I do indeed want to have a variable number of "any" types, as:

CREATE AGGREGATE FOO ( ANYELEMENT, <more types>, VARIADIC "any") (
...
    STYPE = ANYARRAY
...)
so the corresponding transition function would be
CREATE FUNCTION FOO_sfunc( ANYARRAY, ANYELEMENT, <more types>, VARIADIC "any") RETURNS ANYARRAY
and the final func is
CREATE FUNCTION FOO_ffunc( ANYARRAY ) RETURNS ANYELEMENT

The functions are in C, and I cheat and actually use the ANYARRAY transition variable as a struct just keeping the varlena length correct (thanks to Tom for that idea).  Currently I just  support a fixed number of "any" args but really need to have that be variable.

So supporting VARIADIC "any" for user defined aggregates would be most useful.


On Thu, Jul 18, 2013 at 7:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Noah Misch <noah@leadboat.com> writes:
> (I don't know whether VARIADIC transition functions work today, but that would
> become an orthogonal project.)

Coincidentally enough, some Salesforce folk were asking me about allowing
VARIADIC aggregates just a few days ago.  I experimented enough to find
out that if you make an array-accepting transition function, and then
force the aggregate's pg_proc entry to look like it's variadic (by
manually setting provariadic and some other fields), then everything
seems to Just Work: the parser and executor are both fine with it.
So I think all that's needed here is to add some syntax support to
CREATE AGGREGATE, and probably make some tweaks in pg_dump.  I was
planning to go work on that sometime soon.

Having said that, though, what Andrew seemed to want was VARIADIC ANY,
which is a *completely* different kettle of fish, since the actual
parameters can't be converted to an array.  I'm not sure if that's
as easy to support.

                        regards, tom lane


Ok, since Atri posted our work-so-far and there's not been much
comment, I'll outline here my proposed plan of attack.

Rather than, as in the WIP patch, using the agg finalfn to validate
the split between normal args and ORDER BY args, I propose this:

Firstly, as in the WIP patch,
 func(a) within group (order by b)

is looked up as though it were func(a,b). The result must be marked as
an ordered set function. A new pg_aggregate integer column,
aggordnargs (?), must be equal to the number of normal args (i.e. (a)
in this case). Note that this may be 0; one can see a legitimate use
case for mode() within group (order by anyelement) for example.

The finalfn must be defined to have the same signature, so its args
are processed as if it were func_final(a,b) - but only a dummy arg is
passed for b. (Similar to the case for window functions.) Resolution
of polymorphic parameters and result types therefore works as normal.

For hypothetical set functions we add a special case, aggordnargs=-1,
for which both the aggregate and the finalfn must be defined as
(variadic "any") and parse analysis detects this case and unifies the
types of the normal args with those of the ORDER BY args.

I propose this new syntax:

create aggregate func(argtypes...) within group (argtypes...) ( [ STYPE = ... , ] [ SORTOP = ... , ] [ INITCOND = ... ,
]FINALFUNC = func_final
 
);

Ordered set functions will typically not need STYPE etc., but
hypothetical set functions will be declared as, e.g.:

create aggregate rank(variadic "any") within group (variadic "any") ( STYPE = boolean, INITCOND = 'f', SORTOP = >,
FINALFUNC= rank_hypothetical_final
 
);

(I'm open to comment as to whether to simply overload the aggsortop
column in pg_aggregate or add a new one. I'm inclined to do the latter.)

The idea here is that a column of type STYPE will be appended to the
list of columns to be sorted, using SORTOP as sort operator, and all
input rows will have this column initialized to the INITCOND value.
This is to make it easy to implement rank() and friends by simply
inserting the hypothetical row into the sort, with a true value to
flag it, and finding its position in the sort result. (Better that
than comparing the hypothetical row against the whole group.)

(Security caveat: it will be necessary for the finalfn in such cases
to validate that the additional column exists with the right
type. Producing the wrong result is acceptable if the values in it are
unexpected; crashing is not.)

Any comment before we get back to coding?

-- 
Andrew (irc:RhodiumToad)



On Tue, Jul 23, 2013 at 01:21:52AM +0000, Andrew Gierth wrote:
> For hypothetical set functions we add a special case, aggordnargs=-1,
> for which both the aggregate and the finalfn must be defined as
> (variadic "any") and parse analysis detects this case and unifies the
> types of the normal args with those of the ORDER BY args.

Other aggregates based on this syntax might not desire such type unification.
Having parse analysis do that distorts the character of an "any" argument.  I
think the proper place for such processing is the first call to a transition
function.  The transition functions could certainly call a new API exposed
under src/backend/parser to do the heavy lifting.  But let's not make the
parser presume that an aggordnargs=-1 aggregate always wants its "any"
arguments handled in the manner of the standard hypothetical set functions.

The rest of the plan looks good so far.

-- 
Noah Misch
EnterpriseDB                                 http://www.enterprisedb.com



Noah Misch said:
> Other aggregates based on this syntax might not desire such type unification.

Then there would have to be some way to distinguish that. Maybe those could
have -1 and the standard hypothetical set functions -2, with some flag in
CREATE AGGREGATE to sort it out.

> Having parse analysis do that distorts the character of an "any" argument.  I
> think the proper place for such processing is the first call to a transition
> function.

Except there isn't one.

> But let's not make the
> parser presume that an aggordnargs=-1 aggregate always wants its "any"
> arguments handled in the manner of the standard hypothetical set functions.

This has to happen in the parser because these are errors that should be
caught before execution:

rank(foo) within group (order by bar,baz)
rank(integercol) within group (order by textcol)

And collations have to be resolved (pairwise) before sorting can happen:

rank(textval COLLATE "C") within group (order by foo)  -- sorts in "C"
rank(textval COLLATE "C") within group (order by bar COLLATE "en_US")  -- error

(basically, in rank(x) within group (order by y) where x and y are
collatable, the collation rules apply exactly as though you were doing
(x < y), with all the implicit vs. explicit stuff included)

And this:

rank(1.1) within group (order by intcol)

should become  rank(1.1) within group (order by intcol::numeric)

-- 
Andrew (irc:RhodiumToad)



On Wed, Jul 24, 2013 at 04:16:28AM +0000, Andrew Gierth wrote:
> Noah Misch said:
> > Other aggregates based on this syntax might not desire such type unification.
> 
> Then there would have to be some way to distinguish that. Maybe those could
> have -1 and the standard hypothetical set functions -2, with some flag in
> CREATE AGGREGATE to sort it out.

Sure.

> > But let's not make the
> > parser presume that an aggordnargs=-1 aggregate always wants its "any"
> > arguments handled in the manner of the standard hypothetical set functions.
> 
> This has to happen in the parser because these are errors that should be
> caught before execution:
> 
> rank(foo) within group (order by bar,baz)
> rank(integercol) within group (order by textcol)

Would be nice to have, but not an overriding concern.

> And collations have to be resolved (pairwise) before sorting can happen:
> 
> rank(textval COLLATE "C") within group (order by foo)  -- sorts in "C"
> rank(textval COLLATE "C") within group (order by bar COLLATE "en_US")  -- error
> 
> (basically, in rank(x) within group (order by y) where x and y are
> collatable, the collation rules apply exactly as though you were doing
> (x < y), with all the implicit vs. explicit stuff included)

This, though, makes direct parser support nigh inevitable.


The issue to resolve here is whether and to what extent we should implement
the SQL-standard hypothetical set functions as special cases of some
more-generic concept.  Here's the declaration you proposed for the rank() HSF:

> create aggregate rank(variadic "any") within group (variadic "any") (

This would be the first place to my knowledge where "any" doesn't mean
unrestricted type acceptance at the parser level.  If we need bespoke syntax
declaring a function as an HSF with the entailed call behavior nuances, fine.
Treating a declaration with this particular mix of "any" as a secret handshake
requesting those nuances is not the way to go.

>   STYPE = boolean,
>   INITCOND = 'f',
>   SORTOP = >,
>   FINALFUNC = rank_hypothetical_final
> );

-- 
Noah Misch
EnterpriseDB                                 http://www.enterprisedb.com



On Fri, 2013-07-19 at 21:29 +0530, Atri Sharma wrote:
> Hi all,
>
> This is our current work-in-progress patch for WITHIN GROUP.

Please fix these compiler warnings:

parse_agg.c: In function ‘check_ungrouped_columns_walker’:
parse_agg.c:848:3: warning: passing argument 1 of ‘check_ungrouped_columns_walker’ from incompatible pointer type
[enabledby default] 
parse_agg.c:822:1: note: expected ‘struct Node *’ but argument is of type ‘struct List *’

parse_func.c: In function ‘make_fn_arguments’:
parse_func.c:1540:9: warning: assignment from incompatible pointer type [enabled by default]
parse_func.c:1547:15: warning: assignment from incompatible pointer type [enabled by default]





On 7/19/13 11:59 AM, Atri Sharma wrote:
> Hi all,
> 
> This is our current work-in-progress patch for WITHIN GROUP.

This patch needs to be rebased.





Sent from my iPad

On 04-Sep-2013, at 21:38, Peter Eisentraut <peter_e@gmx.net> wrote:

> On 7/19/13 11:59 AM, Atri Sharma wrote:
>> Hi all,
>>
>> This is our current work-in-progress patch for WITHIN GROUP.
>
> This patch needs to be rebased.
>
This version of patch is quite old.We will be sending an updated patch before the start of September commitfest, with
allthe points you mentioned taken care of. 

Thanks for the points.

Regards,

Atri


On Wed, Sep 4, 2013 at 10:01 PM, Atri Sharma <atri.jiit@gmail.com> wrote:
>
>
> Sent from my iPad
>
> On 04-Sep-2013, at 21:38, Peter Eisentraut <peter_e@gmx.net> wrote:
>
>> On 7/19/13 11:59 AM, Atri Sharma wrote:
>>> Hi all,
>>>
>>> This is our current work-in-progress patch for WITHIN GROUP.
>>
>> This patch needs to be rebased.

Hi All,

This is our complete patch for implementation of WITHIN GROUP.

Functions supported:

percentile_disc
percentile_cont for float8 and intervals
percentile_disc and percentile_cont support arrays of percentiles as well.
mode
rank
dense_rank
percent_rank
cume_dist

The patch also adds support for user defined ordered set functions
with CREATE AGGREGATE.

Polymorphism is now supported, with the original gripes about it now
solved. Essentially, we have added
a new field in pg_aggregate, aggordnargs, which we use it to verify,
having looked up the function, that it is being called correctly.
aggordnargs holds the number of direct args to the aggregate.

Hypothetical set functions build over the extension of VARIADIC, and
all of the hypothetical set functions
have variadic 'any' as their parameter types.

Need review:

1) psql /df and /dfa output.
2) Handling of non hypothetical collations.
3) Need of mode(), and the name.

Feedback/Comments?

Regards,

Atri

Attachment
On Fri, 2013-09-13 at 14:56 +0530, Atri Sharma wrote:
> This is our complete patch for implementation of WITHIN GROUP.

Please fix compiler warnings:

inversedistribution.c: In function ‘mode_final’:
inversedistribution.c:276:11: warning: ‘mode_val’ may be used uninitialized in this function [-Wmaybe-uninitialized]
inversedistribution.c:299:8: warning: ‘last_val’ may be used uninitialized in this function [-Wmaybe-uninitialized]





>>>>> "Peter" == Peter Eisentraut <peter_e@gmx.net> writes:
Peter> Please fix compiler warnings:

Someone should do the same in WaitForBackgroundWorkerStartup so that
building with -Werror works.

New patch coming shortly.

-- 
Andrew.



>>>>> "Peter" == Peter Eisentraut <peter_e@gmx.net> writes:

 Peter> Please fix compiler warnings:

Done.

--
Andrew (irc:RhodiumToad)


Attachment
On Sat, Sep 14, 2013 at 7:23 AM, Andrew Gierth
<andrew@tao11.riddles.org.uk> wrote:
>  Peter> Please fix compiler warnings:
>
> Someone should do the same in WaitForBackgroundWorkerStartup so that
> building with -Werror works.

I don't get a warning there.  Can you be more specific about the problem?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



>>>>> "Robert" == Robert Haas <robertmhaas@gmail.com> writes:
>> Someone should do the same in WaitForBackgroundWorkerStartup so>> that building with -Werror works.
Robert> I don't get a warning there.  Can you be more specific aboutRobert> the problem?

bgworker.c: In function 'WaitForBackgroundWorkerStartup':
bgworker.c:866: warning: 'pid' may be used uninitialized in this function

gcc 4.2.2 / freebsd 8.2

-- 
Andrew (irc:RhodiumToad)



On Tue, Sep 17, 2013 at 2:27 PM, Andrew Gierth
<andrew@tao11.riddles.org.uk> wrote:
>>>>>> "Robert" == Robert Haas <robertmhaas@gmail.com> writes:
>
>  >> Someone should do the same in WaitForBackgroundWorkerStartup so
>  >> that building with -Werror works.
>
>  Robert> I don't get a warning there.  Can you be more specific about
>  Robert> the problem?
>
> bgworker.c: In function 'WaitForBackgroundWorkerStartup':
> bgworker.c:866: warning: 'pid' may be used uninitialized in this function

Does the attached patch fix it for you?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment
>>>>> "Robert" == Robert Haas <robertmhaas@gmail.com> writes:
>> bgworker.c: In function 'WaitForBackgroundWorkerStartup':>> bgworker.c:866: warning: 'pid' may be used uninitialized
inthis function
 
Robert> Does the attached patch fix it for you?

It compiles without error and looks ok...

-- 
Andrew (irc:RhodiumToad)



On Thu, Sep 19, 2013 at 12:52 PM, Andrew Gierth
<andrew@tao11.riddles.org.uk> wrote:
> It compiles without error and looks ok...

Thanks for checking.   Committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company