Re: Proposal/design feedback needed: WITHIN GROUP (sql standard ordered set aggregate functions) - Mailing list pgsql-hackers

From Andrew Gierth
Subject Re: Proposal/design feedback needed: WITHIN GROUP (sql standard ordered set aggregate functions)
Date
Msg-id 0f53c634ff4777551240aa8efc805398@news-out.riddles.org.uk
Whole thread Raw
In response to Proposal/design feedback needed: WITHIN GROUP (sql standard ordered set aggregate functions)  (Andrew Gierth <andrew@tao11.riddles.org.uk>)
List pgsql-hackers
Josh Berkus wrote:
> On 07/17/2013 08:15 PM, Andrew Gierth wrote:
> > The spec defines two types of aggregate function classed as "ordered set
> > function", as follows:
> >  
> > 1. An "inverse distribution function" taking one argument (which must be
> >    a grouped column or otherwise constant within groups) plus a sorted
> >    group with exactly one column:
> >  
> >    =# SELECT (func(p) WITHIN GROUP (ORDER BY q)) from ...
> >  
> >    The motivating example for this (and the only ones in the spec) are
> >    percentile_cont and percentile_disc, to return a percentile result
> >    from a continuous or discrete distribution. (Thus
> >    percentile_cont(0.5) within group (order by x) is the spec's version
> >    of a median(x) function.)
> 
> One question is how this relates to the existing
> 
>    SELECT agg_func(x order by y)
> 
> ... syntax.  Clearly there's some extra functionality here, but the two
> are very similar conceptually.

Well, as you probably know, the spec is a whole pile of random
special-case syntax and any similarities are probably more accidental
than anything else.

A major difference is that in agg(x order by y), the values of y are
not passed to the aggregate function - they serve no purpose other
than controlling the order of the "x" values. Whereas in WITHIN GROUP,
the values in the ORDER BY ... clause are in some sense the primary
input to the aggregate, and the "p" argument is secondary and can't
vary between rows of the group.

Our implementation does heavily reuse the existing executor mechanics
for ORDER BY in aggregates, and it also reuses a fair chunk of the
parser code for it, but there are significant differences.

>[of hypothetical set functions]
> Wow, I can't possibly grasp the purpose of this.  Maybe a practical
> example?

=# select rank(123) within group (order by x)    from (values (10),(50),(100),(200),(500)) v(x);

would return 1 row containing the value 4, because if you added the
value 123 to the grouped values, it would have been ranked 4th.

Any time you want to calculate what the rank, dense_rank or cume_dist
would be of a specific row within a group without actually adding the
row to the group, this is how it's done.

I don't have any practical examples to hand, but this beast seems to
be implemented in at least Oracle and MSSQL so I guess it has uses.

>[on supporting arrays of percentiles]
> To be specific, I asked for this because it's already something I do
> using PL/R, although in PL/R it's pretty much limited to floats.

percentile_cont is limited to floats and intervals in the spec; to be
precise, it's limited to taking args of either interval or any numeric
type, and returns interval for interval args, and "approximate numeric
with implementation-defined precision", i.e. some form of float, for
numeric-type args. The definition requires interpolation between values,
so it's not clear that there's any point in trying to allow other types.

percentile_disc is also limited to floats and intervals in the spec, but
I see absolutely no reason whatsoever for this, since the definition
given is valid for any type with ordering operators; there is no reason
not to make it fully polymorphic. (The requirement for ordering will be
enforced in parse-analysis anyway, by the ORDER BY transformations, and
the function simply returns one of the input values unaltered.)

-- 
Andrew (irc:RhodiumToad)



pgsql-hackers by date:

Previous
From: Josh Berkus
Date:
Subject: Re: ALTER SYSTEM SET command to change postgresql.conf parameters (RE: Proposal for Allow postgresql.conf values to be changed via SQL [review])
Next
From: Josh Berkus
Date:
Subject: Re: Proposal/design feedback needed: WITHIN GROUP (sql standard ordered set aggregate functions)