Thread: Window functions patch v04 for the September commit fest

Window functions patch v04 for the September commit fest

From
"Hitoshi Harada"
Date:
Here's the latest window functions patch against HEAD. It seems to be
ready for the September commit fest, as added documents, WINDOW clause
feature and misc tests. I guess this would be the window functions
feature freeze for 8.4. The remaining feature will be implemented for
the later release.

This patch consists of:
- windowed aggregates
- cooperate with GROUP BY aggregates
- some optimizations with multiple windows
- ranking functions
- WINDOW clause
- windowing SQL regression tests
- sgml documents

Looking up the total road map, the dropped features are:

- sliding window (window framing)
- lead(), lag(), etc. that reach for random rows
- user defined window functions

The first and second topics are difficult to implement currently.
Because these features require random row access, it seems that
tuplestore would be able to save multiple positions to mark/restore.
This is fundamental change that is over my capability. Also, user
defined window functions seem to have much more to decide. I think I
can't put into shape the general needs of user's window functions now.
Lacking these feature, this stage looks compatible to SQLServer 2005,
while Oracle and DB2 have almost full of the specification.

Also, current implementation has only a type of plan which uses sort
operation. It should be optimized by re-position the windows and/or
using hashtable.

Oh, git hosting repository is updated as well.

Regards,


--
Hitoshi Harada

Attachment

Re: Window functions patch v04 for the September commit fest

From
"Hitoshi Harada"
Date:
2008/8/30 Hitoshi Harada <umi.tanuki@gmail.com>:
> Here's the latest window functions patch against HEAD. It seems to be
> ready for the September commit fest, as added documents, WINDOW clause
> feature and misc tests. I guess this would be the window functions
> feature freeze for 8.4. The remaining feature will be implemented for
> the later release.
>
> This patch consists of:
> - windowed aggregates
> - cooperate with GROUP BY aggregates
> - some optimizations with multiple windows
> - ranking functions
> - WINDOW clause
> - windowing SQL regression tests
> - sgml documents
>
> Looking up the total road map, the dropped features are:
>
> - sliding window (window framing)
> - lead(), lag(), etc. that reach for random rows
> - user defined window functions
>
> The first and second topics are difficult to implement currently.
> Because these features require random row access, it seems that
> tuplestore would be able to save multiple positions to mark/restore.
> This is fundamental change that is over my capability. Also, user
> defined window functions seem to have much more to decide. I think I
> can't put into shape the general needs of user's window functions now.
> Lacking these feature, this stage looks compatible to SQLServer 2005,
> while Oracle and DB2 have almost full of the specification.
>
> Also, current implementation has only a type of plan which uses sort
> operation. It should be optimized by re-position the windows and/or
> using hashtable.
>
> Oh, git hosting repository is updated as well.

README is updated.
http://umitanuki.net/pgsql/wfv04/design.html

Please add link to commit fest wiki.

Regards,

-- 
Hitoshi Harada


Re: Window functions patch v04 for the September commit fest

From
David Fetter
Date:
On Mon, Sep 01, 2008 at 12:15:19PM +0900, Hitoshi Harada wrote:
> README is updated.
> http://umitanuki.net/pgsql/wfv04/design.html
> 
> Please add link to commit fest wiki.

Added :)

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


Re: Window functions patch v04 for the September commit fest

From
Heikki Linnakangas
Date:
Hitoshi Harada wrote:
> 2008/8/30 Hitoshi Harada <umi.tanuki@gmail.com>:
>> Here's the latest window functions patch against HEAD. It seems to be
>> ready for the September commit fest, as added documents, WINDOW clause
>> feature and misc tests. I guess this would be the window functions
>> feature freeze for 8.4. The remaining feature will be implemented for
>> the later release.
>>
>> This patch consists of:
>> - windowed aggregates
>> - cooperate with GROUP BY aggregates
>> - some optimizations with multiple windows
>> - ranking functions
>> - WINDOW clause
>> - windowing SQL regression tests
>> - sgml documents
>>
>> Looking up the total road map, the dropped features are:
>>
>> - sliding window (window framing)
>> - lead(), lag(), etc. that reach for random rows
>> - user defined window functions

Ok, I'm starting to read up on SQL2003 window functions, and to review 
this patch. Here's a long brain-dump of my first thoughts on the design.

Let's list a couple of requirements first:

1. It's important that what gets committed now can be extended to handle 
all of the window function stuff in SQL2003 in the future, as well as 
user-defined-window-functions in the spirit of PostgreSQL extensibility. 
Even if we don't implement all of it in this release.

2. Performance needs to be reasonable, in the face a large number of 
input rows. These are typically used in OLAP queries that process 
gazillions of rows. Some window functions need access to all rows in the 
window frame or partition, so you can't reasonably expect great 
performance with those if the working set is large, but those that 
don't, should work with finite memory requirements, and reasonably fast.

Because of 1., let's focus on the interface for 
user-defined-window-functions first. Ideally, the interface is such that 
all the window functions and aggregates defined in SQL2003, and others 
we might want to have in core or as pgfoundry projects, can be 
implemented using it.

I think there's still some confusion in the design document about what a 
frame is. The terminology section is great, it helped me a lot to get 
started, so thank you for that. However, in the "Actual design in the 
patch" you describe how a window function works, and you talk about a 
window frame, but elsewhere in the doc you note that window frames are 
not implemented yet.

As already discussed, there's two different kind of window expressions: 
window aggregates, and ranking aggregates (you call them window 
functions in your doc). It seems that they are different enough that we 
can't bang them together. For example, you can't use a window frame 
clause with ranking aggregates, and ranking aggregates need to see the 
whole row, whereas window aggregates only see the expressions passed as 
argument.

API for window functions (aka. ranking aggregate functions)
------------------------

Borrowing ideas from the many approaches listed in the design doc, I 
propose an interface using a set-returning function. Similar to Simon's 
proposal, but taking into account that a ranking function might need to 
see some or all input tuples before returning anything.

For each input tuple, the window function can return any number of 
tuples, including 0. After processing all input values, though, the 
total number of output rows returned must match the total number of 
input values fed into it. The function must eventually output one value 
for each input value, in the same order.

This is a flexible design that allows for different kind of 
implementations; the function can return one output tuple for each input 
tuple (ROW_NUMBER(), RANK() etc.), or it can swallow all input tuples 
first, and only then return all output tuples (e.g. CUME_DIST).

Here's a couple of step-by-step examples of how some functions would 
work. Imagine input values 1, 3, 4, 4, 5:

RANK
----

Invocation    Input    Value returned    Internal state (after)
1        1    1        (last value 1, count 1, rank=1)
2        3    2        (last value 2, count 1, rank=2)
3        4    3        (last value 4, count 1, rank=3)
4        4    3        (last value 4, count 2, rank=3)
5        5    5        (last value 5, count 1, rank=5)

So RANK returns one tuple after each input tuple. Just like in your 
current patch, keep the last value returned, a row count, and the 
current rank as state.

LAG
---

Internal state is a FIFO queue of values. Each input value is pushed to 
the FIFO, and the tuple that falls out of the queue is returned.

For example, LAG(<col>,2,0):

Invocation    Input row    Value returned    Internal state (after)
1        1        0        (1)
2        3        0        (3, 1)
3        4        1        (4, 3)
4        4        3        (4, 4)
5        5        4        (5, 4)

LEAD
----

Return nothing for first <offset> input tuples. Then return the current 
input tuple for the rest of the input tuples, and after the last input 
tuple, return <offset> number of default values.

For example, LEAD(<col>,2,0):

Invocation    Input row    Value returned    Internal state (after)
1        1        <none>        (offsetbegin = 1, pad=2)
2        3        <none>        (offsetbegin = 0, pad=2)
3        4        4        (offsetbegin = 0, pad=2)
4        4        4        (offsetbegin = 0, pad=2)
5        5        5        (offsetbegin = 0, pad=2)
6        <none>        0        (offsetbegin = 0, pad=1)
7        <none>        0        (offsetbegin = 0, pad=0)


For functions like LEAD or CUME_DIST that don't immediately start 
returning values, the executor will need a tuplestore to buffer input 
rows until it gets their values from the window function.


The syntax for creating a ranking aggregate might look something like this:

CREATE RANKING AGGREGATE name (    SFUNC = sfunc,    STYPE = state_data_type,    OUTFUNC = outfunc
)

This is very similar to Simon's proposal, and to current CREATE 
AGGREGATE, but tweaked a bit to fix the problems Hitoshi pointed out.

sfunc is called once for each input row. Unlike with normal aggregates, 
sfunc is passed the whole input row, so that e.g RANK can compare it 
against the previous row, or LEAD can buffer it.

outfunc is a set-returning function, and it is called until it returns 
no more rows, after each sfunc invocation.

Window aggregates
-----------------

Like normal aggregates, window aggregates process N input values, and 
return one output value. In normal GROUP BY expressions, the aggregate 
function is called once each group, but in a window aggregate expression 
with a window frame (aka sliding window), there is as many "groups" as 
there is rows.

In fact, any normal aggregate function can also be used as a window 
aggregate and vice versa. The trivial implementation is to have a buffer 
to hold all values currently in the window frame, and for each input 
row, invoke the aggregate function over all the rows currently in the 
frame. I think we'll need support that, to allow using any built-in or 
user-defined aggregate as a window aggregate.

However, we also want to provide optimized versions of the common 
aggregates. Aggregates like COUNT, SUM, and AVG can easily be 
implemented more efficiently, without rescanning all the tuples in the 
group.

I propose an extension to the current CREATE AGGREGATE syntax to allow 
for optimized versions. Currently, the syntax is like:

CREATE AGGREGATE name ( input_data_type [ , ... ] ) (    SFUNC = sfunc,    STYPE = state_data_type    [ , FINALFUNC =
ffunc]    [ , INITCOND = initial_condition ]    [ , SORTOP = sort_operator ]
 
)

Let's add a function to allow removing tuples from the current window frame:

CREATE AGGREGATE name ( input_data_type [ , ... ] ) (    SFUNC = sfunc,    STYPE = state_data_type    [ , FINALFUNC =
ffunc]    [ , INITCOND = initial_condition ]    [ , SORTOP = sort_operator ]    [ , SUBTRACTFUNC = subfunc,
 
)

subfunc is like sfunc, except that the input value is "subtracted" from 
the current value. On each input row, the executor calls sfunc for all 
new tuples that enter the window frame, and subfunc for all tuples that 
exit it. Another difference is that finalfunc can be called many times 
during the lifecycle of an aggregate invocation. For example, the 
subfunc for COUNT would decrement the row count by one, and for SUM,
subfunc would subtract the removed value from the current sum. Perhaps 
we should also support an "opportunistic subfunc", that could either 
perform the subtraction, or return a special value indicating that it 
can't be done, and we need to calculate the aggregate from scratch as if 
subfunc wasn't defined. That would be useful for MIN/MAX, which can be 
implemented by just keeping track of the current MIN/MAX value, and 
"subtracting" any other value than the current MIN/MAX is a no-op, but 
"subtracting" the current MIN/MAX value requires scanning all the rest 
of the tuples from scratch.

PS. Have you looked at the "hypothetical set functions" in SQL2003?

PPS. I was just about to send this, when I noticed that LEAD/LAG are in 
fact not in the SQL2003 draft I have at hand. In fact, none of the 
window functions defined there, RANK, DENSE_RANK, PERCENT_RANK, 
CUME_DIST and ROW_NUMBER, require random access to the tuples. Still 
seems like a good idea to be have the flexibility for them.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Window functions patch v04 for the September commit fest

From
David Fetter
Date:
On Mon, Sep 01, 2008 at 09:00:47PM +0300, Heikki Linnakangas wrote:
> Hitoshi Harada wrote:
>> 2008/8/30 Hitoshi Harada <umi.tanuki@gmail.com>:
>>> Here's the latest window functions patch against HEAD. It seems to be
>>> ready for the September commit fest, as added documents, WINDOW clause
>>> feature and misc tests. I guess this would be the window functions
>>> feature freeze for 8.4. The remaining feature will be implemented for
>>> the later release.
>>>
>>> This patch consists of:
>>> - windowed aggregates
>>> - cooperate with GROUP BY aggregates
>>> - some optimizations with multiple windows
>>> - ranking functions
>>> - WINDOW clause
>>> - windowing SQL regression tests
>>> - sgml documents
>>>
>>> Looking up the total road map, the dropped features are:
>>>
>>> - sliding window (window framing)
>>> - lead(), lag(), etc. that reach for random rows
>>> - user defined window functions
>
> Ok, I'm starting to read up on SQL2003 window functions,

Maybe it would be better to read the SQL2008 standard
<http://wiscorp.com/sql200n.zip> :)

> and to review  this patch.  Here's a long brain-dump of my first
> thoughts on the design.
>
> Let's list a couple of requirements first:
>
> 1. It's important that what gets committed now can be extended to
> handle  all of the window function stuff in SQL2003 in the future,
> as well as  user-defined-window-functions in the spirit of
> PostgreSQL extensibility.  Even if we don't implement all of it in
> this release.

> 2. Performance needs to be reasonable, in the face a large number of  
> input rows. These are typically used in OLAP queries that process  
> gazillions of rows. Some window functions need access to all rows in the  
> window frame or partition, so you can't reasonably expect great  
> performance with those if the working set is large, but those that  
> don't, should work with finite memory requirements, and reasonably fast.
>
> Because of 1., let's focus on the interface for
> user-defined-window-functions first. Ideally, the interface is such
> that  all the window functions and aggregates defined in SQL2003,
> and others  we might want to have in core or as pgfoundry projects,
> can be  implemented using it.

SQL2008 defines the following:

<window function> ::=   <window function type> OVER <window name or specification>

<window function type> ::=     <rank function type> <left paren> <right paren>   | ROW_NUMBER <left paren> <right
paren>  | <aggregate function>   | <ntile function>   | <lead or lag function>   | <first or last value function>   |
<nthvalue function>
 

<rank function type> ::=     RANK   | DENSE_RANK   | PERCENT_RANK   | CUME_DIST

<ntile function> ::=   NTILE <left paren> <number of tiles> <right paren>

<number of tiles> ::=     <simple value specification>   | <dynamic parameter specification>

<lead or lag function> ::=   <lead or lag> <left paren> <lead or lag extent>       [ <comma> <offset> [ <comma>
<defaultexpression> ] ] <right paren>       [ <null treatment> ]
 

<lead or lag> ::=   LEAD | LAG

<lead or lag extent> ::=   <value expression>

<offset> ::=   <exact numeric literal>

<default expression> ::=   <value expression>

<null treatment> ::=   RESPECT NULLS | IGNORE NULLS

<first or last value function> ::=   <first or last value> <left paren> <value expression> <right paren> [ <null
treatment>]
 

<first or last value> ::=   FIRST_VALUE | LAST_VALUE

<nth value function> ::=   NTH_VALUE <left paren> <value expression> <comma> <nth row> <right paren>       [ <from
firstor last> ] [ <null treatment> ]
 

<nth row> ::=     <simple value specification>   | <dynamic parameter specification>

<from first or last> ::=     FROM FIRST   | FROM LAST

<window name or specification> ::=     <window name>   | <in-line window specification>

<in-line window specification> ::=   <window specification>

> For functions like LEAD or CUME_DIST that don't immediately start  
> returning values, the executor will need a tuplestore to buffer input  
> rows until it gets their values from the window function.

For these aggregates, should there be an API that signals that a tuplestore
will be needed?

> The syntax for creating a ranking aggregate might look something like this:
>
> CREATE RANKING AGGREGATE name (
>     SFUNC = sfunc,
>     STYPE = state_data_type,
>     OUTFUNC = outfunc
> )
>
> This is very similar to Simon's proposal, and to current CREATE
> AGGREGATE, but tweaked a bit to fix the problems Hitoshi pointed
> out.
>
> sfunc is called once for each input row. Unlike with normal
> aggregates,  sfunc is passed the whole input row, so that e.g RANK
> can compare it  against the previous row, or LEAD can buffer it.
>
> outfunc is a set-returning function, and it is called until it
> returns  no more rows, after each sfunc invocation.

> Window aggregates
> -----------------
>
> Like normal aggregates, window aggregates process N input values, and  
> return one output value. In normal GROUP BY expressions, the aggregate  
> function is called once each group, but in a window aggregate expression  
> with a window frame (aka sliding window), there is as many "groups" as  
> there is rows.
>
> In fact, any normal aggregate function can also be used as a window  
> aggregate and vice versa. The trivial implementation is to have a buffer  
> to hold all values currently in the window frame, and for each input  
> row, invoke the aggregate function over all the rows currently in the  
> frame. I think we'll need support that, to allow using any built-in or  
> user-defined aggregate as a window aggregate.
>
> However, we also want to provide optimized versions of the common  
> aggregates. Aggregates like COUNT, SUM, and AVG can easily be  
> implemented more efficiently, without rescanning all the tuples in the  
> group.
>
> I propose an extension to the current CREATE AGGREGATE syntax to allow  
> for optimized versions. Currently, the syntax is like:
>
> CREATE AGGREGATE name ( input_data_type [ , ... ] ) (
>     SFUNC = sfunc,
>     STYPE = state_data_type
>     [ , FINALFUNC = ffunc ]
>     [ , INITCOND = initial_condition ]
>     [ , SORTOP = sort_operator ]
> )
>
> Let's add a function to allow removing tuples from the current window frame:
>
> CREATE AGGREGATE name ( input_data_type [ , ... ] ) (
>     SFUNC = sfunc,
>     STYPE = state_data_type
>     [ , FINALFUNC = ffunc ]
>     [ , INITCOND = initial_condition ]
>     [ , SORTOP = sort_operator ]
>     [ , SUBTRACTFUNC = subfunc,
> )
>
> subfunc is like sfunc, except that the input value is "subtracted" from  
> the current value. On each input row, the executor calls sfunc for all  
> new tuples that enter the window frame, and subfunc for all tuples that  
> exit it. Another difference is that finalfunc can be called many times  
> during the lifecycle of an aggregate invocation. For example, the  
> subfunc for COUNT would decrement the row count by one, and for SUM,  
> subfunc would subtract the removed value from the current sum. Perhaps  
> we should also support an "opportunistic subfunc", that could either  
> perform the subtraction, or return a special value indicating that it  
> can't be done, and we need to calculate the aggregate from scratch as if  
> subfunc wasn't defined. That would be useful for MIN/MAX, which can be  
> implemented by just keeping track of the current MIN/MAX value, and  
> "subtracting" any other value than the current MIN/MAX is a no-op, but  
> "subtracting" the current MIN/MAX value requires scanning all the rest  
> of the tuples from scratch.
>
> PS. Have you looked at the "hypothetical set functions" in SQL2003?

They're kinda neat :)

> PPS. I was just about to send this, when I noticed that LEAD/LAG are in  
> fact not in the SQL2003 draft I have at hand. In fact, none of the  
> window functions defined there, RANK, DENSE_RANK, PERCENT_RANK,  
> CUME_DIST and ROW_NUMBER, require random access to the tuples. Still
> seems like a good idea to be have the flexibility for them.

They're in SQL:2008.

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


Re: Window functions patch v04 for the September commit fest

From
Gregory Stark
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

> sfunc is called once for each input row. Unlike with normal aggregates, sfunc
> is passed the whole input row, so that e.g RANK can compare it against the
> previous row, or LEAD can buffer it.

I'm not sure I follow this bit about being passed the whole input row. How
would that relate to something like lag(x*y, 2) for example?

> outfunc is a set-returning function, and it is called until it returns no more
> rows, after each sfunc invocation.

And in any case I feel like this is abstraction on the cheap. The only reason
it's so general is because it's leaving all the work to the functions to
implement. 

It also means we get no benefit from cases like

SELECT lag(x,5),lag(y,5),lag(z,5)

where the executor could keep one tuplestore and for all of them. For that
matter it could keep one tuplestore even for cases like lag(x,5),lag(y,4).

What would the executor do for a query like

SELECT lead(x,1),lead(y,2),lead(y,3)

It would not only have to keep a tuplestore to buffer the output but it would
have to deal with receiving data from different SRFs at different times. The
best approach I can think of would be to keep a tuplestore for each SRF using
themas queues, reading old values from the head as soon as they all have at
least one new value in them.

And it doesn't answer how to deal with things like

SELECT lag(x,1) OVER (ORDER BY a), lag(x,1) OVER (ORDER BY b)

I, uh, don't actually have any ideas of how to deal with that one :(

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support!


Re: Window functions patch v04 for the September commit fest

From
"Hitoshi Harada"
Date:
Thanks for comments,

2008/9/2 Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>:
>
> Ok, I'm starting to read up on SQL2003 window functions, and to review this
> patch. Here's a long brain-dump of my first thoughts on the design.
>
> Let's list a couple of requirements first:
>
> 1. It's important that what gets committed now can be extended to handle all
> of the window function stuff in SQL2003 in the future, as well as
> user-defined-window-functions in the spirit of PostgreSQL extensibility.
> Even if we don't implement all of it in this release.
>
> 2. Performance needs to be reasonable, in the face a large number of input
> rows. These are typically used in OLAP queries that process gazillions of
> rows. Some window functions need access to all rows in the window frame or
> partition, so you can't reasonably expect great performance with those if
> the working set is large, but those that don't, should work with finite
> memory requirements, and reasonably fast.
>
> Because of 1., let's focus on the interface for
> user-defined-window-functions first. Ideally, the interface is such that all
> the window functions and aggregates defined in SQL2003, and others we might
> want to have in core or as pgfoundry projects, can be implemented using it.
>
> I think there's still some confusion in the design document about what a
> frame is. The terminology section is great, it helped me a lot to get
> started, so thank you for that. However, in the "Actual design in the patch"
> you describe how a window function works, and you talk about a window frame,
> but elsewhere in the doc you note that window frames are not implemented
> yet.

Sorry about that. In my understanding, the "Window Frame" is defined
by clauses such like "ROWS BETWEEN ... ", "RANGE BETWEEN ... " or so,
contrast to "Window Partition" defined by "PARTITION BY" clause. A
frame slides within a partition or there's only one frame if those
clauses are not specified. The current patch has not implemented this
yet. I'll update the docs.

>
> As already discussed, there's two different kind of window expressions:
> window aggregates, and ranking aggregates (you call them window functions in
> your doc). It seems that they are different enough that we can't bang them
> together. For example, you can't use a window frame clause with ranking
> aggregates, and ranking aggregates need to see the whole row, whereas window
> aggregates only see the expressions passed as argument.

I don't like to call the second type "ranking aggregates" because it
may refer to only ranking functions though there are more types of
function like ntile(), lead() and lag(). But "window functions"
doesn't seem appropriate either since it's ambiguous with the general
name of window expressions.

>
> API for window functions (aka. ranking aggregate functions)
> ------------------------
>
> Borrowing ideas from the many approaches listed in the design doc, I propose
> an interface using a set-returning function. Similar to Simon's proposal,
> but taking into account that a ranking function might need to see some or
> all input tuples before returning anything.
>
> For each input tuple, the window function can return any number of tuples,
> including 0. After processing all input values, though, the total number of
> output rows returned must match the total number of input values fed into
> it. The function must eventually output one value for each input value, in
> the same order.
>
> This is a flexible design that allows for different kind of implementations;
> the function can return one output tuple for each input tuple (ROW_NUMBER(),
> RANK() etc.), or it can swallow all input tuples first, and only then return
> all output tuples (e.g. CUME_DIST).
>
> Here's a couple of step-by-step examples of how some functions would work.
> Imagine input values 1, 3, 4, 4, 5:
>
> RANK
> ----
>
> Invocation      Input   Value returned  Internal state (after)
> 1               1       1               (last value 1, count 1, rank=1)
> 2               3       2               (last value 2, count 1, rank=2)
> 3               4       3               (last value 4, count 1, rank=3)
> 4               4       3               (last value 4, count 2, rank=3)
> 5               5       5               (last value 5, count 1, rank=5)
>
> So RANK returns one tuple after each input tuple. Just like in your current
> patch, keep the last value returned, a row count, and the current rank as
> state.
>
> LAG
> ---
>
> Internal state is a FIFO queue of values. Each input value is pushed to the
> FIFO, and the tuple that falls out of the queue is returned.
>
> For example, LAG(<col>,2,0):
>
> Invocation      Input row       Value returned  Internal state (after)
> 1               1               0               (1)
> 2               3               0               (3, 1)
> 3               4               1               (4, 3)
> 4               4               3               (4, 4)
> 5               5               4               (5, 4)
>
> LEAD
> ----
>
> Return nothing for first <offset> input tuples. Then return the current
> input tuple for the rest of the input tuples, and after the last input
> tuple, return <offset> number of default values.
>
> For example, LEAD(<col>,2,0):
>
> Invocation      Input row       Value returned  Internal state (after)
> 1               1               <none>          (offsetbegin = 1, pad=2)
> 2               3               <none>          (offsetbegin = 0, pad=2)
> 3               4               4               (offsetbegin = 0, pad=2)
> 4               4               4               (offsetbegin = 0, pad=2)
> 5               5               5               (offsetbegin = 0, pad=2)
> 6               <none>          0               (offsetbegin = 0, pad=1)
> 7               <none>          0               (offsetbegin = 0, pad=0)
>
>
> For functions like LEAD or CUME_DIST that don't immediately start returning
> values, the executor will need a tuplestore to buffer input rows until it
> gets their values from the window function.
>
>
> The syntax for creating a ranking aggregate might look something like this:
>
> CREATE RANKING AGGREGATE name (
>    SFUNC = sfunc,
>    STYPE = state_data_type,
>    OUTFUNC = outfunc
> )
>
> This is very similar to Simon's proposal, and to current CREATE AGGREGATE,
> but tweaked a bit to fix the problems Hitoshi pointed out.
>
> sfunc is called once for each input row. Unlike with normal aggregates,
> sfunc is passed the whole input row, so that e.g RANK can compare it against
> the previous row, or LEAD can buffer it.
>
> outfunc is a set-returning function, and it is called until it returns no
> more rows, after each sfunc invocation.

Your proposal is smarter than the current implementation. But it
doesn't seem complete in some way. From logical point of view, the
window functions should be allowed to access whole of rows in the
frame the current row belongs to (c.f. inverse distribution
functions). From implementing and performance point of view, there
need to consider such case like mixing window aggregates and ranking
aggregates in the same window, which means it is better that the two
types of functions are processed in the similar flow. Also logically,
SQL spec doesn't forbid ranking aggregates in sliding window frames.
So your design may be flexible enough to cover different requirements
of lead()/lag() but I don't know if it will.

> Window aggregates
> -----------------
>
> Like normal aggregates, window aggregates process N input values, and return
> one output value. In normal GROUP BY expressions, the aggregate function is
> called once each group, but in a window aggregate expression with a window
> frame (aka sliding window), there is as many "groups" as there is rows.
>
> In fact, any normal aggregate function can also be used as a window
> aggregate and vice versa. The trivial implementation is to have a buffer to
> hold all values currently in the window frame, and for each input row,
> invoke the aggregate function over all the rows currently in the frame. I
> think we'll need support that, to allow using any built-in or user-defined
> aggregate as a window aggregate.
>
> However, we also want to provide optimized versions of the common
> aggregates. Aggregates like COUNT, SUM, and AVG can easily be implemented
> more efficiently, without rescanning all the tuples in the group.
>
> I propose an extension to the current CREATE AGGREGATE syntax to allow for
> optimized versions. Currently, the syntax is like:
>
> CREATE AGGREGATE name ( input_data_type [ , ... ] ) (
>    SFUNC = sfunc,
>    STYPE = state_data_type
>    [ , FINALFUNC = ffunc ]
>    [ , INITCOND = initial_condition ]
>    [ , SORTOP = sort_operator ]
> )
>
> Let's add a function to allow removing tuples from the current window frame:
>
> CREATE AGGREGATE name ( input_data_type [ , ... ] ) (
>    SFUNC = sfunc,
>    STYPE = state_data_type
>    [ , FINALFUNC = ffunc ]
>    [ , INITCOND = initial_condition ]
>    [ , SORTOP = sort_operator ]
>    [ , SUBTRACTFUNC = subfunc,
> )
>
> subfunc is like sfunc, except that the input value is "subtracted" from the
> current value. On each input row, the executor calls sfunc for all new
> tuples that enter the window frame, and subfunc for all tuples that exit it.
> Another difference is that finalfunc can be called many times during the
> lifecycle of an aggregate invocation. For example, the subfunc for COUNT
> would decrement the row count by one, and for SUM, subfunc would subtract
> the removed value from the current sum. Perhaps we should also support an
> "opportunistic subfunc", that could either perform the subtraction, or
> return a special value indicating that it can't be done, and we need to
> calculate the aggregate from scratch as if subfunc wasn't defined. That
> would be useful for MIN/MAX, which can be implemented by just keeping track
> of the current MIN/MAX value, and "subtracting" any other value than the
> current MIN/MAX is a no-op, but "subtracting" the current MIN/MAX value
> requires scanning all the rest of the tuples from scratch.

Hmmm, its name may be better as "invfunc" than subfunc. I feel
horrible in imagining to manage the combination of functions that
don't support subfunc, ones that does and ranking aggregates but it's
possible.

> PS. Have you looked at the "hypothetical set functions" in SQL2003?

I had a glance but not deeply, since I found I would be lost in design
if deeply diving into it. :)


-- 
Hitoshi Harada


Re: Window functions patch v04 for the September commit fest

From
"Hitoshi Harada"
Date:
2008/9/2 Gregory Stark <stark@enterprisedb.com>:
> And it doesn't answer how to deal with things like
>
> SELECT lag(x,1) OVER (ORDER BY a), lag(x,1) OVER (ORDER BY b)
>
> I, uh, don't actually have any ideas of how to deal with that one :(

For different windows we make different window nodes with different
sort nodes as the patch does.



Regards,


-- 
Hitoshi Harada


Re: Window functions patch v04 for the September commit fest

From
Heikki Linnakangas
Date:
David Fetter wrote:
> On Mon, Sep 01, 2008 at 09:00:47PM +0300, Heikki Linnakangas wrote:
>> Ok, I'm starting to read up on SQL2003 window functions,
> 
> Maybe it would be better to read the SQL2008 standard
> <http://wiscorp.com/sql200n.zip> :)

Ah, thanks!

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Window functions patch v04 for the September commit fest

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> David Fetter wrote:
>> On Mon, Sep 01, 2008 at 09:00:47PM +0300, Heikki Linnakangas wrote:
>>> Ok, I'm starting to read up on SQL2003 window functions,
>> 
>> Maybe it would be better to read the SQL2008 standard
>> <http://wiscorp.com/sql200n.zip> :)

> Ah, thanks!

Er, s/2008 standard/200n draft with uncertain chances of passage/

It's not like we haven't seen a SQL draft go down in flames before.
        regards, tom lane


Re: Window functions patch v04 for the September commit fest

From
Heikki Linnakangas
Date:
Gregory Stark wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> 
>> sfunc is called once for each input row. Unlike with normal aggregates, sfunc
>> is passed the whole input row, so that e.g RANK can compare it against the
>> previous row, or LEAD can buffer it.
> 
> I'm not sure I follow this bit about being passed the whole input row. How
> would that relate to something like lag(x*y, 2) for example?

Well, sfunc needs to be passed not only the whole input tuple, but also 
the values of the arguments.

Hmm. The spec says that the offset argument of lead and lag needs to be 
a numeric literal. To enforce that at parse time, we'd have to hard-code 
that into the grammar. Or we could check it at run-time, and throw an 
error if its value changes across invocations of the sfunc function.

>> outfunc is a set-returning function, and it is called until it returns no more
>> rows, after each sfunc invocation.
> 
> And in any case I feel like this is abstraction on the cheap. The only reason
> it's so general is because it's leaving all the work to the functions to
> implement. 
> 
> It also means we get no benefit from cases like
> 
> SELECT lag(x,5),lag(y,5),lag(z,5)
> 
> where the executor could keep one tuplestore and for all of them. For that
> matter it could keep one tuplestore even for cases like lag(x,5),lag(y,4).

True. I guess it comes down to how much complexity we're willing to have 
in the executor node to handle that more efficiently.

Note that the memory usage is the same either way, except for the extra 
constant overhead of multiple tuplestores, because each tuplestore 
inside the lag implementation only needs to store its own column.

> What would the executor do for a query like
> 
> SELECT lead(x,1),lead(y,2),lead(y,3)
> 
> It would not only have to keep a tuplestore to buffer the output but it would
> have to deal with receiving data from different SRFs at different times. The
> best approach I can think of would be to keep a tuplestore for each SRF using
> themas queues, reading old values from the head as soon as they all have at
> least one new value in them.

Hitoshi solved that creating a separate Window node for each window 
function. So the plan tree for that would look something like:

Window (lead(x,1))  Window (lead(y,2))    Window (lead(y,3))      Seq Scan ...

That keeps the Window node implementation quite simple because it only 
needs to handle one window function at a time.

> And it doesn't answer how to deal with things like
> 
> SELECT lag(x,1) OVER (ORDER BY a), lag(x,1) OVER (ORDER BY b)
> 
> I, uh, don't actually have any ideas of how to deal with that one :(

The above handles that by putting extra sort nodes in between the Window 
nodes. Not too efficient, but I don't see any way around it.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Window functions patch v04 for the September commit fest

From
David Fetter
Date:
On Tue, Sep 02, 2008 at 02:42:25AM -0400, Tom Lane wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> > David Fetter wrote:
> >> On Mon, Sep 01, 2008 at 09:00:47PM +0300, Heikki Linnakangas wrote:
> >>> Ok, I'm starting to read up on SQL2003 window functions,
> >> 
> >> Maybe it would be better to read the SQL2008 standard
> >> <http://wiscorp.com/sql200n.zip> :)
> 
> > Ah, thanks!
> 
> Er, s/2008 standard/200n draft with uncertain chances of passage/
> 
> It's not like we haven't seen a SQL draft go down in flames before.

Do you think that anything in the windowing functions section will
disappear?

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


Re: Window functions patch v04 for the September commit fest

From
Tom Lane
Date:
David Fetter <david@fetter.org> writes:
> On Tue, Sep 02, 2008 at 02:42:25AM -0400, Tom Lane wrote:
>> It's not like we haven't seen a SQL draft go down in flames before.

> Do you think that anything in the windowing functions section will
> disappear?

Who's to say?

I have no objection to looking at the 2003 and 200n documents in
parallel, especially if there are places where 200n clarifies the
intent of 2003.  But I'd be suspicious of designing around
entirely-new features presented by 200n.
        regards, tom lane


Re: Window functions patch v04 for the September commit fest

From
Stefan Kaltenbrunner
Date:
Tom Lane wrote:
> David Fetter <david@fetter.org> writes:
>> On Tue, Sep 02, 2008 at 02:42:25AM -0400, Tom Lane wrote:
>>> It's not like we haven't seen a SQL draft go down in flames before.
> 
>> Do you think that anything in the windowing functions section will
>> disappear?
> 
> Who's to say?
> 
> I have no objection to looking at the 2003 and 200n documents in
> parallel, especially if there are places where 200n clarifies the
> intent of 2003.  But I'd be suspicious of designing around
> entirely-new features presented by 200n.

well http://www.wiscorp.com/SQLStandards.html

states:

"This points to the documents which wlll likely be the documents that 
represent the SQL 2008 Standard. These documents are out for 
International Standard ballot at this time. The vote is an Up/Down vote. 
No changes allowed."


Stefan


Re: Window functions patch v04 for the September commit fest

From
"Hitoshi Harada"
Date:
2008/9/2 Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>:
> Gregory Stark wrote:
>> What would the executor do for a query like
>>
>> SELECT lead(x,1),lead(y,2),lead(y,3)
>>
>> It would not only have to keep a tuplestore to buffer the output but it
>> would
>> have to deal with receiving data from different SRFs at different times.
>> The
>> best approach I can think of would be to keep a tuplestore for each SRF
>> using
>> themas queues, reading old values from the head as soon as they all have
>> at
>> least one new value in them.
>
> Hitoshi solved that creating a separate Window node for each window
> function. So the plan tree for that would look something like:
>
> Window (lead(x,1))
>  Window (lead(y,2))
>    Window (lead(y,3))
>      Seq Scan ...
>
> That keeps the Window node implementation quite simple because it only needs
> to handle one window function at a time.

To say more accurately, one Window node can handle more than one
window function. If it is thought to be the same using equalFuncs
comparing targetlists, some functions are put into one node.

Regards,


-- 
Hitoshi Harada


Re: Window functions patch v04 for the September commit fest

From
Simon Riggs
Date:
On Sat, 2008-08-30 at 02:04 +0900, Hitoshi Harada wrote:

> Here's the latest window functions patch against HEAD. It seems to be
> ready for the September commit fest, as added documents, WINDOW clause
> feature and misc tests. I guess this would be the window functions
> feature freeze for 8.4. The remaining feature will be implemented for
> the later release.
> 
> This patch consists of:
> - windowed aggregates
> - cooperate with GROUP BY aggregates
> - some optimizations with multiple windows
> - ranking functions
> - WINDOW clause
> - windowing SQL regression tests
> - sgml documents
> 
> Looking up the total road map, the dropped features are:
> 
> - sliding window (window framing)
> - lead(), lag(), etc. that reach for random rows
> - user defined window functions
> 
> The first and second topics are difficult to implement currently.
> Because these features require random row access, it seems that
> tuplestore would be able to save multiple positions to mark/restore.
> This is fundamental change that is over my capability. Also, user
> defined window functions seem to have much more to decide. I think I
> can't put into shape the general needs of user's window functions now.
> Lacking these feature, this stage looks compatible to SQLServer 2005,
> while Oracle and DB2 have almost full of the specification.

If you've done all of that, then I'm impressed. Well done.

Few general comments

* The docs talk about "windowing functions", yet you talk about "window
functions" here. I think the latter is correct, but whichever we choose
we should be consistent (and hopefully matching SQL Standard).

* You don't use duplicate the examples from the docs into the tests,
which is always a good way to get conflicting reports from users. :-)

* The tests seem very light for such a huge range of new functionality.
(8 tests is hardly sufficient). I'd like to see a wide range of tests -
probably 5-10 times as many individual test statements. I would also
like to see test failures that illustrate the as-yet unimplemented
features and the warning messages that are thrown - this will help us
understand exactly what is missing also. It would also be useful to see
other common coding mistakes/misconceptions and the corresponding error
messages.

> Also, current implementation has only a type of plan which uses sort
> operation. It should be optimized by re-position the windows and/or
> using hashtable.

I would like to see some performance test results also. It would be good
to know whether they are fast/slow etc.. It will definitely help the
case for inclusion if they are faster than alternative multi-statement
approaches to solving the basic data access problems.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Window functions patch v04 for the September commit fest

From
Simon Riggs
Date:
On Mon, 2008-09-01 at 21:00 +0300, Heikki Linnakangas wrote:

> 1. It's important that what gets committed now can be extended to handle 
> all of the window function stuff in SQL2003 in the future, as well as 
> user-defined-window-functions in the spirit of PostgreSQL extensibility. 
> Even if we don't implement all of it in this release.

I think whatever public APIs get published now must be sufficient to
support user-defined-window functions across all future releases, so on
that point I agree completely. (One reason why I argued earlier in
favour of avoiding an API for now).

We shouldn't restrict the implementation of the internals to be upward
compatible though because I foresee some aspect of complexity stalling
and thus killing the patch in the short term if we do that. We don't
have much time left for this release.

If we only have the combined (brain * time) to get a partial
implementation in for this release then I would urge we go for that,
rather than wait for perfection - as long as there are no other negative
effects.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Window functions patch v04 for the September commit fest

From
Martijn van Oosterhout
Date:
On Tue, Sep 02, 2008 at 10:44:31AM +0100, Simon Riggs wrote:
> If we only have the combined (brain * time) to get a partial
> implementation in for this release then I would urge we go for that,
> rather than wait for perfection - as long as there are no other negative
> effects.

"premature optimization is the root of all evil." (Knuth, Donald.)

"make it work, then make it better".

Getting a partial implementation that works out now is far better than
waiting until its perfect.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Re: Window functions patch v04 for the September commit fest

From
Simon Riggs
Date:
On Tue, 2008-09-02 at 03:14 -0400, Tom Lane wrote:
> David Fetter <david@fetter.org> writes:
> > On Tue, Sep 02, 2008 at 02:42:25AM -0400, Tom Lane wrote:
> >> It's not like we haven't seen a SQL draft go down in flames before.
> 
> > Do you think that anything in the windowing functions section will
> > disappear?
> 
> Who's to say?
> 
> I have no objection to looking at the 2003 and 200n documents in
> parallel, especially if there are places where 200n clarifies the
> intent of 2003.  But I'd be suspicious of designing around
> entirely-new features presented by 200n.

I have confirmation from Michael Gorman, Wiscorp, that 

> The new standard was approved in early Summer. SQL 2008 is finished.

So as of now, SQL2008 exists, all hail. SQL2003 and earlier versions
have been superseded and can be ignored.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Window functions patch v04 for the September commit fest

From
"Hitoshi Harada"
Date:
2008/9/2 Simon Riggs <simon@2ndquadrant.com>:
> If you've done all of that, then I'm impressed. Well done.
>
> Few general comments
>
> * The docs talk about "windowing functions", yet you talk about "window
> functions" here. I think the latter is correct, but whichever we choose
> we should be consistent (and hopefully matching SQL Standard).

That's what I'm embarrassed. Now we have "window functions" meaning
two, one for generic name of window expressions and the other for
non-window-aggregates. It is a word play, which is difficult problem
for non-native people, but... let's use "window functions". I'll
revise sgml docs.

> * You don't use duplicate the examples from the docs into the tests,
> which is always a good way to get conflicting reports from users. :-)
>
> * The tests seem very light for such a huge range of new functionality.
> (8 tests is hardly sufficient). I'd like to see a wide range of tests -
> probably 5-10 times as many individual test statements. I would also
> like to see test failures that illustrate the as-yet unimplemented
> features and the warning messages that are thrown - this will help us
> understand exactly what is missing also. It would also be useful to see
> other common coding mistakes/misconceptions and the corresponding error
> messages.
>
>> Also, current implementation has only a type of plan which uses sort
>> operation. It should be optimized by re-position the windows and/or
>> using hashtable.
>
> I would like to see some performance test results also. It would be good
> to know whether they are fast/slow etc.. It will definitely help the
> case for inclusion if they are faster than alternative multi-statement
> approaches to solving the basic data access problems.
>

OK, thanks for your advices. I'll work on tests, docs and benchmarks,
then send in another patch in a week or so.

Regards,


-- 
Hitoshi Harada


Re: Window functions patch v04 for the September commit fest

From
Heikki Linnakangas
Date:
Hitoshi Harada wrote:
> 2008/9/2 Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>:
> In my understanding, the "Window Frame" is defined
> by clauses such like "ROWS BETWEEN ... ", "RANGE BETWEEN ... " or so,
> contrast to "Window Partition" defined by "PARTITION BY" clause. A
> frame slides within a partition or there's only one frame if those
> clauses are not specified. The current patch has not implemented this
> yet. I'll update the docs.

Yes, that's how I read it as well. Another way to think of it is that 
there's always a window frame, but if no ROWS BETWEEN or similar clause 
is given, the window frame spans the whole partition (or from the 1st 
row of the partition, up to the current row, if ORDER BY is given).

> I don't like to call the second type "ranking aggregates" because it
> may refer to only ranking functions though there are more types of
> function like ntile(), lead() and lag(). But "window functions"
> doesn't seem appropriate either since it's ambiguous with the general
> name of window expressions.

Yep, confusing :-(. The SQL standard says that "a window function is one 
of: a rank function, a distribution function, a row number function, a 
window aggregate function, the ntile function, the lead function, the 
lag function, the first-value function, the last-value function, the 
nth-value function". So, window aggregate functions are a special class 
of window functions, and there's no term to refer to all the rest of the 
window functions excluding window aggregate functions.

Your doc        SQL spec
Window expression    Window function
Window function        Any window function other than a window aggregate function
Window aggregate    Window aggregate function

I tried to coin the term "ranking aggregate" for the SQL2008 term "Any 
window function other than a window aggregate function", but you're 
right that that's still confusing, because the SQL2008 term "rank 
function" includes only RANK() and DENSE_RANK().

The spec calls them "group aggregate functions", when they're used with 
GROUP BY, rather than as a window function. I think we could use that term.

> Your proposal is smarter than the current implementation. But it
> doesn't seem complete in some way. From logical point of view, the
> window functions should be allowed to access whole of rows in the
> frame the current row belongs to (c.f. inverse distribution
> functions). 

By the whole of rows, do you mean
a) the chosen value or expression of all rows, or
b) all columns of the input rows?

Different window functions have different needs. RANK() for example does 
need to see all columns, to compare them, but it only needs access to 
the previous and current row. CUME_DIST on the other hand needs access 
to all columns of all rows, and LAG needs access to a specific column of 
a fixed number of rows. And row_number needs nothing.

The needs of access to the rows are so different that it seems best to 
me to delegate the buffering to the window function.

Actually, looking closer to the ranking functions, they don't really 
need access to all columns. They just need to be able to compare them, 
according to the window ordering, so we should probably only provide 
access to the arguments of the aggregate, evaluated for any row in the 
frame, and a comparator function, that can determine if any given rows 
in the frame are <, = or >.

> From implementing and performance point of view, there
> need to consider such case like mixing window aggregates and ranking
> aggregates in the same window, which means it is better that the two
> types of functions are processed in the similar flow. Also logically,
> SQL spec doesn't forbid ranking aggregates in sliding window frames.

It doesn't? Oh, bummer. It seems we need a grand unified theory of 
ranking and other aggregates then :-(. How does something like RANK work 
with a window frame? What does it return, if the current row is excluded 
from the window frame, e.g with EXCLUDE CURRENT ROW?


Let's look at the trivial, generic, and slow implementation first, and 
then figure out what tricks we can do to make it faster. I gather that 
that generic algorithm, following how the SQL spec defines the window 
frame, looks like this:

0. Construct the ordered set of tuples, containing all the tuples in the 
partition.
For each row, start with the ordered set constructed in step 0, and:
1. Remove all tuples from the set that are before the start of the 
sliding frame
2. Remove all tuples from the set that are after the end of the sliding 
frame
3. Remove all tuples specified by the window-frame-exclusion clause 
(EXCLUDE CURRENT ROW/GROUP/TIES/NO OTHERS).
4. Run the aggregate over all tuples still in the set.

Now, the first obvious optimization is that we don't want to construct 
the window frame from scratch for each row. Instead:

0. Start with an empty ordered set S.
For each row:
1. Read forward until we hit the end of the new sliding frame, and add 
those rows to S.
2. Remove all tuples from S that are before the start of the sliding frame.
3. Construct a new set T, which is the same as S, but all tuples 
specified by the window-frame-exclusion clause (EXCLUDE CURRENT 
ROW/GROUP/TIES/NO OTHERS) are removed.
5. Run the aggregate over all tuples in T.

This is applicable to all ranking and window aggregates. Per spec, if no 
window-frame-clause is given, the start of the frame is always the 1st 
row in the partition. If an ORDER BY is given, the end of the frame is 
the the current row. Otherwise it's the end of partition, which means 
that we can just run the aggregate once and return the same value for 
all rows in the partition.

Now, let's consider how to speed up step 5. There's the difference 
between ranking and window aggregates. Window aggregates don't care 
about the position of the current row within the window frame, so the 
interface I proposed earlier, with the subfunc, would work fine. But for 
ranking aggregates, the position of the current row matters.

Another must-have optimization is that we must be able to eagerly 
discard rows that we know won't be accessed anymore. And likewise, we 
mustn't read ahead many rows that we don't need to calculate the 
aggregate for the current row.

I think I'm coming around to your suggestion of passing a Window object 
to the aggregate function, allowing random access to the current frame. 
As Simon pointed out, there needs to be a way for the aggregate function 
to signal when it doesn't need some rows in the frame anymore, for 
performance.

The sliding window frame always moves forward, never backwards. 
Therefore it's sufficient if the user-defined function can indicate a 
cutoff point, allowing the executor node to discard any preceding rows. 
However, the window-frame-exclusion clause complicates that, because the 
current row and its peers might or might not be in the frame, which 
means that rows can disappear and reappear into the middle of the window 
frame. We don't need to implement window-frame-exclusion in this 
release, but we should consider how that affects the API.

The API could be like this (in pseudo-syntax):

CREATE AGGREGATE name (    <type> evalfunc(Window object)    enterfunc(Window object, int pos)    exitfunc(Window
object,int pos)
 
)

where evalfunc, enterfunc and exitfunc are functions provided by the 
aggregate function author.

Window object {  /* Returns the position of current row within the frame. */  int get_current_pos();  /* Returns the
numberof rows in the frame. */  int get_max_pos();  /* Compares two tuple in the window frame, according to window 
 
ordering */  int compare(int posa, int posb)  /* Returns the value of an aggregate argument for tuple at given 
position */  Datum get_arg_value(int pos, int argno);  /* Promise that we won't call get_arg_value or compare for any
pos< 
 
cutoff_pos anymore */  void signal_cutoff(int cutoff_pos);
}

The executor calls enterfunc for each row that enters the window frame, 
and exitfunc for each row that exits the frame (before removing the
tuple from the frame, so that it's still accessible to get_arg_value() 
and compare()). After calling enter/exitfuncs for all rows that 
enter/exit the frame, evalfunc is called once, to return the value for 
the current row.

I believe this allows for an efficient implementation of all the window 
functions we've discussed. It works fine for implementing regular 
aggregates as well, so there no need to treat ranking and other 
aggregates any differently. We can completely replace the current method 
of defining aggregates with this if we want to, as long as we provide 
some sort of glue for backwards-compatibility with aggregates in 
external projects.

Here's sketches of some aggregate implementations using this API:

RANK()
------
enterfunc:  if position of the new row is > current row, do nothing. Otherwise, 
decrement current rank.
exitfunc:  if position of the new row is > current row, do nothing. Otherwise, 
increment current rank.

LEAD(offset)
------------
evalfunc: Call get_row(get_current_row() + offset). Call signal_cutoff(get_current_row());

LAG(offset)
-----------
evalfunc: Call get_row(get_current_row() - offset). Call signal_cutoff(get_current_row() - offset);

MIN/MAX
-------
enterfunc: if new value is </> the current min/max value, replace it with the new 
value
exitfunc: if removed value = the current min/max value, rescan the current frame 
for new min/max value, iterating through all values (1..get_max_row()).

COUNT
-----
enterfunc: increment counter
exitfunc: decrement counter

CUME_DIST
---------
CUME_DIST is defined as NP/NR, where NP is the number of rows preceding 
or peer with the current row, and NR is the total number of rows in the 
frame. NR = get_max_pos(). Maintain NP-counter in enterfunc/exitfunc, 
incrementing it whenever a row preceding current row enters or exits the 
frame. In evalfunc, increment the counter to reflect the new current row 
position.


> I feel horrible in imagining to manage the combination of functions that
> don't support subfunc, ones that does and ranking aggregates but it's
> possible.

Well, if you handle each window/ranking aggregate in a separate Window 
node, like you did, it should be much easier to manage. If it gets 
overwhelmingly complex, you could also split it into two slightly 
different Window nodes, one for the generic case, for aggregates that 
don't have subfunc, and one for ones that do.

(the point is moot, if we go with the API I proposed above, using a 
Window object)


All in all, this is such a monstrous piece of functionality, that if we 
try to implement everything in one patch, it'll never get finished. So 
we clearly have to implement this in phases, and just make sure we don't 
paint ourselves in the corner with the design.

I'd suggest:

1. Implement Window node, with the capability to invoke an aggregate 
function, using the above API. Implement required parser/planner 
changes. Implement a few simple ranking aggregates using the API.
2. Implement glue to call normal aggregates through the new API
3. Implement the optimization to drop rows that are no longer needed 
(signal_cutoff can be a no-op until this phase)
4. Implement window framing (the frame can always be all rows in the 
partition, or all rows until the current row, until this phase)
5. Expose the new API to user-defined aggregates. It can be an internal 
API only used by built-in functions until this phase

I believe you already have phase 1 in your patch, except for the API 
changes.

>> PS. Have you looked at the "hypothetical set functions" in SQL2003?
> 
> I had a glance but not deeply, since I found I would be lost in design
> if deeply diving into it. :)

Yeah ;-). A quick glance suggests that it won't affect aggregate the API 
we come up with, but will require changes to parser/executor.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Window functions patch v04 for the September commit fest

From
Heikki Linnakangas
Date:
Martijn van Oosterhout wrote:
> On Tue, Sep 02, 2008 at 10:44:31AM +0100, Simon Riggs wrote:
>> If we only have the combined (brain * time) to get a partial
>> implementation in for this release then I would urge we go for that,
>> rather than wait for perfection - as long as there are no other negative
>> effects.
> 
> "premature optimization is the root of all evil." (Knuth, Donald.)
> 
> "make it work, then make it better".
> 
> Getting a partial implementation that works out now is far better than
> waiting until its perfect.

Sure. Just have to watch out that we don't follow a dead-end, making it 
harder to add missing functionality later on.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Window functions patch v04 for the September commit fest

From
David Fetter
Date:
On Tue, Sep 02, 2008 at 12:42:45PM +0100, Simon Riggs wrote:
> 
> On Tue, 2008-09-02 at 03:14 -0400, Tom Lane wrote:
> > David Fetter <david@fetter.org> writes:
> > > On Tue, Sep 02, 2008 at 02:42:25AM -0400, Tom Lane wrote:
> > >> It's not like we haven't seen a SQL draft go down in flames
> > >> before.
> > 
> > > Do you think that anything in the windowing functions section
> > > will disappear?
> > 
> > Who's to say?
> > 
> > I have no objection to looking at the 2003 and 200n documents in
> > parallel, especially if there are places where 200n clarifies the
> > intent of 2003.  But I'd be suspicious of designing around
> > entirely-new features presented by 200n.
> 
> I have confirmation from Michael Gorman, Wiscorp, that 
> 
> > The new standard was approved in early Summer. SQL 2008 is
> > finished.
> 
> So as of now, SQL2008 exists, all hail. SQL2003 and earlier versions
> have been superseded and can be ignored.

Any chance we can buy a few copies of the official one for use on the
project?

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


Re: Window functions patch v04 for the September commit fest

From
Simon Riggs
Date:
On Tue, 2008-09-02 at 09:35 -0700, David Fetter wrote:
> On Tue, Sep 02, 2008 at 12:42:45PM +0100, Simon Riggs wrote:
> > 
> > On Tue, 2008-09-02 at 03:14 -0400, Tom Lane wrote:
> > > David Fetter <david@fetter.org> writes:
> > > > On Tue, Sep 02, 2008 at 02:42:25AM -0400, Tom Lane wrote:
> > > >> It's not like we haven't seen a SQL draft go down in flames
> > > >> before.
> > > 
> > > > Do you think that anything in the windowing functions section
> > > > will disappear?
> > > 
> > > Who's to say?
> > > 
> > > I have no objection to looking at the 2003 and 200n documents in
> > > parallel, especially if there are places where 200n clarifies the
> > > intent of 2003.  But I'd be suspicious of designing around
> > > entirely-new features presented by 200n.
> > 
> > I have confirmation from Michael Gorman, Wiscorp, that 
> > 
> > > The new standard was approved in early Summer. SQL 2008 is
> > > finished.
> > 
> > So as of now, SQL2008 exists, all hail. SQL2003 and earlier versions
> > have been superseded and can be ignored.
> 
> Any chance we can buy a few copies of the official one for use on the
> project?

You have to buy them from ISO website I think, but it was $00s when I
looked some years back - we'd probably need a dozen copies at least
since we hardly ever meet. Michael's .pdf was the final version, so we
do actually have the final form even if the page formatting somewhat
different.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Window functions patch v04 for the September commit fest

From
"Neil Conway"
Date:
On Tue, Sep 2, 2008 at 9:35 AM, David Fetter <david@fetter.org> wrote:
> Any chance we can buy a few copies of the official one for use on the
> project?

AFAIK there is no significant difference between the "official"
standard and the draft version available online, so I don't see the
point.

Neil


Re: Window functions patch v04 for the September commit fest

From
"Hitoshi Harada"
Date:
2008/9/2 Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>:
> Hitoshi Harada wrote:
>>
>> 2008/9/2 Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>:
>> In my understanding, the "Window Frame" is defined
>> by clauses such like "ROWS BETWEEN ... ", "RANGE BETWEEN ... " or so,
>> contrast to "Window Partition" defined by "PARTITION BY" clause. A
>> frame slides within a partition or there's only one frame if those
>> clauses are not specified. The current patch has not implemented this
>> yet. I'll update the docs.
>
> Yes, that's how I read it as well. Another way to think of it is that
> there's always a window frame, but if no ROWS BETWEEN or similar clause is
> given, the window frame spans the whole partition (or from the 1st row of
> the partition, up to the current row, if ORDER BY is given).
>
>> I don't like to call the second type "ranking aggregates" because it
>> may refer to only ranking functions though there are more types of
>> function like ntile(), lead() and lag(). But "window functions"
>> doesn't seem appropriate either since it's ambiguous with the general
>> name of window expressions.
>
> Yep, confusing :-(. The SQL standard says that "a window function is one of:
> a rank function, a distribution function, a row number function, a window
> aggregate function, the ntile function, the lead function, the lag function,
> the first-value function, the last-value function, the nth-value function".
> So, window aggregate functions are a special class of window functions, and
> there's no term to refer to all the rest of the window functions excluding
> window aggregate functions.
>
> Your doc                SQL spec
> Window expression       Window function
> Window function         Any window function other than a window aggregate
> function
> Window aggregate        Window aggregate function
>
> I tried to coin the term "ranking aggregate" for the SQL2008 term "Any
> window function other than a window aggregate function", but you're right
> that that's still confusing, because the SQL2008 term "rank function"
> includes only RANK() and DENSE_RANK().
>
> The spec calls them "group aggregate functions", when they're used with
> GROUP BY, rather than as a window function. I think we could use that term.

Agree. So from now on, we use "window functions" for all kinds of
functions including window aggregates. "Window expression" is
discarded. "Window functions" also means the mechanism to support
these functions to process and this project.

>> Your proposal is smarter than the current implementation. But it
>> doesn't seem complete in some way. From logical point of view, the
>> window functions should be allowed to access whole of rows in the
>> frame the current row belongs to (c.f. inverse distribution
>> functions).
>
> By the whole of rows, do you mean
> a) the chosen value or expression of all rows, or
> b) all columns of the input rows?

a). I mean all input rows in a window frame. But later I found
"inverse distribution function" is not one of window functions. That
is actually one of aggregate functions. Forget about it.

>
> Different window functions have different needs. RANK() for example does
> need to see all columns, to compare them, but it only needs access to the
> previous and current row. CUME_DIST on the other hand needs access to all
> columns of all rows, and LAG needs access to a specific column of a fixed
> number of rows. And row_number needs nothing.
>
> The needs of access to the rows are so different that it seems best to me to
> delegate the buffering to the window function.

Delegating optimization to them depending on functions' needs is a
good idea. So executor can concentrate on the window function process
flow. Let's unify it in executor and let trivial optimizations get
into individual functions.

>
> Actually, looking closer to the ranking functions, they don't really need
> access to all columns. They just need to be able to compare them, according
> to the window ordering, so we should probably only provide access to the
> arguments of the aggregate, evaluated for any row in the frame, and a
> comparator function, that can determine if any given rows in the frame are
> <, = or >.

That is kind of problem. If your task is only to define that window
node executor simply stores window frame rows and pass them to window
functions as they need, the rank functions' needs don't come. As you
point out, rank functions need ordering key columns and its
comparators. So you have to imagine what comes next? What will be
wanted other than ordering key columns, if we think about "universe
window functions" much more than SQL spec says?

>
>> From implementing and performance point of view, there
>> need to consider such case like mixing window aggregates and ranking
>> aggregates in the same window, which means it is better that the two
>> types of functions are processed in the similar flow. Also logically,
>> SQL spec doesn't forbid ranking aggregates in sliding window frames.
>
> It doesn't? Oh, bummer. It seems we need a grand unified theory of ranking
> and other aggregates then :-(. How does something like RANK work with a
> window frame? What does it return, if the current row is excluded from the
> window frame, e.g with EXCLUDE CURRENT ROW?

It doesn't but implicitly does. Also it seems that the Oracle and
other RDMBSs do. Rank functions in sliding frame are possible because
if you take the current row the window frame can be defined and with
it aggregate. Also in implementing, without having catalog info about
which function can take sliding window, parser/planner/executor cannot
see how to forbid it. Have the info, otherwise we have to allow it.

> Let's look at the trivial, generic, and slow implementation first, and then
> figure out what tricks we can do to make it faster. I gather that that
> generic algorithm, following how the SQL spec defines the window frame,
> looks like this:

So as to satisfy all of window functions' needs, Window object does
well. But it is so heavy that we need to optimize as functions need,
by reducing stored rows and avoiding to call redundant functions.
Still I'm afraid if user can define such a complex function...

> Here's sketches of some aggregate implementations using this API:
>
> RANK()
> ------
> enterfunc:
>  if position of the new row is > current row, do nothing. Otherwise,
> decrement current rank.
> exitfunc:
>  if position of the new row is > current row, do nothing. Otherwise,
> increment current rank.

I don't see why the row's positions affect rank. The rank depends on
comparing two rows ordering columns, doen't it?

> I'd suggest:
>
> 1. Implement Window node, with the capability to invoke an aggregate
> function, using the above API. Implement required parser/planner changes.
> Implement a few simple ranking aggregates using the API.
> 2. Implement glue to call normal aggregates through the new API
> 3. Implement the optimization to drop rows that are no longer needed
> (signal_cutoff can be a no-op until this phase)
> 4. Implement window framing (the frame can always be all rows in the
> partition, or all rows until the current row, until this phase)
> 5. Expose the new API to user-defined aggregates. It can be an internal API
> only used by built-in functions until this phase
>
> I believe you already have phase 1 in your patch, except for the API
> changes.
>

I am willing to challenge to implement the API above, after maintain
the current patch adding docs and tests. Since the API includes
changes much more like Aggregate syntax than current patch, I'm not
sure if I can finish it by next commit fest, which is said to be
"feature freeze". For safety, remain the current patch to review
excluding API and executor then if I fail to finish use it for next
release. Git helps it by cutting a branch, does it? How do you think?

Regards,



-- 
Hitoshi Harada


Re: Window functions patch v04 for the September commit fest

From
Simon Riggs
Date:
On Tue, 2008-09-02 at 15:51 +0300, Heikki Linnakangas wrote:

> The needs of access to the rows are so different that it seems best to
> me to delegate the buffering to the window function.

That seems sensible in some ways, not others.

Some of the window functions, like lead and lag merely specify window
size and shape for other functions to act upon. For those types of
request I don't see any need for custom functions, whereas for the
comparison/calculation functions there might be a need.

We don't need to implement all the things the SQL Standard calls window
functions with a 1:1 mapping to Postgres functions.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Window functions patch v04 for the September commit fest

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Tue, 2008-09-02 at 15:51 +0300, Heikki Linnakangas wrote:
> 
>> The needs of access to the rows are so different that it seems best to
>> me to delegate the buffering to the window function.
> 
> That seems sensible in some ways, not others.

In the API I proposed later in that mail, the buffering is actually done 
by the executor node, not by the window function. Instead, the window 
function can request abitrary rows of the frame from the executor, and 
can signal that some rows are no longer required, allowing them to be 
discarded.

> Some of the window functions, like lead and lag merely specify window
> size and shape for other functions to act upon.

I don't  understand that. LEAD/LAG return a value. They don't affect the 
size or shape of the window in any way. It doesn't affect other functions.

> For those types of
> request I don't see any need for custom functions, whereas for the
> comparison/calculation functions there might be a need.
> 
> We don't need to implement all the things the SQL Standard calls window
> functions with a 1:1 mapping to Postgres functions.

Sure, we have special hacks for things like MIN/MAX already. But using 
PostgreSQL functions does seem like the simplest solution to me, as the 
backend code can get quite complex if we have to add special handling 
for different window functions. LEAD/LAG fall quite nicely into the 
framework I proposed, but if something comes along that doesn't, then 
we'll have to extend the framework or add a special case.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Window functions patch v04 for the September commit fest

From
Heikki Linnakangas
Date:
Hitoshi Harada wrote:
> 2008/9/2 Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>:
>> Hitoshi Harada wrote:
>>> 2008/9/2 Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>:
>>> In my understanding, the "Window Frame" is defined
>>> by clauses such like "ROWS BETWEEN ... ", "RANGE BETWEEN ... " or so,
>>> contrast to "Window Partition" defined by "PARTITION BY" clause. A
>>> frame slides within a partition or there's only one frame if those
>>> clauses are not specified. The current patch has not implemented this
>>> yet. I'll update the docs.
>> Yes, that's how I read it as well. Another way to think of it is that
>> there's always a window frame, but if no ROWS BETWEEN or similar clause is
>> given, the window frame spans the whole partition (or from the 1st row of
>> the partition, up to the current row, if ORDER BY is given).
>>
>>> I don't like to call the second type "ranking aggregates" because it
>>> may refer to only ranking functions though there are more types of
>>> function like ntile(), lead() and lag(). But "window functions"
>>> doesn't seem appropriate either since it's ambiguous with the general
>>> name of window expressions.
>> Yep, confusing :-(. The SQL standard says that "a window function is one of:
>> a rank function, a distribution function, a row number function, a window
>> aggregate function, the ntile function, the lead function, the lag function,
>> the first-value function, the last-value function, the nth-value function".
>> So, window aggregate functions are a special class of window functions, and
>> there's no term to refer to all the rest of the window functions excluding
>> window aggregate functions.
>>
>> Your doc                SQL spec
>> Window expression       Window function
>> Window function         Any window function other than a window aggregate
>> function
>> Window aggregate        Window aggregate function
>>
>> I tried to coin the term "ranking aggregate" for the SQL2008 term "Any
>> window function other than a window aggregate function", but you're right
>> that that's still confusing, because the SQL2008 term "rank function"
>> includes only RANK() and DENSE_RANK().
>>
>> The spec calls them "group aggregate functions", when they're used with
>> GROUP BY, rather than as a window function. I think we could use that term.
> 
> Agree. So from now on, we use "window functions" for all kinds of
> functions including window aggregates. "Window expression" is
> discarded. "Window functions" also means the mechanism to support
> these functions to process and this project.
> 
>>> Your proposal is smarter than the current implementation. But it
>>> doesn't seem complete in some way. From logical point of view, the
>>> window functions should be allowed to access whole of rows in the
>>> frame the current row belongs to (c.f. inverse distribution
>>> functions).
>> By the whole of rows, do you mean
>> a) the chosen value or expression of all rows, or
>> b) all columns of the input rows?
> 
> a). I mean all input rows in a window frame. But later I found
> "inverse distribution function" is not one of window functions. That
> is actually one of aggregate functions. Forget about it.
> 
>> Different window functions have different needs. RANK() for example does
>> need to see all columns, to compare them, but it only needs access to the
>> previous and current row. CUME_DIST on the other hand needs access to all
>> columns of all rows, and LAG needs access to a specific column of a fixed
>> number of rows. And row_number needs nothing.
>>
>> The needs of access to the rows are so different that it seems best to me to
>> delegate the buffering to the window function.
> 
> Delegating optimization to them depending on functions' needs is a
> good idea. So executor can concentrate on the window function process
> flow. Let's unify it in executor and let trivial optimizations get
> into individual functions.
> 
>> Actually, looking closer to the ranking functions, they don't really need
>> access to all columns. They just need to be able to compare them, according
>> to the window ordering, so we should probably only provide access to the
>> arguments of the aggregate, evaluated for any row in the frame, and a
>> comparator function, that can determine if any given rows in the frame are
>> <, = or >.
> 
> That is kind of problem. If your task is only to define that window
> node executor simply stores window frame rows and pass them to window
> functions as they need, the rank functions' needs don't come. As you
> point out, rank functions need ordering key columns and its
> comparators. So you have to imagine what comes next? What will be
> wanted other than ordering key columns, if we think about "universe
> window functions" much more than SQL spec says?

It might be a good idea to google around what window functions other 
DBMSs support, and see if this scheme could support all of them. I 
couldn't find any that it couldn't, but I didn't look very hard.

>> Let's look at the trivial, generic, and slow implementation first, and then
>> figure out what tricks we can do to make it faster. I gather that that
>> generic algorithm, following how the SQL spec defines the window frame,
>> looks like this:
> 
> So as to satisfy all of window functions' needs, Window object does
> well. But it is so heavy that we need to optimize as functions need,
> by reducing stored rows and avoiding to call redundant functions.
> Still I'm afraid if user can define such a complex function...

An average user probably can't. Creating a regular aggregate using the 
current method is not exactly trivial either.

At the moment, I'm more worried about finding a good abstraction to use 
within the backend, and worry about exposing it to user-defined 
functions later.

>> Here's sketches of some aggregate implementations using this API:
>>
>> RANK()
>> ------
>> enterfunc:
>>  if position of the new row is > current row, do nothing. Otherwise,
>> decrement current rank.
>> exitfunc:
>>  if position of the new row is > current row, do nothing. Otherwise,
>> increment current rank.
> 
> I don't see why the row's positions affect rank. The rank depends on
> comparing two rows ordering columns, doen't it?

Yes. Imagine a window frame like this (e.g. from "ROWS BETWEEN 2 
PRECEDING AND 2 FOLLOWING"):

10
20
30 <-- current row
40
50

RANK() doesn't care about rows that enter frame, that are after current 
row. For example, if 60 enters the frame:

10
20
30 <-- current row
40
50
60

The RANK() of the current row is still 3, as it was before. RANK() only 
advances when the current row is advanced, or new rows that precede the 
current row enter the frame, e.g if a peer of the current row enters the 
frame, which is possible with a window-frame-exclusion clause:

10
20
30
30 <-- current row
40
50
60

> 
>> I'd suggest:
>>
>> 1. Implement Window node, with the capability to invoke an aggregate
>> function, using the above API. Implement required parser/planner changes.
>> Implement a few simple ranking aggregates using the API.
>> 2. Implement glue to call normal aggregates through the new API
>> 3. Implement the optimization to drop rows that are no longer needed
>> (signal_cutoff can be a no-op until this phase)
>> 4. Implement window framing (the frame can always be all rows in the
>> partition, or all rows until the current row, until this phase)
>> 5. Expose the new API to user-defined aggregates. It can be an internal API
>> only used by built-in functions until this phase
>>
>> I believe you already have phase 1 in your patch, except for the API
>> changes.
> 
> I am willing to challenge to implement the API above, after maintain
> the current patch adding docs and tests. Since the API includes
> changes much more like Aggregate syntax than current patch, I'm not
> sure if I can finish it by next commit fest, which is said to be
> "feature freeze". For safety, remain the current patch to review
> excluding API and executor then if I fail to finish use it for next
> release. Git helps it by cutting a branch, does it? How do you think?

We do allow changes to the user manual after the feature freeze, so I'd 
suggest concentrating on the code and tests first. Code comments and 
internal docs are important, though, for easy review.

I'm sure we won't get all the way to phase 5 for 8.4, but if we can even 
get 1-3, plus some of the most important window functions, this this 
will be a great release!

I'll review the parser/planner changes from the current patch.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Window functions patch v04 for the September commit fest

From
Simon Riggs
Date:
On Wed, 2008-09-03 at 09:51 +0300, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Tue, 2008-09-02 at 15:51 +0300, Heikki Linnakangas wrote:
> > 
> >> The needs of access to the rows are so different that it seems best to
> >> me to delegate the buffering to the window function.
> > 
> > That seems sensible in some ways, not others.
> 
> In the API I proposed later in that mail, the buffering is actually done 
> by the executor node, not by the window function. Instead, the window 
> function can request abitrary rows of the frame from the executor, and 
> can signal that some rows are no longer required, allowing them to be 
> discarded.

I'm happy with that.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Window functions patch v04 for the September commit fest

From
"Hitoshi Harada"
Date:
2008/9/3 Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>:
> Hitoshi Harada wrote:
>>
>>> I'd suggest:
>>>
>>> 1. Implement Window node, with the capability to invoke an aggregate
>>> function, using the above API. Implement required parser/planner changes.
>>> Implement a few simple ranking aggregates using the API.
>>> 2. Implement glue to call normal aggregates through the new API
>>> 3. Implement the optimization to drop rows that are no longer needed
>>> (signal_cutoff can be a no-op until this phase)
>>> 4. Implement window framing (the frame can always be all rows in the
>>> partition, or all rows until the current row, until this phase)
>>> 5. Expose the new API to user-defined aggregates. It can be an internal
>>> API
>>> only used by built-in functions until this phase
>>>
>>> I believe you already have phase 1 in your patch, except for the API
>>> changes.
>>
>> I am willing to challenge to implement the API above, after maintain
>> the current patch adding docs and tests. Since the API includes
>> changes much more like Aggregate syntax than current patch, I'm not
>> sure if I can finish it by next commit fest, which is said to be
>> "feature freeze". For safety, remain the current patch to review
>> excluding API and executor then if I fail to finish use it for next
>> release. Git helps it by cutting a branch, does it? How do you think?
>
> We do allow changes to the user manual after the feature freeze, so I'd
> suggest concentrating on the code and tests first. Code comments and
> internal docs are important, though, for easy review.
>
> I'm sure we won't get all the way to phase 5 for 8.4, but if we can even get
> 1-3, plus some of the most important window functions, this this will be a
> great release!

OK, so first tests and internal docs/comments, then comes trying to
catch API , finally docs.

BTW, I think it is better to put together the discussion points we
have done as "general roadmap to complete window functions". It is not
about the features for the next release but is the complete tasks.
Where to go? Wiki, or my design docs?

Regards,


-- 
Hitoshi Harada


Re: Window functions patch v04 for the September commit fest

From
Heikki Linnakangas
Date:
Hitoshi Harada wrote:
> BTW, I think it is better to put together the discussion points we
> have done as "general roadmap to complete window functions". It is not
> about the features for the next release but is the complete tasks.
> Where to go? Wiki, or my design docs?

That's up to you, really. I like your design docs page, but you're free 
to use the Wiki as well, of course.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Window functions patch v04 for the September commit fest

From
Heikki Linnakangas
Date:
Heikki Linnakangas wrote:
> I'll review the parser/planner changes from the current patch.

Looks pretty sane to me. Few issues:

Is it always OK to share a window between two separate window function 
invocations, if they both happen to have identical OVER clause? It seems 
OK for stable functions, but I'm not sure that's correct for expressions 
involving volatile functions. I wonder if the SQL spec has anything to 
say about that.

As you noted in the comments, the planner could be a lot smarter about 
avoiding sorts. Currently you just always put a sort node below each 
Window, and another on top of them if there's an ORDER BY. There's 
obviously a lot of room for improvement there.

This line:
>         result_plan->targetlist = preprocess_window(tlist, result_plan);
in planner.c, won't work if result_plan is one that can't do projection.  A few screenfuls above that call, there's
this:
>                 /*
>                  * If the top-level plan node is one that cannot do expression
>                  * evaluation, we must insert a Result node to project the
>                  * desired tlist.
>                  */
>                 if (!is_projection_capable_plan(result_plan))
>                 {
>                     result_plan = (Plan *) make_result(root,
>                                                        sub_tlist,
>                                                        NULL,
>                                                        result_plan);
>                 }
>                 else
>                 {
>                     /*
>                      * Otherwise, just replace the subplan's flat tlist with
>                      * the desired tlist.
>                      */
>                     result_plan->targetlist = sub_tlist;
>                 }
which is what you need to do with the preprocess_window call as well. I 
think this is caused by that:
postgres=# explain SELECT row_number() OVER (ORDER BY id*10) FROM 
(SELECT * FROM foo UNION ALL SELECT * FROM foo) sq;
ERROR:  bogus varattno for OUTER var: 2

And then there's these:

postgres=# explain SELECT * FROm foo WHERE row_number() OVER (ORDER BY 
id) > 10;
ERROR:  winaggref found in non-Window plan node
postgres=# explain UPDATE foo SET id = 10 RETURNING (ROW_NUMBER() OVER 
(ORDER BY random())) ;
ERROR:  winaggref found in non-Window plan node
postgres=# explain SELECT row_number() OVER (ORDER BY (1)) FROm foo;
ERROR:  ORDER/GROUP BY expression not found in targetlist
postgres=# explain SELECT row_number() OVER (ORDER BY length('foo')) 
FROM foo;
ERROR:  could not find pathkey item to sort

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Window functions patch v04 for the September commit fest

From
"Hitoshi Harada"
Date:
2008/9/5 Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>:
> Heikki Linnakangas wrote:
>>
>> I'll review the parser/planner changes from the current patch.
>
> Looks pretty sane to me. Few issues:
>
> Is it always OK to share a window between two separate window function
> invocations, if they both happen to have identical OVER clause? It seems OK
> for stable functions, but I'm not sure that's correct for expressions
> involving volatile functions. I wonder if the SQL spec has anything to say
> about that.

It may be here:

---quote---
In general, two <window function>s are computed independently, each
one performing its own sort of its data, even if they use the same
data and the same <sort specification list>. Since sorts may specify
partial orderings,
the computation of <window function>s is inevitably non-deterministic
to the extent that the ordering is not total. Nevertheless, the user
may desire that two <window function>s be computed using the same
ordering, so that, for example, two moving aggregates move through the
rows of a partition in precisely the same order.

Two <window function>s are computed using the same (possibly
non-deterministic) window ordering of the rows if any of the following
are true:

― The <window function>s identify the same window structure descriptor.

― The <window function>s' window structure descriptors have window
partitioning clauses that enumerate the same number of column
references, and those column references are pairwise equivalent in
their order
of occurrence; and their window structure descriptors have window
ordering clauses with the same number of <sort key>s, and those <sort
key>s are all column references, and those column references are
pairwise equivalent in their order of occurrence, and the <sort
specification>s pairwise specify or imply <collate clause>s that
specify equivalent <collation name>s, the same <ordering
specification> (ASC or DESC), and the same <null ordering> (NULLS
FIRST or NULLS LAST).

― The window structure descriptor of one <window function> is the
ordering window of the other <window function>, or both window
structure descriptors identify the same ordering window.
/---quote---

But it doesn't say anything about volatile functions. Do you have
example that is bad with the current design?

The other issuses are OK. I missed those cases. will fix them.

Regards,


-- 
Hitoshi Harada


Re: Window functions patch v04 for the September commit fest

From
"Hitoshi Harada"
Date:
>> Also, current implementation has only a type of plan which uses sort
>> operation. It should be optimized by re-position the windows and/or
>> using hashtable.
>
> I would like to see some performance test results also. It would be good
> to know whether they are fast/slow etc.. It will definitely help the
> case for inclusion if they are faster than alternative multi-statement
> approaches to solving the basic data access problems.
>

Just for the report, I attach the result I have tested today. You see
the result says the current window function is faster than
sort-operated self-join and slower than hashagg-operated self-join.

This test is on the Redhat Linux ES3 Xeon 2.13GHz with 100,000 rows 2
column integers. I wrote simple perl script using psql invoking the
shell so it may contain the invocation overhead overall.


test0    test1    test2    test3    test4    test5
------------------------------------------------------------
689.502    416.633    257.970    1195.294    954.318    1204.292
687.254    447.676    256.629    1075.342    949.711    1154.754
700.602    421.818    260.742    1105.680    926.462    1203.012
736.594    476.388    334.310    1157.818    978.861    1199.944
676.572    418.782    270.270    1060.900    909.474    1175.079
687.260    428.564    257.032    1069.013    1045.387    1275.988
700.252    429.289    263.216    1074.749    1018.968    1273.965
719.478    445.218    258.464    1087.932    1015.744    1273.637
694.865    453.737    261.286    1065.229    1039.941    1262.208
685.756    430.169    258.017    1124.795    1102.055    1297.603
------------------------------------------------------------
697.81    436.83    267.79    1101.68    994.09    1232.05

test0    SELECT sum(amount) OVER (PARTITION BY sector) FROM bench1;
test1    SELECT amount FROM bench1 ORDER BY sector;
test2    SELECT sum(amount) FROM bench1 GROUP BY sector;
test3    SELECT id, amount - avg(amount) OVER (PARTITION BY sector) FROM bench1;
test4    SELECT id, amount - avg FROM bench1 INNER JOIN(SELECT sector,
avg(amount) FROM bench1 GROUP BY sector)t USING(sector)
test5    SET enable_hashagg TO off; SELECT id, amount - avg FROM bench1
INNER JOIN(SELECT sector, avg(amount) FROM bench1 GROUP BY sector)t
USING(sector)

I'll include this test in my docs later.

Regards,


-- 
Hitoshi Harada