Re: Window functions patch v04 for the September commit fest - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Window functions patch v04 for the September commit fest
Date
Msg-id 48BC2DCF.2090103@enterprisedb.com
Whole thread Raw
In response to Re: Window functions patch v04 for the September commit fest  ("Hitoshi Harada" <umi.tanuki@gmail.com>)
Responses Re: Window functions patch v04 for the September commit fest  (David Fetter <david@fetter.org>)
Re: Window functions patch v04 for the September commit fest  (Gregory Stark <stark@enterprisedb.com>)
Re: Window functions patch v04 for the September commit fest  ("Hitoshi Harada" <umi.tanuki@gmail.com>)
Re: Window functions patch v04 for the September commit fest  (Simon Riggs <simon@2ndQuadrant.com>)
List pgsql-hackers
Hitoshi Harada wrote:
> 2008/8/30 Hitoshi Harada <umi.tanuki@gmail.com>:
>> Here's the latest window functions patch against HEAD. It seems to be
>> ready for the September commit fest, as added documents, WINDOW clause
>> feature and misc tests. I guess this would be the window functions
>> feature freeze for 8.4. The remaining feature will be implemented for
>> the later release.
>>
>> This patch consists of:
>> - windowed aggregates
>> - cooperate with GROUP BY aggregates
>> - some optimizations with multiple windows
>> - ranking functions
>> - WINDOW clause
>> - windowing SQL regression tests
>> - sgml documents
>>
>> Looking up the total road map, the dropped features are:
>>
>> - sliding window (window framing)
>> - lead(), lag(), etc. that reach for random rows
>> - user defined window functions

Ok, I'm starting to read up on SQL2003 window functions, and to review 
this patch. Here's a long brain-dump of my first thoughts on the design.

Let's list a couple of requirements first:

1. It's important that what gets committed now can be extended to handle 
all of the window function stuff in SQL2003 in the future, as well as 
user-defined-window-functions in the spirit of PostgreSQL extensibility. 
Even if we don't implement all of it in this release.

2. Performance needs to be reasonable, in the face a large number of 
input rows. These are typically used in OLAP queries that process 
gazillions of rows. Some window functions need access to all rows in the 
window frame or partition, so you can't reasonably expect great 
performance with those if the working set is large, but those that 
don't, should work with finite memory requirements, and reasonably fast.

Because of 1., let's focus on the interface for 
user-defined-window-functions first. Ideally, the interface is such that 
all the window functions and aggregates defined in SQL2003, and others 
we might want to have in core or as pgfoundry projects, can be 
implemented using it.

I think there's still some confusion in the design document about what a 
frame is. The terminology section is great, it helped me a lot to get 
started, so thank you for that. However, in the "Actual design in the 
patch" you describe how a window function works, and you talk about a 
window frame, but elsewhere in the doc you note that window frames are 
not implemented yet.

As already discussed, there's two different kind of window expressions: 
window aggregates, and ranking aggregates (you call them window 
functions in your doc). It seems that they are different enough that we 
can't bang them together. For example, you can't use a window frame 
clause with ranking aggregates, and ranking aggregates need to see the 
whole row, whereas window aggregates only see the expressions passed as 
argument.

API for window functions (aka. ranking aggregate functions)
------------------------

Borrowing ideas from the many approaches listed in the design doc, I 
propose an interface using a set-returning function. Similar to Simon's 
proposal, but taking into account that a ranking function might need to 
see some or all input tuples before returning anything.

For each input tuple, the window function can return any number of 
tuples, including 0. After processing all input values, though, the 
total number of output rows returned must match the total number of 
input values fed into it. The function must eventually output one value 
for each input value, in the same order.

This is a flexible design that allows for different kind of 
implementations; the function can return one output tuple for each input 
tuple (ROW_NUMBER(), RANK() etc.), or it can swallow all input tuples 
first, and only then return all output tuples (e.g. CUME_DIST).

Here's a couple of step-by-step examples of how some functions would 
work. Imagine input values 1, 3, 4, 4, 5:

RANK
----

Invocation    Input    Value returned    Internal state (after)
1        1    1        (last value 1, count 1, rank=1)
2        3    2        (last value 2, count 1, rank=2)
3        4    3        (last value 4, count 1, rank=3)
4        4    3        (last value 4, count 2, rank=3)
5        5    5        (last value 5, count 1, rank=5)

So RANK returns one tuple after each input tuple. Just like in your 
current patch, keep the last value returned, a row count, and the 
current rank as state.

LAG
---

Internal state is a FIFO queue of values. Each input value is pushed to 
the FIFO, and the tuple that falls out of the queue is returned.

For example, LAG(<col>,2,0):

Invocation    Input row    Value returned    Internal state (after)
1        1        0        (1)
2        3        0        (3, 1)
3        4        1        (4, 3)
4        4        3        (4, 4)
5        5        4        (5, 4)

LEAD
----

Return nothing for first <offset> input tuples. Then return the current 
input tuple for the rest of the input tuples, and after the last input 
tuple, return <offset> number of default values.

For example, LEAD(<col>,2,0):

Invocation    Input row    Value returned    Internal state (after)
1        1        <none>        (offsetbegin = 1, pad=2)
2        3        <none>        (offsetbegin = 0, pad=2)
3        4        4        (offsetbegin = 0, pad=2)
4        4        4        (offsetbegin = 0, pad=2)
5        5        5        (offsetbegin = 0, pad=2)
6        <none>        0        (offsetbegin = 0, pad=1)
7        <none>        0        (offsetbegin = 0, pad=0)


For functions like LEAD or CUME_DIST that don't immediately start 
returning values, the executor will need a tuplestore to buffer input 
rows until it gets their values from the window function.


The syntax for creating a ranking aggregate might look something like this:

CREATE RANKING AGGREGATE name (    SFUNC = sfunc,    STYPE = state_data_type,    OUTFUNC = outfunc
)

This is very similar to Simon's proposal, and to current CREATE 
AGGREGATE, but tweaked a bit to fix the problems Hitoshi pointed out.

sfunc is called once for each input row. Unlike with normal aggregates, 
sfunc is passed the whole input row, so that e.g RANK can compare it 
against the previous row, or LEAD can buffer it.

outfunc is a set-returning function, and it is called until it returns 
no more rows, after each sfunc invocation.

Window aggregates
-----------------

Like normal aggregates, window aggregates process N input values, and 
return one output value. In normal GROUP BY expressions, the aggregate 
function is called once each group, but in a window aggregate expression 
with a window frame (aka sliding window), there is as many "groups" as 
there is rows.

In fact, any normal aggregate function can also be used as a window 
aggregate and vice versa. The trivial implementation is to have a buffer 
to hold all values currently in the window frame, and for each input 
row, invoke the aggregate function over all the rows currently in the 
frame. I think we'll need support that, to allow using any built-in or 
user-defined aggregate as a window aggregate.

However, we also want to provide optimized versions of the common 
aggregates. Aggregates like COUNT, SUM, and AVG can easily be 
implemented more efficiently, without rescanning all the tuples in the 
group.

I propose an extension to the current CREATE AGGREGATE syntax to allow 
for optimized versions. Currently, the syntax is like:

CREATE AGGREGATE name ( input_data_type [ , ... ] ) (    SFUNC = sfunc,    STYPE = state_data_type    [ , FINALFUNC =
ffunc]    [ , INITCOND = initial_condition ]    [ , SORTOP = sort_operator ]
 
)

Let's add a function to allow removing tuples from the current window frame:

CREATE AGGREGATE name ( input_data_type [ , ... ] ) (    SFUNC = sfunc,    STYPE = state_data_type    [ , FINALFUNC =
ffunc]    [ , INITCOND = initial_condition ]    [ , SORTOP = sort_operator ]    [ , SUBTRACTFUNC = subfunc,
 
)

subfunc is like sfunc, except that the input value is "subtracted" from 
the current value. On each input row, the executor calls sfunc for all 
new tuples that enter the window frame, and subfunc for all tuples that 
exit it. Another difference is that finalfunc can be called many times 
during the lifecycle of an aggregate invocation. For example, the 
subfunc for COUNT would decrement the row count by one, and for SUM,
subfunc would subtract the removed value from the current sum. Perhaps 
we should also support an "opportunistic subfunc", that could either 
perform the subtraction, or return a special value indicating that it 
can't be done, and we need to calculate the aggregate from scratch as if 
subfunc wasn't defined. That would be useful for MIN/MAX, which can be 
implemented by just keeping track of the current MIN/MAX value, and 
"subtracting" any other value than the current MIN/MAX is a no-op, but 
"subtracting" the current MIN/MAX value requires scanning all the rest 
of the tuples from scratch.

PS. Have you looked at the "hypothetical set functions" in SQL2003?

PPS. I was just about to send this, when I noticed that LEAD/LAG are in 
fact not in the SQL2003 draft I have at hand. In fact, none of the 
window functions defined there, RANK, DENSE_RANK, PERCENT_RANK, 
CUME_DIST and ROW_NUMBER, require random access to the tuples. Still 
seems like a good idea to be have the flexibility for them.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


pgsql-hackers by date:

Previous
From: Stephen Frost
Date:
Subject: Re: Extending grant insert on tables to sequences
Next
From: "Ryan Bradetich"
Date:
Subject: Re: [Patch Review] TRUNCATE Permission