Re: Window functions patch v04 for the September commit fest - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Window functions patch v04 for the September commit fest
Date
Msg-id 48BD36BD.2070209@enterprisedb.com
Whole thread Raw
In response to Re: Window functions patch v04 for the September commit fest  ("Hitoshi Harada" <umi.tanuki@gmail.com>)
Responses Re: Window functions patch v04 for the September commit fest  ("Hitoshi Harada" <umi.tanuki@gmail.com>)
Re: Window functions patch v04 for the September commit fest  (Simon Riggs <simon@2ndQuadrant.com>)
List pgsql-hackers
Hitoshi Harada wrote:
> 2008/9/2 Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>:
> In my understanding, the "Window Frame" is defined
> by clauses such like "ROWS BETWEEN ... ", "RANGE BETWEEN ... " or so,
> contrast to "Window Partition" defined by "PARTITION BY" clause. A
> frame slides within a partition or there's only one frame if those
> clauses are not specified. The current patch has not implemented this
> yet. I'll update the docs.

Yes, that's how I read it as well. Another way to think of it is that 
there's always a window frame, but if no ROWS BETWEEN or similar clause 
is given, the window frame spans the whole partition (or from the 1st 
row of the partition, up to the current row, if ORDER BY is given).

> I don't like to call the second type "ranking aggregates" because it
> may refer to only ranking functions though there are more types of
> function like ntile(), lead() and lag(). But "window functions"
> doesn't seem appropriate either since it's ambiguous with the general
> name of window expressions.

Yep, confusing :-(. The SQL standard says that "a window function is one 
of: a rank function, a distribution function, a row number function, a 
window aggregate function, the ntile function, the lead function, the 
lag function, the first-value function, the last-value function, the 
nth-value function". So, window aggregate functions are a special class 
of window functions, and there's no term to refer to all the rest of the 
window functions excluding window aggregate functions.

Your doc        SQL spec
Window expression    Window function
Window function        Any window function other than a window aggregate function
Window aggregate    Window aggregate function

I tried to coin the term "ranking aggregate" for the SQL2008 term "Any 
window function other than a window aggregate function", but you're 
right that that's still confusing, because the SQL2008 term "rank 
function" includes only RANK() and DENSE_RANK().

The spec calls them "group aggregate functions", when they're used with 
GROUP BY, rather than as a window function. I think we could use that term.

> Your proposal is smarter than the current implementation. But it
> doesn't seem complete in some way. From logical point of view, the
> window functions should be allowed to access whole of rows in the
> frame the current row belongs to (c.f. inverse distribution
> functions). 

By the whole of rows, do you mean
a) the chosen value or expression of all rows, or
b) all columns of the input rows?

Different window functions have different needs. RANK() for example does 
need to see all columns, to compare them, but it only needs access to 
the previous and current row. CUME_DIST on the other hand needs access 
to all columns of all rows, and LAG needs access to a specific column of 
a fixed number of rows. And row_number needs nothing.

The needs of access to the rows are so different that it seems best to 
me to delegate the buffering to the window function.

Actually, looking closer to the ranking functions, they don't really 
need access to all columns. They just need to be able to compare them, 
according to the window ordering, so we should probably only provide 
access to the arguments of the aggregate, evaluated for any row in the 
frame, and a comparator function, that can determine if any given rows 
in the frame are <, = or >.

> From implementing and performance point of view, there
> need to consider such case like mixing window aggregates and ranking
> aggregates in the same window, which means it is better that the two
> types of functions are processed in the similar flow. Also logically,
> SQL spec doesn't forbid ranking aggregates in sliding window frames.

It doesn't? Oh, bummer. It seems we need a grand unified theory of 
ranking and other aggregates then :-(. How does something like RANK work 
with a window frame? What does it return, if the current row is excluded 
from the window frame, e.g with EXCLUDE CURRENT ROW?


Let's look at the trivial, generic, and slow implementation first, and 
then figure out what tricks we can do to make it faster. I gather that 
that generic algorithm, following how the SQL spec defines the window 
frame, looks like this:

0. Construct the ordered set of tuples, containing all the tuples in the 
partition.
For each row, start with the ordered set constructed in step 0, and:
1. Remove all tuples from the set that are before the start of the 
sliding frame
2. Remove all tuples from the set that are after the end of the sliding 
frame
3. Remove all tuples specified by the window-frame-exclusion clause 
(EXCLUDE CURRENT ROW/GROUP/TIES/NO OTHERS).
4. Run the aggregate over all tuples still in the set.

Now, the first obvious optimization is that we don't want to construct 
the window frame from scratch for each row. Instead:

0. Start with an empty ordered set S.
For each row:
1. Read forward until we hit the end of the new sliding frame, and add 
those rows to S.
2. Remove all tuples from S that are before the start of the sliding frame.
3. Construct a new set T, which is the same as S, but all tuples 
specified by the window-frame-exclusion clause (EXCLUDE CURRENT 
ROW/GROUP/TIES/NO OTHERS) are removed.
5. Run the aggregate over all tuples in T.

This is applicable to all ranking and window aggregates. Per spec, if no 
window-frame-clause is given, the start of the frame is always the 1st 
row in the partition. If an ORDER BY is given, the end of the frame is 
the the current row. Otherwise it's the end of partition, which means 
that we can just run the aggregate once and return the same value for 
all rows in the partition.

Now, let's consider how to speed up step 5. There's the difference 
between ranking and window aggregates. Window aggregates don't care 
about the position of the current row within the window frame, so the 
interface I proposed earlier, with the subfunc, would work fine. But for 
ranking aggregates, the position of the current row matters.

Another must-have optimization is that we must be able to eagerly 
discard rows that we know won't be accessed anymore. And likewise, we 
mustn't read ahead many rows that we don't need to calculate the 
aggregate for the current row.

I think I'm coming around to your suggestion of passing a Window object 
to the aggregate function, allowing random access to the current frame. 
As Simon pointed out, there needs to be a way for the aggregate function 
to signal when it doesn't need some rows in the frame anymore, for 
performance.

The sliding window frame always moves forward, never backwards. 
Therefore it's sufficient if the user-defined function can indicate a 
cutoff point, allowing the executor node to discard any preceding rows. 
However, the window-frame-exclusion clause complicates that, because the 
current row and its peers might or might not be in the frame, which 
means that rows can disappear and reappear into the middle of the window 
frame. We don't need to implement window-frame-exclusion in this 
release, but we should consider how that affects the API.

The API could be like this (in pseudo-syntax):

CREATE AGGREGATE name (    <type> evalfunc(Window object)    enterfunc(Window object, int pos)    exitfunc(Window
object,int pos)
 
)

where evalfunc, enterfunc and exitfunc are functions provided by the 
aggregate function author.

Window object {  /* Returns the position of current row within the frame. */  int get_current_pos();  /* Returns the
numberof rows in the frame. */  int get_max_pos();  /* Compares two tuple in the window frame, according to window 
 
ordering */  int compare(int posa, int posb)  /* Returns the value of an aggregate argument for tuple at given 
position */  Datum get_arg_value(int pos, int argno);  /* Promise that we won't call get_arg_value or compare for any
pos< 
 
cutoff_pos anymore */  void signal_cutoff(int cutoff_pos);
}

The executor calls enterfunc for each row that enters the window frame, 
and exitfunc for each row that exits the frame (before removing the
tuple from the frame, so that it's still accessible to get_arg_value() 
and compare()). After calling enter/exitfuncs for all rows that 
enter/exit the frame, evalfunc is called once, to return the value for 
the current row.

I believe this allows for an efficient implementation of all the window 
functions we've discussed. It works fine for implementing regular 
aggregates as well, so there no need to treat ranking and other 
aggregates any differently. We can completely replace the current method 
of defining aggregates with this if we want to, as long as we provide 
some sort of glue for backwards-compatibility with aggregates in 
external projects.

Here's sketches of some aggregate implementations using this API:

RANK()
------
enterfunc:  if position of the new row is > current row, do nothing. Otherwise, 
decrement current rank.
exitfunc:  if position of the new row is > current row, do nothing. Otherwise, 
increment current rank.

LEAD(offset)
------------
evalfunc: Call get_row(get_current_row() + offset). Call signal_cutoff(get_current_row());

LAG(offset)
-----------
evalfunc: Call get_row(get_current_row() - offset). Call signal_cutoff(get_current_row() - offset);

MIN/MAX
-------
enterfunc: if new value is </> the current min/max value, replace it with the new 
value
exitfunc: if removed value = the current min/max value, rescan the current frame 
for new min/max value, iterating through all values (1..get_max_row()).

COUNT
-----
enterfunc: increment counter
exitfunc: decrement counter

CUME_DIST
---------
CUME_DIST is defined as NP/NR, where NP is the number of rows preceding 
or peer with the current row, and NR is the total number of rows in the 
frame. NR = get_max_pos(). Maintain NP-counter in enterfunc/exitfunc, 
incrementing it whenever a row preceding current row enters or exits the 
frame. In evalfunc, increment the counter to reflect the new current row 
position.


> I feel horrible in imagining to manage the combination of functions that
> don't support subfunc, ones that does and ranking aggregates but it's
> possible.

Well, if you handle each window/ranking aggregate in a separate Window 
node, like you did, it should be much easier to manage. If it gets 
overwhelmingly complex, you could also split it into two slightly 
different Window nodes, one for the generic case, for aggregates that 
don't have subfunc, and one for ones that do.

(the point is moot, if we go with the API I proposed above, using a 
Window object)


All in all, this is such a monstrous piece of functionality, that if we 
try to implement everything in one patch, it'll never get finished. So 
we clearly have to implement this in phases, and just make sure we don't 
paint ourselves in the corner with the design.

I'd suggest:

1. Implement Window node, with the capability to invoke an aggregate 
function, using the above API. Implement required parser/planner 
changes. Implement a few simple ranking aggregates using the API.
2. Implement glue to call normal aggregates through the new API
3. Implement the optimization to drop rows that are no longer needed 
(signal_cutoff can be a no-op until this phase)
4. Implement window framing (the frame can always be all rows in the 
partition, or all rows until the current row, until this phase)
5. Expose the new API to user-defined aggregates. It can be an internal 
API only used by built-in functions until this phase

I believe you already have phase 1 in your patch, except for the API 
changes.

>> PS. Have you looked at the "hypothetical set functions" in SQL2003?
> 
> I had a glance but not deeply, since I found I would be lost in design
> if deeply diving into it. :)

Yeah ;-). A quick glance suggests that it won't affect aggregate the API 
we come up with, but will require changes to parser/executor.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


pgsql-hackers by date:

Previous
From: Zdenek Kotala
Date:
Subject: Page layout footprint
Next
From: Heikki Linnakangas
Date:
Subject: Re: Window functions patch v04 for the September commit fest