proposal: add window function to 8.4 - Mailing list pgsql-hackers

From H.Harada
Subject proposal: add window function to 8.4
Date
Msg-id e08cc0400806090532p307f0687i3a246aa5f9f293e8@mail.gmail.com
Whole thread Raw
Responses Re: proposal: add window function to 8.4  (Decibel! <decibel@decibel.org>)
List pgsql-hackers
This topic has been discussed on this list and many user expect that
PostgreSQL implements it.
I'd like to work on this feature and hope that we can include it on 8.4.

Former discussions are here:

http://archives.postgresql.org/pgsql-hackers/2004-11/msg01093.php

http://archives.postgresql.org/pgsql-hackers/2007-01/msg00861.php


How it works and examples:SELECT dept, empno,RANK() OVER(PARTITION BY dept ORDER BY age) as age_rank,RANK()
OVER(PARTITIONBY dept ORDER BY salary) as salary_rank,SUM(salary) OVER(PARTITION BY dept ORDER BY age) as run_totalFROM
employeesORDER BY 1, 3, 4;
 
dept empno age_rank salary_rank run_totalENG  2     1        2           40000ENG  1     2        1           90000QA
3    1        2           30000QA   4     2        1           65000(ref.:
http://www.gavinsherry.org/talks/window_talk.pdf)


My current idea and concept:
- add "window function" and treat it specially such like aggregate
function and setof function.
- some features may be dropped at the first release, considering to
support them later.
- to formulate and to let it work properly are primary, performance
optimization is secondary.


From my survey around web and list archive, the points are:
- what is "window function" rank(), rank_dense(), lead() and others? First of all, we should define the window function
suchlike
 
aggregate function. In my opinion, there are two types of functions in
OVER () call context. One of them is aggregate, and the other is
window (ranking) function. Sum() in a query like
 SELECT empno, sum(salary) OVER (PARTITION BY depno) FROM empsalary;
is obviously aggregate function. This type of function can be used as
it is now. Only executer will change its behavior.Current pgsql feature sets lack window function like rank(). This
type of function must 1) have a context such as SETOF functions, 2)
return values until executor says "DONE", rather than function itself
says "DONE" as in SETOF function, and 3) know about other tuples
(mainly ORDER BY clause), as rank() doesn't take any arguments but
knows the ranking boundary.  I suppose that pgsql have new function
system for these types called "window function".Once we can define window function, users have extensibility to this
type of function.

- do we really need FRAME clause?From my survey, Oracle and DB2 have FRAME clause
 SELECT empno, sum(salary) OVER (ORDER BY empno ROWS BETWEEN
UNBOUNDED PRECEDING AND CURRENT ROW) FROM empsalary;
while MS SQL Server doesn't (note that the literal from "ROWS" to
"CURRENT ROW" is called FRAME clause). To implement FRAME clause is
much more complicated than only PARTITION and ORDER clause support
because we need random access to result tuples. Though we will be
expected to support FAME clause in long-term, for the first release it
might be able to be put off. Even though rank() doesn't support FRAME
clause (only PARTITION and ORDER) it is so useful, more than now at
least.

- execution orderIn SQL:2003, the execution order is defined as
where -> group by -> having -> (windowing) * N -> order by (outer,
currently existing one)where windowing ispartition by -> order by -> (framing) * N
But Oracle seems that it has another order
(windowing) * N -> where -> group by ... and so on.
which is better for us? With Oracle's one you can say SELECT empno, rank() OVER (PARTITION BY depno ORDER BY saraly)
AS
topsalary FROM empsalary WHERE topsaraly < 3to get the top 3 people taking heighest salary. In the SQL standard,
you should the nest query.I insist the first (standard) one is better because we may want use
the result of normal aggregate in OVER clause.

- plan and nodeCurrently in my mind the execution strategy could be:
1. Where & GroupBy & Having|2. SortBy partitionClause, orderByClause|3. Window  foreach partition:    if not
there_is_frame():     aggvalue = null      foreach row in partition:        aggvalue = agg_trans_func(aggvalue)
aggvalue= agg_final_func(aggvalue)
 
    foreach row in partition:      if has frame clause:        aggvalue = null        frame = make_frame()
foreachrow_in_frame:          aggvalue = aggregate_trans_func(aggvalue)        aggvalue =
aggregate_final_func(aggvalue)
      set aggvalue to row      val = window_func()      set val to row  goto 2. if another window remained|4. SortBy
ORDERBY clause (outer) & Limit|5. Output
 
This pseudo code is quite simple and stupid. We may optimize it by
splitting tasks with MergeJoin, etc. or think about process 2. that
collect same PARTITION clauses to reduce sort operation. Or to use
Hash Agg to create PARTITION. But let's be stupid at first.
Optimization is secondary.


References:description by Stefan DeBloch

http://wwwdvs.informatik.uni-kl.de/courses/NEDM/WS0607/Vorlesungsunterlagen/NEDM.Chapter.06.Windows_and_Query_Functions_in_SQL.pdfvia
Wikipedia[Select(SQL)] http://en.wikipedia.org/wiki/Select_(SQL)
 

BTW, what about Bizgres now?
I appreciate for any comments and dis -:)


pgsql-hackers by date:

Previous
From: Andrew Chernow
Date:
Subject: Re: libpq support for arrays and composites
Next
From: Magnus Hagander
Date:
Subject: Re: TODO, FAQs to Wiki?