Thread: Planning aggregates which require sorted or distinct input
Recenly, I've been researching and putting together a proposal for window functions. I have not finished this but when I do, I will post it. A nice list of examples can be found here[1]. Rather than spend a lot of time talking about the problems window functions present to the planner and executor, I'd like to bring up the topic of an existing piece of SQL which is well understood and presents a similar problem. Consider the following query: select saledate, count(distinct prodid), count(distinct sellerid), sum(price) from sales group by 1; The point here is that aggregates usually just receive the input that the lower level of the plan generates. When qualified with distinct, however, that changes. Notice that the count on prodid and sellerid receive unique input while the sum on price does not. We do not create seperate plan nodes to achieve this. Rather, it is a property of the aggregation code inside the executor[2]. It seems to me that by creating actual plan nodes for this distinct step we can improve the range of options for executing these types of queries. For example, consider a more complex query than the above that groups over a join using a join key of saledate, prodid (and that the planner implements with a merge join). This means that the sort order is preserved and count(distinct prodid) will receive sorted input. As far as I can tell, however, the executor doesn't know this and but the planner does. That is, the sort step inside the aggregate code is redundant. Another way it could be improved is if we ever introduce a 'hash distinct' execution node. This has been discussed before. My interest here is not so much to do with distinct aggregates but, rather, window functions. Window functions have this same problem as the input to the functions is generally sorted by different keys. So, hypothetically, lets say we wanted to create a plan for the above query which had an explicit Sort -> Unique 'branch' for each of the distinct aggregates. This is actually difficult to represent with the existing plan constructs, as it turns out. Currently, the plan looks something like: GroupAggregate ^ | Sort Op ^ | Scan What we want to do is have a kind of 'sub plan' for each aggregate. In effect, the plan might start looking like a directed graph. Here is part of the plan as a directed graph. GroupAggregate /-----------------^---------------... | | | | ^ | | Unique | ^ | | Sort Sort (saledate) (saledate,prodid) ^ ^ | | -------------- Fan Out ------------... ^ | Scan This idea was presented by Brian Hagenbuch at Greenplum. He calls it a 'Fan Out' plan. It is trivial to rejoin the data because all data input to the aggregates is sorted by the same primary key. Also, we could/would optimize this to perform a single sort for those columns which require the same sort order (saledate and sum(price) in the example query). An extra step would be required for (some) window functions because they wont necessarily preserve order. That's not important now. An alternative approach would be a 'pipeline' design. This would fit into the existing plan structure better. In this approach, each aggregate would be a distinct step in the plan. Finalize (like Result) ^ | Agg (sum(price)) ^ | Sort(saledate) ^ | Agg (count(sellerid)) ^ | Sort (saleid, sellerid)/Unique ^ | Agg (count(prodid)) ^ | Sort (saleid, prodid)/Unique ^ | Scan Now, this would actually work as follows: the input would be received from the Scan node. We sort by the input by the key saleid, prodid and produce a unique result. It is input to the aggregate and we produce a result for a distinct grouping expression. We then pass the output of the aggregate for this grouping expression up the tree along with a pointer to the scan data. We do not discard the result of the sort on (saleid, prodid) because we will need to use it again (for the next grouping expression). We head up the tree. We produce a new sort, this time sorted on the key prodid, sellerid. We get the result for the current grouping expression. We do this until we hit the top level of the plan and produce a single tuple. This top level continues to put from the plan until it runs out of data, naturally. This approach looks inefficient but it should use a similar amount of resource as we do at the moment. There are optimizations we can make specifically for window functions and probably for the sample query as well but that isn't really the point. A query which would generate a plan like this is: select xx.saledate, xx.cp, yy.cs from (select saledate, count(prodid) as cp from (select distinct saledate,prodid from sales) x group by saledate ) xx, (select saledate, count(sellerid) as cs from (select distinct saledate, sellerid from sales) y group by saledate ) yy where xx.saledate =yy.saledate; But, with sharing of the output of the scan on sales. The question is, what would be the best way for data to flow when you have aggregates expecting different multisets as their input? Are there other, better ways to represent it than I have presented here? Which of these approaches is best (if they are worthwhile at all)? Thanks, Gavin [1] http://www.akadia.com/services/ora_analytic_functions.html [2] See advance_aggregates() and process_sorted_aggregate().
Gavin Sherry <swm@alcove.com.au> writes: > What we want to do is have a kind of 'sub plan' for each aggregate. In > effect, the plan might start looking like a directed graph. Here is part > of the plan as a directed graph. > GroupAggregate > /-----------------^---------------... > | | > | | > ^ | > | Unique > | ^ > | | > Sort Sort > (saledate) (saledate,prodid) > ^ ^ > | | > -------------- Fan Out ------------... > ^ > | > Scan > This idea was presented by Brian Hagenbuch at Greenplum. He calls it a > 'Fan Out' plan. It is trivial to rejoin the data because all data input to > the aggregates is sorted by the same primary key. Er, what primary key would that be exactly? And even if you had a key, I wouldn't call joining on it trivial; I'd call it expensive ... Still, it looks better than your "pipeline" idea which is even more full of handwaving --- the problem with that one is that you're either duplicating the earlier aggregates' results a lot of times, or you've got different numbers of rows for different columns at various steps of the pipeline. I'd stick with the fanout idea but work on some way to keep related rows together that doesn't depend on untenable assumptions like having a primary key. When I've thought about this in the past, I had in mind leaving the plan structure pretty much as it is, but making the planner concern itself with the properties of individual aggregates more than it does now --- eg, mark DISTINCT aggregates as to whether they should use sorting or hashing, or mark that they can assume pre-sorted input. Perhaps this is another way of describing what you call a fan-out plan. regards, tom lane
Gavin Sherry wrote: > Recenly, I've been researching and putting together a proposal for window > functions. Implementing NULLS FIRST and NULLS LAST appears like another challenging step to getting window functions wrapped up. Has your research lead you to any ideas on what your strategy for NULLS FIRST and NULLS LAST likely would be?
Karen Hill wrote: > Gavin Sherry wrote: >> Recenly, I've been researching and putting together a proposal for window >> functions. > > Implementing NULLS FIRST and NULLS LAST appears like another > challenging step to getting window functions wrapped up. Has your > research lead you to any ideas on what your strategy for NULLS FIRST > and NULLS LAST likely would be? maybe I'm missunderstanding your question but tom implemented that already in -HEAD: http://archives.postgresql.org/pgsql-committers/2007-01/msg00123.php Stefan
Stefan Kaltenbrunner wrote: > Karen Hill wrote: > > Gavin Sherry wrote: > >> Recenly, I've been researching and putting together a proposal for window > >> functions. > > > > Implementing NULLS FIRST and NULLS LAST appears like another > > challenging step to getting window functions wrapped up. Has your > > research lead you to any ideas on what your strategy for NULLS FIRST > > and NULLS LAST likely would be? > > maybe I'm missunderstanding your question but tom implemented that > already in -HEAD: > Cool. I missed that. :-)
On Fri, 19 Jan 2007, Tom Lane wrote: > Gavin Sherry <swm@alcove.com.au> writes: > > What we want to do is have a kind of 'sub plan' for each aggregate. In > > effect, the plan might start looking like a directed graph. Here is part > > of the plan as a directed graph. > > > GroupAggregate > > /-----------------^---------------... > > | | > > | | > > ^ | > > | Unique > > | ^ > > | | > > Sort Sort > > (saledate) (saledate,prodid) > > ^ ^ > > | | > > -------------- Fan Out ------------... > > ^ > > | > > Scan > > > This idea was presented by Brian Hagenbuch at Greenplum. He calls it a > > 'Fan Out' plan. It is trivial to rejoin the data because all data input to > > the aggregates is sorted by the same primary key. > > Er, what primary key would that be exactly? And even if you had a key, > I wouldn't call joining on it trivial; I'd call it expensive ... I should have used slightly different language. What I meant to say was, both sets are primarily sorted by saledate so they can be merged back together. This is why I said it was trivial. Thanks, Gavin
On Sat, 19 Jan 2007, Karen Hill wrote: > Gavin Sherry wrote: > > Recenly, I've been researching and putting together a proposal for window > > functions. > > Implementing NULLS FIRST and NULLS LAST appears like another > challenging step to getting window functions wrapped up. Has your > research lead you to any ideas on what your strategy for NULLS FIRST > and NULLS LAST likely would be? Explicit handling of NULL is only in the Oracle extension to window functions. I will look at the core SQL functionality first -- maybe with the extension of 'LEAD' and 'LAG' because they are heavily utilitised by those using window functions. Besides, Tom's been working on NULLS FIRST/LAST already. gavin
Stefan Kaltenbrunner wrote: > Karen Hill wrote: > > Gavin Sherry wrote: > >> Recenly, I've been researching and putting together a proposal for window > >> functions. > > > > Implementing NULLS FIRST and NULLS LAST appears like another > > challenging step to getting window functions wrapped up. Has your > > research lead you to any ideas on what your strategy for NULLS FIRST > > and NULLS LAST likely would be? > > maybe I'm missunderstanding your question but tom implemented that > already in -HEAD: > Cool. I missed that. :-)
Gavin Sherry <swm@alcove.com.au> writes: > On Fri, 19 Jan 2007, Tom Lane wrote: >> Er, what primary key would that be exactly? And even if you had a key, >> I wouldn't call joining on it trivial; I'd call it expensive ... > I should have used slightly different language. What I meant to say was, > both sets are primarily sorted by saledate so they can be merged back > together. This is why I said it was trivial. Ah, my misunderstanding. Then isn't this basically isomorphic to what I was thinking of, ie, somewhat-smarter Aggref nodes attached to the existing GroupAggregate plan node? regards, tom lane
On Fri, 19 Jan 2007, Tom Lane wrote: > Gavin Sherry <swm@alcove.com.au> writes: > > On Fri, 19 Jan 2007, Tom Lane wrote: > >> Er, what primary key would that be exactly? And even if you had a key, > >> I wouldn't call joining on it trivial; I'd call it expensive ... > > > I should have used slightly different language. What I meant to say was, > > both sets are primarily sorted by saledate so they can be merged back > > together. This is why I said it was trivial. > > Ah, my misunderstanding. Then isn't this basically isomorphic to what > I was thinking of, ie, somewhat-smarter Aggref nodes attached to the > existing GroupAggregate plan node? Yep. I was thinking about this all morning. I think I've over engineered the problem in my head. Window function input just looks like a slightly more complex distinct aggregate input. I'll think on it more though. To bring out a slightly different point -- and I know this is putting the cart before the horse -- but window functions will (potentially) output rows in the wrong order. I made a passing reference to this earlier. For example, say we have a table employees with the following data: empno | salary | age ==================== 1 | 2000 | 50 2 | 6000 | 30 3 | 3000 | 20 We want to answer the following: for each employee: what is their rank in terms of salary and what is their rank in terms of age. This query answers that: select empno, rank() over (order by salary) as srank, rank() over (order by age) as arank from employees order by empno; The result will be: empno | salary | age ==================== 1 | 1 | 3 2 | 3 | 2 3 | 2 | 1 Both window functions provide results based on the order of their input. So, in terms of empno, srank will output in this order: empno = 1, 3, 2; arank will output in this order: empno = 3, 2, 1. We need to glue these back together and the only way I can think how is via a synthetic key. Ideally, the planner would have some input on how to clue about how large the result set will be and the orders from the window functions so that it can decide whether to use nested loop, merge join or hash join to do it. Can you think of a different approach? Thanks, Gavin
Gavin Sherry <swm@alcove.com.au> writes: > We want to answer the following: for each employee: what is their rank in > terms of salary and what is their rank in terms of age. This query > answers that: > select empno, rank() over (order by salary) as srank, > rank() over (order by age) as arank > from employees order by empno; Eeek. This seems like the worst sort of action-at-a-distance. How does rank() know what value it's supposed to report the rank of? regards, tom lane
Hi, Tom Lane wrote: >> select empno, rank() over (order by salary) as srank, >> rank() over (order by age) as arank >> from employees order by empno; > > Eeek. This seems like the worst sort of action-at-a-distance. How does > rank() know what value it's supposed to report the rank of? All of these ranking aggregate functions (rank, dense_rank, percent_rank, cume_dist and row_number) are normally used without any arguments, see examples in [1] or [2]. However, they explicitly require an ORDER BY clause anyway, so I suppose they need one with exactly *one* argument? Does the standard say anything more explicit? Or should those functions just take the first ORDER BY argument? I.e. what should the following query do? Is it a legal query at all? select empno, cume_dist() over (order by salary, age) as rank, from employees order by empno; Regards Markus [1]: SQL Anywhere Server - SQL Reference, Window Clause: http://www.ianywhere.com/developer/product_manuals/sqlanywhere/1000/en/html/dbrfen10/rf-window-clause-statement.html [2]: A techonthenet.com article about cume_dist() function: http://www.techonthenet.com/oracle/functions/cume_dist.php
On Sat, 2007-01-20 at 09:34 +1100, Gavin Sherry wrote: > On Sat, 19 Jan 2007, Karen Hill wrote: > > > Gavin Sherry wrote: > > > Recenly, I've been researching and putting together a proposal for window > > > functions. Good news. > > Implementing NULLS FIRST and NULLS LAST appears like another > > challenging step to getting window functions wrapped up. Has your > > research lead you to any ideas on what your strategy for NULLS FIRST > > and NULLS LAST likely would be? > > Explicit handling of NULL is only in the Oracle extension to window > functions. I will look at the core SQL functionality first -- maybe with > the extension of 'LEAD' and 'LAG' because they are heavily utilitised by > those using window functions. NULLS FIRST | LAST is part of SQL:2003, specifically the definition of a window over which an aggregate function can be applied contains a sort specification list. > Besides, Tom's been working on NULLS FIRST/LAST already. For that very reason, I imagined. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Sat, 2007-01-20 at 15:58 +1100, Gavin Sherry wrote: > On Fri, 19 Jan 2007, Tom Lane wrote: > > > Gavin Sherry <swm@alcove.com.au> writes: > > > On Fri, 19 Jan 2007, Tom Lane wrote: > > >> Er, what primary key would that be exactly? And even if you had a key, > > >> I wouldn't call joining on it trivial; I'd call it expensive ... > > > > > I should have used slightly different language. What I meant to say was, > > > both sets are primarily sorted by saledate so they can be merged back > > > together. This is why I said it was trivial. Yes, your fan out plan sounds best, together with the optimisation to remove whatever you call the individual strands of the fan-out. Think of a good name now, so we can discuss it more easily... > Yep. I was thinking about this all morning. I think I've over engineered > the problem in my head. Window function input just looks like a slightly > more complex distinct aggregate input. I'll think on it more though. I'm working on modifying Tuplestore that will allow you to store the last N tuples, rather than all previous input. This is specifically designed to allow Merge Join to do Mark/Restore without materializing the whole sort set. This can also be used to materialize Windows (i.e. <window clause> in SQL:2003), so you can store the current row plus n PRECEDING and/or n FOLLOWING rows as appropriate. Reading from the Window would then be similar-ish to doing a Mark/Restore pair, which we can rename to MarkWindowStart and ReturnToWindowStart. I'll present the prototype shortly, I've got a few bugs, but the basic principle is working already. I'm happy to craft that to your exact needs, so that you'll be able to press ahead with the rest of the implementation of Windowed functions. The Materialize node is almost unchanged, but I was thinking of renaming it when used in this way to make the EXPLAIN more readable. Perhaps we should call it a Materialized Window for both the MJ and Window function cases. This won't help with UNBOUNDED window definitions, but I imagine that these are best handled by running aggregates which would be an O(N) operation, rather than recalculating everything each time, which would be O(N^2). > To bring out a slightly different point -- and I know this is putting the > cart before the horse -- but window functions will (potentially) output > rows in the wrong order. I made a passing reference to this earlier. For > example, say we have a table employees with the following data: > > empno | salary | age > ==================== > 1 | 2000 | 50 > 2 | 6000 | 30 > 3 | 3000 | 20 > > We want to answer the following: for each employee: what is their rank in > terms of salary and what is their rank in terms of age. This query > answers that: > > select empno, rank() over (order by salary) as srank, > rank() over (order by age) as arank > from employees order by empno; > > The result will be: > > empno | salary | age > ==================== > 1 | 1 | 3 > 2 | 3 | 2 > 3 | 2 | 1 > > Both window functions provide results based on the order of their input. > So, in terms of empno, srank will output in this order: empno = 1, 3, 2; > arank will output in this order: empno = 3, 2, 1. We need to glue these > back together and the only way I can think how is via a synthetic key. Anything wrong with using empno? > Ideally, the planner would have some input on how to clue about how large > the result set will be and the orders from the window functions so that it > can decide whether to use nested loop, merge join or hash join to do it. > > Can you think of a different approach? Sounds like figuring out and agreeing the executor issues first is the best bet. Once we know whats to be done, extending the planner to do it will be easier. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Simon Riggs wrote: > I'm working on modifying Tuplestore that will allow you to store the > last N tuples, rather than all previous input. This is specifically > designed to allow Merge Join to do Mark/Restore without materializing > the whole sort set. This can also be used to materialize Windows (i.e. > <window clause> in SQL:2003), so you can store the current row plus n > PRECEDING and/or n FOLLOWING rows as appropriate. Interesting. This could also be used to implement the "partial sort" stuff that Oleg is so interested in, and in which I believe Greg Stark was also interested. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Sat, 20 Jan 2007, Simon Riggs wrote: > On Sat, 2007-01-20 at 15:58 +1100, Gavin Sherry wrote: > > On Fri, 19 Jan 2007, Tom Lane wrote: > > > > > Gavin Sherry <swm@alcove.com.au> writes: > > > > On Fri, 19 Jan 2007, Tom Lane wrote: > > > >> Er, what primary key would that be exactly? And even if you had a key, > > > >> I wouldn't call joining on it trivial; I'd call it expensive ... > > > > > > > I should have used slightly different language. What I meant to say was, > > > > both sets are primarily sorted by saledate so they can be merged back > > > > together. This is why I said it was trivial. > > Yes, your fan out plan sounds best, together with the optimisation to > remove whatever you call the individual strands of the fan-out. Think of > a good name now, so we can discuss it more easily... Hah. Yes. I've been using the term 'subplan' but it isn't a good one. Can we just choose something 'cool' like iPlan2006? ;) > > Yep. I was thinking about this all morning. I think I've over engineered > > the problem in my head. Window function input just looks like a slightly > > more complex distinct aggregate input. I'll think on it more though. > > I'm working on modifying Tuplestore that will allow you to store the > last N tuples, rather than all previous input. This is specifically > designed to allow Merge Join to do Mark/Restore without materializing > the whole sort set. This can also be used to materialize Windows (i.e. > <window clause> in SQL:2003), so you can store the current row plus n > PRECEDING and/or n FOLLOWING rows as appropriate. Reading from the > Window would then be similar-ish to doing a Mark/Restore pair, which we > can rename to MarkWindowStart and ReturnToWindowStart. Wow. What a coincidence! Windows are slightly more complex though. As you probably know, there are two ways of specifying the window frame: by an absolute number of rows (ROWS N PRECEDING, for example); or, by a 'range' (RANGE N PRECEDING), where the range, in the case of 'preceding', is determined by subtracted the range parameter from the value of the current field -- i.e., the window attribute. This presents a problem in terms of managing the size of the buffer. If you have data clustered around a value then the amount of data input into the window function when we cross this value explodes. The tuplestore code can or will deal with this but it is currently not designed for this kind of usage pattern. That is, every time we advance a row we must (a) recalculate multiset to be input to the window function and (b) generate the value from the window function by passing the input to it. The problem arises when the window contains more data than can be stored in work_mem. Then, each time we need to recalculate the window function, we'll churn data through the buffer and get no effect from the buffering itself. A lot of the research around window functions recommends offseting this effect by buffering data 'pre-aggregated' for each distinct value in the buffer. Most of the research, however, relies on a trick we don't have available in the SQL window function spec: windows in SQL can have a partition (irrelevant here), data sort order and a range; windows in the world of window function streaming data research have this and a 'slide'. Slide is the interval at which the aggregate is regenerated and in SQL the value is regenerated for every new row. The research concedes that at this level, preaggregation either stops making sense of is very complex. I've come up with a way to be able to do it, but not for all aggregates (median() for example). I'll discuss this in the proposal to be sent through soon. The problem is, the requirements we have for buffering data around window functions could be very complex. > I'll present the prototype shortly, I've got a few bugs, but the basic > principle is working already. I'm happy to craft that to your exact > needs, so that you'll be able to press ahead with the rest of the > implementation of Windowed functions. Awesome. I will get the proposal off so that we can discuss the requirements at length. > The Materialize node is almost unchanged, but I was thinking of renaming > it when used in this way to make the EXPLAIN more readable. Perhaps we > should call it a Materialized Window for both the MJ and Window function > cases. I think that would be confusing in the case of MJ. > This won't help with UNBOUNDED window definitions, but I imagine that > these are best handled by running aggregates which would be an O(N) > operation, rather than recalculating everything each time, which would > be O(N^2). Correct. You only need to recalculate the aggregate if it has a moving window frame. > > To bring out a slightly different point -- and I know this is putting the > > cart before the horse -- but window functions will (potentially) output > > rows in the wrong order. I made a passing reference to this earlier. For > > example, say we have a table employees with the following data: > > > > empno | salary | age > > ==================== > > 1 | 2000 | 50 > > 2 | 6000 | 30 > > 3 | 3000 | 20 > > > > We want to answer the following: for each employee: what is their rank in > > terms of salary and what is their rank in terms of age. This query > > answers that: > > > > select empno, rank() over (order by salary) as srank, > > rank() over (order by age) as arank > > from employees order by empno; > > > > The result will be: > > > > empno | salary | age > > ==================== > > 1 | 1 | 3 > > 2 | 3 | 2 > > 3 | 2 | 1 > > > > Both window functions provide results based on the order of their input. > > So, in terms of empno, srank will output in this order: empno = 1, 3, 2; > > arank will output in this order: empno = 3, 2, 1. We need to glue these > > back together and the only way I can think how is via a synthetic key. > > Anything wrong with using empno? It might not be a unique key. > > > Ideally, the planner would have some input on how to clue about how large > > the result set will be and the orders from the window functions so that it > > can decide whether to use nested loop, merge join or hash join to do it. > > > > Can you think of a different approach? > > Sounds like figuring out and agreeing the executor issues first is the > best bet. Once we know whats to be done, extending the planner to do it > will be easier. Exactly. As you can see, it's a pretty big topic so I'll work further on the proposal and send it off. Thanks, Gavin
On Sat, 20 Jan 2007, Tom Lane wrote: > Gavin Sherry <swm@alcove.com.au> writes: > > We want to answer the following: for each employee: what is their rank in > > terms of salary and what is their rank in terms of age. This query > > answers that: > > > select empno, rank() over (order by salary) as srank, > > rank() over (order by age) as arank > > from employees order by empno; > > Eeek. This seems like the worst sort of action-at-a-distance. How does > rank() know what value it's supposed to report the rank of? This is a frustratingly inconsistent bit of the spec. Rank is defined as follows: RANK() OVER WNS is equivalent to: ( COUNT (*) OVER (WNS1 RANGE UNBOUNDED PRECEDING) - COUNT (*) OVER (WNS1 RANGE CURRENTROW) + 1 ) Say the salary column has the following values: {100, 200, 200, 300}. This would give the following output: {1, 2, 2, 4}. DENSE_RANK() would give: {1, 2, 2, 3}. These functions are pretty ugly (if you think about them in terms of our existing aggregates). However, they are by far the most heavily used window functions (along with ROW_NUMBER()). Thanks, Gavin
"Alvaro Herrera" <alvherre@commandprompt.com> writes: > Simon Riggs wrote: > >> I'm working on modifying Tuplestore that will allow you to store the >> last N tuples, rather than all previous input. This is specifically >> designed to allow Merge Join to do Mark/Restore without materializing >> the whole sort set. This can also be used to materialize Windows (i.e. >> <window clause> in SQL:2003), so you can store the current row plus n >> PRECEDING and/or n FOLLOWING rows as appropriate. > > Interesting. This could also be used to implement the "partial sort" > stuff that Oleg is so interested in, and in which I believe Greg Stark > was also interested. I already have that implemented in tuplesort. The place where I was asking for more guidance was at higher levels. How to communicate this information down to the Sort node. It's not exactly the same as this case where you want to control when the old tuples are purged though. I think the SORT LIMIT case is simpler since it you just have to tell the tuplesort how many tuples you're interested in and it can take care of pruning irrelevant tuples as it goes. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gavin, I'm also interested in the topic, but right now I am wondering if rank() function is a reserved name ? We're working on built-in tsearch2 for 8.3 release and we already have rank() function. Oleg On Sun, 21 Jan 2007, Gavin Sherry wrote: > On Sat, 20 Jan 2007, Tom Lane wrote: > >> Gavin Sherry <swm@alcove.com.au> writes: >>> We want to answer the following: for each employee: what is their rank in >>> terms of salary and what is their rank in terms of age. This query >>> answers that: >> >>> select empno, rank() over (order by salary) as srank, >>> rank() over (order by age) as arank >>> from employees order by empno; >> >> Eeek. This seems like the worst sort of action-at-a-distance. How does >> rank() know what value it's supposed to report the rank of? > > This is a frustratingly inconsistent bit of the spec. Rank is defined as > follows: > > RANK() OVER WNS is equivalent to: > ( COUNT (*) OVER (WNS1 RANGE UNBOUNDED PRECEDING) > - COUNT (*) OVER (WNS1 RANGE CURRENT ROW) + 1 ) > > Say the salary column has the following values: {100, 200, 200, 300}. This > would give the following output: {1, 2, 2, 4}. DENSE_RANK() would give: > {1, 2, 2, 3}. > > These functions are pretty ugly (if you think about them in terms of our > existing aggregates). However, they are by far the most heavily used > window functions (along with ROW_NUMBER()). > > Thanks, > > > Gavin > > ---------------------------(end of broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
"Gavin Sherry" <swm@alcove.com.au> writes: > Wow. What a coincidence! Windows are slightly more complex though. As you > probably know, there are two ways of specifying the window frame: by an > absolute number of rows (ROWS N PRECEDING, for example); or, by a 'range' > (RANGE N PRECEDING), where the range, in the case of 'preceding', is > determined by subtracted the range parameter from the value of the current > field -- i.e., the window attribute. Actually I think there's a third distinct subcase here as well. While in theory "RANGE UNBOUNDED PRECEDING" could be done using the same logic as N PRECEDING I think it makes more sense to treat it as a distinct case because it is amenable to better plans. For RANGE N PRECEDING in the general case we need to reapply the window aggregates over the entire window partition for every record. There may be a property some window aggregate functions have of being able to "remove" the effects of an state transition which allows for an optimization (for example avg() which keeps a running sum and count can easily subtract the old tuple being aged out of the window). But not all aggregates can do this. RANK() I believe will need to resort the entire window partition for every record. However for RANGE UNBOUNDED PRECEDING we can apply a different plan. Keep the state variable for each window aggregate around for the entire time. For each record apply the state transition function then apply the FINAL function to generate the result for that record but keep the state variable as it was for the next record. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
On Sat, 20 Jan 2007, Gregory Stark wrote: > However for RANGE UNBOUNDED PRECEDING we can apply a different plan. Keep the > state variable for each window aggregate around for the entire time. For each > record apply the state transition function then apply the FINAL function to > generate the result for that record but keep the state variable as it was for > the next record. Yes, I made reference to this case else where in the email (or was it a different email in the thread?). Thanks, Gavin
On Sat, 2007-01-20 at 23:54 +1100, Gavin Sherry wrote: > > > Yep. I was thinking about this all morning. I think I've over engineered > > > the problem in my head. Window function input just looks like a slightly > > > more complex distinct aggregate input. I'll think on it more though. > > > > I'm working on modifying Tuplestore that will allow you to store the > > last N tuples, rather than all previous input. This is specifically > > designed to allow Merge Join to do Mark/Restore without materializing > > the whole sort set. This can also be used to materialize Windows (i.e. > > <window clause> in SQL:2003), so you can store the current row plus n > > PRECEDING and/or n FOLLOWING rows as appropriate. Reading from the > > Window would then be similar-ish to doing a Mark/Restore pair, which we > > can rename to MarkWindowStart and ReturnToWindowStart. > > Wow. What a coincidence! Not much of one though. The overhaul of the sort code was the first step on the road to Windowed aggregates. > Windows are slightly more complex though. As you > probably know, there are two ways of specifying the window frame: by an > absolute number of rows (ROWS N PRECEDING, for example); or, by a 'range' > (RANGE N PRECEDING), where the range, in the case of 'preceding', is > determined by subtracted the range parameter from the value of the current > field -- i.e., the window attribute. Sure. The MJ situation is to have the MergeJoin node call a Materialize node which calls the tuplestore functions. The MJ node does all the clever stuff, which includes a variable size buffer according to the values arriving. Seems OK to have a Window node that does its own brand of clever stuff, while the Materialize node just does whats its told in managing the tuplestore. Rows type calls can do a read next/mark at same time, so they always maintain a fixed lead/lag as they go. Range type calls can do a read, then some processing to determine if it should mark yet, just as MJ does. Or at least we can support an additional call to make RangeWindowStart hunt for the appropriate row to set as the end of the window and then read forward until the end of the range is found. > This presents a problem in terms of managing the size of the buffer.If > you have data clustered around a value then the amount of data input into > the window function when we cross this value explodes. The tuplestore code > can or will deal with this but it is currently not designed for this kind > of usage pattern. That is, every time we advance a row we must (a) > recalculate multiset to be input to the window function and (b) generate > the value from the window function by passing the input to it. The problem > arises when the window contains more data than can be stored in work_mem. > Then, each time we need to recalculate the window function, we'll churn > data through the buffer and get no effect from the buffering itself. The reason I'm doing this is that MJs suck when they have large sorted input. So the problem you describe is essentially the same one I worried about when discussing the MJ optimisation. The cost of preparing RandomAccess sort files is *immense*, and scrolling backwards is just not a clever thing to do. The problem you describe won't occur in all cases and my guess is that accepting that it can happen occasionally will still make it run loads faster for large queries than if we use a full RandomAccess sort. We don't really think of it that way because we are paying the RandomAccess costs every time at present, but Windowed aggregates would stick out rather badly if they used that method every time, no matter what the window size is. There'll always be a stupidly specified query that performs badly. That's where the planner can step in to recognize poor situations and do something about it. > A lot of the research around window functions recommends offseting this > effect by buffering data 'pre-aggregated' for each distinct value in the > buffer. Most of the research, however, relies on a trick we don't have > available in the SQL window function spec: windows in SQL can have a > partition (irrelevant here), data sort order and a range; windows in the > world of window function streaming data research have this and a 'slide'. > Slide is the interval at which the aggregate is regenerated and in SQL the > value is regenerated for every new row. The research concedes that at this > level, preaggregation either stops making sense of is very complex. That sounds like Phase 2, don't you think? I'd like to get something into 8.3... The user can always use a subselect to pre-aggregate if they're worried about the performance of their query. Maybe not in a complex fan-out but then anybody with a large query that has multiple conflicting, large windows might reasonably expect poor performance in the first release. > I've come up with a way to be able to do it, but not for all aggregates > (median() for example). I'll discuss this in the proposal to be sent > through soon. The problem is, the requirements we have for buffering data > around window functions could be very complex. > > > I'll present the prototype shortly, I've got a few bugs, but the basic > > principle is working already. I'm happy to craft that to your exact > > needs, so that you'll be able to press ahead with the rest of the > > implementation of Windowed functions. > > Awesome. I will get the proposal off so that we can discuss the > requirements at length. Looking forward to it. I'll help where I can. > > > To bring out a slightly different point -- and I know this is putting the > > > cart before the horse -- but window functions will (potentially) output > > > rows in the wrong order. I made a passing reference to this earlier. For > > > example, say we have a table employees with the following data: > > > > > > empno | salary | age > > > ==================== > > > 1 | 2000 | 50 > > > 2 | 6000 | 30 > > > 3 | 3000 | 20 > > > > > > We want to answer the following: for each employee: what is their rank in > > > terms of salary and what is their rank in terms of age. This query > > > answers that: > > > > > > select empno, rank() over (order by salary) as srank, > > > rank() over (order by age) as arank > > > from employees order by empno; > > > > > > The result will be: > > > > > > empno | salary | age > > > ==================== > > > 1 | 1 | 3 > > > 2 | 3 | 2 > > > 3 | 2 | 1 > > > > > > Both window functions provide results based on the order of their input. > > > So, in terms of empno, srank will output in this order: empno = 1, 3, 2; > > > arank will output in this order: empno = 3, 2, 1. We need to glue these > > > back together and the only way I can think how is via a synthetic key. > > > > Anything wrong with using empno? > > It might not be a unique key. Not sure how such a result set could have meaning to anyone. e.g. empno srank arank 1 1 3 1 3 2 1 2 1 Which emp came 1st? 1 Which emp came 3rd? 1... no useful meaning. I'd be inclined to suggest that such a query should be prevented from being executed with a message along the lines of "no GROUP BY specified". Or at least the problem ignored in the first implementation; its not high up the list of meaningful queries we would like to allow users to execute, IMHO [Order by meaning desc nulls last :-) ] > > > > > Ideally, the planner would have some input on how to clue about how large > > > the result set will be and the orders from the window functions so that it > > > can decide whether to use nested loop, merge join or hash join to do it. > > > > > > Can you think of a different approach? > > > > Sounds like figuring out and agreeing the executor issues first is the > > best bet. Once we know whats to be done, extending the planner to do it > > will be easier. > > Exactly. As you can see, it's a pretty big topic so I'll work further on > the proposal and send it off. It would be very cool to start with a set of queries that are specifically designed to cover all of the complexities, with expected output. That way we can reference a real example of what we are discussing, just like your example here. The range of functionality is very large and hard to visualise, methinks. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Sat, 2007-01-20 at 13:36 +0000, Gregory Stark wrote: > "Alvaro Herrera" <alvherre@commandprompt.com> writes: > > > Simon Riggs wrote: > > > >> I'm working on modifying Tuplestore that will allow you to store the > >> last N tuples, rather than all previous input. This is specifically > >> designed to allow Merge Join to do Mark/Restore without materializing > >> the whole sort set. This can also be used to materialize Windows (i.e. > >> <window clause> in SQL:2003), so you can store the current row plus n > >> PRECEDING and/or n FOLLOWING rows as appropriate. > > > > Interesting. This could also be used to implement the "partial sort" > > stuff that Oleg is so interested in, and in which I believe Greg Stark > > was also interested. > > I already have that implemented in tuplesort. The place where I was asking for > more guidance was at higher levels. How to communicate this information down > to the Sort node. > > It's not exactly the same as this case where you want to control when the old > tuples are purged though. I think the SORT LIMIT case is simpler since it you > just have to tell the tuplesort how many tuples you're interested in and it > can take care of pruning irrelevant tuples as it goes. It does sound the same, but I think Greg is right: Sort Limit optimises the sort itself, not how the processing is handled following the Sort. Both approaches are required, for different cases. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Sat, 2007-01-20 at 14:20 +0000, Simon Riggs wrote: > On Sat, 2007-01-20 at 23:54 +1100, Gavin Sherry wrote: > > Windows are slightly more complex though. As you > > probably know, there are two ways of specifying the window frame: by an > > absolute number of rows (ROWS N PRECEDING, for example); or, by a 'range' > > (RANGE N PRECEDING), where the range, in the case of 'preceding', is > > determined by subtracted the range parameter from the value of the current > > field -- i.e., the window attribute. > > Sure. The MJ situation is to have the MergeJoin node call a Materialize > node which calls the tuplestore functions. The MJ node does all the > clever stuff, which includes a variable size buffer according to the > values arriving. Seems OK to have a Window node that does its own brand > of clever stuff, while the Materialize node just does whats its told in > managing the tuplestore. > > Rows type calls can do a read next/mark at same time, so they always > maintain a fixed lead/lag as they go. > > Range type calls can do a read, then some processing to determine if it > should mark yet, just as MJ does. Or at least we can support an > additional call to make RangeWindowStart hunt for the appropriate row to > set as the end of the window and then read forward until the end of the > range is found. Gavin, Following earlier thoughts, I thought I'd make some notes about how the new tuplestore changes could be used for preparing Windows for use with Windowed aggregate functionality. The new tuplestore commands are tuplestore_moveBufferStartForwards() == mark tuplestore_rewindToBufferStart() == restore (happy to change the names...) They'd be used something like this, in simplified pseudo code that would be executed by a Window? node. With ROWS, for each new tuple: ----------------------------- tuplestore_puttupleslot() // add one new tuple tuplestore_moveBufferStartForwards(++markpos) // move start forwards one // position Window for processing tuplestore_rewindToBufferStart() to position Window for processing With RANGE, for each new row: ---------------------------- // locate new range start tuplestore_rewindToBufferStart() ...step forward until start of new range found... tuplestore_moveBufferStartForwards() at that point // locate new range end while (!end-of-new-range) { if (!eof) tuplestore_gettuple() // move forwards else tuplestore_puttupleslot() // add new tuple } // position Window for processing tuplestore_rewindToBufferStart() (The above is simplified to remove boundary conditions.) So AFAICS, there's not actually any additional work to do, over and above the changes I'm working on to support a circular buffer in tuplestore for merge joins. The above makes the assumption that for RANGE windows, the range start of tuple i+1 is always greater than or equal to the range start of tuple i - could be very interesting if its not true. Anyway, some options for you to consider, at least. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com