Thread: Planning aggregates which require sorted or distinct input

Planning aggregates which require sorted or distinct input

From

Gavin Sherry

Date:

18 January 2007, 23:46:16

Recenly, I've been researching and putting together a proposal for window
functions. I have not finished this but when I do, I will post it. A nice
list of examples can be found here[1].

Rather than spend a lot of time talking about the problems window
functions present to the planner and executor, I'd like to bring up the
topic of an existing piece of SQL which is well understood and presents a
similar problem. Consider the following query:

select saledate, count(distinct prodid), count(distinct sellerid),      sum(price) from sales group by 1;

The point here is that aggregates usually just receive the input that the
lower level of the plan generates. When qualified with distinct, however,
that changes. Notice that the count on prodid and sellerid receive
unique input while the sum on price does not.

We do not create seperate plan nodes to achieve this. Rather, it is a
property of the aggregation code inside the executor[2].

It seems to me that by creating actual plan nodes for this distinct step
we can improve the range of options for executing these types of queries.
For example, consider a more complex query than the above that
groups over a join using a join key of saledate, prodid (and that the
planner implements with a merge join). This means that the sort order is
preserved and count(distinct prodid) will receive sorted input. As far as
I can tell, however, the executor doesn't know this and but the planner
does. That is, the sort step inside the aggregate code is redundant.

Another way it could be improved is if we ever introduce a 'hash distinct'
execution node. This has been discussed before.

My interest here is not so much to do with distinct aggregates but,
rather, window functions. Window functions have this same problem as the
input to the functions is generally sorted by different keys.

So, hypothetically, lets say we wanted to create a plan for the above
query which had an explicit Sort -> Unique 'branch' for each of the
distinct aggregates. This is actually difficult to represent with the
existing plan constructs, as it turns out.

Currently, the plan looks something like:

              GroupAggregate                    ^                    |                 Sort Op                    ^
              |                  Scan
 


What we want to do is have a kind of 'sub plan' for each aggregate. In
effect, the plan might start looking like a directed graph.  Here is part
of the plan as a directed graph.
                      GroupAggregate             /-----------------^---------------...             |                 |
          |                 |             ^                 |             |               Unique             |
      ^             |                 |           Sort               Sort         (saledate)    (saledate,prodid)
     ^                 ^             |                 |             -------------- Fan Out ------------...
                 ^                               |                              Scan
 

This idea was presented by Brian Hagenbuch at Greenplum. He calls it a
'Fan Out' plan. It is trivial to rejoin the data because all data input to
the aggregates is sorted by the same primary key. Also, we could/would
optimize this to perform a single sort for those columns which require the
same sort order (saledate and sum(price) in the example query). An extra
step would be required for (some) window functions because they wont
necessarily preserve order. That's not important now.

An alternative approach would be a 'pipeline' design. This would fit into
the existing plan structure better. In this approach, each aggregate
would be a distinct step in the plan.

                       Finalize (like Result)                               ^                               |
            Agg (sum(price))                               ^                               |
Sort(saledate)                               ^                               |                        Agg
(count(sellerid))                              ^                               |                   Sort (saleid,
sellerid)/Unique                              ^                               |                        Agg
(count(prodid))                              ^                               |                   Sort (saleid,
prodid)/Unique                              ^                               |                             Scan
 

Now, this would actually work as follows: the input would be received from
the Scan node. We sort by the input by the key saleid, prodid and produce
a unique result. It is input to the aggregate and we produce a result for
a distinct grouping expression. We then pass the output of the aggregate
for this grouping expression up the tree along with a pointer to the scan
data. We do not discard the result of the sort on (saleid, prodid) because
we will need to use it again (for the next grouping expression).

We head up the tree. We produce a new sort, this time sorted on the
key prodid, sellerid. We get the result for the current grouping
expression. We do this until we hit the top level of the plan and produce
a single tuple. This top level continues to put from the plan until it
runs out of data, naturally.

This approach looks inefficient but it should use a similar amount of
resource as we do at the moment. There are optimizations we can make
specifically for window functions and probably for the sample query as
well but that isn't really the point. A query which would generate a plan
like this is:
    select xx.saledate, xx.cp, yy.cs    from        (select saledate, count(prodid) as cp         from (select distinct
saledate,prodid from sales) x         group by saledate         ) xx,        (select saledate, count(sellerid) as cs
    from (select distinct saledate, sellerid from sales) y          group by saledate         ) yy    where xx.saledate
=yy.saledate;
 

But, with sharing of the output of the scan on sales.

The question is, what would be the best way for data to flow when you have
aggregates expecting different multisets as their input? Are there other,
better ways to represent it than I have presented here? Which of these
approaches is best (if they are worthwhile at all)?

Thanks,

Gavin

[1] http://www.akadia.com/services/ora_analytic_functions.html
[2] See advance_aggregates() and process_sorted_aggregate().

Re: Planning aggregates which require sorted or distinct input

From

Tom Lane

Date:

19 January 2007, 14:12:36

Gavin Sherry <swm@alcove.com.au> writes:
> What we want to do is have a kind of 'sub plan' for each aggregate. In
> effect, the plan might start looking like a directed graph.  Here is part
> of the plan as a directed graph.

>                        GroupAggregate
>               /-----------------^---------------...
>               |                 |
>               |                 |
>               ^                 |
>               |               Unique
>               |                 ^
>               |                 |
>             Sort               Sort
>           (saledate)    (saledate,prodid)
>               ^                 ^
>               |                 |
>               -------------- Fan Out ------------...
>                                 ^
>                                 |
>                                Scan

> This idea was presented by Brian Hagenbuch at Greenplum. He calls it a
> 'Fan Out' plan. It is trivial to rejoin the data because all data input to
> the aggregates is sorted by the same primary key.

Er, what primary key would that be exactly?  And even if you had a key,
I wouldn't call joining on it trivial; I'd call it expensive ...

Still, it looks better than your "pipeline" idea which is even more full
of handwaving --- the problem with that one is that you're either
duplicating the earlier aggregates' results a lot of times, or you've
got different numbers of rows for different columns at various steps of
the pipeline.

I'd stick with the fanout idea but work on some way to keep related rows
together that doesn't depend on untenable assumptions like having a
primary key.

When I've thought about this in the past, I had in mind leaving the plan
structure pretty much as it is, but making the planner concern itself
with the properties of individual aggregates more than it does now ---
eg, mark DISTINCT aggregates as to whether they should use sorting or
hashing, or mark that they can assume pre-sorted input.  Perhaps this
is another way of describing what you call a fan-out plan.
        regards, tom lane

Re: Planning aggregates which require sorted or distinct input

From

"Karen Hill"

Date:

19 January 2007, 15:43:15

Gavin Sherry wrote:
> Recenly, I've been researching and putting together a proposal for window
> functions.

Implementing NULLS FIRST and NULLS LAST appears like another
challenging step to getting window functions wrapped up.  Has your
research lead you to any ideas on what your strategy for NULLS FIRST
and NULLS LAST likely would be?

Re: Planning aggregates which require sorted or distinct

From

Stefan Kaltenbrunner

Date:

19 January 2007, 15:58:05

Karen Hill wrote:
> Gavin Sherry wrote:
>> Recenly, I've been researching and putting together a proposal for window
>> functions.
> 
> Implementing NULLS FIRST and NULLS LAST appears like another
> challenging step to getting window functions wrapped up.  Has your
> research lead you to any ideas on what your strategy for NULLS FIRST
> and NULLS LAST likely would be?

maybe I'm missunderstanding your question but tom implemented that
already in -HEAD:

http://archives.postgresql.org/pgsql-committers/2007-01/msg00123.php


Stefan

Re: Planning aggregates which require sorted or distinct

From

"Karen Hill"

Date:

19 January 2007, 16:15:33

Stefan Kaltenbrunner wrote:
> Karen Hill wrote:
> > Gavin Sherry wrote:
> >> Recenly, I've been researching and putting together a proposal for window
> >> functions.
> >
> > Implementing NULLS FIRST and NULLS LAST appears like another
> > challenging step to getting window functions wrapped up.  Has your
> > research lead you to any ideas on what your strategy for NULLS FIRST
> > and NULLS LAST likely would be?
>
> maybe I'm missunderstanding your question but tom implemented that
> already in -HEAD:
> 
Cool.  I missed that. :-)

Re: Planning aggregates which require sorted or distinct

From

Gavin Sherry

Date:

19 January 2007, 17:53:24

On Fri, 19 Jan 2007, Tom Lane wrote:

> Gavin Sherry <swm@alcove.com.au> writes:
> > What we want to do is have a kind of 'sub plan' for each aggregate. In
> > effect, the plan might start looking like a directed graph.  Here is part
> > of the plan as a directed graph.
>
> >                        GroupAggregate
> >               /-----------------^---------------...
> >               |                 |
> >               |                 |
> >               ^                 |
> >               |               Unique
> >               |                 ^
> >               |                 |
> >             Sort               Sort
> >           (saledate)    (saledate,prodid)
> >               ^                 ^
> >               |                 |
> >               -------------- Fan Out ------------...
> >                                 ^
> >                                 |
> >                                Scan
>
> > This idea was presented by Brian Hagenbuch at Greenplum. He calls it a
> > 'Fan Out' plan. It is trivial to rejoin the data because all data input to
> > the aggregates is sorted by the same primary key.
>
> Er, what primary key would that be exactly?  And even if you had a key,
> I wouldn't call joining on it trivial; I'd call it expensive ...

I should have used slightly different language. What I meant to say was,
both sets are primarily sorted by saledate so they can be merged back
together. This is why I said it was trivial.

Thanks,

Gavin

Re: Planning aggregates which require sorted or distinct

From

Gavin Sherry

Date:

19 January 2007, 18:34:55

On Sat, 19 Jan 2007, Karen Hill wrote:

> Gavin Sherry wrote:
> > Recenly, I've been researching and putting together a proposal for window
> > functions.
>
> Implementing NULLS FIRST and NULLS LAST appears like another
> challenging step to getting window functions wrapped up.  Has your
> research lead you to any ideas on what your strategy for NULLS FIRST
> and NULLS LAST likely would be?

Explicit handling of NULL is only in the Oracle extension to window
functions. I will look at the core SQL functionality first -- maybe with
the extension of 'LEAD' and 'LAG' because they are heavily utilitised by
those using window functions.

Besides, Tom's been working on NULLS FIRST/LAST already.

gavin

Re: Planning aggregates which require sorted or distinct

From

"Karen Hill"

Date:

19 January 2007, 19:28:25

Stefan Kaltenbrunner wrote:
> Karen Hill wrote:
> > Gavin Sherry wrote:
> >> Recenly, I've been researching and putting together a proposal for window
> >> functions.
> >
> > Implementing NULLS FIRST and NULLS LAST appears like another
> > challenging step to getting window functions wrapped up.  Has your
> > research lead you to any ideas on what your strategy for NULLS FIRST
> > and NULLS LAST likely would be?
>
> maybe I'm missunderstanding your question but tom implemented that
> already in -HEAD:
> 
Cool.  I missed that. :-)

Re: Planning aggregates which require sorted or distinct

From

Tom Lane

Date:

20 January 2007, 00:40:44

Gavin Sherry <swm@alcove.com.au> writes:
> On Fri, 19 Jan 2007, Tom Lane wrote:
>> Er, what primary key would that be exactly?  And even if you had a key,
>> I wouldn't call joining on it trivial; I'd call it expensive ...

> I should have used slightly different language. What I meant to say was,
> both sets are primarily sorted by saledate so they can be merged back
> together. This is why I said it was trivial.

Ah, my misunderstanding.  Then isn't this basically isomorphic to what
I was thinking of, ie, somewhat-smarter Aggref nodes attached to the
existing GroupAggregate plan node?
        regards, tom lane

Re: Planning aggregates which require sorted or distinct

From

Gavin Sherry

Date:

20 January 2007, 00:58:05

On Fri, 19 Jan 2007, Tom Lane wrote:

> Gavin Sherry <swm@alcove.com.au> writes:
> > On Fri, 19 Jan 2007, Tom Lane wrote:
> >> Er, what primary key would that be exactly?  And even if you had a key,
> >> I wouldn't call joining on it trivial; I'd call it expensive ...
>
> > I should have used slightly different language. What I meant to say was,
> > both sets are primarily sorted by saledate so they can be merged back
> > together. This is why I said it was trivial.
>
> Ah, my misunderstanding.  Then isn't this basically isomorphic to what
> I was thinking of, ie, somewhat-smarter Aggref nodes attached to the
> existing GroupAggregate plan node?

Yep. I was thinking about this all morning. I think I've over engineered
the problem in my head. Window function input just looks like a slightly
more complex distinct aggregate input. I'll think on it more though.

To bring out a slightly different point -- and I know this is putting the
cart before the horse -- but window functions will (potentially) output
rows in the wrong order. I made a passing reference to this earlier. For
example, say we have a table employees with the following data:

empno | salary | age
====================
1     | 2000   | 50
2     | 6000   | 30
3     | 3000   | 20

We want to answer the following: for each employee: what is their rank in
terms of salary and what is their rank in terms of age. This query
answers that:

select empno, rank() over (order by salary) as srank, rank() over (order by age) as arank from employees order by
empno;

The result will be:

empno | salary | age
====================
1     | 1      | 3
2     | 3      | 2
3     | 2      | 1

Both window functions provide results based on the order of their input.
So, in terms of empno, srank will output in this order: empno = 1, 3, 2;
arank will output in this order: empno = 3, 2, 1. We need to glue these
back together and the only way I can think how is via a synthetic key.
Ideally, the planner would have some input on how to clue about how large
the result set will be and the orders from the window functions so that it
can decide whether to use nested loop, merge join or hash join to do it.

Can you think of a different approach?

Thanks,

Gavin

Re: Planning aggregates which require sorted or distinct

From

Tom Lane

Date:

20 January 2007, 03:05:58

Gavin Sherry <swm@alcove.com.au> writes:
> We want to answer the following: for each employee: what is their rank in
> terms of salary and what is their rank in terms of age. This query
> answers that:

> select empno, rank() over (order by salary) as srank,
>   rank() over (order by age) as arank
>   from employees order by empno;

Eeek.  This seems like the worst sort of action-at-a-distance.  How does
rank() know what value it's supposed to report the rank of?
        regards, tom lane

Re: Planning aggregates which require sorted or distinct

From

Markus Schiltknecht

Date:

20 January 2007, 06:12:11

Hi,

Tom Lane wrote:
>> select empno, rank() over (order by salary) as srank,
>>   rank() over (order by age) as arank
>>   from employees order by empno;
> 
> Eeek.  This seems like the worst sort of action-at-a-distance.  How does
> rank() know what value it's supposed to report the rank of?

All of these ranking aggregate functions (rank, dense_rank, 
percent_rank, cume_dist and row_number) are normally used without any 
arguments, see examples in [1] or [2]. However, they explicitly require 
an ORDER BY clause anyway, so I suppose they need one with exactly *one* 
argument? Does the standard say anything more explicit? Or should those 
functions just take the first ORDER BY argument?

I.e. what should the following query do? Is it a legal query at all?

select empno, cume_dist() over (order by salary, age) as rank,  from employees order by empno;

Regards

Markus

[1]: SQL Anywhere Server - SQL Reference, Window Clause:
http://www.ianywhere.com/developer/product_manuals/sqlanywhere/1000/en/html/dbrfen10/rf-window-clause-statement.html

[2]: A techonthenet.com article about cume_dist() function:
http://www.techonthenet.com/oracle/functions/cume_dist.php

Re: Planning aggregates which require sorted or distinct

From

"Simon Riggs"

Date:

20 January 2007, 06:29:59

On Sat, 2007-01-20 at 09:34 +1100, Gavin Sherry wrote:
> On Sat, 19 Jan 2007, Karen Hill wrote:
> 
> > Gavin Sherry wrote:
> > > Recenly, I've been researching and putting together a proposal for window
> > > functions.

Good news.

> > Implementing NULLS FIRST and NULLS LAST appears like another
> > challenging step to getting window functions wrapped up.  Has your
> > research lead you to any ideas on what your strategy for NULLS FIRST
> > and NULLS LAST likely would be?
> 
> Explicit handling of NULL is only in the Oracle extension to window
> functions. I will look at the core SQL functionality first -- maybe with
> the extension of 'LEAD' and 'LAG' because they are heavily utilitised by
> those using window functions.

NULLS FIRST | LAST is part of SQL:2003, specifically the definition of a
window over which an aggregate function can be applied contains a sort
specification list.

> Besides, Tom's been working on NULLS FIRST/LAST already.

For that very reason, I imagined.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com

Re: Planning aggregates which require sorted or distinct

From

"Simon Riggs"

Date:

20 January 2007, 07:08:11

On Sat, 2007-01-20 at 15:58 +1100, Gavin Sherry wrote:
> On Fri, 19 Jan 2007, Tom Lane wrote:
> 
> > Gavin Sherry <swm@alcove.com.au> writes:
> > > On Fri, 19 Jan 2007, Tom Lane wrote:
> > >> Er, what primary key would that be exactly?  And even if you had a key,
> > >> I wouldn't call joining on it trivial; I'd call it expensive ...
> >
> > > I should have used slightly different language. What I meant to say was,
> > > both sets are primarily sorted by saledate so they can be merged back
> > > together. This is why I said it was trivial.

Yes, your fan out plan sounds best, together with the optimisation to
remove whatever you call the individual strands of the fan-out. Think of
a good name now, so we can discuss it more easily...

> Yep. I was thinking about this all morning. I think I've over engineered
> the problem in my head. Window function input just looks like a slightly
> more complex distinct aggregate input. I'll think on it more though.

I'm working on modifying Tuplestore that will allow you to store the
last N tuples, rather than all previous input. This is specifically
designed to allow Merge Join to do Mark/Restore without materializing
the whole sort set. This can also be used to materialize Windows (i.e.
<window clause> in SQL:2003), so you can store the current row plus n
PRECEDING and/or n FOLLOWING rows as appropriate. Reading from the
Window would then be similar-ish to doing a Mark/Restore pair, which we
can rename to MarkWindowStart and ReturnToWindowStart.

I'll present the prototype shortly, I've got a few bugs, but the basic
principle is working already. I'm happy to craft that to your exact
needs, so that you'll be able to press ahead with the rest of the
implementation of Windowed functions.

The Materialize node is almost unchanged, but I was thinking of renaming
it when used in this way to make the EXPLAIN more readable. Perhaps we
should call it a Materialized Window for both the MJ and Window function
cases.

This won't help with UNBOUNDED window definitions, but I imagine that
these are best handled by running aggregates which would be an O(N)
operation, rather than recalculating everything each time, which would
be O(N^2).

> To bring out a slightly different point -- and I know this is putting the
> cart before the horse -- but window functions will (potentially) output
> rows in the wrong order. I made a passing reference to this earlier. For
> example, say we have a table employees with the following data:
> 
> empno | salary | age
> ====================
> 1     | 2000   | 50
> 2     | 6000   | 30
> 3     | 3000   | 20
> 
> We want to answer the following: for each employee: what is their rank in
> terms of salary and what is their rank in terms of age. This query
> answers that:
> 
> select empno, rank() over (order by salary) as srank,
>   rank() over (order by age) as arank
>   from employees order by empno;
> 
> The result will be:
> 
> empno | salary | age
> ====================
> 1     | 1      | 3
> 2     | 3      | 2
> 3     | 2      | 1
> 
> Both window functions provide results based on the order of their input.
> So, in terms of empno, srank will output in this order: empno = 1, 3, 2;
> arank will output in this order: empno = 3, 2, 1. We need to glue these
> back together and the only way I can think how is via a synthetic key.

Anything wrong with using empno?

> Ideally, the planner would have some input on how to clue about how large
> the result set will be and the orders from the window functions so that it
> can decide whether to use nested loop, merge join or hash join to do it.
> 
> Can you think of a different approach?

Sounds like figuring out and agreeing the executor issues first is the
best bet. Once we know whats to be done, extending the planner to do it
will be easier.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com

Re: Planning aggregates which require sorted or distinct

From

Alvaro Herrera

Date:

20 January 2007, 08:18:39

Simon Riggs wrote:

> I'm working on modifying Tuplestore that will allow you to store the
> last N tuples, rather than all previous input. This is specifically
> designed to allow Merge Join to do Mark/Restore without materializing
> the whole sort set. This can also be used to materialize Windows (i.e.
> <window clause> in SQL:2003), so you can store the current row plus n
> PRECEDING and/or n FOLLOWING rows as appropriate.

Interesting.  This could also be used to implement the "partial sort"
stuff that Oleg is so interested in, and in which I believe Greg Stark
was also interested.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Planning aggregates which require sorted or distinct

From

Gavin Sherry

Date:

20 January 2007, 08:55:11

On Sat, 20 Jan 2007, Simon Riggs wrote:

> On Sat, 2007-01-20 at 15:58 +1100, Gavin Sherry wrote:
> > On Fri, 19 Jan 2007, Tom Lane wrote:
> >
> > > Gavin Sherry <swm@alcove.com.au> writes:
> > > > On Fri, 19 Jan 2007, Tom Lane wrote:
> > > >> Er, what primary key would that be exactly?  And even if you had a key,
> > > >> I wouldn't call joining on it trivial; I'd call it expensive ...
> > >
> > > > I should have used slightly different language. What I meant to say was,
> > > > both sets are primarily sorted by saledate so they can be merged back
> > > > together. This is why I said it was trivial.
>
> Yes, your fan out plan sounds best, together with the optimisation to
> remove whatever you call the individual strands of the fan-out. Think of
> a good name now, so we can discuss it more easily...

Hah. Yes. I've been using the term 'subplan' but it isn't a good one. Can
we just choose something 'cool' like iPlan2006? ;)

> > Yep. I was thinking about this all morning. I think I've over engineered
> > the problem in my head. Window function input just looks like a slightly
> > more complex distinct aggregate input. I'll think on it more though.
>
> I'm working on modifying Tuplestore that will allow you to store the
> last N tuples, rather than all previous input. This is specifically
> designed to allow Merge Join to do Mark/Restore without materializing
> the whole sort set. This can also be used to materialize Windows (i.e.
> <window clause> in SQL:2003), so you can store the current row plus n
> PRECEDING and/or n FOLLOWING rows as appropriate. Reading from the
> Window would then be similar-ish to doing a Mark/Restore pair, which we
> can rename to MarkWindowStart and ReturnToWindowStart.

Wow. What a coincidence! Windows are slightly more complex though. As you
probably know, there are two ways of specifying the window frame: by an
absolute number of rows (ROWS N PRECEDING, for example); or, by a 'range'
(RANGE N PRECEDING), where the range, in the case of 'preceding', is
determined by subtracted the range parameter from the value of the current
field -- i.e., the window attribute.

This presents a problem in terms of managing the size of the buffer. If
you have data clustered around a value then the amount of data input into
the window function when we cross this value explodes. The tuplestore code
can or will deal with this but it is currently not designed for this kind
of usage pattern. That is, every time we advance a row we must (a)
recalculate multiset to be input to the window function and (b) generate
the value from the window function by passing the input to it. The problem
arises when the window contains more data than can be stored in work_mem.
Then, each time we need to recalculate the window function, we'll churn
data through the buffer and get no effect from the buffering itself.

A lot of the research around window functions recommends offseting this
effect by buffering data 'pre-aggregated' for each distinct value in the
buffer. Most of the research, however, relies on a trick we don't have
available in the SQL window function spec: windows in SQL can have a
partition (irrelevant here), data sort order and a range; windows in the
world of window function streaming data research have this and a 'slide'.
Slide is the interval at which the aggregate is regenerated and in SQL the
value is regenerated for every new row. The research concedes that at this
level, preaggregation either stops making sense of is very complex.

I've come up with a way to be able to do it, but not for all aggregates
(median() for example). I'll discuss this in the proposal to be sent
through soon. The problem is, the requirements we have for buffering data
around window functions could be very complex.

> I'll present the prototype shortly, I've got a few bugs, but the basic
> principle is working already. I'm happy to craft that to your exact
> needs, so that you'll be able to press ahead with the rest of the
> implementation of Windowed functions.

Awesome. I will get the proposal off so that we can discuss the
requirements at length.

> The Materialize node is almost unchanged, but I was thinking of renaming
> it when used in this way to make the EXPLAIN more readable. Perhaps we
> should call it a Materialized Window for both the MJ and Window function
> cases.

I think that would be confusing in the case of MJ.

> This won't help with UNBOUNDED window definitions, but I imagine that
> these are best handled by running aggregates which would be an O(N)
> operation, rather than recalculating everything each time, which would
> be O(N^2).

Correct. You only need to recalculate the aggregate if it has a moving
window frame.

> > To bring out a slightly different point -- and I know this is putting the
> > cart before the horse -- but window functions will (potentially) output
> > rows in the wrong order. I made a passing reference to this earlier. For
> > example, say we have a table employees with the following data:
> >
> > empno | salary | age
> > ====================
> > 1     | 2000   | 50
> > 2     | 6000   | 30
> > 3     | 3000   | 20
> >
> > We want to answer the following: for each employee: what is their rank in
> > terms of salary and what is their rank in terms of age. This query
> > answers that:
> >
> > select empno, rank() over (order by salary) as srank,
> >   rank() over (order by age) as arank
> >   from employees order by empno;
> >
> > The result will be:
> >
> > empno | salary | age
> > ====================
> > 1     | 1      | 3
> > 2     | 3      | 2
> > 3     | 2      | 1
> >
> > Both window functions provide results based on the order of their input.
> > So, in terms of empno, srank will output in this order: empno = 1, 3, 2;
> > arank will output in this order: empno = 3, 2, 1. We need to glue these
> > back together and the only way I can think how is via a synthetic key.
>
> Anything wrong with using empno?

It might not be a unique key.

>
> > Ideally, the planner would have some input on how to clue about how large
> > the result set will be and the orders from the window functions so that it
> > can decide whether to use nested loop, merge join or hash join to do it.
> >
> > Can you think of a different approach?
>
> Sounds like figuring out and agreeing the executor issues first is the
> best bet. Once we know whats to be done, extending the planner to do it
> will be easier.

Exactly. As you can see, it's a pretty big topic so I'll work further on
the proposal and send it off.

Thanks,

Gavin

Re: Planning aggregates which require sorted or distinct

From

Gavin Sherry

Date:

20 January 2007, 09:19:30

On Sat, 20 Jan 2007, Tom Lane wrote:

> Gavin Sherry <swm@alcove.com.au> writes:
> > We want to answer the following: for each employee: what is their rank in
> > terms of salary and what is their rank in terms of age. This query
> > answers that:
>
> > select empno, rank() over (order by salary) as srank,
> >   rank() over (order by age) as arank
> >   from employees order by empno;
>
> Eeek.  This seems like the worst sort of action-at-a-distance.  How does
> rank() know what value it's supposed to report the rank of?

This is a frustratingly inconsistent bit of the spec. Rank is defined as
follows:

RANK() OVER WNS is equivalent to:  ( COUNT (*) OVER (WNS1 RANGE UNBOUNDED PRECEDING)  - COUNT (*) OVER (WNS1 RANGE
CURRENTROW) + 1 )
 

Say the salary column has the following values: {100, 200, 200, 300}. This
would give the following output: {1, 2, 2, 4}. DENSE_RANK() would give:
{1, 2, 2, 3}.

These functions are pretty ugly (if you think about them in terms of our
existing aggregates). However, they are by far the most heavily used
window functions (along with ROW_NUMBER()).

Thanks,


Gavin

Re: Planning aggregates which require sorted or distinct

From

Gregory Stark

Date:

20 January 2007, 09:36:59

"Alvaro Herrera" <alvherre@commandprompt.com> writes:

> Simon Riggs wrote:
>
>> I'm working on modifying Tuplestore that will allow you to store the
>> last N tuples, rather than all previous input. This is specifically
>> designed to allow Merge Join to do Mark/Restore without materializing
>> the whole sort set. This can also be used to materialize Windows (i.e.
>> <window clause> in SQL:2003), so you can store the current row plus n
>> PRECEDING and/or n FOLLOWING rows as appropriate.
>
> Interesting.  This could also be used to implement the "partial sort"
> stuff that Oleg is so interested in, and in which I believe Greg Stark
> was also interested.

I already have that implemented in tuplesort. The place where I was asking for
more guidance was at higher levels. How to communicate this information down
to the Sort node.

It's not exactly the same as this case where you want to control when the old
tuples are purged though. I think the SORT LIMIT case is simpler since it you
just have to tell the tuplesort how many tuples you're interested in and it
can take care of pruning irrelevant tuples as it goes.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com

Re: Planning aggregates which require sorted or distinct

From

Oleg Bartunov

Date:

20 January 2007, 09:44:41

Gavin,

I'm also interested in the topic, but right now I am wondering if
rank() function is a reserved name ?  We're working on built-in
tsearch2 for 8.3 release and we already have rank() function.

Oleg

On Sun, 21 Jan 2007, Gavin Sherry wrote:

> On Sat, 20 Jan 2007, Tom Lane wrote:
>
>> Gavin Sherry <swm@alcove.com.au> writes:
>>> We want to answer the following: for each employee: what is their rank in
>>> terms of salary and what is their rank in terms of age. This query
>>> answers that:
>>
>>> select empno, rank() over (order by salary) as srank,
>>>   rank() over (order by age) as arank
>>>   from employees order by empno;
>>
>> Eeek.  This seems like the worst sort of action-at-a-distance.  How does
>> rank() know what value it's supposed to report the rank of?
>
> This is a frustratingly inconsistent bit of the spec. Rank is defined as
> follows:
>
> RANK() OVER WNS is equivalent to:
>   ( COUNT (*) OVER (WNS1 RANGE UNBOUNDED PRECEDING)
>   - COUNT (*) OVER (WNS1 RANGE CURRENT ROW) + 1 )
>
> Say the salary column has the following values: {100, 200, 200, 300}. This
> would give the following output: {1, 2, 2, 4}. DENSE_RANK() would give:
> {1, 2, 2, 3}.
>
> These functions are pretty ugly (if you think about them in terms of our
> existing aggregates). However, they are by far the most heavily used
> window functions (along with ROW_NUMBER()).
>
> Thanks,
>
>
> Gavin
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>       subscribe-nomail command to majordomo@postgresql.org so that your
>       message can get through to the mailing list cleanly
>
    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: Planning aggregates which require sorted or distinct

From

Gregory Stark

Date:

20 January 2007, 09:48:34

"Gavin Sherry" <swm@alcove.com.au> writes:

> Wow. What a coincidence! Windows are slightly more complex though. As you
> probably know, there are two ways of specifying the window frame: by an
> absolute number of rows (ROWS N PRECEDING, for example); or, by a 'range'
> (RANGE N PRECEDING), where the range, in the case of 'preceding', is
> determined by subtracted the range parameter from the value of the current
> field -- i.e., the window attribute.

Actually I think there's a third distinct subcase here as well. While in
theory "RANGE UNBOUNDED PRECEDING" could be done using the same logic as N
PRECEDING I think it makes more sense to treat it as a distinct case because
it is amenable to better plans.

For RANGE N PRECEDING in the general case we need to reapply the window
aggregates over the entire window partition for every record. There may be a
property some window aggregate functions have of being able to "remove" the
effects of an state transition which allows for an optimization (for example
avg() which keeps a running sum and count can easily subtract the old tuple
being aged out of the window). But not all aggregates can do this. RANK() I
believe will need to resort the entire window partition for every record.

However for RANGE UNBOUNDED PRECEDING we can apply a different plan. Keep the
state variable for each window aggregate around for the entire time. For each
record apply the state transition function then apply the FINAL function to
generate the result for that record but keep the state variable as it was for
the next record.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com

Re: Planning aggregates which require sorted or distinct

From

Gavin Sherry

Date:

20 January 2007, 09:57:53

On Sat, 20 Jan 2007, Gregory Stark wrote:

> However for RANGE UNBOUNDED PRECEDING we can apply a different plan. Keep the
> state variable for each window aggregate around for the entire time. For each
> record apply the state transition function then apply the FINAL function to
> generate the result for that record but keep the state variable as it was for
> the next record.

Yes, I made reference to this case else where in the email (or was it a
different email in the thread?).

Thanks,

Gavin

Re: Planning aggregates which require sorted or distinct

From

"Simon Riggs"

Date:

20 January 2007, 10:22:01

On Sat, 2007-01-20 at 23:54 +1100, Gavin Sherry wrote:

> > > Yep. I was thinking about this all morning. I think I've over engineered
> > > the problem in my head. Window function input just looks like a slightly
> > > more complex distinct aggregate input. I'll think on it more though.
> >
> > I'm working on modifying Tuplestore that will allow you to store the
> > last N tuples, rather than all previous input. This is specifically
> > designed to allow Merge Join to do Mark/Restore without materializing
> > the whole sort set. This can also be used to materialize Windows (i.e.
> > <window clause> in SQL:2003), so you can store the current row plus n
> > PRECEDING and/or n FOLLOWING rows as appropriate. Reading from the
> > Window would then be similar-ish to doing a Mark/Restore pair, which we
> > can rename to MarkWindowStart and ReturnToWindowStart.
> 
> Wow. What a coincidence! 

Not much of one though. The overhaul of the sort code was the first step
on the road to Windowed aggregates.

> Windows are slightly more complex though. As you
> probably know, there are two ways of specifying the window frame: by an
> absolute number of rows (ROWS N PRECEDING, for example); or, by a 'range'
> (RANGE N PRECEDING), where the range, in the case of 'preceding', is
> determined by subtracted the range parameter from the value of the current
> field -- i.e., the window attribute.

Sure. The MJ situation is to have the MergeJoin node call a Materialize
node which calls the tuplestore functions. The MJ node does all the
clever stuff, which includes a variable size buffer according to the
values arriving. Seems OK to have a Window node that does its own brand
of clever stuff, while the Materialize node just does whats its told in
managing the tuplestore.

Rows type calls can do a read next/mark at same time, so they always
maintain a fixed lead/lag as they go.

Range type calls can do a read, then some processing to determine if it
should mark yet, just as MJ does. Or at least we can support an
additional call to make RangeWindowStart hunt for the appropriate row to
set as the end of the window and then read forward until the end of the
range is found.

> This presents a problem in terms of managing the size of the buffer.If
> you have data clustered around a value then the amount of data input into
> the window function when we cross this value explodes. The tuplestore code
> can or will deal with this but it is currently not designed for this kind
> of usage pattern. That is, every time we advance a row we must (a)
> recalculate multiset to be input to the window function and (b) generate
> the value from the window function by passing the input to it. The problem
> arises when the window contains more data than can be stored in work_mem.
> Then, each time we need to recalculate the window function, we'll churn
> data through the buffer and get no effect from the buffering itself.

The reason I'm doing this is that MJs suck when they have large sorted
input. So the problem you describe is essentially the same one I worried
about when discussing the MJ optimisation.

The cost of preparing RandomAccess sort files is *immense*, and
scrolling backwards is just not a clever thing to do. The problem you
describe won't occur in all cases and my guess is that accepting that it
can happen occasionally will still make it run loads faster for large
queries than if we use a full RandomAccess sort. We don't really think
of it that way because we are paying the RandomAccess costs every time
at present, but Windowed aggregates would stick out rather badly if they
used that method every time, no matter what the window size is.

There'll always be a stupidly specified query that performs badly.
That's where the planner can step in to recognize poor situations and do
something about it.

> A lot of the research around window functions recommends offseting this
> effect by buffering data 'pre-aggregated' for each distinct value in the
> buffer. Most of the research, however, relies on a trick we don't have
> available in the SQL window function spec: windows in SQL can have a
> partition (irrelevant here), data sort order and a range; windows in the
> world of window function streaming data research have this and a 'slide'.
> Slide is the interval at which the aggregate is regenerated and in SQL the
> value is regenerated for every new row. The research concedes that at this
> level, preaggregation either stops making sense of is very complex.

That sounds like Phase 2, don't you think? I'd like to get something
into 8.3...

The user can always use a subselect to pre-aggregate if they're worried
about the performance of their query. Maybe not in a complex fan-out but
then anybody with a large query that has multiple conflicting, large
windows might reasonably expect poor performance in the first release.

> I've come up with a way to be able to do it, but not for all aggregates
> (median() for example). I'll discuss this in the proposal to be sent
> through soon. The problem is, the requirements we have for buffering data
> around window functions could be very complex.
> 
> > I'll present the prototype shortly, I've got a few bugs, but the basic
> > principle is working already. I'm happy to craft that to your exact
> > needs, so that you'll be able to press ahead with the rest of the
> > implementation of Windowed functions.
> 
> Awesome. I will get the proposal off so that we can discuss the
> requirements at length.

Looking forward to it. I'll help where I can.

> > > To bring out a slightly different point -- and I know this is putting the
> > > cart before the horse -- but window functions will (potentially) output
> > > rows in the wrong order. I made a passing reference to this earlier. For
> > > example, say we have a table employees with the following data:
> > >
> > > empno | salary | age
> > > ====================
> > > 1     | 2000   | 50
> > > 2     | 6000   | 30
> > > 3     | 3000   | 20
> > >
> > > We want to answer the following: for each employee: what is their rank in
> > > terms of salary and what is their rank in terms of age. This query
> > > answers that:
> > >
> > > select empno, rank() over (order by salary) as srank,
> > >   rank() over (order by age) as arank
> > >   from employees order by empno;
> > >
> > > The result will be:
> > >
> > > empno | salary | age
> > > ====================
> > > 1     | 1      | 3
> > > 2     | 3      | 2
> > > 3     | 2      | 1
> > >
> > > Both window functions provide results based on the order of their input.
> > > So, in terms of empno, srank will output in this order: empno = 1, 3, 2;
> > > arank will output in this order: empno = 3, 2, 1. We need to glue these
> > > back together and the only way I can think how is via a synthetic key.
> >
> > Anything wrong with using empno?
> 
> It might not be a unique key.

Not sure how such a result set could have meaning to anyone. e.g.

empno    srank    arank
1    1    3
1    3    2
1    2    1

Which emp came 1st? 1 Which emp came 3rd? 1... no useful meaning.

I'd be inclined to suggest that such a query should be prevented from
being executed with a message along the lines of "no GROUP BY
specified". Or at least the problem ignored in the first implementation;
its not high up the list of meaningful queries we would like to allow
users to execute, IMHO [Order by meaning desc nulls last :-) ]

> >
> > > Ideally, the planner would have some input on how to clue about how large
> > > the result set will be and the orders from the window functions so that it
> > > can decide whether to use nested loop, merge join or hash join to do it.
> > >
> > > Can you think of a different approach?
> >
> > Sounds like figuring out and agreeing the executor issues first is the
> > best bet. Once we know whats to be done, extending the planner to do it
> > will be easier.
> 
> Exactly. As you can see, it's a pretty big topic so I'll work further on
> the proposal and send it off.

It would be very cool to start with a set of queries that are
specifically designed to cover all of the complexities, with expected
output. That way we can reference a real example of what we are
discussing, just like your example here. The range of functionality is
very large and hard to visualise, methinks.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com

Re: Planning aggregates which require sorted or distinct

From

"Simon Riggs"

Date:

20 January 2007, 10:31:24

On Sat, 2007-01-20 at 13:36 +0000, Gregory Stark wrote:
> "Alvaro Herrera" <alvherre@commandprompt.com> writes:
> 
> > Simon Riggs wrote:
> >
> >> I'm working on modifying Tuplestore that will allow you to store the
> >> last N tuples, rather than all previous input. This is specifically
> >> designed to allow Merge Join to do Mark/Restore without materializing
> >> the whole sort set. This can also be used to materialize Windows (i.e.
> >> <window clause> in SQL:2003), so you can store the current row plus n
> >> PRECEDING and/or n FOLLOWING rows as appropriate.
> >
> > Interesting.  This could also be used to implement the "partial sort"
> > stuff that Oleg is so interested in, and in which I believe Greg Stark
> > was also interested.
> 
> I already have that implemented in tuplesort. The place where I was asking for
> more guidance was at higher levels. How to communicate this information down
> to the Sort node.
> 
> It's not exactly the same as this case where you want to control when the old
> tuples are purged though. I think the SORT LIMIT case is simpler since it you
> just have to tell the tuplesort how many tuples you're interested in and it
> can take care of pruning irrelevant tuples as it goes.

It does sound the same, but I think Greg is right: Sort Limit optimises
the sort itself, not how the processing is handled following the Sort.
Both approaches are required, for different cases.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com

Re: Planning aggregates which require sorted or distinct

From

"Simon Riggs"

Date:

23 January 2007, 06:53:43

On Sat, 2007-01-20 at 14:20 +0000, Simon Riggs wrote:
> On Sat, 2007-01-20 at 23:54 +1100, Gavin Sherry wrote:

> > Windows are slightly more complex though. As you
> > probably know, there are two ways of specifying the window frame: by an
> > absolute number of rows (ROWS N PRECEDING, for example); or, by a 'range'
> > (RANGE N PRECEDING), where the range, in the case of 'preceding', is
> > determined by subtracted the range parameter from the value of the current
> > field -- i.e., the window attribute.
> 
> Sure. The MJ situation is to have the MergeJoin node call a Materialize
> node which calls the tuplestore functions. The MJ node does all the
> clever stuff, which includes a variable size buffer according to the
> values arriving. Seems OK to have a Window node that does its own brand
> of clever stuff, while the Materialize node just does whats its told in
> managing the tuplestore.
> 
> Rows type calls can do a read next/mark at same time, so they always
> maintain a fixed lead/lag as they go.
> 
> Range type calls can do a read, then some processing to determine if it
> should mark yet, just as MJ does. Or at least we can support an
> additional call to make RangeWindowStart hunt for the appropriate row to
> set as the end of the window and then read forward until the end of the
> range is found.

Gavin,

Following earlier thoughts, I thought I'd make some notes about how the
new tuplestore changes could be used for preparing Windows for use with
Windowed aggregate functionality.

The new tuplestore commands are
tuplestore_moveBufferStartForwards()  == mark
tuplestore_rewindToBufferStart()      == restore

(happy to change the names...)

They'd be used something like this, in simplified pseudo code that would
be executed by a Window? node.

With ROWS, for each new tuple:
-----------------------------
tuplestore_puttupleslot() // add one new tuple
tuplestore_moveBufferStartForwards(++markpos) // move start forwards one

// position Window for processing
tuplestore_rewindToBufferStart() to position Window for processing

With RANGE, for each new row:
----------------------------
// locate new range start
tuplestore_rewindToBufferStart()
...step forward until start of new range found...
tuplestore_moveBufferStartForwards() at that point

// locate new range end
while (!end-of-new-range)
{   if (!eof)       tuplestore_gettuple() // move forwards   else       tuplestore_puttupleslot() // add new tuple
}

// position Window for processing
tuplestore_rewindToBufferStart()

(The above is simplified to remove boundary conditions.)

So AFAICS, there's not actually any additional work to do, over and
above the changes I'm working on to support a circular buffer in
tuplestore for merge joins.

The above makes the assumption that for RANGE windows, the range start
of tuple i+1 is always greater than or equal to the range start of tuple
i - could be very interesting if its not true.

Anyway, some options for you to consider, at least.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com