Thread: Missing optimization when filters are applied after window functions

Missing optimization when filters are applied after window functions

From
Volker Grabsch
Date:
Dear PostgreSQL hackers,

[ please CC to me as I'm not subscribed to the list ]

I think there is a missing optimization when a filter is
applied after a window function, where the filtered field
is also used for partitioning.

Here a simplified example: Suppose we have a table
that stores 100.000 ids, and 10 random values per id:
   create table test (       id integer,       value double precision,       primary key (id, value)   );
   insert into test (id, value)       select           generate_series(1, 100000) id,           random()       from
     generate_series(1, 10);
 

We are now asking for the rank of each value per id
(it doesn't matter if it's rank() or any other window
function):
   select       id,       value,       rank() over (partition by id order by value)   from       test;

Now suppose we're interested only in the result for
id 23000. If the query above was defined in a view,
this would lead to the following query:
   select       *   from       (           select               id,               value,               rank() over
(partitionby id order by value)           from               test       ) subquery   where       id = 23000;
 

This is very slow! The "explain" command shows the reason:
   Subquery Scan on subquery  (cost=0.00..86556.80 rows=5000 width=20)      Filter: (subquery.id = 23000)       ->
WindowAgg (cost=0.00..74056.80 rows=1000000 width=12)            ->  Index Scan using test_pkey on test
(cost=0.00..56556.80rows=1000000 width=12)
 

The Filter (id = 23000) is applied after WindowAgg,
although it could certainly be applied before!

Of course we could rewrite the above query like this:
   select       id,       value,       rank() over (partition by id order by value)   from       test   where       id
=23000;
 

This would lead to a more sane plan where the filter is
applied before WindowAgg, making use of the primary key:
   WindowAgg  (cost=43.53..43.70 rows=10 width=12)     ->  Sort  (cost=43.53..43.55 rows=10 width=12)           Sort
Key:value           ->  Bitmap Heap Scan on test  (cost=4.53..43.36 rows=10 width=12)                 Recheck Cond: (id
=23000)                 ->  Bitmap Index Scan on test_pkey  (cost=0.00..4.53 rows=10 width=0)
IndexCond: (id = 23000)
 

However, the second query cannot be cleanly split into
a generic view and a specialized part anymore.

In other words, it is currently impossible to use Window
functions in a View without risking serious, unexpected
performance issues, even in simple cases!

I propose the following general optimization: If all window
functions are partitioned by the same first field (here: id),
then any filter on that field should be executed before
WindowAgg. So a query like this:
   select       ...   from       (select           ... over (partition by FIELD, other2, ...)           ... over
(partitionby FIELD, other3, ...)           ... over (partition by FIELD, ...)           ...       )   where
SOME_FILTER(FIELD)

Shoule be transformed into that:
   select       ...   from       (select           ... over (partition by FIELD, other2, ...)           ... over
(partitionby FIELD, other3, ...)           ... over (partition by FIELD, ...)           ...        where
SOME_FILTER(FIELD)      )
 

I verfied that this issue exists with PostgreSQL 9.2 beta 1
as well as the PostgreSQL 9.1 release.

Is there any work on this issue? Or am I wrong and this
optimization should not be implemented for some reason?


Regards,
Volker

-- 
Volker Grabsch
---<<(())>>---


Volker Grabsch <vog@notjusthosting.com> writes:
> I propose the following general optimization: If all window
> functions are partitioned by the same first field (here: id),
> then any filter on that field should be executed before
> WindowAgg.

I'm not sure if that rule is correct in detail, but in any case the
short answer is that window aggregates are a new feature in Postgres
and we basically haven't done any optimization work on them yet.
Feel free to work in that area if it interests you...
        regards, tom lane


Re: Missing optimization when filters are applied after window functions

From
Hitoshi Harada
Date:
On Wed, May 16, 2012 at 12:50 AM, Volker Grabsch <vog@notjusthosting.com> wrote:
> I propose the following general optimization: If all window
> functions are partitioned by the same first field (here: id),
> then any filter on that field should be executed before
> WindowAgg. So a query like this:

I think that's possible.  Currently the planner doesn't think any
qualification from the upper query can be pushed down to a query that
has a window function.  It would be possible to let it push down if
the expression matches PARTITION BY expression.  However, the
challenge is that a query may have a number of window functions that
have different PARTITION BY expressions.  At the time of pushing down
in the planning, it is not obvious which window function comes first.
One idea is to restrict such optimization in only case of single
window function, and the other is to make it generalize and cover a
lot of cases.

That said, our planner on window functions has a lot of improvement to
be done.  Every kind of optimization I see is what I raised above;
they can be done easily by hacking in a small case, or they can be
done by generalizing for the most of cases.  My understanding is our
project tends to like the latter and it takes a little time but covers
more use cases.

Thanks,
-- 
Hitoshi Harada


Re: Missing optimization when filters are applied after window functions

From
Volker Grabsch
Date:
Hitoshi Harada schrieb:
> On Wed, May 16, 2012 at 12:50 AM, Volker Grabsch <vog@notjusthosting.com> wrote:
> > I propose the following general optimization: If all window
> > functions are partitioned by the same first field (here: id),
> > then any filter on that field should be executed before
> > WindowAgg. So a query like this:
> 
> I think that's possible.  Currently the planner doesn't think any
> qualification from the upper query can be pushed down to a query that
> has a window function.  It would be possible to let it push down if
> the expression matches PARTITION BY expression.

Sounds great!

> However, the
> challenge is that a query may have a number of window functions that
> have different PARTITION BY expressions.  At the time of pushing down
> in the planning, it is not obvious which window function comes first.

I'm don't really unterstand what you mean with "which window function
comes first", because to my understanding, all window functions of
a query belong to the same level in the query hierarchy. But then,
my knowledge of PostgreSQL internals isn't very deep, either.

> One idea is to restrict such optimization in only case of single
> window function, and the other is to make it generalize and cover a
> lot of cases.

From a practical point of view, the restriction to a single window
function wouldn't be that bad, although I'd prefer to think about
the number of different windows rather than number of window functions.

In other words, every optimization that is correct for a single window
function is also correct for multiple window functions if those use
all the same window.

> That said, our planner on window functions has a lot of improvement to
> be done.  Every kind of optimization I see is what I raised above;
> they can be done easily by hacking in a small case, or they can be
> done by generalizing for the most of cases.  My understanding is our
> project tends to like the latter and it takes a little time but covers
> more use cases.

I'd also prefer to see a general solution, as this provides less
room for unpleasant surprises (e.g. "This query is only slightly
different from the previous one. Why does it take so much longer?").

On the other hand, any small improvement is a big step forward
regarding window functions.

Unfortunately, I can't voluteer on that, as it is currently
impossible for me to allocate enough time for this.

However, any pointer to where to look at the source (or in the
manual) would be of great. Maybe I'll find at least enough time
to provide a rough proposal, or to improve existing attempts
to solve this issue.

Also, is there any chance to include a (simple) attempt of
such an optimiztation into PostgreSQL-9.2 beta, or is this
only a possible topic for 9.3 and later?


Regards,
Volker

-- 
Volker Grabsch
---<<(())>>---


Re: Missing optimization when filters are applied after window functions

From
Nicolas Barbier
Date:
2012/5/17 Volker Grabsch <vog@notjusthosting.com>:

> Also, is there any chance to include a (simple) attempt of
> such an optimiztation into PostgreSQL-9.2 beta, or is this
> only a possible topic for 9.3 and later?

For 9.2, you’re about 4 months late :-). The last commitfest was in Januari:

<URL:https://commitfest.postgresql.org/action/commitfest_view?id=13>

Nicolas

--
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?


Re: Missing optimization when filters are applied after window functions

From
Hitoshi Harada
Date:
On Thu, May 17, 2012 at 7:26 AM, Volker Grabsch <vog@notjusthosting.com> wrote:
> Hitoshi Harada schrieb:
>> On Wed, May 16, 2012 at 12:50 AM, Volker Grabsch <vog@notjusthosting.com> wrote:
>> > I propose the following general optimization: If all window
>> > functions are partitioned by the same first field (here: id),
>> > then any filter on that field should be executed before
>> > WindowAgg. So a query like this:
>>
>> I think that's possible.  Currently the planner doesn't think any
>> qualification from the upper query can be pushed down to a query that
>> has a window function.  It would be possible to let it push down if
>> the expression matches PARTITION BY expression.
>
> Sounds great!
>
>> However, the
>> challenge is that a query may have a number of window functions that
>> have different PARTITION BY expressions.  At the time of pushing down
>> in the planning, it is not obvious which window function comes first.
>
> I'm don't really unterstand what you mean with "which window function
> comes first", because to my understanding, all window functions of
> a query belong to the same level in the query hierarchy. But then,
> my knowledge of PostgreSQL internals isn't very deep, either.

No, you can specify as many window specification as you like.  Say,

SELECT count(*) over (w1), count(*) over (w2)
FROM tbl
WINDOW w1 AS (PARTITION BY x ORDER BY y), w2 AS (PARTITION BY y ORDER BYx);

and in the same query level there are different type of window
specifications.  The code as stands today doesn't have any semantics
about which of w1 or w2 to run first (probably w1 will be run first,
but the query semantics doesn't enforce anything).

>> One idea is to restrict such optimization in only case of single
>> window function, and the other is to make it generalize and cover a
>> lot of cases.
>
> From a practical point of view, the restriction to a single window
> function wouldn't be that bad, although I'd prefer to think about
> the number of different windows rather than number of window functions.
>
> In other words, every optimization that is correct for a single window
> function is also correct for multiple window functions if those use
> all the same window.

Yeah, I mean, multiple windows, not window functions.

>> That said, our planner on window functions has a lot of improvement to
>> be done.  Every kind of optimization I see is what I raised above;
>> they can be done easily by hacking in a small case, or they can be
>> done by generalizing for the most of cases.  My understanding is our
>> project tends to like the latter and it takes a little time but covers
>> more use cases.
>
> I'd also prefer to see a general solution, as this provides less
> room for unpleasant surprises (e.g. "This query is only slightly
> different from the previous one. Why does it take so much longer?").
>
> On the other hand, any small improvement is a big step forward
> regarding window functions.
>
> Unfortunately, I can't voluteer on that, as it is currently
> impossible for me to allocate enough time for this.
>
> However, any pointer to where to look at the source (or in the
> manual) would be of great. Maybe I'll find at least enough time
> to provide a rough proposal, or to improve existing attempts
> to solve this issue.

Look at subquery_is_pushdown_safe in allpath.c.  Here we stop looking
deeper if the upper qualification can be pushed down to the inner
sub-query if the sub-query has any window function expressions.


Thanks,
--
Hitoshi Harada


Re: Missing optimization when filters are applied after window functions

From
Bruce Momjian
Date:
On Wed, May 16, 2012 at 09:25:13AM -0400, Tom Lane wrote:
> Volker Grabsch <vog@notjusthosting.com> writes:
> > I propose the following general optimization: If all window
> > functions are partitioned by the same first field (here: id),
> > then any filter on that field should be executed before
> > WindowAgg.
> 
> I'm not sure if that rule is correct in detail, but in any case the
> short answer is that window aggregates are a new feature in Postgres
> and we basically haven't done any optimization work on them yet.
> Feel free to work in that area if it interests you...

Is this a TODO?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +