Thread: limit is sometimes not pushed in view with order

limit is sometimes not pushed in view with order

From

Rikard Pavelic

Date:

13 April 2013, 08:25:35

I was investigating some performance issues and stumbled upon this behavior:

create table main_table (i serial primary key, data varchar, ord int);
create view main_view_order as select m.i, m.data, m.ord from main_table m order by m.i desc;

insert into main_table select i, i::text, i/10 from generate_series(1,1000000) i;

create index ix_ord on main_table(ord);
analyze main_table;

explain analyze select * from main_view_order m where m.ord >= 5000 and m.ord <= 5500 limit 10;

Limit  (cost=0.00..69.01 rows=10 width=14) (actual time=330.943..330.951 rows=10 loops=1)
  ->  Index Scan Backward using main_table_pkey on main_table m  (cost=0.00..36389.36 rows=5281 width=14) (actual
time=330.937..330.940rows=10 loops=1) 
        Filter: ((ord >= 5000) AND (ord <= 5500))
Total runtime: 330.975 ms

I havent found it on TODO or in archives so I'm wondering if this is a known behavior.

Regards,
Rikard

--
Rikard Pavelic
http://www.ngs.hr/

Re: limit is sometimes not pushed in view with order

From

Tom Lane

Date:

13 April 2013, 15:21:24

Rikard Pavelic <rikard@ngs.hr> writes:
> I was investigating some performance issues and stumbled upon this behavior:
> create table main_table (i serial primary key, data varchar, ord int);
> create view main_view_order as select m.i, m.data, m.ord from main_table m order by m.i desc;

> insert into main_table select i, i::text, i/10 from generate_series(1,1000000) i;

> create index ix_ord on main_table(ord);
> analyze main_table;

> explain analyze select * from main_view_order m where m.ord >= 5000 and m.ord <= 5500 limit 10;

> Limit  (cost=0.00..69.01 rows=10 width=14) (actual time=330.943..330.951 rows=10 loops=1)
>   ->  Index Scan Backward using main_table_pkey on main_table m  (cost=0.00..36389.36 rows=5281 width=14) (actual
time=330.937..330.940rows=10 loops=1) 
>         Filter: ((ord >= 5000) AND (ord <= 5500))
> Total runtime: 330.975 ms

> I havent found it on TODO or in archives so I'm wondering if this is a known behavior.

There is nothing particularly wrong with that plan, or at least I'd not
recommend holding your breath waiting for it to get better.

Given this scenario, there are two possible (index-based) plans.  The one
above scans the pkey index in decreasing order, reports out only the rows
satisfying the "ord" condition, and stops as soon as it has 10 rows.
The only other alternative is to scan the ord index to collect the 5000
or so rows satisfying the "ord" condition, sort them all by "i", and
then throw away 4990 of them and return just the first 10.

The planner realizes that about 1/200th of the table satisfies the "ord"
condition, so it estimates that the first plan will require scanning
about 2000 entries in the pkey index to get 10 results.  So that looks
significantly cheaper than the other plan, which would require 5000
index fetches, not to mention a sort step.

Now, in this artificial test case, the cost estimate is wrong because
"i" and "ord" are perfectly correlated and all the desired rows are
quite far down the descending-i index scan; so the chosen plan actually
has to scan a lot more than 2000 index entries.  In a more realistic
case that plan would probably work noticeably better.  However, the
planner has no statistics that would tell it about the degree of order
correlation of the two columns, so it's not able to find that out.

Having said all that, neither of these plan choices are exactly ideal;
the planner is basically reduced to having to guess which one will
suck less.  You might try experimenting with two-column indexes on
(i, ord) or (ord, i) to give the planner some other cards to play.
I'm not sure how much that would help in this exact query type, but for
example the common case of "where x = constant order by y" is perfectly
matched to a btree index on (x, y).

            regards, tom lane

Re: limit is sometimes not pushed in view with order

From

Julian

Date:

13 April 2013, 15:37:06

On 13/04/13 18:25, Rikard Pavelic wrote:
> I was investigating some performance issues and stumbled upon this behavior:
>
> create table main_table (i serial primary key, data varchar, ord int);
> create view main_view_order as select m.i, m.data, m.ord from main_table m order by m.i desc;
>
> insert into main_table select i, i::text, i/10 from generate_series(1,1000000) i;
>
> create index ix_ord on main_table(ord);
> analyze main_table;
>
> explain analyze select * from main_view_order m where m.ord >= 5000 and m.ord <= 5500 limit 10;
>
> Limit  (cost=0.00..69.01 rows=10 width=14) (actual time=330.943..330.951 rows=10 loops=1)
>   ->  Index Scan Backward using main_table_pkey on main_table m  (cost=0.00..36389.36 rows=5281 width=14) (actual
time=330.937..330.940rows=10 loops=1) 
>         Filter: ((ord >= 5000) AND (ord <= 5500))
> Total runtime: 330.975 ms
>
> I havent found it on TODO or in archives so I'm wondering if this is a known behavior.
>
> Regards,
> Rikard
>
Hi,
Disregard the VIEW for the moment. (its not the issue here).

I wasn't able to get much better than a LIMIT of around 50 after a
SET STATISTICS 1000 on the PK column (i).

julian=# explain analyse select * from main_table m where m.ord >= 5000
and m.ord <= 5500 order by m.i desc limit 49;
                                                                QUERY
PLAN

------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=352.73..352.85 rows=49 width=14) (actual time=3.215..3.227
rows=49 loops=1)
   ->  Sort  (cost=352.73..365.23 rows=5000 width=14) (actual
time=3.213..3.217 rows=49 loops=1)
         Sort Key: i
         Sort Method: top-N heapsort  Memory: 27kB
         ->  Index Scan using ix_ord on main_table m  (cost=0.00..187.36
rows=5000 width=14) (actual time=0.025..1.479 rows=5010 loops=1)
               Index Cond: ((ord >= 5000) AND (ord <= 5500))
 Total runtime: 3.252 ms


However, at LIMIT 48 it goes bad:

julian=# explain analyse select * from main_table m where m.ord >= 5000
and m.ord <= 5500 order by m.i desc limit 48;

QUERY PLAN


----------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..349.34 rows=48 width=14) (actual
time=280.158..280.179 rows=48 loops=1)
   ->  Index Scan Backward using main_table_pkey on main_table m
(cost=0.00..36389.36 rows=5000 width=14) (actual time=280.156..280.172
rows=48 loops=1)
         Filter: ((ord >= 5000) AND (ord <= 5500))
         Rows Removed by Filter: 944991
 Total runtime: 280.206 ms


49 rows is pretty good IMO.
But others might want to give some tips, because I don't use LIMIT much.
You might want to consider using CURSORs - Which in this example
would cache the 49 rows and pass the rows you limit (FETCH) more
efficiently.

Regards,
Jules.

Re: limit is sometimes not pushed in view with order

From

Rikard Pavelic

Date:

13 April 2013, 18:08:23

On Sat, 13 Apr 2013 11:21:19 -0400
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> The planner realizes that about 1/200th of the table satisfies the
> "ord" condition, so it estimates that the first plan will require
> scanning about 2000 entries in the pkey index to get 10 results.  So
> that looks significantly cheaper than the other plan, which would
> require 5000 index fetches, not to mention a sort step.

Is it really realizing that?
Maybe I oversimplified my example.
I was trying to show that planner is not using limit information when
it should. For example if there were joins, he would do joins first
using filter estimate and at the end apply limit.

>
> Now, in this artificial test case, the cost estimate is wrong because
> "i" and "ord" are perfectly correlated and all the desired rows are
> quite far down the descending-i index scan; so the chosen plan
> actually has to scan a lot more than 2000 index entries.  In a more
> realistic case that plan would probably work noticeably better.
> However, the planner has no statistics that would tell it about the
> degree of order correlation of the two columns, so it's not able to
> find that out.
>

Thats actually pretty realistic scenario (if you change ord int to
created timestamp), but yeah it's probably best to create composite
index for that scenario.

But, look at this standard ERP example:

create table main_table (id serial primary key, data varchar, created timestamptz);
create table detail_table (main_id int references main_table, index int, data varchar, primary key(main_id, index));

create view main_view as
select m.id, m.data, m.created, d.details
from main_table m
left join
(
    select main_id, array_agg(d order by index) details
    from detail_table d
    group by main_id
) d on m.id = d.main_id
order by m.created desc;

insert into main_table select i, i::text, now() + (i/10 || ' sec')::interval from generate_series(1,100001) i;
insert into detail_table select i/5 + 1, i%5, i::text from generate_series(1,500000) i;

create index ix_created on main_table(created desc);
analyze main_table;
analyze detail_table;

explain analyze select * from main_view m where m.created >= now() + '1 min'::interval and m.created <= now() + '5
min'::intervallimit 10; 


Limit  (cost=0.01..22913.81 rows=10 width=49) (actual time=35.548..920.034 rows=10 loops=1)
  ->  Nested Loop Left Join  (cost=0.01..5879654.29 rows=2566 width=49) (actual time=35.546..920.028 rows=10 loops=1)
        Join Filter: (m.id = d.main_id)
        Rows Removed by Join Filter: 904978
        ->  Index Scan using ix_created on main_table m  (cost=0.01..98.79 rows=2521 width=17) (actual
time=0.059..0.103rows=10 loops=1) 
              Index Cond: ((created >= (now() + '00:01:00'::interval)) AND (created <= (now() + '00:05:00'::interval)))
        ->  Materialize  (cost=0.00..25343.93 rows=101773 width=36) (actual time=0.012..84.037 rows=90499 loops=10)
              ->  Subquery Scan on d  (cost=0.00..24039.07 rows=101773 width=36) (actual time=0.036..630.576
rows=100001loops=1) 
                    ->  GroupAggregate  (cost=0.00..23021.34 rows=101773 width=46) (actual time=0.035..619.240
rows=100001loops=1) 
                          ->  Index Scan using detail_table_pkey on detail_table d  (cost=0.00..19249.17 rows=500000
width=46)(actual time=0.012..154.834 rows=500000 loops=1) 
Total runtime: 922.272 ms

While one could argue that optimizer doesn't know to optimize left
join with group by its primary key, you can replace that join with
some other joins (ie left join to another table pk) and the same
behavior will be displayed (joining all tables and applying limit at
the end).
That's why I asked if fence for pushing limit is a known behavior.

Since this behavior is really important to me, I will spend a lot of
time looking at Postgres to try and improve this.

Regards,
Rikard

Re: limit is sometimes not pushed in view with order

From

Rikard Pavelic

Date:

13 April 2013, 22:08:16

On Sat, 13 Apr 2013 11:21:19 -0400
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> The planner realizes that about 1/200th of the table satisfies the
> "ord" condition, so it estimates that the first plan will require
> scanning about 2000 entries in the pkey index to get 10 results.  So
> that looks significantly cheaper than the other plan, which would
> require 5000 index fetches, not to mention a sort step.

Is it really realizing that?
Maybe I oversimplified my example.
I was trying to show that planner is not using limit information when
it should. For example if there were joins, he would do joins first
using filter estimate and at the end apply limit.

>
> Now, in this artificial test case, the cost estimate is wrong because
> "i" and "ord" are perfectly correlated and all the desired rows are
> quite far down the descending-i index scan; so the chosen plan
> actually has to scan a lot more than 2000 index entries.  In a more
> realistic case that plan would probably work noticeably better.
> However, the planner has no statistics that would tell it about the
> degree of order correlation of the two columns, so it's not able to
> find that out.
>

Thats actually pretty realistic scenario (if you change ord int to
created timestamp), but yeah it's probably best to create composite
index for that scenario.

But, look at this standard ERP example:

create table main_table (id serial primary key, data varchar, created timestamptz);
create table detail_table (main_id int references main_table, index int, data varchar, primary key(main_id, index));

create view main_view as
select m.id, m.data, m.created, d.details
from main_table m
left join
(
    select main_id, array_agg(d order by index) details
    from detail_table d
    group by main_id
) d on m.id = d.main_id
order by m.created desc;

insert into main_table select i, i::text, now() + (i/10 || ' sec')::interval from generate_series(1,100001) i;
insert into detail_table select i/5 + 1, i%5, i::text from generate_series(1,500000) i;

create index ix_created on main_table(created desc);
analyze main_table;
analyze detail_table;

explain analyze select * from main_view m where m.created >= now() + '1 min'::interval and m.created <= now() + '5
min'::intervallimit 10; 


Limit  (cost=0.01..22913.81 rows=10 width=49) (actual time=35.548..920.034 rows=10 loops=1)
  ->  Nested Loop Left Join  (cost=0.01..5879654.29 rows=2566 width=49) (actual time=35.546..920.028 rows=10 loops=1)
        Join Filter: (m.id = d.main_id)
        Rows Removed by Join Filter: 904978
        ->  Index Scan using ix_created on main_table m  (cost=0.01..98.79 rows=2521 width=17) (actual
time=0.059..0.103rows=10 loops=1) 
              Index Cond: ((created >= (now() + '00:01:00'::interval)) AND (created <= (now() + '00:05:00'::interval)))
        ->  Materialize  (cost=0.00..25343.93 rows=101773 width=36) (actual time=0.012..84.037 rows=90499 loops=10)
              ->  Subquery Scan on d  (cost=0.00..24039.07 rows=101773 width=36) (actual time=0.036..630.576
rows=100001loops=1) 
                    ->  GroupAggregate  (cost=0.00..23021.34 rows=101773 width=46) (actual time=0.035..619.240
rows=100001loops=1) 
                          ->  Index Scan using detail_table_pkey on detail_table d  (cost=0.00..19249.17 rows=500000
width=46)(actual time=0.012..154.834 rows=500000 loops=1) 
Total runtime: 922.272 ms

While one could argue that optimizer doesn't know to optimize left
join with group by its primary key, you can replace that join with
some other joins (ie left join to another table pk) and the same
behavior will be displayed (joining all tables and applying limit at
the end).
That's why I asked if fence for pushing limit is a known behavior.

Since this behavior is really important to me, I will spend a lot of
time looking at Postgres to try and improve this.

Regards,
Rikard

Re: limit is sometimes not pushed in view with order

From

Rikard Pavelic

Date:

14 April 2013, 10:34:23

On Sat, 13 Apr 2013 20:08:16 +0200
Rikard Pavelic <rikard@ngs.hr> wrote:

> While one could argue that optimizer doesn't know to optimize left
> join with group by its primary key, you can replace that join with
> some other joins (ie left join to another table pk) and the same
> behavior will be displayed (joining all tables and applying limit at
> the end).
> That's why I asked if fence for pushing limit is a known behavior.

While I can work around that problem by pushing left join in select
subquery this doesn't solve all problems since limit is not pushed
down on nontrivial queries.

This is probably the best example:

create table big_table(i serial primary key, delay int);
create function some_calculation(i int) returns int as $$
begin
  perform pg_sleep(i);
  return i*i;
end $$ language plpgsql stable cost 100000;

create view big_view
as select t.i, some_calculation(t.delay) as calc, s.delay as d2
from big_table t
left join big_table s on t.i = s.i + s.i
order by t.i asc;

insert into big_table select i, i%5 from generate_series(1, 100000) i;
analyze big_table;

explain analyze select * from big_view v where i >= 100 and i <= 105 limit 1;

Limit  (cost=3201.63..3201.64 rows=1 width=12) (actual
time=10017.471..10017.471 rows=1 loops=1) ->  Sort  (cost=3201.63..3201.64 rows=5 width=12) (actual
time=10017.469..10017.469rows=1 loops=1) 
        Sort Key: t.i
        Sort Method: top-N heapsort  Memory: 25kB
        ->  Hash Right Join  (cost=8.52..3201.57 rows=5 width=12) (actual time=0.078..10017.436 rows=6 loops=1)
              Hash Cond: ((s.i + s.i) = t.i)
              ->  Seq Scan on big_table s  (cost=0.00..1443.00 rows=100000 width=8) (actual time=0.005..6.294
rows=100000loops=1) 
              ->  Hash  (cost=8.46..8.46 rows=5 width=8) (actual time=0.012..0.012 rows=6 loops=1)
                    Buckets: 1024  Batches: 1  Memory Usage: 1kB
                    ->  Index Scan using big_table_pkey on big_table t  (cost=0.00..8.46 rows=5 width=8) (actual
time=0.007..0.008rows=6 loops=1) 
                          Index Cond: ((i >= 100) AND (i <= 105))
Total runtime: 10017.514 ms

explain analyze select * from big_view v where i >= 100 and i <= 10005 limit 1;

Limit  (cost=0.00..2391.22 rows=1 width=12) (actual time=0.088..0.088 rows=1 loops=1)
  ->  Nested Loop Left Join  (cost=0.00..23780547.26 rows=9945 width=12) (actual time=0.087..0.087 rows=1 loops=1)
        Join Filter: (t.i = (s.i + s.i))
        Rows Removed by Join Filter: 49
        ->  Index Scan using big_table_pkey on big_table t  (cost=0.00..359.26 rows=9945 width=8) (actual
time=0.014..0.014rows=1 loops=1) 
              Index Cond: ((i >= 100) AND (i <= 10005))
        ->  Materialize  (cost=0.00..2334.00 rows=100000 width=8) (actual time=0.009..0.020 rows=50 loops=1)
              ->  Seq Scan on big_table s  (cost=0.00..1443.00 rows=100000 width=8) (actual time=0.005..0.010 rows=50
loops=1)
Total runtime: 0.122 ms

explain analyze select * from big_view v where i >= 100 and i <= 10005 limit 10;

takes too long...

To me this looks like it should be fixable if limit is applied before
all targets are evaluated.
If I remove the left join from the view, Postgres works as expected,
so I guess it already knows how to apply limit before selects, but this
is probably missing for subqueries where targets are pulled out of it.

Maybe this is a problem since you would probably need closures to pull
of a general solution, but there are plenty of use cases without group
by that would benefit from applying limit before evaluating targets
that are used only in topmost result.

So, I was wondering if this is a known problem and is there any
interest in tackling it?

Regards,
Rikard

Re: limit is sometimes not pushed in view with order

From

Rikard Pavelic

Date:

14 April 2013, 14:34:14

On Sat, 13 Apr 2013 20:08:16 +0200
Rikard Pavelic <rikard@ngs.hr> wrote:

> While one could argue that optimizer doesn't know to optimize left
> join with group by its primary key, you can replace that join with
> some other joins (ie left join to another table pk) and the same
> behavior will be displayed (joining all tables and applying limit at
> the end).
> That's why I asked if fence for pushing limit is a known behavior.

While I can work around that problem by pushing left join in select
subquery this doesn't solve all problems since limit is not pushed
down on nontrivial queries.

This is probably the best example:

create table big_table(i serial primary key, delay int);
create function some_calculation(i int) returns int as $$
begin
  perform pg_sleep(i);
  return i*i;
end $$ language plpgsql stable cost 100000;

create view big_view
as select t.i, some_calculation(t.delay) as calc, s.delay as d2
from big_table t
left join big_table s on t.i = s.i + s.i
order by t.i asc;

insert into big_table select i, i%5 from generate_series(1, 100000) i;
analyze big_table;

explain analyze select * from big_view v where i >= 100 and i <= 105 limit 1;

Limit  (cost=3201.63..3201.64 rows=1 width=12) (actual
time=10017.471..10017.471 rows=1 loops=1) ->  Sort  (cost=3201.63..3201.64 rows=5 width=12) (actual
time=10017.469..10017.469rows=1 loops=1) 
        Sort Key: t.i
        Sort Method: top-N heapsort  Memory: 25kB
        ->  Hash Right Join  (cost=8.52..3201.57 rows=5 width=12) (actual time=0.078..10017.436 rows=6 loops=1)
              Hash Cond: ((s.i + s.i) = t.i)
              ->  Seq Scan on big_table s  (cost=0.00..1443.00 rows=100000 width=8) (actual time=0.005..6.294
rows=100000loops=1) 
              ->  Hash  (cost=8.46..8.46 rows=5 width=8) (actual time=0.012..0.012 rows=6 loops=1)
                    Buckets: 1024  Batches: 1  Memory Usage: 1kB
                    ->  Index Scan using big_table_pkey on big_table t  (cost=0.00..8.46 rows=5 width=8) (actual
time=0.007..0.008rows=6 loops=1) 
                          Index Cond: ((i >= 100) AND (i <= 105))
Total runtime: 10017.514 ms

explain analyze select * from big_view v where i >= 100 and i <= 10005 limit 1;

Limit  (cost=0.00..2391.22 rows=1 width=12) (actual time=0.088..0.088 rows=1 loops=1)
  ->  Nested Loop Left Join  (cost=0.00..23780547.26 rows=9945 width=12) (actual time=0.087..0.087 rows=1 loops=1)
        Join Filter: (t.i = (s.i + s.i))
        Rows Removed by Join Filter: 49
        ->  Index Scan using big_table_pkey on big_table t  (cost=0.00..359.26 rows=9945 width=8) (actual
time=0.014..0.014rows=1 loops=1) 
              Index Cond: ((i >= 100) AND (i <= 10005))
        ->  Materialize  (cost=0.00..2334.00 rows=100000 width=8) (actual time=0.009..0.020 rows=50 loops=1)
              ->  Seq Scan on big_table s  (cost=0.00..1443.00 rows=100000 width=8) (actual time=0.005..0.010 rows=50
loops=1)
Total runtime: 0.122 ms

explain analyze select * from big_view v where i >= 100 and i <= 10005 limit 10;

takes too long...

To me this looks like it should be fixable if limit is applied before
all targets are evaluated.
If I remove the left join from the view, Postgres works as expected,
so I guess it already knows how to apply limit before selects, but this
is probably missing for subqueries where targets are pulled out of it.

Maybe this is a problem since you would probably need closures to pull
of a general solution, but there are plenty of use cases without group
by that would benefit from applying limit before evaluating targets
that are used only in topmost result.

So, I was wondering if this is a known problem and is there any
interest in tackling it?

Regards,
Rikard