Thread: limit is sometimes not pushed in view with order
I was investigating some performance issues and stumbled upon this behavior: create table main_table (i serial primary key, data varchar, ord int); create view main_view_order as select m.i, m.data, m.ord from main_table m order by m.i desc; insert into main_table select i, i::text, i/10 from generate_series(1,1000000) i; create index ix_ord on main_table(ord); analyze main_table; explain analyze select * from main_view_order m where m.ord >= 5000 and m.ord <= 5500 limit 10; Limit (cost=0.00..69.01 rows=10 width=14) (actual time=330.943..330.951 rows=10 loops=1) -> Index Scan Backward using main_table_pkey on main_table m (cost=0.00..36389.36 rows=5281 width=14) (actual time=330.937..330.940rows=10 loops=1) Filter: ((ord >= 5000) AND (ord <= 5500)) Total runtime: 330.975 ms I havent found it on TODO or in archives so I'm wondering if this is a known behavior. Regards, Rikard -- Rikard Pavelic http://www.ngs.hr/
Rikard Pavelic <rikard@ngs.hr> writes: > I was investigating some performance issues and stumbled upon this behavior: > create table main_table (i serial primary key, data varchar, ord int); > create view main_view_order as select m.i, m.data, m.ord from main_table m order by m.i desc; > insert into main_table select i, i::text, i/10 from generate_series(1,1000000) i; > create index ix_ord on main_table(ord); > analyze main_table; > explain analyze select * from main_view_order m where m.ord >= 5000 and m.ord <= 5500 limit 10; > Limit (cost=0.00..69.01 rows=10 width=14) (actual time=330.943..330.951 rows=10 loops=1) > -> Index Scan Backward using main_table_pkey on main_table m (cost=0.00..36389.36 rows=5281 width=14) (actual time=330.937..330.940rows=10 loops=1) > Filter: ((ord >= 5000) AND (ord <= 5500)) > Total runtime: 330.975 ms > I havent found it on TODO or in archives so I'm wondering if this is a known behavior. There is nothing particularly wrong with that plan, or at least I'd not recommend holding your breath waiting for it to get better. Given this scenario, there are two possible (index-based) plans. The one above scans the pkey index in decreasing order, reports out only the rows satisfying the "ord" condition, and stops as soon as it has 10 rows. The only other alternative is to scan the ord index to collect the 5000 or so rows satisfying the "ord" condition, sort them all by "i", and then throw away 4990 of them and return just the first 10. The planner realizes that about 1/200th of the table satisfies the "ord" condition, so it estimates that the first plan will require scanning about 2000 entries in the pkey index to get 10 results. So that looks significantly cheaper than the other plan, which would require 5000 index fetches, not to mention a sort step. Now, in this artificial test case, the cost estimate is wrong because "i" and "ord" are perfectly correlated and all the desired rows are quite far down the descending-i index scan; so the chosen plan actually has to scan a lot more than 2000 index entries. In a more realistic case that plan would probably work noticeably better. However, the planner has no statistics that would tell it about the degree of order correlation of the two columns, so it's not able to find that out. Having said all that, neither of these plan choices are exactly ideal; the planner is basically reduced to having to guess which one will suck less. You might try experimenting with two-column indexes on (i, ord) or (ord, i) to give the planner some other cards to play. I'm not sure how much that would help in this exact query type, but for example the common case of "where x = constant order by y" is perfectly matched to a btree index on (x, y). regards, tom lane
On 13/04/13 18:25, Rikard Pavelic wrote: > I was investigating some performance issues and stumbled upon this behavior: > > create table main_table (i serial primary key, data varchar, ord int); > create view main_view_order as select m.i, m.data, m.ord from main_table m order by m.i desc; > > insert into main_table select i, i::text, i/10 from generate_series(1,1000000) i; > > create index ix_ord on main_table(ord); > analyze main_table; > > explain analyze select * from main_view_order m where m.ord >= 5000 and m.ord <= 5500 limit 10; > > Limit (cost=0.00..69.01 rows=10 width=14) (actual time=330.943..330.951 rows=10 loops=1) > -> Index Scan Backward using main_table_pkey on main_table m (cost=0.00..36389.36 rows=5281 width=14) (actual time=330.937..330.940rows=10 loops=1) > Filter: ((ord >= 5000) AND (ord <= 5500)) > Total runtime: 330.975 ms > > I havent found it on TODO or in archives so I'm wondering if this is a known behavior. > > Regards, > Rikard > Hi, Disregard the VIEW for the moment. (its not the issue here). I wasn't able to get much better than a LIMIT of around 50 after a SET STATISTICS 1000 on the PK column (i). julian=# explain analyse select * from main_table m where m.ord >= 5000 and m.ord <= 5500 order by m.i desc limit 49; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------ Limit (cost=352.73..352.85 rows=49 width=14) (actual time=3.215..3.227 rows=49 loops=1) -> Sort (cost=352.73..365.23 rows=5000 width=14) (actual time=3.213..3.217 rows=49 loops=1) Sort Key: i Sort Method: top-N heapsort Memory: 27kB -> Index Scan using ix_ord on main_table m (cost=0.00..187.36 rows=5000 width=14) (actual time=0.025..1.479 rows=5010 loops=1) Index Cond: ((ord >= 5000) AND (ord <= 5500)) Total runtime: 3.252 ms However, at LIMIT 48 it goes bad: julian=# explain analyse select * from main_table m where m.ord >= 5000 and m.ord <= 5500 order by m.i desc limit 48; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=0.00..349.34 rows=48 width=14) (actual time=280.158..280.179 rows=48 loops=1) -> Index Scan Backward using main_table_pkey on main_table m (cost=0.00..36389.36 rows=5000 width=14) (actual time=280.156..280.172 rows=48 loops=1) Filter: ((ord >= 5000) AND (ord <= 5500)) Rows Removed by Filter: 944991 Total runtime: 280.206 ms 49 rows is pretty good IMO. But others might want to give some tips, because I don't use LIMIT much. You might want to consider using CURSORs - Which in this example would cache the 49 rows and pass the rows you limit (FETCH) more efficiently. Regards, Jules.
On Sat, 13 Apr 2013 11:21:19 -0400 Tom Lane <tgl@sss.pgh.pa.us> wrote: > The planner realizes that about 1/200th of the table satisfies the > "ord" condition, so it estimates that the first plan will require > scanning about 2000 entries in the pkey index to get 10 results. So > that looks significantly cheaper than the other plan, which would > require 5000 index fetches, not to mention a sort step. Is it really realizing that? Maybe I oversimplified my example. I was trying to show that planner is not using limit information when it should. For example if there were joins, he would do joins first using filter estimate and at the end apply limit. > > Now, in this artificial test case, the cost estimate is wrong because > "i" and "ord" are perfectly correlated and all the desired rows are > quite far down the descending-i index scan; so the chosen plan > actually has to scan a lot more than 2000 index entries. In a more > realistic case that plan would probably work noticeably better. > However, the planner has no statistics that would tell it about the > degree of order correlation of the two columns, so it's not able to > find that out. > Thats actually pretty realistic scenario (if you change ord int to created timestamp), but yeah it's probably best to create composite index for that scenario. But, look at this standard ERP example: create table main_table (id serial primary key, data varchar, created timestamptz); create table detail_table (main_id int references main_table, index int, data varchar, primary key(main_id, index)); create view main_view as select m.id, m.data, m.created, d.details from main_table m left join ( select main_id, array_agg(d order by index) details from detail_table d group by main_id ) d on m.id = d.main_id order by m.created desc; insert into main_table select i, i::text, now() + (i/10 || ' sec')::interval from generate_series(1,100001) i; insert into detail_table select i/5 + 1, i%5, i::text from generate_series(1,500000) i; create index ix_created on main_table(created desc); analyze main_table; analyze detail_table; explain analyze select * from main_view m where m.created >= now() + '1 min'::interval and m.created <= now() + '5 min'::intervallimit 10; Limit (cost=0.01..22913.81 rows=10 width=49) (actual time=35.548..920.034 rows=10 loops=1) -> Nested Loop Left Join (cost=0.01..5879654.29 rows=2566 width=49) (actual time=35.546..920.028 rows=10 loops=1) Join Filter: (m.id = d.main_id) Rows Removed by Join Filter: 904978 -> Index Scan using ix_created on main_table m (cost=0.01..98.79 rows=2521 width=17) (actual time=0.059..0.103rows=10 loops=1) Index Cond: ((created >= (now() + '00:01:00'::interval)) AND (created <= (now() + '00:05:00'::interval))) -> Materialize (cost=0.00..25343.93 rows=101773 width=36) (actual time=0.012..84.037 rows=90499 loops=10) -> Subquery Scan on d (cost=0.00..24039.07 rows=101773 width=36) (actual time=0.036..630.576 rows=100001loops=1) -> GroupAggregate (cost=0.00..23021.34 rows=101773 width=46) (actual time=0.035..619.240 rows=100001loops=1) -> Index Scan using detail_table_pkey on detail_table d (cost=0.00..19249.17 rows=500000 width=46)(actual time=0.012..154.834 rows=500000 loops=1) Total runtime: 922.272 ms While one could argue that optimizer doesn't know to optimize left join with group by its primary key, you can replace that join with some other joins (ie left join to another table pk) and the same behavior will be displayed (joining all tables and applying limit at the end). That's why I asked if fence for pushing limit is a known behavior. Since this behavior is really important to me, I will spend a lot of time looking at Postgres to try and improve this. Regards, Rikard
On Sat, 13 Apr 2013 11:21:19 -0400 Tom Lane <tgl@sss.pgh.pa.us> wrote: > The planner realizes that about 1/200th of the table satisfies the > "ord" condition, so it estimates that the first plan will require > scanning about 2000 entries in the pkey index to get 10 results. So > that looks significantly cheaper than the other plan, which would > require 5000 index fetches, not to mention a sort step. Is it really realizing that? Maybe I oversimplified my example. I was trying to show that planner is not using limit information when it should. For example if there were joins, he would do joins first using filter estimate and at the end apply limit. > > Now, in this artificial test case, the cost estimate is wrong because > "i" and "ord" are perfectly correlated and all the desired rows are > quite far down the descending-i index scan; so the chosen plan > actually has to scan a lot more than 2000 index entries. In a more > realistic case that plan would probably work noticeably better. > However, the planner has no statistics that would tell it about the > degree of order correlation of the two columns, so it's not able to > find that out. > Thats actually pretty realistic scenario (if you change ord int to created timestamp), but yeah it's probably best to create composite index for that scenario. But, look at this standard ERP example: create table main_table (id serial primary key, data varchar, created timestamptz); create table detail_table (main_id int references main_table, index int, data varchar, primary key(main_id, index)); create view main_view as select m.id, m.data, m.created, d.details from main_table m left join ( select main_id, array_agg(d order by index) details from detail_table d group by main_id ) d on m.id = d.main_id order by m.created desc; insert into main_table select i, i::text, now() + (i/10 || ' sec')::interval from generate_series(1,100001) i; insert into detail_table select i/5 + 1, i%5, i::text from generate_series(1,500000) i; create index ix_created on main_table(created desc); analyze main_table; analyze detail_table; explain analyze select * from main_view m where m.created >= now() + '1 min'::interval and m.created <= now() + '5 min'::intervallimit 10; Limit (cost=0.01..22913.81 rows=10 width=49) (actual time=35.548..920.034 rows=10 loops=1) -> Nested Loop Left Join (cost=0.01..5879654.29 rows=2566 width=49) (actual time=35.546..920.028 rows=10 loops=1) Join Filter: (m.id = d.main_id) Rows Removed by Join Filter: 904978 -> Index Scan using ix_created on main_table m (cost=0.01..98.79 rows=2521 width=17) (actual time=0.059..0.103rows=10 loops=1) Index Cond: ((created >= (now() + '00:01:00'::interval)) AND (created <= (now() + '00:05:00'::interval))) -> Materialize (cost=0.00..25343.93 rows=101773 width=36) (actual time=0.012..84.037 rows=90499 loops=10) -> Subquery Scan on d (cost=0.00..24039.07 rows=101773 width=36) (actual time=0.036..630.576 rows=100001loops=1) -> GroupAggregate (cost=0.00..23021.34 rows=101773 width=46) (actual time=0.035..619.240 rows=100001loops=1) -> Index Scan using detail_table_pkey on detail_table d (cost=0.00..19249.17 rows=500000 width=46)(actual time=0.012..154.834 rows=500000 loops=1) Total runtime: 922.272 ms While one could argue that optimizer doesn't know to optimize left join with group by its primary key, you can replace that join with some other joins (ie left join to another table pk) and the same behavior will be displayed (joining all tables and applying limit at the end). That's why I asked if fence for pushing limit is a known behavior. Since this behavior is really important to me, I will spend a lot of time looking at Postgres to try and improve this. Regards, Rikard
On Sat, 13 Apr 2013 20:08:16 +0200 Rikard Pavelic <rikard@ngs.hr> wrote: > While one could argue that optimizer doesn't know to optimize left > join with group by its primary key, you can replace that join with > some other joins (ie left join to another table pk) and the same > behavior will be displayed (joining all tables and applying limit at > the end). > That's why I asked if fence for pushing limit is a known behavior. While I can work around that problem by pushing left join in select subquery this doesn't solve all problems since limit is not pushed down on nontrivial queries. This is probably the best example: create table big_table(i serial primary key, delay int); create function some_calculation(i int) returns int as $$ begin perform pg_sleep(i); return i*i; end $$ language plpgsql stable cost 100000; create view big_view as select t.i, some_calculation(t.delay) as calc, s.delay as d2 from big_table t left join big_table s on t.i = s.i + s.i order by t.i asc; insert into big_table select i, i%5 from generate_series(1, 100000) i; analyze big_table; explain analyze select * from big_view v where i >= 100 and i <= 105 limit 1; Limit (cost=3201.63..3201.64 rows=1 width=12) (actual time=10017.471..10017.471 rows=1 loops=1) -> Sort (cost=3201.63..3201.64 rows=5 width=12) (actual time=10017.469..10017.469rows=1 loops=1) Sort Key: t.i Sort Method: top-N heapsort Memory: 25kB -> Hash Right Join (cost=8.52..3201.57 rows=5 width=12) (actual time=0.078..10017.436 rows=6 loops=1) Hash Cond: ((s.i + s.i) = t.i) -> Seq Scan on big_table s (cost=0.00..1443.00 rows=100000 width=8) (actual time=0.005..6.294 rows=100000loops=1) -> Hash (cost=8.46..8.46 rows=5 width=8) (actual time=0.012..0.012 rows=6 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 1kB -> Index Scan using big_table_pkey on big_table t (cost=0.00..8.46 rows=5 width=8) (actual time=0.007..0.008rows=6 loops=1) Index Cond: ((i >= 100) AND (i <= 105)) Total runtime: 10017.514 ms explain analyze select * from big_view v where i >= 100 and i <= 10005 limit 1; Limit (cost=0.00..2391.22 rows=1 width=12) (actual time=0.088..0.088 rows=1 loops=1) -> Nested Loop Left Join (cost=0.00..23780547.26 rows=9945 width=12) (actual time=0.087..0.087 rows=1 loops=1) Join Filter: (t.i = (s.i + s.i)) Rows Removed by Join Filter: 49 -> Index Scan using big_table_pkey on big_table t (cost=0.00..359.26 rows=9945 width=8) (actual time=0.014..0.014rows=1 loops=1) Index Cond: ((i >= 100) AND (i <= 10005)) -> Materialize (cost=0.00..2334.00 rows=100000 width=8) (actual time=0.009..0.020 rows=50 loops=1) -> Seq Scan on big_table s (cost=0.00..1443.00 rows=100000 width=8) (actual time=0.005..0.010 rows=50 loops=1) Total runtime: 0.122 ms explain analyze select * from big_view v where i >= 100 and i <= 10005 limit 10; takes too long... To me this looks like it should be fixable if limit is applied before all targets are evaluated. If I remove the left join from the view, Postgres works as expected, so I guess it already knows how to apply limit before selects, but this is probably missing for subqueries where targets are pulled out of it. Maybe this is a problem since you would probably need closures to pull of a general solution, but there are plenty of use cases without group by that would benefit from applying limit before evaluating targets that are used only in topmost result. So, I was wondering if this is a known problem and is there any interest in tackling it? Regards, Rikard
On Sat, 13 Apr 2013 20:08:16 +0200 Rikard Pavelic <rikard@ngs.hr> wrote: > While one could argue that optimizer doesn't know to optimize left > join with group by its primary key, you can replace that join with > some other joins (ie left join to another table pk) and the same > behavior will be displayed (joining all tables and applying limit at > the end). > That's why I asked if fence for pushing limit is a known behavior. While I can work around that problem by pushing left join in select subquery this doesn't solve all problems since limit is not pushed down on nontrivial queries. This is probably the best example: create table big_table(i serial primary key, delay int); create function some_calculation(i int) returns int as $$ begin perform pg_sleep(i); return i*i; end $$ language plpgsql stable cost 100000; create view big_view as select t.i, some_calculation(t.delay) as calc, s.delay as d2 from big_table t left join big_table s on t.i = s.i + s.i order by t.i asc; insert into big_table select i, i%5 from generate_series(1, 100000) i; analyze big_table; explain analyze select * from big_view v where i >= 100 and i <= 105 limit 1; Limit (cost=3201.63..3201.64 rows=1 width=12) (actual time=10017.471..10017.471 rows=1 loops=1) -> Sort (cost=3201.63..3201.64 rows=5 width=12) (actual time=10017.469..10017.469rows=1 loops=1) Sort Key: t.i Sort Method: top-N heapsort Memory: 25kB -> Hash Right Join (cost=8.52..3201.57 rows=5 width=12) (actual time=0.078..10017.436 rows=6 loops=1) Hash Cond: ((s.i + s.i) = t.i) -> Seq Scan on big_table s (cost=0.00..1443.00 rows=100000 width=8) (actual time=0.005..6.294 rows=100000loops=1) -> Hash (cost=8.46..8.46 rows=5 width=8) (actual time=0.012..0.012 rows=6 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 1kB -> Index Scan using big_table_pkey on big_table t (cost=0.00..8.46 rows=5 width=8) (actual time=0.007..0.008rows=6 loops=1) Index Cond: ((i >= 100) AND (i <= 105)) Total runtime: 10017.514 ms explain analyze select * from big_view v where i >= 100 and i <= 10005 limit 1; Limit (cost=0.00..2391.22 rows=1 width=12) (actual time=0.088..0.088 rows=1 loops=1) -> Nested Loop Left Join (cost=0.00..23780547.26 rows=9945 width=12) (actual time=0.087..0.087 rows=1 loops=1) Join Filter: (t.i = (s.i + s.i)) Rows Removed by Join Filter: 49 -> Index Scan using big_table_pkey on big_table t (cost=0.00..359.26 rows=9945 width=8) (actual time=0.014..0.014rows=1 loops=1) Index Cond: ((i >= 100) AND (i <= 10005)) -> Materialize (cost=0.00..2334.00 rows=100000 width=8) (actual time=0.009..0.020 rows=50 loops=1) -> Seq Scan on big_table s (cost=0.00..1443.00 rows=100000 width=8) (actual time=0.005..0.010 rows=50 loops=1) Total runtime: 0.122 ms explain analyze select * from big_view v where i >= 100 and i <= 10005 limit 10; takes too long... To me this looks like it should be fixable if limit is applied before all targets are evaluated. If I remove the left join from the view, Postgres works as expected, so I guess it already knows how to apply limit before selects, but this is probably missing for subqueries where targets are pulled out of it. Maybe this is a problem since you would probably need closures to pull of a general solution, but there are plenty of use cases without group by that would benefit from applying limit before evaluating targets that are used only in topmost result. So, I was wondering if this is a known problem and is there any interest in tackling it? Regards, Rikard