Thread: BUG #7495: chosen wrong index
The following bug has been logged on the website: Bug reference: 7495 Logged by: Andreas Email address: psql@elbrief.de PostgreSQL version: 9.1.4 Operating system: Debian Linux Description: = Hello. create table bla ( a int , b int ) ; insert into bla ( a , b ) select a , a from generate_series( 1 , 1000000 ) as a ( a ) ; create index bla_a on bla ( a ) ; create index bla_b on bla ( b ) ; explain analyze select * from bla where b > 990000 limit 10 ; QUERY PLAN ---------------------------------------------------------------------------= --------------------------------------------- Limit (cost=3D0.00..0.27 rows=3D10 width=3D8) (actual time=3D0.150..0.173= rows=3D10 loops=3D1) -> Index Scan using bla_b on bla (cost=3D0.00..265.29 rows=3D10000 wid= th=3D8) (actual time=3D0.147..0.159 rows=3D10 loops=3D1) Index Cond: (b > 990000) Total runtime: 0.226 ms explain analyze select * from bla where b > 990000 order by a limit 10 ; QUERY PLAN ---------------------------------------------------------------------------= --------------------------------------------------- Limit (cost=3D0.00..26.32 rows=3D10 width=3D8) (actual time=3D991.096..99= 1.113 rows=3D10 loops=3D1) -> Index Scan using bla_a on bla (cost=3D0.00..26322.29 rows=3D10000 width=3D8) (actual time=3D991.093..991.103 rows=3D10 loops=3D1) Filter: (b > 990000) Total runtime: 991.164 ms explain analyze select * from ( select * from bla where b > 990000 union select * from bla where b < 0 ) a order by a limit 10 ; QUERY PLAN ---------------------------------------------------------------------------= ------------------------------------------------------------------- Limit (cost=3D835.76..835.78 rows=3D10 width=3D8) (actual time=3D51.551..= 51.571 rows=3D10 loops=3D1) -> Sort (cost=3D835.76..860.76 rows=3D10001 width=3D8) (actual time=3D51.547..51.548 rows=3D10 loops=3D1) Sort Key: wasnoch.bla.a Sort Method: top-N heapsort Memory: 17kB -> HashAggregate (cost=3D419.62..519.63 rows=3D10001 width=3D8) = (actual time=3D32.061..42.544 rows=3D10000 loops=3D1) -> Append (cost=3D0.00..369.62 rows=3D10001 width=3D8) (ac= tual time=3D0.037..19.857 rows=3D10000 loops=3D1) -> Index Scan using bla_b on bla (cost=3D0.00..265.29 rows=3D10000 width=3D8) (actual time=3D0.035..11.538 rows=3D10000 loops=3D1) Index Cond: (b > 990000) -> Index Scan using bla_b on bla (cost=3D0.00..4.31 rows=3D1 width=3D8) (actual time=3D0.012..0.012 rows=3D0 loops=3D1) Index Cond: (b < 0) Total runtime: 51.678 ms seq_page_cost =3D 1.0 random_page_cost =3D 20.0 restart server explain analyze select * from bla where b > 997400 order by a limit 10 ; QUERY PLAN ---------------------------------------------------------------------------= ---------------------------------------------------- Limit (cost=3D253.37..253.40 rows=3D10 width=3D8) (actual time=3D3.642..3= .653 rows=3D10 loops=3D1) -> Sort (cost=3D253.37..259.87 rows=3D2600 width=3D8) (actual time=3D3.639..3.643 rows=3D10 loops=3D1) Sort Key: a Sort Method: top-N heapsort Memory: 17kB -> Index Scan using bla_b on bla (cost=3D0.00..197.19 rows=3D2600 width=3D8) (actual time=3D0.041..2.155 rows=3D2600 loops=3D1) Index Cond: (b > 997400) Total runtime: 3.698 ms seq_page_cost =3D 1.0 random_page_cost =3D 2.0 restart server explain analyze select * from bla where b > 997400 order by a limit 10 ; QUERY PLAN ---------------------------------------------------------------------------= -------------------------------------------------- Limit (cost=3D0.00..101.24 rows=3D10 width=3D8) (actual time=3D726.649..7= 26.667 rows=3D10 loops=3D1) -> Index Scan using bla_a on bla (cost=3D0.00..26322.29 rows=3D2600 width=3D8) (actual time=3D726.642..726.652 rows=3D10 loops=3D1) Filter: (b > 997400) Total runtime: 726.731 ms explain analyze select * from bla where b > 997699 order by a limit 10 ; QUERY PLAN ---------------------------------------------------------------------------= --------------------------------------------------- Limit (cost=3D114.29..114.31 rows=3D10 width=3D8) (actual time=3D4.009..4= .020 rows=3D10 loops=3D1) -> Sort (cost=3D114.29..120.04 rows=3D2301 width=3D8) (actual time=3D4.007..4.011 rows=3D10 loops=3D1) Sort Key: a Sort Method: top-N heapsort Memory: 17kB -> Index Scan using bla_b on bla (cost=3D0.00..64.56 rows=3D2301 width=3D8) (actual time=3D0.068..2.448 rows=3D2301 loops=3D1) Index Cond: (b > 997699) Total runtime: 4.073 ms i have also played with cpu_tuple_cost, cpu_index_tuple_cost and cpu_operator_cost, but there i have not found a setting which chose index bla_b under b > 996000. but till b > 900000 it is faster to chose bla_b instead of bla_a. i think the planner estimate the wrong amount of costs. best regards, Andreas
<psql@elbrief.de> wrote: > insert into bla ( a , b ) > select a , a > from generate_series( 1 , 1000000 ) as a ( a ) ; > explain analyze select * from bla > where b > 990000 order by a limit 10 ; > [uses index on b and has a long run time] The problem is that PostgreSQL doesn't have any sense of the correlation between columns a and b (i.e., they are always equal) and assumes that it will find enough matching rows soon enough on the scan of the index on b to make it cheaper than sorting the results of finding all rows that match the predicate. Try your test suite again with the only change being the insert statement: insert into bla ( a , b ) select a , floor(random() * 1000000) + 1 from generate_series( 1 , 1000000 ) as a ( a ) ; On my machine, with that data, all of the queries run fast. We've been looking at ways to develop statistics on multiple columns, so that correlations like that don't confuse the optimizer, or trying to evaluate the "risk" of a query taking a long time based on unexpected correlations. Not really a bug; more like a recognized opportunity to improve the optimizer. -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote: <> psql@elbrief.de> wrote: > >> insert into bla ( a , b ) >> select a , a >> from generate_series( 1 , 1000000 ) as a ( a ) ; > >> explain analyze select * from bla >> where b > 990000 order by a limit 10 ; >> [uses index on b and has a long run time] > > The problem is that PostgreSQL doesn't have any sense of the > correlation between columns a and b (i.e., they are always equal) A belated thought on this -- you went right from your load to running queries without leaving any time for autovacuum to kick in and generate statistics. I recommend that somewhere after you insert the data and before you run your selects, you run: VACUUM ANALYZE bla; This will get you to something which more closely resembles a "steady state" environment. -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote: > insert into bla ( a , b ) > select a , floor(random() * 1000000) 1 > from generate_series( 1 , 1000000 ) as a ( a ) ; > > On my machine, with that data, all of the queries run fast. Yes, this runs by me fast too. But here is no relation between a and b. By my data psql must scan mostly all data to find the rows. This is only an example. In my live environment i have a table with a boolean where the boolean is usually true. The boolean is false on new or changed entrys and if i select the false-rows order by primary key i get slow querys. The table has a lot of million rows and a very small amount of rows with false. The boolean indicates that this row has an entry in a second table so i can select the rows without a join which is expensive too. BTW: it would be nice to have an index wich works on select * from a left join b on b.id = a.id where b.id is null. explain select * from bla where b > 990000 order by a limit 10 ; QUERY PLAN ------------------------------------------------------------------------------- Limit (cost=0.00..30.75 rows=10 width=8) -> Index Scan using bla_a on bla (cost=0.00..30747.29 rows=10000 width=8) Filter: (b > 990000) drop index bla_a ; explain select * from bla where b > 990000 order by a limit 10 ; QUERY PLAN ----------------------------------------------------------------------------------- Limit (cost=633.50..633.52 rows=10 width=8) -> Sort (cost=633.50..658.50 rows=10000 width=8) Sort Key: a -> Index Scan using bla_b on bla (cost=0.00..417.40 rows=10000 width=8) Index Cond: (b > 990000) The first explain reduce by the limit the cost by the faktor 10/10000. This is good for data with no relation between a and b. But in my example it is about 1000 times higher. BTW: in the first explain this is not an Index Scan, it is an sequential scan on an index. It would be nice to show this in the explain. Perhaps it would be good to reduce the cost for the limit on an sequential scan on an index not linear. Best regards, Andreas
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote: > insert into bla ( a , b ) > select a , floor(random() * 1000000) 1 > from generate_series( 1 , 1000000 ) as a ( a ) ; > > On my machine, with that data, all of the queries run fast. Yes, this runs by me fast too. But here is no relation between a and b. By my data psql must scan mostly all data to find the rows. This is only an example. In my live environment i have a table with a boolean where the boolean is usually true. The boolean is false on new or changed entrys and if i select the false-rows order by primary key i get slow querys. The table has a lot of million rows and a very small amount of rows with false. The boolean indicates that this row has an entry in a second table so i can select the rows without a join which is expensive too. BTW: it would be nice to have an index wich works on select * from a left join b on b.id = a.id where b.id is null. explain select * from bla where b > 990000 order by a limit 10 ; QUERY PLAN ------------------------------------------------------------------------------- Limit (cost=0.00..30.75 rows=10 width=8) -> Index Scan using bla_a on bla (cost=0.00..30747.29 rows=10000 width=8) Filter: (b > 990000) drop index bla_a ; explain select * from bla where b > 990000 order by a limit 10 ; QUERY PLAN ----------------------------------------------------------------------------------- Limit (cost=633.50..633.52 rows=10 width=8) -> Sort (cost=633.50..658.50 rows=10000 width=8) Sort Key: a -> Index Scan using bla_b on bla (cost=0.00..417.40 rows=10000 width=8) Index Cond: (b > 990000) The first explain reduce by the limit the cost by the faktor 10/10000. This is good for data with no relation between a and b. But in my example it is about 1000 times higher. BTW: in the first explain this is not an Index Scan, it is an sequential scan on an index. It would be nice to show this in the explain. Perhaps it would be good to reduce the cost for the limit on an sequential scan on an index not linear. Best regards, Andreas
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote: > insert into bla ( a , b ) > select a , floor(random() * 1000000) 1 > from generate_series( 1 , 1000000 ) as a ( a ) ; > > On my machine, with that data, all of the queries run fast. Yes, this runs by me fast too. But here is no relation between a and b. By my data psql must scan mostly all data to find the rows. This is only an example. In my live environment i have a table with a boolean where the boolean is usually true. The boolean is false on new or changed entrys and if i select the false-rows order by primary key i get slow querys. The table has a lot of million rows and a very small amount of rows with false. The boolean indicates that this row has an entry in a second table so i can select the rows without a join which is expensive too. BTW: it would be nice to have an index wich works on select * from a left join b on b.id = a.id where b.id is null. explain select * from bla where b > 990000 order by a limit 10 ; QUERY PLAN ------------------------------------------------------------------------------- Limit (cost=0.00..30.75 rows=10 width=8) -> Index Scan using bla_a on bla (cost=0.00..30747.29 rows=10000 width=8) Filter: (b > 990000) drop index bla_a ; explain select * from bla where b > 990000 order by a limit 10 ; QUERY PLAN ----------------------------------------------------------------------------------- Limit (cost=633.50..633.52 rows=10 width=8) -> Sort (cost=633.50..658.50 rows=10000 width=8) Sort Key: a -> Index Scan using bla_b on bla (cost=0.00..417.40 rows=10000 width=8) Index Cond: (b > 990000) The first explain reduce by the limit the cost by the faktor 10/10000. This is good for data with no relation between a and b. But in my example it is about 1000 times higher. BTW: in the first explain this is not an Index Scan, it is an sequential scan on an index. It would be nice to show this in the explain. Perhaps it would be good to reduce the cost for the limit on an sequential scan on an index not linear. Best regards, Andreas
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote: > insert into bla ( a , b ) > select a , floor(random() * 1000000) 1 > from generate_series( 1 , 1000000 ) as a ( a ) ; > > On my machine, with that data, all of the queries run fast. Yes, this runs by me fast too. But here is no relation between a and b. By my data psql must scan mostly all data to find the rows. This is only an example. In my live environment i have a table with a boolean where the boolean is usually true. The boolean is false on new or changed entrys and if i select the false-rows order by primary key i get slow querys. The table has a lot of million rows and a very small amount of rows with false. The boolean indicates that this row has an entry in a second table so i can select the rows without a join which is expensive too. BTW: it would be nice to have an index wich works on select * from a left join b on b.id = a.id where b.id is null. explain select * from bla where b > 990000 order by a limit 10 ; QUERY PLAN ------------------------------------------------------------------------------- Limit (cost=0.00..30.75 rows=10 width=8) -> Index Scan using bla_a on bla (cost=0.00..30747.29 rows=10000 width=8) Filter: (b > 990000) drop index bla_a ; explain select * from bla where b > 990000 order by a limit 10 ; QUERY PLAN ----------------------------------------------------------------------------------- Limit (cost=633.50..633.52 rows=10 width=8) -> Sort (cost=633.50..658.50 rows=10000 width=8) Sort Key: a -> Index Scan using bla_b on bla (cost=0.00..417.40 rows=10000 width=8) Index Cond: (b > 990000) The first explain reduce by the limit the cost by the faktor 10/10000. This is good for data with no relation between a and b. But in my example it is about 1000 times higher. BTW: in the first explain this is not an Index Scan, it is an sequential scan on an index. It would be nice to show this in the explain. Perhaps it would be good to reduce the cost for the limit on an sequential scan on an index not linear. Best regards, Andreas
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote: > insert into bla ( a , b ) > select a , floor(random() * 1000000) 1 > from generate_series( 1 , 1000000 ) as a ( a ) ; > > On my machine, with that data, all of the queries run fast. Yes, this runs by me fast too. But here is no relation between a and b. By my data psql must scan mostly all data to find the rows. This is only an example. In my live environment i have a table with a boolean where the boolean is usually true. The boolean is false on new or changed entrys and if i select the false-rows order by primary key i get slow querys. The table has a lot of million rows and a very small amount of rows with false. The boolean indicates that this row has an entry in a second table so i can select the rows without a join which is expensive too. BTW: it would be nice to have an index wich works on select * from a left join b on b.id = a.id where b.id is null. explain select * from bla where b > 990000 order by a limit 10 ; QUERY PLAN ------------------------------------------------------------------------------- Limit (cost=0.00..30.75 rows=10 width=8) -> Index Scan using bla_a on bla (cost=0.00..30747.29 rows=10000 width=8) Filter: (b > 990000) drop index bla_a ; explain select * from bla where b > 990000 order by a limit 10 ; QUERY PLAN ----------------------------------------------------------------------------------- Limit (cost=633.50..633.52 rows=10 width=8) -> Sort (cost=633.50..658.50 rows=10000 width=8) Sort Key: a -> Index Scan using bla_b on bla (cost=0.00..417.40 rows=10000 width=8) Index Cond: (b > 990000) The first explain reduce by the limit the cost by the faktor 10/10000. This is good for data with no relation between a and b. But in my example it is about 1000 times higher. BTW: in the first explain this is not an Index Scan, it is an sequential scan on an index. It would be nice to show this in the explain. Perhaps it would be good to reduce the cost for the limit on an sequential scan on an index not linear. Best regards, Andreas
<psql@elbrief.de> wrote: > In my live environment i have a table with a boolean where the > boolean is usually true. The boolean is false on new or changed > entrys and if i select the false-rows order by primary key i get > slow querys. The table has a lot of million rows and a very small > amount of rows with false. That sure sounds like a situation where you should use a partial index. http://www.postgresql.org/docs/current/interactive/indexes-partial.html -Kevin