Thread: BUG #7495: chosen wrong index

BUG #7495: chosen wrong index

From
psql@elbrief.de
Date:
The following bug has been logged on the website:

Bug reference:      7495
Logged by:          Andreas
Email address:      psql@elbrief.de
PostgreSQL version: 9.1.4
Operating system:   Debian Linux
Description:        =


Hello.

create table bla ( a int , b int ) ;

insert into bla ( a , b ) select a , a from generate_series( 1 , 1000000 )
as a ( a ) ;

create index bla_a on bla ( a ) ;

create index bla_b on bla ( b ) ;

explain analyze select * from bla where b > 990000 limit 10 ;
                                                       QUERY PLAN
---------------------------------------------------------------------------=
---------------------------------------------
 Limit  (cost=3D0.00..0.27 rows=3D10 width=3D8) (actual time=3D0.150..0.173=
 rows=3D10
loops=3D1)
   ->  Index Scan using bla_b on bla  (cost=3D0.00..265.29 rows=3D10000 wid=
th=3D8)
(actual time=3D0.147..0.159 rows=3D10 loops=3D1)
         Index Cond: (b > 990000)
 Total runtime: 0.226 ms

 explain analyze select * from bla where b > 990000 order by a limit 10 ;
                                                          QUERY PLAN
---------------------------------------------------------------------------=
---------------------------------------------------
 Limit  (cost=3D0.00..26.32 rows=3D10 width=3D8) (actual time=3D991.096..99=
1.113
rows=3D10 loops=3D1)
   ->  Index Scan using bla_a on bla  (cost=3D0.00..26322.29 rows=3D10000
width=3D8) (actual time=3D991.093..991.103 rows=3D10 loops=3D1)
         Filter: (b > 990000)
 Total runtime: 991.164 ms

explain analyze select * from ( select * from bla where b > 990000 union
select * from bla where b < 0 ) a order by a limit 10 ;
                                                                  QUERY
PLAN
---------------------------------------------------------------------------=
-------------------------------------------------------------------
 Limit  (cost=3D835.76..835.78 rows=3D10 width=3D8) (actual time=3D51.551..=
51.571
rows=3D10 loops=3D1)
   ->  Sort  (cost=3D835.76..860.76 rows=3D10001 width=3D8) (actual
time=3D51.547..51.548 rows=3D10 loops=3D1)
         Sort Key: wasnoch.bla.a
         Sort Method: top-N heapsort  Memory: 17kB
         ->  HashAggregate  (cost=3D419.62..519.63 rows=3D10001 width=3D8) =
(actual
time=3D32.061..42.544 rows=3D10000 loops=3D1)
               ->  Append  (cost=3D0.00..369.62 rows=3D10001 width=3D8) (ac=
tual
time=3D0.037..19.857 rows=3D10000 loops=3D1)
                     ->  Index Scan using bla_b on bla  (cost=3D0.00..265.29
rows=3D10000 width=3D8) (actual time=3D0.035..11.538 rows=3D10000 loops=3D1)
                           Index Cond: (b > 990000)
                     ->  Index Scan using bla_b on bla  (cost=3D0.00..4.31
rows=3D1 width=3D8) (actual time=3D0.012..0.012 rows=3D0 loops=3D1)
                           Index Cond: (b < 0)
 Total runtime: 51.678 ms

seq_page_cost =3D 1.0
random_page_cost =3D 20.0
restart server

explain analyze select * from bla where b > 997400 order by a limit 10 ;
                                                          QUERY PLAN
---------------------------------------------------------------------------=
----------------------------------------------------
 Limit  (cost=3D253.37..253.40 rows=3D10 width=3D8) (actual time=3D3.642..3=
.653
rows=3D10 loops=3D1)
   ->  Sort  (cost=3D253.37..259.87 rows=3D2600 width=3D8) (actual
time=3D3.639..3.643 rows=3D10 loops=3D1)
         Sort Key: a
         Sort Method: top-N heapsort  Memory: 17kB
         ->  Index Scan using bla_b on bla  (cost=3D0.00..197.19 rows=3D2600
width=3D8) (actual time=3D0.041..2.155 rows=3D2600 loops=3D1)
               Index Cond: (b > 997400)
 Total runtime: 3.698 ms

seq_page_cost =3D 1.0
random_page_cost =3D 2.0
restart server

explain analyze select * from bla where b > 997400 order by a limit 10 ;
                                                         QUERY PLAN
---------------------------------------------------------------------------=
--------------------------------------------------
 Limit  (cost=3D0.00..101.24 rows=3D10 width=3D8) (actual time=3D726.649..7=
26.667
rows=3D10 loops=3D1)
   ->  Index Scan using bla_a on bla  (cost=3D0.00..26322.29 rows=3D2600
width=3D8) (actual time=3D726.642..726.652 rows=3D10 loops=3D1)
         Filter: (b > 997400)
 Total runtime: 726.731 ms

explain analyze select * from bla where b > 997699 order by a limit 10 ;
                                                          QUERY PLAN
---------------------------------------------------------------------------=
---------------------------------------------------
 Limit  (cost=3D114.29..114.31 rows=3D10 width=3D8) (actual time=3D4.009..4=
.020
rows=3D10 loops=3D1)
   ->  Sort  (cost=3D114.29..120.04 rows=3D2301 width=3D8) (actual
time=3D4.007..4.011 rows=3D10 loops=3D1)
         Sort Key: a
         Sort Method: top-N heapsort  Memory: 17kB
         ->  Index Scan using bla_b on bla  (cost=3D0.00..64.56 rows=3D2301
width=3D8) (actual time=3D0.068..2.448 rows=3D2301 loops=3D1)
               Index Cond: (b > 997699)
 Total runtime: 4.073 ms

i have also played with cpu_tuple_cost, cpu_index_tuple_cost
and cpu_operator_cost, but there i have not found a setting
which chose index bla_b under b > 996000. but till b > 900000
it is faster to chose bla_b instead of bla_a.

i think the planner estimate the wrong amount of costs.

best regards,
Andreas

Re: BUG #7495: chosen wrong index

From
"Kevin Grittner"
Date:
<psql@elbrief.de> wrote:

> insert into bla ( a , b )
>   select a , a
>     from generate_series( 1 , 1000000 ) as a ( a ) ;

> explain analyze select * from bla
>   where b > 990000 order by a limit 10 ;
> [uses index on b and has a long run time]

The problem is that PostgreSQL doesn't have any sense of the
correlation between columns a and b (i.e., they are always equal)
and assumes that it will find enough matching rows soon enough on
the scan of the index on b to make it cheaper than sorting the
results of finding all rows that match the predicate.  Try your test
suite again with the only change being the insert statement:

insert into bla ( a , b )
  select a , floor(random() * 1000000) + 1
  from generate_series( 1 , 1000000 ) as a ( a ) ;

On my machine, with that data, all of the queries run fast.

We've been looking at ways to develop statistics on multiple
columns, so that correlations like that don't confuse the optimizer,
or trying to evaluate the "risk" of a query taking a long time based
on unexpected correlations.

Not really a bug; more like a recognized opportunity to improve the
optimizer.

-Kevin

Re: BUG #7495: chosen wrong index

From
"Kevin Grittner"
Date:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote:
<> psql@elbrief.de> wrote:
>
>> insert into bla ( a , b )
>>   select a , a
>>     from generate_series( 1 , 1000000 ) as a ( a ) ;
>
>> explain analyze select * from bla
>>   where b > 990000 order by a limit 10 ;
>> [uses index on b and has a long run time]
>
> The problem is that PostgreSQL doesn't have any sense of the
> correlation between columns a and b (i.e., they are always equal)

A belated thought on this -- you went right from your load to
running queries without leaving any time for autovacuum to kick in
and generate statistics.  I recommend that somewhere after you
insert the data and before you run your selects, you run:

VACUUM ANALYZE bla;

This will get you to something which more closely resembles a
"steady state" environment.

-Kevin

Re-2: BUG #7495: chosen wrong index

From
psql@elbrief.de
Date:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote:
> insert into bla ( a , b )
>   select a , floor(random() * 1000000)   1
>   from generate_series( 1 , 1000000 ) as a ( a ) ;
>
> On my machine, with that data, all of the queries run fast.

Yes, this runs by me fast too. But here is no relation
between a and b. By my data psql must scan mostly all data
to find the rows.

This is only an example. In my live environment i have a table with
a boolean where the boolean is usually true. The boolean is
false on new or changed entrys and if i select the false-rows
order by primary key i get slow querys. The table has a lot of
million rows and a very small amount of rows with false. The
boolean indicates that this row has an entry in a second table
so i can select the rows without a join which is expensive too.

BTW: it would be nice to have an index wich works on
select * from a left join b on b.id = a.id where b.id is null.

explain select * from bla where b > 990000 order by a limit 10 ;
                                  QUERY PLAN
-------------------------------------------------------------------------------
 Limit  (cost=0.00..30.75 rows=10 width=8)
   ->  Index Scan using bla_a on bla  (cost=0.00..30747.29 rows=10000 width=8)
         Filter: (b > 990000)

drop index bla_a ;

explain select * from bla where b > 990000 order by a limit 10 ;
                                    QUERY PLAN
-----------------------------------------------------------------------------------
 Limit  (cost=633.50..633.52 rows=10 width=8)
   ->  Sort  (cost=633.50..658.50 rows=10000 width=8)
         Sort Key: a
         ->  Index Scan using bla_b on bla  (cost=0.00..417.40 rows=10000 width=8)
               Index Cond: (b > 990000)

The first explain reduce by the limit the cost by the faktor 10/10000.
This is good for data with no relation between a and b. But in my
example it is about 1000 times higher.

BTW: in the first explain this is not an Index Scan, it is an
sequential  scan on an index. It would be nice to show this in
the explain.

Perhaps it would be good to reduce the cost for the limit on
an sequential scan on an index not linear.


Best regards,
Andreas

Re-2: BUG #7495: chosen wrong index

From
psql@elbrief.de
Date:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote:
> insert into bla ( a , b )
>   select a , floor(random() * 1000000)   1
>   from generate_series( 1 , 1000000 ) as a ( a ) ;
>
> On my machine, with that data, all of the queries run fast.

Yes, this runs by me fast too. But here is no relation
between a and b. By my data psql must scan mostly all data
to find the rows.

This is only an example. In my live environment i have a table with
a boolean where the boolean is usually true. The boolean is
false on new or changed entrys and if i select the false-rows
order by primary key i get slow querys. The table has a lot of
million rows and a very small amount of rows with false. The
boolean indicates that this row has an entry in a second table
so i can select the rows without a join which is expensive too.

BTW: it would be nice to have an index wich works on
select * from a left join b on b.id = a.id where b.id is null.

explain select * from bla where b > 990000 order by a limit 10 ;
                                  QUERY PLAN
-------------------------------------------------------------------------------
 Limit  (cost=0.00..30.75 rows=10 width=8)
   ->  Index Scan using bla_a on bla  (cost=0.00..30747.29 rows=10000 width=8)
         Filter: (b > 990000)

drop index bla_a ;

explain select * from bla where b > 990000 order by a limit 10 ;
                                    QUERY PLAN
-----------------------------------------------------------------------------------
 Limit  (cost=633.50..633.52 rows=10 width=8)
   ->  Sort  (cost=633.50..658.50 rows=10000 width=8)
         Sort Key: a
         ->  Index Scan using bla_b on bla  (cost=0.00..417.40 rows=10000 width=8)
               Index Cond: (b > 990000)

The first explain reduce by the limit the cost by the faktor 10/10000.
This is good for data with no relation between a and b. But in my
example it is about 1000 times higher.

BTW: in the first explain this is not an Index Scan, it is an
sequential  scan on an index. It would be nice to show this in
the explain.

Perhaps it would be good to reduce the cost for the limit on
an sequential scan on an index not linear.


Best regards,
Andreas

Re-2: BUG #7495: chosen wrong index

From
psql@elbrief.de
Date:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote:
> insert into bla ( a , b )
>   select a , floor(random() * 1000000)   1
>   from generate_series( 1 , 1000000 ) as a ( a ) ;
>
> On my machine, with that data, all of the queries run fast.

Yes, this runs by me fast too. But here is no relation
between a and b. By my data psql must scan mostly all data
to find the rows.

This is only an example. In my live environment i have a table with
a boolean where the boolean is usually true. The boolean is
false on new or changed entrys and if i select the false-rows
order by primary key i get slow querys. The table has a lot of
million rows and a very small amount of rows with false. The
boolean indicates that this row has an entry in a second table
so i can select the rows without a join which is expensive too.

BTW: it would be nice to have an index wich works on
select * from a left join b on b.id = a.id where b.id is null.

explain select * from bla where b > 990000 order by a limit 10 ;
                                  QUERY PLAN
-------------------------------------------------------------------------------
 Limit  (cost=0.00..30.75 rows=10 width=8)
   ->  Index Scan using bla_a on bla  (cost=0.00..30747.29 rows=10000 width=8)
         Filter: (b > 990000)

drop index bla_a ;

explain select * from bla where b > 990000 order by a limit 10 ;
                                    QUERY PLAN
-----------------------------------------------------------------------------------
 Limit  (cost=633.50..633.52 rows=10 width=8)
   ->  Sort  (cost=633.50..658.50 rows=10000 width=8)
         Sort Key: a
         ->  Index Scan using bla_b on bla  (cost=0.00..417.40 rows=10000 width=8)
               Index Cond: (b > 990000)

The first explain reduce by the limit the cost by the faktor 10/10000.
This is good for data with no relation between a and b. But in my
example it is about 1000 times higher.

BTW: in the first explain this is not an Index Scan, it is an
sequential  scan on an index. It would be nice to show this in
the explain.

Perhaps it would be good to reduce the cost for the limit on
an sequential scan on an index not linear.


Best regards,
Andreas

Re-2: BUG #7495: chosen wrong index

From
psql@elbrief.de
Date:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote:
> insert into bla ( a , b )
>   select a , floor(random() * 1000000)   1
>   from generate_series( 1 , 1000000 ) as a ( a ) ;
>
> On my machine, with that data, all of the queries run fast.

Yes, this runs by me fast too. But here is no relation
between a and b. By my data psql must scan mostly all data
to find the rows.

This is only an example. In my live environment i have a table with
a boolean where the boolean is usually true. The boolean is
false on new or changed entrys and if i select the false-rows
order by primary key i get slow querys. The table has a lot of
million rows and a very small amount of rows with false. The
boolean indicates that this row has an entry in a second table
so i can select the rows without a join which is expensive too.

BTW: it would be nice to have an index wich works on
select * from a left join b on b.id = a.id where b.id is null.

explain select * from bla where b > 990000 order by a limit 10 ;
                                  QUERY PLAN
-------------------------------------------------------------------------------
 Limit  (cost=0.00..30.75 rows=10 width=8)
   ->  Index Scan using bla_a on bla  (cost=0.00..30747.29 rows=10000 width=8)
         Filter: (b > 990000)

drop index bla_a ;

explain select * from bla where b > 990000 order by a limit 10 ;
                                    QUERY PLAN
-----------------------------------------------------------------------------------
 Limit  (cost=633.50..633.52 rows=10 width=8)
   ->  Sort  (cost=633.50..658.50 rows=10000 width=8)
         Sort Key: a
         ->  Index Scan using bla_b on bla  (cost=0.00..417.40 rows=10000 width=8)
               Index Cond: (b > 990000)

The first explain reduce by the limit the cost by the faktor 10/10000.
This is good for data with no relation between a and b. But in my
example it is about 1000 times higher.

BTW: in the first explain this is not an Index Scan, it is an
sequential  scan on an index. It would be nice to show this in
the explain.

Perhaps it would be good to reduce the cost for the limit on
an sequential scan on an index not linear.


Best regards,
Andreas

Re-2: BUG #7495: chosen wrong index

From
psql@elbrief.de
Date:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote:
> insert into bla ( a , b )
>   select a , floor(random() * 1000000)   1
>   from generate_series( 1 , 1000000 ) as a ( a ) ;
>
> On my machine, with that data, all of the queries run fast.

Yes, this runs by me fast too. But here is no relation
between a and b. By my data psql must scan mostly all data
to find the rows.

This is only an example. In my live environment i have a table with
a boolean where the boolean is usually true. The boolean is
false on new or changed entrys and if i select the false-rows
order by primary key i get slow querys. The table has a lot of
million rows and a very small amount of rows with false. The
boolean indicates that this row has an entry in a second table
so i can select the rows without a join which is expensive too.

BTW: it would be nice to have an index wich works on
select * from a left join b on b.id = a.id where b.id is null.

explain select * from bla where b > 990000 order by a limit 10 ;
                                  QUERY PLAN
-------------------------------------------------------------------------------
 Limit  (cost=0.00..30.75 rows=10 width=8)
   ->  Index Scan using bla_a on bla  (cost=0.00..30747.29 rows=10000 width=8)
         Filter: (b > 990000)

drop index bla_a ;

explain select * from bla where b > 990000 order by a limit 10 ;
                                    QUERY PLAN
-----------------------------------------------------------------------------------
 Limit  (cost=633.50..633.52 rows=10 width=8)
   ->  Sort  (cost=633.50..658.50 rows=10000 width=8)
         Sort Key: a
         ->  Index Scan using bla_b on bla  (cost=0.00..417.40 rows=10000 width=8)
               Index Cond: (b > 990000)

The first explain reduce by the limit the cost by the faktor 10/10000.
This is good for data with no relation between a and b. But in my
example it is about 1000 times higher.

BTW: in the first explain this is not an Index Scan, it is an
sequential  scan on an index. It would be nice to show this in
the explain.

Perhaps it would be good to reduce the cost for the limit on
an sequential scan on an index not linear.


Best regards,
Andreas

Re: Re-2: BUG #7495: chosen wrong index

From
"Kevin Grittner"
Date:
<psql@elbrief.de> wrote:

> In my live environment i have a table with a boolean where the
> boolean is usually true. The boolean is false on new or changed
> entrys and if i select the false-rows order by primary key i get
> slow querys. The table has a lot of million rows and a very small
> amount of rows with false.

That sure sounds like a situation where you should use a partial
index.

http://www.postgresql.org/docs/current/interactive/indexes-partial.html

-Kevin