Thread: BUG #11441: Weird (and seems wrong) behavior of partial indexes with order by/limit

BUG #11441: Weird (and seems wrong) behavior of partial indexes with order by/limit

From
maxim.boguk@postgresql-consulting.com
Date:
The following bug has been logged on the website:

Bug reference:      11441
Logged by:          Maksym Boguk
Email address:      maxim.boguk@postgresql-consulting.com
PostgreSQL version: 9.2.9
Operating system:   Linux
Description:

Today I found very strange behavior of partial indexes with correlating
order by/limit.

Simple test case look like:

(postgres@[local]:5432)=# \d qqq_test
      Table "public.qqq_test"
   Column    |  Type   | Modifiers
-------------+---------+-----------
 resume_id   | integer |
 is_finished | integer |

10M rows, vacuumed and analyzed

1.Initial query (no problem found):
select * from qqq_test where is_finished=0 order by resume_id limit 1;
with index:
create index qqq_test_1_key on qqq_test(resume_id, is_finished) where
is_finished=0;
there are no issue:
(postgres@[local]:5432)=# explain analyze select * from qqq_test where
is_finished=0 order by resume_id limit 1;
                                                                 QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..0.32 rows=1 width=8) (actual time=0.062..0.062 rows=1
loops=1)
   ->  Index Only Scan using qqq_test_1_key on qqq_test
(cost=0.00..44145.36 rows=138808 width=8) (actual time=0.062..0.062 rows=1
loops=1)
         Index Cond: (is_finished = 0)
 Total runtime: 0.035 ms
(index only scan as expected).

2.Now a bit more complicated case:
select * from qqq_test where is_finished = ANY (ARRAY[0, 5]) order by
resume_id limit 1;
With index:
create index qqq_test_3_key on qqq_test(resume_id) where (is_finished = ANY
(ARRAY[0, 5]));
also no issue:
(postgres@[local]:5432)=# explain analyze select * from qqq_test where
is_finished = ANY (ARRAY[0, 5]) order by resume_id limit 1;
                                                               QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..0.31 rows=1 width=8) (actual time=0.035..0.035 rows=1
loops=1)
   ->  Index Scan using qqq_test_3_key on qqq_test  (cost=0.00..54166.69
rows=173510 width=8) (actual time=0.034..0.034 rows=1 loops=1)
 Total runtime: 0.049 ms
(index scan as expected).

3.Now it would be nice to have index only scan for the second query (lets
add is_finished to index description to let IOS be used), and there problem
begin:
drop index qqq_test_3_key;
create index qqq_test_2_key on qqq_test(resume_id, is_finished) where
(is_finished = ANY (ARRAY[0, 5]));
And now oops:
(postgres@[local]:5432)=# explain analyze select * from qqq_test where
is_finished = ANY (ARRAY[0, 5]) order by resume_id limit 1;
                                                                     QUERY
PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=38740.23..38740.24 rows=1 width=8) (actual time=54.251..54.251
rows=1 loops=1)
   ->  Sort  (cost=38740.23..40359.65 rows=161942 width=8) (actual
time=54.251..54.251 rows=1 loops=1)
         Sort Key: resume_id
         Sort Method: top-N heapsort  Memory: 25kB
         ->  Index Only Scan using qqq_test_2_key on qqq_test
(cost=0.00..35501.39 rows=161942 width=8) (actual time=0.030..33.449
rows=145278 loops=1)
               Index Cond: (is_finished = ANY ('{0,5}'::integer[]))
               Heap Fetches: 0
 Total runtime: 54.282 ms


Is there any reason why index scan/index only scan could not be used for
this query instead of slow sort+limit?
It's bug or planner/executor limitation (set enable_sort to 0 - have no
effect)?
Query/index pair looks like pretty suitable for simple IOS without
sort+limit.

Now I cannot think any way to perform such query using IOS because with no
is_finished column in index there will be no IOS, and if I add is_finished
to the index there no index scan but slow order by+limit plan.

Kind Regards,
Maksym
Some update now with full reproducible test case (and some surprising
results):

Test case initialization:

drop table if exists test;
create table test as (select g.i as id, (random()*100)::integer as
is_finished from generate_series(1,1000000) as g(i));
create index test_1_key on test(id, is_finished) where is_finished =3D ANY
(ARRAY[0, 5]);
vacuum analyze test;

Good (but not expected in that case) plan:

explain analyze select * from test where is_finished=3D0 or is_finished=3D5
order by id limit 1;
                                                            QUERY PLAN
---------------------------------------------------------------------------=
--------------------------------------------------------
 Limit  (cost=3D0.00..0.24 rows=3D1 width=3D8) (actual time=3D0.052..0.052 =
rows=3D1
loops=3D1)
   ->  Index Only Scan using test_1_key on test  (cost=3D0.00..4493.05
rows=3D18921 width=3D8) (actual time=3D0.052..0.052 rows=3D1 loops=3D1)
         Heap Fetches: 1
 Total runtime: 0.066 ms
(i'm very surprised than the PostgreSQL managed deduct is_finished =3D ANY
(ARRAY[0, 5]) from (is_finished=3D0 or is_finished=3D5))



Bad plan (techically the same query and even better suitable for the
partial index and should have the same plan but no luck):

explain analyze select * from test where is_finished IN (0,5) order by id
limit 1;
                                                                  QUERY PLA=
N
---------------------------------------------------------------------------=
-------------------------------------------------------------------
 Limit  (cost=3D4809.18..4809.19 rows=3D1 width=3D8) (actual time=3D15.410.=
.15.410
rows=3D1 loops=3D1)
   ->  Sort  (cost=3D4809.18..4999.44 rows=3D19026 width=3D8) (actual
time=3D15.408..15.408 rows=3D1 loops=3D1)
         Sort Key: id
         Sort Method: top-N heapsort  Memory: 25kB
         ->  Index Only Scan using test_1_key on test  (cost=3D0.00..4428.6=
6
rows=3D19026 width=3D8) (actual time=3D0.051..12.277 rows=3D15222 loops=3D1=
)
               Index Cond: (is_finished =3D ANY ('{0,5}'::integer[]))
               Heap Fetches: 15222
 Total runtime: 15.469 ms


--=20
Maxim Boguk
Senior Postgresql DBA
http://www.postgresql-consulting.ru/ <http://www.postgresql-consulting.com/=
>

Phone RU: +7 910 405 4718
Phone AU: +61 45 218 5678

LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk
Jabber: maxim.boguk@gmail.com
=D0=9C=D0=BE=D0=B9=D0=9A=D1=80=D1=83=D0=B3: http://mboguk.moikrug.ru/

"People problems are solved with people.
If people cannot solve the problem, try technology.
People will then wish they'd listened at the first stage."

Re: BUG #11441: Weird (and seems wrong) behavior of partial indexes with order by/limit

From
Kyotaro HORIGUCHI
Date:
Hello, I think this is a behavior as desinged.

An index scans having scalar array operator on a non-first index
column are considered as 'not ordered'. This is because scalar
array operator on non-first index columns breaks ordering of the
result tuples.

Btree is the the only index capable of accepting scalar-array
operators so far.

> =# select amname from pg_am where amsearcharray;
>  amname
> --------
>  btree
> (1 row)

If my understnding is correct, it repeats scanning the index
using non-array restrictions for every array element, or every
possible combination of elements of multiple scalar arrays, so
the index-order generally won't be preserved in the result
tuples.

The one obvious exception is the case of the scalar-array
operation on the first index column. The value in the array is
sorted before the iterations mentioned above, so the the planner
can determine it to be ordered *only* for this case.

The result could be ordered if the all restrictions on all index
columns before scalar-array-op column are equal conditions, but
the case is judged to be abandoned from the viewpoint of cost and
modularitly.


Therefore, the planner eliminates the sort for the following
example, even though no meaning in itself.

create table test as (select g.i as id, (random()*100)::integer as
is_finished from generate_series(1,1000000) as g(i));
create index test_2_key on test(is_finished, id) where is_finished = ANY
(ARRAY[0, 5]);
vacuum analyze test;

explain (costs off) select * from test where is_finished IN (0,5) order by is_finished, id limit 1;

                          QUERY PLAN
--------------------------------------------------------------
 Limit
   ->  Index Only Scan using test_2_key on test
         Index Cond: (is_finished = ANY ('{0,5}'::integer[]))


regards,

At Thu, 18 Sep 2014 21:34:07 +1000, Maxim Boguk <maxim.boguk@gmail.com> wrote in
<CAK-MWwS2=8iE=BV-vPORd-+PL76HZsgC9PVzydkAUgnXXntkyQ@mail.gmail.com>
> Some update now with full reproducible test case (and some surprising
> results):
>
> Test case initialization:
>
> drop table if exists test;
> create table test as (select g.i as id, (random()*100)::integer as
> is_finished from generate_series(1,1000000) as g(i));
> create index test_1_key on test(id, is_finished) where is_finished = ANY
> (ARRAY[0, 5]);
> vacuum analyze test;
>
> Good (but not expected in that case) plan:
>
> explain analyze select * from test where is_finished=0 or is_finished=5
> order by id limit 1;
>                                                             QUERY PLAN
>
-----------------------------------------------------------------------------------------------------------------------------------
>  Limit  (cost=0.00..0.24 rows=1 width=8) (actual time=0.052..0.052 rows=1
> loops=1)
>    ->  Index Only Scan using test_1_key on test  (cost=0.00..4493.05
> rows=18921 width=8) (actual time=0.052..0.052 rows=1 loops=1)
>          Heap Fetches: 1
>  Total runtime: 0.066 ms
> (i'm very surprised than the PostgreSQL managed deduct is_finished = ANY
> (ARRAY[0, 5]) from (is_finished=0 or is_finished=5))
>
>
>
> Bad plan (techically the same query and even better suitable for the
> partial index and should have the same plan but no luck):
>
> explain analyze select * from test where is_finished IN (0,5) order by id
> limit 1;
>                                                                   QUERY PLAN
>
----------------------------------------------------------------------------------------------------------------------------------------------
>  Limit  (cost=4809.18..4809.19 rows=1 width=8) (actual time=15.410..15.410
> rows=1 loops=1)
>    ->  Sort  (cost=4809.18..4999.44 rows=19026 width=8) (actual
> time=15.408..15.408 rows=1 loops=1)
>          Sort Key: id
>          Sort Method: top-N heapsort  Memory: 25kB
>          ->  Index Only Scan using test_1_key on test  (cost=0.00..4428.66
> rows=19026 width=8) (actual time=0.051..12.277 rows=15222 loops=1)
>                Index Cond: (is_finished = ANY ('{0,5}'::integer[]))
>                Heap Fetches: 15222
>  Total runtime: 15.469 ms

--
Kyotaro Horiguchi
NTT Open Source Software Center
>
> If my understnding is correct, it repeats scanning the index
> using non-array restrictions for every array element, or every
> possible combination of elements of multiple scalar arrays, so
> the index-order generally won't be preserved in the result
> tuples.
>
> The one obvious exception is the case of the scalar-array
> operation on the first index column. The value in the array is
> sorted before the iterations mentioned above, so the the planner
> can determine it to be ordered *only* for this case.
>
> The result could be ordered if the all restrictions on all index
> columns before scalar-array-op column are equal conditions, but
> the case is judged to be abandoned from the viewpoint of cost and
> modularitly.
>
>
> Therefore, the planner eliminates the sort for the following
> example, even though no meaning in itself.
>
> create table test as (select g.i as id, (random()*100)::integer as
> is_finished from generate_series(1,1000000) as g(i));
> create index test_2_key on test(is_finished, id) where is_finished = ANY
> (ARRAY[0, 5]);
> vacuum analyze test;
>
> explain (costs off) select * from test where is_finished IN (0,5) order by
> is_finished, id limit 1;
>
>                           QUERY PLAN
> --------------------------------------------------------------
>  Limit
>    ->  Index Only Scan using test_2_key on test
>          Index Cond: (is_finished = ANY ('{0,5}'::integer[]))
>
>
>
Hi,

But why index scan working for completely equivalent query with OR
condition than?

explain analyze select * from test where is_finished=0 or is_finished=5
order by id limit 1;

                              QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..0.24 rows=1 width=8) (actual time=0.052..0.052 rows=1
loops=1)
   ->  Index Only Scan using test_1_key on test  (cost=0.00..4493.05
rows=18921 width=8) (actual time=0.052..0.052 rows=1 loops=1)


"is_finished = ANY ('{0,5}'::integer[])" is equivalent to  " is_finished=0
or is_finished=5"
what's more planner aware about it for sure (or it will not be able use
conditional index with "where is_finished = ANY (ARRAY[0, 5]);" for the OR
query).


Kind Regards,
Maksym

Re: BUG #11441: Weird (and seems wrong) behavior of partial indexes with order by/limit

From
David G Johnston
Date:
Kyotaro HORIGUCHI-2 wrote
> Hello, I think this is a behavior as desinged.

It may not technically be a bug but it definitely could use some TLC from
Tom Lane.

I'm curious, but too busy to check myself, how these three queries would
perform with the index columns placed in reverse order.  Then create a
summary post with the 3 queries x 2 indexes, and the plan summaries (not the
entire explain) would let someone quickly see the end result without sifting
through 3 posts and lots of explanation.

Also, can you test on 9.3 and/or 9.4?  Someone knowledgable may know
outright but otherwise this may have been recently improved and, because it
is not really a bug, not back-patched.  Release notes should indicate this
but testing is more accurate.

David J.




--
View this message in context:
http://postgresql.1045698.n5.nabble.com/BUG-11441-Weird-and-seems-wrong-behavior-of-partial-indexes-with-order-by-limit-tp5819400p5819658.html
Sent from the PostgreSQL - bugs mailing list archive at Nabble.com.
On Sat, Sep 20, 2014 at 12:05 AM, David G Johnston <
david.g.johnston@gmail.com> wrote:

> Kyotaro HORIGUCHI-2 wrote
> > Hello, I think this is a behavior as desinged.
>
> It may not technically be a bug but it definitely could use some TLC from
> Tom Lane.
>
> I'm curious, but too busy to check myself, how these three queries would
> perform with the index columns placed in reverse order.  Then create a
> summary post with the 3 queries x 2 indexes, and the plan summaries (not
> the
> entire explain) would let someone quickly see the end result without
> sifting
> through 3 posts and lots of explanation.
>
> Also, can you test on 9.3 and/or 9.4?  Someone knowledgable may know
> outright but otherwise this may have been recently improved and, because it
> is not really a bug, not back-patched.  Release notes should indicate this
> but testing is more accurate.
>

Hi David,
full test sequence as you requested:

drop table if exists test;
create table test as (select g.i as id, (random()*100)::integer as
is_finished from generate_series(1,1000000) as g(i));
create index test_1_key on test(id, is_finished) where is_finished = ANY
(ARRAY[0, 5]);
vacuum analyze test;
--q1
explain analyze select * from test where is_finished=0 or is_finished=5
order by id limit 1;
--q2
explain analyze select * from test where is_finished=0 or is_finished=5
order by is_finished,id limit 1;
--q3
explain analyze select * from test where is_finished IN (0,5) order by id
limit 1;
--q4
explain analyze select * from test where is_finished IN (0,5) order by
is_finished,id limit 1;

drop index test_1_key;
create index test_2_key on test(is_finished, id) where is_finished = ANY
(ARRAY[0, 5]);
vacuum analyze test;
--q1
explain analyze select * from test where is_finished=0 or is_finished=5
order by id limit 1;
--q2
explain analyze select * from test where is_finished=0 or is_finished=5
order by is_finished,id limit 1;
--q3
explain analyze select * from test where is_finished IN (0,5) order by id
limit 1;
--q4
explain analyze select * from test where is_finished IN (0,5) order by
is_finished,id limit 1;


The tests had been performed on versions 9.2.9, 9.3.5 and 9.4beta2, all
produced the same plans for every version.

Result table:
             test_1_key                test_2_key
q1           IOS+limit                   IOS+sort/limit
q2           IOS+sort/limit            IOS+limit
q3           IOS+sort/limit            IOS+sort/limit
q4           IOS+sort/limit            IOS+limit

Everything work as expected except the problem case test_1_key + q3:
(
create index test_1_key on test(id, is_finished) where is_finished = ANY
(ARRAY[0, 5]);
and
select * from test where is_finished IN (0,5) order by id limit 1;
).

PS: Results of q1+test_1_key verified and found to be correct so there are
no problem in using IOS+limit plan for query.

Kind Regards,
Maksym
maxim.boguk@postgresql-consulting.com writes:
> create index qqq_test_2_key on qqq_test(resume_id, is_finished) where
> (is_finished = ANY (ARRAY[0, 5]));

> (postgres@[local]:5432)=# explain analyze select * from qqq_test where
> is_finished = ANY (ARRAY[0, 5]) order by resume_id limit 1;
> [ doesn't use the index ]

The reason why not is that it starts by generating a path that uses the
is_finished = ANY() clause as an indexqual, and decides that such a
path will not produce data that's ordered by resume_id.  Which is correct:
it won't, because there will first be a scan to find the is_finished = 0
data and then another scan to find the is_finished = 5 data.

Now in point of fact, we don't need to use that clause as an indexqual
because it's implied by the index predicate.  However, indxpath.c has
never tested for such cases and I'm a bit hesitant to add the cycles that
would be required to do so.  This sort of case doesn't really seem
compelling enough to justify slowing down planning for *every* query on
tables having partial indexes, which would be the likely outcome.  If it
were compelling, we'd have heard about it before ...

            regards, tom lane
On Sun, Sep 21, 2014 at 6:21 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> maxim.boguk@postgresql-consulting.com writes:
> > create index qqq_test_2_key on qqq_test(resume_id, is_finished) where
> > (is_finished = ANY (ARRAY[0, 5]));
>
> > (postgres@[local]:5432)=# explain analyze select * from qqq_test where
> > is_finished = ANY (ARRAY[0, 5]) order by resume_id limit 1;
> > [ doesn't use the index ]
>
> The reason why not is that it starts by generating a path that uses the
> is_finished = ANY() clause as an indexqual, and decides that such a
> path will not produce data that's ordered by resume_id.  Which is correct:
> it won't, because there will first be a scan to find the is_finished = 0
> data and then another scan to find the is_finished = 5 data.
>
> Now in point of fact, we don't need to use that clause as an indexqual
> because it's implied by the index predicate.  However, indxpath.c has
> never tested for such cases and I'm a bit hesitant to add the cycles that
> would be required to do so.  This sort of case doesn't really seem
> compelling enough to justify slowing down planning for *every* query on
> tables having partial indexes, which would be the likely outcome.  If it
> were compelling, we'd have heard about it before ...
>
>                         regards, tom lane
>

Hi Tom,

I'm very sorry but are you sure that your analysis fully correct?
I see two potential problem here:

1)I always thought that the index on (a,b) during index scan will provide
results ordered by 'a'  in any case and independently of which additional
conditions on the 'b' we have in query. I made a mistake on that assumption?
So index scan using  'qqq_test(resume_id, is_finished) where (is_finished =
ANY (ARRAY[0, 5]))'  index will be ordered by resume_id anyway?


2)How to planner found the same index pretty suitable for the equivalent
query:
select * from qqq_test where is_finished = 0 OR is_finished=5 order by
resume_id limit 1;
?
As I see the planner able to substitute 'is_finished = 0 OR is_finished=5'
with equivalent  'is_finished = ANY (ARRAY[0, 5])' (without such
substitution the partial index will be not used). So i assume the same
logic should be applicable, but PostgreSQL able use index only scan for
that case:
explain analyze select * from qqq_test where is_finished = 0 OR
is_finished=5 order by resume_id limit 1;;
                                                             QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..0.90 rows=1 width=12) (actual time=0.016..0.016 rows=0
loops=1)
   ->  Index Only Scan using qqq_test_2_key on qqq_test  (cost=0.00..1.81
rows=2 width=12) (actual time=0.016..0.016 rows=0 loops=1)
         Heap Fetches: 0
 Total runtime: 0.032 ms


Kind Regards,
Maksym