Thread: Can Postgres use an INDEX over an OR?

Can Postgres use an INDEX over an OR?

From

Robert James

Date:

19 July 2009, 20:03:41

Hi. I notice that when I do a WHERE x, Postgres uses an index, and when I do WHERE y, it does so as well, but when I do WHERE x OR y, it doesn't. Why is this so? And how can I shut this off?
select * from dict
where
word in (select substr('moon', 0, generate_series(3,length('moon')))) -- this is my X above
OR word like 'moon%' -- this is my Y above

Seq Scan on dict (cost=0.02..2775.66 rows=30422 width=24) (actual time=16.635..28.580 rows=8 loops=1)
Filter: ((hashed subplan) OR ((word)::text ~~ 'moon%'::text))
SubPlan
-> Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.014..0.019 rows=2 loops=1)
Total runtime: 28.658 ms
(Using just X or Y alone uses the index, and completes in 0.150 ms)

Is this a bug?

PS Running "PostgreSQL 8.2.1 on i686-pc-mingw32, compiled by GCC gcc.exe (GCC) 3.4.2 (mingw-special)"

Re: Can Postgres use an INDEX over an OR?

From

Віталій Тимчишин

Date:

20 July 2009, 03:24:58

2009/7/20 Robert James <srobertjames@gmail.com>

Hi. I notice that when I do a WHERE x, Postgres uses an index, and when I do WHERE y, it does so as well, but when I do WHERE x OR y, it doesn't. Why is this so?

It's not clever enough.

And how can I shut this off?

Use UNION/UNION ALL if possible in your case.

Re: Can Postgres use an INDEX over an OR?

From

Chris

Date:

20 July 2009, 05:11:34

Віталій Тимчишин wrote:
>
>
> 2009/7/20 Robert James <srobertjames@gmail.com
> <mailto:srobertjames@gmail.com>>
>
>
>     Hi. I notice that when I do a WHERE x, Postgres uses an index, and
>     when I do WHERE y, it does so as well, but when I do WHERE x OR y,
>     it doesn't. Why is this so?
>
>
> It's not clever enough.

Of course it is.

I'm running 8.3.7.

create table t1(id int primary key);
insert into t1(id) select a from generate_series(1, 500000) as s(a);
analyze t1;

explain analyze select * from t1 where id=5000 or id=25937;
                                                       QUERY PLAN

----------------------------------------------------------------------------------------------------------------------
  Bitmap Heap Scan on t1  (cost=8.60..16.44 rows=2 width=4) (actual
time=0.077..0.083 rows=2 loops=1)
    Recheck Cond: ((id = 5000) OR (id = 25937))
    ->  BitmapOr  (cost=8.60..8.60 rows=2 width=0) (actual
time=0.063..0.063 rows=0 loops=1)
          ->  Bitmap Index Scan on t1_pkey  (cost=0.00..4.30 rows=1
width=0) (actual time=0.034..0.034 rows=1 loops=1)
                Index Cond: (id = 5000)
          ->  Bitmap Index Scan on t1_pkey  (cost=0.00..4.30 rows=1
width=0) (actual time=0.021..0.021 rows=1 loops=1)
                Index Cond: (id = 25937)
  Total runtime: 0.153 ms
(8 rows)

What Robert didn't post was his query, see

http://archives.postgresql.org/pgsql-general/2009-07/msg00767.php

which makes it a lot harder to 'optimize' since they aren't straight
forward conditions.

--
Postgresql & php tutorials
http://www.designmagick.com/

Re: Can Postgres use an INDEX over an OR?

From

Robert James

Date:

20 July 2009, 10:26:37

Query is:

select * from dict
where
word in (select substr('moon', 0, generate_series(3,length('moon')))) -- this is my X above
OR word like 'moon%' -- this is my Y above

dict is indexed on word

2009/7/20 Chris <dmagick@gmail.com>

2009/7/20 Robert James <srobertjames@gmail.com <mailto:srobertjames@gmail.com>>

Hi. I notice that when I do a WHERE x, Postgres uses an index, and
when I do WHERE y, it does so as well, but when I do WHERE x OR y,
it doesn't. Why is this so?

What Robert didn't post was his query, see

http://archives.postgresql.org/pgsql-general/2009-07/msg00767.php

which makes it a lot harder to 'optimize' since they aren't straight forward conditions.

Re: Can Postgres use an INDEX over an OR?

From

Віталій Тимчишин

Date:

20 July 2009, 12:06:16

20 липня 2009 р. 11:02 Chris <dmagick@gmail.com> написав:

Віталій Тимчишин wrote:

2009/7/20 Robert James <srobertjames@gmail.com <mailto:srobertjames@gmail.com>>

Hi. I notice that when I do a WHERE x, Postgres uses an index, and
when I do WHERE y, it does so as well, but when I do WHERE x OR y,
it doesn't. Why is this so?

It's not clever enough.

Of course it is.

For simple cases

I'm running 8.3.7.

create table t1(id int primary key);
insert into t1(id) select a from generate_series(1, 500000) as s(a);
analyze t1;

explain analyze select * from t1 where
id < 10000

"Index Scan using t1_pkey on t1 (cost=0.00..322.51 rows=9612 width=4) (actual time=0.030..3.700 rows=9999 loops=1)"
" Index Cond: (id < 10000)"
"Total runtime: 4.835 ms"

explain analyze select * from t1 where
id in (select (random() * 500000)::int4 from generate_series(0,10))

"Nested Loop (cost=32.50..1341.49 rows=200 width=4) (actual time=15.353..67.014 rows=11 loops=1)"
" -> HashAggregate (cost=32.50..34.50 rows=200 width=4) (actual time=0.028..0.043 rows=11 loops=1)"
"        -> Function Scan on generate_series (cost=0.00..20.00 rows=1000 width=0) (actual time=0.014..0.020 rows=11 loops=1)"
" -> Index Scan using t1_pkey on t1 (cost=0.00..6.52 rows=1 width=4) (actual time=6.083..6.084 rows=1 loops=11)"
"        Index Cond: (t1.id = (((random() * 500000::double precision))::integer))"
"Total runtime: 67.070 ms"

explain analyze select * from t1 where
id in (select (random() * 500000)::int4 from generate_series(0,10))
or
id < 10000

"Seq Scan on t1 (cost=22.50..9735.50 rows=254806 width=4) (actual time=0.049..148.947 rows=10010 loops=1)"
" Filter: ((hashed subplan) OR (id < 10000))"
" SubPlan"
"    -> Function Scan on generate_series (cost=0.00..20.00 rows=1000 width=0) (actual time=0.014..0.019 rows=11 loops=1)"
"Total runtime: 150.123 ms"

explain analyze
select * from t1 where
id in (select (random() * 500000)::int4 from generate_series(0,10))
union
select * from t1 where
id < 10000

"Unique (cost=2412.68..2461.74 rows=9812 width=4) (actual time=89.190..95.014 rows=10010 loops=1)"
" -> Sort (cost=2412.68..2437.21 rows=9812 width=4) (actual time=89.189..91.167 rows=10010 loops=1)"
"        Sort Key: public.t1.id"
"        Sort Method: quicksort Memory: 854kB"
"        -> Append (cost=32.50..1762.13 rows=9812 width=4) (actual time=16.641..76.338 rows=10010 loops=1)"
"              -> Nested Loop (cost=32.50..1341.49 rows=200 width=4) (actual time=16.641..70.051 rows=11 loops=1)"
"                    -> HashAggregate (cost=32.50..34.50 rows=200 width=4) (actual time=0.033..0.049 rows=11 loops=1)"
"                          -> Function Scan on generate_series (cost=0.00..20.00 rows=1000 width=0) (actual time=0.020..0.026 rows=11 loops=1)"
"                    -> Index Scan using t1_pkey on t1 (cost=0.00..6.52 rows=1 width=4) (actual time=6.359..6.361 rows=1 loops=11)"
"                          Index Cond: (public.t1.id = (((random() * 500000::double precision))::integer))"
"              -> Index Scan using t1_pkey on t1 (cost=0.00..322.51 rows=9612 width=4) (actual time=0.023..4.075 rows=9999 loops=1)"
"                    Index Cond: (id < 10000)"
"Total runtime: 112.694 ms"

So, if it founds out anything complex, it sadly falls back to Sequence scan.

Re: Can Postgres use an INDEX over an OR?

From

Robert Haas

Date:

27 July 2009, 07:53:20

2009/7/20 Віталій Тимчишин <tivv00@gmail.com>:
> 20 липня 2009 р. 11:02 Chris <dmagick@gmail.com> написав:
>>
>> Віталій Тимчишин wrote:
>>>
>>>
>>> 2009/7/20 Robert James <srobertjames@gmail.com
>>> <mailto:srobertjames@gmail.com>>
>>>
>>>
>>>    Hi. I notice that when I do a WHERE x, Postgres uses an index, and
>>>    when I do WHERE y, it does so as well, but when I do WHERE x OR y,
>>>    it doesn't. Why is this so?
>>>
>>> It's not clever enough.
>>
>> Of course it is.
>
> For simple cases
>
>>
>>
>> I'm running 8.3.7.
>>
>> create table t1(id int primary key);
>> insert into t1(id) select a from generate_series(1, 500000) as s(a);
>> analyze t1;
>
> explain analyze select * from t1 where
> id < 10000
>
> "Index Scan using t1_pkey on t1  (cost=0.00..322.51 rows=9612 width=4)
> (actual time=0.030..3.700 rows=9999 loops=1)"
> "  Index Cond: (id < 10000)"
> "Total runtime: 4.835 ms"
>
> explain analyze select * from t1 where
> id in (select (random() * 500000)::int4 from generate_series(0,10))
>
> "Nested Loop  (cost=32.50..1341.49 rows=200 width=4) (actual
> time=15.353..67.014 rows=11 loops=1)"
> "  ->  HashAggregate  (cost=32.50..34.50 rows=200 width=4) (actual
> time=0.028..0.043 rows=11 loops=1)"
> "        ->  Function Scan on generate_series  (cost=0.00..20.00 rows=1000
> width=0) (actual time=0.014..0.020 rows=11 loops=1)"
> "  ->  Index Scan using t1_pkey on t1  (cost=0.00..6.52 rows=1 width=4)
> (actual time=6.083..6.084 rows=1 loops=11)"
> "        Index Cond: (t1.id = (((random() * 500000::double
> precision))::integer))"
> "Total runtime: 67.070 ms"
>
> explain analyze select * from t1 where
> id in (select (random() * 500000)::int4 from generate_series(0,10))
> or
> id < 10000
>
> "Seq Scan on t1  (cost=22.50..9735.50 rows=254806 width=4) (actual
> time=0.049..148.947 rows=10010 loops=1)"
> "  Filter: ((hashed subplan) OR (id < 10000))"
> "  SubPlan"
> "    ->  Function Scan on generate_series  (cost=0.00..20.00 rows=1000
> width=0) (actual time=0.014..0.019 rows=11 loops=1)"
> "Total runtime: 150.123 ms"
>
> explain analyze
> select * from t1 where
> id in (select (random() * 500000)::int4 from generate_series(0,10))
> union
> select * from t1 where
> id < 10000
>
> "Unique  (cost=2412.68..2461.74 rows=9812 width=4) (actual
> time=89.190..95.014 rows=10010 loops=1)"
> "  ->  Sort  (cost=2412.68..2437.21 rows=9812 width=4) (actual
> time=89.189..91.167 rows=10010 loops=1)"
> "        Sort Key: public.t1.id"
> "        Sort Method:  quicksort  Memory: 854kB"
> "        ->  Append  (cost=32.50..1762.13 rows=9812 width=4) (actual
> time=16.641..76.338 rows=10010 loops=1)"
> "              ->  Nested Loop  (cost=32.50..1341.49 rows=200 width=4)
> (actual time=16.641..70.051 rows=11 loops=1)"
> "                    ->  HashAggregate  (cost=32.50..34.50 rows=200 width=4)
> (actual time=0.033..0.049 rows=11 loops=1)"
> "                          ->  Function Scan on generate_series
> (cost=0.00..20.00 rows=1000 width=0) (actual time=0.020..0.026 rows=11
> loops=1)"
> "                    ->  Index Scan using t1_pkey on t1  (cost=0.00..6.52
> rows=1 width=4) (actual time=6.359..6.361 rows=1 loops=11)"
> "                          Index Cond: (public.t1.id = (((random() *
> 500000::double precision))::integer))"
> "              ->  Index Scan using t1_pkey on t1  (cost=0.00..322.51
> rows=9612 width=4) (actual time=0.023..4.075 rows=9999 loops=1)"
> "                    Index Cond: (id < 10000)"
> "Total runtime: 112.694 ms"

Hmm.  What you're suggesting here is that we could consider
implementing OR conditions by rescanning the inner side for each index
qual and then unique-ifying the results on the index column.  That's
probably possible, but it doesn't sound easy, especially since our
selectivity-estimation code for OR conditions is not very good, so we
might choose to do it this way when that's not actually the best plan.

...Robert

Re: Can Postgres use an INDEX over an OR?

From

Віталій Тимчишин

Date:

27 July 2009, 08:37:27

27 липня 2009 р. 13:53 Robert Haas <robertmhaas@gmail.com> написав:

Hmm. What you're suggesting here is that we could consider
implementing OR conditions by rescanning the inner side for each index
qual and then unique-ifying the results on the index column. That's
probably possible, but it doesn't sound easy, especially since our
selectivity-estimation code for OR conditions is not very good, so we
might choose to do it this way when that's not actually the best plan.

...Robert

Actually what I am talking about is to make OR with UNION (or UNION-like because it's a little different depending on input rows uniqueness) as an option. All of OR parts can use/not use different strategies (including multiple different idexes or hash joins).
In cases when conditions are complex this can drastically increase performance by winning over sequence scan.

As of selectivity, I'd say this is general problem - sometimes it is estimated OK, sometimes not, but this should not prevent from trying different plans. (From my current work: it does wrong estimations of filter selectivity, introduces HASH join and kills the server with OOM).

Best regards, Vitaliy Tymchyshyn.

Re: Can Postgres use an INDEX over an OR?

From

Robert Haas

Date:

27 July 2009, 09:02:59

2009/7/27 Віталій Тимчишин <tivv00@gmail.com>:
>
>
> 27 липня 2009 р. 13:53 Robert Haas <robertmhaas@gmail.com> написав:
>>
>> Hmm.  What you're suggesting here is that we could consider
>> implementing OR conditions by rescanning the inner side for each index
>> qual and then unique-ifying the results on the index column.  That's
>> probably possible, but it doesn't sound easy, especially since our
>> selectivity-estimation code for OR conditions is not very good, so we
>> might choose to do it this way when that's not actually the best plan.
>>
>> ...Robert
>
> Actually what I am talking about is to make OR with UNION (or UNION-like
> because it's a little different depending on input rows uniqueness) as an
> option. All of OR parts can use/not use different strategies (including
> multiple different idexes or hash joins).
> In cases when conditions are complex this can drastically increase
> performance by winning over sequence scan.

That's exactly what I was talking about.

> As of selectivity, I'd say this is general problem - sometimes it is
> estimated OK, sometimes not, but this should not prevent from trying
> different plans. (From my current work: it does wrong estimations of filter
> selectivity, introduces HASH join and kills the server with OOM).

Yep.  I think the two things that would help the most with this are
multi-column statistics and some kind of  cache, but AFAIK nobody is
actively developing code for either one ATM.

The problem, though, is that it won't ALWAYS be right to implement OR
using UNION, so you have to have some way of deciding which is better.

...Robert

Re: Can Postgres use an INDEX over an OR?

From

Віталій Тимчишин

Date:

27 July 2009, 10:00:36

27 липня 2009 р. 15:02 Robert Haas <robertmhaas@gmail.com> написав:

The problem, though, is that it won't ALWAYS be right to implement OR
using UNION, so you have to have some way of deciding which is better.

That's easy - you propose both ways to planner and it's up to it to decide. Yes, it can decide wrong way, but we are returning to statistics problem. At least one can tune costs and enable_ settings. Now one have to rewrite query that may be not possible/too complex.

Re: Can Postgres use an INDEX over an OR?

From

Tom Lane

Date:

27 July 2009, 11:18:43

=?KOI8-U?B?96bUwcymyiD0yc3eydvJzg==?= <tivv00@gmail.com> writes:
> Actually what I am talking about is to make OR with UNION (or UNION-like
> because it's a little different depending on input rows uniqueness) as an
> option. All of OR parts can use/not use different strategies (including
> multiple different idexes or hash joins).

AFAICS you're proposing re-inventing the old implementation of OR'd
indexscans.  We took that out when we added bitmap scans because it
didn't have any performance advantage over BitmapOr.

            regards, tom lane

Re: Can Postgres use an INDEX over an OR?

From

Віталій Тимчишин

Date:

27 July 2009, 11:33:46

27 липня 2009 р. 17:18 Tom Lane <tgl@sss.pgh.pa.us> написав:

Віталій Тимчишин <tivv00@gmail.com> writes:
> Actually what I am talking about is to make OR with UNION (or UNION-like
> because it's a little different depending on input rows uniqueness) as an
> option. All of OR parts can use/not use different strategies (including
> multiple different idexes or hash joins).

AFAICS you're proposing re-inventing the old implementation of OR'd
indexscans. We took that out when we added bitmap scans because it
didn't have any performance advantage over BitmapOr.

It's not tied to indexscans at all. Different parts can do (as in UNION) totally different strategy - e.g. perform two hash joins or perform merge join for one part and nested loop for another or ...

As of performance - see above in this thread. UNION now often provides much better performance when different parts of OR expression involve different additional tables.