Thread: optimizer choosing the wrong index

optimizer choosing the wrong index

From
Martin Below
Date:
Hello,

I'm facing a strange problem where the optimizer does pick the wrong index.
Im using postgres 8.4, and my schema look like this:

 client_id   | character varying(36)       | not null
 key         | character varying(16)       | not null
 expires_on  | timestamp without time zone | not null

Indexe:
   "ps_pkey" PRIMARY KEY, btree (client_id, key)
   "idx_correct" btree (client_id, expires_on)
   "idx_wrong" btree (expires_on)


the query:
explain analyze select * from ps where client_id='foo' and expires_on
= timestamp '2010-11-24';

                                                 QUERY PLAN
---------------------------------------------------------------------------------------------------------------
 Index Scan using idx_wrong on ps  (cost=0.00..8.29 rows=1 width=61)
(actual time=0.010..0.010 rows=0 loops=1)
  Index Cond: (expires_on = '2010-11-24 00:00:00'::timestamp without time zone)
  Filter: ((client_id)::text = 'foo'::text)
 Total runtime: 0.089 ms


Why is "idx_wrong" used (which only includes one of the fields
queried) instead of idx_correct (which contains both fields)?
If I drop idx_wrong, the correct index is choosen:

test=# explain analyze select * from ps where client_id='foo' and
expires_on = timestamp '2010-11-24';

                                                      QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------
 Index Scan using idx_correct on ps  (cost=0.00..8.34 rows=1 width=53)
(actual time=0.023..0.023 rows=0 loops=1)
  Index Cond: (((client_id)::text = 'foo'::text) AND (expires_on =
'2010-11-24 00:00:00'::timestamp without time zone))
 Total runtime: 0.058 ms



The problem seems to me that the estimates costs are not correct.
With the table containing about 200.000 records, using the "wrong"
index takes about 22 times as long as using the "right" index. I did
run "vacuum analyze", without any effect.

Any help would be very much appreciated.

Thanks,
Martin Below

Re: optimizer choosing the wrong index

From
Merlin Moncure
Date:
On Wed, Jul 7, 2010 at 7:42 AM, Martin Below
<machtin.below@googlemail.com> wrote:
> Hello,
>
> I'm facing a strange problem where the optimizer does pick the wrong index.
> Im using postgres 8.4, and my schema look like this:
>
>  client_id   | character varying(36)       | not null
>  key         | character varying(16)       | not null
>  expires_on  | timestamp without time zone | not null
>
> Indexe:
>    "ps_pkey" PRIMARY KEY, btree (client_id, key)
>    "idx_correct" btree (client_id, expires_on)
>    "idx_wrong" btree (expires_on)
>
>
> the query:
> explain analyze select * from ps where client_id='foo' and expires_on
> = timestamp '2010-11-24';
>
>                                                  QUERY PLAN
> ---------------------------------------------------------------------------------------------------------------
>  Index Scan using idx_wrong on ps  (cost=0.00..8.29 rows=1 width=61)
> (actual time=0.010..0.010 rows=0 loops=1)
>   Index Cond: (expires_on = '2010-11-24 00:00:00'::timestamp without time zone)
>   Filter: ((client_id)::text = 'foo'::text)
>  Total runtime: 0.089 ms
>
>
> Why is "idx_wrong" used (which only includes one of the fields
> queried) instead of idx_correct (which contains both fields)?
> If I drop idx_wrong, the correct index is choosen:
>
> test=# explain analyze select * from ps where client_id='foo' and
> expires_on = timestamp '2010-11-24';
>
>                                                       QUERY PLAN
>
-------------------------------------------------------------------------------------------------------------------------
>  Index Scan using idx_correct on ps  (cost=0.00..8.34 rows=1 width=53)
> (actual time=0.023..0.023 rows=0 loops=1)
>   Index Cond: (((client_id)::text = 'foo'::text) AND (expires_on =
> '2010-11-24 00:00:00'::timestamp without time zone))
>  Total runtime: 0.058 ms
>
>
>
> The problem seems to me that the estimates costs are not correct.
> With the table containing about 200.000 records, using the "wrong"
> index takes about 22 times as long as using the "right" index. I did
> run "vacuum analyze", without any effect.
>
> Any help would be very much appreciated.

can you supply the plans on the actual tables?  the 'wrong' index
might actually be the 'right' one if expires_on is of high cardinality
(perhaps it's distributed badly and the table needs a stats tweak to
make it correct).

btw, consider using 'date' type for dates vs non timezone timestamp,
which is a bit of a kludge imo.

You can probably force the right index like this:
explain analyze select * from ps where (client_id, expires_on) =
('foo', '2010-11-24'::timestamp);

merlin

Re: optimizer choosing the wrong index

From
Tom Lane
Date:
Martin Below <machtin.below@googlemail.com> writes:
> Why is "idx_wrong" used (which only includes one of the fields
> queried) instead of idx_correct (which contains both fields)?

It's not immediately clear that that choice is "wrong".  A two-column
index is bigger and hence more expensive to search than a one-column
index --- perhaps quite substantially so, given this particular
combination of datatypes.  The planner is probably estimating that
the expires_on condition is selective enough that nothing much will be
gained by using the two-column index.  If that's wrong (which is not
proven by your example) you might be able to fix it by increasing
the stats target for the table.  Also, if you haven't customized
effective_cache_size, increasing that makes large indexes look
cheaper to use.

            regards, tom lane

Re: optimizer choosing the wrong index

From
Martin Below
Date:
Hello Merlin,

thanks for your help.

> can you supply the plans on the actual tables?  the 'wrong' index
> might actually be the 'right' one if expires_on is of high cardinality
> (perhaps it's distributed badly and the table needs a stats tweak to
> make it correct).

test=# select count(*) total, count(distinct client_id) ids,
count(distinct expires_on) dates from ps;
 total  |  ids   | dates
--------+--------+--------
 213645 | 123366 | 213549

I played arround with set statistics on both columns, but that didn't
seem to help. (I did run analyze)

> You can probably force the right index like this:
> explain analyze select * from ps where (client_id, expires_on) =
> ('foo', '2010-11-24'::timestamp);

That didn't seem to work either:
test=# explain select * from ps where (client_id, expires_on) =
('123', '24.11.2010'::timestamp);
                                   QUERY PLAN
---------------------------------------------------------------------------------
 Index Scan using idx_wrong on ps  (cost=0.00..8.29 rows=1 width=53)
   Index Cond: (expires_on = '2010-11-24 00:00:00'::timestamp without time zone)
   Filter: ((client_id)::text = '123'::text)

>
> merlin
>

Re: optimizer choosing the wrong index

From
Tom Lane
Date:
Martin Below <machtin.below@googlemail.com> writes:
> test=# select count(*) total, count(distinct client_id) ids,
> count(distinct expires_on) dates from ps;
>  total  |  ids   | dates
> --------+--------+--------
>  213645 | 123366 | 213549

That says the expires_on column is practically unique, which makes me
think the planner is indeed making the right choice.

            regards, tom lane

Re: optimizer choosing the wrong index

From
Merlin Moncure
Date:
On Thu, Jul 8, 2010 at 10:11 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Martin Below <machtin.below@googlemail.com> writes:
>> test=# select count(*) total, count(distinct client_id) ids,
>> count(distinct expires_on) dates from ps;
>>  total  |  ids   | dates
>> --------+--------+--------
>>  213645 | 123366 | 213549
>
> That says the expires_on column is practically unique, which makes me
> think the planner is indeed making the right choice.

I think so too, but suppose we wanted to force the other plan anyways:
select * from ps where (client_id, expires_on) >=
('123', '24.11.2010'::timestamp) and (client_id, expires_on) < ('123',
null) order by client_id, expires_on;

I'd be curious to see explain analyze (not explain) comparisons for
the 'wrong' index vs above.  I suspect the plan is 'correct' for
*most* of the data, or you cherry picked (or unluckily drew) a bad
value to get your 22 times speed difference.

merlin

Re: optimizer choosing the wrong index

From
Tom Lane
Date:
Merlin Moncure <mmoncure@gmail.com> writes:
> I think so too, but suppose we wanted to force the other plan anyways:
> select * from ps where (client_id, expires_on) >=
> ('123', '24.11.2010'::timestamp) and (client_id, expires_on) < ('123',
> null) order by client_id, expires_on;

A simpler way to force use of the other index is

    begin;
    drop index wrong_index;
    explain analyze my_query;
    rollback;

> I'd be curious to see explain analyze (not explain) comparisons for
> the 'wrong' index vs above.  I suspect the plan is 'correct' for
> *most* of the data, or you cherry picked (or unluckily drew) a bad
> value to get your 22 times speed difference.

Yeah, it would be interesting to see explain analyze output for both
cases, and for a few different values of the expires_on date if this
one is unlike the typical value.

            regards, tom lane

Re: optimizer choosing the wrong index

From
Adrian von Bidder
Date:
Heyho!

On Wednesday 07 July 2010 13.42:59 Martin Below wrote:
> I'm facing a strange problem where the optimizer does pick the wrong
> index.

What's not clear to me is why you want pg to chose a particular index.  Are
you having performance issues? (The case you show is a table with 200k rows
- this doesn't look like very much.)

cheers
-- vbi

--
Verbing weirds language.
        -- Calvin & Hobbes

Attachment