Thread: How to get the optimizer to use an index with multiple fields

How to get the optimizer to use an index with multiple fields

From

"Jim C. Nasby"

Date:

28 April 2003, 17:21:51

select id into temp NewRetires from stats_participant where retire_to>=1
AND retire_date = (SELECT last_date FROM Project_statsrun WHERE
project_id = :ProjectID);

results in a table with 5 values...

explain analyze delete from email_rank where project_id=25 and id in
(select id from NewRetires);

 Index Scan using email_rank__day_rank on email_rank
(cost=0.00..9003741627715.16 rows=45019 width=6) (actual time=408.12..9688.37 rows=3 loops=1)
   Index Cond: (project_id = 25)
   Filter: (subplan)
   SubPlan
     ->  Seq Scan on newretires  (cost=100000000.00..100000020.00 rows=1000 width=4) (actual time=0.01..0.05 rows=5
loops=91834)
 Total runtime: 9689.86 msec

But, there's already an index that would fit the bill here perfectly:

              Table "public.email_rank"
        Column         |  Type   |     Modifiers
-----------------------+---------+--------------------
 project_id            | integer | not null
 id                    | integer | not null
 first_date            | date    | not null
 last_date             | date    | not null
 day_rank              | integer | not null default 0
 day_rank_previous     | integer | not null default 0
 overall_rank          | integer | not null default 0
 overall_rank_previous | integer | not null default 0
 work_today            | bigint  | not null default 0
 work_total            | bigint  | not null default 0
Indexes: email_rank_pkey primary key btree (project_id, id),
         email_rank__day_rank btree (project_id, day_rank),
         email_rank__overall_rank btree (project_id, overall_rank)

Why isn't it using email_rank_pkey instead of using day_rank then a
filter? The original query on sybase (see below) is essentially instant,
because it's using the index of (project_id, id), so it doesn't have to
read the whole table.

stats=> select project_id,count(*) from email_rank group by project_id;
 project_id | count
------------+--------
          5 | 327856
          8 |  28304
         24 |  34622
         25 |  91834
        205 | 331464

Also, changing the WHERE IN to a WHERE EXISTS in the delete is
substantially faster in this case (3.5 seconds as opposed to 9); it
would be nice if the optimizer could rewrite the query on-the-fly. I
started looking into this in the first place because the original query
was taking 6-10 seconds, which seemed too long...

Original query:
DELETE FROM Email_Rank
    WHERE project_id = :ProjectID
        AND id IN  (SELECT id
                        FROM STATS_Participant sp
                        WHERE retire_to >= 1
                            AND retire_date = (SELECT last_date FROM Project_statsrun WHERE project_id = :ProjectID)
                    )
;

I tried changing this to an EXISTS and it takes over a minute. So in
this case, the range of runtimes is ~4 seconds (building the temp table
takes ~0.25 seconds) to over a minute.
--
Jim C. Nasby (aka Decibel!)                    jim@nasby.net
Member: Triangle Fraternity, Sports Car Club of America
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

Re: How to get the optimizer to use an index with multiple fields

From

"Jim C. Nasby"

Date:

28 April 2003, 17:32:47

On Mon, Apr 28, 2003 at 01:21:43PM -0500, Jim C. Nasby wrote:
> I tried changing this to an EXISTS and it takes over a minute. So in
> this case, the range of runtimes is ~4 seconds (building the temp table
> takes ~0.25 seconds) to over a minute.

BTW, I forgot to mention that building the temp table only takes 0.25
seconds if I first disable sequential scans. :/
--
Jim C. Nasby (aka Decibel!)                    jim@nasby.net
Member: Triangle Fraternity, Sports Car Club of America
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

Re: How to get the optimizer to use an index with multiple

From

"scott.marlowe"

Date:

28 April 2003, 19:04:10

On Mon, 28 Apr 2003, Jim C. Nasby wrote:

> select id into temp NewRetires from stats_participant where retire_to>=1
> AND retire_date = (SELECT last_date FROM Project_statsrun WHERE
> project_id = :ProjectID);
>
> results in a table with 5 values...
>
> explain analyze delete from email_rank where project_id=25 and id in
> (select id from NewRetires);
>
>  Index Scan using email_rank__day_rank on email_rank
> (cost=0.00..9003741627715.16 rows=45019 width=6) (actual time=408.12..9688.37 rows=3 loops=1)
>    Index Cond: (project_id = 25)
>    Filter: (subplan)
>    SubPlan
>      ->  Seq Scan on newretires  (cost=100000000.00..100000020.00 rows=1000 width=4) (actual time=0.01..0.05 rows=5
loops=91834)
>  Total runtime: 9689.86 msec
>
> But, there's already an index that would fit the bill here perfectly:
>
>               Table "public.email_rank"
>         Column         |  Type   |     Modifiers
> -----------------------+---------+--------------------
>  project_id            | integer | not null
>  id                    | integer | not null
>  first_date            | date    | not null
>  last_date             | date    | not null
>  day_rank              | integer | not null default 0
>  day_rank_previous     | integer | not null default 0
>  overall_rank          | integer | not null default 0
>  overall_rank_previous | integer | not null default 0
>  work_today            | bigint  | not null default 0
>  work_total            | bigint  | not null default 0
> Indexes: email_rank_pkey primary key btree (project_id, id),
>          email_rank__day_rank btree (project_id, day_rank),
>          email_rank__overall_rank btree (project_id, overall_rank)
>
> Why isn't it using email_rank_pkey instead of using day_rank then a
> filter? The original query on sybase (see below) is essentially instant,
> because it's using the index of (project_id, id), so it doesn't have to
> read the whole table.

It looks like the seq scan is newretires up there, from your 'id in
(select id from NewRetires);' part of your query.  I.e. the where in() has
to be done first, and the query planner has no stats on that table, so it
assumes a seq scan will be faster in case we need the whole thing anyway.

Try adding an analyze newretires in there between the two queries.

No clue as to why it's choosing one index over the other.  I don't think
that really matters a lot, it's the seq scan on the temp table that is
taking your time on this.

Re: How to get the optimizer to use an index with multiple

From

"Jim C. Nasby"

Date:

28 April 2003, 21:54:45

On Mon, Apr 28, 2003 at 01:52:57PM -0600, scott.marlowe wrote:
> On Mon, 28 Apr 2003, Jim C. Nasby wrote:
>
> > select id into temp NewRetires from stats_participant where retire_to>=1
> > AND retire_date = (SELECT last_date FROM Project_statsrun WHERE
> > project_id = :ProjectID);
> >
> > results in a table with 5 values...
> >
> > explain analyze delete from email_rank where project_id=25 and id in
> > (select id from NewRetires);
> >
> >  Index Scan using email_rank__day_rank on email_rank
> > (cost=0.00..9003741627715.16 rows=45019 width=6) (actual time=408.12..9688.37 rows=3 loops=1)
> >    Index Cond: (project_id = 25)
> >    Filter: (subplan)
> >    SubPlan
> >      ->  Seq Scan on newretires  (cost=100000000.00..100000020.00 rows=1000 width=4) (actual time=0.01..0.05 rows=5
loops=91834)
> >  Total runtime: 9689.86 msec
> >
> > But, there's already an index that would fit the bill here perfectly:
> >
> >               Table "public.email_rank"
> >         Column         |  Type   |     Modifiers
> > -----------------------+---------+--------------------
> >  project_id            | integer | not null
> >  id                    | integer | not null
> >  first_date            | date    | not null
> >  last_date             | date    | not null
> >  day_rank              | integer | not null default 0
> >  day_rank_previous     | integer | not null default 0
> >  overall_rank          | integer | not null default 0
> >  overall_rank_previous | integer | not null default 0
> >  work_today            | bigint  | not null default 0
> >  work_total            | bigint  | not null default 0
> > Indexes: email_rank_pkey primary key btree (project_id, id),
> >          email_rank__day_rank btree (project_id, day_rank),
> >          email_rank__overall_rank btree (project_id, overall_rank)
> >
> > Why isn't it using email_rank_pkey instead of using day_rank then a
> > filter? The original query on sybase (see below) is essentially instant,
> > because it's using the index of (project_id, id), so it doesn't have to
> > read the whole table.
>
> It looks like the seq scan is newretires up there, from your 'id in
> (select id from NewRetires);' part of your query.  I.e. the where in() has
> to be done first, and the query planner has no stats on that table, so it
> assumes a seq scan will be faster in case we need the whole thing anyway.
>
> Try adding an analyze newretires in there between the two queries.
>
> No clue as to why it's choosing one index over the other.  I don't think
> that really matters a lot, it's the seq scan on the temp table that is
> taking your time on this.

There's no index at all on the temporary table; I fully expect it to
seqscan than. :) The issue is the choice of index on email_rank. It's
only going to hit at most 5 rows in email_rank (which it should be able
to figure out based on newretires and the fact that email_rank_pkey is
unique. I didn't show it, but I did run analyze on the temporary table
(why it doesn't have statistics I don't know...)
--
Jim C. Nasby (aka Decibel!)                    jim@nasby.net
Member: Triangle Fraternity, Sports Car Club of America
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"