Thread: Why is not using the index

Why is not using the index

From
Luis Cornide Arce
Date:
Hi everyone,

I have some problems with a quite long query and the plan postgreSQL is choosing. The query joins 12 tables and in the WHERE clause I use a IN expression with a lot of identifiers (up to 2000). The problem is that the planner is proposing a seq_scan on two tables 2M rows each (
internalexpressionprofile and expressionprofile)

I have just try this query (after doing a vacuum analyze), in the 'IN' clause there are 1552 identifiers, and the query should return 14K rows.
I'm using a PostgreSQL 8.0.2 on a SuSE 8.1 with 1GB of RAM.

explain analyze SELECT DISTINCT rset.replicatesetid, tra.value as value, tra.expressionprofileid, rep.*, epg.expprogeneid,  con.ordinal
FROM expprogene epg JOIN reporter rep ON  (epg.reporterid=rep.reporterid), expressionprofile epro,
transformedexpressionprofile tra, internalexpressionprofile int,
meanvalues mea, replicateset rset, replicateset_condition rsco, condition con,
"CLUSTER" clu, clustertree tre, clusteranalysis an
WHERE epg.expprogeneid IN (80174,84567,...) AND epg.expprogeneid=epro.expprogeneid
AND epro.expressionprofileid=tra.expressionprofileid AND tra.expressionprofileid=int.expressionprofileid
AND int.meanvaluesid=mea.meanvaluesid AND mea.replicatesetid=rset.replicatesetid
AND rset.replicatesetid=rsco.replicatesetid AND rsco.conditionid=con.conditionid
AND tra.clusterid=clu.clusterid AND clu.clustertreeid=tre.clustertreeid AND tre.clustertreeid=an.genetreeid
AND an.clusteranalysisid=1 AND con.clusteranalysisid = an.clusteranalysisid
ORDER BY epg.expprogeneid, con.ordinal;

The plan...

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=129132.53..129132.59 rows=2 width=150) (actual time=12637.224..12676.016 rows=13968 loops=1)
   ->  Sort  (cost=129132.53..129132.54 rows=2 width=150) (actual time=12637.217..12646.484 rows=13968 loops=1)
         Sort Key: epg.expprogeneid, con.ordinal, rset.replicatesetid, tra.value, tra.expressionprofileid, rep.reporterid, rep.name, rep.anotation, rep.otherinfo, rep.incidences
         ->  Nested Loop  (cost=62927.42..129132.52 rows=2 width=150) (actual time=7112.942..12586.314 rows=13968 loops=1)
               Join Filter: ("outer".genetreeid = "inner".clustertreeid)
               ->  Nested Loop  (cost=62927.42..127893.86 rows=409 width=162) (actual time=7112.864..11960.324 rows=41904 loops=1)
                     ->  Nested Loop  (cost=62927.42..125727.31 rows=369 width=154) (actual time=7112.825..11500.645 rows=13968 loops=1)
                           ->  Merge Join  (cost=3.02..7.70 rows=1 width=12) (actual time=0.057..0.073 rows=1 loops=1)
                                 Merge Cond: ("outer".clustertreeid = "inner".genetreeid)
                                 ->  Index Scan using clustertree_pk on clustertree tre  (cost=0.00..4.35 rows=123 width=4) (actual time=0.017..0.024 rows=2 loops=1)
                                 ->  Sort  (cost=3.02..3.03 rows=1 width=8) (actual time=0.028..0.030 rows=1 loops=1)
                                       Sort Key: an.genetreeid
                                       ->  Index Scan using clusteranalysis_pk on clusteranalysis an  (cost=0.00..3.01 rows=1 width=8) (actual time=0.015..0.018 rows=1 loops=1)
                                             Index Cond: (clusteranalysisid = 1)
                           ->  Hash Join  (cost=62924.39..125715.53 rows=408 width=150) (actual time=7112.758..11455.797 rows=13968 loops=1)
                                 Hash Cond: ("outer".expressionprofileid = "inner".expressionprofileid)
                                 ->  Hash Join  (cost=15413.58..78079.33 rows=24339 width=134) (actual time=1489.347..5721.306 rows=41904 loops=1)
                                       Hash Cond: ("outer".expprogeneid = "inner".expprogeneid)
                                       ->  Seq Scan on expressionprofile epro  (cost=0.00..48263.24 rows=2831824 width=8) (actual time=0.039..3097.656 rows=2839676 loops=1)
                                       ->  Hash  (cost=15409.72..15409.72 rows=1546 width=130) (actual time=43.365..43.365 rows=0 loops=1)
                                             ->  Nested Loop  (cost=0.00..15409.72 rows=1546 width=130) (actual time=0.056..40.637 rows=1552 loops=1)
                                                   ->  Index Scan using expprogene_pk, expprogene_pk, [......] on expprogene epg  (cost=0.00..10698.83 rows=1546 width=8) (actual time=0.027..15.907 rows=1552 loops=1)
                                                         Index Cond: ((expprogeneid = 80174) OR (expprogeneid = 84567) OR (expprogeneid = 83608) OR [OR ....])
                                                   ->  Index Scan using reporter_pkey on reporter rep  (cost=0.00..3.03 rows=1 width=126) (actual time=0.009..0.010 rows=1 loops=1552)
                                                         Index Cond: ("outer".reporterid = rep.reporterid)
                                 ->  Hash  (cost=47403.68..47403.68 rows=42853 width=16) (actual time=5623.174..5623.174 rows=0 loops=1)
                                       ->  Hash Join  (cost=2369.91..47403.68 rows=42853 width=16) (actual time=346.040..5538.571 rows=75816 loops=1)
                                             Hash Cond: ("outer".meanvaluesid = "inner".meanvaluesid)
                                             ->  Seq Scan on internalexpressionprofile "int"  (cost=0.00..34506.16 rows=2019816 width=8) (actual time=0.003..2231.427 rows=2019816 loops=1)
                                             ->  Hash  (cost=2262.78..2262.78 rows=42853 width=16) (actual time=345.803..345.803 rows=0 loops=1)
                                                   ->  Nested Loop  (cost=17.49..2262.78 rows=42853 width=16) (actual time=1.965..259.363 rows=75816 loops=1)
                                                         ->  Hash Join  (cost=17.49..28.42 rows=6 width=16) (actual time=1.881..2.387 rows=9 loops=1)
                                                               Hash Cond: ("outer".replicatesetid = "inner".replicatesetid)
                                                               ->  Seq Scan on replicateset rset  (cost=0.00..9.58 rows=258 width=4) (actual time=0.003..0.295 rows=258 loops=1)
                                                               ->  Hash  (cost=17.47..17.47 rows=6 width=12) (actual time=1.575..1.575 rows=0 loops=1)
                                                                     ->  Hash Join  (cost=3.17..17.47 rows=6 width=12) (actual time=0.315..1.557 rows=9 loops=1)
                                                                           Hash Cond: ("outer".conditionid = "inner".conditionid)
                                                                           ->  Seq Scan on replicateset_condition rsco  (cost=0.00..10.83 rows=683 width=8) (actual time=0.004..0.688 rows=683 loops=1)
                                                                           ->  Hash  (cost=3.14..3.14 rows=9 width=12) (actual time=0.059..0.059 rows=0 loops=1)
                                                                                 ->  Index Scan using clustering_analysis_fk on condition con  (cost=0.00..3.14 rows=9 width=12) (actual time=0.019..0.039 rows=9 loops=1)
                                                                                       Index Cond: (clusteranalysisid = 1)
                                                         ->  Index Scan using has_meanvalues_fk on meanvalues mea  (cost=0.00..264.03 rows=8669 width=8) (actual time=0.027..13.032 rows=8424 loops=9)
                                                               Index Cond: ("outer".replicatesetid = mea.replicatesetid)
                     ->  Index Scan using comes_from_raw_fk on transformedexpressionprofile tra  (cost=0.00..5.86 rows=1 width=16) (actual time=0.010..0.018 rows=3 loops=13968)
                           Index Cond: (tra.expressionprofileid = "outer".expressionprofileid)
               ->  Index Scan using _cluster__pk on "CLUSTER" clu  (cost=0.00..3.01 rows=1 width=8) (actual time=0.009..0.010 rows=1 loops=41904)
                     Index Cond: ("outer".clusterid = clu.clusterid)
 Total runtime: 12696.289 ms
(48 rows)

I tried setting the enable_seq_scan to off and the query's runtime returned by the explain analyze is 4000ms.
Why postgre is not using the indexes?
What is the real impact of having such a big 'IN' clause?


Thanks in advance,

Luis Cornide

Re: Why is not using the index

From
Richard Huxton
Date:
Luis Cornide Arce wrote:
> Hi everyone,
>
> I have some problems with a quite long query and the plan postgreSQL is
> choosing. The query joins 12 tables and in the WHERE clause I use a IN
> expression with a lot of identifiers (up to 2000). The problem is that
> the planner is proposing a seq_scan on two tables 2M rows each
> (internalexpressionprofile and expressionprofile)
>
> I have just try this query (after doing a vacuum analyze), in the 'IN'
> clause there are 1552 identifiers, and the query should return 14K rows.
> I'm using a PostgreSQL 8.0.2 on a SuSE 8.1 with 1GB of RAM.

> WHERE epg.expprogeneid IN (80174,84567,...) AND
> epg.expprogeneid=epro.expprogeneid

->  Hash Join
    (cost=15413.58..78079.33 rows=24339 width=134)
    (actual time=1489.347..5721.306 rows=41904 loops=1)
    Hash Cond: ("outer".expprogeneid = "inner".expprogeneid)
    ->  Seq Scan on expressionprofile epro
        (cost=0.00..48263.24 rows=2831824 width=8)
        (actual time=0.039..3097.656 rows=2839676 loops=1)

->  Index Scan using
expprogene_pk, expprogene_pk, [......] on expprogene epg
(cost=0.00..10698.83 rows=1546 width=8) (actual time=0.027..15.907
rows=1552 loops=1)
    Index Cond: ((expprogeneid = 80174) OR (expprogeneid = 84567)
    OR (expprogeneid = 83608) OR [OR ....])

OK - it looks like the "IN" clause is using your index. The fact that
it's using a Seq-scan on "expressionprofile epro" looks odd though,
especially since it expects 24339 matches (out of 2.8 million rows -
that should favour an index).

Of course, I've not considered the context of the rest of the query, but
I'd expect the index to be used.

Do you have any unusual config settings?
--
   Richard Huxton
   Archonet Ltd

Re: Why is not using the index

From
Luis Cornide Arce
Date:
Well I have change the next setting in the postgresql.conf

shared_buffers= 16384
work_mem =32768
maintenance_work_mem= 65536
bgwriter_delay =800
bgwriter_maxpages= 100
wal_buffers =64
efective_cache_size= 20000

The rest of the settings are the default.

Thanks,

Luis

Richard Huxton escribió:

> Luis Cornide Arce wrote:
>
>> Hi everyone,
>>
>> I have some problems with a quite long query and the plan postgreSQL
>> is choosing. The query joins 12 tables and in the WHERE clause I use
>> a IN expression with a lot of identifiers (up to 2000). The problem
>> is that the planner is proposing a seq_scan on two tables 2M rows
>> each (internalexpressionprofile and expressionprofile)
>>
>> I have just try this query (after doing a vacuum analyze), in the
>> 'IN' clause there are 1552 identifiers, and the query should return
>> 14K rows.
>> I'm using a PostgreSQL 8.0.2 on a SuSE 8.1 with 1GB of RAM.
>
>
>> WHERE epg.expprogeneid IN (80174,84567,...) AND
>> epg.expprogeneid=epro.expprogeneid
>
>
> ->  Hash Join
>     (cost=15413.58..78079.33 rows=24339 width=134)
>     (actual time=1489.347..5721.306 rows=41904 loops=1)
>     Hash Cond: ("outer".expprogeneid = "inner".expprogeneid)
>    ->  Seq Scan on expressionprofile epro
>         (cost=0.00..48263.24 rows=2831824 width=8)
>         (actual time=0.039..3097.656 rows=2839676 loops=1)
>
> ->  Index Scan using
> expprogene_pk, expprogene_pk, [......] on expprogene epg
> (cost=0.00..10698.83 rows=1546 width=8) (actual time=0.027..15.907
> rows=1552 loops=1)
>     Index Cond: ((expprogeneid = 80174) OR (expprogeneid = 84567)
>     OR (expprogeneid = 83608) OR [OR ....])
>
> OK - it looks like the "IN" clause is using your index. The fact that
> it's using a Seq-scan on "expressionprofile epro" looks odd though,
> especially since it expects 24339 matches (out of 2.8 million rows -
> that should favour an index).
>
> Of course, I've not considered the context of the rest of the query,
> but I'd expect the index to be used.
>
> Do you have any unusual config settings?