Why is not using the index - Mailing list pgsql-performance
From | Luis Cornide Arce |
---|---|
Subject | Why is not using the index |
Date | |
Msg-id | 42FB3791.2060200@almabioinfo.com Whole thread Raw |
Responses |
Re: Why is not using the index
|
List | pgsql-performance |
Hi everyone,
I have some problems with a quite long query and the plan postgreSQL is choosing. The query joins 12 tables and in the WHERE clause I use a IN expression with a lot of identifiers (up to 2000). The problem is that the planner is proposing a seq_scan on two tables 2M rows each (internalexpressionprofile and expressionprofile)
I have just try this query (after doing a vacuum analyze), in the 'IN' clause there are 1552 identifiers, and the query should return 14K rows.
I'm using a PostgreSQL 8.0.2 on a SuSE 8.1 with 1GB of RAM.
explain analyze SELECT DISTINCT rset.replicatesetid, tra.value as value, tra.expressionprofileid, rep.*, epg.expprogeneid, con.ordinal
FROM expprogene epg JOIN reporter rep ON (epg.reporterid=rep.reporterid), expressionprofile epro,
transformedexpressionprofile tra, internalexpressionprofile int,
meanvalues mea, replicateset rset, replicateset_condition rsco, condition con,
"CLUSTER" clu, clustertree tre, clusteranalysis an
WHERE epg.expprogeneid IN (80174,84567,...) AND epg.expprogeneid=epro.expprogeneid
AND epro.expressionprofileid=tra.expressionprofileid AND tra.expressionprofileid=int.expressionprofileid
AND int.meanvaluesid=mea.meanvaluesid AND mea.replicatesetid=rset.replicatesetid
AND rset.replicatesetid=rsco.replicatesetid AND rsco.conditionid=con.conditionid
AND tra.clusterid=clu.clusterid AND clu.clustertreeid=tre.clustertreeid AND tre.clustertreeid=an.genetreeid
AND an.clusteranalysisid=1 AND con.clusteranalysisid = an.clusteranalysisid
ORDER BY epg.expprogeneid, con.ordinal;
The plan...
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=129132.53..129132.59 rows=2 width=150) (actual time=12637.224..12676.016 rows=13968 loops=1)
-> Sort (cost=129132.53..129132.54 rows=2 width=150) (actual time=12637.217..12646.484 rows=13968 loops=1)
Sort Key: epg.expprogeneid, con.ordinal, rset.replicatesetid, tra.value, tra.expressionprofileid, rep.reporterid, rep.name, rep.anotation, rep.otherinfo, rep.incidences
-> Nested Loop (cost=62927.42..129132.52 rows=2 width=150) (actual time=7112.942..12586.314 rows=13968 loops=1)
Join Filter: ("outer".genetreeid = "inner".clustertreeid)
-> Nested Loop (cost=62927.42..127893.86 rows=409 width=162) (actual time=7112.864..11960.324 rows=41904 loops=1)
-> Nested Loop (cost=62927.42..125727.31 rows=369 width=154) (actual time=7112.825..11500.645 rows=13968 loops=1)
-> Merge Join (cost=3.02..7.70 rows=1 width=12) (actual time=0.057..0.073 rows=1 loops=1)
Merge Cond: ("outer".clustertreeid = "inner".genetreeid)
-> Index Scan using clustertree_pk on clustertree tre (cost=0.00..4.35 rows=123 width=4) (actual time=0.017..0.024 rows=2 loops=1)
-> Sort (cost=3.02..3.03 rows=1 width=8) (actual time=0.028..0.030 rows=1 loops=1)
Sort Key: an.genetreeid
-> Index Scan using clusteranalysis_pk on clusteranalysis an (cost=0.00..3.01 rows=1 width=8) (actual time=0.015..0.018 rows=1 loops=1)
Index Cond: (clusteranalysisid = 1)
-> Hash Join (cost=62924.39..125715.53 rows=408 width=150) (actual time=7112.758..11455.797 rows=13968 loops=1)
Hash Cond: ("outer".expressionprofileid = "inner".expressionprofileid)
-> Hash Join (cost=15413.58..78079.33 rows=24339 width=134) (actual time=1489.347..5721.306 rows=41904 loops=1)
Hash Cond: ("outer".expprogeneid = "inner".expprogeneid)
-> Seq Scan on expressionprofile epro (cost=0.00..48263.24 rows=2831824 width=8) (actual time=0.039..3097.656 rows=2839676 loops=1)
-> Hash (cost=15409.72..15409.72 rows=1546 width=130) (actual time=43.365..43.365 rows=0 loops=1)
-> Nested Loop (cost=0.00..15409.72 rows=1546 width=130) (actual time=0.056..40.637 rows=1552 loops=1)
-> Index Scan using expprogene_pk, expprogene_pk, [......] on expprogene epg (cost=0.00..10698.83 rows=1546 width=8) (actual time=0.027..15.907 rows=1552 loops=1)
Index Cond: ((expprogeneid = 80174) OR (expprogeneid = 84567) OR (expprogeneid = 83608) OR [OR ....])
-> Index Scan using reporter_pkey on reporter rep (cost=0.00..3.03 rows=1 width=126) (actual time=0.009..0.010 rows=1 loops=1552)
Index Cond: ("outer".reporterid = rep.reporterid)
-> Hash (cost=47403.68..47403.68 rows=42853 width=16) (actual time=5623.174..5623.174 rows=0 loops=1)
-> Hash Join (cost=2369.91..47403.68 rows=42853 width=16) (actual time=346.040..5538.571 rows=75816 loops=1)
Hash Cond: ("outer".meanvaluesid = "inner".meanvaluesid)
-> Seq Scan on internalexpressionprofile "int" (cost=0.00..34506.16 rows=2019816 width=8) (actual time=0.003..2231.427 rows=2019816 loops=1)
-> Hash (cost=2262.78..2262.78 rows=42853 width=16) (actual time=345.803..345.803 rows=0 loops=1)
-> Nested Loop (cost=17.49..2262.78 rows=42853 width=16) (actual time=1.965..259.363 rows=75816 loops=1)
-> Hash Join (cost=17.49..28.42 rows=6 width=16) (actual time=1.881..2.387 rows=9 loops=1)
Hash Cond: ("outer".replicatesetid = "inner".replicatesetid)
-> Seq Scan on replicateset rset (cost=0.00..9.58 rows=258 width=4) (actual time=0.003..0.295 rows=258 loops=1)
-> Hash (cost=17.47..17.47 rows=6 width=12) (actual time=1.575..1.575 rows=0 loops=1)
-> Hash Join (cost=3.17..17.47 rows=6 width=12) (actual time=0.315..1.557 rows=9 loops=1)
Hash Cond: ("outer".conditionid = "inner".conditionid)
-> Seq Scan on replicateset_condition rsco (cost=0.00..10.83 rows=683 width=8) (actual time=0.004..0.688 rows=683 loops=1)
-> Hash (cost=3.14..3.14 rows=9 width=12) (actual time=0.059..0.059 rows=0 loops=1)
-> Index Scan using clustering_analysis_fk on condition con (cost=0.00..3.14 rows=9 width=12) (actual time=0.019..0.039 rows=9 loops=1)
Index Cond: (clusteranalysisid = 1)
-> Index Scan using has_meanvalues_fk on meanvalues mea (cost=0.00..264.03 rows=8669 width=8) (actual time=0.027..13.032 rows=8424 loops=9)
Index Cond: ("outer".replicatesetid = mea.replicatesetid)
-> Index Scan using comes_from_raw_fk on transformedexpressionprofile tra (cost=0.00..5.86 rows=1 width=16) (actual time=0.010..0.018 rows=3 loops=13968)
Index Cond: (tra.expressionprofileid = "outer".expressionprofileid)
-> Index Scan using _cluster__pk on "CLUSTER" clu (cost=0.00..3.01 rows=1 width=8) (actual time=0.009..0.010 rows=1 loops=41904)
Index Cond: ("outer".clusterid = clu.clusterid)
Total runtime: 12696.289 ms
(48 rows)
I tried setting the enable_seq_scan to off and the query's runtime returned by the explain analyze is 4000ms.
Why postgre is not using the indexes?
What is the real impact of having such a big 'IN' clause?
Thanks in advance,
Luis Cornide
I have some problems with a quite long query and the plan postgreSQL is choosing. The query joins 12 tables and in the WHERE clause I use a IN expression with a lot of identifiers (up to 2000). The problem is that the planner is proposing a seq_scan on two tables 2M rows each (internalexpressionprofile and expressionprofile)
I have just try this query (after doing a vacuum analyze), in the 'IN' clause there are 1552 identifiers, and the query should return 14K rows.
I'm using a PostgreSQL 8.0.2 on a SuSE 8.1 with 1GB of RAM.
explain analyze SELECT DISTINCT rset.replicatesetid, tra.value as value, tra.expressionprofileid, rep.*, epg.expprogeneid, con.ordinal
FROM expprogene epg JOIN reporter rep ON (epg.reporterid=rep.reporterid), expressionprofile epro,
transformedexpressionprofile tra, internalexpressionprofile int,
meanvalues mea, replicateset rset, replicateset_condition rsco, condition con,
"CLUSTER" clu, clustertree tre, clusteranalysis an
WHERE epg.expprogeneid IN (80174,84567,...) AND epg.expprogeneid=epro.expprogeneid
AND epro.expressionprofileid=tra.expressionprofileid AND tra.expressionprofileid=int.expressionprofileid
AND int.meanvaluesid=mea.meanvaluesid AND mea.replicatesetid=rset.replicatesetid
AND rset.replicatesetid=rsco.replicatesetid AND rsco.conditionid=con.conditionid
AND tra.clusterid=clu.clusterid AND clu.clustertreeid=tre.clustertreeid AND tre.clustertreeid=an.genetreeid
AND an.clusteranalysisid=1 AND con.clusteranalysisid = an.clusteranalysisid
ORDER BY epg.expprogeneid, con.ordinal;
The plan...
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=129132.53..129132.59 rows=2 width=150) (actual time=12637.224..12676.016 rows=13968 loops=1)
-> Sort (cost=129132.53..129132.54 rows=2 width=150) (actual time=12637.217..12646.484 rows=13968 loops=1)
Sort Key: epg.expprogeneid, con.ordinal, rset.replicatesetid, tra.value, tra.expressionprofileid, rep.reporterid, rep.name, rep.anotation, rep.otherinfo, rep.incidences
-> Nested Loop (cost=62927.42..129132.52 rows=2 width=150) (actual time=7112.942..12586.314 rows=13968 loops=1)
Join Filter: ("outer".genetreeid = "inner".clustertreeid)
-> Nested Loop (cost=62927.42..127893.86 rows=409 width=162) (actual time=7112.864..11960.324 rows=41904 loops=1)
-> Nested Loop (cost=62927.42..125727.31 rows=369 width=154) (actual time=7112.825..11500.645 rows=13968 loops=1)
-> Merge Join (cost=3.02..7.70 rows=1 width=12) (actual time=0.057..0.073 rows=1 loops=1)
Merge Cond: ("outer".clustertreeid = "inner".genetreeid)
-> Index Scan using clustertree_pk on clustertree tre (cost=0.00..4.35 rows=123 width=4) (actual time=0.017..0.024 rows=2 loops=1)
-> Sort (cost=3.02..3.03 rows=1 width=8) (actual time=0.028..0.030 rows=1 loops=1)
Sort Key: an.genetreeid
-> Index Scan using clusteranalysis_pk on clusteranalysis an (cost=0.00..3.01 rows=1 width=8) (actual time=0.015..0.018 rows=1 loops=1)
Index Cond: (clusteranalysisid = 1)
-> Hash Join (cost=62924.39..125715.53 rows=408 width=150) (actual time=7112.758..11455.797 rows=13968 loops=1)
Hash Cond: ("outer".expressionprofileid = "inner".expressionprofileid)
-> Hash Join (cost=15413.58..78079.33 rows=24339 width=134) (actual time=1489.347..5721.306 rows=41904 loops=1)
Hash Cond: ("outer".expprogeneid = "inner".expprogeneid)
-> Seq Scan on expressionprofile epro (cost=0.00..48263.24 rows=2831824 width=8) (actual time=0.039..3097.656 rows=2839676 loops=1)
-> Hash (cost=15409.72..15409.72 rows=1546 width=130) (actual time=43.365..43.365 rows=0 loops=1)
-> Nested Loop (cost=0.00..15409.72 rows=1546 width=130) (actual time=0.056..40.637 rows=1552 loops=1)
-> Index Scan using expprogene_pk, expprogene_pk, [......] on expprogene epg (cost=0.00..10698.83 rows=1546 width=8) (actual time=0.027..15.907 rows=1552 loops=1)
Index Cond: ((expprogeneid = 80174) OR (expprogeneid = 84567) OR (expprogeneid = 83608) OR [OR ....])
-> Index Scan using reporter_pkey on reporter rep (cost=0.00..3.03 rows=1 width=126) (actual time=0.009..0.010 rows=1 loops=1552)
Index Cond: ("outer".reporterid = rep.reporterid)
-> Hash (cost=47403.68..47403.68 rows=42853 width=16) (actual time=5623.174..5623.174 rows=0 loops=1)
-> Hash Join (cost=2369.91..47403.68 rows=42853 width=16) (actual time=346.040..5538.571 rows=75816 loops=1)
Hash Cond: ("outer".meanvaluesid = "inner".meanvaluesid)
-> Seq Scan on internalexpressionprofile "int" (cost=0.00..34506.16 rows=2019816 width=8) (actual time=0.003..2231.427 rows=2019816 loops=1)
-> Hash (cost=2262.78..2262.78 rows=42853 width=16) (actual time=345.803..345.803 rows=0 loops=1)
-> Nested Loop (cost=17.49..2262.78 rows=42853 width=16) (actual time=1.965..259.363 rows=75816 loops=1)
-> Hash Join (cost=17.49..28.42 rows=6 width=16) (actual time=1.881..2.387 rows=9 loops=1)
Hash Cond: ("outer".replicatesetid = "inner".replicatesetid)
-> Seq Scan on replicateset rset (cost=0.00..9.58 rows=258 width=4) (actual time=0.003..0.295 rows=258 loops=1)
-> Hash (cost=17.47..17.47 rows=6 width=12) (actual time=1.575..1.575 rows=0 loops=1)
-> Hash Join (cost=3.17..17.47 rows=6 width=12) (actual time=0.315..1.557 rows=9 loops=1)
Hash Cond: ("outer".conditionid = "inner".conditionid)
-> Seq Scan on replicateset_condition rsco (cost=0.00..10.83 rows=683 width=8) (actual time=0.004..0.688 rows=683 loops=1)
-> Hash (cost=3.14..3.14 rows=9 width=12) (actual time=0.059..0.059 rows=0 loops=1)
-> Index Scan using clustering_analysis_fk on condition con (cost=0.00..3.14 rows=9 width=12) (actual time=0.019..0.039 rows=9 loops=1)
Index Cond: (clusteranalysisid = 1)
-> Index Scan using has_meanvalues_fk on meanvalues mea (cost=0.00..264.03 rows=8669 width=8) (actual time=0.027..13.032 rows=8424 loops=9)
Index Cond: ("outer".replicatesetid = mea.replicatesetid)
-> Index Scan using comes_from_raw_fk on transformedexpressionprofile tra (cost=0.00..5.86 rows=1 width=16) (actual time=0.010..0.018 rows=3 loops=13968)
Index Cond: (tra.expressionprofileid = "outer".expressionprofileid)
-> Index Scan using _cluster__pk on "CLUSTER" clu (cost=0.00..3.01 rows=1 width=8) (actual time=0.009..0.010 rows=1 loops=41904)
Index Cond: ("outer".clusterid = clu.clusterid)
Total runtime: 12696.289 ms
(48 rows)
I tried setting the enable_seq_scan to off and the query's runtime returned by the explain analyze is 4000ms.
Why postgre is not using the indexes?
What is the real impact of having such a big 'IN' clause?
Thanks in advance,
Luis Cornide
pgsql-performance by date: