Thread: force the use of a particular index
Hello, I am wondering if there is a way to force the use of a particular index when doing a query. I have two tables that are pretty big (each >3 million rows), and when I do a join between them the performance is generally quite poor as it does not use the indexes that I think it should use. Here is an example query: SELECT DISTINCT f.name,fl.fmin,fl.fmax,fl.strand,f.type_id,f.feature_id FROM feature f, featureloc fl WHERE f.feature_id = fl.feature_id and fl.srcfeature_id = 6 and fl.fmin <= 2585581 and fl.fmax >= 2565581 and f.type_id = 219 Now, I know that if the query planner will use an index on featureloc on (srcfeature_id, fmin, fmax) that will reduce the amount of data from the featureloc table from over 3 million to at most a few thousand, and it will go quite quickly (if I drop other indexes on this table, it does use that index and completes in about 1/1000th of the time). After that, the join with the feature table should go quite quickly as well using the primary key on feature. So, the question is, is there a way I can force the query planner to use the index I want it to use? I have experimented with using INNER JOIN and changing the order of the tables in the join clause, but nothing seems to work. Any suggestions? Thanks much, Scott -- ------------------------------------------------------------------------ Scott Cain, Ph. D. cain@cshl.org GMOD Coordinator (http://www.gmod.org/) 216-392-3087 Cold Spring Harbor Laboratory
On Fri, 2003-07-11 at 13:17, Scott Cain wrote: > The problem (at least as it appears to me) is not that it is performing > a table scan instead of an index scan, it is that it is using the wrong > index. Here is the output from EXPLAIN ANALYZE: > > QUERY PLAN > ------------------------------------------------------------------------------------------------------------------------------------------------------------ > Unique (cost=494008.47..494037.59 rows=166 width=54) (actual time=114660.37..114660.38 rows=1 loops=1) > -> Sort (cost=494008.47..494012.63 rows=1664 width=54) (actual time=114660.37..114660.37 rows=1 loops=1) > Sort Key: f.name, fl.fmin, fl.fmax, fl.strand, f.type_id, f.feature_id > -> Nested Loop (cost=0.00..493919.44 rows=1664 width=54) (actual time=2596.13..114632.90 rows=1 loops=1) > -> Index Scan using feature_pkey on feature f (cost=0.00..134601.43 rows=52231 width=40) (actual time=105.74..56048.87rows=13825 loops=1) > Filter: (type_id = 219) > -> Index Scan using featureloc_idx1 on featureloc fl (cost=0.00..6.87 rows=1 width=14) (actual time=4.23..4.23rows=0 loops=13825) > Index Cond: ("outer".feature_id = fl.feature_id) > Filter: ((srcfeature_id = 6) AND (fmin <= 2585581) AND (fmax >= 2565581)) > Total runtime: 114660.91 msec > it is using on featureloc (featureloc_idx1) is on the foreign key > feature_id. It should instead be using another index, featureloc_idx3, > which is built on (srcfeature_id, fmin, fmax). Nope.. The optimizer is right in the decision to use featureloc_idx1. You will notice it is expecting to retrieve a single row from this index, but the featureloc_idx3 is bound to be larger (due to indexing more data), thus take more disk reads for the exact same information (or in this case, lack thereof). What is taking a long time is the scan on feature_pkey. It looks like it is throwing away a ton of rows that are not type_id = 219. Either that, or you do a pile of deletes and haven't run REINDEX recently. Create an index consisting of (feature_id, type_id). This will probably make a significant different in execution time.
Attachment
On Thu, 2003-07-10 at 15:18, Scott Cain wrote: > Hello, > > I am wondering if there is a way to force the use of a particular index > when doing a query. I have two tables that are pretty big (each >3 > million rows), and when I do a join between them the performance is > generally quite poor as it does not use the indexes that I think it > should use. Here is an example query: Please send the EXPLAIN ANALYZE results for that query with and without sequential scans enabled. set enable_seqscan = true; EXPLAIN ANALYZE <query>; set enable_seqscan = false; EXPLAIN ANALYZE <query>;
Attachment
On Fri, 2003-07-11 at 12:20, Rod Taylor wrote: > On Fri, 2003-07-11 at 11:36, Scott Cain wrote: > > Any other ideas? > > Out of curiosity, what do you get if you disable hash joins? > > set enable_hashjoin = false; BINGO! QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------- Unique (cost=161718.69..161738.38 rows=113 width=53) (actual time=529.03..529.03 rows=1 loops=1) -> Sort (cost=161718.69..161721.50 rows=1125 width=53) (actual time=529.02..529.02 rows=1 loops=1) Sort Key: f.name, fl.fmin, fl.fmax, fl.strand, f.type_id, f.feature_id -> Merge Join (cost=26493.64..161661.65 rows=1125 width=53) (actual time=416.46..528.77 rows=1 loops=1) Merge Cond: ("outer".feature_id = "inner".feature_id) -> Index Scan using feature_pkey on feature f (cost=0.00..134592.43 rows=47912 width=39) (actual time=0.46..502.50rows=431 loops=1) Filter: (type_id = 219) -> Sort (cost=26493.64..26722.33 rows=91476 width=14) (actual time=23.98..24.38 rows=570 loops=1) Sort Key: fl.feature_id -> Index Scan using featureloc_src_6 on featureloc fl (cost=0.00..18039.22 rows=91476 width=14) (actualtime=15.16..21.85 rows=570 loops=1) Index Cond: ((fmin <= 2585581) AND (fmax >= 2565581)) Filter: (srcfeature_id = 6) Total runtime: 529.52 msec (13 rows) > > How about a partial index on (feature_id) where type_id = 219? That is a possiblity. type_id is a foreign key on another table that has several thousand rows, but in practice, there will be only a subset of those that we are interested in using with this query, so it may not be too unwieldy to do for each interesting type_id in practice. However, for testing I just created the partial index on type_id=219 and it was not used, so it may not make a difference anyway. Thanks much, Scott -- ------------------------------------------------------------------------ Scott Cain, Ph. D. cain@cshl.org GMOD Coordinator (http://www.gmod.org/) 216-392-3087 Cold Spring Harbor Laboratory
> > set enable_hashjoin = false; > > BINGO! I'm not so sure about that. Your dataset seems to have changed fairly significantly since the last test. > -> Index Scan using feature_pkey on feature f (cost=0.00..134592.43 rows=47912 width=39) (actual time=0.46..502.50 rows=431loops=1) Notice it only pulled out 431 rows where prior runs pulled out several thousand (~13000). I think what really happened was something came along and deleted a bunch of stuff, then vacuum ran.
Attachment
Scott Cain <cain@cshl.org> writes: > So, the question is, is there a way I can force the query planner to use > the index I want it to use? No (and I don't think there should be). Given that it *can* generate the plan you want, this is clearly an estimation failure. What is the index it does use? Would you show us EXPLAIN ANALYZE results when using each index? regards, tom lane
Hi Tom, Embarrassingly, I can't. I've been monkeying with the database so much that I can't seem to get it back to the state where I reproduce the behavior I want. A database drop and reload may be the only way, but since that is a time consuming thing to do, I won't be able to do it until this evening. Thanks, Scott On Fri, 2003-07-11 at 11:24, Tom Lane wrote: > Scott Cain <cain@cshl.org> writes: > > So, the question is, is there a way I can force the query planner to use > > the index I want it to use? > > No (and I don't think there should be). Given that it *can* generate > the plan you want, this is clearly an estimation failure. What is the > index it does use? Would you show us EXPLAIN ANALYZE results when > using each index? > > regards, tom lane -- ------------------------------------------------------------------------ Scott Cain, Ph. D. cain@cshl.org GMOD Coordinator (http://www.gmod.org/) 216-392-3087 Cold Spring Harbor Laboratory
The problem (at least as it appears to me) is not that it is performing a table scan instead of an index scan, it is that it is using the wrong index. Here is the output from EXPLAIN ANALYZE: QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------ Unique (cost=494008.47..494037.59 rows=166 width=54) (actual time=114660.37..114660.38 rows=1 loops=1) -> Sort (cost=494008.47..494012.63 rows=1664 width=54) (actual time=114660.37..114660.37 rows=1 loops=1) Sort Key: f.name, fl.fmin, fl.fmax, fl.strand, f.type_id, f.feature_id -> Nested Loop (cost=0.00..493919.44 rows=1664 width=54) (actual time=2596.13..114632.90 rows=1 loops=1) -> Index Scan using feature_pkey on feature f (cost=0.00..134601.43 rows=52231 width=40) (actual time=105.74..56048.87rows=13825 loops=1) Filter: (type_id = 219) -> Index Scan using featureloc_idx1 on featureloc fl (cost=0.00..6.87 rows=1 width=14) (actual time=4.23..4.23rows=0 loops=13825) Index Cond: ("outer".feature_id = fl.feature_id) Filter: ((srcfeature_id = 6) AND (fmin <= 2585581) AND (fmax >= 2565581)) Total runtime: 114660.91 msec This is the same regardless of enable_seqscan's setting. The index that it is using on featureloc (featureloc_idx1) is on the foreign key feature_id. It should instead be using another index, featureloc_idx3, which is built on (srcfeature_id, fmin, fmax). I should also mention that I've done a VACUUM FULL ANALYZE on this database, and I've been using it for a while, and this is the primary type of query I perform on the database. Thanks, Scott On Fri, 2003-07-11 at 06:51, Rod Taylor wrote: > On Thu, 2003-07-10 at 15:18, Scott Cain wrote: > > Hello, > > > > I am wondering if there is a way to force the use of a particular index > > when doing a query. I have two tables that are pretty big (each >3 > > million rows), and when I do a join between them the performance is > > generally quite poor as it does not use the indexes that I think it > > should use. Here is an example query: > > Please send the EXPLAIN ANALYZE results for that query with and without > sequential scans enabled. > > set enable_seqscan = true; > EXPLAIN ANALYZE <query>; > > set enable_seqscan = false; > EXPLAIN ANALYZE <query>; -- ------------------------------------------------------------------------ Scott Cain, Ph. D. cain@cshl.org GMOD Coordinator (http://www.gmod.org/) 216-392-3087 Cold Spring Harbor Laboratory
Rod, I see what you mean about the scan on the feature_pkey taking a long time. I tried several things to remedy that. I created an index on feature (feature_id,type_id) (which I don't think makes sense since feature_id is the primary key, so add another column really doesn't help). I also created a index on feature (type_id, feature_id), but the planner doesn't use it. Also, there was an already existing index on feature (type_id) that the planner never used. One thing I tried that changed the query plan and improved performance slightly (but still nowhere near what I need) was to add a partial index on featureloc on (fmin,fmax) where scrfeature_id=6. This is something I could realistically do since there are relatively few (>30) srcfeature_ids that I am interested in, so putting in place a partial index for each of them would not be a big deal. Nevertheless, the performance is still not there. Here is the EXPLAIN ANALYZE for this situation: QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------------- Unique (cost=156172.23..156200.11 rows=159 width=54) (actual time=63631.93..63631.93 rows=1 loops=1) -> Sort (cost=156172.23..156176.21 rows=1594 width=54) (actual time=63631.93..63631.93 rows=1 loops=1) Sort Key: f.name, fl.fmin, fl.fmax, fl.strand, f.type_id, f.feature_id -> Hash Join (cost=135100.30..156087.46 rows=1594 width=54) (actual time=63631.29..63631.79 rows=1 loops=1) Hash Cond: ("outer".feature_id = "inner".feature_id) -> Index Scan using featureloc_src_6 on featureloc fl (cost=0.00..18064.99 rows=101883 width=14) (actualtime=26.11..430.00 rows=570 loops=1) Index Cond: ((fmin <= 2585581) AND (fmax >= 2565581)) Filter: (srcfeature_id = 6) -> Hash (cost=134601.43..134601.43 rows=48347 width=40) (actual time=63182.86..63182.86 rows=0 loops=1) -> Index Scan using feature_pkey on feature f (cost=0.00..134601.43 rows=48347 width=40) (actual time=69.98..62978.27rows=13825 loops=1) Filter: (type_id = 219) Total runtime: 63632.28 msec (12 rows) Any other ideas? Thanks, Scott On Fri, 2003-07-11 at 09:38, Rod Taylor wrote: > On Fri, 2003-07-11 at 13:17, Scott Cain wrote: > > The problem (at least as it appears to me) is not that it is performing > > a table scan instead of an index scan, it is that it is using the wrong > > index. Here is the output from EXPLAIN ANALYZE: > > > > QUERY PLAN > > ------------------------------------------------------------------------------------------------------------------------------------------------------------ > > Unique (cost=494008.47..494037.59 rows=166 width=54) (actual time=114660.37..114660.38 rows=1 loops=1) > > -> Sort (cost=494008.47..494012.63 rows=1664 width=54) (actual time=114660.37..114660.37 rows=1 loops=1) > > Sort Key: f.name, fl.fmin, fl.fmax, fl.strand, f.type_id, f.feature_id > > -> Nested Loop (cost=0.00..493919.44 rows=1664 width=54) (actual time=2596.13..114632.90 rows=1 loops=1) > > -> Index Scan using feature_pkey on feature f (cost=0.00..134601.43 rows=52231 width=40) (actual time=105.74..56048.87rows=13825 loops=1) > > Filter: (type_id = 219) > > -> Index Scan using featureloc_idx1 on featureloc fl (cost=0.00..6.87 rows=1 width=14) (actual time=4.23..4.23rows=0 loops=13825) > > Index Cond: ("outer".feature_id = fl.feature_id) > > Filter: ((srcfeature_id = 6) AND (fmin <= 2585581) AND (fmax >= 2565581)) > > Total runtime: 114660.91 msec > > > it is using on featureloc (featureloc_idx1) is on the foreign key > > feature_id. It should instead be using another index, featureloc_idx3, > > which is built on (srcfeature_id, fmin, fmax). > > Nope.. The optimizer is right in the decision to use featureloc_idx1. > You will notice it is expecting to retrieve a single row from this > index, but the featureloc_idx3 is bound to be larger (due to indexing > more data), thus take more disk reads for the exact same information (or > in this case, lack thereof). > > What is taking a long time is the scan on feature_pkey. It looks like it > is throwing away a ton of rows that are not type_id = 219. Either that, > or you do a pile of deletes and haven't run REINDEX recently. > > Create an index consisting of (feature_id, type_id). This will probably > make a significant different in execution time. -- ------------------------------------------------------------------------ Scott Cain, Ph. D. cain@cshl.org GMOD Coordinator (http://www.gmod.org/) 216-392-3087 Cold Spring Harbor Laboratory
On Fri, 2003-07-11 at 11:36, Scott Cain wrote: > Rod, > > I see what you mean about the scan on the feature_pkey taking a long > time. I tried several things to remedy that. I created an index on > feature (feature_id,type_id) (which I don't think makes sense since > feature_id is the primary key, so add another column really doesn't It may be the primary key, but the system looked like it was throwing away many rows based on type_id. If it was throwing away many more rows than found, the index with type_id may have been cheaper. It is difficult to tell from an EXPLAIN ANALYZE as it doesn't tell you exactly how many rows were filtered, just the cost to read them and how many were used after the filter. > help). I also created a index on feature (type_id, feature_id), but the > planner doesn't use it. Also, there was an already existing index on > feature (type_id) that the planner never used. It cannot use more than one index for a given table scan at the moment. There are proposals on how to 'fix' that, but those require significant overhauls of various systems. > Any other ideas? Out of curiosity, what do you get if you disable hash joins? set enable_hashjoin = false; How about a partial index on (feature_id) where type_id = 219?
Attachment
On Fri, 2003-07-11 at 14:14, Rod Taylor wrote: > > > set enable_hashjoin = false; > > > > BINGO! > > I'm not so sure about that. Your dataset seems to have changed fairly > significantly since the last test. > > > -> Index Scan using feature_pkey on feature f (cost=0.00..134592.43 rows=47912 width=39) (actual time=0.46..502.50rows=431 loops=1) > > Notice it only pulled out 431 rows where prior runs pulled out several > thousand (~13000). I think what really happened was something came > along and deleted a bunch of stuff, then vacuum ran. There is nearly a zero chance that happened. This database is accessible only by me, I haven't deleted anything. The only things I have done is to create and drop various indexes and run vacuum. Is there anything else that could explain the difference? Is the index scan on feature_pkey using information from the index scan on featureloc_src_6 to limit the number of rows to get from feature? Scott -- ------------------------------------------------------------------------ Scott Cain, Ph. D. cain@cshl.org GMOD Coordinator (http://www.gmod.org/) 216-392-3087 Cold Spring Harbor Laboratory
Scott Cain <cain@cshl.org> writes: > Embarrassingly, I can't. I've been monkeying with the database so much > that I can't seem to get it back to the state where I reproduce the > behavior I want. If the thing works as desired after a VACUUM ANALYZE, then I suggest the estimation failure was just due to out-of-date statistics ... regards, tom lane