Thread: performace problem after VACUUM ANALYZE
Hello, I am going to do my best to describe this problem, but the description may be quite long. Also, this is my first post to this list, so I miss important details please let me know. All of the following analysis was done on my P4 laptop with 0.5 Gig ram and postgresql 7.3 installed from RPM for RedHat linux 7.3 I have a database with two large tables that I need to do a join on. Here is the query that is causing the problems: select distinct f.name,fl.nbeg,fl.nend,fl.strand,f.type_id,f.feature_id from feature f, featureloc fl where fl.srcfeature_id = 1 and ((fl.strand=1 and fl.nbeg <= 393164 and fl.nend >= 390956) OR (fl.strand=-1 and fl.nend <= 393164 and fl.nbeg >= 390956)) and f.feature_id = fl.feature_id The feature table has 3,685,287 rows and featureloc has 3,803,762 rows. Here are all of the relevant indexes on these tables: Index "feature_pkey" Column | Type ------------+--------- feature_id | integer unique btree (primary key) Index "featureloc_idx1" Column | Type ------------+--------- feature_id | integer btree Index "featureloc_idx2" Column | Type ---------------+--------- srcfeature_id | integer btree Index "featureloc_src_strand_beg_end" Column | Type ---------------+---------- srcfeature_id | integer strand | smallint nbeg | integer nend | integer btree In a naive database (ie, no ANALYZE has ever been run), the above query runs in about 3 seconds after the first time it has been run (due to caching?). Here is the output of EXPLAIN on that query: Unique (cost=75513.46..75513.48 rows=1 width=167) -> Sort (cost=75513.46..75513.46 rows=1 width=167) -> Nested Loop (cost=0.00..75513.45 rows=1 width=167) -> Index Scan using featureloc_idx2 on featureloc fl (cost=0.00..75508.43 rows=1 width=14) -> Index Scan using feature_pkey on feature f (cost=0.00..5.01 rows=1 width=153) Notice that for featureloc it is using featureloc_idx2, which is the index on srcfeature_id. Ideally, it would be using featureloc_src_strand_beg_end, but this is not bad. In fact, if I drop featureloc_idx2 (not something I can do in a production database), it does use that index and cuts the query runtime in half. Now comes the really bizarre part: if I run VACUUM ANALYZE, the performance becomes truly awful. Instead of using an index for featureloc, it now does a seq scan, causing the runtime to go up to about 30 seconds. Here is the output of explain now: Unique (cost=344377.70..344759.85 rows=2548 width=47) -> Sort (cost=344377.70..344377.70 rows=25477 width=47) -> Nested Loop (cost=0.00..342053.97 rows=25477 width=47) -> Seq Scan on featureloc fl (cost=0.00..261709.31 rows=25477 width=14) -> Index Scan using feature_pkey on feature f (cost=0.00..3.14 rows=1 width=33) If I try to force it to use an index by setting enable_seqscan=0, it then uses featureloc_idx1, the index on feature_id. This has slightly worse performance than the seq scan, at just over 30 seconds. Here is the output of explain for this case: Unique (cost=356513.46..356895.61 rows=2548 width=47) -> Sort (cost=356513.46..356513.46 rows=25477 width=47) -> Nested Loop (cost=0.00..354189.73 rows=25477 width=47) -> Index Scan using featureloc_idx1 on featureloc fl (cost=0.00..273845.08 rows=25477 width=14) -> Index Scan using feature_pkey on feature f (cost=0.00..3.14 rows=1 width=33) Now, if I drop featureloc_idx1 (again, not something I can do in production) and still disallow seq scans, it uses featureloc_idx2, as it did with the naive database above, but the performance is much worse, running about 24 seconds for the query. Here is the output of explain: Unique (cost=1310195.21..1310577.36 rows=2548 width=47) -> Sort (cost=1310195.21..1310195.21 rows=25477 width=47) -> Nested Loop (cost=0.00..1307871.48 rows=25477 width=47) -> Index Scan using featureloc_idx2 on featureloc fl (cost=0.00..1227526.82 rows=25477 width=14) -> Index Scan using feature_pkey on feature f (cost=0.00..3.14 rows=1 width=33) Finally, if I drop featureloc_idx2, it uses the right index, and the runtime gets back to about 2 seconds, but the database is unusable for other queries because I dropped the other indexes and disallowed seq scans. Here is the output for explain in this case: Unique (cost=1414516.98..1414899.13 rows=2548 width=47) -> Sort (cost=1414516.98..1414516.98 rows=25477 width=47) -> Nested Loop (cost=0.00..1412193.25 rows=25477 width=47) -> Index Scan using featureloc_src_strand_beg_end on featureloc fl (cost=0.00..1331848.60 rows=25477 width=14) -> Index Scan using feature_pkey on feature f (cost=0.00..3.14 rows=1 width=33) Now the question: can anybody tell me why the query planner does such a bad job figuring out how to run this query after I run VACUUM ANALYZE, and can you give me any advice on making this arrangement work without forcing postgres' hand by limiting its options until it gets it right? Thank you very much for reading down to here, Scott -- ------------------------------------------------------------------------ Scott Cain, Ph. D. cain@cshl.org GMOD Coordinator (http://www.gmod.org/) 216-392-3087 Cold Spring Harbor Laboratory
Scott Cain <cain@cshl.org> writes: > [ much stuff ] Could we see EXPLAIN ANALYZE, not just EXPLAIN, output for all these alternatives? Your question boils down to "why is the planner misestimating these queries" ... which is a difficult question to answer when given only the estimates and not the reality. A suggestion though is that you might need to raise the statistics target on the indexed columns, so that ANALYZE will collect finer-grained statistics. (See ALTER TABLE ... SET STATISTICS.) Try booting it up to 100 (from the default 10), re-analyze, and then see if/how the plans change. regards, tom lane
Tom, Sorry about that: I'll try to briefly give the information you are looking for. I've read the docs on ALTER TABLE, but it is not clear to me what columns I should change STATISTICS on, or should I just do it on all of the columns for which indexes exist? Here's the query again: select distinct f.name,fl.nbeg,fl.nend,fl.strand,f.type_id,f.feature_id from feature f, featureloc fl where fl.srcfeature_id = 1 and ((fl.strand=1 and fl.nbeg <= 393164 and fl.nend >= 390956) OR (fl.strand=-1 and fl.nend <= 393164 and fl.nbeg >= 390956)) and f.feature_id = fl.feature_id -------------------------------------------------------------------------- Naive database: Unique (cost=75513.46..75513.48 rows=1 width=167) (actual time=22815.25..22815.93 rows=179 loops=1) -> Sort (cost=75513.46..75513.46 rows=1 width=167) (actual time=22815.24..22815.43 rows=186 loops=1) -> Nested Loop (cost=0.00..75513.45 rows=1 width=167) (actual time=2471.25..22814.01 rows=186 loops=1) -> Index Scan using featureloc_idx2 on featureloc fl (cost=0.00..75508.43 rows=1 width=14) (actual time=2463.83..22796.50 rows=186 loops=1) -> Index Scan using feature_pkey on feature f (cost=0.00..5.01 rows=1 width=153) (actual time=0.08..0.08 rows=1 loops=186) Total runtime: 22816.63 msec -------------------------------------------------------------------------- Naive database after featureloc_idx2 dropped: Unique (cost=75545.46..75545.48 rows=1 width=167) (actual time=5232.36..5234.51 rows=179 loops=1) -> Sort (cost=75545.46..75545.46 rows=1 width=167) (actual time=5232.35..5232.54 rows=186 loops=1) -> Nested Loop (cost=0.00..75545.45 rows=1 width=167) (actual time=291.46..5220.69 rows=186 loops=1) -> Index Scan using featureloc_src_strand_beg_end on featureloc fl (cost=0.00..75540.43 rows=1 width=14) (actual time=291.30..5214.46 rows=186 loops=1) -> Index Scan using feature_pkey on feature f (cost=0.00..5.01 rows=1 width=153) (actual time=0.02..0.03 rows=1 loops=186) Total runtime: 5234.89 msec -------------------------------------------------------------------------- Database after VACUUM ANALYZE was run: Unique (cost=344377.70..344759.85 rows=2548 width=47) (actual time=26466.82..26467.51 rows=179 loops=1) -> Sort (cost=344377.70..344377.70 rows=25477 width=47) (actual time=26466.82..26467.01 rows=186 loops=1) -> Nested Loop (cost=0.00..342053.97 rows=25477 width=47) (actual time=262.66..26465.63 rows=186 loops=1) -> Seq Scan on featureloc fl (cost=0.00..261709.31 rows=25477 width=14) (actual time=118.62..26006.05 rows=186 loops=1) -> Index Scan using feature_pkey on feature f (cost=0.00..3.14 rows=1 width=33) (actual time=2.45..2.46 rows=1 loops=186) Total runtime: 26467.85 msec -------------------------------------------------------------------------- After disallowing seqscans (set enable_seqscan=0): Unique (cost=356513.46..356895.61 rows=2548 width=47) (actual time=27494.62..27495.34 rows=179 loops=1) -> Sort (cost=356513.46..356513.46 rows=25477 width=47) (actual time=27494.61..27494.83 rows=186 loops=1) -> Nested Loop (cost=0.00..354189.73 rows=25477 width=47) (actual time=198.88..27493.48 rows=186 loops=1) -> Index Scan using featureloc_idx1 on featureloc fl (cost=0.00..273845.08 rows=25477 width=14) (actual time=129.30..27280.95 rows=186 loops=1) -> Index Scan using feature_pkey on feature f (cost=0.00..3.14 rows=1 width=33) (actual time=1.13..1.13 rows=1 loops=186) Total runtime: 27495.66 msec -------------------------------------------------------------------------- After dropping featureloc_idx1: Unique (cost=1310195.21..1310577.36 rows=2548 width=47) (actual time=21692.69..21693.37 rows=179 loops=1) -> Sort (cost=1310195.21..1310195.21 rows=25477 width=47) (actual time=21692.69..21692.88 rows=186 loops=1) -> Nested Loop (cost=0.00..1307871.48 rows=25477 width=47) (actual time=2197.65..21691.39 rows=186 loops=1) -> Index Scan using featureloc_idx2 on featureloc fl (cost=0.00..1227526.82 rows=25477 width=14) (actual time=2197.49..21618.89 rows=186 loops=1) -> Index Scan using feature_pkey on feature f (cost=0.00..3.14 rows=1 width=33) (actual time=0.37..0.38 rows=1 loops=186) Total runtime: 21693.72 msec -------------------------------------------------------------------------- After dropping featureloc_idx2: Unique (cost=1414516.98..1414899.13 rows=2548 width=47) (actual time=1669.17..1669.86 rows=179 loops=1) -> Sort (cost=1414516.98..1414516.98 rows=25477 width=47) (actual time=1669.17..1669.36 rows=186 loops=1) -> Nested Loop (cost=0.00..1412193.25 rows=25477 width=47) (actual time=122.69..1668.08 rows=186 loops=1) -> Index Scan using featureloc_src_strand_beg_end on featureloc fl (cost=0.00..1331848.60 rows=25477 width=14) (actual time=122.51..1661.81 rows=186 loops=1) -> Index Scan using feature_pkey on feature f (cost=0.00..3.14 rows=1 width=33) (actual time=0.02..0.03 rows=1 loops=186) Total runtime: 1670.20 msec On Fri, 2003-02-14 at 12:00, Tom Lane wrote: > Scott Cain <cain@cshl.org> writes: > > [ much stuff ] > > Could we see EXPLAIN ANALYZE, not just EXPLAIN, output for all these > alternatives? Your question boils down to "why is the planner > misestimating these queries" ... which is a difficult question to > answer when given only the estimates and not the reality. > > A suggestion though is that you might need to raise the statistics > target on the indexed columns, so that ANALYZE will collect > finer-grained statistics. (See ALTER TABLE ... SET STATISTICS.) > Try booting it up to 100 (from the default 10), re-analyze, and > then see if/how the plans change. > > regards, tom lane -- ------------------------------------------------------------------------ Scott Cain, Ph. D. cain@cshl.org GMOD Coordinator (http://www.gmod.org/) 216-392-3087 Cold Spring Harbor Laboratory
An update: I ran alter table as suggested, ie, alter table featureloc alter srcfeature_id set statistics 100; on each column in the table, running vacuum analyze and explain analyze on the query in between each alter to see if it made any difference. It did not. Postgres still instists on doing a seq scan on featureloc: Unique (cost=336831.46..337179.45 rows=2320 width=47) (actual time=27219.62..27220.30 rows=179 loops=1) -> Sort (cost=336831.46..336831.46 rows=23200 width=47) (actual time=27219.61..27219.80 rows=186 loops=1) -> Nested Loop (cost=0.00..334732.77 rows=23200 width=47) (actual time=1003.04..27217.99 rows=186 loops=1) -> Seq Scan on featureloc fl (cost=0.00..261709.31 rows=23200 width=14) (actual time=814.68..26094.18 rows=186 loops=1) -> Index Scan using feature_pkey on feature f (cost=0.00..3.14 rows=1 width=33) (actual time=6.03..6.03 rows=1 loops=186) Total runtime: 27220.63 msec On Fri, 2003-02-14 at 12:29, Scott Cain wrote: > Tom, > > Sorry about that: I'll try to briefly give the information you are > looking for. I've read the docs on ALTER TABLE, but it is not clear to > me what columns I should change STATISTICS on, or should I just do it on > all of the columns for which indexes exist? > > Here's the query again: > > select distinct f.name,fl.nbeg,fl.nend,fl.strand,f.type_id,f.feature_id > from feature f, featureloc fl > where > fl.srcfeature_id = 1 and > ((fl.strand=1 and fl.nbeg <= 393164 and fl.nend >= 390956) OR > (fl.strand=-1 and fl.nend <= 393164 and fl.nbeg >= 390956)) and > f.feature_id = fl.feature_id > > -------------------------------------------------------------------------- > > Naive database: > > Unique (cost=75513.46..75513.48 rows=1 width=167) (actual > time=22815.25..22815.93 rows=179 loops=1) > -> Sort (cost=75513.46..75513.46 rows=1 width=167) (actual > time=22815.24..22815.43 rows=186 loops=1) > -> Nested Loop (cost=0.00..75513.45 rows=1 width=167) (actual > time=2471.25..22814.01 rows=186 loops=1) > -> Index Scan using featureloc_idx2 on featureloc fl > (cost=0.00..75508.43 rows=1 width=14) (actual time=2463.83..22796.50 > rows=186 loops=1) > -> Index Scan using feature_pkey on feature f > (cost=0.00..5.01 rows=1 width=153) (actual time=0.08..0.08 rows=1 > loops=186) > Total runtime: 22816.63 msec > -------------------------------------------------------------------------- > > Naive database after featureloc_idx2 dropped: > > Unique (cost=75545.46..75545.48 rows=1 width=167) (actual > time=5232.36..5234.51 rows=179 loops=1) > -> Sort (cost=75545.46..75545.46 rows=1 width=167) (actual > time=5232.35..5232.54 rows=186 loops=1) > -> Nested Loop (cost=0.00..75545.45 rows=1 width=167) (actual > time=291.46..5220.69 rows=186 loops=1) > -> Index Scan using featureloc_src_strand_beg_end on > featureloc fl (cost=0.00..75540.43 rows=1 width=14) (actual > time=291.30..5214.46 rows=186 loops=1) > -> Index Scan using feature_pkey on feature f > (cost=0.00..5.01 rows=1 width=153) (actual time=0.02..0.03 rows=1 > loops=186) > Total runtime: 5234.89 msec > -------------------------------------------------------------------------- > > Database after VACUUM ANALYZE was run: > > Unique (cost=344377.70..344759.85 rows=2548 width=47) (actual > time=26466.82..26467.51 rows=179 loops=1) > -> Sort (cost=344377.70..344377.70 rows=25477 width=47) (actual > time=26466.82..26467.01 rows=186 loops=1) > -> Nested Loop (cost=0.00..342053.97 rows=25477 width=47) > (actual time=262.66..26465.63 rows=186 loops=1) > -> Seq Scan on featureloc fl (cost=0.00..261709.31 > rows=25477 width=14) (actual time=118.62..26006.05 rows=186 loops=1) > -> Index Scan using feature_pkey on feature f > (cost=0.00..3.14 rows=1 width=33) (actual time=2.45..2.46 rows=1 > loops=186) > Total runtime: 26467.85 msec > -------------------------------------------------------------------------- > > After disallowing seqscans (set enable_seqscan=0): > > Unique (cost=356513.46..356895.61 rows=2548 width=47) (actual > time=27494.62..27495.34 rows=179 loops=1) > -> Sort (cost=356513.46..356513.46 rows=25477 width=47) (actual > time=27494.61..27494.83 rows=186 loops=1) > -> Nested Loop (cost=0.00..354189.73 rows=25477 width=47) > (actual time=198.88..27493.48 rows=186 loops=1) > -> Index Scan using featureloc_idx1 on featureloc fl > (cost=0.00..273845.08 rows=25477 width=14) (actual time=129.30..27280.95 > rows=186 loops=1) > -> Index Scan using feature_pkey on feature f > (cost=0.00..3.14 rows=1 width=33) (actual time=1.13..1.13 rows=1 > loops=186) > Total runtime: 27495.66 msec > -------------------------------------------------------------------------- > > After dropping featureloc_idx1: > > Unique (cost=1310195.21..1310577.36 rows=2548 width=47) (actual > time=21692.69..21693.37 rows=179 loops=1) > -> Sort (cost=1310195.21..1310195.21 rows=25477 width=47) (actual > time=21692.69..21692.88 rows=186 loops=1) > -> Nested Loop (cost=0.00..1307871.48 rows=25477 width=47) > (actual time=2197.65..21691.39 rows=186 loops=1) > -> Index Scan using featureloc_idx2 on featureloc fl > (cost=0.00..1227526.82 rows=25477 width=14) (actual > time=2197.49..21618.89 rows=186 loops=1) > -> Index Scan using feature_pkey on feature f > (cost=0.00..3.14 rows=1 width=33) (actual time=0.37..0.38 rows=1 > loops=186) > Total runtime: 21693.72 msec > -------------------------------------------------------------------------- > > After dropping featureloc_idx2: > > Unique (cost=1414516.98..1414899.13 rows=2548 width=47) (actual > time=1669.17..1669.86 rows=179 loops=1) > -> Sort (cost=1414516.98..1414516.98 rows=25477 width=47) (actual > time=1669.17..1669.36 rows=186 loops=1) > -> Nested Loop (cost=0.00..1412193.25 rows=25477 width=47) > (actual time=122.69..1668.08 rows=186 loops=1) > -> Index Scan using featureloc_src_strand_beg_end on > featureloc fl (cost=0.00..1331848.60 rows=25477 width=14) (actual > time=122.51..1661.81 rows=186 loops=1) > -> Index Scan using feature_pkey on feature f > (cost=0.00..3.14 rows=1 width=33) (actual time=0.02..0.03 rows=1 > loops=186) > Total runtime: 1670.20 msec > > > On Fri, 2003-02-14 at 12:00, Tom Lane wrote: > > Scott Cain <cain@cshl.org> writes: > > > [ much stuff ] > > > > Could we see EXPLAIN ANALYZE, not just EXPLAIN, output for all these > > alternatives? Your question boils down to "why is the planner > > misestimating these queries" ... which is a difficult question to > > answer when given only the estimates and not the reality. > > > > A suggestion though is that you might need to raise the statistics > > target on the indexed columns, so that ANALYZE will collect > > finer-grained statistics. (See ALTER TABLE ... SET STATISTICS.) > > Try booting it up to 100 (from the default 10), re-analyze, and > > then see if/how the plans change. > > > > regards, tom lane -- ------------------------------------------------------------------------ Scott Cain, Ph. D. cain@cshl.org GMOD Coordinator (http://www.gmod.org/) 216-392-3087 Cold Spring Harbor Laboratory
PG really does not do the right thing as I mentioned earlier for joining tables. To force not to use seqscan, it still does not use right index (srcfeature_id, nbeg, nend) and performance is even worse. c_gadfly3=# \d fl_src_b_e_key; Index "public.fl_src_b_e_key" Column | Type ---------------+--------- srcfeature_id | integer nbeg | integer nend | integer btree, for table "public.featureloc" c_gadfly3=# explain analyze select * from featureloc fl, feature f where f.feature_id = fl.feature_id and srcfeature_id=1 and (nbeg >= 1000 and nend <= 2000); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------ Merge Join (cost=115700.27..232516.85 rows=69007 width=445) (actual time=12342.97..12461.23 rows=32 loops=1) Merge Cond: ("outer".feature_id = "inner".feature_id) -> Index Scan using feature_pkey on feature f (cost=0.00..110535.53 rows=2060653 width=361) (actual time=17.85..490.86 rows=28341 loops=1) -> Sort (cost=115700.27..115872.79 rows=69007 width=84) (actual time=11944.23..11944.25 rows=32 loops=1) Sort Key: fl.feature_id -> Seq Scan on featureloc fl (cost=0.00..107580.43 rows=69007 width=84) (actual time=375.85..11944.10 rows=32 loops=1) Filter: ((srcfeature_id = 1) AND (nbeg >= 1000) AND (nend <= 2000)) Total runtime: 12461.37 msec (8 rows) c_gadfly3=# c_gadfly3=# set enable_seqscan=0; SET c_gadfly3=# explain analyze select * from featureloc fl, feature f where f.feature_id = fl.feature_id and srcfeature_id=1 and (nbeg >= 1000 and nend <= 2000); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------ Merge Join (cost=0.00..236345.49 rows=69007 width=445) (actual time=721.75..26078.64 rows=32 loops=1) Merge Cond: ("outer".feature_id = "inner".feature_id) -> Index Scan using fl_feature_id_key on featureloc fl (cost=0.00..119701.43 rows=69007 width=84) (actual time=549.14..25854.12 rows=32 loops=1) Filter: ((srcfeature_id = 1) AND (nbeg >= 1000) AND (nend <= 2000)) -> Index Scan using feature_pkey on feature f (cost=0.00..110535.53 rows=2060653 width=361) (actual time=50.95..200.37 rows=28342 loops=1) Total runtime: 26078.80 msec (6 rows) Scott Cain wrote: > An update: I ran alter table as suggested, ie, > > alter table featureloc alter srcfeature_id set statistics 100; > > on each column in the table, running vacuum analyze and explain analyze > on the query in between each alter to see if it made any difference. It > did not. Postgres still instists on doing a seq scan on featureloc: > > Unique (cost=336831.46..337179.45 rows=2320 width=47) (actual > time=27219.62..27220.30 rows=179 loops=1) > -> Sort (cost=336831.46..336831.46 rows=23200 width=47) (actual > time=27219.61..27219.80 rows=186 loops=1) > -> Nested Loop (cost=0.00..334732.77 rows=23200 width=47) > (actual time=1003.04..27217.99 rows=186 loops=1) > -> Seq Scan on featureloc fl (cost=0.00..261709.31 > rows=23200 width=14) (actual time=814.68..26094.18 rows=186 loops=1) > -> Index Scan using feature_pkey on feature f > (cost=0.00..3.14 rows=1 width=33) (actual time=6.03..6.03 rows=1 > loops=186) > Total runtime: 27220.63 msec > > > On Fri, 2003-02-14 at 12:29, Scott Cain wrote: > > >Tom, > > > >Sorry about that: I'll try to briefly give the information you are > >looking for. I've read the docs on ALTER TABLE, but it is not clear to > >me what columns I should change STATISTICS on, or should I just do it on > >all of the columns for which indexes exist? > > > >Here's the query again: > > > >select distinct f.name,fl.nbeg,fl.nend,fl.strand,f.type_id,f.feature_id > > from feature f, featureloc fl > > where > > fl.srcfeature_id = 1 and > > ((fl.strand=1 and fl.nbeg <= 393164 and fl.nend >= 390956) OR > > (fl.strand=-1 and fl.nend <= 393164 and fl.nbeg >= 390956)) and > > f.feature_id = fl.feature_id > > > >-------------------------------------------------------------------------- > > > >Naive database: > > > >Unique (cost=75513.46..75513.48 rows=1 width=167) (actual > >time=22815.25..22815.93 rows=179 loops=1) > > -> Sort (cost=75513.46..75513.46 rows=1 width=167) (actual > >time=22815.24..22815.43 rows=186 loops=1) > > -> Nested Loop (cost=0.00..75513.45 rows=1 width=167) (actual > >time=2471.25..22814.01 rows=186 loops=1) > > -> Index Scan using featureloc_idx2 on featureloc fl > >(cost=0.00..75508.43 rows=1 width=14) (actual time=2463.83..22796.50 > >rows=186 loops=1) > > -> Index Scan using feature_pkey on feature f > >(cost=0.00..5.01 rows=1 width=153) (actual time=0.08..0.08 rows=1 > >loops=186) > >Total runtime: 22816.63 msec > >-------------------------------------------------------------------------- > > > >Naive database after featureloc_idx2 dropped: > > > >Unique (cost=75545.46..75545.48 rows=1 width=167) (actual > >time=5232.36..5234.51 rows=179 loops=1) > > -> Sort (cost=75545.46..75545.46 rows=1 width=167) (actual > >time=5232.35..5232.54 rows=186 loops=1) > > -> Nested Loop (cost=0.00..75545.45 rows=1 width=167) (actual > >time=291.46..5220.69 rows=186 loops=1) > > -> Index Scan using featureloc_src_strand_beg_end on > >featureloc fl (cost=0.00..75540.43 rows=1 width=14) (actual > >time=291.30..5214.46 rows=186 loops=1) > > -> Index Scan using feature_pkey on feature f > >(cost=0.00..5.01 rows=1 width=153) (actual time=0.02..0.03 rows=1 > >loops=186) > >Total runtime: 5234.89 msec > >-------------------------------------------------------------------------- > > > >Database after VACUUM ANALYZE was run: > > > >Unique (cost=344377.70..344759.85 rows=2548 width=47) (actual > >time=26466.82..26467.51 rows=179 loops=1) > > -> Sort (cost=344377.70..344377.70 rows=25477 width=47) (actual > >time=26466.82..26467.01 rows=186 loops=1) > > -> Nested Loop (cost=0.00..342053.97 rows=25477 width=47) > >(actual time=262.66..26465.63 rows=186 loops=1) > > -> Seq Scan on featureloc fl (cost=0.00..261709.31 > >rows=25477 width=14) (actual time=118.62..26006.05 rows=186 loops=1) > > -> Index Scan using feature_pkey on feature f > >(cost=0.00..3.14 rows=1 width=33) (actual time=2.45..2.46 rows=1 > >loops=186) > >Total runtime: 26467.85 msec > >-------------------------------------------------------------------------- > > > >After disallowing seqscans (set enable_seqscan=0): > > > >Unique (cost=356513.46..356895.61 rows=2548 width=47) (actual > >time=27494.62..27495.34 rows=179 loops=1) > > -> Sort (cost=356513.46..356513.46 rows=25477 width=47) (actual > >time=27494.61..27494.83 rows=186 loops=1) > > -> Nested Loop (cost=0.00..354189.73 rows=25477 width=47) > >(actual time=198.88..27493.48 rows=186 loops=1) > > -> Index Scan using featureloc_idx1 on featureloc fl > >(cost=0.00..273845.08 rows=25477 width=14) (actual time=129.30..27280.95 > >rows=186 loops=1) > > -> Index Scan using feature_pkey on feature f > >(cost=0.00..3.14 rows=1 width=33) (actual time=1.13..1.13 rows=1 > >loops=186) > >Total runtime: 27495.66 msec > >-------------------------------------------------------------------------- > > > >After dropping featureloc_idx1: > > > >Unique (cost=1310195.21..1310577.36 rows=2548 width=47) (actual > >time=21692.69..21693.37 rows=179 loops=1) > > -> Sort (cost=1310195.21..1310195.21 rows=25477 width=47) (actual > >time=21692.69..21692.88 rows=186 loops=1) > > -> Nested Loop (cost=0.00..1307871.48 rows=25477 width=47) > >(actual time=2197.65..21691.39 rows=186 loops=1) > > -> Index Scan using featureloc_idx2 on featureloc fl > >(cost=0.00..1227526.82 rows=25477 width=14) (actual > >time=2197.49..21618.89 rows=186 loops=1) > > -> Index Scan using feature_pkey on feature f > >(cost=0.00..3.14 rows=1 width=33) (actual time=0.37..0.38 rows=1 > >loops=186) > >Total runtime: 21693.72 msec > >-------------------------------------------------------------------------- > > > >After dropping featureloc_idx2: > > > >Unique (cost=1414516.98..1414899.13 rows=2548 width=47) (actual > >time=1669.17..1669.86 rows=179 loops=1) > > -> Sort (cost=1414516.98..1414516.98 rows=25477 width=47) (actual > >time=1669.17..1669.36 rows=186 loops=1) > > -> Nested Loop (cost=0.00..1412193.25 rows=25477 width=47) > >(actual time=122.69..1668.08 rows=186 loops=1) > > -> Index Scan using featureloc_src_strand_beg_end on > >featureloc fl (cost=0.00..1331848.60 rows=25477 width=14) (actual > >time=122.51..1661.81 rows=186 loops=1) > > -> Index Scan using feature_pkey on feature f > >(cost=0.00..3.14 rows=1 width=33) (actual time=0.02..0.03 rows=1 > >loops=186) > >Total runtime: 1670.20 msec > > > > > >On Fri, 2003-02-14 at 12:00, Tom Lane wrote: > > > >>Scott Cain writes: > >> > >>>[ much stuff ] > >> > >>Could we see EXPLAIN ANALYZE, not just EXPLAIN, output for all these > >>alternatives? Your question boils down to "why is the planner > >>misestimating these queries" ... which is a difficult question to > >>answer when given only the estimates and not the reality. > >> > >>A suggestion though is that you might need to raise the statistics > >>target on the indexed columns, so that ANALYZE will collect > >>finer-grained statistics. (See ALTER TABLE ... SET STATISTICS.) > >>Try booting it up to 100 (from the default 10), re-analyze, and > >>then see if/how the plans change. > >> > >> regards, tom lane
Scott Cain <cain@cshl.org> writes: > Here is the query that is causing the problems: > select distinct f.name,fl.nbeg,fl.nend,fl.strand,f.type_id,f.feature_id > from feature f, featureloc fl > where > fl.srcfeature_id = 1 and > ((fl.strand=1 and fl.nbeg <= 393164 and fl.nend >= 390956) OR > (fl.strand=-1 and fl.nend <= 393164 and fl.nbeg >= 390956)) and > f.feature_id = fl.feature_id > [ and the index he'd like it to use is ] > Index "featureloc_src_strand_beg_end" > Column | Type > ---------------+---------- > srcfeature_id | integer > strand | smallint > nbeg | integer > nend | integer > btree After fooling with this I see a couple of problems. One is the same old cross-datatype-comparison issue that so frequently bites people: "1" and "-1" are integer constants, and comparing them to a smallint column isn't an indexable operation. You need casts. (Or, forget the "optimization" of storing strand as a smallint. Depending on your row layout, it's quite likely saving you no space anyway.) Problem two is that the optimizer isn't smart enough to recognize that a query condition laid out in this form should be processed as two indexscans --- it would possibly have gotten it right if the first index column had been inside the OR, but not this way. The upshot is that when you force it to use index featureloc_src_strand_beg_end, it's actually only using the srcfeature_id column of the index --- which is slow of course, and also explains why the optimizer doesn't find that option very attractive. I had some success in getting current sources to generate a desirable plan by doing this: regression=# explain select distinct * regression-# from feature f join featureloc fl on (f.feature_id = fl.feature_id) where regression-# ((fl.srcfeature_id = 1 and fl.strand=1::int2 and fl.nbeg <= 393164 and fl.nend >= 390956) OR regression(# (fl.srcfeature_id = 1 and fl.strand=-1::int2 and fl.nend <= 393164 and fl.nbeg >= 390956)); Unique (cost=34.79..34.85 rows=5 width=50) -> Sort (cost=34.79..34.80 rows=5 width=50) Sort Key: f.name, fl.nbeg, fl.nend, fl.strand -> Hash Join (cost=9.68..34.73 rows=5 width=50) Hash Cond: ("outer".feature_id = "inner".feature_id) -> Seq Scan on feature f (cost=0.00..20.00 rows=1000 width=36) -> Hash (cost=9.68..9.68 rows=1 width=14) -> Index Scan using featureloc_src_strand_beg_end, featureloc_src_strand_beg_end on featureloc fl (cost=0.00..9.68 rows=1 width=14) Index Cond: (((srcfeature_id = 1) AND (strand = 1::smallint) AND (nbeg <= 393164) AND (nend >=390956)) OR ((srcfeature_id = 1) AND (strand = -1::smallint) AND (nbeg >= 390956) AND (nend <= 393164))) Filter: (((srcfeature_id = 1) AND (strand = 1::smallint) AND (nbeg <= 393164) AND (nend >= 390956))OR ((srcfeature_id = 1) AND (strand = -1::smallint) AND (nend <= 393164) AND (nbeg >= 390956))) (10 rows) Shoving the join condition into an explicit JOIN clause is a hack, but it nicely does the job of keeping the WHERE clause as a pristine OR-of-ANDs structure, so that the optimizer can hardly fail to notice that that's the preferred canonical form. I would strongly recommend an upgrade to PG 7.3, both on general principles and because you can actually see what the indexscan condition is in EXPLAIN's output. Before 7.3 you had to grovel through EXPLAIN VERBOSE to be sure what was really happening with a multicolumn index. regards, tom lane