Thread: Tell postgres which index to use?
Hi, is there a way to tell Postgres which index to use when a query is issued in 7.4.2? I have a query for which costwise a Hash-Join and no Index-Usage is the best, but timewise using the index and then do a NestedLoop join is much better (3 - 4 times). I have vacuumed before I started the comparison, so Postgres does its best. And I don't constantly want to switch on and off the hashjoin and mergejoin. Regards, Silke Trissl
Silke, > is there a way to tell Postgres which index to use when a query is > issued in 7.4.2? PostgreSQL adjusts usage through global parameters, statistics, and periodic ANALYZE. Please post an EXPLAIN ANALYZE (not just EXPLAIN) for your query and people on this list can help you with your specific problem. -- Josh Berkus Aglio Database Solutions San Francisco
Sorry, > >>is there a way to tell Postgres which index to use when a query is >>issued in 7.4.2? > > > PostgreSQL adjusts usage through global parameters, statistics, and periodic > ANALYZE. Please post an EXPLAIN ANALYZE (not just EXPLAIN) for your query > and people on this list can help you with your specific problem. here are the plans, but still I would like to tell Postgres to use an index or the join method (like HINT in ORACLE). > First the vacuum db=# vacuum full analyze; VACUUM Then the query for the first time with analyze db=# EXPLAIN ANALYZE db-# SELECT chain.pdb_id, chain.id FROM PDB_ENTRY, CHAIN WHERE PDB_ENTRY.resolution > 0.0 and PDB_ENTRY.resolution < 1.7 AND PDB_ENTRY.id = CHAIN.pdb_id; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------- Hash Join (cost=1426.75..5210.52 rows=7533 width=8) (actual time=77.712..399.108 rows=5798 loops=1) Hash Cond: ("outer".pdb_id = "inner".id) -> Seq Scan on "chain" (cost=0.00..3202.11 rows=67511 width=8) (actual time=0.048..151.885 rows=67511 loops=1) -> Hash (cost=1418.68..1418.68 rows=3226 width=4) (actual time=77.062..77.062 rows=0 loops=1) -> Seq Scan on pdb_entry (cost=0.00..1418.68 rows=3226 width=4) (actual time=0.118..71.956 rows=3329 loops=1) Filter: ((resolution > 0::double precision) AND (resolution < 1.7::double precision)) Total runtime: 404.434 ms (7 rows) And then try to avoid the Hash Join db=# SET ENABLE_hashjoin = OFF; SET db=# EXPLAIN ANALYZE db-# SELECT chain.pdb_id, chain.id FROM PDB_ENTRY, CHAIN WHERE PDB_ENTRY.resolution > 0.0 and PDB_ENTRY.resolution < 1.7 AND PDB_ENTRY.id = CHAIN.pdb_id; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------- Merge Join (cost=9163.85..11100.74 rows=7533 width=8) (actual time=606.505..902.740 rows=5798 loops=1) Merge Cond: ("outer".id = "inner".pdb_id) -> Index Scan using pdb_entry_pkey on pdb_entry (cost=0.00..1516.03 rows=3226 width=4) (actual time=0.440..102.912 rows=3329 loops=1) Filter: ((resolution > 0::double precision) AND (resolution < 1.7::double precision)) -> Sort (cost=9163.85..9332.63 rows=67511 width=8) (actual time=605.838..694.190 rows=67501 loops=1) Sort Key: "chain".pdb_id -> Seq Scan on "chain" (cost=0.00..3202.11 rows=67511 width=8) (actual time=0.064..225.859 rows=67511 loops=1) Total runtime: 911.024 ms (8 rows) And finally timewise the fastest method, but not costwise. Even for almost full table joins, this method is the fastest. db=# SET ENABLE_mergejoin = off; SET db=# EXPLAIN ANALYZE db-# SELECT chain.pdb_id, chain.id FROM PDB_ENTRY, CHAIN WHERE PDB_ENTRY.resolution > 0.0 and PDB_ENTRY.resolution < 1.7 AND PDB_ENTRY.id = CHAIN.pdb_id; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------- Nested Loop (cost=0.00..23849.81 rows=7533 width=8) (actual time=0.341..198.162 rows=5798 loops=1) -> Seq Scan on pdb_entry (cost=0.00..1418.68 rows=3226 width=4) (actual time=0.145..78.177 rows=3329 loops=1) Filter: ((resolution > 0::double precision) AND (resolution < 1.7::double precision)) -> Index Scan using chain_pdb_id_ind on "chain" (cost=0.00..6.87 rows=6 width=8) (actual time=0.021..0.027 rows=2 loops=3329) Index Cond: ("outer".id = "chain".pdb_id) Total runtime: 204.105 ms (6 rows)
Silke Trissl wrote: > Sorry, > >> >>> is there a way to tell Postgres which index to use when a query is >>> issued in 7.4.2? >> >> >> >> PostgreSQL adjusts usage through global parameters, statistics, and >> periodic ANALYZE. Please post an EXPLAIN ANALYZE (not just EXPLAIN) >> for your query and people on this list can help you with your >> specific problem. > > > here are the plans, but still I would like to tell Postgres to use an > index or the join method (like HINT in ORACLE). > >> > > First the vacuum > db=# vacuum full analyze; > VACUUM > > Then the query for the first time with analyze > db=# EXPLAIN ANALYZE > db-# SELECT chain.pdb_id, chain.id FROM PDB_ENTRY, CHAIN > WHERE PDB_ENTRY.resolution > 0.0 and PDB_ENTRY.resolution < 1.7 > AND PDB_ENTRY.id = CHAIN.pdb_id; > QUERY PLAN > ------------------------------------------------------------------------------------------------------------------------- > > Hash Join (cost=1426.75..5210.52 rows=7533 width=8) (actual > time=77.712..399.108 rows=5798 loops=1) > Hash Cond: ("outer".pdb_id = "inner".id) > -> Seq Scan on "chain" (cost=0.00..3202.11 rows=67511 width=8) > (actual time=0.048..151.885 rows=67511 loops=1) > -> Hash (cost=1418.68..1418.68 rows=3226 width=4) (actual > time=77.062..77.062 rows=0 loops=1) This seems to be at least one of the problems. The planner thinks there are going to be 3000+ rows, but in reality there are 0. > -> Seq Scan on pdb_entry (cost=0.00..1418.68 rows=3226 > width=4) (actual time=0.118..71.956 rows=3329 loops=1) > Filter: ((resolution > 0::double precision) AND > (resolution < 1.7::double precision)) > Total runtime: 404.434 ms > (7 rows) > > And then try to avoid the Hash Join > > db=# SET ENABLE_hashjoin = OFF; > SET > db=# EXPLAIN ANALYZE > db-# SELECT chain.pdb_id, chain.id FROM PDB_ENTRY, CHAIN > WHERE PDB_ENTRY.resolution > 0.0 and PDB_ENTRY.resolution < 1.7 > AND PDB_ENTRY.id = CHAIN.pdb_id; > QUERY > PLAN > ------------------------------------------------------------------------------------------------------------------------------------------- > > Merge Join (cost=9163.85..11100.74 rows=7533 width=8) (actual > time=606.505..902.740 rows=5798 loops=1) > Merge Cond: ("outer".id = "inner".pdb_id) > -> Index Scan using pdb_entry_pkey on pdb_entry > (cost=0.00..1516.03 rows=3226 width=4) (actual time=0.440..102.912 > rows=3329 loops=1) > Filter: ((resolution > 0::double precision) AND (resolution < > 1.7::double precision)) > -> Sort (cost=9163.85..9332.63 rows=67511 width=8) (actual > time=605.838..694.190 rows=67501 loops=1) > Sort Key: "chain".pdb_id > -> Seq Scan on "chain" (cost=0.00..3202.11 rows=67511 > width=8) (actual time=0.064..225.859 rows=67511 loops=1) > Total runtime: 911.024 ms > (8 rows) > > And finally timewise the fastest method, but not costwise. Even for > almost full table joins, this method is the fastest. > > db=# SET ENABLE_mergejoin = off; > SET > db=# EXPLAIN ANALYZE > db-# SELECT chain.pdb_id, chain.id FROM PDB_ENTRY, CHAIN > WHERE PDB_ENTRY.resolution > 0.0 and PDB_ENTRY.resolution < 1.7 > AND PDB_ENTRY.id = CHAIN.pdb_id; > QUERY PLAN > ----------------------------------------------------------------------------------------------------------------------------------- > > Nested Loop (cost=0.00..23849.81 rows=7533 width=8) (actual > time=0.341..198.162 rows=5798 loops=1) > -> Seq Scan on pdb_entry (cost=0.00..1418.68 rows=3226 width=4) > (actual time=0.145..78.177 rows=3329 loops=1) > Filter: ((resolution > 0::double precision) AND (resolution < > 1.7::double precision)) > -> Index Scan using chain_pdb_id_ind on "chain" (cost=0.00..6.87 > rows=6 width=8) (actual time=0.021..0.027 rows=2 loops=3329) > Index Cond: ("outer".id = "chain".pdb_id) > Total runtime: 204.105 ms > (6 rows) I'm guessing the filter is more selective than postgres thinks it is (0 <> 1.7). You might try increasing the statistics of that column, you might also try playing with your random_page_cost to make index scans relatively cheaper (than seq scans). It might be an issue that your effective_cache_size isn't quite right, which makes postgres think most things are on disk, when in reality they are in memory (which also makes index scans much cheaper). Also, this query may sort itself out in time. As the tables grow, the relative fraction that you desire probably decreases, which makes index scans more attractive. John =:->
Attachment
John Arbash Meinel <john@arbash-meinel.com> writes: > > -> Hash (cost=1418.68..1418.68 rows=3226 width=4) (actual > > time=77.062..77.062 rows=0 loops=1) > > This seems to be at least one of the problems. The planner thinks there > are going to be 3000+ rows, but in reality there are 0. No, that's a red herring. Hash nodes always report 0 rows. > > Nested Loop (cost=0.00..23849.81 rows=7533 width=8) (actual time=0.341..198.162 rows=5798 loops=1) > > -> Seq Scan on pdb_entry (cost=0.00..1418.68 rows=3226 width=4) (actual time=0.145..78.177 rows=3329 loops=1) > > Filter: ((resolution > 0::double precision) AND (resolution < 1.7::double precision)) > > -> Index Scan using chain_pdb_id_ind on "chain" (cost=0.00..6.87 rows=6 width=8) (actual time=0.021..0.027 rows=2loops=3329) > > Index Cond: ("outer".id = "chain".pdb_id) The actual number of records is pretty close to the estimated number. And the difference seems to come primarily from selectivity of the join where it thinks an average of 6 rows will match every row whereas in fact an average of about 2 rows matches. So it thinks it's going to read about 18,000 records out of 67,000 or about 25%. In that case the sequential scan is almost certainly better. In fact it's going to read about 6,000 or just under 10%, in which case the sequential scan is probably still better but it's not so clear. I suspect the only reason you're seeing such a big difference when I would expect it to be about even is because nearly all the data is cached. In that case the non-sequential access pattern of the nested loop has little effect. You might get away with lowering random_page_cost but since it thinks it's going to read 25% of the table I suspect you'll have to get very close to 1 before it switches over, if it does even then. Be careful about tuning settings like this based on a single query, especially to unrealistically low values. You might also want to try raising the statistics target on pdb_entry. See if that makes the estimate go down from 6 to closer to 2. -- greg