Thread: Query planner refuses to use index
Hi there, I know this subject has come up before many times, but I'm struggling for hours with the following problem and none of the posts seem to have a solution. I have a table with a good 13 million entries with station_data=# \d speed Table "public.speed" Column | Type | Modifiers --------------+----------+----------- set_id | smallint | not null rec_time | abstime | not null wind_speed | smallint | Indexes: "speed_pkey" primary key, btree (set_id, rec_time) I use lots of queries of the form SELECT * FROM speed WHERE set_id=xxx AND rec_time >=yyy where xxx is an integer and yyy is an abstime. At first, I battled to get the query planner to use an index at all, even when forcing, but http://archives.postgresql.org/pgsql-general/2001-09/msg01120.php suggested I need to use set_id=5::smallint. It works, but why is pg not intelligent enough to figure out that the literal 5 and smallint are compatible? So I thought I had solved my problem, but then it still refused to use the index, even though sequential scans are prohibitively expensive: station_data=# EXPLAIN ANALYZE SELECT * FROM speed WHERE set_id=5::smallint AND rec_time >= '1999/01/01'::abstime; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------- Seq Scan on speed (cost=0.00..276640.28 rows=677372 width=8) (actual time=14024.081..78236.525 rows=652389 loops=1) Filter: ((set_id = 5::smallint) AND (rec_time >= '1999-01-01 00:00:00+02'::abstime)) Total runtime: 80156.330 ms When I explicitly turn off sequential scans by issuing "SET enable_seqscan TO OFF;", I get what I want: station_data=# EXPLAIN ANALYZE SELECT * FROM speed WHERE set_id=5::smallint AND rec_time >= '1999/01/01'::abstime; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------- Index Scan using speed_pkey on speed (cost=0.00..2009924.87 rows=677372 width=8) (actual time=50.070..5775.698 rows=652389 loops=1) Index Cond: ((set_id = 5::smallint) AND (rec_time >= '1999-01-01 00:00:00+02'::abstime)) Total runtime: 8819.371 ms which is 10 times faster. We're down to the last recommendation of section 11.8 in the documentation, so I increased the statistics gathered with "SET default_statistics_target TO 50;", but that makes no difference either. Am I left to disable seqscans for eternity (in which case may I file a bug) or is there something else I might be missing? -- Kilian Hagemann Climate Systems Analysis Group University of Cape Town Republic of South Africa Tel(w): ++27 21 650 2748
On Thu, Jul 21, 2005 at 05:45:55PM +0200, Kilian Hagemann wrote: > Hi there, > > I know this subject has come up before many times, but I'm struggling for > hours with the following problem and none of the posts seem to have a > solution. I have a table with a good 13 million entries with Firstly, you havn't indicated what version of postgresql you're using. <snip> > http://archives.postgresql.org/pgsql-general/2001-09/msg01120.php > > suggested I need to use set_id=5::smallint. It works, but why is pg not > intelligent enough to figure out that the literal 5 and smallint are > compatible? Later versions do, which is why the version you are using is relevent. <snip> The EXPLAIN ANALYZE below indicates that the system thinks the index scan will take 10 times longer than the seqscan whereas it doesn't actually take that long. Please tell us if you've adjusted random_page_cost at all and what to. Are the values matching your criteria clustered on disk? Old versions of Postgres didn't take this into account, although that would be very old then... Hope this helps, > station_data=# EXPLAIN ANALYZE SELECT * FROM speed WHERE set_id=5::smallint > AND rec_time >= '1999/01/01'::abstime; > QUERY PLAN > ---------------------------------------------------------------------------------------------------------------------- > Seq Scan on speed (cost=0.00..276640.28 rows=677372 width=8) (actual > time=14024.081..78236.525 rows=652389 loops=1) > Filter: ((set_id = 5::smallint) AND (rec_time >= '1999-01-01 > 00:00:00+02'::abstime)) > Total runtime: 80156.330 ms > > When I explicitly turn off sequential scans by issuing "SET enable_seqscan TO > OFF;", I get what I want: > station_data=# EXPLAIN ANALYZE SELECT * FROM speed WHERE set_id=5::smallint > AND rec_time >= '1999/01/01'::abstime; > QUERY PLAN > -------------------------------------------------------------------------------------------------------------------------------------- > Index Scan using speed_pkey on speed (cost=0.00..2009924.87 rows=677372 > width=8) (actual time=50.070..5775.698 rows=652389 loops=1) > Index Cond: ((set_id = 5::smallint) AND (rec_time >= '1999-01-01 > 00:00:00+02'::abstime)) > Total runtime: 8819.371 ms -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Attachment
Am Donnerstag, 21. Juli 2005 17:45 schrieb Kilian Hagemann: > Hi there, > > I know this subject has come up before many times, but I'm struggling for > hours with the following problem and none of the posts seem to have a > solution. I have a table with a good 13 million entries with > > station_data=# \d speed > Table "public.speed" > Column | Type | Modifiers > --------------+----------+----------- > set_id | smallint | not null > rec_time | abstime | not null > wind_speed | smallint | > Indexes: > "speed_pkey" primary key, btree (set_id, rec_time) > > I use lots of queries of the form SELECT * FROM speed WHERE set_id=xxx AND > rec_time >=yyy where xxx is an integer and yyy is an abstime. At first, I > battled to get the query planner to use an index at all, even when forcing, > but > > http://archives.postgresql.org/pgsql-general/2001-09/msg01120.php > > suggested I need to use set_id=5::smallint. It works, but why is pg not > intelligent enough to figure out that the literal 5 and smallint are > compatible? > > So I thought I had solved my problem, but then it still refused to use the > index, even though sequential scans are prohibitively expensive: > > station_data=# EXPLAIN ANALYZE SELECT * FROM speed WHERE set_id=5::smallint > AND rec_time >= '1999/01/01'::abstime; > QUERY PLAN > --------------------------------------------------------------------------- >------------------------------------------- Seq Scan on speed > (cost=0.00..276640.28 rows=677372 width=8) (actual > time=14024.081..78236.525 rows=652389 loops=1) > Filter: ((set_id = 5::smallint) AND (rec_time >= '1999-01-01 > 00:00:00+02'::abstime)) > Total runtime: 80156.330 ms > > When I explicitly turn off sequential scans by issuing "SET enable_seqscan > TO OFF;", I get what I want: > station_data=# EXPLAIN ANALYZE SELECT * FROM speed WHERE set_id=5::smallint > AND rec_time >= '1999/01/01'::abstime; > QUERY PLAN > --------------------------------------------------------------------------- >----------------------------------------------------------- Index Scan using > speed_pkey on speed (cost=0.00..2009924.87 rows=677372 width=8) (actual > time=50.070..5775.698 rows=652389 loops=1) > Index Cond: ((set_id = 5::smallint) AND (rec_time >= '1999-01-01 > 00:00:00+02'::abstime)) > Total runtime: 8819.371 ms > > which is 10 times faster. We're down to the last recommendation of section > 11.8 in the documentation, so I increased the statistics gathered with "SET > default_statistics_target TO 50;", but that makes no difference either. > > Am I left to disable seqscans for eternity (in which case may I file a bug) > or is there something else I might be missing? some hints: - You never want to set enable_seq off in production database. - did you run "VACUUM ANALYZE speed" lately? - what version are you running? - if you look at the cost values you can see what cost the planner expects. "actual time" is what it tooks to really execute the query. You see in example 1 with seq-scan the estimated cost is 276,640 and with index scan it is 2,009,924. So the planner chooses the seq scan. Maybe because you have not analyzed lately? - another parameter to look at is random_page_cost: "Sets the planner's estimate of the cost of a nonsequentially fetched disk page. This is measured as a multiple of the cost of a sequential page fetch. A higher value makes it more likely a sequential scan will be used, a lower value makes it more likely an index scan will be used. The default is four." kind regards, janning
Hi there, Thanks for your and Martijn's comments, I obviously forgot to put in some vital detail: > - You never want to set enable_seq off in production database. That's what I thought... > - did you run "VACUUM ANALYZE speed" lately? Yes, just before I ran all of the queries in my last email. Hence I mentioned increasing default_statistics_target to 50 and reanalysing, which didn't help either. > - what version are you running? 7.4.8, not sure if I'm ready for 8 yet. > - another parameter to look at is random_page_cost: "Sets the planner's > estimate of the cost of a nonsequentially fetched disk page. This is > measured as a multiple of the cost of a sequential page fetch. A higher > value makes it more likely a sequential scan will be used, a lower value > makes it more likely an index scan will be used. The default is four." Hmm, that's interesting. I need to set random_page_cost as low as 0.5 for the index scan's cost to dip below that of the seq_scan. Surely that's a non-realistic setting and not what I want in the long run. Why on earth does the planner in its default configuration so blatantly miss that the index scan is vastly superior? Maybe some more stats about my data will help, a summary is attached. Also, note that set_id is strictly increasing (hence correlation of 1) and rec_time is strictly increasing within records with same set_id. -- Kilian Hagemann Climate Systems Analysis Group University of Cape Town Republic of South Africa Tel(w): ++27 21 650 2748
Attachment
On Fri, Jul 22, 2005 at 10:46:39AM +0200, Kilian Hagemann wrote: > > - did you run "VACUUM ANALYZE speed" lately? > > Yes, just before I ran all of the queries in my last email. Did you run VACUUM ANALYZE or just ANALYZE? Could we see the output of VACUUM ANALYZE VERBOSE speed? > Hence I mentioned increasing default_statistics_target to 50 and > reanalysing, which didn't help either. It might be better to use ALTER TABLE to set the statistics target for specific columns instead of changing the system-wide default -- no need to spend time over-analyzing columns when it's not necessary. And if EXPLAIN'S row estimates are already reasonably accurate, then increasing the statistics will have little effect on the planner's decisions. > Hmm, that's interesting. I need to set random_page_cost as low as 0.5 for the > index scan's cost to dip below that of the seq_scan. Surely that's a > non-realistic setting and not what I want in the long run. What are the values of other relevant settings, like shared_buffers, effective_cache_size, and cpu_index_tuple_cost? How much memory does this system have? > station_data=# SELECT relname, relkind, reltuples, relpages FROM pg_class WHERE relname LIKE 'speed%'; > relname | relkind | reltuples | relpages > ------------------+---------+-------------------+---------- > speed | r | 1.39002e+07 | 68138 > speed_pkey | i | 1000 | 1 That's odd -- why aren't there more tuples and pages in the speed_pkey index? Those look like never-been-vacuumed defaults. Are you sure you've been vacuuming this table, or have you just been analyzing it? How much update/delete activity does this table undergo? -- Michael Fuhr http://www.fuhr.org/~mfuhr/
Kilian Hagemann <hagemann1@egs.uct.ac.za> writes: >> - what version are you running? > 7.4.8, not sure if I'm ready for 8 yet. Pre-8.0 tends to underestimate the correlation of a multicolumn index. (8.0 may too, but not as much.) > Also, note that set_id is strictly increasing (hence correlation of 1) and > rec_time is strictly increasing within records with same set_id. So the reason the indexscan is so good is that the ordering correlation is perfect. This isn't the planner's default assumption, and unfortunately we haven't got statistics available that would allow correlation of a multicolumn index to be estimated well. regards, tom lane
Hello. I have some tables that are updated by several users in the same time and are used in queries for reports. Those tables have rows that are actualy copied from original tables that are not to be altered. There is a procedure that inserts rows for every user when connects, along with his username, so different users can't interfere with each other because every user has his own copy of rows that he can update, and records are filtered by current_user. Well, it's my heritage from MS Access, before I moved to Postgres, because there is no such thing as temporary table in Access... Now, I'm wondering is there any true advantage to implement temporary tables for each user, insted of one table with inserted rows with username for every user ? What would be advantage of temporary tables? Would database grow less if I implement temporary tables (less need for vacuum ?), in comparison to my current solution with true table with rows for every user ? Zlatko
On Jul 22, 2005, at 1:55 PM, Zlatko Matic wrote: > Hello. > I have some tables that are updated by several users in the same time > and are used in queries for reports. Those tables have rows that are > actualy copied from original tables that are not to be altered. There > is a procedure that inserts rows for every user when connects, along > with his username, so different users can't interfere with each other > because every user has his own copy of rows that he can update, and > records are filtered by current_user. > Well, it's my heritage from MS Access, before I moved to Postgres, > because there is no such thing as temporary table in Access... > Now, I'm wondering is there any true advantage to implement temporary > tables for each user, insted of one table with inserted rows with > username for every user ? Temporary tables are not per-user, but per-connection. A user can be connected twice, but a temporary table created on one connection is not visible from the other connection. Also, temporary tables are temporary--they disappear after the connection is closed.
They can also be set to drop with ON COMMIT DROP, this way they disappear after the tranaction is commited. I use them in this way on my web server to sort file listings, and it works well since each apache DSO connection gets one connection to the database. Tony http://www.amsoftwaredesign.com Home of PG Lightning Admin (PGLA) Sean Davis wrote: > > Temporary tables are not per-user, but per-connection. A user can be > connected twice, but a temporary table created on one connection is > not visible from the other connection. Also, temporary tables are > temporary--they disappear after the connection is closed. > > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend >
What is the influence on database growing in comparrison to permanent table frequently inserted/deleted rows ? ----- Original Message ----- From: "Sean Davis" <sdavis2@mail.nih.gov> To: "Zlatko Matic" <zlatko.matic1@sb.t-com.hr> Cc: "pgsql list" <pgsql-general@postgresql.org> Sent: Friday, July 22, 2005 8:06 PM Subject: Re: [GENERAL] temporary tables ? > > On Jul 22, 2005, at 1:55 PM, Zlatko Matic wrote: > >> Hello. >> I have some tables that are updated by several users in the same time and >> are used in queries for reports. Those tables have rows that are actualy >> copied from original tables that are not to be altered. There is a >> procedure that inserts rows for every user when connects, along with his >> username, so different users can't interfere with each other because >> every user has his own copy of rows that he can update, and records are >> filtered by current_user. >> Well, it's my heritage from MS Access, before I moved to Postgres, >> because there is no such thing as temporary table in Access... >> Now, I'm wondering is there any true advantage to implement temporary >> tables for each user, insted of one table with inserted rows with >> username for every user ? > > Temporary tables are not per-user, but per-connection. A user can be > connected twice, but a temporary table created on one connection is not > visible from the other connection. Also, temporary tables are > temporary--they disappear after the connection is closed. >
> What is the influence on database growing in comparrison to permanent > table frequently inserted/deleted rows ? The tables are dropped automatically after the connection is closed. The database doesn't grow because of temporary tables. As for comparison to a frequently inserted/deleted table, that would depend on the time between vacuums. The rows aren't "removed" from a table until a vacuum is performed. >> On Jul 22, 2005, at 1:55 PM, Zlatko Matic wrote: >> >>> Hello. >>> I have some tables that are updated by several users in the same time >>> and are used in queries for reports. Those tables have rows that are >>> actualy copied from original tables that are not to be altered. There >>> is a procedure that inserts rows for every user when connects, along >>> with his username, so different users can't interfere with each other >>> because every user has his own copy of rows that he can update, and >>> records are filtered by current_user. >>> Well, it's my heritage from MS Access, before I moved to Postgres, >>> because there is no such thing as temporary table in Access... >>> Now, I'm wondering is there any true advantage to implement temporary >>> tables for each user, insted of one table with inserted rows with >>> username for every user ? >> >> Temporary tables are not per-user, but per-connection. A user can be >> connected twice, but a temporary table created on one connection is not >> visible from the other connection. Also, temporary tables are >> temporary--they disappear after the connection is closed. >> >
Thanks for explaination. Zlatko ----- Original Message ----- From: "Sean Davis" <sdavis2@mail.nih.gov> To: "Zlatko Matic" <zlatko.matic1@sb.t-com.hr> Cc: "pgsql list" <pgsql-general@postgresql.org> Sent: Sunday, July 24, 2005 1:27 AM Subject: Re: [GENERAL] temporary tables ? > > > >> What is the influence on database growing in comparrison to permanent >> table frequently inserted/deleted rows ? > > The tables are dropped automatically after the connection is closed. The > database doesn't grow because of temporary tables. As for comparison to a > frequently inserted/deleted table, that would depend on the time between > vacuums. The rows aren't "removed" from a table until a vacuum is > performed. > >>> On Jul 22, 2005, at 1:55 PM, Zlatko Matic wrote: >>> >>>> Hello. >>>> I have some tables that are updated by several users in the same time >>>> and are used in queries for reports. Those tables have rows that are >>>> actualy copied from original tables that are not to be altered. There >>>> is a procedure that inserts rows for every user when connects, along >>>> with his username, so different users can't interfere with each other >>>> because every user has his own copy of rows that he can update, and >>>> records are filtered by current_user. >>>> Well, it's my heritage from MS Access, before I moved to Postgres, >>>> because there is no such thing as temporary table in Access... >>>> Now, I'm wondering is there any true advantage to implement temporary >>>> tables for each user, insted of one table with inserted rows with >>>> username for every user ? >>> >>> Temporary tables are not per-user, but per-connection. A user can be >>> connected twice, but a temporary table created on one connection is not >>> visible from the other connection. Also, temporary tables are >>> temporary--they disappear after the connection is closed. >>> >> > > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly
(v 7.4.8 on linux) pg_resetxlog question... I have a idle database (idle = no writes for @12 hours, everything committed, full dump in between) that was shut down cleanly, then imaged to another machine. Forgetting a certain flag in the rsync command, my symlinked pg_xlog didn't make the trip, and I didn't notice until the originals were gone. (Oh, the embarrassment...) I ran pg_resetxlog, and reindexed the system tables. Everything seems fine. Both documentation and chatter on the lists imply that I should be suspect of the database. Given the particulars mentioned above (idleness, clean shutdown, etc), would that be true in this case? -------------------- Andrew Rawnsley Chief Technology Officer Investor Analytics, LLC (740) 587-0114 http://www.investoranalytics.com
Andrew Rawnsley <ronz@ravensfield.com> writes: > I have a idle database (idle = no writes for @12 hours, everything > committed, > full dump in between) that was shut down cleanly, then imaged to another > machine. Forgetting a certain flag in the rsync command, my symlinked > pg_xlog didn't make the trip, and I didn't notice until the originals > were gone. > (Oh, the embarrassment...) > I ran pg_resetxlog, and reindexed the system tables. Everything seems > fine. > Both documentation and chatter on the lists imply that I should be > suspect of the database. Given the particulars mentioned above > (idleness, clean shutdown, etc), would that be true in this case? If the database was in fact cleanly shut down before copying, you shouldn't have any problem. pg_xlog is only needed to reapply changes that hadn't made it to disk yet. regards, tom lane
On Friday 22 July 2005 15:23, Michael Fuhr pondered: > Did you run VACUUM ANALYZE or just ANALYZE? Could we see the output > of VACUUM ANALYZE VERBOSE speed? I just ran a plain ANALYZE then. When I VACUUM ANALYZE the table the (inferior) sequential scan strategy is still chosen over the index scan. Here is the output of VACUUM ANALYZE VERBOSE speed (second VACUUM ANALYZE): INFO: vacuuming "public.speed" INFO: index "speed_pkey" now contains 13959950 row versions in 53283 pages DETAIL: 0 index pages have been deleted, 0 are currently reusable. CPU 3.73s/0.57u sec elapsed 258.89 sec. INFO: "speed": found 0 removable, 13959950 nonremovable row versions in 68138 pages DETAIL: 0 dead row versions cannot be removed yet. There were 0 unused item pointers. 0 pages are entirely empty. CPU 5.50s/1.32u sec elapsed 365.29 sec. INFO: analyzing "public.speed" INFO: "speed": 68138 pages, 3000 rows sampled, 13900152 estimated total rows VACUUM > What are the values of other relevant settings, like shared_buffers, > effective_cache_size, and cpu_index_tuple_cost? How much memory > does this system have? shared_buffers, effective_cache_size and cpu_index_tuple_cost all have their default values of 1000, 1000 and 0.001 respectively. From their descriptions I gather that's reasonable and I don't know how I would optimise these for my system (I cannot find any hints in the documentation). It has 512MB of RAM, Pentium M 1.5Ghz, but is used as a desktop system (KDE) with the database sitting on an external 7200rpm USB 2.0 harddisk for space reasons. > > relname | relkind | reltuples | relpages > > ------------------+---------+-------------------+---------- > > speed | r | 1.39002e+07 | 68138 > > speed_pkey | i | 1000 | 1 > > That's odd -- why aren't there more tuples and pages in the speed_pkey > index? Those look like never-been-vacuumed defaults. Are you sure > you've been vacuuming this table, or have you just been analyzing it? I also noticed that. I didn't realise that a plain ANALYZE wouldn't update the stats of the index, only a VACUUM ANALYZE seems to take care of that (why isn't this documented?). As I said above, this seems to make no difference. > How much update/delete activity does this table undergo? It's a low usage table/database, I'm its only user (not really production) and insertion happens infrequently (only in the beginning of its lifetime) while queries of the type I posted constitute the bulk of the resource-intensive database usage. -- Kilian Hagemann Climate Systems Analysis Group University of Cape Town Republic of South Africa Tel(w): ++27 21 650 2748
On Mon, Jul 25, 2005 at 11:23:01AM +0200, Kilian Hagemann wrote: > shared_buffers, effective_cache_size and cpu_index_tuple_cost all have their > default values of 1000, 1000 and 0.001 respectively. From their descriptions > I gather that's reasonable and I don't know how I would optimise these for my > system (I cannot find any hints in the documentation). It has 512MB of RAM, > Pentium M 1.5Ghz, but is used as a desktop system (KDE) with the database > sitting on an external 7200rpm USB 2.0 harddisk for space reasons. Various third-party tuning guides exist -- here's one (although it says it's for 8.0, much of the advice applies to earlier versions as well): http://www.powerpostgresql.com/PerfList/ With 512MB of RAM you should benefit from raising effective_cache_size, and you could probably raise shared_buffers as well. You'll have to experiment to find the best values, especially on a mixed-use machine where the database is competing with other processes for resources. The above tuning guide suggests setting effective_cache_size to 2/3 of RAM on a dedicated server, which for you would be a setting of about 40000, so you should probably stay below that. As for random_page_cost, on my systems and with my usage patterns, I've found that a setting of 2 results in more realistic plans than the default of 4. Your mileage (kilometerage?) may vary. Whatever the results of your experiments, could you post the settings you tried and the corresponding EXPLAIN ANALYZE outputs? -- Michael Fuhr http://www.fuhr.org/~mfuhr/
On Monday 25 July 2005 15:43, Michael Fuhr pondered: > Whatever the results of your experiments, could you post the settings > you tried and the corresponding EXPLAIN ANALYZE outputs? I did lots of tests now that you pointed me to a useful guide, also taking what's in the documentation into account. In the attached file I have documented my results. There are three sections to the file, each separated by '====' markers. The first section deals in detail with the EXPLAIN ANALYZE info relating to the troublesome queries. The second is probably of least interest, just showing that I could implement my problem differently to improve performance. But the last section is the most important, where I varied effective_cache_size, random_page_cost, shared_buffers and cpu_tuple_costs, each on its own with the other ones assuming default values(unless indicated). To summarise, increasing effective_cache_size and decreasing random_page_cost both yield in lower index scan cost estimates while not changing the seqscan ones. As expected, increasing shared_buffers makes no difference whatsoever in the query cost estimates or the actual query times. A higher cpu_tuple cost penalises the seqscans significantly while only slightly increasing the index scan estimates. Also note that these are all related to the query planner only, they do NOT change the actual query time which explains why I did not include EXPLAIN ANALYZE outputs, only plain EXPLAIN ones. In order to make PostgreSQL choose the index scans when I need them (other than by setting enable_seq_scans to off), I ended up choosing effective_cache_size 40000 random_page_cost 2.5 cpu_tuple_cost 0.08 as only a combination yielded the desired results. Hardly optimal, but the real problem seems to lie with the correlation of the indexed columns (see other post in this thread). If I encounter trouble with these somewhere down the line, I'll post again. Hope this helps someone out there. -- Kilian Hagemann Climate Systems Analysis Group University of Cape Town Republic of South Africa Tel(w): ++27 21 650 2748
Attachment
On Friday 22 July 2005 16:17, Tom Lane pondered: > Pre-8.0 tends to underestimate the correlation of a multicolumn index. > (8.0 may too, but not as much.) I actually upgraded to 8.0.3 now and 2 things have changed. Firstly, I don't need to do the annoying casts anymore as the query planner now recognises which literals are compatible with which indexes. Secondly, and you're right here, 8.0 has decreased the gap between index and sequential scan cost estimate significantly, but not nearly sufficiently to detect that the index scan is indeed superior. > > Also, note that set_id is strictly increasing (hence correlation of 1) > > and rec_time is strictly increasing within records with same set_id. > > So the reason the indexscan is so good is that the ordering correlation > is perfect. This isn't the planner's default assumption, and > unfortunately we haven't got statistics available that would allow > correlation of a multicolumn index to be estimated well. Hmm, what's wrong with using the 'correlation' column of pg_stats? It told us straight away that the correlation on set_id was perfect. Even when leaving out the condition on the second index column (rec_time) the query planner thinks a sequential scan is more appropriate (please refer to the text file in my other most recent post for more details). May I file a bug report for this? I really think that this points to a deficiency in the query planner. -- Kilian Hagemann Climate Systems Analysis Group University of Cape Town Republic of South Africa Tel(w): ++27 21 650 2748