Thread: Index lookup on > and < criteria
Apologies if this questions is asked often. I'm doing some select statements based on a timestamp field. I have an index on the field, and when I use the '=' operator the index is used. However, if I use the '>' or '<' operators, then it does a full table scan. I've got around 6 million rows, so I would think that an index scan would be more appropriate. Here are the statements I'm looking at: select * from myTable where myTimeStamp = '10/1/2005'; uses an index. select max(myTimeStamp) from myTable; select * from myTable where myTimeStamp < '10/2/2005'; select * from myTablewhere myTimeStamp < '10/2/2005' and myTimeStamp >= '10/1/2005'; do not use indexes. Can anyone point me to some info about what's going on? I've started reading through the manual (chapter 13) which I think explains query optimizing, index usage etc. It seems like this would be a common enough problem that it would have a relatively simple solution. Thanks. -Dave
am 01.11.2005, um 12:18:19 -0600 mailte David Durham folgendes: > Apologies if this questions is asked often. I'm doing some select > statements based on a timestamp field. I have an index on the field, and > when I use the '=' operator the index is used. However, if I use the '>' > or '<' operators, then it does a full table scan. I've got around 6 > million rows, so I would think that an index scan would be more > appropriate. > > > Here are the statements I'm looking at: > > select * from myTable where myTimeStamp = '10/1/2005'; > > uses an index. > > select max(myTimeStamp) from myTable; > select * from myTable where myTimeStamp < '10/2/2005'; > select * from myTable where myTimeStamp < '10/2/2005' and myTimeStamp > >= '10/1/2005'; > > do not use indexes. Can anyone point me to some info about what's going > on? I've started reading through the manual (chapter 13) which I think Nice question. My guess: The planner fels its better to use seq-scan. My test: ,----[ Test ] | Produktionsreport=# explain select * from bde_meldungen where zeitpunkt > '2005-08-30'::timestamp; | QUERY PLAN | ---------------------------------------------------------------------------- | Seq Scan on bde_meldungen (cost=0.00..33862.46 rows=55106 width=139) | Filter: (zeitpunkt > '2005-08-30 00:00:00'::timestamp without time zone) | (2 Zeilen) | | Produktionsreport=# explain select * from bde_meldungen where zeitpunkt > '2005-09-30'::timestamp; | QUERY PLAN | ------------------------------------------------------------------------------------------- | Index Scan using bde_zeitpunkt on bde_meldungen (cost=0.00..8255.23 rows=9521 width=139) | Index Cond: (zeitpunkt > '2005-09-30 00:00:00'::timestamp without time zone) | (2 Zeilen) `---- Sorry about the german column names, 'zeitpunkt' is a timestamp. On the first query the result set is estimeted 55.000 rows long -> seq-scan. The second test: estimated to rows=9521 -> index scan. Btw.: min/max cant use index, this is coming with 8.1. I'm using for examples above 7.4.6. PS.: you can use set ..., example: ,----[ Test with set enable_seqscan=... ] | Produktionsreport=# set enable_seqscan=on; | SET | Produktionsreport=# explain analyse select * from bde_meldungen where zeitpunkt > '2005-08-30'::timestamp; | QUERY PLAN | --------------------------------------------------------------------------------------------------------------------------- | Seq Scan on bde_meldungen (cost=0.00..33862.46 rows=55106 width=139) (actual time=2574.004..4892.563 rows=99915 loops=1) | Filter: (zeitpunkt > '2005-08-30 00:00:00'::timestamp without time zone) | Total runtime: 4971.179 ms | (3 Zeilen) | | Produktionsreport=# set enable_seqscan=off; | SET | Produktionsreport=# explain analyse select * from bde_meldungen where zeitpunkt > '2005-08-30'::timestamp; | QUERY PLAN | ----------------------------------------------------------------------------------------------------------------------------------------------- | Index Scan using bde_zeitpunkt on bde_meldungen (cost=0.00..47679.39 rows=55106 width=139) (actual time=57.387..1649.591rows=99915 loops=1) | Index Cond: (zeitpunkt > '2005-08-30 00:00:00'::timestamp without time zone) | Total runtime: 1729.420 ms | (3 Zeilen) `---- Now it using the index _and_ it is faster! HTH, Andreas -- Andreas Kretschmer (Kontakt: siehe Header) Heynitz: 035242/47212, D1: 0160/7141639 GnuPG-ID 0x3FFF606C http://wwwkeys.de.pgp.net=== Schollglas Unternehmensgruppe ===
On Tue, Nov 01, 2005 at 12:18:19PM -0600, David Durham wrote: > Apologies if this questions is asked often. I'm doing some select > statements based on a timestamp field. I have an index on the field, > and when I use the '=' operator the index is used. However, if I use > the '>' or '<' operators, then it does a full table scan. I've got > around 6 million rows, so I would think that an index scan would be more > appropriate. No need to guess: run the queries with enable_seqscan disabled and see if an index scan is indeed faster. > select max(myTimeStamp) from myTable; In current releases min() and max() can't use indexes; search the archives for numerous discussions of the reasons. The workarounds are, respectively: SELECT myTimeStamp FROM myTable ORDER BY myTimeStamp LIMIT 1; SELECT myTimeStamp FROM myTable ORDER BY myTimeStamp DESC LIMIT 1; In 8.1 min() and max() are optimized to do the above. > select * from myTable where myTimeStamp < '10/2/2005'; > select * from myTable where myTimeStamp < '10/2/2005' and myTimeStamp > >= '10/1/2005'; How many rows do these queries return? If they return a significant portion of the table then the planner might think that a sequential scan would be faster than an index scan. It would be useful to see the EXPLAIN ANALYZE output of these queries so we can see how accurate the planner's row count estimates are. Has the table been vacuumed and analyzed? If so, and if the planner's row count estimates aren't close to the actual row counts, then you might benefit from increasing the statistics target for the myTimeStamp column. How much memory do you have and what's your effective_cache_size setting? That's one of the settings that influences the planner's decision. Also, what version of PostgreSQL are you running? BTW, pgsql-performance would be a more appropriate list to discuss performance issues. -- Michael Fuhr
Michael Fuhr wrote: > No need to guess: run the queries with enable_seqscan disabled and > see if an index scan is indeed faster. > > >> select max(myTimeStamp) from myTable; > > > In current releases min() and max() can't use indexes; search the > archives for numerous discussions of the reasons. The workarounds > are, respectively: Good to know. Thanks all. > How many rows do these queries return? If they return a significant > portion of the table then the planner might think that a sequential > scan would be faster than an index scan. It would be useful to see > the EXPLAIN ANALYZE output of these queries so we can see how > accurate the planner's row count estimates are. Ok. Looks like you guys caught me in a(n unintentional) lie. Here goes: sipcdr=# explain analyze select * from october_cdr_call where begin_time >= '10/1/2005' and begin_time < '10/4/2005'; QUERYPLAN -------------------------------------------------------------------------------- -------------------------------------------------------------------------- Index Scan using october_begin_time on october_cdr_call (cost=0.00..98383.82 r ows=24594 width=568) (actual time=0.280..79274.579 rows=538592 loops=1) Index Cond: ((begin_time >= '2005-10-01 00:00:00'::timestampwithout time zon e) AND (begin_time < '2005-10-04 00:00:00'::timestamp without time zone)) Total runtime: 81457.938 ms (3 rows) sipcdr=# explain analyze select * from october_cdr_call where begin_time < '10/15/2005'; QUERY PLAN -------------------------------------------------------------------------------- -------------------------------------------------- Seq Scan on october_cdr_call (cost=0.00..273437.39 rows=1639584 width=568) (ac tual time=11.623..43681.396 rows=2609215 loops=1) Filter: (begin_time < '2005-10-15 00:00:00'::timestamp without time zone) Total runtime: 54366.944 ms (3 rows) > Has the table been vacuumed and analyzed? Brand new table that I haven't deleted anything from yet. > If so, and if the planner's > row count estimates aren't close to the actual row counts, then you > might benefit from increasing the statistics target for the myTimeStamp > column. Ok, this is something that balances what might lead to overuse of the vacuum command? I can just look that one up. > How much memory do you have and what's your effective_cache_size > setting? 1.5 gig RAM, effective_cache_size is the default, so 1000. > That's one of the settings that influences the planner's > decision. Also, what version of PostgreSQL are you running? 8.0.3 > BTW, pgsql-performance would be a more appropriate list to discuss > performance issues. Ok, I won't cross post this one, but I'll send the next one there. Here's the final word on this, I think: sipcdr=# set enable_seqscan=off; SET sipcdr=# explain analyze select * from october_cdr_call where begin_time < '10/15/2005'; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------------- IndexScan using october_begin_time on october_cdr_call (cost=0.00..6338044.65 rows=1639584 width=568) (actual time=51.454..355782.687 rows=2609215 loops=1) Index Cond: (begin_time < '2005-10-15 00:00:00'::timestamp without time zone) Total runtime: 366289.918 ms Thanks again, -Dave
On Tue, Nov 01, 2005 at 03:21:21PM -0600, David Durham wrote: > sipcdr=# explain analyze select * from october_cdr_call where begin_time > >= '10/1/2005' and begin_time < '10/4/2005'; > > QUERY PLAN > > -------------------------------------------------------------------------------- > -------------------------------------------------------------------------- > Index Scan using october_begin_time on october_cdr_call > (cost=0.00..98383.82 r > ows=24594 width=568) (actual time=0.280..79274.579 rows=538592 loops=1) > Index Cond: ((begin_time >= '2005-10-01 00:00:00'::timestamp without > time zon > e) AND (begin_time < '2005-10-04 00:00:00'::timestamp without time zone)) > Total runtime: 81457.938 ms > (3 rows) The estimated row count (24594) is much different than the actual row count (538592), which makes me wonder if the statistics are up to date. Try running ANALYZE on the table and then see if the estimate is more accurate. With a more accurate estimate the planner might choose a sequential scan, but the other queries you posted suggest that a sequential scan is indeed faster when you're fetching this much data. > >Has the table been vacuumed and analyzed? > > Brand new table that I haven't deleted anything from yet. The table should still be analyzed to update the planner's statistics. The planner uses statistics to estimate how many rows a query will return, and that influences the choice of plan. > >How much memory do you have and what's your effective_cache_size > >setting? > > 1.5 gig RAM, effective_cache_size is the default, so 1000. You'd probably benefit from raising effective_cache_size to reflect the amount of memory being used for disk cache, both by PostgreSQL and by the operating system; you might also benefit from adjusting other settings like shared_buffers. See a tuning guide like the following for advice: http://www.powerpostgresql.com/PerfList -- Michael Fuhr
Yeah, analyze did make a difference. See below. >>-------------------------------------------------------------------------- >> Index Scan using october_begin_time on october_cdr_call >>(cost=0.00..98383.82 r >>ows=24594 width=568) (actual time=0.280..79274.579 rows=538592 loops=1) >> Index Cond: ((begin_time >= '2005-10-01 00:00:00'::timestamp without >>time zon >>e) AND (begin_time < '2005-10-04 00:00:00'::timestamp without time zone)) >> Total runtime: 81457.938 ms >>(3 rows) > > > The estimated row count (24594) is much different than the actual > row count (538592), which makes me wonder if the statistics are up > to date. Try running ANALYZE on the table and then see if the > estimate is more accurate. With a more accurate estimate the planner > might choose a sequential scan, but the other queries you posted > suggest that a sequential scan is indeed faster when you're fetching > this much data. sipcdr=# analyze october_cdr_call; ANALYZE sipcdr=# explain analyze select * from october_cdr_call where begin_time >= '10/1/2005' and begin_time < '10/4/2005'; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------ SeqScan on october_cdr_call (cost=0.00..285695.68 rows=500922 width=371) (actual time=54.510..50004.458 rows=538592 loops=1) Filter: ((begin_time >= '2005-10-01 00:00:00'::timestampwithout time zone) AND (begin_time < '2005-10-04 00:00:00'::timestamp without time zone)) Total runtime: 52335.126 ms -Dave