Thread: Index lookup on > and < criteria

Index lookup on > and < criteria

From

David Durham

Date:

01 November 2005, 17:17:50

Apologies if this questions is asked often.  I'm doing some select 
statements based on a timestamp field.  I have an index on the field, 
and when I use the '=' operator the index is used.  However, if I use 
the '>' or '<' operators, then it does a full table scan.  I've got 
around 6 million rows, so I would think that an index scan would be more 
appropriate.


Here are the statements I'm looking at:
   select * from myTable where myTimeStamp = '10/1/2005';

uses an index.
   select max(myTimeStamp) from myTable;   select * from myTable where myTimeStamp < '10/2/2005';   select * from
myTablewhere myTimeStamp < '10/2/2005' and myTimeStamp       >= '10/1/2005';
 

do not use indexes.  Can anyone point me to some info about what's going 
on?  I've started reading through the manual (chapter 13) which I think 
explains query optimizing, index usage etc.  It seems like this would be 
a common enough problem that it would have a relatively simple solution.  Thanks.


-Dave

Re: Index lookup on > and < criteria

From

"A. Kretschmer"

Date:

01 November 2005, 18:11:46

am  01.11.2005, um 12:18:19 -0600 mailte David Durham folgendes:
> Apologies if this questions is asked often.  I'm doing some select 
> statements based on a timestamp field.  I have an index on the field, and 
> when I use the '=' operator the index is used.  However, if I use the '>' 
> or '<' operators, then it does a full table scan.  I've got around 6 
> million rows, so I would think that an index scan would be more 
> appropriate.
> 
> 
> Here are the statements I'm looking at:
> 
>     select * from myTable where myTimeStamp = '10/1/2005';
> 
> uses an index.
> 
>     select max(myTimeStamp) from myTable;
>     select * from myTable where myTimeStamp < '10/2/2005';
>     select * from myTable where myTimeStamp < '10/2/2005' and myTimeStamp
>         >= '10/1/2005';
> 
> do not use indexes.  Can anyone point me to some info about what's going 
> on?  I've started reading through the manual (chapter 13) which I think 

Nice question. My guess:

The planner fels its better to use seq-scan. My test:

,----[  Test  ]
| Produktionsreport=# explain select * from bde_meldungen where zeitpunkt > '2005-08-30'::timestamp;
|                                  QUERY PLAN
| ----------------------------------------------------------------------------
|  Seq Scan on bde_meldungen  (cost=0.00..33862.46 rows=55106 width=139)
|    Filter: (zeitpunkt > '2005-08-30 00:00:00'::timestamp without time zone)
| (2 Zeilen)
|
| Produktionsreport=# explain select * from bde_meldungen where zeitpunkt > '2005-09-30'::timestamp;
|                                         QUERY PLAN
| -------------------------------------------------------------------------------------------
|  Index Scan using bde_zeitpunkt on bde_meldungen  (cost=0.00..8255.23 rows=9521 width=139)
|    Index Cond: (zeitpunkt > '2005-09-30 00:00:00'::timestamp without time zone)
| (2 Zeilen)
`----

Sorry about the german column names, 'zeitpunkt' is a timestamp. On the
first query the result set is estimeted 55.000 rows long -> seq-scan.
The second test: estimated to rows=9521 -> index scan.


Btw.: min/max cant use index, this is coming with 8.1. I'm using for
examples above 7.4.6.


PS.: you can use set ..., example:

,----[  Test with set enable_seqscan=... ]
| Produktionsreport=# set enable_seqscan=on;
| SET
| Produktionsreport=# explain analyse select * from bde_meldungen where zeitpunkt > '2005-08-30'::timestamp;
|                                                         QUERY PLAN
|
---------------------------------------------------------------------------------------------------------------------------
|  Seq Scan on bde_meldungen  (cost=0.00..33862.46 rows=55106 width=139) (actual time=2574.004..4892.563 rows=99915
loops=1)
|    Filter: (zeitpunkt > '2005-08-30 00:00:00'::timestamp without time zone)
|  Total runtime: 4971.179 ms
| (3 Zeilen)
|
| Produktionsreport=# set enable_seqscan=off;
| SET
| Produktionsreport=# explain analyse select * from bde_meldungen where zeitpunkt > '2005-08-30'::timestamp;
|                                                                   QUERY PLAN
|
-----------------------------------------------------------------------------------------------------------------------------------------------
|  Index Scan using bde_zeitpunkt on bde_meldungen  (cost=0.00..47679.39 rows=55106 width=139) (actual
time=57.387..1649.591rows=99915 loops=1)
 
|    Index Cond: (zeitpunkt > '2005-08-30 00:00:00'::timestamp without time zone)
|  Total runtime: 1729.420 ms
| (3 Zeilen)
`----


Now it using the index _and_ it is faster!


HTH, Andreas
-- 
Andreas Kretschmer    (Kontakt: siehe Header)
Heynitz:  035242/47212,      D1: 0160/7141639
GnuPG-ID 0x3FFF606C http://wwwkeys.de.pgp.net===    Schollglas Unternehmensgruppe    ===

Re: Index lookup on > and < criteria

From

Michael Fuhr

Date:

01 November 2005, 18:21:41

On Tue, Nov 01, 2005 at 12:18:19PM -0600, David Durham wrote:
> Apologies if this questions is asked often.  I'm doing some select 
> statements based on a timestamp field.  I have an index on the field, 
> and when I use the '=' operator the index is used.  However, if I use 
> the '>' or '<' operators, then it does a full table scan.  I've got 
> around 6 million rows, so I would think that an index scan would be more 
> appropriate.

No need to guess: run the queries with enable_seqscan disabled and
see if an index scan is indeed faster.

>    select max(myTimeStamp) from myTable;

In current releases min() and max() can't use indexes; search the
archives for numerous discussions of the reasons.  The workarounds
are, respectively:

SELECT myTimeStamp FROM myTable ORDER BY myTimeStamp LIMIT 1;
SELECT myTimeStamp FROM myTable ORDER BY myTimeStamp DESC LIMIT 1;

In 8.1 min() and max() are optimized to do the above.

>    select * from myTable where myTimeStamp < '10/2/2005';
>    select * from myTable where myTimeStamp < '10/2/2005' and myTimeStamp
>        >= '10/1/2005';

How many rows do these queries return?  If they return a significant
portion of the table then the planner might think that a sequential
scan would be faster than an index scan.  It would be useful to see
the EXPLAIN ANALYZE output of these queries so we can see how
accurate the planner's row count estimates are.

Has the table been vacuumed and analyzed?  If so, and if the planner's
row count estimates aren't close to the actual row counts, then you
might benefit from increasing the statistics target for the myTimeStamp
column.

How much memory do you have and what's your effective_cache_size
setting?  That's one of the settings that influences the planner's
decision.  Also, what version of PostgreSQL are you running?

BTW, pgsql-performance would be a more appropriate list to discuss
performance issues.

-- 
Michael Fuhr

Re: Index lookup on > and < criteria

From

David Durham

Date:

01 November 2005, 20:20:50

Michael Fuhr wrote:
> No need to guess: run the queries with enable_seqscan disabled and
> see if an index scan is indeed faster.
> 
> 
>>   select max(myTimeStamp) from myTable;
> 
> 
> In current releases min() and max() can't use indexes; search the
> archives for numerous discussions of the reasons.  The workarounds
> are, respectively:

Good to know.  Thanks all.

> How many rows do these queries return?  If they return a significant
> portion of the table then the planner might think that a sequential
> scan would be faster than an index scan.  It would be useful to see
> the EXPLAIN ANALYZE output of these queries so we can see how
> accurate the planner's row count estimates are.

Ok.  Looks like you guys caught me in a(n unintentional) lie.  Here goes:

sipcdr=# explain analyze select * from october_cdr_call where begin_time >= '10/1/2005' and begin_time < '10/4/2005';
QUERYPLAN 
 

--------------------------------------------------------------------------------
-------------------------------------------------------------------------- Index Scan using october_begin_time on
october_cdr_call
 
(cost=0.00..98383.82 r
ows=24594 width=568) (actual time=0.280..79274.579 rows=538592 loops=1)   Index Cond: ((begin_time >= '2005-10-01
00:00:00'::timestampwithout 
 
time zon
e) AND (begin_time < '2005-10-04 00:00:00'::timestamp without time zone)) Total runtime: 81457.938 ms
(3 rows)

sipcdr=# explain analyze select * from october_cdr_call where begin_time 
< '10/15/2005';               QUERY PLAN
--------------------------------------------------------------------------------
-------------------------------------------------- Seq Scan on october_cdr_call  (cost=0.00..273437.39 rows=1639584 
width=568) (ac
tual time=11.623..43681.396 rows=2609215 loops=1)   Filter: (begin_time < '2005-10-15 00:00:00'::timestamp without time

zone) Total runtime: 54366.944 ms
(3 rows)




> Has the table been vacuumed and analyzed?  

Brand new table that I haven't deleted anything from yet.


> If so, and if the planner's
> row count estimates aren't close to the actual row counts, then you
> might benefit from increasing the statistics target for the myTimeStamp
> column.

Ok, this is something that balances what might lead to overuse of the 
vacuum command?  I can just look that one up.


> How much memory do you have and what's your effective_cache_size
> setting?  

1.5 gig RAM, effective_cache_size is the default, so 1000.


> That's one of the settings that influences the planner's
> decision.  Also, what version of PostgreSQL are you running?

8.0.3


> BTW, pgsql-performance would be a more appropriate list to discuss
> performance issues.

Ok, I won't cross post this one, but I'll send the next one there.

Here's the final word on this, I think:

sipcdr=# set enable_seqscan=off;
SET
sipcdr=# explain analyze select * from october_cdr_call where begin_time 
< '10/15/2005';   QUERY PLAN 


---------------------------------------------------------------------------------------------------------------------------------------------------------------
IndexScan using october_begin_time on october_cdr_call 
 
(cost=0.00..6338044.65 rows=1639584 width=568) (actual 
time=51.454..355782.687 rows=2609215 loops=1)   Index Cond: (begin_time < '2005-10-15 00:00:00'::timestamp without 
time zone) Total runtime: 366289.918 ms


Thanks again,

-Dave

Re: Index lookup on > and < criteria

From

Michael Fuhr

Date:

02 November 2005, 00:08:08

On Tue, Nov 01, 2005 at 03:21:21PM -0600, David Durham wrote:
> sipcdr=# explain analyze select * from october_cdr_call where begin_time 
> >= '10/1/2005' and begin_time < '10/4/2005';
> 
>  QUERY PLAN 
> 
> --------------------------------------------------------------------------------
> --------------------------------------------------------------------------
>  Index Scan using october_begin_time on october_cdr_call 
> (cost=0.00..98383.82 r
> ows=24594 width=568) (actual time=0.280..79274.579 rows=538592 loops=1)
>    Index Cond: ((begin_time >= '2005-10-01 00:00:00'::timestamp without 
> time zon
> e) AND (begin_time < '2005-10-04 00:00:00'::timestamp without time zone))
>  Total runtime: 81457.938 ms
> (3 rows)

The estimated row count (24594) is much different than the actual
row count (538592), which makes me wonder if the statistics are up
to date.  Try running ANALYZE on the table and then see if the
estimate is more accurate.  With a more accurate estimate the planner
might choose a sequential scan, but the other queries you posted
suggest that a sequential scan is indeed faster when you're fetching
this much data.

> >Has the table been vacuumed and analyzed?  
> 
> Brand new table that I haven't deleted anything from yet.

The table should still be analyzed to update the planner's statistics.
The planner uses statistics to estimate how many rows a query will
return, and that influences the choice of plan.

> >How much memory do you have and what's your effective_cache_size
> >setting?  
> 
> 1.5 gig RAM, effective_cache_size is the default, so 1000.

You'd probably benefit from raising effective_cache_size to reflect
the amount of memory being used for disk cache, both by PostgreSQL
and by the operating system; you might also benefit from adjusting
other settings like shared_buffers.  See a tuning guide like the
following for advice:

http://www.powerpostgresql.com/PerfList

-- 
Michael Fuhr

Re: Index lookup on > and < criteria

From

David Durham

Date:

02 November 2005, 18:41:29

Yeah, analyze did make a difference.  See below.

>>--------------------------------------------------------------------------
>> Index Scan using october_begin_time on october_cdr_call 
>>(cost=0.00..98383.82 r
>>ows=24594 width=568) (actual time=0.280..79274.579 rows=538592 loops=1)
>>   Index Cond: ((begin_time >= '2005-10-01 00:00:00'::timestamp without 
>>time zon
>>e) AND (begin_time < '2005-10-04 00:00:00'::timestamp without time zone))
>> Total runtime: 81457.938 ms
>>(3 rows)
> 
> 
> The estimated row count (24594) is much different than the actual
> row count (538592), which makes me wonder if the statistics are up
> to date.  Try running ANALYZE on the table and then see if the
> estimate is more accurate.  With a more accurate estimate the planner
> might choose a sequential scan, but the other queries you posted
> suggest that a sequential scan is indeed faster when you're fetching
> this much data.

sipcdr=# analyze october_cdr_call;
ANALYZE

sipcdr=# explain analyze select * from october_cdr_call where begin_time >= '10/1/2005' and begin_time < '10/4/2005';
QUERY PLAN 


------------------------------------------------------------------------------------------------------------------------------------------------------
SeqScan on october_cdr_call  (cost=0.00..285695.68 rows=500922 
 
width=371) (actual time=54.510..50004.458 rows=538592 loops=1)   Filter: ((begin_time >= '2005-10-01
00:00:00'::timestampwithout 
 
time zone) AND (begin_time < '2005-10-04 00:00:00'::timestamp without 
time zone)) Total runtime: 52335.126 ms


-Dave