Thread: Queries not using Index

Queries not using Index

From
Daryl Herzmann
Date:
Greetings,

I suppose I should have sent this to pgsql-bugs maybe?  I would appreciate 
it if anybody could help me out.  I can't figure out what is going on 
here...

snet=# select version();                          version                           
-------------------------------------------------------------PostgreSQL 7.2.1 on i686-pc-linux-gnu, compiled by GCC
2.96

snet=# \d t2002_06               Table "t2002_06"Column  |           Type           | Modifiers 
---------+--------------------------+-----------station | character varying(5)     | valid   | timestamp with time zone
|tmpf    | smallint                 | dwpf    | smallint                 | drct    | smallint                 | sknt
|real                     | pday    | real                     | pmonth  | real                     | srad    | real
                | relh    | real                     | alti    | real                     | 
 
Indexes: t2002_06_station_idx,        t2002_06_tmpf_idx

snet=# \d t2002_06_station_idx; Index "t2002_06_station_idx"Column  |         Type         
---------+----------------------station | character varying(5)

snet=# select count(valid) from t2002_06; count  
---------1513895

snet=# vacuum analyze t2002_06;
VACUUM

snet=# vacuum t2002_06;
VACUUM

snet=# explain SELECT * from t2002_06 WHERE station = 'SGLI4';
NOTICE:  QUERY PLAN:

Seq Scan on t2002_06  (cost=0.00..35379.69 rows=35564 width=47)

EXPLAIN


Shouldn't this be an index scan?  I hope that I am not doing something 
stupid, although I probably am :(  

I even just tried this

drop index t2002_06_station_idx;
vacuum analyze t2002_06;
create index t2002_06_station_idx on t2002_06(station);
vacuum analyze t2002_06;

And I still get a Seq Scan.  Augh....

Thanks, Daryl




Re: Queries not using Index

From
"Gaetano Mendola"
Date:
"Daryl Herzmann" <akrherz@iastate.edu> wrote:
> snet=# select count(valid) from t2002_06;
>   count  
> ---------
>  1513895

> snet=# explain SELECT * from t2002_06 WHERE station = 'SGLI4';
> NOTICE:  QUERY PLAN:
> 
> Seq Scan on t2002_06  (cost=0.00..35379.69 rows=35564 width=47)


Can you do the following query for better understand your situation ?

select count(*) from t2002_06 where station = 'SGLI4';

select count(*) from t2002_06;




Ciao 
Gaetano.



Re: Queries not using Index

From
Daryl Herzmann
Date:
>On Tue, 23 Jul 2002, Daryl Herzmann wrote:
>
>> snet=# explain analyze select * from t2002_06 WHERE station = 'SAMI4';
>> NOTICE:  QUERY PLAN:
>>
>> Seq Scan on t2002_06  (cost=0.00..35379.69 rows=34979 width=47) (actual
>> time=67.89..3734.93 rows=38146 loops=1)
>> Total runtime: 3748.33 msec
>>
>> EXPLAIN
>>
>> snet=# set enable_seqscan=off;
>> SET VARIABLE
>> snet=# explain analyze select * from t2002_06 WHERE station = 'SAMI4';
>> NOTICE:  QUERY PLAN:
>>
>> Index Scan using t2002_06_station_idx on t2002_06  (cost=0.00..132124.96
>> rows=34979 width=47) (actual time=72.03..298.85 rows=38146 loops=1)
>> Total runtime: 317.76 msec
>
>Looks like the estimated cost is way divorced from reality.  Is the
>34979 row estimate even realistic and how well ordered is the table
>(actually output from pg_statistic would be good as well :) ).

Thanks for the help! I am not sure if I can answer your questions.  I will 
try :)

I believe the row estimate is realistic based on this value.

snet=# select count(*) from t2002_06 WHERE station = 'SAMI4';count 
-------38146

I am really sorry, but I don't know what to output from pg_statistic. I 
searched around on the Internet and was not sure what to send you from 
this table.  Sorry :(

Thanks! Daryl




Re: Queries not using Index

From
Stephan Szabo
Date:
On Tue, 23 Jul 2002, Daryl Herzmann wrote:

>
> >On Tue, 23 Jul 2002, Daryl Herzmann wrote:
> >
> >> snet=# explain analyze select * from t2002_06 WHERE station = 'SAMI4';
> >> NOTICE:  QUERY PLAN:
> >>
> >> Seq Scan on t2002_06  (cost=0.00..35379.69 rows=34979 width=47) (actual
> >> time=67.89..3734.93 rows=38146 loops=1)
> >> Total runtime: 3748.33 msec
> >>
> >> EXPLAIN
> >>
> >> snet=# set enable_seqscan=off;
> >> SET VARIABLE
> >> snet=# explain analyze select * from t2002_06 WHERE station = 'SAMI4';
> >> NOTICE:  QUERY PLAN:
> >>
> >> Index Scan using t2002_06_station_idx on t2002_06  (cost=0.00..132124.96
> >> rows=34979 width=47) (actual time=72.03..298.85 rows=38146 loops=1)
> >> Total runtime: 317.76 msec
> >
> >Looks like the estimated cost is way divorced from reality.  Is the
> >34979 row estimate even realistic and how well ordered is the table
> >(actually output from pg_statistic would be good as well :) ).
>
> Thanks for the help! I am not sure if I can answer your questions.  I will
> try :)
>
> I believe the row estimate is realistic based on this value.
>
> snet=# select count(*) from t2002_06 WHERE station = 'SAMI4';
>  count
> -------
>  38146

> I am really sorry, but I don't know what to output from pg_statistic. I
> searched around on the Internet and was not sure what to send you from
> this table.  Sorry :(

Right... sorry about that...
select * from pg_statistic where starelid=(select oid from pg_classwhere relname='t2002_06';




Re: Queries not using Index

From
Daryl Herzmann
Date:
Hi!
Thanks for the continued help.

I have attached the results of your request.  Thank you!

Daryl

>Right... sorry about that...
>select * from pg_statistic where starelid=(select oid from pg_class
> where relname='t2002_06';

Re: Queries not using Index

From
Stephan Szabo
Date:
On Tue, 23 Jul 2002, Daryl Herzmann wrote:

> Hi!
>
> Thanks for the continued help.
>
> I have attached the results of your request.  Thank you!

Hmm, when the data was put in, was it put in where the same value
would be bunched up?  IIRC that's a case the optimizer won't realize
if the data isn't ordered but merely bunched together that'd cause
it to over-estimate the cost of an index scan. Clustering on the index
might help, but cluster drops alot of info about the table, so you
have to be careful.




Re: Queries not using Index

From
Daryl Herzmann
Date:
Good evening.

On Tue, 23 Jul 2002, Stephan Szabo wrote:

>Hmm, when the data was put in, was it put in where the same value
>would be bunched up? 

I inserted the data via 30 "COPY t2002_06 from stdin" (one per day)  So it 
was grouped by station and then day for each insert.  (My script dumped 
the data from each station for the day and then repeated for each station 
and then finally dumped the entire day into the DB.  Are you saying that 
this process has tricked pgsql into not believing it needs to use an 
INDEX?  Sorry for my ignorance here.  I have done similar processes with 
PG7.1.2 and it seemed to use the INDEX.  

In fact, I just repeated the dumping scripts on a machine with 7.1.2 and 
the "explain select" reports to be using the Index Scan.  Hmmmm


>IIRC that's a case the optimizer won't realize if the data isn't ordered
>but merely bunched together that'd cause it to over-estimate the cost of
>an index scan. Clustering on the index might help, but cluster drops alot
>of info about the table, so you have to be careful.

Thanks for the info.  I am off to read about how to do clustering!

Thanks! Daryl




Re: Queries not using Index

From
"Christopher Kings-Lynne"
Date:
> I inserted the data via 30 "COPY t2002_06 from stdin" (one per
> day)  So it
> was grouped by station and then day for each insert.  (My script dumped
> the data from each station for the day and then repeated for each station
> and then finally dumped the entire day into the DB.  Are you saying that
> this process has tricked pgsql into not believing it needs to use an
> INDEX?  Sorry for my ignorance here.  I have done similar processes with
> PG7.1.2 and it seemed to use the INDEX.
>
> In fact, I just repeated the dumping scripts on a machine with 7.1.2 and
> the "explain select" reports to be using the Index Scan.  Hmmmm

You _have_ actually run ANALYZE on the table, right?

Chris



Re: Queries not using Index

From
Daryl Herzmann
Date:
Hi,

>You _have_ actually run ANALYZE on the table, right?

snet=# vacuum analyze t2002_06;
VACUUM
snet=# vacuum analyze;
VACUUM
snet=# explain analyze select * from t2002_06 WHERE station = 'SAMI4';
NOTICE:  QUERY PLAN:

Seq Scan on t2002_06  (cost=0.00..35379.69 rows=35169 width=47) (actual 
time=20.51..1717.78 rows=38146 loops=1)
Total runtime: 1730.63 msec

EXPLAIN
snet=# set enable_seqscan=off;
SET VARIABLE
snet=# explain analyze select * from t2002_06 WHERE station = 'SAMI4';
NOTICE:  QUERY PLAN:

Index Scan using t2002_06_station_idx on t2002_06  (cost=0.00..132773.85 
rows=35169 width=47) (actual time=74.86..299.53 rows=38146 loops=1)
Total runtime: 313.42 msec

EXPLAIN

Any thoughts?  I am sorry to be causing all this trouble.  I just want my 
queries to voom-voom!!  Interestingly enough, I see that the SEQ SCAN is 
now estimated at 1730.63, when I first posted to this list, it was 3900.00 
or so. Errrr

Thanks, Daryl




Re: Queries not using Index

From
"Christopher Kings-Lynne"
Date:
Have you tried playing with the statistics gatherer?

>From the ANALYZE docs:

"The extent of analysis can be controlled by adjusting the per-column
statistics target with ALTER TABLE ALTER COLUMN SET STATISTICS (see ALTER
TABLE). The target value sets the maximum number of entries in the
most-common-value list and the maximum number of bins in the histogram. The
default target value is 10, but this can be adjusted up or down to trade off
accuracy of planner estimates against the time taken for ANALYZE and the
amount of space occupied in pg_statistic. In particular, setting the
statistics target to zero disables collection of statistics for that column.
It may be useful to do that for columns that are never used as part of the
WHERE, GROUP BY, or ORDER BY clauses of queries, since the planner will have
no use for statistics on such columns. "

Just a thought...

Also, what is the result of:

select indexdef from pg_indexes where indexname='t2002_06_station_idx';

> Any thoughts?  I am sorry to be causing all this trouble.  I just want my
> queries to voom-voom!!  Interestingly enough, I see that the SEQ SCAN is
> now estimated at 1730.63, when I first posted to this list, it
> was 3900.00
> or so. Errrr

It's no trouble.  Cases where the planner fails are essential to improving
the planner.  Ideally this query should use your index automatically...

Chris



Re: Queries not using Index

From
Daryl Herzmann
Date:
Hi! :)

On Wed, 24 Jul 2002, Christopher Kings-Lynne wrote:

>Have you tried playing with the statistics gatherer?

Nope.  I will look at the docs some and play around.  This machine is not 
fully production yet. :)

>Also, what is the result of:
>select indexdef from pg_indexes where indexname='t2002_06_station_idx';

snet=# select indexdef from pg_indexes where 
indexname='t2002_06_station_idx';                             indexdef                               
---------------------------------------------------------------------CREATE INDEX t2002_06_station_idx ON t2002_06
USINGbtree (station)
 


>It's no trouble.  Cases where the planner fails are essential to improving
>the planner.  Ideally this query should use your index automatically...

Thanks!  I have the ~exact~ same database on another machine with PG 7.1.2 
and it uses the Index Scan without tweaking.  I have never had troubles 
with indexes up until this table/database.

I have got another database on the same 7.2.1 machine.  It has very 
similar data and the same index.  It has no trouble defaulting to use the 
Index.

awos=# explain analyze select * from t1999_06 WHERE station = 'SAMI4';
NOTICE:  QUERY PLAN:

Index Scan using t1999_06_stn_idx on t1999_06  (cost=0.00..25859.88 
rows=36544 width=53) (actual time=152.94..152.94 rows=0 loops=1)
Total runtime: 153.03 msec

EXPLAIN

Augh.  Puzzling.  Thanks everyone for the help!  You all rock!

Daryl




Re: Queries not using Index

From
Daryl Herzmann
Date:
Hi!

Thanks for your help!

On Tue, 23 Jul 2002, Gaetano Mendola wrote:

>"Daryl Herzmann" <akrherz@iastate.edu> wrote:
>> snet=# select count(valid) from t2002_06;
>>   count  
>> ---------
>>  1513895
>
>> snet=# explain SELECT * from t2002_06 WHERE station = 'SGLI4';
>> NOTICE:  QUERY PLAN:
>> 
>> Seq Scan on t2002_06  (cost=0.00..35379.69 rows=35564 width=47)
>
>
>Can you do the following query for better understand your situation ?
>
>select count(*) from t2002_06 where station = 'SGLI4';

snet=# select count(*) from t2002_06 where station = 'SGLI4';count 
-------39319


>select count(*) from t2002_06;

snet=# select count(*) from t2002_06; count  
---------1513895

In another email, it was suggested that I do this...

snet=# explain analyze select * from t2002_06 WHERE station = 'SAMI4';
NOTICE:  QUERY PLAN:

Seq Scan on t2002_06  (cost=0.00..35379.69 rows=34979 width=47) (actual 
time=67.89..3734.93 rows=38146 loops=1)
Total runtime: 3748.33 msec

EXPLAIN


snet=# set enable_seqscan=off;
SET VARIABLE
snet=# explain analyze select * from t2002_06 WHERE station = 'SAMI4';
NOTICE:  QUERY PLAN:

Index Scan using t2002_06_station_idx on t2002_06  (cost=0.00..132124.96 
rows=34979 width=47) (actual time=72.03..298.85 rows=38146 loops=1)
Total runtime: 317.76 msec

EXPLAIN

Thanks so much! Daryl