Thread: Index of a table is not used (in any case)

Index of a table is not used (in any case)

From
Reiner Dassing
Date:
Hello PostgreSQl Users!

PostSQL V 7.1.1:

I have defined a table and the necessary indices.
But the index is not used in every SELECT. (Therefore, the selects are
*very* slow, due to seq scan on
20 million entries, which is a test setup up to now)

The definitions can be seen in the annex.

Does some body know the reason and how to circumvent the seq scan?

Is the order of index creation relevant? I.e., should I create the
indices before inserting
entries or the other way around?

Should a hashing index be used? (I tried this, but I got the known error
"Out of overflow pages")
(The docu on "create index" says :    "Notes 
   The Postgres query optimizer will consider using a btree index
whenever an indexed attribute is involved in a   comparison using one of: <, <=, =, >=, > 
   The Postgres query optimizer will consider using an rtree index
whenever an indexed attribute is involved in a   comparison using one of: <<, &<, &>, >>, @, ~=, && 
   The Postgres query optimizer will consider using a hash index
whenever an indexed attribute is involved in a   comparison using the = operator. "


The table entry 'epoche' is used in two different indices. Should that
be avoided?

Any suggestions are welcome.

Thank you in advance.
Reiner
------------------------------
Annex:
======

Table:
------
\d wetter                Table "wetter"Attribute |           Type           | Modifier 
-----------+--------------------------+----------sensor_id | integer                  | not nullepoche    | timestamp
withtime zone | not nullwert      | real                     | not null
 
Indices: wetter_epoche_idx,        wetter_pkey
\d wetter_epoche_idx     Index "wetter_epoche_idx"Attribute |           Type           
-----------+--------------------------epoche    | timestamp with time zone
btree


\d wetter_pkey        Index "wetter_pkey"Attribute |           Type           
-----------+--------------------------sensor_id | integerepoche    | timestamp with time zone
unique btree (primary key)


Select where index is used:
============================
explain select * from wetter order by epoche desc;    
NOTICE:  QUERY PLAN:

Index Scan Backward using wetter_epoche_idx on wetter 
(cost=0.00..3216018.59 rows=20340000 width=16)

EXPLAIN



Select where the index is NOT used:
===================================
explain select * from wetter where epoche between '1970-01-01' and
'1980-01-01' order by epoche asc;
NOTICE:  QUERY PLAN:

Sort  (cost=480705.74..480705.74 rows=203400 width=16) ->  Seq Scan on wetter  (cost=0.00..454852.00 rows=203400
width=16)

EXPLAIN

--
Mit freundlichen Gruessen / With best regards  Reiner Dassing


Re: Index of a table is not used (in any case)

From
"Christopher Kings-Lynne"
Date:
> Hello PostgreSQl Users!
>
> PostSQL V 7.1.1:

You should upgrade to 7.1.3 at some point...

> I have defined a table and the necessary indices.
> But the index is not used in every SELECT. (Therefore, the selects are
> *very* slow, due to seq scan on
> 20 million entries, which is a test setup up to now)
>
> The definitions can be seen in the annex.
>
> Does some body know the reason and how to circumvent the seq scan?

Yes. You probably have not run 'VACUUM ANALYZE' on your large table.

> Is the order of index creation relevant? I.e., should I create the
> indices before inserting
> entries or the other way around?

If you are inserting a great many entries, insert the data first and then
create the indices - it will be much faster this way.

> Should a hashing index be used? (I tried this, but I got the known error
> "Out of overflow pages")

Just do the default CREATE INDEX - btree should be fine... (probably)

> The table entry 'epoche' is used in two different indices. Should that
> be avoided?

It's not a problem, but just check your EXPLAIN output after the VACUUM to
check that you have them right.

Chris



Re: Index of a table is not used (in any case)

From
Doug McNaught
Date:
Reiner Dassing <dassing@wettzell.ifag.de> writes:

> Hello PostgreSQl Users!
> 
> PostSQL V 7.1.1:
> 
> I have defined a table and the necessary indices.
> But the index is not used in every SELECT. (Therefore, the selects are
> *very* slow, due to seq scan on
> 20 million entries, which is a test setup up to now)

Perennial first question: did you VACUUM ANALYZE?

-Doug
-- 
Let us cross over the river, and rest under the shade of the trees.  --T. J. Jackson, 1863


Re: Index of a table is not used (in any case)

From
mlw
Date:
Doug McNaught wrote:

> Reiner Dassing <dassing@wettzell.ifag.de> writes:
>
> > Hello PostgreSQl Users!
> >
> > PostSQL V 7.1.1:
> >
> > I have defined a table and the necessary indices.
> > But the index is not used in every SELECT. (Therefore, the selects are
> > *very* slow, due to seq scan on
> > 20 million entries, which is a test setup up to now)
>
> Perennial first question: did you VACUUM ANALYZE?

Can there, or could there, be a notion of "rule based" optimization of
queries in PostgreSQL? The "not using index" problem is probably the most
common and most misunderstood problem.




Re: [SQL] Index of a table is not used (in any case)

From
Tom Lane
Date:
Reiner Dassing <dassing@wettzell.ifag.de> writes:
> explain select * from wetter order by epoche desc;    
> NOTICE:  QUERY PLAN:

> Index Scan Backward using wetter_epoche_idx on wetter 
> (cost=0.00..3216018.59 rows=20340000 width=16)

> explain select * from wetter where epoche between '1970-01-01' and
> '1980-01-01' order by epoche asc;
> NOTICE:  QUERY PLAN:

> Sort  (cost=480705.74..480705.74 rows=203400 width=16)
>   ->  Seq Scan on wetter  (cost=0.00..454852.00 rows=203400 width=16)

It's hard to believe that you've done a VACUUM ANALYZE on this table,
since you are getting a selectivity estimate of exactly 0.01, which
just happens to be the default selectivity estimate for range queries.
How many rows are there really in this date range?

Anyway, the reason the planner is picking a seqscan+sort is that it
thinks that will be faster than an indexscan.  It's not necessarily
wrong.  Have you compared the explain output and actual timings both
ways?  (Use "set enable_seqscan to off" to force it to pick an indexscan
for testing purposes.)
        regards, tom lane


Re: Index of a table is not used (in any case)

From
"Zeugswetter Andreas SB SD"
Date:
> > > *very* slow, due to seq scan on
> > > 20 million entries, which is a test setup up to now)
> >
> > Perennial first question: did you VACUUM ANALYZE?
> 
> Can there, or could there, be a notion of "rule based" optimization of
> queries in PostgreSQL? The "not using index" problem is probably the
most
> common and most misunderstood problem.

There is a (sort of) rule based behavior in PostgreSQL, 
the down side of the current implementation is, that certain 
other commands than ANALYZE (e.g. "create index") partly update 
optimizer statistics. This is bad behavior, since then only part 
of the statistics are accurate. Statistics always have to be seen 
in context to other table's and other index'es statistics. 

Thus, currently the rule based optimizer only works if you create 
the indexes on empty tables (before loading data), which obviously 
has downsides. Else you have no choice but to ANALYZE frequently.

I have tried hard to fight for this pseudo rule based behavior, 
but was only partly successful in convincing core. My opinion is, 
that (unless runtime statistics are kept) no other command than 
ANALYZE should be allowed to touch optimizer relevant statistics 
(maybe unless explicitly told to).

Andreas


Re: Index of a table is not used (in any case)

From
mlw
Date:
Zeugswetter Andreas SB SD wrote:
> 
> > > > *very* slow, due to seq scan on
> > > > 20 million entries, which is a test setup up to now)
> > >
> > > Perennial first question: did you VACUUM ANALYZE?
> >
> > Can there, or could there, be a notion of "rule based" optimization of
> > queries in PostgreSQL? The "not using index" problem is probably the
> most
> > common and most misunderstood problem.
> 
> There is a (sort of) rule based behavior in PostgreSQL,
> the down side of the current implementation is, that certain
> other commands than ANALYZE (e.g. "create index") partly update
> optimizer statistics. This is bad behavior, since then only part
> of the statistics are accurate. Statistics always have to be seen
> in context to other table's and other index'es statistics.
> 
> Thus, currently the rule based optimizer only works if you create
> the indexes on empty tables (before loading data), which obviously
> has downsides. Else you have no choice but to ANALYZE frequently.
> 
> I have tried hard to fight for this pseudo rule based behavior,
> but was only partly successful in convincing core. My opinion is,
> that (unless runtime statistics are kept) no other command than
> ANALYZE should be allowed to touch optimizer relevant statistics
> (maybe unless explicitly told to).

Perhaps there could be an extension to ANALYZE, i.e. ANALYZE RULEBASED
tablename that would restore or recalculate the state that a table would be if
all indexes were created from scratch?

The "not using index" was very frustrating to understand. The stock answer,
"did you vacuum?" just isn't enough. There has to be some explanation (in the
FAQ or something) about the indexed key distribution in your data. Postgres'
statistics are pretty poor too, a relative few very populous entries in a table
will make it virtually impossible for the cost based optimizer (CBO) to use an
index.

At my site we have lots of tables that have many duplicate items in an index.
It is a music based site and has a huge amount of "Various Artists" entries. No
matter what we do, there is NO way to get Postgres to use the index from the
query alone. We have over 20 thousand artists, but 5 "Various Artists" or
"Soundtrack" entries change the statistics so much that they exclude an index
scan. We have to run the system with sequential scan disabled. Running with seq
disabled eliminates the usefulness of the CBO because when it is a justified
table scan, it does an index scan.

I have approached this windmill before and a bit regretful at bringing it up
again, but it is important, very important. There needs to be a way to direct
the optimizer about how to optimize the query.

Using "set foo=bar" prior to a query is not acceptable. Web sites use
persistent connections to the databases and since "set" can not be restored,
you override global settings for the session, or have to code, in the web page,
the proper default setting. The result is either that different web processes
will behave differently depending on the order in which they execute queries,
or you have to have your DBA write web pages.

A syntax like:

select * from table where /* enable_seqscan = false */ key = 'value';

Would be great in that you could tune the optimizer as long as the settings
were for the clause directly following the directive, without affecting the
state of the session or transaction. For instance:

select id from t1, t2 where /* enable_seqscan = false */ t1.key = 'value' and
t2.key = 'test' and t1.id = t2.id;

The where "t1.key = 'value'" condition would be prohibited from using a
sequntial scan, while the "t2.key = 'test'" would use it if it made sense.

Is this possible?


Re: Index of a table is not used (in any case)

From
Tom Lane
Date:
mlw <markw@mohawksoft.com> writes:
> ... Postgres' statistics are pretty poor too, a relative few very
> populous entries in a table will make it virtually impossible for the
> cost based optimizer (CBO) to use an index.

Have you looked at development sources lately?
        regards, tom lane


Re: Index of a table is not used (in any case)

From
Peter Eisentraut
Date:
mlw writes:

> The "not using index" was very frustrating to understand. The stock answer,
> "did you vacuum?" just isn't enough. There has to be some explanation (in the
> FAQ or something) about the indexed key distribution in your data.

Most "not using index" questions seem to be related to a misunderstanding
of users to the effect that "if there is an index it must be used, not
matter what the query", which is of course far from reality.  Add to that
the (related) category of inquiries from people that think the index ought
to be used but don't have any actual timings to show, you have a lot of
people that just need to be educated.

Of course the question "did you vacuum" (better, did you analyze) is
annoying, just as the requirement to analyze is annoying in the first
place, but unless someone designs a better query planner it will have to
do.  The reason why we always ask that question first is that people
invariantly have not analyzed.  A seasoned developer can often tell from
the EXPLAIN output whether ANALYZE has been done, but users cannot.
Perhaps something can be done in this area, but I'm not exactly sure what.

-- 
Peter Eisentraut   peter_e@gmx.net   http://funkturm.homeip.net/~peter



Re: Index of a table is not used (in any case)

From
"Zeugswetter Andreas SB SD"
Date:
> Of course the question "did you vacuum" (better, did you analyze) is
> annoying, just as the requirement to analyze is annoying in the first
> place, but unless someone designs a better query planner it 
> will have to do.  The reason why we always ask that question first is 
> that people invariantly have not analyzed.

I think it is also not allways useful to ANALYZE. There are applications

that choose optimal plans with only the rudimentary statistics VACUUM 
creates. And even such that use optimal plans with only the default 
statistics in place.

Imho one of the biggest sources for problems is people creating new
indexes on populated tables when the rest of the db/table has badly
outdated statistics or even only default statistics in place.
In this situation the optimizer is badly misguided, because it now
sees completely inconsistent statistics to work on.
(e.g. old indexes on that table may seem way too cheap compared 
to table scan) 

I would thus propose a more distinguished approach of writing 
the statistics gathered during "create index" to the system tables.

Something like:
if (default stats in place)   write defaults
else if (this is the only index)   write gathered statistics
else    write only normalized statistics for index       (e.g. index.reltuples = table.reltuples;
index.relpages= (index.gathered.relpages *                  table.relpages / table.gathered.relpages)
 

Andreas


Re: [SQL] Index of a table is not used (in any case)

From
Reiner Dassing
Date:
Hello Tom!

Tom Lane wrote:
> 
> Reiner Dassing <dassing@wettzell.ifag.de> writes:
> > explain select * from wetter order by epoche desc;
> > NOTICE:  QUERY PLAN:
> 
> > Index Scan Backward using wetter_epoche_idx on wetter
> > (cost=0.00..3216018.59 rows=20340000 width=16)
> 
> > explain select * from wetter where epoche between '1970-01-01' and
> > '1980-01-01' order by epoche asc;
> > NOTICE:  QUERY PLAN:
> 
> > Sort  (cost=480705.74..480705.74 rows=203400 width=16)
> >   ->  Seq Scan on wetter  (cost=0.00..454852.00 rows=203400 width=16)
> 
> It's hard to believe that you've done a VACUUM ANALYZE on this table,
> since you are getting a selectivity estimate of exactly 0.01, which
> just happens to be the default selectivity estimate for range queries.
> How many rows are there really in this date range?
> 
Well, I did not claim that i made a VACUUM ANALYZE, I just set up a new
table
for testing purposes doing just INSERTs.

After VACUUM ANALYSE the results look like:
explain select * from wetter where epoche between '1970-01-01' and
test_wetter-# '1980-01-01' order by epoche asc;
NOTICE:  QUERY PLAN:

Index Scan using wetter_epoche_idx on wetter  (cost=0.00..3313780.74
rows=20319660 width=16)

EXPLAIN

Now, the INDEX Scan is used and therefore, the query is very fast, as
expected.

For me, as a user not being involved in all the intrinsics of
PostgreSQL, the question was

"Why is this SELECT so slow?" (this question is asked a lot of times in
this Mail lists)

Now, I would like to say thank you! You have explained me and hopefully
many more users
what is going on behind the scene.

> Anyway, the reason the planner is picking a seqscan+sort is that it
> thinks that will be faster than an indexscan.  It's not necessarily
> wrong.  Have you compared the explain output and actual timings both
> ways?  (Use "set enable_seqscan to off" to force it to pick an indexscan
> for testing purposes.)
> 
>                         regards, tom lane

--
Mit freundlichen Gruessen / With best regards  Reiner Dassing


Re: Index of a table is not used (in any case)

From
Tom Lane
Date:
"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:
> Imho one of the biggest sources for problems is people creating new
> indexes on populated tables when the rest of the db/table has badly
> outdated statistics or even only default statistics in place.
> In this situation the optimizer is badly misguided, because it now
> sees completely inconsistent statistics to work on.
> (e.g. old indexes on that table may seem way too cheap compared 
> to table scan) 

I don't think any of this is correct.  We don't have per-index
statistics.  The only stats updated by CREATE INDEX are the same ones
updated by plain VACUUM, viz the number-of-tuples and number-of-pages
counts in pg_class.  I believe it's reasonable to update those stats
more often than the pg_statistic stats (in fact, if we could keep them
constantly up-to-date at a reasonable cost, we'd do so).  The
pg_statistic stats are designed as much as possible to be independent
of the absolute number of rows in the table, so that it's okay if they
are out of sync with the pg_class stats.

The major reason why "you vacuumed but you never analyzed" is such a
killer is that in the absence of any pg_statistic data, the default
selectivity estimates are such that you may get either an index or seq
scan depending on how big the table is.  The cost estimates are
nonlinear (correctly so, IMHO, though I wouldn't necessarily defend the
exact shape of the curve) and ye olde default 0.01 will give you an
indexscan for a small table but not for a big one.  In 7.2 I have
reduced the default selectivity estimate to 0.005, for a number of
reasons but mostly to get it out of the range where the decision will
flip-flop.  Observe:

test71=# create table foo (f1 int);
CREATE
test71=# create index fooi on foo(f1);
CREATE
test71=# explain select * from foo where f1 = 42;
NOTICE:  QUERY PLAN:

Index Scan using fooi on foo  (cost=0.00..8.14 rows=10 width=4)

EXPLAIN
test71=# select reltuples,relpages from pg_class where relname = 'foo';reltuples | relpages
-----------+----------     1000 |       10
(1 row)

EXPLAIN
test71=# update pg_class set reltuples = 100000, relpages = 1000 where relname = 'foo';
UPDATE 1
test71=# explain select * from foo where f1 = 42;
NOTICE:  QUERY PLAN:

Index Scan using fooi on foo  (cost=0.00..1399.04 rows=1000 width=4)

EXPLAIN
test71=# update pg_class set reltuples = 1000000, relpages = 10000 where relname = 'foo';
UPDATE 1
test71=# explain select * from foo where f1 = 42;
NOTICE:  QUERY PLAN:

Seq Scan on foo  (cost=0.00..22500.00 rows=10000 width=4)

EXPLAIN
test71=#

In current sources you keep getting an indexscan as you increase the
number of tuples...
        regards, tom lane


Re: Index of a table is not used (in any case)

From
"Zeugswetter Andreas SB SD"
Date:
Tom Lane writes:
> "Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:
> > Imho one of the biggest sources for problems is people creating new
> > indexes on populated tables when the rest of the db/table has badly
> > outdated statistics or even only default statistics in place.
> > In this situation the optimizer is badly misguided, because it now
> > sees completely inconsistent statistics to work on.
> > (e.g. old indexes on that table may seem way too cheap compared 
> > to table scan) 
> 
> I don't think any of this is correct.  We don't have per-index
> statistics.  The only stats updated by CREATE INDEX are the same ones
> updated by plain VACUUM, viz the number-of-tuples and number-of-pages
> counts in pg_class.

1. Have I said anything about other stats, than relpages and reltuples ?

2. There is only limited use in the most accurate pg_statistics if
reltuples
and relpages is completely off. In the current behavior you eg get:

rel1: pages = 100000    -- updated from "create index"
index1 pages = 2        -- outdated
index2 pages = 2000    -- current

rel2: pages = 1        -- outdated

--> Optimizer will prefer join order: rel2, rel1

> I believe it's reasonable to update those stats
> more often than the pg_statistic stats (in fact, if we could keep them
> constantly up-to-date at a reasonable cost, we'd do so).

There is a whole lot of difference between keeping them constantly up to

date and modifying (part of) them in the "create index" command, so I do

not counter your above sentence, but imho the conclusion is wrong.

> The
> pg_statistic stats are designed as much as possible to be independent
> of the absolute number of rows in the table, so that it's okay if they
> are out of sync with the pg_class stats.

Independently, they can only be good for choosing whether to use an 
index or seq scan. They are not sufficient to choose a good join order.

> The major reason why "you vacuumed but you never analyzed" is such a
> killer is that in the absence of any pg_statistic data, the default
> selectivity estimates are such that you may get either an index or seq
> scan depending on how big the table is.  The cost estimates are
> nonlinear (correctly so, IMHO, though I wouldn't necessarily 
> defend the
> exact shape of the curve) and ye olde default 0.01 will give you an
> indexscan for a small table but not for a big one.  In 7.2 I have
> reduced the default selectivity estimate to 0.005, for a number of
> reasons but mostly to get it out of the range where the decision will
> flip-flop.

Yes, the new selectivity is better, imho even still too high.
Imho the strategy should be to assume a good selectivity
of values in absence of pg_statistics evidence.
If the index was not selective enough for an average query, the
dba should not have created the index in the first place.

> test71=# create table foo (f1 int);
> test71=# create index fooi on foo(f1);
> test71=# explain select * from foo where f1 = 42;

> Index Scan using fooi on foo  (cost=0.00..8.14 rows=10 width=4)

> test71=# update pg_class set reltuples = 100000, relpages = 
> 1000 where relname = 'foo';
> Index Scan using fooi on foo  (cost=0.00..1399.04 rows=1000 width=4)

> test71=# update pg_class set reltuples = 1000000, relpages = 
> 10000 where relname = 'foo';

> Seq Scan on foo  (cost=0.00..22500.00 rows=10000 width=4)

> In current sources you keep getting an indexscan as you increase the
> number of tuples...

As you can see it toppeled at 10 Mio rows :-(

Andreas