Thread: Query does not use index

Query does not use index

From

Martin Hampl

Date:

29 April 2004, 19:35:19

Hi,

I hope this message won't be sent several times to the list. If so,
please accept my apologies.

It would be very nice, if someone could have a look at the query I'm
trying to optimize. At the moment, I don't understand PostgreSQL's
behaviour and are stuck. Thanks a lot in advance.

I am using PostgreSQL 7.4.1

In my database are the following tables:

              Table "public.token"
   Column  |          Type          | Modifiers
----------+------------------------+-----------
  text_id  | integer                | not null
  position | integer                | not null
  word     | character varying(255) |
Indexes:
     "token_pkey" primary key, btree (text_id, "position")
     "word_idx" btree (word)

and

        Table "public.s"
  Column  |  Type   | Modifiers
---------+---------+-----------
  text_id | integer | not null
  s       | integer | not null
  start   | integer |
  stop    | integer |
Indexes:
     "s_pkey" primary key, btree (text_id, s)
     "s_begin_idx" btree (text_id, "start")
     "s_end_idx" btree (text_id, stop)
Foreign-key constraints:
     "$1" FOREIGN KEY (text_id, "start") REFERENCES token(text_id,
"position")
     "$2" FOREIGN KEY (text_id, stop) REFERENCES token(text_id,
"position")

(I hope it's readable)

I have the following query:

select * from
     token,
     s
where
     token.word = 'FACTSHEET' and
     s.text_id = token.text_id and
     s.start = token.position


and PostgreSQL generates the following query plan

                                             QUERY PLAN
------------------------------------------------------------------------
--------------------------
  Nested Loop  (cost=0.00..39120.95 rows=1 width=32)
    ->  Index Scan using s_pkey on s  (cost=0.00..24698.44 rows=3367
width=16)
    ->  Index Scan using token_pkey on token  (cost=0.00..4.27 rows=1
width=16)
          Index Cond: (("outer".text_id = token.text_id) AND
("outer"."start" = token."position"))
          Filter: ((word)::text = 'FACTSHEET'::text)

and the execution takes ages.

Now, 'FACTSHEET' is *very* seldom (5 of 23 million) and I intended it
to first search for 'FACTSHEET' and then use index s_begin_idx (and the
values of token.text_id and token.position) to derive the corresponding
records in s.

I tried to force PostgreSQL in doing it in this order by trying the
following:

select * from (select text_id, position from token where word =
'FACTSHEET') t left join s on (s.text_id = t.text_id and  s.start =
t.position )

                                           QUERY PLAN
------------------------------------------------------------------------
-----------------------
  Hash Left Join  (cost=24715.28..56630.78 rows=7892 width=24)
    Hash Cond: (("outer".text_id = "inner".text_id) AND
("outer"."position" = "inner"."start"))
    ->  Index Scan using word_idx on token  (cost=0.00..31402.51
rows=7892 width=8)
          Index Cond: ((word)::text = 'FACTSHEET'::text)
    ->  Hash  (cost=24698.44..24698.44 rows=3367 width=16)
          ->  Index Scan using s_pkey on s  (cost=0.00..24698.44
rows=3367 width=16)


Now it *does* first search for 'FACTSHEET' but it still does not use
s_begin_idx and I have no idea why. Any ideas, what I could do?

Thanks,
Martin.

Re: Query does not use index

From

Tom Lane

Date:

29 April 2004, 23:33:35

Martin Hampl <Martin.Hampl@gmx.de> writes:
> It would be very nice, if someone could have a look at the query I'm
> trying to optimize. At the moment, I don't understand PostgreSQL's
> behaviour and are stuck. Thanks a lot in advance.

Did you ANALYZE these tables?  Also, please post EXPLAIN ANALYZE not
just EXPLAIN when complaining about bad plans.  Since the essence of
your complaint is that the planner's estimates are wrong, showing us
only estimates and not reality makes it hard to give constructive
suggestions ...

            regards, tom lane

Re: Query does not use index

From

Martin Hampl

Date:

30 April 2004, 06:02:06

Hi,

Am 30.04.2004 um 01:32 schrieb Tom Lane:

> Martin Hampl <Martin.Hampl@gmx.de> writes:
>> It would be very nice, if someone could have a look at the query I'm
>> trying to optimize. At the moment, I don't understand PostgreSQL's
>> behaviour and are stuck. Thanks a lot in advance.
>
> Did you ANALYZE these tables?

I did.

> Also, please post EXPLAIN ANALYZE not
> just EXPLAIN when complaining about bad plans.  Since the essence of
> your complaint is that the planner's estimates are wrong, showing us
> only estimates and not reality makes it hard to give constructive
> suggestions ...

OK.

bnc23Mio=# EXPLAIN ANALYZE select * from
bnc23Mio-#     token,
bnc23Mio-#     s
bnc23Mio-# where
bnc23Mio-#     token.word = 'FACTSHEET' and
bnc23Mio-#     s.text_id = token.text_id and
bnc23Mio-#     s.start = token.position;
                                                              QUERY PLAN
------------------------------------------------------------------------
------------------------------------------------------------
  Nested Loop  (cost=0.00..39120.95 rows=1 width=32) (actual
time=102.263..692248.553 rows=3 loops=1)
    ->  Index Scan using s_pkey on s  (cost=0.00..24698.44 rows=3367
width=16) (actual time=75.933..589743.642 rows=1111220 loops=1)
    ->  Index Scan using token_pkey on token  (cost=0.00..4.27 rows=1
width=16) (actual time=0.086..0.086 rows=0 loops=1111220)
          Index Cond: (("outer".text_id = token.text_id) AND
("outer"."start" = token."position"))
          Filter: ((word)::text = 'FACTSHEET'::text)
  Total runtime: 692249.314 ms


bnc23Mio=# EXPLAIN ANALYZE select * from (select text_id, position from
token where word = 'FACTSHEET') t left join s on (s.text_id = t.text_id
and  s.start = t.position );
                                                                 QUERY
PLAN
------------------------------------------------------------------------
-------------------------------------------------------------------
  Hash Left Join  (cost=24715.28..56630.78 rows=7892 width=24) (actual
time=255329.976..255355.967 rows=5 loops=1)
    Hash Cond: (("outer".text_id = "inner".text_id) AND
("outer"."position" = "inner"."start"))
    ->  Index Scan using word_idx on token  (cost=0.00..31402.51
rows=7892 width=8) (actual time=91.010..109.394 rows=5 loops=1)
          Index Cond: ((word)::text = 'FACTSHEET'::text)
    ->  Hash  (cost=24698.44..24698.44 rows=3367 width=16) (actual
time=255236.914..255236.914 rows=0 loops=1)
          ->  Index Scan using s_pkey on s  (cost=0.00..24698.44
rows=3367 width=16) (actual time=105.100..247798.661 rows=1111220
loops=1)
  Total runtime: 255502.736 ms

Maybe that *is* what i wanted it to do? However, searching just for
'FACTSHEET' is very quick (I rebooted before this query to clear any
cache---is there a better way to do this?):

bnc23Mio=# EXPLAIN ANALYZE select * from token where word = 'FACTSHEET';
                                                          QUERY PLAN
------------------------------------------------------------------------
----------------------------------------------------
  Index Scan using word_idx on token  (cost=0.00..31402.51 rows=7892
width=16) (actual time=102.350..125.032 rows=5 loops=1)
    Index Cond: ((word)::text = 'FACTSHEET'::text)
  Total runtime: 125.289 ms

and I would have thought that the results of this query could have been
used to search for the respective records in s (using on of the
indexes)?

Regards,
Martin.

Re: Query does not use index

From

Tom Lane

Date:

30 April 2004, 15:53:27

Martin Hampl <Martin.Hampl@gmx.de> writes:
> Am 30.04.2004 um 01:32 schrieb Tom Lane:
>> Did you ANALYZE these tables?

> I did.

Hm.  I'm wondering why the row estimates for 's' are off by several
orders of magnitude:

>     ->  Index Scan using s_pkey on s  (cost=0.00..24698.44 rows=3367
> width=16) (actual time=75.933..589743.642 rows=1111220 loops=1)

It could be that this table has a lot of empty pages near the front,
which is a condition that's known to lead to underestimated row count
from ANALYZE.  (Manfred is working on a better ANALYZE sampling method
that should avoid such errors in future.)  Try doing a straight VACUUM
and see if the row count estimate gets better.  If so, it might be worth
the trouble to do a VACUUM FULL to get rid of the empty space.  (And you
should also think about doing routine vacuums more often, and perhaps
increasing the FSM settings, to ensure you don't get back into this state.)

The other thing I'm wondering about is why the devil it's choosing an
indexscan at all, when it has no indexscan conditions to use.  Are you
perhaps forcing that choice via "enable_seqscan = false"?  If so, don't.

> Maybe that *is* what i wanted it to do? However, searching just for
> 'FACTSHEET' is very quick (I rebooted before this query to clear any
> cache---is there a better way to do this?):
> and I would have thought that the results of this query could have been
> used to search for the respective records in s (using on of the
> indexes)?

Undoubtedly it did consider that plan, but rejected it because it looked
more expensive than the alternatives.  This is not too surprising given
the overestimate of the number of rows matching 'FACTSHEET' (7892 vs
reality of 5).  You might need to increase the statistics target for
token.word (see ALTER TABLE SET STATISTICS) to give the planner more
data to work with about the distribution of words.

            regards, tom lane

Re: Query does not use index

From

Martin Hampl

Date:

01 May 2004, 10:17:25

Am 30.04.2004 um 17:52 schrieb Tom Lane:

> It could be that this table has a lot of empty pages near the front,
> which is a condition that's known to lead to underestimated row count
> from ANALYZE.  (Manfred is working on a better ANALYZE sampling method
> that should avoid such errors in future.)  Try doing a straight VACUUM
> and see if the row count estimate gets better.

It did! Thanks! Explain analyze now yields the following result:

Explain analyze now
                                                             QUERY PLAN
------------------------------------------------------------------------
----------------------------------------------------------
  Nested Loop  (cost=0.00..70650.10 rows=12 width=32) (actual
time=170.449..181.663 rows=3 loops=1)
    ->  Index Scan using word_idx on token  (cost=0.00..30403.79
rows=7621 width=16) (actual time=158.237..179.105 rows=5 loops=1)
          Index Cond: ((word)::text = 'FACTSHEET'::text)
    ->  Index Scan using s_begin_idx on s  (cost=0.00..5.27 rows=1
width=16) (actual time=0.429..0.434 rows=1 loops=5)
          Index Cond: ((s.text_id = "outer".text_id) AND (s."start" =
"outer"."position"))
  Total runtime: 182.207 ms


I never deleted anything from this database, but it might be that I
dropped some of the tables a couple of times when populating the
database. Also 'copy from' failed a couple of times. Can that have the
same effect?

Thanks again,
Martin

Re: Query does not use index

From

Tom Lane

Date:

01 May 2004, 15:49:40

Martin Hampl <Martin.Hampl@gmx.de> writes:
> Am 30.04.2004 um 17:52 schrieb Tom Lane:
>> It could be that this table has a lot of empty pages near the front,

> I never deleted anything from this database, but it might be that I
> dropped some of the tables a couple of times when populating the
> database. Also 'copy from' failed a couple of times. Can that have the
> same effect?

A failed copy-in would leave a dead row for each input line that it was
able to process before hitting the error.  So potentially that could
account for a lot of dead rows.  I think there is a hint on the COPY
reference page suggesting that you VACUUM in such a situation ...

            regards, tom lane

Re: Query does not use index

From

Martin Hampl

Date:

01 May 2004, 15:58:59

Am 01.05.2004 um 17:48 schrieb Tom Lane:

> Martin Hampl <Martin.Hampl@gmx.de> writes:
>> Am 30.04.2004 um 17:52 schrieb Tom Lane:
>>> It could be that this table has a lot of empty pages near the front,
>
>> I never deleted anything from this database, but it might be that I
>> dropped some of the tables a couple of times when populating the
>> database. Also 'copy from' failed a couple of times. Can that have the
>> same effect?
>
> A failed copy-in would leave a dead row for each input line that it was
> able to process before hitting the error.  So potentially that could
> account for a lot of dead rows.  I think there is a hint on the COPY
> reference page suggesting that you VACUUM in such a situation ...

There is. Well...

Regards,
Martin

How to set up an SSL test environment?

From

"M. Bastin"

Date:

05 May 2004, 10:49:11

Hi,

I would like to set up an SSL test environment to further develop my
frontend-backend implementation and add SSL support to it.

Trouble is, I don't know much about SSL (and command shells)--I'm on
OS X and come from a traditional Mac culture.  I've been searching
the docs and archives and it's all very cryptic to me.

I guess this is the best hint to what I'm looking for:
http://archives.postgresql.org/pgsql-docs/2002-09/msg00038.php
But I'm not sure on how to interpret this and it looks like a step
somewhere in the middle and not the whole procedure.

Can anyone direct me to some good entry level documentation or go
through the trouble of explaining me the whole thing if it's not too
much--I bet it is though?

Thanks,

Marc

Re: How to set up an SSL test environment?

From

"M. Bastin"

Date:

05 May 2004, 13:14:20

I found a not too high level explanation here:
http://www.gtlib.cc.gatech.edu/pub/linux/docs/HOWTO/other-formats/html_single/SSL-Certificates-HOWTO.html

Now what I'd really like to find next is a OpenSSL GUI for OS X.
Does anyone know of one?
I checked versiontracker, google, Apple, etc. without luck so far.  I
hear there's one that comes with OS X Server.

Marc