index being ignored for "limit n" queries - Mailing list pgsql-general

From Piotr Sulecki
Subject index being ignored for "limit n" queries
Date
Msg-id 42F89DC1.9040303@sybilla.traxelektronik.pl
Whole thread Raw
Responses Re: index being ignored for "limit n" queries  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-general
Ave!

Yesterday I noticed a problem with my PostgreSQL installation. I have
three database clusters, one using version 7.4.6 and the rest using
version 8.0.1. The problem manifests itself in all three installations.
(The three databases are copies of -- more or less -- the same data; now
I'm writing a program to really synchronize these databases, but that's
besides the point. I'll only describe one of them; the same can be said
of the others, too.)

I have two really big tables, the problem is with one of them. The table
looks as follows:

                                  Table "public.pakiety"
   Column    |            Type             |                  Modifiers
-------------+-----------------------------+---------------------------------------------
pktid        | integer                     | not null default
                                             nextval('pktid_seq')
stid         | smallint                    | not null
received     | timestamp(6) with time zone | not null
measured     | timestamp(0) with time zone | not null
station_time | timestamp(0) with time zone |
bezwzgl      | smallint                    |
full_cycle   | boolean                     |

Indexes:
    "pakiety_pkey" PRIMARY KEY, btree (pktid)
    "pakiety_stid_received_idx" UNIQUE, btree (stid, received)
    "pakiety_measured_idx" btree (measured)
    "pakiety_received_idx" btree (received)
    "pakiety_stid_measured_idx" btree (stid, measured)

This table holds info about data packets sent by our automated meteo
stations.

The table contains about 15 million rows. Any type of a sequential scan
will be extremely slow on this table. Even an index scan will take
minutes if the index is not fit for the search.

What's unusual is that:
1) there is only a very limited set of used stid (station ID) values -
   about 500 or so, and
2) there can be some very long periods of time (measured in months) when
   a specific station doesn't send any packets at all.

That's why I created indices pakiety_stid_received_idx and
pakiety_stid_measured_idx. Unfortunately, when I'm looking for the last
data packet from a specific station, I get an index scan using only the
"pakiety_received_idx" index and not the "pakiety_stid_received_idx"
one. While it takes a split second for a recently active station, it
really takes several minutes for a station which has been quiet for
several months.

trax=# explain select * from pakiety where stid = 234::smallint order by
received desc limit 1;
                                                 QUERY PLAN
-------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..6.26 rows=1 width=33)
   ->  Index Scan Backward using pakiety_received_idx on pakiety
(cost=0.00..193599.37 rows=30915 width=33)
         Filter: (stid = 234::smallint)
(3 rows)

What's funny, I get the same plan when using "limit 10000":

trax=# explain select * from pakiety where stid = 234::smallint order by
received desc limit 10000;
                                                 QUERY PLAN
-------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..62621.32 rows=10000 width=33)
   ->  Index Scan Backward using pakiety_received_idx on pakiety
(cost=0.00..193756.63 rows=30941 width=33)
         Filter: (stid = 234::smallint)
(3 rows)

I have just noticed another problem: when the limit goes even higher,
the planner decides to use another unfit index.

trax=# explain select * from pakiety where stid = 234::smallint order by
received desc limit 100000;
                                                  QUERY PLAN
---------------------------------------------------------------------------------------------------------------
 Limit  (cost=125101.46..125178.81 rows=30940 width=33)
   ->  Sort  (cost=125101.46..125178.81 rows=30940 width=33)
         Sort Key: received
         ->  Index Scan using pakiety_stid_measured_idx on pakiety
(cost=0.00..122288.52 rows=30940 width=33)
               Index Cond: (stid = 234::smallint)
(5 rows)

The same index is also used if I remove the "limit n" clause entirely:

trax=# explain select * from pakiety where stid = 234::smallint order by
received desc;
                                               QUERY PLAN
---------------------------------------------------------------------------------------------------------
 Sort  (cost=125293.11..125370.58 rows=30986 width=33)
   Sort Key: received
   ->  Index Scan using pakiety_stid_measured_idx on pakiety
(cost=0.00..122474.14 rows=30986 width=33)
         Index Cond: (stid = 234::smallint)
(4 rows)

The "pakiety_stid_received_idx" index never gets used. Any hints about
possible reasons of such behavior?

Regards,

Piotr Sulecki.

pgsql-general by date:

Previous
From: Ben-Nes Yonatan
Date:
Subject: Re: Weird lock or bug maybe?
Next
From: Ana Mandiola
Date:
Subject: Solicitud de informacion de psql con php