strange query plan with LIMIT - Mailing list pgsql-performance

From anthony.shipman@symstream.com
Subject strange query plan with LIMIT
Date
Msg-id 201106071826.17282.anthony.shipman@symstream.com
Whole thread Raw
Responses Re: strange query plan with LIMIT
List pgsql-performance
Version: PostgreSQL 8.3.5 (mammoth replicator)

Schema:

CREATE TABLE tdiag (
    diag_id             integer DEFAULT nextval('diag_id_seq'::text),
    create_time        timestamp with time zone default now(),    /* time this record
was created */
    diag_time           timestamp with time zone not null,
    device_id           integer,                /* optional */
    fleet_id            integer,                /* optional */
    customer_id         integer,                /* optional */
    module              character varying,
    node_kind           smallint,
    diag_level          smallint,
    tag                 character varying not null default '',
    message             character varying not null default '',
    options             text,

    PRIMARY KEY (diag_id)
);

create index tdiag_create_time   ON tdiag(create_time);

The number of rows is around 33 million with time stamps over the past two
weeks.
A VACUUM ANALYZE has been done recently on the table.

The create_time order is almost identical to the id order.  What I want
to find is the first or last entry by id in a given time range. The
query I am having a problem with is:

symstream2=> explain analyze select * from tdiag where (create_time
>= '2011-06-03
09:49:04.000000+0' and create_time < '2011-06-06 09:59:04.000000+0') order by
diag_id limit 1;


QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..16.75 rows=1 width=114) (actual time=69425.356..69425.358
rows=1 loops=1)
   ->  Index Scan using tdiag_pkey on tdiag  (cost=0.00..19114765.76
rows=1141019 width=114)
(actual time=69425.352..69425.352 rows=1 loops=1)
         Filter: ((create_time >= '2011-06-03 19:49:04+10'::timestamp with
time zone) AND
(create_time < '2011-06-06 19:59:04+10'::timestamp with time zone))
 Total runtime: 69425.400 ms

PG seems to decide it must scan the diag_id column and filter each row by the
create_time.



If I leave out the limit I get

symstream2=> explain analyze select * from tdiag where (create_time
>= '2011-06-03
09:49:04.000000+0' and create_time < '2011-06-06 09:59:04.000000+0') order by
diag_id;


QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=957632.43..960484.98 rows=1141019 width=114) (actual
time=552.795..656.319 rows=86530
loops=1)
   Sort Key: diag_id
   Sort Method:  external merge  Disk: 9872kB
   ->  Bitmap Heap Scan on tdiag  (cost=25763.48..638085.13 rows=1141019
width=114) (actual
time=43.232..322.441 rows=86530 loops=1)
         Recheck Cond: ((create_time >= '2011-06-03 19:49:04+10'::timestamp
with time zone) AND
(create_time < '2011-06-06 19:59:04+10'::timestamp with time zone))
         ->  Bitmap Index Scan on tdiag_create_time  (cost=0.00..25478.23
rows=1141019 width=0)
(actual time=42.574..42.574 rows=86530 loops=1)
               Index Cond: ((create_time >= '2011-06-03
19:49:04+10'::timestamp with time zone) AND
(create_time < '2011-06-06 19:59:04+10'::timestamp with time zone))
 Total runtime: 736.440 ms
(8 rows)




I can be explicit about the query order:

select * into tt from tdiag where (create_time >= '2011-06-03
09:49:04.000000+0' and create_time <
'2011-06-06 09:59:04.000000+0');

symstream2=> explain analyze select * from tt order by diag_id limit 1;
                                                       QUERY PLAN

------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=2731.95..2731.95 rows=1 width=101) (actual time=440.165..440.166
rows=1 loops=1)
   ->  Sort  (cost=2731.95..2948.28 rows=86530 width=101) (actual
time=440.161..440.161 rows=1
loops=1)
         Sort Key: diag_id
         Sort Method:  top-N heapsort  Memory: 17kB
         ->  Seq Scan on tt  (cost=0.00..2299.30 rows=86530 width=101) (actual
time=19.602..330.873
rows=86530 loops=1)
 Total runtime: 440.209 ms
(6 rows)



But if I try using a subquery I get

symstream2=> explain analyze select * from (select * from tdiag where
(create_time >= '2011-06-03
09:49:04.000000+0' and create_time < '2011-06-06 09:59:04.000000+0')) as sub
order by diag_id limit
1;


QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..16.75 rows=1 width=114) (actual time=90344.384..90344.385
rows=1 loops=1)
   ->  Index Scan using tdiag_pkey on tdiag  (cost=0.00..19114765.76
rows=1141019 width=114)
(actual time=90344.380..90344.380 rows=1 loops=1)
         Filter: ((create_time >= '2011-06-03 19:49:04+10'::timestamp with
time zone) AND
(create_time < '2011-06-06 19:59:04+10'::timestamp with time zone))
 Total runtime: 90344.431 ms


How do I make this both fast and simple?
--
Anthony Shipman                 | flailover systems: When one goes down it
Anthony.Shipman@symstream.com   | flails about until the other goes down too.

pgsql-performance by date:

Previous
From: Vitalii Tymchyshyn
Date:
Subject: Re: 8.4/9.0 simple query performance regression
Next
From: Craig Ringer
Date:
Subject: Re: i want to ask monitory peformance memory postgresql with automatically