Thread: strange query plan with LIMIT

strange query plan with LIMIT

From

anthony.shipman@symstream.com

Date:

07 June 2011, 13:18:10

Version: PostgreSQL 8.3.5 (mammoth replicator)

Schema:

CREATE TABLE tdiag (
    diag_id             integer DEFAULT nextval('diag_id_seq'::text),
    create_time        timestamp with time zone default now(),    /* time this record
was created */
    diag_time           timestamp with time zone not null,
    device_id           integer,                /* optional */
    fleet_id            integer,                /* optional */
    customer_id         integer,                /* optional */
    module              character varying,
    node_kind           smallint,
    diag_level          smallint,
    tag                 character varying not null default '',
    message             character varying not null default '',
    options             text,

    PRIMARY KEY (diag_id)
);

create index tdiag_create_time   ON tdiag(create_time);

The number of rows is over 33 million with time stamps over the past two
weeks.

The create_time order is almost identical to the id order.  What I want
to find is the first or last entry by id in a given time range. The
query I am having a problem with is:

symstream2=> explain analyze select * from tdiag where (create_time
>= '2011-06-03
09:49:04.000000+0' and create_time < '2011-06-06 09:59:04.000000+0') order by
diag_id limit 1;


QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..16.75 rows=1 width=114) (actual time=69425.356..69425.358
rows=1 loops=1)
   ->  Index Scan using tdiag_pkey on tdiag  (cost=0.00..19114765.76
rows=1141019 width=114)
(actual time=69425.352..69425.352 rows=1 loops=1)
         Filter: ((create_time >= '2011-06-03 19:49:04+10'::timestamp with
time zone) AND
(create_time < '2011-06-06 19:59:04+10'::timestamp with time zone))
 Total runtime: 69425.400 ms

PG seems to decide it must scan the diag_id column and filter each row by the
create_time.



If I leave out the limit I get

symstream2=> explain analyze select * from tdiag where (create_time
>= '2011-06-03
09:49:04.000000+0' and create_time < '2011-06-06 09:59:04.000000+0') order by
diag_id;


QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=957632.43..960484.98 rows=1141019 width=114) (actual
time=552.795..656.319 rows=86530
loops=1)
   Sort Key: diag_id
   Sort Method:  external merge  Disk: 9872kB
   ->  Bitmap Heap Scan on tdiag  (cost=25763.48..638085.13 rows=1141019
width=114) (actual
time=43.232..322.441 rows=86530 loops=1)
         Recheck Cond: ((create_time >= '2011-06-03 19:49:04+10'::timestamp
with time zone) AND
(create_time < '2011-06-06 19:59:04+10'::timestamp with time zone))
         ->  Bitmap Index Scan on tdiag_create_time  (cost=0.00..25478.23
rows=1141019 width=0)
(actual time=42.574..42.574 rows=86530 loops=1)
               Index Cond: ((create_time >= '2011-06-03
19:49:04+10'::timestamp with time zone) AND
(create_time < '2011-06-06 19:59:04+10'::timestamp with time zone))
 Total runtime: 736.440 ms
(8 rows)




I can be explicit about the query order:

select * into tt from tdiag where (create_time >= '2011-06-03
09:49:04.000000+0' and create_time <
'2011-06-06 09:59:04.000000+0');

symstream2=> explain analyze select * from tt order by diag_id limit 1;
                                                       QUERY PLAN

------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=2731.95..2731.95 rows=1 width=101) (actual time=440.165..440.166
rows=1 loops=1)
   ->  Sort  (cost=2731.95..2948.28 rows=86530 width=101) (actual
time=440.161..440.161 rows=1
loops=1)
         Sort Key: diag_id
         Sort Method:  top-N heapsort  Memory: 17kB
         ->  Seq Scan on tt  (cost=0.00..2299.30 rows=86530 width=101) (actual
time=19.602..330.873
rows=86530 loops=1)
 Total runtime: 440.209 ms
(6 rows)



But if I try using a subquery I get

symstream2=> explain analyze select * from (select * from tdiag where
(create_time >= '2011-06-03
09:49:04.000000+0' and create_time < '2011-06-06 09:59:04.000000+0')) as sub
order by diag_id limit
1;


QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..16.75 rows=1 width=114) (actual time=90344.384..90344.385
rows=1 loops=1)
   ->  Index Scan using tdiag_pkey on tdiag  (cost=0.00..19114765.76
rows=1141019 width=114)
(actual time=90344.380..90344.380 rows=1 loops=1)
         Filter: ((create_time >= '2011-06-03 19:49:04+10'::timestamp with
time zone) AND
(create_time < '2011-06-06 19:59:04+10'::timestamp with time zone))
 Total runtime: 90344.431 ms


How do I make this both fast and simple?
--
Anthony Shipman                 | Tech Support: The guys who follow the
Anthony.Shipman@symstream.com   | 'Parade of New Products' with a shovel.

Re: strange query plan with LIMIT

From

tv@fuzzy.cz

Date:

07 June 2011, 13:40:22

> Version: PostgreSQL 8.3.5 (mammoth replicator)
>
> Schema:
>
> CREATE TABLE tdiag (
>     diag_id             integer DEFAULT nextval('diag_id_seq'::text),
>     create_time        timestamp with time zone default now(),    /* time this
> record
> was created */
>     diag_time           timestamp with time zone not null,
>     device_id           integer,                /* optional */
>     fleet_id            integer,                /* optional */
>     customer_id         integer,                /* optional */
>     module              character varying,
>     node_kind           smallint,
>     diag_level          smallint,
>     tag                 character varying not null default '',
>     message             character varying not null default '',
>     options             text,
>
>     PRIMARY KEY (diag_id)
> );
>
> create index tdiag_create_time   ON tdiag(create_time);
>
> The number of rows is over 33 million with time stamps over the past two
> weeks.
>
> The create_time order is almost identical to the id order.  What I want
> to find is the first or last entry by id in a given time range. The
> query I am having a problem with is:

Hi,

why are you reposting this? Pavel Stehule already recommended you to run
ANALYZE on the tdiag table - have you done that? What was the effect?

The stats are off - e.g. the bitmap scan says

   ->  Bitmap Heap Scan on tdiag  (cost=25763.48..638085.13 rows=1141019
width=114) (actual time=43.232..322.441 rows=86530 loops=1)

so it expects to get 1141019 rows but it gets 86530, i.e. about 7% of the
expected number. That might be enough to cause bad plan choice and thus
performance issues.

And yet another recommendation - the sort is performed on disk, so give it
more work_mem and it should be much faster (should change from "merge
sort" to "quick sort"). Try something like work_mem=20MB and see if it
does the trick.

regards
Tomas

Re: strange query plan with LIMIT

From

anthony.shipman@symstream.com

Date:

08 June 2011, 02:08:21

On Wednesday 08 June 2011 02:40, tv@fuzzy.cz wrote:
> Hi,
>
> why are you reposting this? Pavel Stehule already recommended you to run
> ANALYZE on the tdiag table - have you done that? What was the effect?

The mailing list system hiccupped and I ended up with two posts.

VACUUM ANALYZE was done, more than once.
Setting the statistics value on the diag_id column to 1000 seemed to only make
the query a bit slower.

>
> The stats are off - e.g. the bitmap scan says
>
>    ->  Bitmap Heap Scan on tdiag  (cost=25763.48..638085.13 rows=1141019
> width=114) (actual time=43.232..322.441 rows=86530 loops=1)
>
> so it expects to get 1141019 rows but it gets 86530, i.e. about 7% of the
> expected number. That might be enough to cause bad plan choice and thus
> performance issues.

What seems odd to me is that the only difference between the two is the limit
clause:

select * from tdiag where (create_time >= '2011-06-03
09:49:04.000000+0' and create_time < '2011-06-06 09:59:04.000000+0') order by
diag_id limit 1;

select * from tdiag where (create_time >= '2011-06-03
09:49:04.000000+0' and create_time < '2011-06-06 09:59:04.000000+0') order by
diag_id;

and yet the plan completely changes.

I think that I have to force the evaluation order to get a reliably fast
result:

begin; create temporary table tt on commit drop as
select diag_id from tdiag where create_time >= '2011-06-03 09:49:04.000000+0'
     and create_time < '2011-06-06 09:59:04.000000+0';
select * from tdiag where diag_id in (select * from tt)
     order by diag_id limit 10; commit;

                                                                QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=3566.24..3566.27 rows=10 width=112) (actual
time=1800.699..1800.736 rows=10 loops=1)
   ->  Sort  (cost=3566.24..3566.74 rows=200 width=112) (actual
time=1800.694..1800.708 rows=10 loops=1)
         Sort Key: tdiag.diag_id
         Sort Method:  top-N heapsort  Memory: 18kB
         ->  Nested Loop  (cost=1360.00..3561.92 rows=200 width=112) (actual
time=269.087..1608.324 rows=86530 loops=1)
               ->  HashAggregate  (cost=1360.00..1362.00 rows=200 width=4)
(actual time=269.052..416.898 rows=86530 loops=1)
                     ->  Seq Scan on tt  (cost=0.00..1156.00 rows=81600
width=4) (actual time=0.020..120.323 rows=86530 loops=1)
               ->  Index Scan using tdiag_pkey on tdiag  (cost=0.00..10.99
rows=1 width=112) (actual time=0.006..0.008 rows=1 loops=86530)
                     Index Cond: (tdiag.diag_id = tt.diag_id)
 Total runtime: 1801.290 ms

>
> And yet another recommendation - the sort is performed on disk, so give it
> more work_mem and it should be much faster (should change from "merge
> sort" to "quick sort"). Try something like work_mem=20MB and see if it
> does the trick.

This certainly speeds up the sorting.

>
> regards
> Tomas

--
Anthony Shipman                 | What most people think about
Anthony.Shipman@symstream.com   | most things is mostly wrong.

Re: strange query plan with LIMIT

From

Claudio Freire

Date:

08 June 2011, 04:39:13

On Wed, Jun 8, 2011 at 7:08 AM,  <anthony.shipman@symstream.com> wrote:
> What seems odd to me is that the only difference between the two is the limit
> clause

Why would that seem odd?

Of course optimally executing a plan with limit is a lot different
than one without.

Just... why are you sorting by diag_id?

I believe you would be better off sorting by timestamp than diag_id,
but I don't know what the query is supposed to do.

In any case, that's a weakness I've seen in many database systems, and
postgres is no exception: order + limit strongly suggests index usage,
and when the ordered column has "anti" correlation with the where
clause (that is, too many of the first rows in the ordered output are
filtered out by the whereclause), the plan with an index is
insufferably slow compared to a sequential scan + sort.

Postgres has no way to know that, it depends on correlation between
the where clause and the ordering expressions.

If you cannot change the query, I think your only option is to either
add a specific index for that query (ie, if the where clause is always
the same, you could add a partial index), or just disable nested loops
with "set enable_nestloop = false;" just prior to running that query
(and remember to re-enable afterwards).

Re: strange query plan with LIMIT

From

tv@fuzzy.cz

Date:

08 June 2011, 05:33:28

> What seems odd to me is that the only difference between the two is the
> limit
> clause:
>
> select * from tdiag where (create_time >= '2011-06-03
> 09:49:04.000000+0' and create_time < '2011-06-06 09:59:04.000000+0') order
> by
> diag_id limit 1;
>
> select * from tdiag where (create_time >= '2011-06-03
> 09:49:04.000000+0' and create_time < '2011-06-06 09:59:04.000000+0') order
> by
> diag_id;
>
> and yet the plan completely changes.

As Claudio Freire already pointed out, this is expected behavior. With
LIMIT the planner prefers plans with low starting cost, as it expects to
end soon and building index bitmap / hash table would be a waste. So
actually it would be very odd if the plan did not change in this case ...

Anyway I have no idea how to fix this "clean" - without messing with
enable_* or cost variables or other such dirty tricks.

regards
Tomas

Re: strange query plan with LIMIT

From

anthony.shipman@symstream.com

Date:

08 June 2011, 05:34:18

On Wednesday 08 June 2011 17:39, Claudio Freire wrote:
> Of course optimally executing a plan with limit is a lot different
> than one without.

I imagined that limit just cuts out a slice of the query results.
If it can find 80000 rows in 0.5 seconds then I would have thought that
returning just the first 100 of them should be just as easy.

>
> Just... why are you sorting by diag_id?
>
> I believe you would be better off sorting by timestamp than diag_id,
> but I don't know what the query is supposed to do.

The timestamp is only almost monotonic. I need to scan the table in slices and
I use limit and offset to select the slice.

I've forced the query order with some pgsql like:

declare
    query   character varying;
    rec     record;
begin
    -- PG 8.3 doesn't have the 'using' syntax nor 'return query execute'

    execute 'create temporary table tt on commit drop as ' ||
        'select diag_id from tdiag ' || v_where;

    query = 'select * from tdiag where diag_id in (select * from tt) ' ||
            'order by diag_id ' || v_limit || ' ' || v_offset;

    for rec in execute query loop
        return next rec;
    end loop;
end;

--
Anthony Shipman                 | Life is the interval
Anthony.Shipman@symstream.com   | between pay days.

Re: strange query plan with LIMIT

From

Pavel Stehule

Date:

08 June 2011, 05:40:41

2011/6/8  <anthony.shipman@symstream.com>:
> On Wednesday 08 June 2011 17:39, Claudio Freire wrote:
>> Of course optimally executing a plan with limit is a lot different
>> than one without.
>
> I imagined that limit just cuts out a slice of the query results.
> If it can find 80000 rows in 0.5 seconds then I would have thought that
> returning just the first 100 of them should be just as easy.
>
>>
>> Just... why are you sorting by diag_id?
>>
>> I believe you would be better off sorting by timestamp than diag_id,
>> but I don't know what the query is supposed to do.
>
> The timestamp is only almost monotonic. I need to scan the table in slices and
> I use limit and offset to select the slice.
>
> I've forced the query order with some pgsql like:
>
> declare
>    query   character varying;
>    rec     record;
> begin
>    -- PG 8.3 doesn't have the 'using' syntax nor 'return query execute'
>
>    execute 'create temporary table tt on commit drop as ' ||
>        'select diag_id from tdiag ' || v_where;
>
>    query = 'select * from tdiag where diag_id in (select * from tt) ' ||
>            'order by diag_id ' || v_limit || ' ' || v_offset;
>
>    for rec in execute query loop
>        return next rec;
>    end loop;
> end;

if you use FOR statement, there should be a problem in using a
implicit cursor - try to set a GUC cursor_tuple_fraction to 1.0.

Regards

Pavel Stehule


>
> --
> Anthony Shipman                 | Life is the interval
> Anthony.Shipman@symstream.com   | between pay days.
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

Re: strange query plan with LIMIT

From

anthony.shipman@symstream.com

Date:

08 June 2011, 06:36:44

On Wednesday 08 June 2011 18:39, Pavel Stehule wrote:
> if you use FOR statement, there should be a problem in using a
> implicit cursor - try to set a GUC cursor_tuple_fraction to 1.0.
Alas this is mammoth replicator, equivalent to PG 8.3 and it doesn't have that
parameter.
--
Anthony Shipman                 | It's caches all the way
Anthony.Shipman@symstream.com   | down.

Re: strange query plan with LIMIT

From

tv@fuzzy.cz

Date:

08 June 2011, 06:47:51

> On Wednesday 08 June 2011 17:39, Claudio Freire wrote:
>> Of course optimally executing a plan with limit is a lot different
>> than one without.
>
> I imagined that limit just cuts out a slice of the query results.
> If it can find 80000 rows in 0.5 seconds then I would have thought that
> returning just the first 100 of them should be just as easy.

But that's exactly the problem with LIMIT clause. The planner considers
two choices - index scan with this estimate

Index Scan using tdiag_pkey on tdiag  (cost=0.00..19114765.76
rows=1141019 width=114)

and bitmap index scan with this estimate

Bitmap Heap Scan on tdiag  (cost=25763.48..638085.13 rows=1141019
width=114)

and says - hey, the index scan has much lower starting cost, and I'm using
limit so it's much better! Let's use index scan. But then it finds out it
needs to scan most of the table and that ruins the performance.

Have you tried to create a composite index on those two columns? Not sure
if that helps but I'd try that.

Tomas

Re: strange query plan with LIMIT

From

Pavel Stehule

Date:

08 June 2011, 06:58:53

Hello

2011/6/8  <anthony.shipman@symstream.com>:
> On Wednesday 08 June 2011 18:39, Pavel Stehule wrote:
>> if you use FOR statement, there should be a problem in using a
>> implicit cursor - try to set a GUC cursor_tuple_fraction to 1.0.
> Alas this is mammoth replicator, equivalent to PG 8.3 and it doesn't have that
> parameter.

It should be a part of problem - resp. combination with bad statistic.
Maybe you should to rewrite your code to

DECLARE int i = 0;

FOR x IN EXECUTE '....'
LOOP
  RETURN NEXT ...
  i := i + 1;
  EXIT WHEN i > limitvar
END LOOP

Regards

Pavel Stehule



> --
> Anthony Shipman                 | It's caches all the way
> Anthony.Shipman@symstream.com   | down.
>

Re: strange query plan with LIMIT

From

anthony.shipman@symstream.com

Date:

09 June 2011, 03:04:54

On Wednesday 08 June 2011 19:47, tv@fuzzy.cz wrote:
> Have you tried to create a composite index on those two columns? Not sure
> if that helps but I'd try that.
>
> Tomas

Do you mean
   create index tdiag_index2   ON tdiag(diag_id, create_time);
Should this be in addition to or instead of the single index on create_time?



I must be doing something really wrong to get this to happen:

symstream2=> select count(*) from tdiag where create_time <= '2011-05-23
03:51:00.131597+0';
 count
-------
     0
(1 row)

symstream2=> explain analyze select count(*) from tdiag where create_time
<= '2011-05-23 03:51:00.131597+0';
                                                       QUERY PLAN

------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=863867.21..863867.22 rows=1 width=0) (actual
time=58994.078..58994.080 rows=1 loops=1)
   ->  Seq Scan on tdiag  (cost=0.00..844188.68 rows=7871413 width=0) (actual
time=58994.063..58994.063 rows=0 loops=1)
         Filter: (create_time <= '2011-05-23 13:51:00.131597+10'::timestamp
with time zone)
 Total runtime: 58994.172 ms
(4 rows)

symstream2=> \d tdiag
                                        Table "public.tdiag"
   Column    |           Type           |                         Modifiers
-------------+--------------------------+-----------------------------------------------------------
 diag_id     | integer                  | not null default
nextval(('diag_id_seq'::text)::regclass)
 create_time | timestamp with time zone | default now()
 diag_time   | timestamp with time zone | not null
 device_id   | integer                  |
 fleet_id    | integer                  |
 customer_id | integer                  |
 module      | character varying        |
 node_kind   | smallint                 |
 diag_level  | smallint                 |
 message     | character varying        | not null default ''::character
varying
 options     | text                     |
 tag         | character varying        | not null default ''::character
varying
Indexes:
    "tdiag_pkey" PRIMARY KEY, btree (diag_id)
    "tdiag_create_time" btree (create_time)


--
Anthony Shipman                 | Programming is like sex: One mistake and
Anthony.Shipman@symstream.com   | you're providing support for a lifetime.

Re: strange query plan with LIMIT

From

anthony.shipman@symstream.com

Date:

09 June 2011, 03:16:38

On Thursday 09 June 2011 16:04, anthony.shipman@symstream.com wrote:
> I must be doing something really wrong to get this to happen:
Yes I did. Ignore that.
--
Anthony Shipman                 | flailover systems: When one goes down it
Anthony.Shipman@symstream.com   | flails about until the other goes down too.

Re: strange query plan with LIMIT

From

anthony.shipman@symstream.com

Date:

10 June 2011, 05:38:51

On Wednesday 08 June 2011 19:47, tv@fuzzy.cz wrote:
> Have you tried to create a composite index on those two columns? Not sure
> if that helps but I'd try that.
>
> Tomas

This finally works well enough

CREATE TABLE tdiag (
    diag_id             integer DEFAULT nextval('diag_id_seq'::text),
    create_time        timestamp with time zone default now(),
....
   PRIMARY KEY (diag_id)
);

-- ************ COMPOSITE INDEX
create index tdiag_id_create on tdiag(diag_id, create_time);

alter table tdiag alter column diag_id set statistics 1000;
alter table tdiag alter column create_time set statistics 1000;

and then just do the original query

symstream2=> explain analyze select * from tdiag where
symstream2-> (create_time >= '2011-06-07 02:00:00.000000+0' and create_time
< '2011-06-10 07:58:03.000000+0') and  diag_level <= 1
symstream2-> order by diag_id LIMIT 100 OFFSET 800;

QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=6064.19..6822.21 rows=100 width=112) (actual
time=1496.644..1497.094 rows=100 loops=1)
   ->  Index Scan using tdiag_id_create on tdiag  (cost=0.00..1320219.58
rows=174166 width=112) (actual time=1409.285..1495.831 rows=900 loops=1)
         Index Cond: ((create_time >= '2011-06-07 12:00:00+10'::timestamp with
time zone) AND (create_time < '2011-06-10 17:58:03+10'::timestamp with time
zone))
         Filter: (diag_level <= 1)
 Total runtime: 1497.297 ms


If I had set the primary key to (diag_id, create_time) would simple queries on
diag_id still work well i.e.
    select * from tdiag where diag_id = 1234;

--
Anthony Shipman                 | -module(erlang).
Anthony.Shipman@symstream.com   | ''(_)->0. %-)

Re: strange query plan with LIMIT

From

tv@fuzzy.cz

Date:

10 June 2011, 08:23:06

> If I had set the primary key to (diag_id, create_time) would simple
> queries on
> diag_id still work well i.e.
>     select * from tdiag where diag_id = 1234;

Yes. IIRC the performance penalty for using non-leading column of an index
is negligible. But why don't you try that on your own - just run an
explain and you'll get an immediate answer if that works.

regards
Tomas

Re: strange query plan with LIMIT

From

Claudio Freire

Date:

10 June 2011, 12:22:42

On Fri, Jun 10, 2011 at 1:22 PM,  <tv@fuzzy.cz> wrote:
>> If I had set the primary key to (diag_id, create_time) would simple
>> queries on
>> diag_id still work well i.e.
>>     select * from tdiag where diag_id = 1234;
>
> Yes. IIRC the performance penalty for using non-leading column of an index
> is negligible. But why don't you try that on your own - just run an
> explain and you'll get an immediate answer if that works.

The effective penalty, which you don't see on your explain, is the
size of the index.

Depends on the data stored there, but the index can grow up to double
size (usually less than that), and the bigger index is slower for all
operations.

But, in general, if you need both a single-column a multi-column
index, just go for a multipurpose multicolumn one.