Thread: strange query plan with LIMIT
Version: PostgreSQL 8.3.5 (mammoth replicator) Schema: CREATE TABLE tdiag ( diag_id integer DEFAULT nextval('diag_id_seq'::text), create_time timestamp with time zone default now(), /* time this record was created */ diag_time timestamp with time zone not null, device_id integer, /* optional */ fleet_id integer, /* optional */ customer_id integer, /* optional */ module character varying, node_kind smallint, diag_level smallint, tag character varying not null default '', message character varying not null default '', options text, PRIMARY KEY (diag_id) ); create index tdiag_create_time ON tdiag(create_time); The number of rows is over 33 million with time stamps over the past two weeks. The create_time order is almost identical to the id order. What I want to find is the first or last entry by id in a given time range. The query I am having a problem with is: symstream2=> explain analyze select * from tdiag where (create_time >= '2011-06-03 09:49:04.000000+0' and create_time < '2011-06-06 09:59:04.000000+0') order by diag_id limit 1; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=0.00..16.75 rows=1 width=114) (actual time=69425.356..69425.358 rows=1 loops=1) -> Index Scan using tdiag_pkey on tdiag (cost=0.00..19114765.76 rows=1141019 width=114) (actual time=69425.352..69425.352 rows=1 loops=1) Filter: ((create_time >= '2011-06-03 19:49:04+10'::timestamp with time zone) AND (create_time < '2011-06-06 19:59:04+10'::timestamp with time zone)) Total runtime: 69425.400 ms PG seems to decide it must scan the diag_id column and filter each row by the create_time. If I leave out the limit I get symstream2=> explain analyze select * from tdiag where (create_time >= '2011-06-03 09:49:04.000000+0' and create_time < '2011-06-06 09:59:04.000000+0') order by diag_id; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Sort (cost=957632.43..960484.98 rows=1141019 width=114) (actual time=552.795..656.319 rows=86530 loops=1) Sort Key: diag_id Sort Method: external merge Disk: 9872kB -> Bitmap Heap Scan on tdiag (cost=25763.48..638085.13 rows=1141019 width=114) (actual time=43.232..322.441 rows=86530 loops=1) Recheck Cond: ((create_time >= '2011-06-03 19:49:04+10'::timestamp with time zone) AND (create_time < '2011-06-06 19:59:04+10'::timestamp with time zone)) -> Bitmap Index Scan on tdiag_create_time (cost=0.00..25478.23 rows=1141019 width=0) (actual time=42.574..42.574 rows=86530 loops=1) Index Cond: ((create_time >= '2011-06-03 19:49:04+10'::timestamp with time zone) AND (create_time < '2011-06-06 19:59:04+10'::timestamp with time zone)) Total runtime: 736.440 ms (8 rows) I can be explicit about the query order: select * into tt from tdiag where (create_time >= '2011-06-03 09:49:04.000000+0' and create_time < '2011-06-06 09:59:04.000000+0'); symstream2=> explain analyze select * from tt order by diag_id limit 1; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------ Limit (cost=2731.95..2731.95 rows=1 width=101) (actual time=440.165..440.166 rows=1 loops=1) -> Sort (cost=2731.95..2948.28 rows=86530 width=101) (actual time=440.161..440.161 rows=1 loops=1) Sort Key: diag_id Sort Method: top-N heapsort Memory: 17kB -> Seq Scan on tt (cost=0.00..2299.30 rows=86530 width=101) (actual time=19.602..330.873 rows=86530 loops=1) Total runtime: 440.209 ms (6 rows) But if I try using a subquery I get symstream2=> explain analyze select * from (select * from tdiag where (create_time >= '2011-06-03 09:49:04.000000+0' and create_time < '2011-06-06 09:59:04.000000+0')) as sub order by diag_id limit 1; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=0.00..16.75 rows=1 width=114) (actual time=90344.384..90344.385 rows=1 loops=1) -> Index Scan using tdiag_pkey on tdiag (cost=0.00..19114765.76 rows=1141019 width=114) (actual time=90344.380..90344.380 rows=1 loops=1) Filter: ((create_time >= '2011-06-03 19:49:04+10'::timestamp with time zone) AND (create_time < '2011-06-06 19:59:04+10'::timestamp with time zone)) Total runtime: 90344.431 ms How do I make this both fast and simple? -- Anthony Shipman | Tech Support: The guys who follow the Anthony.Shipman@symstream.com | 'Parade of New Products' with a shovel.
> Version: PostgreSQL 8.3.5 (mammoth replicator) > > Schema: > > CREATE TABLE tdiag ( > diag_id integer DEFAULT nextval('diag_id_seq'::text), > create_time timestamp with time zone default now(), /* time this > record > was created */ > diag_time timestamp with time zone not null, > device_id integer, /* optional */ > fleet_id integer, /* optional */ > customer_id integer, /* optional */ > module character varying, > node_kind smallint, > diag_level smallint, > tag character varying not null default '', > message character varying not null default '', > options text, > > PRIMARY KEY (diag_id) > ); > > create index tdiag_create_time ON tdiag(create_time); > > The number of rows is over 33 million with time stamps over the past two > weeks. > > The create_time order is almost identical to the id order. What I want > to find is the first or last entry by id in a given time range. The > query I am having a problem with is: Hi, why are you reposting this? Pavel Stehule already recommended you to run ANALYZE on the tdiag table - have you done that? What was the effect? The stats are off - e.g. the bitmap scan says -> Bitmap Heap Scan on tdiag (cost=25763.48..638085.13 rows=1141019 width=114) (actual time=43.232..322.441 rows=86530 loops=1) so it expects to get 1141019 rows but it gets 86530, i.e. about 7% of the expected number. That might be enough to cause bad plan choice and thus performance issues. And yet another recommendation - the sort is performed on disk, so give it more work_mem and it should be much faster (should change from "merge sort" to "quick sort"). Try something like work_mem=20MB and see if it does the trick. regards Tomas
On Wednesday 08 June 2011 02:40, tv@fuzzy.cz wrote: > Hi, > > why are you reposting this? Pavel Stehule already recommended you to run > ANALYZE on the tdiag table - have you done that? What was the effect? The mailing list system hiccupped and I ended up with two posts. VACUUM ANALYZE was done, more than once. Setting the statistics value on the diag_id column to 1000 seemed to only make the query a bit slower. > > The stats are off - e.g. the bitmap scan says > > -> Bitmap Heap Scan on tdiag (cost=25763.48..638085.13 rows=1141019 > width=114) (actual time=43.232..322.441 rows=86530 loops=1) > > so it expects to get 1141019 rows but it gets 86530, i.e. about 7% of the > expected number. That might be enough to cause bad plan choice and thus > performance issues. What seems odd to me is that the only difference between the two is the limit clause: select * from tdiag where (create_time >= '2011-06-03 09:49:04.000000+0' and create_time < '2011-06-06 09:59:04.000000+0') order by diag_id limit 1; select * from tdiag where (create_time >= '2011-06-03 09:49:04.000000+0' and create_time < '2011-06-06 09:59:04.000000+0') order by diag_id; and yet the plan completely changes. I think that I have to force the evaluation order to get a reliably fast result: begin; create temporary table tt on commit drop as select diag_id from tdiag where create_time >= '2011-06-03 09:49:04.000000+0' and create_time < '2011-06-06 09:59:04.000000+0'; select * from tdiag where diag_id in (select * from tt) order by diag_id limit 10; commit; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=3566.24..3566.27 rows=10 width=112) (actual time=1800.699..1800.736 rows=10 loops=1) -> Sort (cost=3566.24..3566.74 rows=200 width=112) (actual time=1800.694..1800.708 rows=10 loops=1) Sort Key: tdiag.diag_id Sort Method: top-N heapsort Memory: 18kB -> Nested Loop (cost=1360.00..3561.92 rows=200 width=112) (actual time=269.087..1608.324 rows=86530 loops=1) -> HashAggregate (cost=1360.00..1362.00 rows=200 width=4) (actual time=269.052..416.898 rows=86530 loops=1) -> Seq Scan on tt (cost=0.00..1156.00 rows=81600 width=4) (actual time=0.020..120.323 rows=86530 loops=1) -> Index Scan using tdiag_pkey on tdiag (cost=0.00..10.99 rows=1 width=112) (actual time=0.006..0.008 rows=1 loops=86530) Index Cond: (tdiag.diag_id = tt.diag_id) Total runtime: 1801.290 ms > > And yet another recommendation - the sort is performed on disk, so give it > more work_mem and it should be much faster (should change from "merge > sort" to "quick sort"). Try something like work_mem=20MB and see if it > does the trick. This certainly speeds up the sorting. > > regards > Tomas -- Anthony Shipman | What most people think about Anthony.Shipman@symstream.com | most things is mostly wrong.
On Wed, Jun 8, 2011 at 7:08 AM, <anthony.shipman@symstream.com> wrote: > What seems odd to me is that the only difference between the two is the limit > clause Why would that seem odd? Of course optimally executing a plan with limit is a lot different than one without. Just... why are you sorting by diag_id? I believe you would be better off sorting by timestamp than diag_id, but I don't know what the query is supposed to do. In any case, that's a weakness I've seen in many database systems, and postgres is no exception: order + limit strongly suggests index usage, and when the ordered column has "anti" correlation with the where clause (that is, too many of the first rows in the ordered output are filtered out by the whereclause), the plan with an index is insufferably slow compared to a sequential scan + sort. Postgres has no way to know that, it depends on correlation between the where clause and the ordering expressions. If you cannot change the query, I think your only option is to either add a specific index for that query (ie, if the where clause is always the same, you could add a partial index), or just disable nested loops with "set enable_nestloop = false;" just prior to running that query (and remember to re-enable afterwards).
> What seems odd to me is that the only difference between the two is the > limit > clause: > > select * from tdiag where (create_time >= '2011-06-03 > 09:49:04.000000+0' and create_time < '2011-06-06 09:59:04.000000+0') order > by > diag_id limit 1; > > select * from tdiag where (create_time >= '2011-06-03 > 09:49:04.000000+0' and create_time < '2011-06-06 09:59:04.000000+0') order > by > diag_id; > > and yet the plan completely changes. As Claudio Freire already pointed out, this is expected behavior. With LIMIT the planner prefers plans with low starting cost, as it expects to end soon and building index bitmap / hash table would be a waste. So actually it would be very odd if the plan did not change in this case ... Anyway I have no idea how to fix this "clean" - without messing with enable_* or cost variables or other such dirty tricks. regards Tomas
On Wednesday 08 June 2011 17:39, Claudio Freire wrote: > Of course optimally executing a plan with limit is a lot different > than one without. I imagined that limit just cuts out a slice of the query results. If it can find 80000 rows in 0.5 seconds then I would have thought that returning just the first 100 of them should be just as easy. > > Just... why are you sorting by diag_id? > > I believe you would be better off sorting by timestamp than diag_id, > but I don't know what the query is supposed to do. The timestamp is only almost monotonic. I need to scan the table in slices and I use limit and offset to select the slice. I've forced the query order with some pgsql like: declare query character varying; rec record; begin -- PG 8.3 doesn't have the 'using' syntax nor 'return query execute' execute 'create temporary table tt on commit drop as ' || 'select diag_id from tdiag ' || v_where; query = 'select * from tdiag where diag_id in (select * from tt) ' || 'order by diag_id ' || v_limit || ' ' || v_offset; for rec in execute query loop return next rec; end loop; end; -- Anthony Shipman | Life is the interval Anthony.Shipman@symstream.com | between pay days.
2011/6/8 <anthony.shipman@symstream.com>: > On Wednesday 08 June 2011 17:39, Claudio Freire wrote: >> Of course optimally executing a plan with limit is a lot different >> than one without. > > I imagined that limit just cuts out a slice of the query results. > If it can find 80000 rows in 0.5 seconds then I would have thought that > returning just the first 100 of them should be just as easy. > >> >> Just... why are you sorting by diag_id? >> >> I believe you would be better off sorting by timestamp than diag_id, >> but I don't know what the query is supposed to do. > > The timestamp is only almost monotonic. I need to scan the table in slices and > I use limit and offset to select the slice. > > I've forced the query order with some pgsql like: > > declare > query character varying; > rec record; > begin > -- PG 8.3 doesn't have the 'using' syntax nor 'return query execute' > > execute 'create temporary table tt on commit drop as ' || > 'select diag_id from tdiag ' || v_where; > > query = 'select * from tdiag where diag_id in (select * from tt) ' || > 'order by diag_id ' || v_limit || ' ' || v_offset; > > for rec in execute query loop > return next rec; > end loop; > end; if you use FOR statement, there should be a problem in using a implicit cursor - try to set a GUC cursor_tuple_fraction to 1.0. Regards Pavel Stehule > > -- > Anthony Shipman | Life is the interval > Anthony.Shipman@symstream.com | between pay days. > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance >
On Wednesday 08 June 2011 18:39, Pavel Stehule wrote: > if you use FOR statement, there should be a problem in using a > implicit cursor - try to set a GUC cursor_tuple_fraction to 1.0. Alas this is mammoth replicator, equivalent to PG 8.3 and it doesn't have that parameter. -- Anthony Shipman | It's caches all the way Anthony.Shipman@symstream.com | down.
> On Wednesday 08 June 2011 17:39, Claudio Freire wrote: >> Of course optimally executing a plan with limit is a lot different >> than one without. > > I imagined that limit just cuts out a slice of the query results. > If it can find 80000 rows in 0.5 seconds then I would have thought that > returning just the first 100 of them should be just as easy. But that's exactly the problem with LIMIT clause. The planner considers two choices - index scan with this estimate Index Scan using tdiag_pkey on tdiag (cost=0.00..19114765.76 rows=1141019 width=114) and bitmap index scan with this estimate Bitmap Heap Scan on tdiag (cost=25763.48..638085.13 rows=1141019 width=114) and says - hey, the index scan has much lower starting cost, and I'm using limit so it's much better! Let's use index scan. But then it finds out it needs to scan most of the table and that ruins the performance. Have you tried to create a composite index on those two columns? Not sure if that helps but I'd try that. Tomas
Hello 2011/6/8 <anthony.shipman@symstream.com>: > On Wednesday 08 June 2011 18:39, Pavel Stehule wrote: >> if you use FOR statement, there should be a problem in using a >> implicit cursor - try to set a GUC cursor_tuple_fraction to 1.0. > Alas this is mammoth replicator, equivalent to PG 8.3 and it doesn't have that > parameter. It should be a part of problem - resp. combination with bad statistic. Maybe you should to rewrite your code to DECLARE int i = 0; FOR x IN EXECUTE '....' LOOP RETURN NEXT ... i := i + 1; EXIT WHEN i > limitvar END LOOP Regards Pavel Stehule > -- > Anthony Shipman | It's caches all the way > Anthony.Shipman@symstream.com | down. >
On Wednesday 08 June 2011 19:47, tv@fuzzy.cz wrote: > Have you tried to create a composite index on those two columns? Not sure > if that helps but I'd try that. > > Tomas Do you mean create index tdiag_index2 ON tdiag(diag_id, create_time); Should this be in addition to or instead of the single index on create_time? I must be doing something really wrong to get this to happen: symstream2=> select count(*) from tdiag where create_time <= '2011-05-23 03:51:00.131597+0'; count ------- 0 (1 row) symstream2=> explain analyze select count(*) from tdiag where create_time <= '2011-05-23 03:51:00.131597+0'; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------ Aggregate (cost=863867.21..863867.22 rows=1 width=0) (actual time=58994.078..58994.080 rows=1 loops=1) -> Seq Scan on tdiag (cost=0.00..844188.68 rows=7871413 width=0) (actual time=58994.063..58994.063 rows=0 loops=1) Filter: (create_time <= '2011-05-23 13:51:00.131597+10'::timestamp with time zone) Total runtime: 58994.172 ms (4 rows) symstream2=> \d tdiag Table "public.tdiag" Column | Type | Modifiers -------------+--------------------------+----------------------------------------------------------- diag_id | integer | not null default nextval(('diag_id_seq'::text)::regclass) create_time | timestamp with time zone | default now() diag_time | timestamp with time zone | not null device_id | integer | fleet_id | integer | customer_id | integer | module | character varying | node_kind | smallint | diag_level | smallint | message | character varying | not null default ''::character varying options | text | tag | character varying | not null default ''::character varying Indexes: "tdiag_pkey" PRIMARY KEY, btree (diag_id) "tdiag_create_time" btree (create_time) -- Anthony Shipman | Programming is like sex: One mistake and Anthony.Shipman@symstream.com | you're providing support for a lifetime.
On Thursday 09 June 2011 16:04, anthony.shipman@symstream.com wrote: > I must be doing something really wrong to get this to happen: Yes I did. Ignore that. -- Anthony Shipman | flailover systems: When one goes down it Anthony.Shipman@symstream.com | flails about until the other goes down too.
On Wednesday 08 June 2011 19:47, tv@fuzzy.cz wrote: > Have you tried to create a composite index on those two columns? Not sure > if that helps but I'd try that. > > Tomas This finally works well enough CREATE TABLE tdiag ( diag_id integer DEFAULT nextval('diag_id_seq'::text), create_time timestamp with time zone default now(), .... PRIMARY KEY (diag_id) ); -- ************ COMPOSITE INDEX create index tdiag_id_create on tdiag(diag_id, create_time); alter table tdiag alter column diag_id set statistics 1000; alter table tdiag alter column create_time set statistics 1000; and then just do the original query symstream2=> explain analyze select * from tdiag where symstream2-> (create_time >= '2011-06-07 02:00:00.000000+0' and create_time < '2011-06-10 07:58:03.000000+0') and diag_level <= 1 symstream2-> order by diag_id LIMIT 100 OFFSET 800; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------ Limit (cost=6064.19..6822.21 rows=100 width=112) (actual time=1496.644..1497.094 rows=100 loops=1) -> Index Scan using tdiag_id_create on tdiag (cost=0.00..1320219.58 rows=174166 width=112) (actual time=1409.285..1495.831 rows=900 loops=1) Index Cond: ((create_time >= '2011-06-07 12:00:00+10'::timestamp with time zone) AND (create_time < '2011-06-10 17:58:03+10'::timestamp with time zone)) Filter: (diag_level <= 1) Total runtime: 1497.297 ms If I had set the primary key to (diag_id, create_time) would simple queries on diag_id still work well i.e. select * from tdiag where diag_id = 1234; -- Anthony Shipman | -module(erlang). Anthony.Shipman@symstream.com | ''(_)->0. %-)
> If I had set the primary key to (diag_id, create_time) would simple > queries on > diag_id still work well i.e. > select * from tdiag where diag_id = 1234; Yes. IIRC the performance penalty for using non-leading column of an index is negligible. But why don't you try that on your own - just run an explain and you'll get an immediate answer if that works. regards Tomas
On Fri, Jun 10, 2011 at 1:22 PM, <tv@fuzzy.cz> wrote: >> If I had set the primary key to (diag_id, create_time) would simple >> queries on >> diag_id still work well i.e. >> select * from tdiag where diag_id = 1234; > > Yes. IIRC the performance penalty for using non-leading column of an index > is negligible. But why don't you try that on your own - just run an > explain and you'll get an immediate answer if that works. The effective penalty, which you don't see on your explain, is the size of the index. Depends on the data stored there, but the index can grow up to double size (usually less than that), and the bigger index is slower for all operations. But, in general, if you need both a single-column a multi-column index, just go for a multipurpose multicolumn one.