Thread: *very* inefficient choice made by the planner (regarding IN(...))
L.S. Could anybody explain why the planner is doing what it is doing? What could I do to make it easier to choose a better plan? ********* Summary ********* On a freshly vacuum/analysed pair of tables with 7389 and 64333 records, this: select id from location where id not in (select location_id from location_carrier); takes 581546,497 ms While a variant like: select id from location where not exists (select 1 from location_carrier where location_id = location.id); takes only 124,625 ms ********* Details ********* =# select version(); version --------------------------------------------------------------------- PostgreSQL 7.4.2 on i686-pc-linux-gnu, compiled by GCC egcs-2.91.66 (1 row) =# \d location Table "public.location" Column | Type | Modifiers ------------+-----------------------------+----------- id | integer | not null Indexes: "location_pkey" primary key, btree (id) =# select count(*) from location; count ------- 7389 (1 row) =# \d location_carrier Table "public.location_carrier" Column | Type | Modifiers ---------------------+-----------------------------+----------- location_id | integer | not null carrier_id | integer | not null Indexes: "location_carrier_pkey" primary key, btree (location_id, carrier_id) =# select count(*) from location_carrier; count ------- 64333 (1 row) =# explain select id from location where id not in (select location_id from location_carrier); QUERY PLAN ------------------------------------------------------------------------------- Seq Scan on "location" (cost=0.00..5077093.72 rows=3695 width=4) Filter: (NOT (subplan)) SubPlan -> Seq Scan on location_carrier (cost=0.00..1213.33 rows=64333 width=4) (4 rows) =# explain analyse select id from location where id not in (select location_id from location_carrier); QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------- Seq Scan on "location" (cost=0.00..5077093.72 rows=3695 width=4) (actual time=248.310..581541.483 rows=240 loops=1) Filter: (NOT (subplan)) SubPlan -> Seq Scan on location_carrier (cost=0.00..1213.33 rows=64333 width=4) (actual time=0.007..48.517 rows=19364 loops=7389) Total runtime: 581542.560 ms (5 rows) Time: 581546,497 ms =# explain analyse select id from location l left outer join location_carrier lc on l.id = lc.location_id where lc.location_id is null; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------- Merge Left Join (cost=0.00..3022.51 rows=7389 width=4) (actual time=0.083..435.841 rows=240 loops=1) Merge Cond: ("outer".id = "inner".location_id) Filter: ("inner".location_id IS NULL) -> Index Scan using location_pkey on "location" l (cost=0.00..258.85 rows=7389 width=4) (actual time=0.041..26.211 rows=7389 loops=1) -> Index Scan using location_carrier_pkey on location_carrier lc (cost=0.00..1941.22 rows=64333 width=4) (actual time=0.015..238.305 rows=64333 loops=1) Total runtime: 436.213 ms (6 rows) Time: 440,787 ms megafox=# explain analyse select id from location where not exists (select 1 from location_carrier where location_id = location.id); QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------- Seq Scan on "location" (cost=0.00..13242.14 rows=3695 width=4) (actual time=0.078..120.785 rows=240 loops=1) Filter: (NOT (subplan)) SubPlan -> Index Scan using location_carrier_pkey on location_carrier (cost=0.00..17.61 rows=10 width=0) (actual time=0.011..0.011 rows=1 loops=7389) Index Cond: (location_id = $0) Total runtime: 121.165 ms (6 rows) Time: 124,625 ms -- Best, Frank.
On Thu, 10 Jun 2004, Frank van Vugt wrote: > Could anybody explain why the planner is doing what it is doing? > > What could I do to make it easier to choose a better plan? You might try raising sort_mem to see if it chooses a better plan. I think it may be guessing that the hash won't fit and falling back to the plan you were getting.
Frank van Vugt <ftm.van.vugt@foxi.nl> writes: > What could I do to make it easier to choose a better plan? Increase sort_mem. You want it to pick a "hashed subplan", but it's not doing so because 64000 rows won't fit in the default sort_mem. regards, tom lane
Wow, The effectiveness of the pgsql mailinglists never ceases to amaze me. Default sort mem it was, I guess I'd simply been to cautious with this per-client setting. Stephan & Tom : thanks! -- Best, Frank.
The real question is: If the two statments are functionally equivalent, why can't PG rewrite the "NOT IN" version into the more efficient "NOT EXISTS"? Frank van Vugt wrote: > L.S. > > Could anybody explain why the planner is doing what it is doing? > > What could I do to make it easier to choose a better plan? > > > > ********* > Summary > ********* > On a freshly vacuum/analysed pair of tables with 7389 and 64333 records, this: > > select id from location where id not in (select location_id from > location_carrier); > > takes 581546,497 ms > > > While a variant like: > > select id from location where not exists (select 1 from location_carrier where > location_id = location.id); > > takes only 124,625 ms > > > ********* > Details > ********* > =# select version(); > version > --------------------------------------------------------------------- > PostgreSQL 7.4.2 on i686-pc-linux-gnu, compiled by GCC egcs-2.91.66 > (1 row) > > > =# \d location > Table "public.location" > Column | Type | Modifiers > ------------+-----------------------------+----------- > id | integer | not null > Indexes: > "location_pkey" primary key, btree (id) > > > =# select count(*) from location; > count > ------- > 7389 > (1 row) > > > =# \d location_carrier > Table "public.location_carrier" > Column | Type | Modifiers > ---------------------+-----------------------------+----------- > location_id | integer | not null > carrier_id | integer | not null > Indexes: > "location_carrier_pkey" primary key, btree (location_id, carrier_id) > > > =# select count(*) from location_carrier; > count > ------- > 64333 > (1 row) > > > =# explain select id from location where id not in (select location_id from > location_carrier); > QUERY PLAN > ------------------------------------------------------------------------------- > Seq Scan on "location" (cost=0.00..5077093.72 rows=3695 width=4) > Filter: (NOT (subplan)) > SubPlan > -> Seq Scan on location_carrier (cost=0.00..1213.33 rows=64333 width=4) > (4 rows) > > > =# explain analyse select id from location where id not in (select location_id > from location_carrier); > QUERY PLAN > --------------------------------------------------------------------------------------------------------------------------------- > Seq Scan on "location" (cost=0.00..5077093.72 rows=3695 width=4) (actual > time=248.310..581541.483 rows=240 loops=1) > Filter: (NOT (subplan)) > SubPlan > -> Seq Scan on location_carrier (cost=0.00..1213.33 rows=64333 width=4) > (actual time=0.007..48.517 rows=19364 loops=7389) > Total runtime: 581542.560 ms > (5 rows) > > Time: 581546,497 ms > > > =# explain analyse select id from location l left outer join location_carrier > lc on l.id = lc.location_id where lc.location_id is null; > QUERY > PLAN > -------------------------------------------------------------------------------------------------------------------------------------------------------------- > Merge Left Join (cost=0.00..3022.51 rows=7389 width=4) (actual > time=0.083..435.841 rows=240 loops=1) > Merge Cond: ("outer".id = "inner".location_id) > Filter: ("inner".location_id IS NULL) > -> Index Scan using location_pkey on "location" l (cost=0.00..258.85 > rows=7389 width=4) (actual time=0.041..26.211 rows=7389 loops=1) > -> Index Scan using location_carrier_pkey on location_carrier lc > (cost=0.00..1941.22 rows=64333 width=4) (actual time=0.015..238.305 > rows=64333 loops=1) > Total runtime: 436.213 ms > (6 rows) > > Time: 440,787 ms > > > megafox=# explain analyse select id from location where not exists (select 1 > from location_carrier where location_id = location.id); > QUERY > PLAN > ----------------------------------------------------------------------------------------------------------------------------------------------------- > Seq Scan on "location" (cost=0.00..13242.14 rows=3695 width=4) (actual > time=0.078..120.785 rows=240 loops=1) > Filter: (NOT (subplan)) > SubPlan > -> Index Scan using location_carrier_pkey on location_carrier > (cost=0.00..17.61 rows=10 width=0) (actual time=0.011..0.011 rows=1 > loops=7389) > Index Cond: (location_id = $0) > Total runtime: 121.165 ms > (6 rows) > > Time: 124,625 ms > > > > > > >
Jean-Luc Lachance <jllachan@sympatico.ca> writes: > If the two statments are functionally equivalent, why can't PG rewrite > the "NOT IN" version into the more efficient "NOT EXISTS"? They're not equivalent. In particular, the behavior in the presence of NULLs is quite different. regards, tom lane
I agree, but it should be a simple rewrite. No? x IS NULL/IS NOT NULL AND/OR NOT EXISTS Tom Lane wrote: > Jean-Luc Lachance <jllachan@sympatico.ca> writes: > >>If the two statments are functionally equivalent, why can't PG rewrite >>the "NOT IN" version into the more efficient "NOT EXISTS"? > > > They're not equivalent. In particular, the behavior in the presence of > NULLs is quite different. > > regards, tom lane >
On Thu, 10 Jun 2004, Jean-Luc Lachance wrote: > I agree, but it should be a simple rewrite. No? It's NULLs inside the subselect that are the issue. select 1 in (select a from foo) select exists ( select 1 from foo where a=1) If foo.a contains a row with NULL but no rows containing a 1, the above give different results (unknown and exists) and IIRC, exists cannot return unknown, so there's no simple rewrite of the subselect that gives equivalent behavior.
On Thu, 10 Jun 2004, Stephan Szabo wrote: > > On Thu, 10 Jun 2004, Jean-Luc Lachance wrote: > > > I agree, but it should be a simple rewrite. No? > > It's NULLs inside the subselect that are the issue. > > select 1 in (select a from foo) > select exists ( select 1 from foo where a=1) > > If foo.a contains a row with NULL but no rows containing a 1, the above > give different results (unknown and exists) and IIRC, exists cannot Erm that exists should have been false
Dear Gurus, ----- Original Message ----- From: "Stephan Szabo" <sszabo@megazone.bigpanda.com> Sent: Thursday, June 10, 2004 7:14 PM > > On Thu, 10 Jun 2004, Stephan Szabo wrote: > > > > > On Thu, 10 Jun 2004, Jean-Luc Lachance wrote: > > > > > I agree, but it should be a simple rewrite. No? > > > > It's NULLs inside the subselect that are the issue. > > > > select 1 in (select a from foo) > > select exists ( select 1 from foo where a=1) Just a dumb try :) SELECT (exists(select 1 from foo where a isnull) AND NULL) OR exists(select 1 from foo where a=1) AFAIK this returns * NULL if (NULL in foo.a) and (1 not in foo.a) * (1 in foo.a) otherwise. The weakness is the doubled exists clause. I'm sure it makes most cases at least doubtful... G. %----------------------- cut here -----------------------% \end
=?iso-8859-1?Q?SZUCS_G=E1bor?= <surrano@mailbox.hu> writes: >> It's NULLs inside the subselect that are the issue. > > select 1 in (select a from foo) > select exists ( select 1 from foo where a=1) > Just a dumb try :) > SELECT (exists(select 1 from foo where a isnull) AND NULL) > OR exists(select 1 from foo where a=1) The more general case is where you have a variable (or expression) on the left of IN. That could be NULL too, and this still doesn't give the right result in that case :-(. With NULL on the left the correct answer would be FALSE if the subselect has zero rows, NULL otherwise. regards, tom lane
On Fri, 18 Jun 2004, [iso-8859-1] SZUCS Gábor wrote: > Dear Gurus, > > ----- Original Message ----- > From: "Stephan Szabo" <sszabo@megazone.bigpanda.com> > Sent: Thursday, June 10, 2004 7:14 PM > > > > > > On Thu, 10 Jun 2004, Stephan Szabo wrote: > > > > > > > > On Thu, 10 Jun 2004, Jean-Luc Lachance wrote: > > > > > > > I agree, but it should be a simple rewrite. No? > > > > > > It's NULLs inside the subselect that are the issue. > > > > > > select 1 in (select a from foo) > > > select exists ( select 1 from foo where a=1) > > Just a dumb try :) > > SELECT (exists(select 1 from foo where a isnull) AND NULL) > OR exists(select 1 from foo where a=1) > > AFAIK this returns > * NULL if (NULL in foo.a) and (1 not in foo.a) > * (1 in foo.a) otherwise. > > The weakness is the doubled exists clause. I'm sure it makes most cases at > least doubtful... Well, once you take into account the lhs being potentially null lhe in (select rhe from foo) is something like: case when lhe is null then not exists(select 1 from foo limit 1) or null else (exists(select 1 from foo where rhe is null) and null) or exists(select 1 from foo where rhe=lhe) end I think the real win occurs for where clause cases if it can pull up the exists that references lhe so that it doesn't try to evaluate it on every row and that's unlikely to occur in something like the above.