Thread: *very* inefficient choice made by the planner (regarding IN(...))

*very* inefficient choice made by the planner (regarding IN(...))

From
Frank van Vugt
Date:
L.S.

Could anybody explain why the planner is doing what it is doing?

What could I do to make it easier to choose a better plan?



*********
Summary
*********
On a freshly vacuum/analysed pair of tables with 7389 and 64333 records, this:

select id from location where id not in (select location_id from
location_carrier);

takes 581546,497 ms


While a variant like:

select id from location where not exists (select 1 from location_carrier where
location_id = location.id);

takes only 124,625 ms


*********
Details
*********
=# select version();
                               version
---------------------------------------------------------------------
 PostgreSQL 7.4.2 on i686-pc-linux-gnu, compiled by GCC egcs-2.91.66
(1 row)


=# \d location
               Table "public.location"
   Column   |            Type             | Modifiers
------------+-----------------------------+-----------
 id         | integer                     | not null
Indexes:
    "location_pkey" primary key, btree (id)


=# select count(*) from location;
 count
-------
  7389
(1 row)


=# \d location_carrier
                Table "public.location_carrier"
       Column        |            Type             | Modifiers
---------------------+-----------------------------+-----------
 location_id         | integer                     | not null
 carrier_id          | integer                     | not null
Indexes:
    "location_carrier_pkey" primary key, btree (location_id, carrier_id)


=# select count(*) from location_carrier;
 count
-------
 64333
(1 row)


=# explain select id from location where id not in (select location_id from
location_carrier);
                                  QUERY PLAN
-------------------------------------------------------------------------------
 Seq Scan on "location"  (cost=0.00..5077093.72 rows=3695 width=4)
   Filter: (NOT (subplan))
   SubPlan
     ->  Seq Scan on location_carrier  (cost=0.00..1213.33 rows=64333 width=4)
(4 rows)


=# explain analyse select id from location where id not in (select location_id
from location_carrier);
                                                           QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on "location"  (cost=0.00..5077093.72 rows=3695 width=4) (actual
time=248.310..581541.483 rows=240 loops=1)
   Filter: (NOT (subplan))
   SubPlan
     ->  Seq Scan on location_carrier  (cost=0.00..1213.33 rows=64333 width=4)
(actual time=0.007..48.517 rows=19364 loops=7389)
 Total runtime: 581542.560 ms
(5 rows)

Time: 581546,497 ms


=# explain analyse select id from location l left outer join location_carrier
lc on l.id = lc.location_id where lc.location_id is null;
                                                                          QUERY
PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------------
 Merge Left Join  (cost=0.00..3022.51 rows=7389 width=4) (actual
time=0.083..435.841 rows=240 loops=1)
   Merge Cond: ("outer".id = "inner".location_id)
   Filter: ("inner".location_id IS NULL)
   ->  Index Scan using location_pkey on "location" l  (cost=0.00..258.85
rows=7389 width=4) (actual time=0.041..26.211 rows=7389 loops=1)
   ->  Index Scan using location_carrier_pkey on location_carrier lc
(cost=0.00..1941.22 rows=64333 width=4) (actual time=0.015..238.305
rows=64333 loops=1)
 Total runtime: 436.213 ms
(6 rows)

Time: 440,787 ms


megafox=# explain analyse select id from location where not exists (select 1
from location_carrier where location_id = location.id);
                                                                     QUERY
PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on "location"  (cost=0.00..13242.14 rows=3695 width=4) (actual
time=0.078..120.785 rows=240 loops=1)
   Filter: (NOT (subplan))
   SubPlan
     ->  Index Scan using location_carrier_pkey on location_carrier
(cost=0.00..17.61 rows=10 width=0) (actual time=0.011..0.011 rows=1
loops=7389)
           Index Cond: (location_id = $0)
 Total runtime: 121.165 ms
(6 rows)

Time: 124,625 ms







--
Best,




Frank.


Re: *very* inefficient choice made by the planner (regarding

From
Stephan Szabo
Date:
On Thu, 10 Jun 2004, Frank van Vugt wrote:

> Could anybody explain why the planner is doing what it is doing?
>
> What could I do to make it easier to choose a better plan?

You might try raising sort_mem to see if it chooses a better plan.  I
think it may be guessing that the hash won't fit and falling back to the
plan you were getting.


Re: *very* inefficient choice made by the planner (regarding IN(...))

From
Tom Lane
Date:
Frank van Vugt <ftm.van.vugt@foxi.nl> writes:
> What could I do to make it easier to choose a better plan?

Increase sort_mem.  You want it to pick a "hashed subplan", but
it's not doing so because 64000 rows won't fit in the default
sort_mem.

            regards, tom lane

Re: *very* inefficient choice made by the planner (regarding IN(...))

From
Frank van Vugt
Date:
Wow,

The effectiveness of the pgsql mailinglists never ceases to amaze me.

Default sort mem it was, I guess I'd simply been to cautious with this
per-client setting.




Stephan & Tom : thanks!




--
Best,




Frank.


Re: *very* inefficient choice made by the planner (regarding

From
Jean-Luc Lachance
Date:
The real question is:

If the two statments are functionally equivalent, why can't PG rewrite
the "NOT IN" version into the more efficient "NOT EXISTS"?



Frank van Vugt wrote:

> L.S.
>
> Could anybody explain why the planner is doing what it is doing?
>
> What could I do to make it easier to choose a better plan?
>
>
>
> *********
> Summary
> *********
> On a freshly vacuum/analysed pair of tables with 7389 and 64333 records, this:
>
> select id from location where id not in (select location_id from
> location_carrier);
>
> takes 581546,497 ms
>
>
> While a variant like:
>
> select id from location where not exists (select 1 from location_carrier where
> location_id = location.id);
>
> takes only 124,625 ms
>
>
> *********
> Details
> *********
> =# select version();
>                                version
> ---------------------------------------------------------------------
>  PostgreSQL 7.4.2 on i686-pc-linux-gnu, compiled by GCC egcs-2.91.66
> (1 row)
>
>
> =# \d location
>                Table "public.location"
>    Column   |            Type             | Modifiers
> ------------+-----------------------------+-----------
>  id         | integer                     | not null
> Indexes:
>     "location_pkey" primary key, btree (id)
>
>
> =# select count(*) from location;
>  count
> -------
>   7389
> (1 row)
>
>
> =# \d location_carrier
>                 Table "public.location_carrier"
>        Column        |            Type             | Modifiers
> ---------------------+-----------------------------+-----------
>  location_id         | integer                     | not null
>  carrier_id          | integer                     | not null
> Indexes:
>     "location_carrier_pkey" primary key, btree (location_id, carrier_id)
>
>
> =# select count(*) from location_carrier;
>  count
> -------
>  64333
> (1 row)
>
>
> =# explain select id from location where id not in (select location_id from
> location_carrier);
>                                   QUERY PLAN
> -------------------------------------------------------------------------------
>  Seq Scan on "location"  (cost=0.00..5077093.72 rows=3695 width=4)
>    Filter: (NOT (subplan))
>    SubPlan
>      ->  Seq Scan on location_carrier  (cost=0.00..1213.33 rows=64333 width=4)
> (4 rows)
>
>
> =# explain analyse select id from location where id not in (select location_id
> from location_carrier);
>                                                            QUERY PLAN
>
---------------------------------------------------------------------------------------------------------------------------------
>  Seq Scan on "location"  (cost=0.00..5077093.72 rows=3695 width=4) (actual
> time=248.310..581541.483 rows=240 loops=1)
>    Filter: (NOT (subplan))
>    SubPlan
>      ->  Seq Scan on location_carrier  (cost=0.00..1213.33 rows=64333 width=4)
> (actual time=0.007..48.517 rows=19364 loops=7389)
>  Total runtime: 581542.560 ms
> (5 rows)
>
> Time: 581546,497 ms
>
>
> =# explain analyse select id from location l left outer join location_carrier
> lc on l.id = lc.location_id where lc.location_id is null;
>                                                                           QUERY
> PLAN
>
--------------------------------------------------------------------------------------------------------------------------------------------------------------
>  Merge Left Join  (cost=0.00..3022.51 rows=7389 width=4) (actual
> time=0.083..435.841 rows=240 loops=1)
>    Merge Cond: ("outer".id = "inner".location_id)
>    Filter: ("inner".location_id IS NULL)
>    ->  Index Scan using location_pkey on "location" l  (cost=0.00..258.85
> rows=7389 width=4) (actual time=0.041..26.211 rows=7389 loops=1)
>    ->  Index Scan using location_carrier_pkey on location_carrier lc
> (cost=0.00..1941.22 rows=64333 width=4) (actual time=0.015..238.305
> rows=64333 loops=1)
>  Total runtime: 436.213 ms
> (6 rows)
>
> Time: 440,787 ms
>
>
> megafox=# explain analyse select id from location where not exists (select 1
> from location_carrier where location_id = location.id);
>                                                                      QUERY
> PLAN
>
-----------------------------------------------------------------------------------------------------------------------------------------------------
>  Seq Scan on "location"  (cost=0.00..13242.14 rows=3695 width=4) (actual
> time=0.078..120.785 rows=240 loops=1)
>    Filter: (NOT (subplan))
>    SubPlan
>      ->  Index Scan using location_carrier_pkey on location_carrier
> (cost=0.00..17.61 rows=10 width=0) (actual time=0.011..0.011 rows=1
> loops=7389)
>            Index Cond: (location_id = $0)
>  Total runtime: 121.165 ms
> (6 rows)
>
> Time: 124,625 ms
>
>
>
>
>
>
>


Re: *very* inefficient choice made by the planner (regarding

From
Tom Lane
Date:
Jean-Luc Lachance <jllachan@sympatico.ca> writes:
> If the two statments are functionally equivalent, why can't PG rewrite
> the "NOT IN" version into the more efficient "NOT EXISTS"?

They're not equivalent.  In particular, the behavior in the presence of
NULLs is quite different.

            regards, tom lane

Re: *very* inefficient choice made by the planner (regarding

From
Jean-Luc Lachance
Date:
I agree, but it should be a simple rewrite. No?

x IS NULL/IS NOT NULL AND/OR NOT EXISTS


Tom Lane wrote:

> Jean-Luc Lachance <jllachan@sympatico.ca> writes:
>
>>If the two statments are functionally equivalent, why can't PG rewrite
>>the "NOT IN" version into the more efficient "NOT EXISTS"?
>
>
> They're not equivalent.  In particular, the behavior in the presence of
> NULLs is quite different.
>
>             regards, tom lane
>


Re: *very* inefficient choice made by the planner (regarding

From
Stephan Szabo
Date:
On Thu, 10 Jun 2004, Jean-Luc Lachance wrote:

> I agree, but it should be a simple rewrite. No?

It's NULLs inside the subselect that are the issue.

select 1 in (select a from foo)
select exists ( select 1 from foo where a=1)

If foo.a contains a row with NULL but no rows containing a 1, the above
give different results (unknown and exists) and IIRC, exists cannot
return unknown, so there's no simple rewrite of the subselect that gives
equivalent behavior.


Re: *very* inefficient choice made by the planner (regarding

From
Stephan Szabo
Date:
On Thu, 10 Jun 2004, Stephan Szabo wrote:

>
> On Thu, 10 Jun 2004, Jean-Luc Lachance wrote:
>
> > I agree, but it should be a simple rewrite. No?
>
> It's NULLs inside the subselect that are the issue.
>
> select 1 in (select a from foo)
> select exists ( select 1 from foo where a=1)
>
> If foo.a contains a row with NULL but no rows containing a 1, the above
> give different results (unknown and exists) and IIRC, exists cannot

Erm that exists should have been false

Re: *very* inefficient choice made by the planner (regarding

From
SZUCS Gábor
Date:
Dear Gurus,

----- Original Message -----
From: "Stephan Szabo" <sszabo@megazone.bigpanda.com>
Sent: Thursday, June 10, 2004 7:14 PM


>
> On Thu, 10 Jun 2004, Stephan Szabo wrote:
>
> >
> > On Thu, 10 Jun 2004, Jean-Luc Lachance wrote:
> >
> > > I agree, but it should be a simple rewrite. No?
> >
> > It's NULLs inside the subselect that are the issue.
> >
> > select 1 in (select a from foo)
> > select exists ( select 1 from foo where a=1)

Just a dumb try :)

  SELECT (exists(select 1 from foo where a isnull) AND NULL)
       OR exists(select 1 from foo where a=1)

AFAIK this returns
* NULL if (NULL in foo.a) and (1 not in foo.a)
* (1 in foo.a) otherwise.

The weakness is the doubled exists clause. I'm sure it makes most cases at
least doubtful...

G.
%----------------------- cut here -----------------------%
\end


Re: *very* inefficient choice made by the planner (regarding

From
Tom Lane
Date:
=?iso-8859-1?Q?SZUCS_G=E1bor?= <surrano@mailbox.hu> writes:
>> It's NULLs inside the subselect that are the issue.
>
> select 1 in (select a from foo)
> select exists ( select 1 from foo where a=1)

> Just a dumb try :)

>   SELECT (exists(select 1 from foo where a isnull) AND NULL)
>        OR exists(select 1 from foo where a=1)

The more general case is where you have a variable (or expression)
on the left of IN.  That could be NULL too, and this still doesn't
give the right result in that case :-(.  With NULL on the left the
correct answer would be FALSE if the subselect has zero rows,
NULL otherwise.

            regards, tom lane

Re: *very* inefficient choice made by the planner (regarding

From
Stephan Szabo
Date:
On Fri, 18 Jun 2004, [iso-8859-1] SZUCS Gábor wrote:

> Dear Gurus,
>
> ----- Original Message -----
> From: "Stephan Szabo" <sszabo@megazone.bigpanda.com>
> Sent: Thursday, June 10, 2004 7:14 PM
>
>
> >
> > On Thu, 10 Jun 2004, Stephan Szabo wrote:
> >
> > >
> > > On Thu, 10 Jun 2004, Jean-Luc Lachance wrote:
> > >
> > > > I agree, but it should be a simple rewrite. No?
> > >
> > > It's NULLs inside the subselect that are the issue.
> > >
> > > select 1 in (select a from foo)
> > > select exists ( select 1 from foo where a=1)
>
> Just a dumb try :)
>
>   SELECT (exists(select 1 from foo where a isnull) AND NULL)
>        OR exists(select 1 from foo where a=1)
>
> AFAIK this returns
> * NULL if (NULL in foo.a) and (1 not in foo.a)
> * (1 in foo.a) otherwise.
>
> The weakness is the doubled exists clause. I'm sure it makes most cases at
> least doubtful...

Well, once you take into account the lhs being potentially null
 lhe in (select rhe from foo) is something like:

case when lhe is null then
 not exists(select 1 from foo limit 1) or null
else
 (exists(select 1 from foo where rhe is null) and null)
 or exists(select 1 from foo where rhe=lhe)
end

I think the real win occurs for where clause cases if it can pull up the
exists that references lhe so that it doesn't try to evaluate it on every
row and that's unlikely to occur in something like the above.