Thread: Curious why planner can't handle NOT IN
I have an `account` table with 5.3M rows, with primary key `id` of type `text` (and 600+ columns if that matters). I'm trying to create a `newaccount` table with the same schema but 600k newly imported rows, then insert all the old rows for which `id` isn't already in the set of newly updated rows. => create table newaccount as select * from account limit 0; => \copy newaccount from stdin with (format csv) [...copy 600k rows in...] => analyze newaccount; The most obvious query doesn't work so hot due to repeated execution of the subplan: => explain insert into newaccount select * from account where id not in (select id from account); QUERY PLAN ----------------------------------------------------------------------------------------- Insert on newaccount (cost=0.00..458800400224.89 rows=2985976 width=5366) -> Seq Scan on account (cost=0.00..458800400224.89 rows=2985976 width=5366) Filter: (NOT (SubPlan 1)) SubPlan 1 -> Materialize (cost=0.00..153276.00 rows=150000 width=32) -> Seq Scan on newaccount (cost=0.00..151500.00 rows=150000 width=32) (6 rows) This works fine, though: => explain insert into newaccount select * from account where id in (select id from account except select id from newaccount); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------ Insert on newaccount (cost=3749026.24..14034547.77 rows=2985976 width=5366) -> Hash Join (cost=3749026.24..14034547.77 rows=2985976 width=5366) Hash Cond: (account.id = "ANY_subquery".id) -> Seq Scan on account (cost=0.00..2263744.51 rows=5971951 width=5366) -> Hash (cost=3633552.85..3633552.85 rows=5971951 width=32) -> Subquery Scan on "ANY_subquery" (cost=3543223.59..3633552.85 rows=5971951 width=32) -> SetOp Except (cost=3543223.59..3573833.34 rows=5971951 width=19) -> Sort (cost=3543223.59..3558528.47 rows=6121951 width=19) Sort Key: "*SELECT* 1".id -> Append (cost=0.00..2476464.02 rows=6121951 width=19) -> Subquery Scan on "*SELECT* 1" (cost=0.00..2323464.02 rows=5971951 width=19) -> Seq Scan on account (cost=0.00..2263744.51 rows=5971951 width=19) -> Subquery Scan on "*SELECT* 2" (cost=0.00..153000.00 rows=150000 width=32) -> Seq Scan on newaccount (cost=0.00..151500.00 rows=150000 width=32) (14 rows) This is all in PG 9.1. This isn't a big deal as there's a straightforward workaround, but I am curious what happened here. Googling turns up various mentions of "NOT IN" with poor plans that involve subplans. Then again I have read mention of hash anti-join which seems appropriate here(?), but that wasn't used here (nor was the latter join used, though for whatever reason it looks like a more complex/deeper plan tree than I had expected, so maybe it was out of the plan generator's reach?). E.g. the following mentions cranking up work_mem, but I probably can't crank up work_mem to meet the requirements of this example, and even if it doesn't fit in memory, it'd be nice for the planner to not degenerate to a pathological plan and still execute this join efficiently while spilling to and from disk. http://stackoverflow.com/questions/7125291/postgresql-not-in-versus-except-performance-difference-edited-2 Thanks!
Yang Zhang <yanghatespam@gmail.com> writes: > The most obvious query doesn't work so hot due to repeated execution > of the subplan: > => explain insert into newaccount select * from account where id not > in (select id from account); Yeah. Try using a NOT EXISTS instead. > This isn't a big deal as there's a straightforward workaround, but I > am curious what happened here. Googling turns up various mentions of > "NOT IN" with poor plans that involve subplans. Then again I have > read mention of hash anti-join which seems appropriate here(?), but > that wasn't used here The trouble with NOT IN is that it's not exactly the same as an antijoin, because of the spec-mandated bizarre behavior for NULLs. It's very difficult to optimize it to any extent without producing wrong answers. NOT EXISTS avoids that problem. regards, tom lane
On May 3, 2013 12:15 PM, "Yang Zhang" <yanghatespam@gmail.com> wrote:
I have an `account` table with 5.3M rows, with primary key `id` of
type `text` (and 600+ columns if that matters).
I'm trying to create a `newaccount` table with the same schema but
600k newly imported rows, then insert all the old rows for which `id`
isn't already in the set of newly updated rows.
=> create table newaccount as select * from account limit 0;
=> \copy newaccount from stdin with (format csv)
[...copy 600k rows in...]
=> analyze newaccount;
The most obvious query doesn't work so hot due to repeated execution
of the subplan:
=> explain insert into newaccount select * from account where id not
in (select id from account);
QUERY PLAN
-----------------------------------------------------------------------------------------
Insert on newaccount (cost=0.00..458800400224.89 rows=2985976 width=5366)
-> Seq Scan on account (cost=0.00..458800400224.89 rows=2985976 width=5366)
Filter: (NOT (SubPlan 1))
SubPlan 1
-> Materialize (cost=0.00..153276.00 rows=150000 width=32)
-> Seq Scan on newaccount (cost=0.00..151500.00
rows=150000 width=32)
(6 rows)
This works fine, though:
=> explain insert into newaccount select * from account where id in
(select id from account except select id from newaccount);
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Insert on newaccount (cost=3749026.24..14034547.77 rows=2985976 width=5366)
-> Hash Join (cost=3749026.24..14034547.77 rows=2985976 width=5366)
Hash Cond: (account.id = "ANY_subquery".id)
-> Seq Scan on account (cost=0.00..2263744.51 rows=5971951
width=5366)
-> Hash (cost=3633552.85..3633552.85 rows=5971951 width=32)
-> Subquery Scan on "ANY_subquery"
(cost=3543223.59..3633552.85 rows=5971951 width=32)
-> SetOp Except (cost=3543223.59..3573833.34
rows=5971951 width=19)
-> Sort (cost=3543223.59..3558528.47
rows=6121951 width=19)
Sort Key: "*SELECT* 1".id
-> Append (cost=0.00..2476464.02
rows=6121951 width=19)
-> Subquery Scan on "*SELECT*
1" (cost=0.00..2323464.02 rows=5971951 width=19)
-> Seq Scan on account
(cost=0.00..2263744.51 rows=5971951 width=19)
-> Subquery Scan on "*SELECT*
2" (cost=0.00..153000.00 rows=150000 width=32)
-> Seq Scan on
newaccount (cost=0.00..151500.00 rows=150000 width=32)
(14 rows)
This is all in PG 9.1.
This isn't a big deal as there's a straightforward workaround, but I
am curious what happened here. Googling turns up various mentions of
"NOT IN" with poor plans that involve subplans. Then again I have
read mention of hash anti-join which seems appropriate here(?), but
that wasn't used here (nor was the latter join used, though for
whatever reason it looks like a more complex/deeper plan tree than I
had expected, so maybe it was out of the plan generator's reach?).
E.g. the following mentions cranking up work_mem, but I probably can't
crank up work_mem to meet the requirements of this example, and even
if it doesn't fit in memory, it'd be nice for the planner to not
degenerate to a pathological plan and still execute this join
efficiently while spilling to and from disk.
http://stackoverflow.com/questions/7125291/postgresql-not-in-versus-except-performance-difference-edited-2
Thanks!
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general