Optimizing NOT IN plans / verify rewrite - Mailing list pgsql-performance

From Maciek Sakrejda
Subject Optimizing NOT IN plans / verify rewrite
Date
Msg-id AANLkTim0entqazUpWhwotC+gcDGrmeM=a=WPJUv=MgdO@mail.gmail.com
Whole thread Raw
Responses Re: Optimizing NOT IN plans / verify rewrite  ("Kevin Grittner" <Kevin.Grittner@wicourts.gov>)
Re: Optimizing NOT IN plans / verify rewrite  (Andres Freund <andres@anarazel.de>)
List pgsql-performance
Hi,

I'm running PostgreSQL 8.3 and I have a query with a couple of NOT IN
subqueries:

DELETE FROM foo WHERE type = 'o' AND b NOT IN (SELECT cqc.b FROM bar
cqc) AND b NOT IN (SELECT car.b FROM foo car WHERE car.type != 'o');

The plan produced for this is:

                                                         QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------
 Index Scan using foo_type_index on foo  (cost=17851.93..1271633830.75
rows=66410 width=6)
   Index Cond: (type = 'o'::bpchar)
   Filter: ((NOT (subplan)) AND (NOT (subplan)))
   SubPlan
     ->  Materialize  (cost=6077.87..10238.57 rows=299170 width=8)
           ->  Seq Scan on bar cqc  (cost=0.00..4609.70 rows=299170 width=8)
     ->  Materialize  (cost=11774.06..15728.45 rows=284339 width=8)
           ->  Seq Scan on foo car  (cost=0.00..10378.73 rows=284339 width=8)
                 Filter: (type <> 'o'::bpchar)
(9 rows)


Unfortunately, when these tables get large-ish, the materilzations get
really expensive to re-scan for every tuple (see cost above). At the
moment, I have ~500k rows in foo and ~300k rows in bar. The
selectivity of type = 'o' is ~50%. I've tried to re-write the query as
follows:

DELETE FROM foo WHERE b IN (SELECT candidate_run.type_o_run as b FROM
(SELECT cqar1.b AS type_o_run, cqar2.b AS non_type_o_run FROM foo
cqar1 LEFT OUTER JOIN foo cqar2 ON (cqar1.b = cqar2.b AND cqar2.type
!= 'o') WHERE cqar1.type = 'o') candidate_run LEFT OUTER JOIN bar ON
(candidate_run.type_o_run = bar.b) WHERE non_type_o_run IS NULL AND
bar.b IS NULL);

This gives the more sensible plan:

                                                   QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
 Hash IN Join  (cost=48999.81..71174.41 rows=66410 width=6)
   Hash Cond: (foo.b = cqar1.b)
   ->  Seq Scan on foo  (cost=0.00..9003.78 rows=549978 width=14)
   ->  Hash  (cost=47909.68..47909.68 rows=66410 width=8)
         ->  Hash Left Join  (cost=24562.29..47909.68 rows=66410 width=8)
               Hash Cond: (cqar1.b = bar.b)
               Filter: (bar.b IS NULL)
               ->  Hash Left Join  (cost=15043.96..33635.58 rows=132820 width=8)
                     Hash Cond: (cqar1.b = cqar2.b)
                     Filter: (cqar2.b IS NULL)
                     ->  Seq Scan on foo cqar1  (cost=0.00..10378.73
rows=265639 width=8)
                           Filter: (type = 'o'::bpchar)
                     ->  Hash  (cost=10378.73..10378.73 rows=284339 width=8)
                           ->  Seq Scan on foo cqar2
(cost=0.00..10378.73 rows=284339 width=8)
                                 Filter: (type <> 'o'::bpchar)
               ->  Hash  (cost=4609.70..4609.70 rows=299170 width=8)
                     ->  Seq Scan on bar  (cost=0.00..4609.70
rows=299170 width=8)
(17 rows)


As far as I can tell, the results are identical.

My questions

1. Is my rewrite valid?
2. Any way to reliably achieve the second plan (or really, any plan
that doesn't rescan ~~500k tuples per each of ~250k tuples) by
tweaking (per-session) planner constants? I've played with this a
little, but without much success. As with any rewrite situation, I'd
prefer to stick with the simpler, more explicit original query.

Here is a SQL script to reproduce the problem:

\set ON_ERROR_STOP

drop schema if exists not_in_test cascade;
create schema not_in_test;

set search_path to not_in_test;

create table foo (
       a oid not null,
       b bigint not null,
       type char not null,
       ts timestamp without time zone not null
);
create index "foo_b_a_type_index" on foo (b, a, type);
create index "foo_a_index" on foo (a);
create index "foo_type_index" on foo(type);

create table bar (
       b bigint unique not null,
       c timestamp with time zone not null
);
create index "bar_b_index" on bar(b);

insert into foo select (random()*10)::integer,
generate_series(1,550000), case when random() > 0.5 then 'o' else 'x'
end, now();
insert into bar select val, now() from generate_series(1,1200000) as
vals(val) where random() > 0.75;

analyze foo;
analyze bar;

EXPLAIN DELETE FROM foo WHERE type = 'o' AND b NOT IN (SELECT cqc.b
FROM bar cqc) AND b NOT IN (SELECT car.b FROM foo car WHERE car.type
!= 'o');
EXPLAIN DELETE FROM foo WHERE b IN (SELECT candidate_run.type_o_run as
b FROM (SELECT cqar1.b AS type_o_run, cqar2.b AS non_type_o_run FROM
foo cqar1 LEFT OUTER JOIN foo cqar2 ON (cqar1.b = cqar2.b AND
cqar2.type != 'o') WHERE cqar1.type = 'o') candidate_run LEFT OUTER
JOIN bar ON (candidate_run.type_o_run = bar.b) WHERE non_type_o_run IS
NULL AND bar.b IS NULL);


Thanks,
---
Maciek Sakrejda | System Architect | Truviso

1065 E. Hillsdale Blvd., Suite 215
Foster City, CA 94404
(650) 242-3500 Main
www.truviso.com

pgsql-performance by date:

Previous
From: Tom Lane
Date:
Subject: Re: what does "initplan" operation in explain output mean?
Next
From: "Kevin Grittner"
Date:
Subject: Re: Optimizing NOT IN plans / verify rewrite