Performance of subselects - Mailing list pgsql-general

From Christian Schröder
Subject Performance of subselects
Date
Msg-id 49B0C1D1.8020207@deriva.de
Whole thread Raw
Responses Re: Performance of subselects  (Scott Marlowe <scott.marlowe@gmail.com>)
Re: Performance of subselects  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Performance of subselects  (Thom Brown <thombrown@gmail.com>)
List pgsql-general
Hi list,
if I want to find all records from a table that don't have a matching
record in another table there are at least two ways to do it: Using a
left outer join or using a subselect. I always thought that the planner
would create identical plans for both approaches, but actually they are
quite different which leads to a bad performance in one case.
I tried the following test case:

chschroe=# create table a (id integer primary key);
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "a_pkey"
for table "a"
CREATE TABLE
chschroe=# create table b (id serial not null, fk integer not null,
primary key (id, fk));
NOTICE:  CREATE TABLE will create implicit sequence "b_id_seq" for
serial column "b.id"
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "b_pkey"
for table "b"
CREATE TABLE
chschroe=# insert into a select generate_series(1, 500000);
INSERT 0 500000
chschroe=# insert into b(fk) select generate_series(1, 450000);
INSERT 0 450000
chschroe=# analyze a;
ANALYZE
chschroe=# analyze b;
ANALYZE
chschroe=# explain analyze select * from b where fk not in (select id
from a);
                                                        QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------
 Seq Scan on b  (cost=10645.00..1955718703.00 rows=225000 width=8)
(actual time=65378590.167..65378590.167 rows=0 loops=1)
   Filter: (NOT (subplan))
   SubPlan
     ->  Materialize  (cost=10645.00..18087.00 rows=500000 width=4)
(actual time=0.008..72.326 rows=225000 loops=450000)
           ->  Seq Scan on a  (cost=0.00..7703.00 rows=500000 width=4)
(actual time=0.008..894.163 rows=450000 loops=1)
 Total runtime: 65378595.489 ms
(6 rows)
chschroe=# explain analyze select b.* from b left outer join a on b.fk =
a.id where a.id is null;
                                                      QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
 Hash Left Join  (cost=16395.00..38041.00 rows=225000 width=8) (actual
time=1040.840..1040.840 rows=0 loops=1)
   Hash Cond: (b.fk = a.id)
   Filter: (a.id IS NULL)
   ->  Seq Scan on b  (cost=0.00..6933.00 rows=450000 width=8) (actual
time=0.010..149.508 rows=450000 loops=1)
   ->  Hash  (cost=7703.00..7703.00 rows=500000 width=4) (actual
time=408.126..408.126 rows=500000 loops=1)
         ->  Seq Scan on a  (cost=0.00..7703.00 rows=500000 width=4)
(actual time=0.007..166.168 rows=500000 loops=1)
 Total runtime: 1041.945 ms
(7 rows)

Is there any difference between the two approaches that explain why the
plans are so different? There would be a difference if the subselect
could generate null values, but since the id field is a primary key
field, it should be implicitly declared not null.

Another interesting thing: If table "a" contains only 400,000 rows
(instead of 500,000) the query planner decides to use a hashed subplan
and performance is fine again:

chschroe=# explain analyze select * from b where fk not in (select id
from a);
                                                    QUERY PLAN
------------------------------------------------------------------------------------------------------------------
 Seq Scan on b  (cost=7163.00..15221.00 rows=225000 width=8) (actual
time=472.969..497.096 rows=50000 loops=1)
   Filter: (NOT (hashed subplan))
   SubPlan
     ->  Seq Scan on a  (cost=0.00..6163.00 rows=400000 width=4) (actual
time=0.010..124.503 rows=400000 loops=1)
 Total runtime: 509.632 ms
(5 rows)

Why this different plan?

All tests have been performed on a PostgreSQL 8.2.9 server:
chschroe=# select version();
                                                     version
------------------------------------------------------------------------------------------------------------------
 PostgreSQL 8.2.9 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC)
4.1.2 20061115 (prerelease) (SUSE Linux)
(1 row)

Regards,
    Christian

--
Deriva GmbH                         Tel.: +49 551 489500-42
Financial IT and Consulting         Fax:  +49 551 489500-91
Hans-Böckler-Straße 2                  http://www.deriva.de
D-37079 Göttingen

Deriva CA Certificate: http://www.deriva.de/deriva-ca.cer



pgsql-general by date:

Previous
From: Scott Marlowe
Date:
Subject: Re: converting older databases
Next
From: Ivan Sergio Borgonovo
Date:
Subject: merging 2 databases