Performance of subselects - Mailing list pgsql-general
From | Christian Schröder |
---|---|
Subject | Performance of subselects |
Date | |
Msg-id | 49B0C1D1.8020207@deriva.de Whole thread Raw |
Responses |
Re: Performance of subselects
Re: Performance of subselects Re: Performance of subselects |
List | pgsql-general |
Hi list, if I want to find all records from a table that don't have a matching record in another table there are at least two ways to do it: Using a left outer join or using a subselect. I always thought that the planner would create identical plans for both approaches, but actually they are quite different which leads to a bad performance in one case. I tried the following test case: chschroe=# create table a (id integer primary key); NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "a_pkey" for table "a" CREATE TABLE chschroe=# create table b (id serial not null, fk integer not null, primary key (id, fk)); NOTICE: CREATE TABLE will create implicit sequence "b_id_seq" for serial column "b.id" NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "b_pkey" for table "b" CREATE TABLE chschroe=# insert into a select generate_series(1, 500000); INSERT 0 500000 chschroe=# insert into b(fk) select generate_series(1, 450000); INSERT 0 450000 chschroe=# analyze a; ANALYZE chschroe=# analyze b; ANALYZE chschroe=# explain analyze select * from b where fk not in (select id from a); QUERY PLAN --------------------------------------------------------------------------------------------------------------------------- Seq Scan on b (cost=10645.00..1955718703.00 rows=225000 width=8) (actual time=65378590.167..65378590.167 rows=0 loops=1) Filter: (NOT (subplan)) SubPlan -> Materialize (cost=10645.00..18087.00 rows=500000 width=4) (actual time=0.008..72.326 rows=225000 loops=450000) -> Seq Scan on a (cost=0.00..7703.00 rows=500000 width=4) (actual time=0.008..894.163 rows=450000 loops=1) Total runtime: 65378595.489 ms (6 rows) chschroe=# explain analyze select b.* from b left outer join a on b.fk = a.id where a.id is null; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------- Hash Left Join (cost=16395.00..38041.00 rows=225000 width=8) (actual time=1040.840..1040.840 rows=0 loops=1) Hash Cond: (b.fk = a.id) Filter: (a.id IS NULL) -> Seq Scan on b (cost=0.00..6933.00 rows=450000 width=8) (actual time=0.010..149.508 rows=450000 loops=1) -> Hash (cost=7703.00..7703.00 rows=500000 width=4) (actual time=408.126..408.126 rows=500000 loops=1) -> Seq Scan on a (cost=0.00..7703.00 rows=500000 width=4) (actual time=0.007..166.168 rows=500000 loops=1) Total runtime: 1041.945 ms (7 rows) Is there any difference between the two approaches that explain why the plans are so different? There would be a difference if the subselect could generate null values, but since the id field is a primary key field, it should be implicitly declared not null. Another interesting thing: If table "a" contains only 400,000 rows (instead of 500,000) the query planner decides to use a hashed subplan and performance is fine again: chschroe=# explain analyze select * from b where fk not in (select id from a); QUERY PLAN ------------------------------------------------------------------------------------------------------------------ Seq Scan on b (cost=7163.00..15221.00 rows=225000 width=8) (actual time=472.969..497.096 rows=50000 loops=1) Filter: (NOT (hashed subplan)) SubPlan -> Seq Scan on a (cost=0.00..6163.00 rows=400000 width=4) (actual time=0.010..124.503 rows=400000 loops=1) Total runtime: 509.632 ms (5 rows) Why this different plan? All tests have been performed on a PostgreSQL 8.2.9 server: chschroe=# select version(); version ------------------------------------------------------------------------------------------------------------------ PostgreSQL 8.2.9 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20061115 (prerelease) (SUSE Linux) (1 row) Regards, Christian -- Deriva GmbH Tel.: +49 551 489500-42 Financial IT and Consulting Fax: +49 551 489500-91 Hans-Böckler-Straße 2 http://www.deriva.de D-37079 Göttingen Deriva CA Certificate: http://www.deriva.de/deriva-ca.cer
pgsql-general by date: