Terrible perfomance during nested "... where x in (select ...)" operator - Mailing list pgsql-bugs

From Черепанов Леонид
Subject Terrible perfomance during nested "... where x in (select ...)" operator
Date
Msg-id D078291436B0D411B08800805F6D1315073F13@fantom.tcnet.ru
Whole thread Raw
Responses Re: Terrible perfomance during nested "... where x in (select ...)" operator
List pgsql-bugs
Leonid (leo@rusmedia.net) reports a bug with a severity of 2(?)

Short Description
Terrible perfomance during nested "... where x in (select ...)" operator

Long Description
PostgreSQL 7.1, FreeBSD 4.2-STABLE

Analyzing the reasons for terrible perfomance of my query I've found
a very strange thing. Here is sublimation.

Queries likeselect distinct i from t1 where i in (select distinct i from t2 where j in  (select distinct j from t3));
is much-much slower thanselect distinct t1.i from t1,t2,t3 where t1.i=t2.i and t2.j=t3.j;

EXPLAIN results.
The 1st one:
NOTICE:  QUERY PLAN:
Unique  (cost=0.00..62059410444.36 rows=15000 width=4) ->  Index Scan using t1_i on t1  (cost=0.00..62059410069.36
rows=150000
width=4)       SubPlan         ->  Materialize  (cost=413729.34..413729.34 rows=400 width=4)               ->  Unique
(cost=0.00..413729.34rows=400 width=4)                     ->  Index Scan using t2_i on t2  (cost=0.00..413719.34 
rows=4000 width=4)                           SubPlan                             ->  Materialize  (cost=103.38..103.38
rows=225
width=4)                                   ->  Unique  (cost=0.00..103.38 rows=225
width=4)                                         ->  Index Scan using t3_j on t3
(cost=0.00..97.76 rows=2250 wid
th=4)
EXPLAIN

And the second:
NOTICE:  QUERY PLAN:
Unique  (cost=362.62..10472.26 rows=2768 width=16) ->  Merge Join  (cost=362.62..10403.06 rows=27681 width=16)       ->
Index Scan using t1_i on t1  (cost=0.00..8145.36 rows=150000 
width=4)       ->  Sort  (cost=362.62..362.62 rows=1606 width=12)             ->  Hash Join  (cost=102.63..277.13
rows=1606width=12)                   ->  Seq Scan on t3  (cost=0.00..34.50 rows=2250 width=4)                   ->
Hash (cost=62.00..62.00 rows=4000 width=8)                         ->  Seq Scan on t2  (cost=0.00..62.00 rows=4000 
width=8)
EXPLAIN

*** IMPORTANT !!! ***
VACUUM and VACUUM ANALIZY make things even worse;
NOTICE:  QUERY PLAN:
Unique  (cost=0.00..186176903970.36 rows=15000 width=4) ->  Index Scan using t1_i on t1  (cost=0.00..186176903595.36
rows=150000
width=4)       SubPlan         ->  Materialize  (cost=1241179.30..1241179.30 rows=1200 width=4)               ->
Unique (cost=0.00..1241179.30 rows=1200 width=4)                     ->  Index Scan using t2_i on t2 
(cost=0.00..1241149.30 rows=12000 width=4)                           SubPlan                             ->
Materialize (cost=103.38..103.38 rows=225 
width=4)                                   ->  Unique  (cost=0.00..103.38 rows=225
width=4)                                         ->  Index Scan using t3_j on t3
(cost=0.00..97.76 rows=2250 wid
th=4)
EXPLAIN

NOTICE:  QUERY PLAN:
Unique  (cost=1027.91..11289.49 rows=7240 width=16) ->  Merge Join  (cost=1027.91..11108.49 rows=72401 width=16)
-> Index Scan using t1_i on t1  (cost=0.00..8145.36 rows=150000 
width=4)       ->  Sort  (cost=1027.91..1027.91 rows=4817 width=12)             ->  Hash Join  (cost=40.12..733.25
rows=4817width=12)                   ->  Seq Scan on t2  (cost=0.00..185.00 rows=12000 
width=8)                   ->  Hash  (cost=34.50..34.50 rows=2250 width=4)                         ->  Seq Scan on t3
(cost=0.00..34.50rows=2250 
width=4)
EXPLAIN

Real working times differ 8 TIMES !

Here is a perl script, which generates a PgSQL files with sample tables and
datadump.
The "scale" parameter controls the amount of data, generated by the script.
Recommended "scale" value is 0.1 for typical computer.

--------------------------8<--------------------------
open (F, "> create");
print F (<<CREATE);
drop table t1;
drop table t2;
drop table t3;

create table t1 (i int);
create table t2 (i int, j int);
create table t3 (j int);

create index t1_i on t1 (i);
create index t2_i on t2 (i);
create index t2_j on t2 (j);
create index t3_j on t3 (j);
CREATE

open (F, "> compare");
print F (<<COMPARE);
explainselect count(distinct i) from t1 where i in (select distinct i from t2 where j in  (select distinct j from t3));
select now();select count(distinct i) from t1 where i in (select distinct i
from t2 where j in  (select distinct j from t3));select now();

explain select count(distinct t1.i) from t1,t2,t3 where t1.i=t2.i and
t2.j=t3.j;
select now();select count(distinct t1.i) from t1,t2,t3 where t1.i=t2.i and
t2.j=t3.j;select now();

explainselect count(distinct t1.i) from t1 inner join (select distinct t2.i from t2 inner join  (select distinct j from
t3)as tt3 on tt3.j=t2.j ) as tt2 on tt2.i=t1.i; 
select now();select count(distinct t1.i) from t1 inner join (select distinct
t2.i from t2 inner join  (select distinct j from t3) as tt3 on tt3.j=t2.j )
as tt2 on tt2.i=t1.i;select now();
COMPARE

print ("Scale: ");
$m=<>;
chomp $m;
open (F, "> data." . $m);
open (F, ">> data." . $m);
print F ("copy t1 from stdin;\n");
for($k=0;$k<300000*$m;$k++)
{$r = int ( rand(4000*$m) );print F ("$r\n");
}
print F (<<T1);
\\.

select count(*) as t1_rows from t1;
select count(distinct i) as t1_dist_rows from t1;
T1

print F ("\ncopy t2 from stdin;\n");
for($k=0;$k<8000*$m;$k++)
{$r = int ( rand(4000*$m) );print F ("$r\t$k\n");
}
print F (<<T2);
\\.

select count(*) as t2_rows from t2;
select count(distinct i) as t2_dist_rows from t2;
T2

print F ("\ncopy t3 from stdin;\n");
for($k=0;$k<4500*$m;$k++)
{$r = int ( rand(8000*$m) );print F ("$r\n");
}
print F (<<T3);
\\.

select count(*) as t3_rows from t3;
select count(distinct j) as t3_dist_rows from t3;
T3
--------------------------8<--------------------------


pgsql-bugs by date:

Previous
From: "Jerry Davis"
Date:
Subject: Problem with restoring database from a pg_dump generated script.
Next
From: "McCoy Mun"
Date:
Subject: createdb test shown as number in data directory