Convert IN sublink to join - Mailing list pgsql-performance
From | |
---|---|
Subject | Convert IN sublink to join |
Date | |
Msg-id | 000001c600b2$1ccb44e0$017ba8c0@flintstone Whole thread Raw |
Responses |
Re: Convert IN sublink to join
|
List | pgsql-performance |
Dear Sir or Madam:
The function "convert_IN_to_join(Query *parse, SubLink *sublink)", from file: <postgres-8.0.4 root>/src/backend/optimizer/plan/subselect.c, is responsible for converting IN type sublinks to joins, whenever appropriate.
The following lines of code, extracted from convert_IN_to_join, verify if the subquery is correlated:
/*
* The sub-select must not refer to any Vars of the parent query.
* (Vars of higher levels should be okay, though.)
*/
if (contain_vars_of_level((Node *) subselect, 1))
return NULL;
By commenting this code region I was able to optimize several correlated subqueries. Apparently, the rest of the PostgreSQL code is ready to handle the "convert subquery to join" optimization. Is this test really necessary?
Please analyze the following example:
DDL:
CREATE TABLE "students" ( sid char(5) primary key, name char(20) not null, age integer, email char(20) not null, unique( email ), avgrade float not null );
CREATE TABLE "enrolled" ( sid char(5), cid char(5), grade real not null, primary key(sid,cid), foreign key(sid) references students (sid) on delete restrict );
DML:
1) correlated IN subquery:
"Find all students who's grade in 'TFC' class is higher than their average grade."
select a.sid, a.name from students a where a.sid IN ( select i.sid from enrolled i where i.grade > a.avgrade AND i.cid = 'TFC');
QUERY PLAN:
Seq Scan on students a (cost=0.00..5763804.50 rows=5000 width=33)
Filter: (subplan)
SubPlan
-> Seq Scan on enrolled i (cost=0.00..1144.00 rows=3473 width=9)
Filter: ((grade > $0) AND (cid = 'TFC'::bpchar))
2) the same query after commenting out the above code region in convert_IN_to_join:
QUERY PLAN:
Hash Join (cost=1050.24..1518.21 rows=693 width=33)
Hash Cond: ("outer".sid = "inner".sid)
Join Filter: ("inner".grade > "outer".avgrade)
-> Seq Scan on students a (cost=0.00..367.00 rows=10000 width=41)
-> Hash (cost=1045.04..1045.04 rows=2078 width=13)
-> HashAggregate (cost=1045.04..1045.04 rows=2078 width=13)
-> Seq Scan on enrolled i (cost=0.00..1019.00 rows=10417 width=13)
Filter: (cid = 'TFC'::bpchar)
3) Clearly, it is possible to extract the IN subquery from query 1 since the outer attribute a.sid matches, at most once, with the inner tuple i.sid. Although s.sid is not a primary key by itself, together with "i.cid = 'TFC'" conjunct, it forms a unique tuple. Here is an efficient alternative to query 1:
select a.sid, a.name from students a, enrolled i where a.sid = i.sid AND i.cid = 'TFC' AND i.grade > a.avgrade;
QUERY PLAN:
Hash Join (cost=480.00..2366.86 rows=3473 width=33)
Hash Cond: ("outer".sid = "inner".sid)
Join Filter: ("outer".grade > "inner".avgrade)
-> Seq Scan on enrolled i (cost=0.00..1019.00 rows=10417 width=13)
Filter: (cid = 'TFC'::bpchar)
-> Hash (cost=367.00..367.00 rows=10000 width=41)
-> Seq Scan on students a (cost=0.00..367.00 rows=10000 width=41)
I have verified that both 2) and 3) return the exact same tuples, query 1) never completed due to the highly inefficient execution plan.
Please help me with this issue.
Kind regards,
Francisco Santos
pgsql-performance by date: