Home > mailing lists

Convert IN sublink to join - Mailing list pgsql-performance

From
Subject	Convert IN sublink to join
Date	December 14, 2005 09:34:02
Msg-id	000001c600b2$1ccb44e0$017ba8c0@flintstone Whole thread
Responses	Re: Convert IN sublink to join
List	pgsql-performance

Tree view

Dear Sir or Madam:

The function "convert_IN_to_join(Query *parse, SubLink *sublink)", from file: <postgres-8.0.4 root>/src/backend/optimizer/plan/subselect.c, is responsible for converting IN type sublinks to joins, whenever appropriate.

The following lines of code, extracted from convert_IN_to_join, verify if the subquery is correlated:

      /*
      * The sub-select must not refer to any Vars of the parent query.
      * (Vars of higher levels should be okay, though.)
      */
      if (contain_vars_of_level((Node *) subselect, 1))
              return NULL;

By commenting this code region I was able to optimize several correlated subqueries. Apparently, the rest of the PostgreSQL code is ready to handle the "convert subquery to join" optimization. Is this test really necessary?

Please analyze the following example:

DDL:
CREATE TABLE "students" ( sid char(5) primary key, name char(20) not null, age integer, email char(20) not null, unique( email ), avgrade float not null );
CREATE TABLE "enrolled" ( sid char(5), cid char(5), grade real not null, primary key(sid,cid), foreign key(sid) references students (sid) on delete restrict );

DML:

1) correlated IN subquery:
"Find all students who's grade in 'TFC' class is higher than their average grade."
select a.sid, a.name from students a where a.sid IN ( select i.sid from enrolled i where i.grade > a.avgrade AND i.cid = 'TFC');

QUERY PLAN:

Seq Scan on students a (cost=0.00..5763804.50 rows=5000 width=33)
   Filter: (subplan)
   SubPlan
     -> Seq Scan on enrolled i (cost=0.00..1144.00 rows=3473 width=9)
           Filter: ((grade > $0) AND (cid = 'TFC'::bpchar))

2) the same query after commenting out the above code region in convert_IN_to_join:

QUERY PLAN:

Hash Join (cost=1050.24..1518.21 rows=693 width=33)
   Hash Cond: ("outer".sid = "inner".sid)
   Join Filter: ("inner".grade > "outer".avgrade)
   -> Seq Scan on students a (cost=0.00..367.00 rows=10000 width=41)
   -> Hash (cost=1045.04..1045.04 rows=2078 width=13)
         -> HashAggregate (cost=1045.04..1045.04 rows=2078 width=13)
               -> Seq Scan on enrolled i (cost=0.00..1019.00 rows=10417 width=13)
                     Filter: (cid = 'TFC'::bpchar)

3) Clearly, it is possible to extract the IN subquery from query 1 since the outer attribute a.sid matches, at most once, with the inner tuple i.sid. Although s.sid is not a primary key by itself, together with "i.cid = 'TFC'" conjunct, it forms a unique tuple. Here is an efficient alternative to query 1:

select a.sid, a.name from students a, enrolled i where a.sid = i.sid AND i.cid = 'TFC' AND i.grade > a.avgrade;

QUERY PLAN:

Hash Join (cost=480.00..2366.86 rows=3473 width=33)
   Hash Cond: ("outer".sid = "inner".sid)
   Join Filter: ("outer".grade > "inner".avgrade)
   -> Seq Scan on enrolled i (cost=0.00..1019.00 rows=10417 width=13)
         Filter: (cid = 'TFC'::bpchar)
   -> Hash (cost=367.00..367.00 rows=10000 width=41)
         -> Seq Scan on students a (cost=0.00..367.00 rows=10000 width=41)

I have verified that both 2) and 3) return the exact same tuples, query 1) never completed due to the highly inefficient execution plan.
Please help me with this issue.

Kind regards,
Francisco Santos

pgsql-performance by date:

From: Mark Kirkwood
Date: 14 December 2005, 03:29:04
Subject: Re: SAN/NAS options

From: Tom Lane
Date: 14 December 2005, 11:26:58
Subject: Re: Convert IN sublink to join

Convert IN sublink to join - Mailing list pgsql-performance

Previous

Next