Re: NOT IN subquery optimization - Mailing list pgsql-hackers

From Li, Zheng
Subject Re: NOT IN subquery optimization
Date
Msg-id DBD4724F-5466-40FB-AF28-4CEC9D4B6D1F@amazon.com
Whole thread Raw
In response to Re: NOT IN subquery optimization  (David Rowley <david.rowley@2ndquadrant.com>)
Responses Re: NOT IN subquery optimization
List pgsql-hackers
The current transformation would not add "or s.a is NULL" in the example provided since it is non-nullable. You will be
comparingthese two cases in terms of the transformation:
 

select count(*) from big b where not exists(select 1 from small s
where s.a = b.a);
Time: 51.416 ms

select count(*) from big b where a not in (select a from s);
Time: 890.088 ms

 But if s.a is nullable, yes, you have proved my previous statement is false... I should have used almost always.
However, if s.a is nullable, we would do this transformation:
    select count(*) from big b where not exists(select 1 from small s
    where s.a = b.a or s.a is null);

It's possible to stop the nested loop join early during execution once we find an inner Null entry because every outer
tupleis
 
going to evaluate to true on the join condition.

Zheng

On 3/1/19, 6:17 PM, "David Rowley" <david.rowley@2ndquadrant.com> wrote:

    On Sat, 2 Mar 2019 at 12:13, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    >
    > "Li, Zheng" <zhelli@amazon.com> writes:
    > > Although adding "or var is NULL" to the anti join condition forces the planner to choose nested loop anti join,
itis always faster compared to the original plan.
 
    >
    > TBH, I am *really* skeptical of sweeping claims like that.  The existing
    > code will typically produce a hashed-subplan plan, which ought not be
    > that awful as long as the subquery result doesn't blow out memory.
    > It certainly is going to beat a naive nested loop.
    
    It's pretty easy to show the claim is false using master and NOT EXISTS.
    
    create table small(a int not null);
    create table big (a int not null);
    insert into small select generate_Series(1,1000);
    insert into big select x%1000+1 from generate_Series(1,1000000) x;
    
    select count(*) from big b where not exists(select 1 from small s
    where s.a = b.a);
    Time: 178.575 ms
    
    select count(*) from big b where not exists(select 1 from small s
    where s.a = b.a or s.a is null);
    Time: 38049.969 ms (00:38.050)
    
    
    -- 
     David Rowley                   http://www.2ndQuadrant.com/
     PostgreSQL Development, 24x7 Support, Training & Services
    


pgsql-hackers by date:

Previous
From: Chapman Flack
Date:
Subject: Re: Infinity vs Error for division by zero
Next
From: Tom Lane
Date:
Subject: Re: Infinity vs Error for division by zero