Re: NOT IN subquery optimization - Mailing list pgsql-hackers

From Li, Zheng
Subject Re: NOT IN subquery optimization
Date
Msg-id 993EBC1E-906F-49D0-ABFA-7C24DE8398AE@amazon.com
Whole thread Raw
In response to Re: NOT IN subquery optimization  ("Li, Zheng" <zhelli@amazon.com>)
Responses Re: NOT IN subquery optimization
Re: NOT IN subquery optimization
List pgsql-hackers
Resend the patch with a whitespace removed so that "git apply patch" works directly.

---
Zheng Li, AWS, Amazon Aurora PostgreSQL

On 2/25/19, 12:39 PM, "Li, Zheng" <zhelli@amazon.com> wrote:

    I'm attaching a working patch following the discussion.
    
    You can also find the following patch description is the commit message:
    NOT IN to ANTI JOIN transformation
    
        The semantics of ANTI JOIN were created to match the semantics of NOT
        EXISTS, which enables NOT EXISTS subqueries to be efficiently executed
        as a kind of join.  NOT IN subqueries have different NULL semantics than
        NOT EXISTS, but since there is no special join operator for NOT IN it is
        generally executed as a nested sub-plan.  It is possible, however, to
        transform NOT IN to a correlated NOT EXISTS so that it can be executed
        it as an ANTI JOIN with additional correlated predicates.
    
        A general transformation from NOT IN to NOT EXISTS for the
        single-expression case (the multi-expression case is just ANDs of the
        single-expressions) is:
            t1.x NOT IN (SELECT t2.y from t2 where p) <=> NOT EXISTS (select 1
                    from t2 where (y=x or y is NULL or x is NULL) and p).
    
        If x or y is non-nullable, we can safely remove the predicate "x
        is NULL" or "y is NULL", and if there is no predicate p,
        then "x is NULL" may be factored out of the subquery.
        Experiments show that if we can remove one or the other ORed
        predicates, or if we can factor out the "x is NULL", then
        execution is typically much faster.
    
        Basically, for the single expression case (we also handle the multi
        expression case), we try to do the following transformation:
        When p does not exist:
            t1.x not in (t2.y) => ANTI JOIN t1(Filter: x is not null), t2 on
            join condition: t1.x=t2.y or t2.y is null.
        When x is non-nullable:
            t1.x not in (t2.y where p) => ANTI JOIN t1, t2 on join condition:
            (t1.x=t2.y or t2.y is null) and p.
    
        We implemented a nullability test routine is_node_nonnullable().
        Currently it handles Var, TargetEntry, CoalesceExpr and Const. Outer
        joins are taken into consideration in the nullability test.
    
        We adjust and apply reduce_outer_joins() before the transformation so 
        that the outer joins have an opportunity to be converted to inner joins
        prior to the transformation.
    
        Using this transformation, we measured performance improvements of
        two to three orders of magnitude on most queries in a development
        environment. In our performance experiments, table s (small) has 11
        rows, table l (large) has 1 million rows. s.n and l.n have NULL
        values. s.nn and l.nn are NOT NULL.
    
            s.n not in (l.n)          1150 ms -> 0.49 ms
            s.nn not in (l.nn)      1120 ms -> 0.45 ms
            l.n not in (l.n)   over   20 min -> 1700 ms
            l.nn not in (l.nn) over 20 min -> 220 ms
            l.n not in (s.n)               63 ms -> 750 ms
            l.nn not in (s.nn)          58 ms -> 46 ms
    
        For the only case that performance drops - l.n not in (s.n). It is
        likely to be resolved by ending the nested loop anti join early as
        soon as we find NULL inner tuple entry/entries that satisfies the
        join condition during execution. This is still under investigation.
    
    Comments are welcome.
    
    With Regards,
    ---
    Zheng Li, AWS, Amazon Aurora PostgreSQL
    
    On 2/25/19, 7:32 AM, "David Rowley" <david.rowley@2ndquadrant.com> wrote:
    
        On Fri, 22 Feb 2019 at 09:44, David Rowley <david.rowley@2ndquadrant.com> wrote:
        > I've attached the rebased and still broken version.
        
        I set about trying to make a less broken version of this.
        
        A quick reminder of the semantics of NOT IN:
        
        1. WHERE <nullable_column> NOT IN(SELECT <not null column> FROM table);
        
        If table is non-empty:
        will filter out rows where <nullable_column> is NULL
        and only show values that are not in <not null column>
        
        If table is empty:
        Filters nothing.
        
        2. WHERE <nonnullable_column> NOT IN(SELECT <null column> FROM table);
        
        If table contains NULLs in the <null column> no records will match.
        
        The previous patch handled #2 correctly but neglected to do anything
        about #1.  For #1 the only way we can implement this as a planner only
        change is to insist that the outer side expressions also are not null.
        If we could have somehow tested if "table" was non-empty then we could
        have added a IS NOT NULL clause to the outer query and converted to an
        anti-join, but ... can't know that during planning and can't add the
        IS NOT NULL regardless as, if "table" is empty we will filter NULLs
        when we shouldn't.
        
        In the attached, I set about fixing #1 by determining if the outer
        expressions could be NULL by checking
        
        1. If expression is a Var from an inner joined relation it can't be
        NULL if there's a NOT NULL constraint on the column; or
        2. If expression is a Var from an inner joined relation and there is a
        strict WHERE/ON clause, the expression can't be NULL; or
        3. If expression is a Var from an outer joined relation check for
        quals that were specified in the same syntactical level as the NOT IN
        for proofs that NULL will be filtered.
        
        An example of #3 is:
        
        SELECT * FROM t1 LEFT JOIN t2 on t1.a = t2.a WHERE t2.a IS NOT NULL
        AND t2.a NOT IN(SELECT a FROM t3); -- t2 becomes INNER JOINed later in
        planning, but...
        or;
        SELECT * FROM t1 LEFT JOIN t2 on t1.a = t2.a AND t2.a NOT IN(SELECT a FROM t3);
        
        In the latter of the two, the t1.a = t2.a join conditions ensures that
        NULLs can't exist where the NOT IN is evaluated.
        
        I implemented #3 by passing the quals down to
        pull_up_sublinks_qual_recurse().  At the top level call 'node' and
        'notnull_proofs' are the same, but that changes in recursive calls
        like the one we make inside the is_andclause() condition.
        
        Comments welcome.
        
        -- 
         David Rowley                   http://www.2ndQuadrant.com/
         PostgreSQL Development, 24x7 Support, Training & Services
        
    
    


Attachment

pgsql-hackers by date:

Previous
From: Stephen Frost
Date:
Subject: Re: Remove Deprecated Exclusive Backup Mode
Next
From: Andres Freund
Date:
Subject: Re: POC: converting Lists into arrays