Re: NOT IN subquery optimization - Mailing list pgsql-hackers

From Li, Zheng
Subject Re: NOT IN subquery optimization
Date
Msg-id 8654DAEE-4310-4E7E-8362-AF6F63B30888@amazon.com
Whole thread Raw
In response to Re: NOT IN subquery optimization  (Andrey Lepikhov <a.lepikhov@postgrespro.ru>)
Responses Re: NOT IN subquery optimization  (Andrey Lepikhov <a.lepikhov@postgrespro.ru>)
List pgsql-hackers
Hi Andrey,

Thanks for the comment!

The unimproved cases you mentioned all fall into the category “correlated subquery”. This category is explicitly
disallowedby existing code to convert to join in convert_ANY_sublink_to_join:
 
    /*
     * The sub-select must not refer to any Vars of the parent query. (Vars of
     * higher levels should be okay, though.)
     */
    if (contain_vars_of_level((Node *) subselect, 1))
        return NULL;

I think this is also the reason why hashed subplan is not used for such subqueries.

It's probably not always safe to convert a correlated subquery to join. We need to find out/prove when it’s safe/unsafe
toconvert such ANY subquery if we were to do so.
 

Regards,
-----------
Zheng Li
AWS, Amazon Aurora PostgreSQL
 

On 1/5/20, 1:12 AM, "Andrey Lepikhov" <a.lepikhov@postgrespro.ru> wrote:

    At the top of the thread your co-author argued the beginning of this 
    work with "findings about the performance of PostgreSQL, MySQL, and 
    Oracle on various subqueries:"
    
    https://antognini.ch/2017/12/how-well-a-query-optimizer-handles-subqueries/
    
    I launched this test set with your "not_in ..." patch. Your optimization 
    improves only results of D10-D13 queries. Nothing has changed for bad 
    plans of the E20-E27 and F20-F27 queries.
    
    For example, we can replace E20 query:
    SELECT * FROM large WHERE n IN (SELECT n FROM small WHERE small.u = 
    large.u); - execution time: 1370 ms, by
    SELECT * FROM large WHERE EXISTS (SELECT n,u FROM small WHERE (small.u = 
    large.u) AND (large.n = small.n
    )) AND n IS NOT NULL; - execution time: 0.112 ms
    
    E21 query:
    SELECT * FROM large WHERE n IN (SELECT nn FROM small WHERE small.u = 
    large.u); - 1553 ms, by
    SELECT * FROM large WHERE EXISTS (SELECT nn FROM small WHERE (small.u = 
    large.u) AND (small.nn = large.n)); - 0.194 ms
    
    F27 query:
    SELECT * FROM large WHERE nn NOT IN (SELECT nn FROM small WHERE small.nu 
    = large.u); - 1653.048 ms, by
    SELECT * FROM large WHERE NOT EXISTS (SELECT nn,nu FROM small WHERE 
    (small.nu = large.u) AND (small.nn = large.nn)); - 274.019 ms
    
    Are you planning to make another patch for these cases?
    
    Also i tried to increase work_mem up to 2GB: may be hashed subqueries 
    can improve situation? But hashing is not improved execution time of the 
    queries significantly.
    
    On your test cases (from the comments of the patch) the subquery hashing 
    has the same execution time with queries No.13-17. At the queries 
    No.1-12 it is not so slow as without hashing, but works more slowly (up 
    to 3 orders) than NOT IN optimization.
    
    On 12/2/19 9:25 PM, Li, Zheng wrote:
    > Here is the latest rebased patch.
    
    -- 
    Andrey Lepikhov
    Postgres Professional
    https://postgrespro.com
    The Russian Postgres Company
    


pgsql-hackers by date:

Previous
From: Marc Cousin
Date:
Subject: Re: [PATCH] fix a performance issue with multiple logical-decodingwalsenders
Next
From: Justin Pryzby
Date:
Subject: Re: bitmaps and correlation