Thread: Postgres NOT IN vs NOT EXISTS optimization

Postgres NOT IN vs NOT EXISTS optimization

From

"Dirschel, Steve"

Date:

14 June 2022, 15:58:39

We are in the process of migrating from Oracle to Postgres and the following query does much less work with Oracle vs Postgres.

explain (analyze, buffers)

select favoritegr0_.FAVORITE_GROUP_SID as favorite1_2_, favoritegr0_.CHANGED as changed2_2_, favoritegr0_.TYPE_DISCRIMINATOR as type_dis3_2_,

favoritegr0_.GROUP_NAME as group_na4_2_, favoritegr0_.IS_DELETED as is_delet5_2_, favoritegr0_.LAST_USED as last_use6_2_, favoritegr0_.POSITION as position7_2_,

favoritegr0_.PRISM_GUID as prism_gu8_2_, favoritegr0_.PRODUCT_SID as product_9_2_,

favoritegr0_.PRODUCT_VIEW as product10_2_, favoritegr0_.USAGE_TYPE as usage_t11_2_, favoritegr0_.ROW_VERSION as row_ver12_2_

from cf0.FAVORITE_GROUP favoritegr0_

where 'FORMS.WESTLAW' = favoritegr0_.PRODUCT_SID

and favoritegr0_.PRODUCT_VIEW in ('DefaultProductView')

and (favoritegr0_.FAVORITE_GROUP_SID not in

(select favoriteen1_.FAVORITE_GROUP_SID

from cf0.FAVORITE_GROUP_MEMBER favoriteen1_

cross join cf0.CATEGORY_PAGE categorypa2_

where favoriteen1_.CATEGORY_PAGE_SID=categorypa2_.CATEGORY_PAGE_SID

and categorypa2_.UNIQUE_NAME='Florida'

and categorypa2_.IS_DELETED=0

and favoriteen1_.IS_DELETED=0))

and favoritegr0_.IS_DELETED=0

and (favoritegr0_.USAGE_TYPE=0 or favoritegr0_.USAGE_TYPE is null)

and favoritegr0_.PRISM_GUID='ia74483420000012ca23eacf87bb0ed56'

order by favoritegr0_.POSITION desc;

Here is the plan in Postgres. It did 1426 shared block hits. If you look at this plan it is not pushing filtering into the NOT IN subquery- it is fully resolving that part of the query driving off where UNIQUE_NAME = 'Florida'.

QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Sort (cost=5198.22..5198.22 rows=1 width=144) (actual time=6.559..6.560 rows=1 loops=1)

Sort Key: favoritegr0_."position" DESC

Sort Method: quicksort Memory: 25kB

Buffers: shared hit=1426

-> Index Scan using favorite_group_idx01 on favorite_group favoritegr0_ (cost=5190.18..5198.21 rows=1 width=144) (actual time=6.514..6.515 rows=1 loops=1)

Index Cond: (((prism_guid)::text = 'ia74483420000012ca23eacf87bb0ed56'::text) AND (is_deleted = 0))

Filter: ((NOT (hashed SubPlan 1)) AND ((usage_type = 0) OR (usage_type IS NULL)) AND ('FORMS.WESTLAW'::text = (product_sid)::text) AND ((product_view)::text = 'DefaultProductView'::text))

Buffers: shared hit=1423

SubPlan 1

-> Nested Loop (cost=0.70..5189.90 rows=1 width=33) (actual time=6.459..6.459 rows=0 loops=1)

Buffers: shared hit=1417

-> Index Scan using category_page_idx04 on category_page categorypa2_ (cost=0.42..5131.71 rows=7 width=33) (actual time=0.035..6.138 rows=92 loops=1)

Index Cond: ((unique_name)::text = 'Florida'::text)

Filter: (is_deleted = 0)

Buffers: shared hit=1233

-> Index Scan using favorite_group_member_idx03 on favorite_group_member favoriteen1_ (cost=0.28..8.30 rows=1 width=66) (actual time=0.003..0.003 rows=0 loops=92)

Index Cond: ((category_page_sid)::text = (categorypa2_.category_page_sid)::text)

Filter: (is_deleted = 0)

Buffers: shared hit=184

Planning Time: 1.624 ms

Execution Time: 6.697 ms

If I compare that to the plan Oracle uses it pushes the favoritegr0_.FAVORITE_GROUP_SID predicate into the NOT IN. I'm able to get a similar plan with Postgres if I change the NOT IN to a NOT EXISTS:

explain (analyze, buffers)

select favoritegr0_.FAVORITE_GROUP_SID as favorite1_2_, favoritegr0_.CHANGED as changed2_2_, favoritegr0_.TYPE_DISCRIMINATOR as type_dis3_2_,

favoritegr0_.GROUP_NAME as group_na4_2_, favoritegr0_.IS_DELETED as is_delet5_2_, favoritegr0_.LAST_USED as last_use6_2_, favoritegr0_.POSITION as position7_2_,

favoritegr0_.PRISM_GUID as prism_gu8_2_, favoritegr0_.PRODUCT_SID as product_9_2_,

favoritegr0_.PRODUCT_VIEW as product10_2_, favoritegr0_.USAGE_TYPE as usage_t11_2_, favoritegr0_.ROW_VERSION as row_ver12_2_

from cf0.FAVORITE_GROUP favoritegr0_

where 'FORMS.WESTLAW' = favoritegr0_.PRODUCT_SID

and favoritegr0_.PRODUCT_VIEW in ('DefaultProductView')

and not exists (

select 'x'

from cf0.FAVORITE_GROUP_MEMBER favoriteen1_

cross join cf0.CATEGORY_PAGE categorypa2_

where favoriteen1_.CATEGORY_PAGE_SID=categorypa2_.CATEGORY_PAGE_SID

and categorypa2_.UNIQUE_NAME='Florida'

and categorypa2_.IS_DELETED=0

and favoriteen1_.IS_DELETED=0

and favoritegr0_.FAVORITE_GROUP_SID = favoriteen1_.FAVORITE_GROUP_SID)

and favoritegr0_.IS_DELETED=0

and (favoritegr0_.USAGE_TYPE=0 or favoritegr0_.USAGE_TYPE is null)

and favoritegr0_.PRISM_GUID='ia74483420000012ca23eacf87bb0ed56'

order by favoritegr0_.POSITION desc;

Here you can see the query did 5 shared block hits- much better than the plan above. It's pushing the predicate into the NOT EXISTS with a Nested Loop Anti Join.

QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Sort (cost=121.50..121.51 rows=1 width=144) (actual time=0.027..0.028 rows=1 loops=1)

Sort Key: favoritegr0_."position" DESC

Sort Method: quicksort Memory: 25kB

Buffers: shared hit=5

-> Nested Loop Anti Join (cost=5.11..121.49 rows=1 width=144) (actual time=0.021..0.022 rows=1 loops=1)

Buffers: shared hit=5

-> Index Scan using favorite_group_idx01 on favorite_group favoritegr0_ (cost=0.28..8.30 rows=1 width=144) (actual time=0.012..0.012 rows=1 loops=1)

Index Cond: (((prism_guid)::text = 'ia74483420000012ca23eacf87bb0ed56'::text) AND (is_deleted = 0))

Filter: (((usage_type = 0) OR (usage_type IS NULL)) AND ('FORMS.WESTLAW'::text = (product_sid)::text) AND ((product_view)::text = 'DefaultProductView'::text))

Buffers: shared hit=3

-> Nested Loop (cost=4.83..113.18 rows=1 width=33) (actual time=0.008..0.009 rows=0 loops=1)

Buffers: shared hit=2

-> Bitmap Heap Scan on favorite_group_member favoriteen1_ (cost=4.41..56.40 rows=17 width=66) (actual time=0.007..0.008 rows=0 loops=1)

Recheck Cond: ((favoritegr0_.favorite_group_sid)::text = (favorite_group_sid)::text)

Filter: (is_deleted = 0)

Buffers: shared hit=2

-> Bitmap Index Scan on favorite_group_member_idx02 (cost=0.00..4.41 rows=17 width=0) (actual time=0.003..0.003 rows=0 loops=1)

Index Cond: ((favorite_group_sid)::text = (favoritegr0_.favorite_group_sid)::text)

Buffers: shared hit=2

-> Index Scan using category_page_pkey on category_page categorypa2_ (cost=0.42..3.30 rows=1 width=33) (never executed)

Index Cond: ((category_page_sid)::text = (favoriteen1_.category_page_sid)::text)

Filter: (((unique_name)::text = 'Florida'::text) AND (is_deleted = 0))

Planning Time: 0.554 ms

Execution Time: 0.071 ms

Is Postgres able to drive the query the same way with the NOT IN as the NOT EXISTS is doing or is that only available if the query has a NOT EXISTS? I don't see an option to push predicate or something like that using pg_hint_plan. I'm not sure if there are any optimizer settings that may tell Postgres to treat the NOT IN like a NOT EXISTS when optimizing this type of query.

Thanks in advance

Steve

This e-mail is for the sole use of the intended recipient and contains information that may be privileged and/or confidential. If you are not an intended recipient, please notify the sender by return e-mail and delete this e-mail and any attachments. Certain required legal entity disclosures can be accessed on our website: https://www.thomsonreuters.com/en/resources/disclosures.html

Re: Postgres NOT IN vs NOT EXISTS optimization

From

Jeremy Smith

Date:

14 June 2022, 16:06:52

I think this explains the situation well:

https://wiki.postgresql.org/wiki/Don't_Do_This#Don.27t_use_NOT_IN

On Tue, Jun 14, 2022 at 11:59 AM Dirschel, Steve <steve.dirschel@thomsonreuters.com> wrote:

We are in the process of migrating from Oracle to Postgres and the following query does much less work with Oracle vs Postgres.

explain (analyze, buffers)
select favoritegr0_.FAVORITE_GROUP_SID as favorite1_2_, favoritegr0_.CHANGED as changed2_2_, favoritegr0_.TYPE_DISCRIMINATOR as type_dis3_2_,
       favoritegr0_.GROUP_NAME as group_na4_2_, favoritegr0_.IS_DELETED as is_delet5_2_, favoritegr0_.LAST_USED as last_use6_2_, favoritegr0_.POSITION as position7_2_,
       favoritegr0_.PRISM_GUID as prism_gu8_2_, favoritegr0_.PRODUCT_SID as product_9_2_,
       favoritegr0_.PRODUCT_VIEW as product10_2_, favoritegr0_.USAGE_TYPE as usage_t11_2_, favoritegr0_.ROW_VERSION as row_ver12_2_
  from cf0.FAVORITE_GROUP favoritegr0_
where 'FORMS.WESTLAW' = favoritegr0_.PRODUCT_SID
   and favoritegr0_.PRODUCT_VIEW in ('DefaultProductView')
   and (favoritegr0_.FAVORITE_GROUP_SID not in
               (select favoriteen1_.FAVORITE_GROUP_SID
                  from cf0.FAVORITE_GROUP_MEMBER favoriteen1_
                 cross join cf0.CATEGORY_PAGE categorypa2_
                 where favoriteen1_.CATEGORY_PAGE_SID=categorypa2_.CATEGORY_PAGE_SID
                   and categorypa2_.UNIQUE_NAME='Florida'
                   and categorypa2_.IS_DELETED=0
                   and favoriteen1_.IS_DELETED=0))
   and favoritegr0_.IS_DELETED=0
   and (favoritegr0_.USAGE_TYPE=0 or favoritegr0_.USAGE_TYPE is null)
   and favoritegr0_.PRISM_GUID='ia74483420000012ca23eacf87bb0ed56'
order by favoritegr0_.POSITION desc;

Here is the plan in Postgres. It did 1426 shared block hits. If you look at this plan it is not pushing filtering into the NOT IN subquery- it is fully resolving that part of the query driving off where UNIQUE_NAME = 'Florida'.

                                                                                             QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=5198.22..5198.22 rows=1 width=144) (actual time=6.559..6.560 rows=1 loops=1)
   Sort Key: favoritegr0_."position" DESC
   Sort Method: quicksort Memory: 25kB
   Buffers: shared hit=1426
   -> Index Scan using favorite_group_idx01 on favorite_group favoritegr0_ (cost=5190.18..5198.21 rows=1 width=144) (actual time=6.514..6.515 rows=1 loops=1)
         Index Cond: (((prism_guid)::text = 'ia74483420000012ca23eacf87bb0ed56'::text) AND (is_deleted = 0))
         Filter: ((NOT (hashed SubPlan 1)) AND ((usage_type = 0) OR (usage_type IS NULL)) AND ('FORMS.WESTLAW'::text = (product_sid)::text) AND ((product_view)::text = 'DefaultProductView'::text))
         Buffers: shared hit=1423
         SubPlan 1
           -> Nested Loop (cost=0.70..5189.90 rows=1 width=33) (actual time=6.459..6.459 rows=0 loops=1)
                 Buffers: shared hit=1417
                 -> Index Scan using category_page_idx04 on category_page categorypa2_ (cost=0.42..5131.71 rows=7 width=33) (actual time=0.035..6.138 rows=92 loops=1)
                       Index Cond: ((unique_name)::text = 'Florida'::text)
                       Filter: (is_deleted = 0)
                       Buffers: shared hit=1233
                -> Index Scan using favorite_group_member_idx03 on favorite_group_member favoriteen1_ (cost=0.28..8.30 rows=1 width=66) (actual time=0.003..0.003 rows=0 loops=92)
                       Index Cond: ((category_page_sid)::text = (categorypa2_.category_page_sid)::text)
                       Filter: (is_deleted = 0)
                       Buffers: shared hit=184
Planning Time: 1.624 ms
Execution Time: 6.697 ms

If I compare that to the plan Oracle uses it pushes the favoritegr0_.FAVORITE_GROUP_SID predicate into the NOT IN. I'm able to get a similar plan with Postgres if I change the NOT IN to a NOT EXISTS:

explain (analyze, buffers)
select favoritegr0_.FAVORITE_GROUP_SID as favorite1_2_, favoritegr0_.CHANGED as changed2_2_, favoritegr0_.TYPE_DISCRIMINATOR as type_dis3_2_,
       favoritegr0_.GROUP_NAME as group_na4_2_, favoritegr0_.IS_DELETED as is_delet5_2_, favoritegr0_.LAST_USED as last_use6_2_, favoritegr0_.POSITION as position7_2_,
       favoritegr0_.PRISM_GUID as prism_gu8_2_, favoritegr0_.PRODUCT_SID as product_9_2_,
       favoritegr0_.PRODUCT_VIEW as product10_2_, favoritegr0_.USAGE_TYPE as usage_t11_2_, favoritegr0_.ROW_VERSION as row_ver12_2_
  from cf0.FAVORITE_GROUP favoritegr0_
where 'FORMS.WESTLAW' = favoritegr0_.PRODUCT_SID
   and favoritegr0_.PRODUCT_VIEW in ('DefaultProductView')
   and not exists (
     select 'x'
          from cf0.FAVORITE_GROUP_MEMBER favoriteen1_
         cross join cf0.CATEGORY_PAGE categorypa2_
         where favoriteen1_.CATEGORY_PAGE_SID=categorypa2_.CATEGORY_PAGE_SID
           and categorypa2_.UNIQUE_NAME='Florida'
           and categorypa2_.IS_DELETED=0
           and favoriteen1_.IS_DELETED=0
           and favoritegr0_.FAVORITE_GROUP_SID = favoriteen1_.FAVORITE_GROUP_SID)
   and favoritegr0_.IS_DELETED=0
   and (favoritegr0_.USAGE_TYPE=0 or favoritegr0_.USAGE_TYPE is null)
   and favoritegr0_.PRISM_GUID='ia74483420000012ca23eacf87bb0ed56'
order by favoritegr0_.POSITION desc;

Here you can see the query did 5 shared block hits- much better than the plan above. It's pushing the predicate into the NOT EXISTS with a Nested Loop Anti Join.

                                                                                  QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=121.50..121.51 rows=1 width=144) (actual time=0.027..0.028 rows=1 loops=1)
   Sort Key: favoritegr0_."position" DESC
   Sort Method: quicksort Memory: 25kB
   Buffers: shared hit=5
   -> Nested Loop Anti Join (cost=5.11..121.49 rows=1 width=144) (actual time=0.021..0.022 rows=1 loops=1)
         Buffers: shared hit=5
         -> Index Scan using favorite_group_idx01 on favorite_group favoritegr0_ (cost=0.28..8.30 rows=1 width=144) (actual time=0.012..0.012 rows=1 loops=1)
               Index Cond: (((prism_guid)::text = 'ia74483420000012ca23eacf87bb0ed56'::text) AND (is_deleted = 0))
               Filter: (((usage_type = 0) OR (usage_type IS NULL)) AND ('FORMS.WESTLAW'::text = (product_sid)::text) AND ((product_view)::text = 'DefaultProductView'::text))
               Buffers: shared hit=3
         -> Nested Loop (cost=4.83..113.18 rows=1 width=33) (actual time=0.008..0.009 rows=0 loops=1)
               Buffers: shared hit=2
               -> Bitmap Heap Scan on favorite_group_member favoriteen1_ (cost=4.41..56.40 rows=17 width=66) (actual time=0.007..0.008 rows=0 loops=1)
                     Recheck Cond: ((favoritegr0_.favorite_group_sid)::text = (favorite_group_sid)::text)
                     Filter: (is_deleted = 0)
                     Buffers: shared hit=2
                     -> Bitmap Index Scan on favorite_group_member_idx02 (cost=0.00..4.41 rows=17 width=0) (actual time=0.003..0.003 rows=0 loops=1)
                           Index Cond: ((favorite_group_sid)::text = (favoritegr0_.favorite_group_sid)::text)
                           Buffers: shared hit=2
               -> Index Scan using category_page_pkey on category_page categorypa2_ (cost=0.42..3.30 rows=1 width=33) (never executed)
                     Index Cond: ((category_page_sid)::text = (favoriteen1_.category_page_sid)::text)
                     Filter: (((unique_name)::text = 'Florida'::text) AND (is_deleted = 0))
Planning Time: 0.554 ms
Execution Time: 0.071 ms

Is Postgres able to drive the query the same way with the NOT IN as the NOT EXISTS is doing or is that only available if the query has a NOT EXISTS? I don't see an option to push predicate or something like that using pg_hint_plan. I'm not sure if there are any optimizer settings that may tell Postgres to treat the NOT IN like a NOT EXISTS when optimizing this type of query.

Thanks in advance
Steve
This e-mail is for the sole use of the intended recipient and contains information that may be privileged and/or confidential. If you are not an intended recipient, please notify the sender by return e-mail and delete this e-mail and any attachments. Certain required legal entity disclosures can be accessed on our website: https://www.thomsonreuters.com/en/resources/disclosures.html

Re: Postgres NOT IN vs NOT EXISTS optimization

From

Tom Lane

Date:

14 June 2022, 16:09:16

"Dirschel, Steve" <steve.dirschel@thomsonreuters.com> writes:
> Is Postgres able to drive the query the same way with the NOT IN as the
> NOT EXISTS is doing or is that only available if the query has a NOT
> EXISTS?

NOT IN is not optimized very well in PG, because of the strange
semantics that the SQL spec demands when the sub-query produces any
null values.  There's been some interest in detecting cases where
we can prove that the subquery produces no nulls and then optimizing
it into NOT EXISTS, but it seems like a lot of work for not-great
return, so nothing's happened (yet).  Perhaps Oracle does something
like that already, or perhaps they're just ignoring the semantics
problem; they do not have a reputation for hewing closely to the
spec on behavior regarding nulls.

            regards, tom lane

Re: Postgres NOT IN vs NOT EXISTS optimization

From

Bruce Momjian

Date:

11 August 2022, 20:12:33

On Tue, Jun 14, 2022 at 12:09:16PM -0400, Tom Lane wrote:
> "Dirschel, Steve" <steve.dirschel@thomsonreuters.com> writes:
> > Is Postgres able to drive the query the same way with the NOT IN as the
> > NOT EXISTS is doing or is that only available if the query has a NOT
> > EXISTS?
> 
> NOT IN is not optimized very well in PG, because of the strange
> semantics that the SQL spec demands when the sub-query produces any
> null values.  There's been some interest in detecting cases where
> we can prove that the subquery produces no nulls and then optimizing
> it into NOT EXISTS, but it seems like a lot of work for not-great
> return, so nothing's happened (yet).  Perhaps Oracle does something
> like that already, or perhaps they're just ignoring the semantics
> problem; they do not have a reputation for hewing closely to the
> spec on behavior regarding nulls.

I was just now researching NOT IN behavior and remembered this thread,
so wanted to give a simplified example.  If you set up tables like this:

    CREATE TABLE small AS
        SELECT * FROM generate_series(1, 10) AS t(x);

    CREATE TABLE large AS SELECT small.x
        FROM small CROSS JOIN generate_series(1, 1000) AS t(x);

    INSERT INTO small VALUES (11), (12);

    ANALYZE small, large;

These IN and EXISTS/NOT EXISTS queries look fine. using hash joins:

    EXPLAIN SELECT small.x
    FROM small
    WHERE small.x IN (SELECT large.x FROM large);
                                     QUERY PLAN
    -----------------------------------------------------------------------------
     Hash Join  (cost=170.22..171.49 rows=10 width=4)
       Hash Cond: (small.x = large.x)
       ->  Seq Scan on small  (cost=0.00..1.12 rows=12 width=4)
       ->  Hash  (cost=170.10..170.10 rows=10 width=4)
             ->  HashAggregate  (cost=170.00..170.10 rows=10 width=4)
                   Group Key: large.x
                   ->  Seq Scan on large  (cost=0.00..145.00 rows=10000 width=4)
    
    EXPLAIN SELECT small.x
    FROM small
    WHERE EXISTS (SELECT large.x FROM large WHERE large.x = small.x);
                                     QUERY PLAN
    -----------------------------------------------------------------------------
     Hash Join  (cost=170.22..171.49 rows=10 width=4)
       Hash Cond: (small.x = large.x)
       ->  Seq Scan on small  (cost=0.00..1.12 rows=12 width=4)
       ->  Hash  (cost=170.10..170.10 rows=10 width=4)
             ->  HashAggregate  (cost=170.00..170.10 rows=10 width=4)
                   Group Key: large.x
                   ->  Seq Scan on large  (cost=0.00..145.00 rows=10000 width=4)
    
    EXPLAIN SELECT small.x
    FROM small
    WHERE NOT EXISTS (SELECT large.x FROM large WHERE large.x = small.x);
                                  QUERY PLAN
    -----------------------------------------------------------------------
     Hash Anti Join  (cost=270.00..271.20 rows=2 width=4)
       Hash Cond: (small.x = large.x)
       ->  Seq Scan on small  (cost=0.00..1.12 rows=12 width=4)
       ->  Hash  (cost=145.00..145.00 rows=10000 width=4)
             ->  Seq Scan on large  (cost=0.00..145.00 rows=10000 width=4)

These NOT IN queries all use sequential scans, and IS NOT NULL does not help:

    EXPLAIN SELECT small.x
    FROM small
    WHERE small.x NOT IN (SELECT large.x FROM large);
                                QUERY PLAN
    -------------------------------------------------------------------
     Seq Scan on small  (cost=170.00..171.15 rows=6 width=4)
       Filter: (NOT (hashed SubPlan 1))
       SubPlan 1
         ->  Seq Scan on large  (cost=0.00..145.00 rows=10000 width=4)
    
    EXPLAIN SELECT small.x
    FROM small
    WHERE small.x NOT IN (SELECT large.x FROM large WHERE large.x IS NOT NULL);
                                QUERY PLAN
    -------------------------------------------------------------------
     Seq Scan on small  (cost=170.00..171.15 rows=6 width=4)
       Filter: (NOT (hashed SubPlan 1))
       SubPlan 1
         ->  Seq Scan on large  (cost=0.00..145.00 rows=10000 width=4)
               Filter: (x IS NOT NULL)
    
Is converting NOT IN to NOT EXISTS our only option?  Couldn't we start
to create the hash and just switch to always returning NULL if we see
any NULLs while we are creating the hash?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson