Thread: Anti join confusion

Anti join confusion

From
wenhui qiu
Date:
Hi Richard Guo
   I found this path https://commitfest.postgresql.org/patch/3235/ already supports anti join , But I've found that in many cases it doesn't work.It always uses SubPlan Here's my testing process.

###########
create table join1 (id integer,name varchar(300),k1 integer);    
create table join2 (id integer,name varchar(300),score integer);
insert into join1 values ( generate_series(1,20000),'aaaaaaaaaaaaaaaaAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',10);
insert into join1 values ( generate_series(1,20000),'aaaaaaaaaaaaaaaaAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',10);
insert into join1 values ( generate_series(1,20000),'aaaaaaaaaaaaaaaaAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',10);
insert into join1 values ( generate_series(50201,50300),'aaaaaaaaaaaaaaaaAAAAAAAAAAAAAAAAAAAAAASSSSSAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',10);
insert into join1 values ( generate_series(50201,50300),'aaaaaaaaaaaaaaaaAAAAAAAAAAAAAAAAAAAAAASSSSSAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',10);
insert into join1 values ( generate_series(150201,1350300),'aaaaaaaaaaaaaaaaAAAAAAAAAAAAAAAAAAAAAASSSSSAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',10);
insert into join2 values ( generate_series(1,40000),'aaaaaaaaaaaaaaaaAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',1);
insert into join2 values ( generate_series(1,40000),'aaaaaaaaaaaaaaaaAAAAAAABBAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',2);
insert into join2 values ( generate_series(20001,22000),'aaaaaaaaaaaaaaaaAACCCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',3);
insert into join2 values ( generate_series(150201,950300),'aaaaaaaaaaaaaaaaAACCCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',3);
create index idx_j1 on join1(id);
create index idx_j2 on join2(id);
VACUUM ANALYZE JOIN1;
VACUUM ANALYZE JOIN2;    

test=#  explain  SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN (SELECT T2.id FROM join2 t2 WHERE T2.ID>10000)
test-# ;
                                               QUERY PLAN
---------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.42..9016319078.86 rows=630150 width=8)
   Workers Planned: 2
   ->  Parallel Seq Scan on join1 t1  (cost=0.42..9016255063.86 rows=262562 width=8)
         Filter: (NOT (ANY (id = (SubPlan 1).col1)))
         SubPlan 1
           ->  Materialize  (cost=0.42..32181.54 rows=863294 width=4)
                 ->  Index Only Scan using idx_j2 on join2 t2  (cost=0.42..24492.07 rows=863294 width=4)
                       Index Cond: (id > 10000)
(8 rows)

test=#  explain  SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN (SELECT T2.id FROM join2 t2 );
                                               QUERY PLAN
---------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.42..8633476697.61 rows=630150 width=8)
   Workers Planned: 2
   ->  Parallel Seq Scan on join1 t1  (cost=0.42..8633412682.61 rows=262562 width=8)
         Filter: (NOT (ANY (id = (SubPlan 1).col1)))
         SubPlan 1
           ->  Materialize  (cost=0.42..30676.42 rows=882100 width=4)
                 ->  Index Only Scan using idx_j2 on join2 t2  (cost=0.42..22819.92 rows=882100 width=4)
(7 rows)

test=#  explain  SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN (SELECT T2.id FROM join2 t2 where T2.ID < 1000);
                                       QUERY PLAN
----------------------------------------------------------------------------------------
 Seq Scan on join1 t1  (cost=61.73..38730.47 rows=630150 width=8)
   Filter: (NOT (ANY (id = (hashed SubPlan 1).col1)))
   SubPlan 1
     ->  Index Only Scan using idx_j2 on join2 t2  (cost=0.42..57.06 rows=1865 width=4)
           Index Cond: (id < 1000)
(5 rows)

test=#  explain  SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN (SELECT T2.id FROM join2 t2 where T2.ID = 1000);
                                     QUERY PLAN
------------------------------------------------------------------------------------
 Seq Scan on join1 t1  (cost=4.45..38673.19 rows=630150 width=8)
   Filter: (NOT (ANY (id = (hashed SubPlan 1).col1)))
   SubPlan 1
     ->  Index Only Scan using idx_j2 on join2 t2  (cost=0.42..4.44 rows=1 width=4)
           Index Cond: (id = 1000)
(5 rows)

Thanks 

Re: Anti join confusion

From
Tender Wang
Date:


wenhui qiu <qiuwenhuifx@gmail.com> 于2025年2月24日周一 09:48写道:
Hi Richard Guo
   I found this path https://commitfest.postgresql.org/patch/3235/ already supports anti join , But I've found that in many cases it doesn't work.It always uses SubPlan Here's my testing process.

###########
create table join1 (id integer,name varchar(300),k1 integer);    
create table join2 (id integer,name varchar(300),score integer);
insert into join1 values ( generate_series(1,20000),'aaaaaaaaaaaaaaaaAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',10);
insert into join1 values ( generate_series(1,20000),'aaaaaaaaaaaaaaaaAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',10);
insert into join1 values ( generate_series(1,20000),'aaaaaaaaaaaaaaaaAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',10);
insert into join1 values ( generate_series(50201,50300),'aaaaaaaaaaaaaaaaAAAAAAAAAAAAAAAAAAAAAASSSSSAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',10);
insert into join1 values ( generate_series(50201,50300),'aaaaaaaaaaaaaaaaAAAAAAAAAAAAAAAAAAAAAASSSSSAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',10);
insert into join1 values ( generate_series(150201,1350300),'aaaaaaaaaaaaaaaaAAAAAAAAAAAAAAAAAAAAAASSSSSAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',10);
insert into join2 values ( generate_series(1,40000),'aaaaaaaaaaaaaaaaAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',1);
insert into join2 values ( generate_series(1,40000),'aaaaaaaaaaaaaaaaAAAAAAABBAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',2);
insert into join2 values ( generate_series(20001,22000),'aaaaaaaaaaaaaaaaAACCCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',3);
insert into join2 values ( generate_series(150201,950300),'aaaaaaaaaaaaaaaaAACCCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',3);
create index idx_j1 on join1(id);
create index idx_j2 on join2(id);
VACUUM ANALYZE JOIN1;
VACUUM ANALYZE JOIN2;    

test=#  explain  SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN (SELECT T2.id FROM join2 t2 WHERE T2.ID>10000)
test-# ;
                                               QUERY PLAN
---------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.42..9016319078.86 rows=630150 width=8)
   Workers Planned: 2
   ->  Parallel Seq Scan on join1 t1  (cost=0.42..9016255063.86 rows=262562 width=8)
         Filter: (NOT (ANY (id = (SubPlan 1).col1)))
         SubPlan 1
           ->  Materialize  (cost=0.42..32181.54 rows=863294 width=4)
                 ->  Index Only Scan using idx_j2 on join2 t2  (cost=0.42..24492.07 rows=863294 width=4)
                       Index Cond: (id > 10000)
(8 rows)

test=#  explain  SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN (SELECT T2.id FROM join2 t2 );
                                               QUERY PLAN
---------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.42..8633476697.61 rows=630150 width=8)
   Workers Planned: 2
   ->  Parallel Seq Scan on join1 t1  (cost=0.42..8633412682.61 rows=262562 width=8)
         Filter: (NOT (ANY (id = (SubPlan 1).col1)))
         SubPlan 1
           ->  Materialize  (cost=0.42..30676.42 rows=882100 width=4)
                 ->  Index Only Scan using idx_j2 on join2 t2  (cost=0.42..22819.92 rows=882100 width=4)
(7 rows)

test=#  explain  SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN (SELECT T2.id FROM join2 t2 where T2.ID < 1000);
                                       QUERY PLAN
----------------------------------------------------------------------------------------
 Seq Scan on join1 t1  (cost=61.73..38730.47 rows=630150 width=8)
   Filter: (NOT (ANY (id = (hashed SubPlan 1).col1)))
   SubPlan 1
     ->  Index Only Scan using idx_j2 on join2 t2  (cost=0.42..57.06 rows=1865 width=4)
           Index Cond: (id < 1000)
(5 rows)

test=#  explain  SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN (SELECT T2.id FROM join2 t2 where T2.ID = 1000);
                                     QUERY PLAN
------------------------------------------------------------------------------------
 Seq Scan on join1 t1  (cost=4.45..38673.19 rows=630150 width=8)
   Filter: (NOT (ANY (id = (hashed SubPlan 1).col1)))
   SubPlan 1
     ->  Index Only Scan using idx_j2 on join2 t2  (cost=0.42..4.44 rows=1 width=4)
           Index Cond: (id = 1000)
(5 rows)


Planner now doesn't support pulling up the "NOT IN" sublink. The "NOT IN" sublink will be transformed into SubPlan. 


--
Thanks,
Tender Wang

Re: Anti join confusion

From
Tom Lane
Date:
wenhui qiu <qiuwenhuifx@gmail.com> writes:
>    I found this path https://commitfest.postgresql.org/patch/3235/ already
> supports anti join , But I've found that in many cases it doesn't work.It
> always uses SubPlan Here's my testing process.

NOT IN is not convertible to an anti-join: the semantics are wrong
because of the way it treats nulls in the sub-select's output.

In principle you could do it if you could prove that the sub-select's
output is never null, but we need to do this transformation long
before we have enough information to make such a deduction.

            regards, tom lane



Re: Anti join confusion

From
wenhui qiu
Date:
HI Tom , Tender Wang 
     But I saw the path test case are support , 
image.png
and my test case set the id is primary key also SubPlan
test=#  explain analyze SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN (SELECT T2.id FROM join2 t2 where T2.ID = 1000 );
                                                            QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on join1 t1  (cost=8.45..37446.94 rows=610100 width=8) (actual time=0.071..209.875 rows=1220199 loops=1)
   Filter: (NOT (ANY (id = (hashed SubPlan 1).col1)))
   Rows Removed by Filter: 1
   Buffers: shared hit=22190
   SubPlan 1
     ->  Index Only Scan using join2_pkey on join2 t2  (cost=0.42..8.44 rows=1 width=4) (actual time=0.027..0.029 rows=1 loops=1)
           Index Cond: (id = 1000)
           Heap Fetches: 0
           Buffers: shared hit=4
 Planning Time: 0.245 ms
 Execution Time: 247.906 ms
(11 rows)

test=#  explain  SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN (SELECT T2.id FROM join2 t2 where T2.ID > 10000 );
                                                 QUERY PLAN
-------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.42..8414945643.30 rows=610100 width=8)
   Workers Planned: 2
   ->  Parallel Seq Scan on join1 t1  (cost=0.42..8414883633.30 rows=254208 width=8)
         Filter: (NOT (ANY (id = (SubPlan 1).col1)))
         SubPlan 1
           ->  Materialize  (cost=0.42..31028.21 rows=829768 width=4)
                 ->  Index Only Scan using join2_pkey on join2 t2  (cost=0.42..23637.37 rows=829768 width=4)
                       Index Cond: (id > 10000)
(8 rows)

test=#  explain  SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN (SELECT T2.id FROM join2 t2 where T2.ID < 10000 );
                                          QUERY PLAN
----------------------------------------------------------------------------------------------
 Seq Scan on join1 t1  (cost=327.05..37765.54 rows=610100 width=8)
   Filter: (NOT (ANY (id = (hashed SubPlan 1).col1)))
   SubPlan 1
     ->  Index Only Scan using join2_pkey on join2 t2  (cost=0.42..301.22 rows=10331 width=4)
           Index Cond: (id < 10000)
(5 rows)

test=#

On Mon, Feb 24, 2025 at 11:26 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
wenhui qiu <qiuwenhuifx@gmail.com> writes:
>    I found this path https://commitfest.postgresql.org/patch/3235/ already
> supports anti join , But I've found that in many cases it doesn't work.It
> always uses SubPlan Here's my testing process.

NOT IN is not convertible to an anti-join: the semantics are wrong
because of the way it treats nulls in the sub-select's output.

In principle you could do it if you could prove that the sub-select's
output is never null, but we need to do this transformation long
before we have enough information to make such a deduction.

                        regards, tom lane
Attachment

Re: Anti join confusion

From
Tender Wang
Date:


wenhui qiu <qiuwenhuifx@gmail.com> 于2025年2月24日周一 15:10写道:
HI Tom , Tender Wang 
     But I saw the path test case are support , 
image.png

Hi wenhui,

In the above picture,  it is NOT EXISTS sub-query,  which can be pulled up, but NOT IN sub-query, which can not now. 
As Tom said:
"NOT IN is not convertible to an anti-join: the semantics are wrong
because of the way it treats nulls in the sub-select's output."


and my test case set the id is primary key also SubPlan
test=#  explain analyze SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN (SELECT T2.id FROM join2 t2 where T2.ID = 1000 );
                                                            QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on join1 t1  (cost=8.45..37446.94 rows=610100 width=8) (actual time=0.071..209.875 rows=1220199 loops=1)
   Filter: (NOT (ANY (id = (hashed SubPlan 1).col1)))
   Rows Removed by Filter: 1
   Buffers: shared hit=22190
   SubPlan 1
     ->  Index Only Scan using join2_pkey on join2 t2  (cost=0.42..8.44 rows=1 width=4) (actual time=0.027..0.029 rows=1 loops=1)
           Index Cond: (id = 1000)
           Heap Fetches: 0
           Buffers: shared hit=4
 Planning Time: 0.245 ms
 Execution Time: 247.906 ms
(11 rows)

test=#  explain  SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN (SELECT T2.id FROM join2 t2 where T2.ID > 10000 );
                                                 QUERY PLAN
-------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.42..8414945643.30 rows=610100 width=8)
   Workers Planned: 2
   ->  Parallel Seq Scan on join1 t1  (cost=0.42..8414883633.30 rows=254208 width=8)
         Filter: (NOT (ANY (id = (SubPlan 1).col1)))
         SubPlan 1
           ->  Materialize  (cost=0.42..31028.21 rows=829768 width=4)
                 ->  Index Only Scan using join2_pkey on join2 t2  (cost=0.42..23637.37 rows=829768 width=4)
                       Index Cond: (id > 10000)
(8 rows)

test=#  explain  SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN (SELECT T2.id FROM join2 t2 where T2.ID < 10000 );
                                          QUERY PLAN
----------------------------------------------------------------------------------------------
 Seq Scan on join1 t1  (cost=327.05..37765.54 rows=610100 width=8)
   Filter: (NOT (ANY (id = (hashed SubPlan 1).col1)))
   SubPlan 1
     ->  Index Only Scan using join2_pkey on join2 t2  (cost=0.42..301.22 rows=10331 width=4)
           Index Cond: (id < 10000)
(5 rows)

test=#


--
Thanks,
Tender Wang
Attachment

Re: Anti join confusion

From
wenhui qiu
Date:
Hi Tender Wang 
     Thank you very much for the analysis.Do we have plans for  NOT IN subquery  pull up?


Thanks


On Mon, Feb 24, 2025 at 4:00 PM Tender Wang <tndrwang@gmail.com> wrote:


wenhui qiu <qiuwenhuifx@gmail.com> 于2025年2月24日周一 15:10写道:
HI Tom , Tender Wang 
     But I saw the path test case are support , 
image.png

Hi wenhui,

In the above picture,  it is NOT EXISTS sub-query,  which can be pulled up, but NOT IN sub-query, which can not now. 
As Tom said:
"NOT IN is not convertible to an anti-join: the semantics are wrong
because of the way it treats nulls in the sub-select's output."


and my test case set the id is primary key also SubPlan
test=#  explain analyze SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN (SELECT T2.id FROM join2 t2 where T2.ID = 1000 );
                                                            QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on join1 t1  (cost=8.45..37446.94 rows=610100 width=8) (actual time=0.071..209.875 rows=1220199 loops=1)
   Filter: (NOT (ANY (id = (hashed SubPlan 1).col1)))
   Rows Removed by Filter: 1
   Buffers: shared hit=22190
   SubPlan 1
     ->  Index Only Scan using join2_pkey on join2 t2  (cost=0.42..8.44 rows=1 width=4) (actual time=0.027..0.029 rows=1 loops=1)
           Index Cond: (id = 1000)
           Heap Fetches: 0
           Buffers: shared hit=4
 Planning Time: 0.245 ms
 Execution Time: 247.906 ms
(11 rows)

test=#  explain  SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN (SELECT T2.id FROM join2 t2 where T2.ID > 10000 );
                                                 QUERY PLAN
-------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.42..8414945643.30 rows=610100 width=8)
   Workers Planned: 2
   ->  Parallel Seq Scan on join1 t1  (cost=0.42..8414883633.30 rows=254208 width=8)
         Filter: (NOT (ANY (id = (SubPlan 1).col1)))
         SubPlan 1
           ->  Materialize  (cost=0.42..31028.21 rows=829768 width=4)
                 ->  Index Only Scan using join2_pkey on join2 t2  (cost=0.42..23637.37 rows=829768 width=4)
                       Index Cond: (id > 10000)
(8 rows)

test=#  explain  SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN (SELECT T2.id FROM join2 t2 where T2.ID < 10000 );
                                          QUERY PLAN
----------------------------------------------------------------------------------------------
 Seq Scan on join1 t1  (cost=327.05..37765.54 rows=610100 width=8)
   Filter: (NOT (ANY (id = (hashed SubPlan 1).col1)))
   SubPlan 1
     ->  Index Only Scan using join2_pkey on join2 t2  (cost=0.42..301.22 rows=10331 width=4)
           Index Cond: (id < 10000)
(5 rows)

test=#


--
Thanks,
Tender Wang
Attachment

Re: Anti join confusion

From
Laurenz Albe
Date:
On Mon, 2025-02-24 at 17:12 +0800, wenhui qiu wrote:
> Do we have plans for  NOT IN subquery  pull up?

As mentioned before, that is not possible.
Best practice is to avoid NOT IN with subqueries altogether.
The result is usually not what people expect it to be.

Yours,
Laurenz Albe

--

*E-Mail Disclaimer*
Der Inhalt dieser E-Mail ist ausschliesslich fuer den
bezeichneten Adressaten bestimmt. Wenn Sie nicht der vorgesehene Adressat
dieser E-Mail oder dessen Vertreter sein sollten, so beachten Sie bitte,
dass jede Form der Kenntnisnahme, Veroeffentlichung, Vervielfaeltigung oder
Weitergabe des Inhalts dieser E-Mail unzulaessig ist. Wir bitten Sie, sich
in diesem Fall mit dem Absender der E-Mail in Verbindung zu setzen.

*CONFIDENTIALITY NOTICE & DISCLAIMER
*This message and any attachment are
confidential and may be privileged or otherwise protected from disclosure
and solely for the use of the person(s) or entity to whom it is intended.
If you have received this message in error and are not the intended
recipient, please notify the sender immediately and delete this message and
any attachment from your system. If you are not the intended recipient, be
advised that any use of this message is prohibited and may be unlawful, and
you must not copy this message or attachment or disclose the contents to
any other person.



Re: Anti join confusion

From
wenhui qiu
Date:
Hi Laurenz 
    Actually ,Many fork postgresql databases have already implemented ,For example, if the relevant field has a non-null constraint ,Many databases can do the same thing as not exist ( MySQL ,SQL Server,Oracle)


Thanks

On Mon, Feb 24, 2025 at 7:55 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
On Mon, 2025-02-24 at 17:12 +0800, wenhui qiu wrote:
> Do we have plans for  NOT IN subquery  pull up?

As mentioned before, that is not possible.
Best practice is to avoid NOT IN with subqueries altogether.
The result is usually not what people expect it to be.

Yours,
Laurenz Albe

--

*E-Mail Disclaimer*
Der Inhalt dieser E-Mail ist ausschliesslich fuer den
bezeichneten Adressaten bestimmt. Wenn Sie nicht der vorgesehene Adressat
dieser E-Mail oder dessen Vertreter sein sollten, so beachten Sie bitte,
dass jede Form der Kenntnisnahme, Veroeffentlichung, Vervielfaeltigung oder
Weitergabe des Inhalts dieser E-Mail unzulaessig ist. Wir bitten Sie, sich
in diesem Fall mit dem Absender der E-Mail in Verbindung zu setzen.

*CONFIDENTIALITY NOTICE & DISCLAIMER
*This message and any attachment are
confidential and may be privileged or otherwise protected from disclosure
and solely for the use of the person(s) or entity to whom it is intended.
If you have received this message in error and are not the intended
recipient, please notify the sender immediately and delete this message and
any attachment from your system. If you are not the intended recipient, be
advised that any use of this message is prohibited and may be unlawful, and
you must not copy this message or attachment or disclose the contents to
any other person.

Re: Anti join confusion

From
Robert Haas
Date:
On Mon, Feb 24, 2025 at 8:08 AM wenhui qiu <qiuwenhuifx@gmail.com> wrote:
>     Actually ,Many fork postgresql databases have already implemented ,For example, if the relevant field has a
non-nullconstraint ,Many databases can do the same thing as not exist ( MySQL ,SQL Server,Oracle) 

I'm not surprised to hear it. Long-time PostgreSQL users just don't
use NOT IN, so it's fine, but anyone coming from another database gets
hosed. I think it would be good to put some effort into improving this
area, but I do not have time to work on it myself.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Anti join confusion

From
Richard Guo
Date:
On Tue, Feb 25, 2025 at 1:30 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Feb 24, 2025 at 8:08 AM wenhui qiu <qiuwenhuifx@gmail.com> wrote:
> >     Actually ,Many fork postgresql databases have already implemented ,For example, if the relevant field has a
non-nullconstraint ,Many databases can do the same thing as not exist ( MySQL ,SQL Server,Oracle) 

> I'm not surprised to hear it. Long-time PostgreSQL users just don't
> use NOT IN, so it's fine, but anyone coming from another database gets
> hosed. I think it would be good to put some effort into improving this
> area, but I do not have time to work on it myself.

I agree that it'd be beneficial to make some improvements to NOT IN
subqueries.  From what I can see, we may have two potential options:

* As Tom mentioned, we can prove that the subquery's output never
contains NULL values and then convert the NOT IN into an anti-join.
(It seems to me that we would also need to prove that the outer side
never contains NULL values either, because whether the NULL values
from the outer side should be included in the output depends on
whether the inner side is empty.)

* We can add support in the executor to handle the NULL semantics of
NOT IN.  This may require inventing a new join type.

I'm not quite sure which option is more promising at the moment, or if
there are other options to consider.

Thanks
Richard



Re: Anti join confusion

From
Tender Wang
Date:


Richard Guo <guofenglinux@gmail.com> 于2025年2月26日周三 17:46写道:
On Tue, Feb 25, 2025 at 1:30 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Feb 24, 2025 at 8:08 AM wenhui qiu <qiuwenhuifx@gmail.com> wrote:
> >     Actually ,Many fork postgresql databases have already implemented ,For example, if the relevant field has a non-null constraint ,Many databases can do the same thing as not exist ( MySQL ,SQL Server,Oracle)

> I'm not surprised to hear it. Long-time PostgreSQL users just don't
> use NOT IN, so it's fine, but anyone coming from another database gets
> hosed. I think it would be good to put some effort into improving this
> area, but I do not have time to work on it myself.

I agree that it'd be beneficial to make some improvements to NOT IN
subqueries.  From what I can see, we may have two potential options:

* As Tom mentioned, we can prove that the subquery's output never
contains NULL values and then convert the NOT IN into an anti-join.
(It seems to me that we would also need to prove that the outer side
never contains NULL values either, because whether the NULL values
from the outer side should be included in the output depends on
whether the inner side is empty.)

* We can add support in the executor to handle the NULL semantics of
NOT IN.  This may require inventing a new join type.

I'm not quite sure which option is more promising at the moment, or if
there are other options to consider.


Recently, I found Greenplum implement pull-up NOT IN subquery. They have the below comments in their codes:

We normalize NOT subqueries using the following axioms:
*
* val NOT IN (subq) =>  val <> ALL (subq)

Richard, do you have an impression about this?


--
Thanks,
Tender Wang

Re: Anti join confusion

From
Laurenz Albe
Date:
On Wed, 2025-02-26 at 18:08 +0800, Tender Wang wrote:
> Recently,  I found Greenplum implement pull-up NOT IN subquery.
> They have the below comments in their codes:
>
> We normalize NOT subqueries using the following axioms:
>  *
>  * val NOT IN (subq) =>  val <> ALL (subq)

That is true, but I don't see how it leads to an anti-join
without further consideration.  Perhaps you mean something
else by "pull up"...

Yours,
Laurenz Albe



Re: Anti join confusion

From
Robert Haas
Date:
On Wed, Feb 26, 2025 at 4:46 AM Richard Guo <guofenglinux@gmail.com> wrote:
> I agree that it'd be beneficial to make some improvements to NOT IN
> subqueries.  From what I can see, we may have two potential options:
>
> * As Tom mentioned, we can prove that the subquery's output never
> contains NULL values and then convert the NOT IN into an anti-join.
> (It seems to me that we would also need to prove that the outer side
> never contains NULL values either, because whether the NULL values
> from the outer side should be included in the output depends on
> whether the inner side is empty.)
>
> * We can add support in the executor to handle the NULL semantics of
> NOT IN.  This may require inventing a new join type.
>
> I'm not quite sure which option is more promising at the moment, or if
> there are other options to consider.

I'm not quite sure, either. I think that the first option (proving
that there can be no NULL values) is probably more similar to
thingswe've done elsewhere in the planner, so I think I have been
assuming that if we did something about this, it would be that.
However, I have also had the idea of extending the executor to handle
this situation, and it's possible that option is more promising. I
don't really know. I think it's pretty common to have NOT IN clauses
where one can be certain that there definitely isn't a NULL present,
but unfortunately I think it's also pretty common to have cases where
a NULL could be present, or at least, where one cannot provide that no
NULL can be present.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Anti join confusion

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> I'm not quite sure, either. I think that the first option (proving
> that there can be no NULL values) is probably more similar to
> thingswe've done elsewhere in the planner, so I think I have been
> assuming that if we did something about this, it would be that.

Yeah.  The key problem blocking doing something about it in the
planner is that at the time we want to do join tree restructuring,
we haven't yet collected the per-relation data that would allow
us to know about NOT NULL constraints, nor run expression
simplification that would be important for detecting all but the
most basic cases.  Doing either of those things twice seems like
a nonstarter from a planner-speed viewpoint.  It's possible that
we could rearrange things to collect catalog data sooner.
(I'm vaguely imagining that we'd collect that stuff by relation
OID to begin with, and then when we build RelOptInfos later on
we could link to the per-rel-OID catalog data.)  Moving expression
simplification up seems considerably more fraught, but maybe we could
handle the cases of common interest without having done that.

            regards, tom lane



Re: Anti join confusion

From
Robert Haas
Date:
On Wed, Feb 26, 2025 at 4:58 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Yeah.  The key problem blocking doing something about it in the
> planner is that at the time we want to do join tree restructuring,
> we haven't yet collected the per-relation data that would allow
> us to know about NOT NULL constraints, nor run expression
> simplification that would be important for detecting all but the
> most basic cases.  Doing either of those things twice seems like
> a nonstarter from a planner-speed viewpoint.  It's possible that
> we could rearrange things to collect catalog data sooner.
> (I'm vaguely imagining that we'd collect that stuff by relation
> OID to begin with, and then when we build RelOptInfos later on
> we could link to the per-rel-OID catalog data.)  Moving expression
> simplification up seems considerably more fraught, but maybe we could
> handle the cases of common interest without having done that.

Oh, this is very helpful! I didn't realize that there were specific
technical obstacles standing in the way of making this happen; I was
assuming it was just a case of nobody having been interested enough to
write the code. That's good to know.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Anti join confusion

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Feb 26, 2025 at 4:58 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Yeah.  The key problem blocking doing something about it in the
>> planner is that at the time we want to do join tree restructuring,
>> we haven't yet collected the per-relation data that would allow
>> us to know about NOT NULL constraints, nor run expression
>> simplification that would be important for detecting all but the
>> most basic cases.  Doing either of those things twice seems like
>> a nonstarter from a planner-speed viewpoint.  It's possible that
>> we could rearrange things to collect catalog data sooner.
>> (I'm vaguely imagining that we'd collect that stuff by relation
>> OID to begin with, and then when we build RelOptInfos later on
>> we could link to the per-rel-OID catalog data.)  Moving expression
>> simplification up seems considerably more fraught, but maybe we could
>> handle the cases of common interest without having done that.

> Oh, this is very helpful! I didn't realize that there were specific
> technical obstacles standing in the way of making this happen; I was
> assuming it was just a case of nobody having been interested enough to
> write the code. That's good to know.

Well, there's that too ;-).  I would not be suggesting something
requiring significant planner restructuring if the only benefit
were to be able to turn some instances of NOT IN into antijoins.
But I think having NOT NULL info available earlier might allow
removal of some nasty kluges (I'm looking at you,
restriction_is_always_true).

            regards, tom lane



Re: Anti join confusion

From
David Rowley
Date:
On Thu, 27 Feb 2025 at 14:55, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> But I think having NOT NULL info available earlier might allow
> removal of some nasty kluges (I'm looking at you,
> restriction_is_always_true).

Can you elaborate on the gripe here, or point to the thread where you have?

David



Re: Anti join confusion

From
Tom Lane
Date:
David Rowley <dgrowleyml@gmail.com> writes:
> On Thu, 27 Feb 2025 at 14:55, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> But I think having NOT NULL info available earlier might allow
>> removal of some nasty kluges (I'm looking at you,
>> restriction_is_always_true).

> Can you elaborate on the gripe here, or point to the thread where you have?

The Right Way (TM) to do reduction of IS[NOT]NULL to const-true
or const-false is to do it during constant folding
(eval_const_expressions).  Doing it later is a hack, because
(a) we can't do any further folding using the constant, and
(b) it introduces bugs, as per [1] (and IIRC that wasn't the
first problem we found with that code).

Right now we have to do it later because we haven't run any of the
plancat.c code at the point where we do eval_const_expressions.
But if we could decouple that and have the info available during
eval_const_expressions, it'd be way cleaner.

            regards, tom lane

[1] https://www.postgresql.org/message-id/flat/f5320d3d-77af-4ce8-b9c3-4715ff33f213%40rentec.com



Re: Anti join confusion

From
Richard Guo
Date:
On Wed, Feb 26, 2025 at 7:09 PM Tender Wang <tndrwang@gmail.com> wrote:
> Recently, I found Greenplum implement pull-up NOT IN subquery. They have the below comments in their codes:
>
> We normalize NOT subqueries using the following axioms:
> *
> * val NOT IN (subq) =>  val <> ALL (subq)
>
> Richard, do you have an impression about this?

I vaguely recall that Greenplum converts NOT IN to some form of join
in certain cases, but I can't remember the details.

Thanks
Richard



Re: Anti join confusion

From
Tender Wang
Date:


Richard Guo <guofenglinux@gmail.com> 于2025年3月3日周一 15:34写道:
On Wed, Feb 26, 2025 at 7:09 PM Tender Wang <tndrwang@gmail.com> wrote:
> Recently, I found Greenplum implement pull-up NOT IN subquery. They have the below comments in their codes:
>
> We normalize NOT subqueries using the following axioms:
> *
> * val NOT IN (subq) =>  val <> ALL (subq)
>
> Richard, do you have an impression about this?

I vaguely recall that Greenplum converts NOT IN to some form of join
in certain cases, but I can't remember the details.

I do some research about Greenplum planner work for the NOT IN, I think the way they do is just like the second option you said:
 
* We can add support in the executor to handle the NULL semantics of
NOT IN.  This may require inventing a new join type.

1. 
They add a new join type left anti semi join
2.
The executor code can handle the case when the inner side returns NULL tuple.

I‘m not sure this transformation is only available for certain cases.  I will continue to research. 
I may provide a POC patch based on the Greenplum way.

--
Thanks,
Tender Wang

Re: Anti join confusion

From
Richard Guo
Date:
On Thu, Feb 27, 2025 at 6:58 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Yeah.  The key problem blocking doing something about it in the
> planner is that at the time we want to do join tree restructuring,
> we haven't yet collected the per-relation data that would allow
> us to know about NOT NULL constraints, nor run expression
> simplification that would be important for detecting all but the
> most basic cases.  Doing either of those things twice seems like
> a nonstarter from a planner-speed viewpoint.  It's possible that
> we could rearrange things to collect catalog data sooner.
> (I'm vaguely imagining that we'd collect that stuff by relation
> OID to begin with, and then when we build RelOptInfos later on
> we could link to the per-rel-OID catalog data.)  Moving expression
> simplification up seems considerably more fraught, but maybe we could
> handle the cases of common interest without having done that.

Yeah, this is a problem we need to resolve if we want to go with this
option.

Another challenge, as far as I can see, is that determining whether
the outer side contains NULL values can be quite tricky due to join
reordering.  As an example, consider

select * from t t1
         join t t2 on t1.a = t2.a
         join t t3 on t2.a NOT IN (select a from t t4);

Due to the strict qual "t1.a = t2.a", we might assume that "t2.a"
cannot be NULL in the NOT IN sublink and proceed to convert it into an
anti-join.  However, the join order could end up as

    ((t1 JOIN (t2 ANTI JOIN t4)) JOIN t3)

... in which case "t2.a" can be NULL in the anti-join.

I haven't fully thought this through and am not sure if there are
other cases of join reordering that could lead us to make the wrong
decision about whether the outer side of a NOT IN contains NULL
values.  But this is something we need to get right.

I haven't seen much discussion about the second option (adding support
in the executor to handle the NULL semantics of NOT IN).  Any thoughts
on that?

Thanks
Richard