Home > mailing lists

Re: NOT IN subquery optimization - Mailing list pgsql-hackers

From	Richard Guo
Subject	Re: NOT IN subquery optimization
Date	February 26, 2019 13:24:36
Msg-id	CAN_9JTwZUsWLymsGA+BAJiGyqL5-gpEhu5zRF3gG9grCgmoCWA@mail.gmail.com Whole thread Raw
In response to	Re: NOT IN subquery optimization (David Rowley <david.rowley@2ndquadrant.com>)
List	pgsql-hackers

Tree view

Greenplum Database does this optimization. The idea is to use a new join

type, let's call it JOIN_LASJ_NOTIN, and its semantic regarding NULL is

defined as below:

1. If there is a NULL in the outer side, and the inner side is empty, the

NULL should be part of the outputs.

2. If there is a NULL in the outer side, and the inner side is not empty,

the NULL should not be part of the outputs.

3. If there is a NULL in the inner side, no outputs should be produced.

An example plan looks like:

gpadmin=# explain (costs off) select * from t1 where a not in(select a from t2);

QUERY PLAN

-----------------------------------

Hash Left Anti Semi (Not-In) Join

Hash Cond: (t1.a = t2.a)

-> Seq Scan on t1

-> Hash

-> Seq Scan on t2

(5 rows)

Thanks

Richard

On Tue, Feb 26, 2019 at 7:20 AM David Rowley <david.rowley@2ndquadrant.com> wrote:

On Tue, 26 Feb 2019 at 11:51, Li, Zheng <zhelli@amazon.com> wrote:
> Resend the patch with a whitespace removed so that "git apply patch" works directly.

I had a quick look at this and it seems to be broken for the empty
table case I mentioned up thread.

Quick example:

Setup:

create table t1 (a int);
create table t2 (a int not null);
insert into t1 values(NULL),(1),(2);

select * from t1 where a not in(select a from t2);

Patched:
a
---
1
2
(2 rows)

Master:
a
---

1
2
(3 rows)

This will be due to the fact you're adding an a IS NOT NULL qual to
the scan of a:

postgres=# explain select * from t1 where a not in(select a from t2);
QUERY PLAN
------------------------------------------------------------------
Hash Anti Join (cost=67.38..152.18 rows=1268 width=4)
Hash Cond: (t1.a = t2.a)
-> Seq Scan on t1 (cost=0.00..35.50 rows=2537 width=4)
Filter: (a IS NOT NULL)
-> Hash (cost=35.50..35.50 rows=2550 width=4)
-> Seq Scan on t2 (cost=0.00..35.50 rows=2550 width=4)
(6 rows)

but as I mentioned, you can't do that as t2 might be empty and there's
no way to know that during planning.

--
David Rowley https://urldefense.proofpoint.com/v2/url?u=http-3A__www.2ndQuadrant.com_&d=DwIBaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=5r3cnfZPUDOHrMiXq8Mq2g&m=dE1nglE17x3nD-oH_BrF0r4SLaFnQKzwwJBJGpDoaaA&s=dshupMomMvkDAd92918cU21AJ1E1s7QwbrxIGSRxZA8&e=
PostgreSQL Development, 24x7 Support, Training & Services

pgsql-hackers by date:

From: Masahiko Sawada
Date: 26 February 2019, 13:20:00
Subject: Re: [HACKERS] Block level parallel vacuum

From: Dmitry Dolgov
Date: 26 February 2019, 13:47:13
Subject: Re: Segfault when restoring -Fd dump on current HEAD

Re: NOT IN subquery optimization - Mailing list pgsql-hackers

Previous

Next