Home > mailing lists

Re: NOT IN vs. NOT EXISTS performance - Mailing list pgsql-performance

From	Lincoln Swaine-Moore
Subject	Re: NOT IN vs. NOT EXISTS performance
Date	November 10, 2018 00:06:15
Msg-id	CABcidkLL09vsGfqc0_TQa8BuUW8KXRcxxr25DS1BQO3n1Sp=zQ@mail.gmail.com Whole thread
In response to	Re: NOT IN vs. NOT EXISTS performance (Merlin Moncure <mmoncure@gmail.com>)
List	pgsql-performance

Tree view

Thanks, both!

That's a very interesting thread. I was confident this was a subject that had been discussed--just wasn't sure where--so thank you for forwarding.

I guess the big-picture summary is that NOT IN's definition introduces complexity (the nature of which I now understand better) that is usually unwarranted by the question the querier is asking. So NOT EXISTS will almost always be preferable when a subquery is involved, unless the behavior around NULL values is specifically desired.

On Fri, Nov 9, 2018 at 8:45 AM Merlin Moncure <mmoncure@gmail.com> wrote:

On Thu, Nov 8, 2018 at 3:12 PM David Rowley
<david.rowley@2ndquadrant.com> wrote:
>
> On 9 November 2018 at 08:35, Lincoln Swaine-Moore
> <lswainemoore@gmail.com> wrote:
> > My primary question is: why is this approach only possible (for data too
> > large for memory) when using NOT EXISTS, and not when using NOT IN?
> >
> > I understand that there is a slight difference in the meaning of the two
> > expressions, in that NOT IN will produce NULL if there are any NULL values
> > in the right hand side (in this case there are none, and the queries should
> > return the same COUNT). But if anything, I would expect that to improve
> > performance of the NOT IN operation, since a single pass through that data
> > should reveal if there are any NULL values, at which point that information
> > could be used to short-circuit. So I am a bit baffled.
>
> The problem is that the planner makes the plan and would have to know
> beforehand that no NULLs could exist on either side of the join.

Yeah, the core issue is the SQL rules that define NOT IN behaves as:
postgres=# select 1 not in (select 2);
?column?
──────────
t
(1 row)

postgres=# select 1 not in (select 2 union all select null);
?column?
──────────

(1 row)

There's a certain logic to it but it's a death sentence for performance.

merlin

Lincoln Swaine-Moore

pgsql-performance by date:

From: Merlin Moncure
Date: 09 November 2018, 13:45:56
Subject: Re: NOT IN vs. NOT EXISTS performance

From: Jakub Glapa
Date: 13 November 2018, 13:08:24
Subject: Re: dsa_allocate() faliure

Re: NOT IN vs. NOT EXISTS performance - Mailing list pgsql-performance

Previous

Next