Home > mailing lists

Re: BUG #4949: NOT IN is prohibitive slower than the rewrite for medium to large sets - Mailing list pgsql-bugs

From	Ole Tange
Subject	Re: BUG #4949: NOT IN is prohibitive slower than the rewrite for medium to large sets
Date	July 29, 2009 09:17:05
Msg-id	ce534faa0907290334h5bb07eabn5527d75ca01e304e@mail.gmail.com Whole thread Raw
In response to	Re: BUG #4949: NOT IN is prohibitive slower than the rewrite for medium to large sets (Tom Molesworth <tom@audioboundary.com>)
Responses	Re: BUG #4949: NOT IN is prohibitive slower than the rewrite for medium to large sets
List	pgsql-bugs

Tree view

On Wed, Jul 29, 2009 at 12:16 AM, Tom Molesworth<tom@audioboundary.com> wrote:
> Ole Tange wrote:
>> On Tue, Jul 28, 2009 at 3:47 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
>>> "Ole Tange" <postgresql.org@tange.dk> writes:
>>>
>>>> (modulo NULLs which seem to always cause problems in NOT INs).
>>>
>>> Let's see, you understand that the rewrite violates the SQL standard
>>> semantics of NOT IN, but you think we should do it anyway?
>
> Just an observation - it seems that you're using NOT IN() and expecting it
> to do the same as this:
>
> SELECT foo FROM a LEFT JOIN b ON a.key = b.key WHERE b.key IS NULL

For my purposes this query is fine, too. It will, however, still cause
the code to be less readable than 'a.key NOT IN (...)'.

As I found IN actually performs fine, I naively thought I could just
move the NOT in my own code or use '= False':

SELECT foo FROM a WHERE NOT (key IN (SELECT key FROM b));  -- This is slow
SELECT foo FROM a WHERE (key IN (SELECT key FROM b)) = False; -- This
is slow, too

But EXPLAIN tells me these are just as expensive as NOT IN. So though
this is fairly readable to me, they still suffer from the performance.

> I find it's comparatively rare to actually want the NOT IN() null-handling
> semantics, especially given your comment about nulls 'causing problems'

I agree with that. But if the standard says 'Do something that will
surprise the average user' then I understand that you will have to
make a choice between following the standard or pleasing the user.
Postgresql chose to follow the standard, which I believe is a valid
decision; it did, however, cause me a few hours of debugging.

I would have loved if 'NOT IN (...NULL...)' had given me some output
that would have alerted me to the NULL issue. But I do not see a
simple way of doing that.

> (although I guess you could get the same behaviour in that query as NOT IN,
> by adding 'AND a.key IS NOT NULL' to the where clause - I have no idea
> whether this is something the optimiser could or should be able to do).

The problem is not NULLs in table a, but NULLs in table b. If b
contains a NULL the query should return 0 records.

> but I find the left join approach works well enough
> up to several million rows in the two tables. An unindexed subquery like
> 'select key from a' just seems like a guaranteed way to invoking a full
> (slow) table scan.

EXPLAIN tells me the LEFT JOIN does table scan as well. It seems the
primary reason why your LEFT JOIN is faster than my 3 line alternative
is because I do DELETEs, which is slow. So thank you for the LEFT JOIN
idea.

/Ole

pgsql-bugs by date:

From: Robert Haas
Date: 29 July 2009, 08:54:34
Subject: Re: BUG #4950: Problem in Job scheduling in Advanced postgre sql

From: Greg Stark
Date: 29 July 2009, 09:28:55
Subject: Re: BUG #4949: NOT IN is prohibitive slower than the rewrite for medium to large sets

Re: BUG #4949: NOT IN is prohibitive slower than the rewrite for medium to large sets - Mailing list pgsql-bugs

Previous

Next