Re: BUG #4949: NOT IN is prohibitive slower than the rewrite for medium to large sets - Mailing list pgsql-bugs

From Tom Molesworth
Subject Re: BUG #4949: NOT IN is prohibitive slower than the rewrite for medium to large sets
Date
Msg-id 4A6F78C5.9070607@audioboundary.com
Whole thread Raw
In response to Re: BUG #4949: NOT IN is prohibitive slower than the rewrite for medium to large sets  (Ole Tange <ole@tange.dk>)
Responses Re: BUG #4949: NOT IN is prohibitive slower than the rewrite for medium to large sets  (Ole Tange <ole@tange.dk>)
List pgsql-bugs
Ole Tange wrote:
> On Tue, Jul 28, 2009 at 3:47 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
>
>> "Ole Tange" <postgresql.org@tange.dk> writes:
>>
>>> (modulo NULLs which seem to always cause problems in NOT INs).
>>>
>>> Because it can be rewritten, NOT IN should never be much slower than the
>>> rewritten solution, as PostgreSQL should simply rewrite NOT IN to the
>>> above.
>>>
>> Let's see, you understand that the rewrite violates the SQL standard
>> semantics of NOT IN, but you think we should do it anyway?
>>
>
> Thanks for your kind reply.
>
> Apparently my bug report was not quite clear. My rewrite example was
> simply to show a way that could do a marvelous speedup on medium to
> large sets. The correct dealing with NULL I am sure can be handled
> just as efficiently.
>

Just an observation - it seems that you're using NOT IN() and expecting
it to do the same as this:

SELECT foo FROM a LEFT JOIN b ON a.key = b.key WHERE b.key IS NULL

I find it's comparatively rare to actually want the NOT IN()
null-handling semantics, especially given your comment about nulls
'causing problems' (although I guess you could get the same behaviour in
that query as NOT IN, by adding 'AND a.key IS NOT NULL' to the where
clause - I have no idea whether this is something the optimiser could or
should be able to do).

Although it refers to TSQL, this page might help explain more about NOT IN:

http://stackoverflow.com/questions/129077/sql-not-in-constraint-and-null-values

Not sure if constructions such as 'not exists (select 1 from b where
b.key = a.key)' also have the same performance issues you describe -
should be faster due to the key? - but I find the left join approach
works well enough up to several million rows in the two tables. An
unindexed subquery like 'select key from a' just seems like a guaranteed
way to invoking a full (slow) table scan.

Tom

pgsql-bugs by date:

Previous
From: "Jim Michaels"
Date:
Subject: BUG #4951: installation dir wrong for libpq compilation
Next
From: Robert Haas
Date:
Subject: Re: BUG #4945: Parallel update(s) gone wild