Thread: BUG #4949: NOT IN is prohibitive slower than the rewrite for medium to large sets

The following bug has been logged online:

Bug reference:      4949
Logged by:          Ole Tange
Email address:      postgresql.org@tange.dk
PostgreSQL version: 8.3.7
Operating system:   Debian GNU/Linux
Description:        NOT IN is prohibitive slower than the rewrite for medium
to large sets
Details:

A NOT IN like this:

SELECT foo FROM a WHERE key NOT IN (SELECT key FROM b);

can always be rewritten like as:

CREATE TEMPORARY TABLE c AS SELECT key FROM a;
DELETE FROM c USING b WHERE c.key = b.key;
SELECT foo FROM a,c WHERE a.key = c.key;

(modulo NULLs which seem to always cause problems in NOT INs).

Because it can be rewritten, NOT IN should never be much slower than the
rewritten solution, as PostgreSQL should simply rewrite NOT IN to the
above.

However, the perl script below generates examples that clearly show this is
not the case. These are my timings from a single system:

$factor | Time for rewritten | Time for NOT IN
1000    | 00:00:00.03        | 00:00:00.01
3000    | 00:00:00.15        | 00:00:00.02
10000   | 00:00:00.51        | 00:10:26.20
30000   | 00:00:01.36        | 01:28:02.38
100000  | 00:00:06.27        | more than 8 hours

So it seems NOT IN works fine for sets less than 10000. Where as NOT IN
performs extremely poorly at sets > 10000.

I suggest that PostgreSQL rewrites all NOT IN's when the set size reaches a
certain limit (which on my system is around 10000).



#!/usr/bin/perl

one_run(1000);
one_run(3000);
one_run(10000);
one_run(30000);
one_run(100000);
one_run(300000);

sub one_run {
    my $factor = shift;
    print "select '======',$factor,'======';\n";
    print "drop table a;\n";
    print "begin;";
    print "create table a (key text);\n";
    for $x (0..6*$factor) {
    print "insert into a values ('foobarfoobar$x');\n"
    }
    print "commit;";

    print "drop table b;\n";
    print "begin;";
    print "create table b (key text);\n";
    for $x (4*$factor..7*$factor) {
    print "insert into b values ('foobarfoobar$x');\n"
    }
    print "commit;";

print "
VACUUM FULL ANALYZE;
SELECT 1,timeofday();
CREATE TEMPORARY TABLE c AS SELECT key FROM a;
DELETE FROM c USING b WHERE c.key = b.key;
SELECT COUNT(*) FROM a,c WHERE a.key = c.key;
DROP TABLE c;
SELECT 2,timeofday();


SELECT 3,timeofday();
SELECT COUNT(*) FROM a WHERE key NOT IN (SELECT key FROM b);
SELECT 4,timeofday();

CREATE INDEX a_key ON a(key);
SELECT 5,timeofday();
SELECT COUNT(*) FROM a WHERE key NOT IN (SELECT key FROM b);
SELECT 6,timeofday();

CREATE INDEX b_key ON b(key);
SELECT 7,timeofday();
SELECT COUNT(*) FROM a WHERE key NOT IN (SELECT key FROM b);
SELECT 8,timeofday();
";
}
"Ole Tange" <postgresql.org@tange.dk> writes:
> (modulo NULLs which seem to always cause problems in NOT INs).

> Because it can be rewritten, NOT IN should never be much slower than the
> rewritten solution, as PostgreSQL should simply rewrite NOT IN to the
> above.

Let's see, you understand that the rewrite violates the SQL standard
semantics of NOT IN, but you think we should do it anyway?

            regards, tom lane
On Tue, Jul 28, 2009 at 3:47 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
> "Ole Tange" <postgresql.org@tange.dk> writes:
>> (modulo NULLs which seem to always cause problems in NOT INs).
>
>> Because it can be rewritten, NOT IN should never be much slower than the
>> rewritten solution, as PostgreSQL should simply rewrite NOT IN to the
>> above.
>
> Let's see, you understand that the rewrite violates the SQL standard
> semantics of NOT IN, but you think we should do it anyway?

Thanks for your kind reply.

Apparently my bug report was not quite clear. My rewrite example was
simply to show a way that could do a marvelous speedup on medium to
large sets. The correct dealing with NULL I am sure can be handled
just as efficiently.

As the performance of NOT IN is crippling for medium to large sets, I
suggest the way NOT IN is done should be similar to this:

SELECT foo FROM a WHERE a.key NOT IN (SELECT key FROM b);

is executed similar to this pseudo code (which deals with NULL):

CREATE TEMPORARY TABLE c AS SELECT key FROM a;
DELETE FROM c USING b WHERE c.key = b.key;
IF (SELECT count(*) FROM b WHERE b.key IS NULL LIMIT 1) = 0:
  -- there were no NULLs in b
  SELECT foo FROM a,c WHERE a.key = c.key;
ELSE
  -- there were NULLs in b, so just give an empty set back
  SELECT foo FROM a WHERE 1=2;
END IF

I know I can just rewrite my own code to do just that - and that is a
workaround for me. But the code would be more readable if I can simply
write NOT IN and expect it to perform just as well.


/Ole
Ole Tange wrote:
> On Tue, Jul 28, 2009 at 3:47 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
>
>> "Ole Tange" <postgresql.org@tange.dk> writes:
>>
>>> (modulo NULLs which seem to always cause problems in NOT INs).
>>>
>>> Because it can be rewritten, NOT IN should never be much slower than the
>>> rewritten solution, as PostgreSQL should simply rewrite NOT IN to the
>>> above.
>>>
>> Let's see, you understand that the rewrite violates the SQL standard
>> semantics of NOT IN, but you think we should do it anyway?
>>
>
> Thanks for your kind reply.
>
> Apparently my bug report was not quite clear. My rewrite example was
> simply to show a way that could do a marvelous speedup on medium to
> large sets. The correct dealing with NULL I am sure can be handled
> just as efficiently.
>

Just an observation - it seems that you're using NOT IN() and expecting
it to do the same as this:

SELECT foo FROM a LEFT JOIN b ON a.key = b.key WHERE b.key IS NULL

I find it's comparatively rare to actually want the NOT IN()
null-handling semantics, especially given your comment about nulls
'causing problems' (although I guess you could get the same behaviour in
that query as NOT IN, by adding 'AND a.key IS NOT NULL' to the where
clause - I have no idea whether this is something the optimiser could or
should be able to do).

Although it refers to TSQL, this page might help explain more about NOT IN:

http://stackoverflow.com/questions/129077/sql-not-in-constraint-and-null-values

Not sure if constructions such as 'not exists (select 1 from b where
b.key = a.key)' also have the same performance issues you describe -
should be faster due to the key? - but I find the left join approach
works well enough up to several million rows in the two tables. An
unindexed subquery like 'select key from a' just seems like a guaranteed
way to invoking a full (slow) table scan.

Tom
On Jul 28, 2009, at 9:47 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> "Ole Tange" <postgresql.org@tange.dk> writes:
>> (modulo NULLs which seem to always cause problems in NOT INs).
>
>> Because it can be rewritten, NOT IN should never be much slower
>> than the
>> rewritten solution, as PostgreSQL should simply rewrite NOT IN to the
>> above.
>
> Let's see, you understand that the rewrite violates the SQL standard
> semantics of NOT IN, but you think we should do it anyway?

If the subquery can't return NULLs, the rewrite is valid. I know
you've rejected the idea of checking for this in the past, but perhaps
you should consider this a user vote in favor of doing so.

I learned the hard way not to do this >5 years ago but it seemed
strange to me, too.

...Robert
On Wed, Jul 29, 2009 at 12:16 AM, Tom Molesworth<tom@audioboundary.com> wrote:
> Ole Tange wrote:
>> On Tue, Jul 28, 2009 at 3:47 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
>>> "Ole Tange" <postgresql.org@tange.dk> writes:
>>>
>>>> (modulo NULLs which seem to always cause problems in NOT INs).
>>>
>>> Let's see, you understand that the rewrite violates the SQL standard
>>> semantics of NOT IN, but you think we should do it anyway?
>
> Just an observation - it seems that you're using NOT IN() and expecting it
> to do the same as this:
>
> SELECT foo FROM a LEFT JOIN b ON a.key = b.key WHERE b.key IS NULL

For my purposes this query is fine, too. It will, however, still cause
the code to be less readable than 'a.key NOT IN (...)'.

As I found IN actually performs fine, I naively thought I could just
move the NOT in my own code or use '= False':

SELECT foo FROM a WHERE NOT (key IN (SELECT key FROM b));  -- This is slow
SELECT foo FROM a WHERE (key IN (SELECT key FROM b)) = False; -- This
is slow, too

But EXPLAIN tells me these are just as expensive as NOT IN. So though
this is fairly readable to me, they still suffer from the performance.

> I find it's comparatively rare to actually want the NOT IN() null-handling
> semantics, especially given your comment about nulls 'causing problems'

I agree with that. But if the standard says 'Do something that will
surprise the average user' then I understand that you will have to
make a choice between following the standard or pleasing the user.
Postgresql chose to follow the standard, which I believe is a valid
decision; it did, however, cause me a few hours of debugging.

I would have loved if 'NOT IN (...NULL...)' had given me some output
that would have alerted me to the NULL issue. But I do not see a
simple way of doing that.

> (although I guess you could get the same behaviour in that query as NOT IN,
> by adding 'AND a.key IS NOT NULL' to the where clause - I have no idea
> whether this is something the optimiser could or should be able to do).

The problem is not NULLs in table a, but NULLs in table b. If b
contains a NULL the query should return 0 records.

> but I find the left join approach works well enough
> up to several million rows in the two tables. An unindexed subquery like
> 'select key from a' just seems like a guaranteed way to invoking a full
> (slow) table scan.

EXPLAIN tells me the LEFT JOIN does table scan as well. It seems the
primary reason why your LEFT JOIN is faster than my 3 line alternative
is because I do DELETEs, which is slow. So thank you for the LEFT JOIN
idea.


/Ole
On Wed, Jul 29, 2009 at 11:34 AM, Ole Tange<ole@tange.dk> wrote:
> EXPLAIN tells me the LEFT JOIN does table scan as well. It seems the
> primary reason why your LEFT JOIN is faster than my 3 line alternative
> is because I do DELETEs, which is slow. So thank you for the LEFT JOIN
> idea.


If you don't want the NULL behaviour then you should also be able to
translate "WHERE foo NOT IN (select f from bar)" into  "WHERE NOT
EXISTS (select 1 from bar where foo=f)"

You should also test 8.4 to see if the performance of all three of
these variants haven't changed. 8.4 understands how to plan these
kinds of joins much better now. It still doesn't know how to prove a
column cannot be NULL though so I don't know that it would actually
help in your particular case.

--
greg
http://mit.edu/~gsstark/resume.pdf