Home > mailing lists

BUG #4949: NOT IN is prohibitive slower than the rewrite for medium to large sets - Mailing list pgsql-bugs

From	Ole Tange
Subject	BUG #4949: NOT IN is prohibitive slower than the rewrite for medium to large sets
Date	July 28, 2009 07:55:20
Msg-id	200907281055.n6SAtIHu065279@wwwmaster.postgresql.org Whole thread Raw
Responses	Re: BUG #4949: NOT IN is prohibitive slower than the rewrite for medium to large sets
List	pgsql-bugs

Tree view

The following bug has been logged online:

Bug reference:      4949
Logged by:          Ole Tange
Email address:      postgresql.org@tange.dk
PostgreSQL version: 8.3.7
Operating system:   Debian GNU/Linux
Description:        NOT IN is prohibitive slower than the rewrite for medium
to large sets
Details:

A NOT IN like this:

SELECT foo FROM a WHERE key NOT IN (SELECT key FROM b);

can always be rewritten like as:

CREATE TEMPORARY TABLE c AS SELECT key FROM a;
DELETE FROM c USING b WHERE c.key = b.key;
SELECT foo FROM a,c WHERE a.key = c.key;

(modulo NULLs which seem to always cause problems in NOT INs).

Because it can be rewritten, NOT IN should never be much slower than the
rewritten solution, as PostgreSQL should simply rewrite NOT IN to the
above.

However, the perl script below generates examples that clearly show this is
not the case. These are my timings from a single system:

$factor | Time for rewritten | Time for NOT IN
1000    | 00:00:00.03        | 00:00:00.01
3000    | 00:00:00.15        | 00:00:00.02
10000   | 00:00:00.51        | 00:10:26.20
30000   | 00:00:01.36        | 01:28:02.38
100000  | 00:00:06.27        | more than 8 hours

So it seems NOT IN works fine for sets less than 10000. Where as NOT IN
performs extremely poorly at sets > 10000.

I suggest that PostgreSQL rewrites all NOT IN's when the set size reaches a
certain limit (which on my system is around 10000).



#!/usr/bin/perl

one_run(1000);
one_run(3000);
one_run(10000);
one_run(30000);
one_run(100000);
one_run(300000);

sub one_run {
    my $factor = shift;
    print "select '======',$factor,'======';\n";
    print "drop table a;\n";
    print "begin;";
    print "create table a (key text);\n";
    for $x (0..6*$factor) {
    print "insert into a values ('foobarfoobar$x');\n"
    }
    print "commit;";

    print "drop table b;\n";
    print "begin;";
    print "create table b (key text);\n";
    for $x (4*$factor..7*$factor) {
    print "insert into b values ('foobarfoobar$x');\n"
    }
    print "commit;";

print "
VACUUM FULL ANALYZE;
SELECT 1,timeofday();
CREATE TEMPORARY TABLE c AS SELECT key FROM a;
DELETE FROM c USING b WHERE c.key = b.key;
SELECT COUNT(*) FROM a,c WHERE a.key = c.key;
DROP TABLE c;
SELECT 2,timeofday();


SELECT 3,timeofday();
SELECT COUNT(*) FROM a WHERE key NOT IN (SELECT key FROM b);
SELECT 4,timeofday();

CREATE INDEX a_key ON a(key);
SELECT 5,timeofday();
SELECT COUNT(*) FROM a WHERE key NOT IN (SELECT key FROM b);
SELECT 6,timeofday();

CREATE INDEX b_key ON b(key);
SELECT 7,timeofday();
SELECT COUNT(*) FROM a WHERE key NOT IN (SELECT key FROM b);
SELECT 8,timeofday();
";
}

pgsql-bugs by date:

From: Teodor Sigaev
Date: 28 July 2009, 06:35:05
Subject: Re: BUG #4933: ts_rewrite() causes segfault when query with more than one node becomes empty

From: nandakumar@obsitech.com
Date: 28 July 2009, 10:28:56
Subject: issue:pgsql trigger with script file

BUG #4949: NOT IN is prohibitive slower than the rewrite for medium to large sets - Mailing list pgsql-bugs

Previous

Next