Re: PostgreSQL NOT IN performance - Mailing list pgsql-performance

From Віталій Тимчишин
Subject Re: PostgreSQL NOT IN performance
Date
Msg-id 331e40660811190712p416c371t9c9237ac98ab06fb@mail.gmail.com
Whole thread Raw
In response to Re: PostgreSQL NOT IN performance  (DANIEL CRISTIAN CRUZ <daniel.cruz@sc.senai.br>)
List pgsql-performance


2008/11/19 DANIEL CRISTIAN CRUZ <daniel.cruz@sc.senai.br>

Something weird with your example which doesn't have the same result, see row count with explain analyze:

My fault. EXCEPT ALL would not work here, so this method with EXCEPT can be used only when either operation is done on unique key on t1 or result is going to be made unique.

cruz=# SELECT version();                                         version                                          
--------------------------------------------------------------------------------------------PostgreSQL 8.3.5 on i486-pc-linux-gnu, compiled by GCC gcc-4.3.real (Debian 4.3.2-1) 4.3.2
(1 registro)

cruz=# EXPLAIN ANALYZE select * from t1 where id not in (select id from t2);                                                   QUERY PLAN                                                   
------------------------------------------------------------------------------------------------------------------Seq Scan on t1  (cost=1643.00..4928.00 rows=100000 width=4) (actual time=256.687..585.774 rows=73653 loops=1)  Filter: (NOT (hashed subplan))  SubPlan    ->  Seq Scan on t2  (cost=0.00..1393.00 rows=100000 width=4) (actual time=0.052..86.867 rows=100000 loops=1)Total runtime: 625.471 ms
(5 registros)

cruz=# EXPLAIN ANALYZE select * from t1 except all (select id from t2);                                                              QUERY PLAN                                                               
-----------------------------------------------------------------------------------------------------------------------------------------SetOp Except All  (cost=34469.90..35969.90 rows=30000 width=4) (actual time=2598.574..3663.712 rows=126733 loops=1)  ->  Sort  (cost=34469.90..35219.90 rows=300000 width=4) (actual time=2598.550..3178.387 rows=300000 loops=1)
         Sort Key: "*SELECT* 1".id
         Sort Method:  external merge  Disk: 5864kB
         ->  Append  (cost=0.00..7178.00 rows=300000 width=4) (actual time=0.037..1026.367 rows=300000 loops=1)
               ->  Subquery Scan "*SELECT* 1"  (cost=0.00..4785.00 rows=200000 width=4) (actual time=0.035..439.507 rows=200000 loops=1)
                     ->  Seq Scan on t1  (cost=0.00..2785.00 rows=200000 width=4) (actual time=0.029..161.355 rows=200000 loops=1)
               ->  Subquery Scan "*SELECT* 2"  (cost=0.00..2393.00 rows=100000 width=4) (actual time=0.107..255.160 rows=100000 loops=1)
                     ->  Seq Scan on t2  (cost=0.00..1393.00 rows=100000 width=4) (actual time=0.097..110.639 rows=100000 loops=1)
 Total runtime: 3790.831 ms
(10 registros)

Sometimes I got a better result (on older versions) with this kind of query, but in this case it doesn't:

cruz=# EXPLAIN ANALYZE SELECT * FROM t1 LEFT JOIN t2 ON t1.id = t2.id WHERE t2.id IS NULL;
                                                      QUERY PLAN                                                      
-----------------------------------------------------------------------------------------------------------------------
 Merge Right Join  (cost=30092.86..35251.53 rows=155304 width=8) (actual time=850.232..1671.091 rows=73653 loops=1)
   Merge Cond: (t2.id = t1.id)
   Filter: (t2.id IS NULL)
   ->  Sort  (cost=9697.82..9947.82 rows=100000 width=4) (actual time=266.501..372.560 rows=100000 loops=1)
         Sort Key: t2.id
         Sort Method:  quicksort  Memory: 4392kB
         ->  Seq Scan on t2  (cost=0.00..1393.00 rows=100000 width=4) (actual time=0.029..78.087 rows=100000 loops=1)
   ->  Sort  (cost=20394.64..20894.64 rows=200000 width=4) (actual time=583.699..855.427 rows=273364 loops=1)
         Sort Key: t1.id
         Sort Method:  quicksort  Memory: 8784kB
         ->  Seq Scan on t1  (cost=0.00..2785.00 rows=200000 width=4) (actual time=0.087..155.665 rows=200000 loops=1)
 Total runtime: 1717.062 ms
(12 registros)

Yes, your method is even better on 8.3.3 I have. I will try to update to 8.3.5 to see if there was optimizer improvements. You could try increasing values, say, by 10 in table filling to see if NOT IT will switch to "slow" version (for me it starts being slow from some magic row count in t2). I suppose it is the moment it switches from "hashed subplan" to "subplan". For me for 10000 values it is "hashed subplan" (and it is momentary fast), for 100000 - it is "subplan" and it is sloow.
BTW: Which (memory?) configuration variable can affect such a switch?

pgsql-performance by date:

Previous
From: Stephan Szabo
Date:
Subject: Re: PostgreSQL NOT IN performance
Next
From: "Kranti™ K K Parisa"
Date:
Subject: Very Urgent : Sequences Problem