Re: PostgreSQL NOT IN performance - Mailing list pgsql-performance

From DANIEL CRISTIAN CRUZ
Subject Re: PostgreSQL NOT IN performance
Date
Msg-id 91db404146cba5910fcafdba91db5b9b@intranet.sc.senai.br
Whole thread Raw
In response to PostgreSQL NOT IN performance  ("Віталій Тимчишин" <tivv00@gmail.com>)
Responses Re: PostgreSQL NOT IN performance
List pgsql-performance

Something weird with your example which doesn't have the same result, see row count with explain analyze:

cruz=# SELECT version();                                         version                                          
--------------------------------------------------------------------------------------------PostgreSQL 8.3.5 on i486-pc-linux-gnu, compiled by GCC gcc-4.3.real (Debian 4.3.2-1) 4.3.2
(1 registro)

cruz=# EXPLAIN ANALYZE select * from t1 where id not in (select id from t2);                                                   QUERY PLAN                                                   
------------------------------------------------------------------------------------------------------------------Seq Scan on t1  (cost=1643.00..4928.00 rows=100000 width=4) (actual time=256.687..585.774 rows=73653 loops=1)  Filter: (NOT (hashed subplan))  SubPlan    ->  Seq Scan on t2  (cost=0.00..1393.00 rows=100000 width=4) (actual time=0.052..86.867 rows=100000 loops=1)Total runtime: 625.471 ms
(5 registros)

cruz=# EXPLAIN ANALYZE select * from t1 except all (select id from t2);                                                              QUERY PLAN                                                               
-----------------------------------------------------------------------------------------------------------------------------------------SetOp Except All  (cost=34469.90..35969.90 rows=30000 width=4) (actual time=2598.574..3663.712 rows=126733 loops=1)  ->  Sort  (cost=34469.90..35219.90 rows=300000 width=4) (actual time=2598.550..3178.387 rows=300000 loops=1)
         Sort Key: "*SELECT* 1".id
         Sort Method:  external merge  Disk: 5864kB
         ->  Append  (cost=0.00..7178.00 rows=300000 width=4) (actual time=0.037..1026.367 rows=300000 loops=1)
               ->  Subquery Scan "*SELECT* 1"  (cost=0.00..4785.00 rows=200000 width=4) (actual time=0.035..439.507 rows=200000 loops=1)
                     ->  Seq Scan on t1  (cost=0.00..2785.00 rows=200000 width=4) (actual time=0.029..161.355 rows=200000 loops=1)
               ->  Subquery Scan "*SELECT* 2"  (cost=0.00..2393.00 rows=100000 width=4) (actual time=0.107..255.160 rows=100000 loops=1)
                     ->  Seq Scan on t2  (cost=0.00..1393.00 rows=100000 width=4) (actual time=0.097..110.639 rows=100000 loops=1)
 Total runtime: 3790.831 ms
(10 registros)

Sometimes I got a better result (on older versions) with this kind of query, but in this case it doesn't:

cruz=# EXPLAIN ANALYZE SELECT * FROM t1 LEFT JOIN t2 ON t1.id = t2.id WHERE t2.id IS NULL;
                                                      QUERY PLAN                                                      
-----------------------------------------------------------------------------------------------------------------------
 Merge Right Join  (cost=30092.86..35251.53 rows=155304 width=8) (actual time=850.232..1671.091 rows=73653 loops=1)
   Merge Cond: (t2.id = t1.id)
   Filter: (t2.id IS NULL)
   ->  Sort  (cost=9697.82..9947.82 rows=100000 width=4) (actual time=266.501..372.560 rows=100000 loops=1)
         Sort Key: t2.id
         Sort Method:  quicksort  Memory: 4392kB
         ->  Seq Scan on t2  (cost=0.00..1393.00 rows=100000 width=4) (actual time=0.029..78.087 rows=100000 loops=1)
   ->  Sort  (cost=20394.64..20894.64 rows=200000 width=4) (actual time=583.699..855.427 rows=273364 loops=1)
         Sort Key: t1.id
         Sort Method:  quicksort  Memory: 8784kB
         ->  Seq Scan on t1  (cost=0.00..2785.00 rows=200000 width=4) (actual time=0.087..155.665 rows=200000 loops=1)
 Total runtime: 1717.062 ms
(12 registros)

Regards,

"??????? ????????" <tivv00@gmail.com> escreveu:

Hello.

It's second query rewrite postgresql seems not to handle - making EXCEPT from NOT IT.
Here is an example:
Preparation:

drop table if exists t1;
drop table if exists t2;
create temporary table t1(id) as
select
(random()*100000)::int from generate_series(1,200000) a(id);

create temporary table t2(id) as
select
(random()*100000)::int from generate_series(1,100000) a(id);
analyze t1;
analyze t2;

Query 1:
select * from t1 where id not in (select id from t2);
Plan:
"Seq Scan on t1  (cost=1934.00..164105319.00 rows=100000 width=4)"
"  Filter: (NOT (subplan))"
"  SubPlan"
"    ->  Materialize  (cost=1934.00..3325.00 rows=100000 width=4)"
"          ->  Seq Scan on t2  (cost=0.00..1443.00 rows=100000 width=4)"

Query 2 (gives same result as Q1):
select * from t1 except all (select id from t2);
Plan:
"SetOp Except All  (cost=38721.90..40221.90 rows=30000 width=4)"
"  ->  Sort  (cost=38721.90..39471.90 rows=300000 width=4)"
"        Sort Key: "*SELECT* 1".id"
"        ->  Append  (cost=0.00..7328.00 rows=300000 width=4)"
"              ->  Subquery Scan "*SELECT* 1"  (cost=0.00..4885.00 rows=200000 width=4)"
"                    ->  Seq Scan on t1  (cost=0.00..2885.00 rows=200000 width=4)"
"              ->  Subquery Scan "*SELECT* 2"  (cost=0.00..2443.00 rows=100000 width=4)"
"                    ->  Seq Scan on t2  (cost=0.00..1443.00 rows=100000 width=4)"

If I am correct, planner simply do not know that he can rewrite NOT IN as "EXCEPT ALL" operator, so all NOT INs when list of values to remove is long takes very much time.
So the question is: I am willing to participate in postgresql development because it may be easier to fix planner then to rewrite all my queries :). How can I? (I mean to work on query planner enhancements by providing new options of query rewrite, not to work on other thing nor on enhancing planner in other ways, like better estimations of known plans).



 


Daniel Cristian Cruz
Administrador de Banco de Dados
Direção Regional - Núcleo de Tecnologia da Informação
SENAI - SC
Telefone: 48-3239-1422 (ramal 1422)

pgsql-performance by date:

Previous
From: "Віталій Тимчишин"
Date:
Subject: PostgreSQL NOT IN performance
Next
From: Alvaro Herrera
Date:
Subject: Re: PostgreSQL NOT IN performance