Thread: PostgreSQL NOT IN performance

PostgreSQL NOT IN performance

From

"Віталій Тимчишин"

Date:

19 November 2008, 07:51:56

Hello.

It's second query rewrite postgresql seems not to handle - making EXCEPT from NOT IT.
Here is an example:
Preparation:

drop table if exists t1;
drop table if exists t2;
create temporary table t1(id) as
select
(random()*100000)::int from generate_series(1,200000) a(id);

create temporary table t2(id) as
select
(random()*100000)::int from generate_series(1,100000) a(id);
analyze t1;
analyze t2;

Query 1:
select * from t1 where id not in (select id from t2);
Plan:
"Seq Scan on t1 (cost=1934.00..164105319.00 rows=100000 width=4)"
" Filter: (NOT (subplan))"
" SubPlan"
"    -> Materialize (cost=1934.00..3325.00 rows=100000 width=4)"
"          -> Seq Scan on t2 (cost=0.00..1443.00 rows=100000 width=4)"

Query 2 (gives same result as Q1):
select * from t1 except all (select id from t2);
Plan:
"SetOp Except All (cost=38721.90..40221.90 rows=30000 width=4)"
" -> Sort (cost=38721.90..39471.90 rows=300000 width=4)"
"        Sort Key: "*SELECT* 1".id"
"        -> Append (cost=0.00..7328.00 rows=300000 width=4)"
"              -> Subquery Scan "*SELECT* 1" (cost=0.00..4885.00 rows=200000 width=4)"
"                    -> Seq Scan on t1 (cost=0.00..2885.00 rows=200000 width=4)"
"              -> Subquery Scan "*SELECT* 2" (cost=0.00..2443.00 rows=100000 width=4)"
"                    -> Seq Scan on t2 (cost=0.00..1443.00 rows=100000 width=4)"

If I am correct, planner simply do not know that he can rewrite NOT IN as "EXCEPT ALL" operator, so all NOT INs when list of values to remove is long takes very much time.
So the question is: I am willing to participate in postgresql development because it may be easier to fix planner then to rewrite all my queries :). How can I? (I mean to work on query planner enhancements by providing new options of query rewrite, not to work on other thing nor on enhancing planner in other ways, like better estimations of known plans).

Re: PostgreSQL NOT IN performance

From

DANIEL CRISTIAN CRUZ

Date:

19 November 2008, 08:50:12

Something weird with your example which doesn't have the same result, see row count with explain analyze:

cruz=# SELECT version();                                         version                                          
--------------------------------------------------------------------------------------------PostgreSQL 8.3.5 on i486-pc-linux-gnu, compiled by GCC gcc-4.3.real (Debian 4.3.2-1) 4.3.2
(1 registro)

cruz=# EXPLAIN ANALYZE select * from t1 where id not in (select id from t2);                                                   QUERY PLAN                                                   
------------------------------------------------------------------------------------------------------------------Seq Scan on t1  (cost=1643.00..4928.00 rows=100000 width=4) (actual time=256.687..585.774 rows=73653 loops=1)  Filter: (NOT (hashed subplan))  SubPlan    ->  Seq Scan on t2  (cost=0.00..1393.00 rows=100000 width=4) (actual time=0.052..86.867 rows=100000 loops=1)Total runtime: 625.471 ms
(5 registros)

cruz=# EXPLAIN ANALYZE select * from t1 except all (select id from t2);                                                              QUERY PLAN                                                               
-----------------------------------------------------------------------------------------------------------------------------------------SetOp Except All  (cost=34469.90..35969.90 rows=30000 width=4) (actual time=2598.574..3663.712 rows=126733 loops=1)  ->  Sort  (cost=34469.90..35219.90 rows=300000 width=4) (actual time=2598.550..3178.387 rows=300000 loops=1)
         Sort Key: "*SELECT* 1".id
         Sort Method:  external merge  Disk: 5864kB
         ->  Append  (cost=0.00..7178.00 rows=300000 width=4) (actual time=0.037..1026.367 rows=300000 loops=1)
               ->  Subquery Scan "*SELECT* 1"  (cost=0.00..4785.00 rows=200000 width=4) (actual time=0.035..439.507 rows=200000 loops=1)
                     ->  Seq Scan on t1  (cost=0.00..2785.00 rows=200000 width=4) (actual time=0.029..161.355 rows=200000 loops=1)
               ->  Subquery Scan "*SELECT* 2"  (cost=0.00..2393.00 rows=100000 width=4) (actual time=0.107..255.160 rows=100000 loops=1)
                     ->  Seq Scan on t2  (cost=0.00..1393.00 rows=100000 width=4) (actual time=0.097..110.639 rows=100000 loops=1)
 Total runtime: 3790.831 ms
(10 registros)

Sometimes I got a better result (on older versions) with this kind of query, but in this case it doesn't:

cruz=# EXPLAIN ANALYZE SELECT * FROM t1 LEFT JOIN t2 ON t1.id = t2.id WHERE t2.id IS NULL;
                                                      QUERY PLAN                                                      
-----------------------------------------------------------------------------------------------------------------------
 Merge Right Join  (cost=30092.86..35251.53 rows=155304 width=8) (actual time=850.232..1671.091 rows=73653 loops=1)
   Merge Cond: (t2.id = t1.id)
   Filter: (t2.id IS NULL)
   ->  Sort  (cost=9697.82..9947.82 rows=100000 width=4) (actual time=266.501..372.560 rows=100000 loops=1)
         Sort Key: t2.id
         Sort Method:  quicksort  Memory: 4392kB
         ->  Seq Scan on t2  (cost=0.00..1393.00 rows=100000 width=4) (actual time=0.029..78.087 rows=100000 loops=1)
   ->  Sort  (cost=20394.64..20894.64 rows=200000 width=4) (actual time=583.699..855.427 rows=273364 loops=1)
         Sort Key: t1.id
         Sort Method:  quicksort  Memory: 8784kB
         ->  Seq Scan on t1  (cost=0.00..2785.00 rows=200000 width=4) (actual time=0.087..155.665 rows=200000 loops=1)
 Total runtime: 1717.062 ms
(12 registros)

Regards,

"??????? ????????" <tivv00@gmail.com> escreveu:

Hello.

It's second query rewrite postgresql seems not to handle - making EXCEPT from NOT IT.
Here is an example:
Preparation:

drop table if exists t1;
drop table if exists t2;
create temporary table t1(id) as
select
(random()*100000)::int from generate_series(1,200000) a(id);

create temporary table t2(id) as
select
(random()*100000)::int from generate_series(1,100000) a(id);
analyze t1;
analyze t2;

Query 1:
select * from t1 where id not in (select id from t2);
Plan:
"Seq Scan on t1 (cost=1934.00..164105319.00 rows=100000 width=4)"
" Filter: (NOT (subplan))"
" SubPlan"
"    -> Materialize (cost=1934.00..3325.00 rows=100000 width=4)"
"          -> Seq Scan on t2 (cost=0.00..1443.00 rows=100000 width=4)"

Query 2 (gives same result as Q1):
select * from t1 except all (select id from t2);
Plan:
"SetOp Except All (cost=38721.90..40221.90 rows=30000 width=4)"
" -> Sort (cost=38721.90..39471.90 rows=300000 width=4)"
"        Sort Key: "*SELECT* 1".id"
"        -> Append (cost=0.00..7328.00 rows=300000 width=4)"
"              -> Subquery Scan "*SELECT* 1" (cost=0.00..4885.00 rows=200000 width=4)"
"                    -> Seq Scan on t1 (cost=0.00..2885.00 rows=200000 width=4)"
"              -> Subquery Scan "*SELECT* 2" (cost=0.00..2443.00 rows=100000 width=4)"
"                    -> Seq Scan on t2 (cost=0.00..1443.00 rows=100000 width=4)"

If I am correct, planner simply do not know that he can rewrite NOT IN as "EXCEPT ALL" operator, so all NOT INs when list of values to remove is long takes very much time.
So the question is: I am willing to participate in postgresql development because it may be easier to fix planner then to rewrite all my queries :). How can I? (I mean to work on query planner enhancements by providing new options of query rewrite, not to work on other thing nor on enhancing planner in other ways, like better estimations of known plans).

Daniel Cristian Cruz
Administrador de Banco de Dados
Direção Regional - Núcleo de Tecnologia da Informação
SENAI - SC
Telefone: 48-3239-1422 (ramal 1422)

Re: PostgreSQL NOT IN performance

From

Alvaro Herrera

Date:

19 November 2008, 09:11:55

Віталій Тимчишин escribió:

> So the question is: I am willing to participate in postgresql development
> because it may be easier to fix planner then to rewrite all my queries :).
> How can I? (I mean to work on query planner enhancements by providing new
> options of query rewrite, not to work on other thing nor on enhancing
> planner in other ways, like better estimations of known plans).

http://wiki.postgresql.org/wiki/Submitting_a_Patch
http://wiki.postgresql.org/wiki/Developer_FAQ

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: PostgreSQL NOT IN performance

From

"Віталій Тимчишин"

Date:

19 November 2008, 10:55:29

2008/11/19 Stephan Szabo <sszabo@megazone.bigpanda.com>

On Wed, 19 Nov 2008, [ISO-8859-5] Віталій Тимчишин wrote:

> Query 1:
> select * from t1 where id not in (select id from t2);
>
> Query 2 (gives same result as Q1):
> select * from t1 except all (select id from t2);

It gives the same result as long as no nulls are in either table. If
either table can have a null, the conversion changes the results.

In addition, a conversion like the above only happens to work because t1
only has an id column. If t1 had two columns you'd get an error because
the two sides of except all must have the same number of columns.

Actually It can be done even for multi-column mode if the selection is done on unique key. It would look like:

select * from t1 inner join (
select id from t1 except select id from t2) talias on t1.id = talias.id

And it would produce better results then "not in" for large counts in t1 and t2.

Re: PostgreSQL NOT IN performance

From

Stephan Szabo

Date:

19 November 2008, 11:08:17

On Wed, 19 Nov 2008, [ISO-8859-5] ������� �������� wrote:

> Query 1:
> select * from t1 where id not in (select id from t2);
>
> Query 2 (gives same result as Q1):
> select * from t1 except all (select id from t2);

It gives the same result as long as no nulls are in either table. If
either table can have a null, the conversion changes the results.

In addition, a conversion like the above only happens to work because t1
only has an id column. If t1 had two columns you'd get an error because
the two sides of except all must have the same number of columns.

Re: PostgreSQL NOT IN performance

From

"Віталій Тимчишин"

Date:

19 November 2008, 11:12:50

2008/11/19 DANIEL CRISTIAN CRUZ <daniel.cruz@sc.senai.br>

Something weird with your example which doesn't have the same result, see row count with explain analyze:

My fault. EXCEPT ALL would not work here, so this method with EXCEPT can be used only when either operation is done on unique key on t1 or result is going to be made unique.

cruz=# SELECT version();                                         version                                          
--------------------------------------------------------------------------------------------PostgreSQL 8.3.5 on i486-pc-linux-gnu, compiled by GCC gcc-4.3.real (Debian 4.3.2-1) 4.3.2
(1 registro)

cruz=# EXPLAIN ANALYZE select * from t1 where id not in (select id from t2);                                                   QUERY PLAN                                                   
------------------------------------------------------------------------------------------------------------------Seq Scan on t1  (cost=1643.00..4928.00 rows=100000 width=4) (actual time=256.687..585.774 rows=73653 loops=1)  Filter: (NOT (hashed subplan))  SubPlan    ->  Seq Scan on t2  (cost=0.00..1393.00 rows=100000 width=4) (actual time=0.052..86.867 rows=100000 loops=1)Total runtime: 625.471 ms
(5 registros)

cruz=# EXPLAIN ANALYZE select * from t1 except all (select id from t2);                                                              QUERY PLAN                                                               
-----------------------------------------------------------------------------------------------------------------------------------------SetOp Except All  (cost=34469.90..35969.90 rows=30000 width=4) (actual time=2598.574..3663.712 rows=126733 loops=1)  ->  Sort  (cost=34469.90..35219.90 rows=300000 width=4) (actual time=2598.550..3178.387 rows=300000 loops=1)
         Sort Key: "*SELECT* 1".id
         Sort Method:  external merge  Disk: 5864kB
         ->  Append  (cost=0.00..7178.00 rows=300000 width=4) (actual time=0.037..1026.367 rows=300000 loops=1)
               ->  Subquery Scan "*SELECT* 1"  (cost=0.00..4785.00 rows=200000 width=4) (actual time=0.035..439.507 rows=200000 loops=1)
                     ->  Seq Scan on t1  (cost=0.00..2785.00 rows=200000 width=4) (actual time=0.029..161.355 rows=200000 loops=1)
               ->  Subquery Scan "*SELECT* 2"  (cost=0.00..2393.00 rows=100000 width=4) (actual time=0.107..255.160 rows=100000 loops=1)
                     ->  Seq Scan on t2  (cost=0.00..1393.00 rows=100000 width=4) (actual time=0.097..110.639 rows=100000 loops=1)
 Total runtime: 3790.831 ms
(10 registros)

Sometimes I got a better result (on older versions) with this kind of query, but in this case it doesn't:

cruz=# EXPLAIN ANALYZE SELECT * FROM t1 LEFT JOIN t2 ON t1.id = t2.id WHERE t2.id IS NULL;
                                                      QUERY PLAN                                                      
-----------------------------------------------------------------------------------------------------------------------
 Merge Right Join  (cost=30092.86..35251.53 rows=155304 width=8) (actual time=850.232..1671.091 rows=73653 loops=1)
   Merge Cond: (t2.id = t1.id)
   Filter: (t2.id IS NULL)
   ->  Sort  (cost=9697.82..9947.82 rows=100000 width=4) (actual time=266.501..372.560 rows=100000 loops=1)
         Sort Key: t2.id
         Sort Method:  quicksort  Memory: 4392kB
         ->  Seq Scan on t2  (cost=0.00..1393.00 rows=100000 width=4) (actual time=0.029..78.087 rows=100000 loops=1)
   ->  Sort  (cost=20394.64..20894.64 rows=200000 width=4) (actual time=583.699..855.427 rows=273364 loops=1)
         Sort Key: t1.id
         Sort Method:  quicksort  Memory: 8784kB
         ->  Seq Scan on t1  (cost=0.00..2785.00 rows=200000 width=4) (actual time=0.087..155.665 rows=200000 loops=1)
 Total runtime: 1717.062 ms
(12 registros)

Yes, your method is even better on 8.3.3 I have. I will try to update to 8.3.5 to see if there was optimizer improvements. You could try increasing values, say, by 10 in table filling to see if NOT IT will switch to "slow" version (for me it starts being slow from some magic row count in t2). I suppose it is the moment it switches from "hashed subplan" to "subplan". For me for 10000 values it is "hashed subplan" (and it is momentary fast), for 100000 - it is "subplan" and it is sloow.
BTW: Which (memory?) configuration variable can affect such a switch?