Thread: PostgreSQL NOT IN performance
Hello.
It's second query rewrite postgresql seems not to handle - making EXCEPT from NOT IT.
Here is an example:
Preparation:
drop table if exists t1;
drop table if exists t2;
create temporary table t1(id) as
select
(random()*100000)::int from generate_series(1,200000) a(id);
create temporary table t2(id) as
select
(random()*100000)::int from generate_series(1,100000) a(id);
analyze t1;
analyze t2;
Query 1:
select * from t1 where id not in (select id from t2);
Plan:
"Seq Scan on t1 (cost=1934.00..164105319.00 rows=100000 width=4)"
" Filter: (NOT (subplan))"
" SubPlan"
" -> Materialize (cost=1934.00..3325.00 rows=100000 width=4)"
" -> Seq Scan on t2 (cost=0.00..1443.00 rows=100000 width=4)"
Query 2 (gives same result as Q1):
select * from t1 except all (select id from t2);
Plan:
"SetOp Except All (cost=38721.90..40221.90 rows=30000 width=4)"
" -> Sort (cost=38721.90..39471.90 rows=300000 width=4)"
" Sort Key: "*SELECT* 1".id"
" -> Append (cost=0.00..7328.00 rows=300000 width=4)"
" -> Subquery Scan "*SELECT* 1" (cost=0.00..4885.00 rows=200000 width=4)"
" -> Seq Scan on t1 (cost=0.00..2885.00 rows=200000 width=4)"
" -> Subquery Scan "*SELECT* 2" (cost=0.00..2443.00 rows=100000 width=4)"
" -> Seq Scan on t2 (cost=0.00..1443.00 rows=100000 width=4)"
If I am correct, planner simply do not know that he can rewrite NOT IN as "EXCEPT ALL" operator, so all NOT INs when list of values to remove is long takes very much time.
So the question is: I am willing to participate in postgresql development because it may be easier to fix planner then to rewrite all my queries :). How can I? (I mean to work on query planner enhancements by providing new options of query rewrite, not to work on other thing nor on enhancing planner in other ways, like better estimations of known plans).
It's second query rewrite postgresql seems not to handle - making EXCEPT from NOT IT.
Here is an example:
Preparation:
drop table if exists t1;
drop table if exists t2;
create temporary table t1(id) as
select
(random()*100000)::int from generate_series(1,200000) a(id);
create temporary table t2(id) as
select
(random()*100000)::int from generate_series(1,100000) a(id);
analyze t1;
analyze t2;
Query 1:
select * from t1 where id not in (select id from t2);
Plan:
"Seq Scan on t1 (cost=1934.00..164105319.00 rows=100000 width=4)"
" Filter: (NOT (subplan))"
" SubPlan"
" -> Materialize (cost=1934.00..3325.00 rows=100000 width=4)"
" -> Seq Scan on t2 (cost=0.00..1443.00 rows=100000 width=4)"
Query 2 (gives same result as Q1):
select * from t1 except all (select id from t2);
Plan:
"SetOp Except All (cost=38721.90..40221.90 rows=30000 width=4)"
" -> Sort (cost=38721.90..39471.90 rows=300000 width=4)"
" Sort Key: "*SELECT* 1".id"
" -> Append (cost=0.00..7328.00 rows=300000 width=4)"
" -> Subquery Scan "*SELECT* 1" (cost=0.00..4885.00 rows=200000 width=4)"
" -> Seq Scan on t1 (cost=0.00..2885.00 rows=200000 width=4)"
" -> Subquery Scan "*SELECT* 2" (cost=0.00..2443.00 rows=100000 width=4)"
" -> Seq Scan on t2 (cost=0.00..1443.00 rows=100000 width=4)"
If I am correct, planner simply do not know that he can rewrite NOT IN as "EXCEPT ALL" operator, so all NOT INs when list of values to remove is long takes very much time.
So the question is: I am willing to participate in postgresql development because it may be easier to fix planner then to rewrite all my queries :). How can I? (I mean to work on query planner enhancements by providing new options of query rewrite, not to work on other thing nor on enhancing planner in other ways, like better estimations of known plans).
Something weird with your example which doesn't have the same result, see row count with explain analyze:
cruz=# SELECT version(); version --------------------------------------------------------------------------------------------PostgreSQL 8.3.5 on i486-pc-linux-gnu, compiled by GCC gcc-4.3.real (Debian 4.3.2-1) 4.3.2 (1 registro) cruz=# EXPLAIN ANALYZE select * from t1 where id not in (select id from t2); QUERY PLAN ------------------------------------------------------------------------------------------------------------------Seq Scan on t1 (cost=1643.00..4928.00 rows=100000 width=4) (actual time=256.687..585.774 rows=73653 loops=1) Filter: (NOT (hashed subplan)) SubPlan -> Seq Scan on t2 (cost=0.00..1393.00 rows=100000 width=4) (actual time=0.052..86.867 rows=100000 loops=1)Total runtime: 625.471 ms (5 registros) cruz=# EXPLAIN ANALYZE select * from t1 except all (select id from t2); QUERY PLAN -----------------------------------------------------------------------------------------------------------------------------------------SetOp Except All (cost=34469.90..35969.90 rows=30000 width=4) (actual time=2598.574..3663.712 rows=126733 loops=1) -> Sort (cost=34469.90..35219.90 rows=300000 width=4) (actual time=2598.550..3178.387 rows=300000 loops=1) Sort Key: "*SELECT* 1".id Sort Method: external merge Disk: 5864kB -> Append (cost=0.00..7178.00 rows=300000 width=4) (actual time=0.037..1026.367 rows=300000 loops=1) -> Subquery Scan "*SELECT* 1" (cost=0.00..4785.00 rows=200000 width=4) (actual time=0.035..439.507 rows=200000 loops=1) -> Seq Scan on t1 (cost=0.00..2785.00 rows=200000 width=4) (actual time=0.029..161.355 rows=200000 loops=1) -> Subquery Scan "*SELECT* 2" (cost=0.00..2393.00 rows=100000 width=4) (actual time=0.107..255.160 rows=100000 loops=1) -> Seq Scan on t2 (cost=0.00..1393.00 rows=100000 width=4) (actual time=0.097..110.639 rows=100000 loops=1) Total runtime: 3790.831 ms (10 registros)
Sometimes I got a better result (on older versions) with this kind of query, but in this case it doesn't:
cruz=# EXPLAIN ANALYZE SELECT * FROM t1 LEFT JOIN t2 ON t1.id = t2.id WHERE t2.id IS NULL; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------- Merge Right Join (cost=30092.86..35251.53 rows=155304 width=8) (actual time=850.232..1671.091 rows=73653 loops=1) Merge Cond: (t2.id = t1.id) Filter: (t2.id IS NULL) -> Sort (cost=9697.82..9947.82 rows=100000 width=4) (actual time=266.501..372.560 rows=100000 loops=1) Sort Key: t2.id Sort Method: quicksort Memory: 4392kB -> Seq Scan on t2 (cost=0.00..1393.00 rows=100000 width=4) (actual time=0.029..78.087 rows=100000 loops=1) -> Sort (cost=20394.64..20894.64 rows=200000 width=4) (actual time=583.699..855.427 rows=273364 loops=1) Sort Key: t1.id Sort Method: quicksort Memory: 8784kB -> Seq Scan on t1 (cost=0.00..2785.00 rows=200000 width=4) (actual time=0.087..155.665 rows=200000 loops=1) Total runtime: 1717.062 ms (12 registros)
Regards,
"??????? ????????" <tivv00@gmail.com> escreveu:
Hello.
It's second query rewrite postgresql seems not to handle - making EXCEPT from NOT IT.
Here is an example:
Preparation:
drop table if exists t1;
drop table if exists t2;
create temporary table t1(id) as
select
(random()*100000)::int from generate_series(1,200000) a(id);
create temporary table t2(id) as
select
(random()*100000)::int from generate_series(1,100000) a(id);
analyze t1;
analyze t2;
Query 1:
select * from t1 where id not in (select id from t2);
Plan:
"Seq Scan on t1 (cost=1934.00..164105319.00 rows=100000 width=4)"
" Filter: (NOT (subplan))"
" SubPlan"
" -> Materialize (cost=1934.00..3325.00 rows=100000 width=4)"
" -> Seq Scan on t2 (cost=0.00..1443.00 rows=100000 width=4)"
Query 2 (gives same result as Q1):
select * from t1 except all (select id from t2);
Plan:
"SetOp Except All (cost=38721.90..40221.90 rows=30000 width=4)"
" -> Sort (cost=38721.90..39471.90 rows=300000 width=4)"
" Sort Key: "*SELECT* 1".id"
" -> Append (cost=0.00..7328.00 rows=300000 width=4)"
" -> Subquery Scan "*SELECT* 1" (cost=0.00..4885.00 rows=200000 width=4)"
" -> Seq Scan on t1 (cost=0.00..2885.00 rows=200000 width=4)"
" -> Subquery Scan "*SELECT* 2" (cost=0.00..2443.00 rows=100000 width=4)"
" -> Seq Scan on t2 (cost=0.00..1443.00 rows=100000 width=4)"
If I am correct, planner simply do not know that he can rewrite NOT IN as "EXCEPT ALL" operator, so all NOT INs when list of values to remove is long takes very much time.
So the question is: I am willing to participate in postgresql development because it may be easier to fix planner then to rewrite all my queries :). How can I? (I mean to work on query planner enhancements by providing new options of query rewrite, not to work on other thing nor on enhancing planner in other ways, like better estimations of known plans).
Daniel Cristian Cruz
Administrador de Banco de Dados
Direção Regional - Núcleo de Tecnologia da Informação
SENAI - SC
Telefone: 48-3239-1422 (ramal 1422)
Віталій Тимчишин escribió: > So the question is: I am willing to participate in postgresql development > because it may be easier to fix planner then to rewrite all my queries :). > How can I? (I mean to work on query planner enhancements by providing new > options of query rewrite, not to work on other thing nor on enhancing > planner in other ways, like better estimations of known plans). http://wiki.postgresql.org/wiki/Submitting_a_Patch http://wiki.postgresql.org/wiki/Developer_FAQ -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
2008/11/19 Stephan Szabo <sszabo@megazone.bigpanda.com>
On Wed, 19 Nov 2008, [ISO-8859-5] Віталій Тимчишин wrote:
> Query 1:
> select * from t1 where id not in (select id from t2);
>> Query 2 (gives same result as Q1):It gives the same result as long as no nulls are in either table. If
> select * from t1 except all (select id from t2);
either table can have a null, the conversion changes the results.
In addition, a conversion like the above only happens to work because t1
only has an id column. If t1 had two columns you'd get an error because
the two sides of except all must have the same number of columns.
On Wed, 19 Nov 2008, [ISO-8859-5] ������� �������� wrote: > Query 1: > select * from t1 where id not in (select id from t2); > > Query 2 (gives same result as Q1): > select * from t1 except all (select id from t2); It gives the same result as long as no nulls are in either table. If either table can have a null, the conversion changes the results. In addition, a conversion like the above only happens to work because t1 only has an id column. If t1 had two columns you'd get an error because the two sides of except all must have the same number of columns.
2008/11/19 DANIEL CRISTIAN CRUZ <daniel.cruz@sc.senai.br>
Something weird with your example which doesn't have the same result, see row count with explain analyze:
My fault. EXCEPT ALL would not work here, so this method with EXCEPT can be used only when either operation is done on unique key on t1 or result is going to be made unique.
cruz=# SELECT version(); version --------------------------------------------------------------------------------------------PostgreSQL 8.3.5 on i486-pc-linux-gnu, compiled by GCC gcc-4.3.real (Debian 4.3.2-1) 4.3.2 (1 registro) cruz=# EXPLAIN ANALYZE select * from t1 where id not in (select id from t2); QUERY PLAN ------------------------------------------------------------------------------------------------------------------Seq Scan on t1 (cost=1643.00..4928.00 rows=100000 width=4) (actual time=256.687..585.774 rows=73653 loops=1) Filter: (NOT (hashed subplan)) SubPlan -> Seq Scan on t2 (cost=0.00..1393.00 rows=100000 width=4) (actual time=0.052..86.867 rows=100000 loops=1)Total runtime: 625.471 ms (5 registros) cruz=# EXPLAIN ANALYZE select * from t1 except all (select id from t2); QUERY PLAN -----------------------------------------------------------------------------------------------------------------------------------------SetOp Except All (cost=34469.90..35969.90 rows=30000 width=4) (actual time=2598.574..3663.712 rows=126733 loops=1) -> Sort (cost=34469.90..35219.90 rows=300000 width=4) (actual time=2598.550..3178.387 rows=300000 loops=1) Sort Key: "*SELECT* 1".id Sort Method: external merge Disk: 5864kB -> Append (cost=0.00..7178.00 rows=300000 width=4) (actual time=0.037..1026.367 rows=300000 loops=1) -> Subquery Scan "*SELECT* 1" (cost=0.00..4785.00 rows=200000 width=4) (actual time=0.035..439.507 rows=200000 loops=1) -> Seq Scan on t1 (cost=0.00..2785.00 rows=200000 width=4) (actual time=0.029..161.355 rows=200000 loops=1) -> Subquery Scan "*SELECT* 2" (cost=0.00..2393.00 rows=100000 width=4) (actual time=0.107..255.160 rows=100000 loops=1) -> Seq Scan on t2 (cost=0.00..1393.00 rows=100000 width=4) (actual time=0.097..110.639 rows=100000 loops=1) Total runtime: 3790.831 ms (10 registros)Sometimes I got a better result (on older versions) with this kind of query, but in this case it doesn't:
cruz=# EXPLAIN ANALYZE SELECT * FROM t1 LEFT JOIN t2 ON t1.id = t2.id WHERE t2.id IS NULL; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------- Merge Right Join (cost=30092.86..35251.53 rows=155304 width=8) (actual time=850.232..1671.091 rows=73653 loops=1) Merge Cond: (t2.id = t1.id) Filter: (t2.id IS NULL) -> Sort (cost=9697.82..9947.82 rows=100000 width=4) (actual time=266.501..372.560 rows=100000 loops=1) Sort Key: t2.id Sort Method: quicksort Memory: 4392kB -> Seq Scan on t2 (cost=0.00..1393.00 rows=100000 width=4) (actual time=0.029..78.087 rows=100000 loops=1) -> Sort (cost=20394.64..20894.64 rows=200000 width=4) (actual time=583.699..855.427 rows=273364 loops=1) Sort Key: t1.id Sort Method: quicksort Memory: 8784kB -> Seq Scan on t1 (cost=0.00..2785.00 rows=200000 width=4) (actual time=0.087..155.665 rows=200000 loops=1) Total runtime: 1717.062 ms (12 registros)
Yes, your method is even better on 8.3.3 I have. I will try to update to 8.3.5 to see if there was optimizer improvements. You could try increasing values, say, by 10 in table filling to see if NOT IT will switch to "slow" version (for me it starts being slow from some magic row count in t2). I suppose it is the moment it switches from "hashed subplan" to "subplan". For me for 10000 values it is "hashed subplan" (and it is momentary fast), for 100000 - it is "subplan" and it is sloow.
BTW: Which (memory?) configuration variable can affect such a switch?