Thread: Select in subselect vs select = any array
I have noticed that SELECT ... = ANY(ARRAY(...)) is about twice as fast as SELECT IN ( ... ). Can anyone explain a reason for this? Results are the bottom and are reproducible. I can test with other versions if thatis necessary. ./configure --prefix=/usr/local/pgsql84 --with-openssl --with-perl CentOS release 5.4 (Final) psql (PostgreSQL) 8.4.1 prompt2=# select count(*) from nodes; count -------- 754734 (1 row) prompt2=# \d nodes Table "public.nodes" Column | Type | Modifiers --------------+--------------------------+----------------------------------------------------------- node_id | integer | not null default nextval(('node_id_seq'::text)::regclass) node_type_id | integer | not null template_id | integer | not null timestamp | timestamp with time zone | default ('now'::text)::timestamp(6) with time zone Indexes: "nodes_pkey" PRIMARY KEY, btree (node_id) "n_node_id_index" btree (node_id) "n_node_type_id_index" btree (node_type_id) "n_template_id_index" btree (template_id) prompt2=# select count(*) from nodes where node_id = any( Array(select node_id from nodes limit 100000) ); count -------- 100000 (1 row) Time: 404.530 ms prompt2=# select count(*) from nodes where node_id = any( Array(select node_id from nodes limit 100000) ); count -------- 100000 (1 row) Time: 407.316 ms prompt2=# select count(*) from nodes where node_id = any( Array(select node_id from nodes limit 100000) ); count -------- 100000 (1 row) Time: 408.728 ms prompt2=# select count(*) from nodes where node_id in (select node_id from nodes limit 100000 ); count -------- 100000 (1 row) Time: 793.840 ms prompt2=# select count(*) from nodes where node_id in (select node_id from nodes limit 100000 ); count -------- 100000 (1 row) Time: 779.137 ms prompt2=# select count(*) from nodes where node_id in (select node_id from nodes limit 100000 ); count -------- 100000 (1 row) Time: 781.820 ms
Hello 2011/3/20 Adam Tistler <atistler@gmail.com>: > I have noticed that SELECT ... = ANY(ARRAY(...)) is about twice as fast as SELECT IN ( ... ). > Can anyone explain a reason for this? Results are the bottom and are reproducible. I can test with other versions ifthat is necessary. > send a result of EXPLAIN ANALYZE SELECT ..., please The reasons can be different - less seq scans, indexes Regards Pavel Stehule > ./configure --prefix=/usr/local/pgsql84 --with-openssl --with-perl > CentOS release 5.4 (Final) > psql (PostgreSQL) 8.4.1 > > prompt2=# select count(*) from nodes; > count > -------- > 754734 > (1 row) > > > prompt2=# \d nodes > Table "public.nodes" > Column | Type | Modifiers > --------------+--------------------------+----------------------------------------------------------- > node_id | integer | not null default nextval(('node_id_seq'::text)::regclass) > node_type_id | integer | not null > template_id | integer | not null > timestamp | timestamp with time zone | default ('now'::text)::timestamp(6) with time zone > Indexes: > "nodes_pkey" PRIMARY KEY, btree (node_id) > "n_node_id_index" btree (node_id) > "n_node_type_id_index" btree (node_type_id) > "n_template_id_index" btree (template_id) > > prompt2=# select count(*) from nodes where node_id = any( Array(select node_id from nodes limit 100000) ); > count > -------- > 100000 > (1 row) > > Time: 404.530 ms > prompt2=# select count(*) from nodes where node_id = any( Array(select node_id from nodes limit 100000) ); > count > -------- > 100000 > (1 row) > > Time: 407.316 ms > prompt2=# select count(*) from nodes where node_id = any( Array(select node_id from nodes limit 100000) ); > count > -------- > 100000 > (1 row) > > Time: 408.728 ms > prompt2=# select count(*) from nodes where node_id in (select node_id from nodes limit 100000 ); > count > -------- > 100000 > (1 row) > > Time: 793.840 ms > prompt2=# select count(*) from nodes where node_id in (select node_id from nodes limit 100000 ); > count > -------- > 100000 > (1 row) > > Time: 779.137 ms > prompt2=# select count(*) from nodes where node_id in (select node_id from nodes limit 100000 ); > count > -------- > 100000 > (1 row) > > Time: 781.820 ms > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance >
logicops2=# explain analyze select count(*) from nodes where node_id = any( Array(select node_id from nodes limit 100000)); QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=1718.59..1718.60 rows=1 width=0) (actual time=509.126..509.127 rows=1 loops=1) InitPlan 1 (returns $0) -> Limit (cost=0.00..1637.04 rows=100000 width=4) (actual time=0.010..76.604 rows=100000 loops=1) -> Seq Scan on nodes (cost=0.00..12355.41 rows=754741 width=4) (actual time=0.008..38.105 rows=100000 loops=1) -> Bitmap Heap Scan on nodes (cost=42.67..81.53 rows=10 width=0) (actual time=447.274..484.283 rows=100000 loops=1) Recheck Cond: (node_id = ANY ($0)) -> Bitmap Index Scan on n_node_id_index (cost=0.00..42.67 rows=10 width=0) (actual time=447.074..447.074 rows=100000loops=1) Index Cond: (node_id = ANY ($0)) Total runtime: 509.209 ms (9 rows) Time: 510.009 ms logicops2=# explain analyze select count(*) from nodes where node_id in (select node_id from nodes limit 100000); QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=3017.17..3017.18 rows=1 width=0) (actual time=1052.866..1052.866 rows=1 loops=1) -> Nested Loop (cost=2887.04..3016.67 rows=200 width=0) (actual time=167.310..1021.540 rows=100000 loops=1) -> HashAggregate (cost=2887.04..2889.04 rows=200 width=4) (actual time=167.198..251.205 rows=100000 loops=1) -> Limit (cost=0.00..1637.04 rows=100000 width=4) (actual time=0.008..80.090 rows=100000 loops=1) -> Seq Scan on nodes (cost=0.00..12355.41 rows=754741 width=4) (actual time=0.007..41.566 rows=100000loops=1) -> Index Scan using n_node_id_index on nodes (cost=0.00..0.63 rows=1 width=4) (actual time=0.006..0.007 rows=1loops=100000) Index Cond: (public.nodes.node_id = public.nodes.node_id) Total runtime: 1053.523 ms (8 rows) Time: 1054.864 ms On Mar 20, 2011, at 2:51 AM, Pavel Stehule wrote: > Hello > > 2011/3/20 Adam Tistler <atistler@gmail.com>: >> I have noticed that SELECT ... = ANY(ARRAY(...)) is about twice as fast as SELECT IN ( ... ). >> Can anyone explain a reason for this? Results are the bottom and are reproducible. I can test with other versions ifthat is necessary. >> > > send a result of EXPLAIN ANALYZE SELECT ..., please > > The reasons can be different - less seq scans, indexes > > Regards > > Pavel Stehule > > > >> ./configure --prefix=/usr/local/pgsql84 --with-openssl --with-perl >> CentOS release 5.4 (Final) >> psql (PostgreSQL) 8.4.1 >> >> prompt2=# select count(*) from nodes; >> count >> -------- >> 754734 >> (1 row) >> >> >> prompt2=# \d nodes >> Table "public.nodes" >> Column | Type | Modifiers >> --------------+--------------------------+----------------------------------------------------------- >> node_id | integer | not null default nextval(('node_id_seq'::text)::regclass) >> node_type_id | integer | not null >> template_id | integer | not null >> timestamp | timestamp with time zone | default ('now'::text)::timestamp(6) with time zone >> Indexes: >> "nodes_pkey" PRIMARY KEY, btree (node_id) >> "n_node_id_index" btree (node_id) >> "n_node_type_id_index" btree (node_type_id) >> "n_template_id_index" btree (template_id) >> >> prompt2=# select count(*) from nodes where node_id = any( Array(select node_id from nodes limit 100000) ); >> count >> -------- >> 100000 >> (1 row) >> >> Time: 404.530 ms >> prompt2=# select count(*) from nodes where node_id = any( Array(select node_id from nodes limit 100000) ); >> count >> -------- >> 100000 >> (1 row) >> >> Time: 407.316 ms >> prompt2=# select count(*) from nodes where node_id = any( Array(select node_id from nodes limit 100000) ); >> count >> -------- >> 100000 >> (1 row) >> >> Time: 408.728 ms >> prompt2=# select count(*) from nodes where node_id in (select node_id from nodes limit 100000 ); >> count >> -------- >> 100000 >> (1 row) >> >> Time: 793.840 ms >> prompt2=# select count(*) from nodes where node_id in (select node_id from nodes limit 100000 ); >> count >> -------- >> 100000 >> (1 row) >> >> Time: 779.137 ms >> prompt2=# select count(*) from nodes where node_id in (select node_id from nodes limit 100000 ); >> count >> -------- >> 100000 >> (1 row) >> >> Time: 781.820 ms >> >> >> -- >> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) >> To make changes to your subscription: >> http://www.postgresql.org/mailpref/pgsql-performance >>
Hello I think so HashAggregate goes out of memory - you can try to increase a work_mem. There are better queries for counting duplicit then cross join Regards Pavel Stehule 2011/3/21 Adam Tistler <atistler@gmail.com>: > logicops2=# explain analyze select count(*) from nodes where node_id = any( Array(select node_id from nodes limit 100000)); > QUERY PLAN > ----------------------------------------------------------------------------------------------------------------------------------------- > Aggregate (cost=1718.59..1718.60 rows=1 width=0) (actual time=509.126..509.127 rows=1 loops=1) > InitPlan 1 (returns $0) > -> Limit (cost=0.00..1637.04 rows=100000 width=4) (actual time=0.010..76.604 rows=100000 loops=1) > -> Seq Scan on nodes (cost=0.00..12355.41 rows=754741 width=4) (actual time=0.008..38.105 rows=100000 loops=1) > -> Bitmap Heap Scan on nodes (cost=42.67..81.53 rows=10 width=0) (actual time=447.274..484.283 rows=100000 loops=1) > Recheck Cond: (node_id = ANY ($0)) > -> Bitmap Index Scan on n_node_id_index (cost=0.00..42.67 rows=10 width=0) (actual time=447.074..447.074 rows=100000loops=1) > Index Cond: (node_id = ANY ($0)) > Total runtime: 509.209 ms > (9 rows) > > Time: 510.009 ms > > > logicops2=# explain analyze select count(*) from nodes where node_id in (select node_id from nodes limit 100000); > QUERY PLAN > ---------------------------------------------------------------------------------------------------------------------------------------- > Aggregate (cost=3017.17..3017.18 rows=1 width=0) (actual time=1052.866..1052.866 rows=1 loops=1) > -> Nested Loop (cost=2887.04..3016.67 rows=200 width=0) (actual time=167.310..1021.540 rows=100000 loops=1) > -> HashAggregate (cost=2887.04..2889.04 rows=200 width=4) (actual time=167.198..251.205 rows=100000 loops=1) > -> Limit (cost=0.00..1637.04 rows=100000 width=4) (actual time=0.008..80.090 rows=100000 loops=1) > -> Seq Scan on nodes (cost=0.00..12355.41 rows=754741 width=4) (actual time=0.007..41.566 rows=100000loops=1) > -> Index Scan using n_node_id_index on nodes (cost=0.00..0.63 rows=1 width=4) (actual time=0.006..0.007 rows=1loops=100000) > Index Cond: (public.nodes.node_id = public.nodes.node_id) > Total runtime: 1053.523 ms > (8 rows) > > Time: 1054.864 ms > > > > On Mar 20, 2011, at 2:51 AM, Pavel Stehule wrote: > >> Hello >> >> 2011/3/20 Adam Tistler <atistler@gmail.com>: >>> I have noticed that SELECT ... = ANY(ARRAY(...)) is about twice as fast as SELECT IN ( ... ). >>> Can anyone explain a reason for this? Results are the bottom and are reproducible. I can test with other versions ifthat is necessary. >>> >> >> send a result of EXPLAIN ANALYZE SELECT ..., please >> >> The reasons can be different - less seq scans, indexes >> >> Regards >> >> Pavel Stehule >> >> >> >>> ./configure --prefix=/usr/local/pgsql84 --with-openssl --with-perl >>> CentOS release 5.4 (Final) >>> psql (PostgreSQL) 8.4.1 >>> >>> prompt2=# select count(*) from nodes; >>> count >>> -------- >>> 754734 >>> (1 row) >>> >>> >>> prompt2=# \d nodes >>> Table "public.nodes" >>> Column | Type | Modifiers >>> --------------+--------------------------+----------------------------------------------------------- >>> node_id | integer | not null default nextval(('node_id_seq'::text)::regclass) >>> node_type_id | integer | not null >>> template_id | integer | not null >>> timestamp | timestamp with time zone | default ('now'::text)::timestamp(6) with time zone >>> Indexes: >>> "nodes_pkey" PRIMARY KEY, btree (node_id) >>> "n_node_id_index" btree (node_id) >>> "n_node_type_id_index" btree (node_type_id) >>> "n_template_id_index" btree (template_id) >>> >>> prompt2=# select count(*) from nodes where node_id = any( Array(select node_id from nodes limit 100000) ); >>> count >>> -------- >>> 100000 >>> (1 row) >>> >>> Time: 404.530 ms >>> prompt2=# select count(*) from nodes where node_id = any( Array(select node_id from nodes limit 100000) ); >>> count >>> -------- >>> 100000 >>> (1 row) >>> >>> Time: 407.316 ms >>> prompt2=# select count(*) from nodes where node_id = any( Array(select node_id from nodes limit 100000) ); >>> count >>> -------- >>> 100000 >>> (1 row) >>> >>> Time: 408.728 ms >>> prompt2=# select count(*) from nodes where node_id in (select node_id from nodes limit 100000 ); >>> count >>> -------- >>> 100000 >>> (1 row) >>> >>> Time: 793.840 ms >>> prompt2=# select count(*) from nodes where node_id in (select node_id from nodes limit 100000 ); >>> count >>> -------- >>> 100000 >>> (1 row) >>> >>> Time: 779.137 ms >>> prompt2=# select count(*) from nodes where node_id in (select node_id from nodes limit 100000 ); >>> count >>> -------- >>> 100000 >>> (1 row) >>> >>> Time: 781.820 ms >>> >>> >>> -- >>> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) >>> To make changes to your subscription: >>> http://www.postgresql.org/mailpref/pgsql-performance >>> > >
Pavel, thanks for the help. I increased work_mem from 16MB to 64MB, no difference. The queries are really just a test case. My actual queries are actualjust large number of primary keys that I am selecting from the db: For example: select * from nodes where node_id in ( 1, 2, 3 ..... ) I found that even for small queries, the following is faster: select * from nodes where node_in = any (array[1,2,3 .... ]) Its not really a big deal to me, I was just wondering if others could reproduce it on other systems/versions and if perhapsthis is an issue that I should point out to postgres-dev. Results below: logicops2=# explain analyze select count(*) from nodes where node_id in ( select node_id from nodes limit 100000 ); QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=3017.18..3017.19 rows=1 width=0) (actual time=1017.051..1017.051 rows=1 loops=1) -> Nested Loop (cost=2887.05..3016.68 rows=200 width=0) (actual time=157.290..986.329 rows=100000 loops=1) -> HashAggregate (cost=2887.05..2889.05 rows=200 width=4) (actual time=157.252..241.995 rows=100000 loops=1) -> Limit (cost=0.00..1637.05 rows=100000 width=4) (actual time=0.009..73.942 rows=100000 loops=1) -> Seq Scan on nodes (cost=0.00..12355.34 rows=754734 width=4) (actual time=0.008..35.428 rows=100000loops=1) -> Index Scan using n_node_id_index on nodes (cost=0.00..0.63 rows=1 width=4) (actual time=0.006..0.006 rows=1loops=100000) Index Cond: (public.nodes.node_id = public.nodes.node_id) Total runtime: 1017.794 ms (8 rows) logicops2=# explain analyze select count(*) from nodes where node_id = any(array ( select node_id from nodes limit 100000)); QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=1718.60..1718.61 rows=1 width=0) (actual time=485.554..485.555 rows=1 loops=1) InitPlan 1 (returns $0) -> Limit (cost=0.00..1637.05 rows=100000 width=4) (actual time=0.011..73.037 rows=100000 loops=1) -> Seq Scan on nodes (cost=0.00..12355.34 rows=754734 width=4) (actual time=0.010..34.462 rows=100000 loops=1) -> Bitmap Heap Scan on nodes (cost=42.67..81.53 rows=10 width=0) (actual time=433.003..461.108 rows=100000 loops=1) Recheck Cond: (node_id = ANY ($0)) -> Bitmap Index Scan on n_node_id_index (cost=0.00..42.67 rows=10 width=0) (actual time=432.810..432.810 rows=100000loops=1) Index Cond: (node_id = ANY ($0)) Total runtime: 485.638 ms (9 rows) On Mar 21, 2011, at 1:54 AM, Pavel Stehule wrote: > Hello > > I think so HashAggregate goes out of memory - you can try to increase > a work_mem. > > There are better queries for counting duplicit then cross join > > Regards > > Pavel Stehule > > 2011/3/21 Adam Tistler <atistler@gmail.com>: >> logicops2=# explain analyze select count(*) from nodes where node_id = any( Array(select node_id from nodes limit 100000)); >> QUERY PLAN >> ----------------------------------------------------------------------------------------------------------------------------------------- >> Aggregate (cost=1718.59..1718.60 rows=1 width=0) (actual time=509.126..509.127 rows=1 loops=1) >> InitPlan 1 (returns $0) >> -> Limit (cost=0.00..1637.04 rows=100000 width=4) (actual time=0.010..76.604 rows=100000 loops=1) >> -> Seq Scan on nodes (cost=0.00..12355.41 rows=754741 width=4) (actual time=0.008..38.105 rows=100000 loops=1) >> -> Bitmap Heap Scan on nodes (cost=42.67..81.53 rows=10 width=0) (actual time=447.274..484.283 rows=100000 loops=1) >> Recheck Cond: (node_id = ANY ($0)) >> -> Bitmap Index Scan on n_node_id_index (cost=0.00..42.67 rows=10 width=0) (actual time=447.074..447.074 rows=100000loops=1) >> Index Cond: (node_id = ANY ($0)) >> Total runtime: 509.209 ms >> (9 rows) >> >> Time: 510.009 ms >> >> >> logicops2=# explain analyze select count(*) from nodes where node_id in (select node_id from nodes limit 100000); >> QUERY PLAN >> ---------------------------------------------------------------------------------------------------------------------------------------- >> Aggregate (cost=3017.17..3017.18 rows=1 width=0) (actual time=1052.866..1052.866 rows=1 loops=1) >> -> Nested Loop (cost=2887.04..3016.67 rows=200 width=0) (actual time=167.310..1021.540 rows=100000 loops=1) >> -> HashAggregate (cost=2887.04..2889.04 rows=200 width=4) (actual time=167.198..251.205 rows=100000 loops=1) >> -> Limit (cost=0.00..1637.04 rows=100000 width=4) (actual time=0.008..80.090 rows=100000 loops=1) >> -> Seq Scan on nodes (cost=0.00..12355.41 rows=754741 width=4) (actual time=0.007..41.566 rows=100000loops=1) >> -> Index Scan using n_node_id_index on nodes (cost=0.00..0.63 rows=1 width=4) (actual time=0.006..0.007 rows=1loops=100000) >> Index Cond: (public.nodes.node_id = public.nodes.node_id) >> Total runtime: 1053.523 ms >> (8 rows) >> >> Time: 1054.864 ms >> >> >> >> On Mar 20, 2011, at 2:51 AM, Pavel Stehule wrote: >> >>> Hello >>> >>> 2011/3/20 Adam Tistler <atistler@gmail.com>: >>>> I have noticed that SELECT ... = ANY(ARRAY(...)) is about twice as fast as SELECT IN ( ... ). >>>> Can anyone explain a reason for this? Results are the bottom and are reproducible. I can test with other versionsif that is necessary. >>>> >>> >>> send a result of EXPLAIN ANALYZE SELECT ..., please >>> >>> The reasons can be different - less seq scans, indexes >>> >>> Regards >>> >>> Pavel Stehule >>> >>> >>> >>>> ./configure --prefix=/usr/local/pgsql84 --with-openssl --with-perl >>>> CentOS release 5.4 (Final) >>>> psql (PostgreSQL) 8.4.1 >>>> >>>> prompt2=# select count(*) from nodes; >>>> count >>>> -------- >>>> 754734 >>>> (1 row) >>>> >>>> >>>> prompt2=# \d nodes >>>> Table "public.nodes" >>>> Column | Type | Modifiers >>>> --------------+--------------------------+----------------------------------------------------------- >>>> node_id | integer | not null default nextval(('node_id_seq'::text)::regclass) >>>> node_type_id | integer | not null >>>> template_id | integer | not null >>>> timestamp | timestamp with time zone | default ('now'::text)::timestamp(6) with time zone >>>> Indexes: >>>> "nodes_pkey" PRIMARY KEY, btree (node_id) >>>> "n_node_id_index" btree (node_id) >>>> "n_node_type_id_index" btree (node_type_id) >>>> "n_template_id_index" btree (template_id) >>>> >>>> prompt2=# select count(*) from nodes where node_id = any( Array(select node_id from nodes limit 100000) ); >>>> count >>>> -------- >>>> 100000 >>>> (1 row) >>>> >>>> Time: 404.530 ms >>>> prompt2=# select count(*) from nodes where node_id = any( Array(select node_id from nodes limit 100000) ); >>>> count >>>> -------- >>>> 100000 >>>> (1 row) >>>> >>>> Time: 407.316 ms >>>> prompt2=# select count(*) from nodes where node_id = any( Array(select node_id from nodes limit 100000) ); >>>> count >>>> -------- >>>> 100000 >>>> (1 row) >>>> >>>> Time: 408.728 ms >>>> prompt2=# select count(*) from nodes where node_id in (select node_id from nodes limit 100000 ); >>>> count >>>> -------- >>>> 100000 >>>> (1 row) >>>> >>>> Time: 793.840 ms >>>> prompt2=# select count(*) from nodes where node_id in (select node_id from nodes limit 100000 ); >>>> count >>>> -------- >>>> 100000 >>>> (1 row) >>>> >>>> Time: 779.137 ms >>>> prompt2=# select count(*) from nodes where node_id in (select node_id from nodes limit 100000 ); >>>> count >>>> -------- >>>> 100000 >>>> (1 row) >>>> >>>> Time: 781.820 ms >>>> >>>> >>>> -- >>>> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) >>>> To make changes to your subscription: >>>> http://www.postgresql.org/mailpref/pgsql-performance >>>> >> >>
2011/3/21 Adam Tistler <atistler@gmail.com>: > Pavel, thanks for the help. > > I increased work_mem from 16MB to 64MB, no difference. The queries are really just a test case. My actual queries areactual just large number of primary keys that I am selecting from the db: > > For example: > select * from nodes where node_id in ( 1, 2, 3 ..... ) > > I found that even for small queries, the following is faster: > select * from nodes where node_in = any (array[1,2,3 .... ]) it depends on version. I think so on last postgres, these queries are same, not sure. Regards Pavel > > > Its not really a big deal to me, I was just wondering if others could reproduce it on other systems/versions and if perhapsthis is an issue that I should point out to postgres-dev. > > > Results below: > > logicops2=# explain analyze select count(*) from nodes where node_id in ( select node_id from nodes limit 100000 ); > QUERY PLAN > ---------------------------------------------------------------------------------------------------------------------------------------- > Aggregate (cost=3017.18..3017.19 rows=1 width=0) (actual time=1017.051..1017.051 rows=1 loops=1) > -> Nested Loop (cost=2887.05..3016.68 rows=200 width=0) (actual time=157.290..986.329 rows=100000 loops=1) > -> HashAggregate (cost=2887.05..2889.05 rows=200 width=4) (actual time=157.252..241.995 rows=100000 loops=1) > -> Limit (cost=0.00..1637.05 rows=100000 width=4) (actual time=0.009..73.942 rows=100000 loops=1) > -> Seq Scan on nodes (cost=0.00..12355.34 rows=754734 width=4) (actual time=0.008..35.428 rows=100000loops=1) > -> Index Scan using n_node_id_index on nodes (cost=0.00..0.63 rows=1 width=4) (actual time=0.006..0.006 rows=1loops=100000) > Index Cond: (public.nodes.node_id = public.nodes.node_id) > Total runtime: 1017.794 ms > (8 rows) > > logicops2=# explain analyze select count(*) from nodes where node_id = any(array ( select node_id from nodes limit 100000)); > QUERY PLAN > ----------------------------------------------------------------------------------------------------------------------------------------- > Aggregate (cost=1718.60..1718.61 rows=1 width=0) (actual time=485.554..485.555 rows=1 loops=1) > InitPlan 1 (returns $0) > -> Limit (cost=0.00..1637.05 rows=100000 width=4) (actual time=0.011..73.037 rows=100000 loops=1) > -> Seq Scan on nodes (cost=0.00..12355.34 rows=754734 width=4) (actual time=0.010..34.462 rows=100000 loops=1) > -> Bitmap Heap Scan on nodes (cost=42.67..81.53 rows=10 width=0) (actual time=433.003..461.108 rows=100000 loops=1) > Recheck Cond: (node_id = ANY ($0)) > -> Bitmap Index Scan on n_node_id_index (cost=0.00..42.67 rows=10 width=0) (actual time=432.810..432.810 rows=100000loops=1) > Index Cond: (node_id = ANY ($0)) > Total runtime: 485.638 ms > (9 rows) > > On Mar 21, 2011, at 1:54 AM, Pavel Stehule wrote: > >> Hello >> >> I think so HashAggregate goes out of memory - you can try to increase >> a work_mem. >> >> There are better queries for counting duplicit then cross join >> >> Regards >> >> Pavel Stehule >> >> 2011/3/21 Adam Tistler <atistler@gmail.com>: >>> logicops2=# explain analyze select count(*) from nodes where node_id = any( Array(select node_id from nodes limit 100000)); >>> QUERY PLAN >>> ----------------------------------------------------------------------------------------------------------------------------------------- >>> Aggregate (cost=1718.59..1718.60 rows=1 width=0) (actual time=509.126..509.127 rows=1 loops=1) >>> InitPlan 1 (returns $0) >>> -> Limit (cost=0.00..1637.04 rows=100000 width=4) (actual time=0.010..76.604 rows=100000 loops=1) >>> -> Seq Scan on nodes (cost=0.00..12355.41 rows=754741 width=4) (actual time=0.008..38.105 rows=100000 loops=1) >>> -> Bitmap Heap Scan on nodes (cost=42.67..81.53 rows=10 width=0) (actual time=447.274..484.283 rows=100000 loops=1) >>> Recheck Cond: (node_id = ANY ($0)) >>> -> Bitmap Index Scan on n_node_id_index (cost=0.00..42.67 rows=10 width=0) (actual time=447.074..447.074 rows=100000loops=1) >>> Index Cond: (node_id = ANY ($0)) >>> Total runtime: 509.209 ms >>> (9 rows) >>> >>> Time: 510.009 ms >>> >>> >>> logicops2=# explain analyze select count(*) from nodes where node_id in (select node_id from nodes limit 100000); >>> QUERY PLAN >>> ---------------------------------------------------------------------------------------------------------------------------------------- >>> Aggregate (cost=3017.17..3017.18 rows=1 width=0) (actual time=1052.866..1052.866 rows=1 loops=1) >>> -> Nested Loop (cost=2887.04..3016.67 rows=200 width=0) (actual time=167.310..1021.540 rows=100000 loops=1) >>> -> HashAggregate (cost=2887.04..2889.04 rows=200 width=4) (actual time=167.198..251.205 rows=100000 loops=1) >>> -> Limit (cost=0.00..1637.04 rows=100000 width=4) (actual time=0.008..80.090 rows=100000 loops=1) >>> -> Seq Scan on nodes (cost=0.00..12355.41 rows=754741 width=4) (actual time=0.007..41.566 rows=100000loops=1) >>> -> Index Scan using n_node_id_index on nodes (cost=0.00..0.63 rows=1 width=4) (actual time=0.006..0.007 rows=1loops=100000) >>> Index Cond: (public.nodes.node_id = public.nodes.node_id) >>> Total runtime: 1053.523 ms >>> (8 rows) >>> >>> Time: 1054.864 ms >>> >>> >>> >>> On Mar 20, 2011, at 2:51 AM, Pavel Stehule wrote: >>> >>>> Hello >>>> >>>> 2011/3/20 Adam Tistler <atistler@gmail.com>: >>>>> I have noticed that SELECT ... = ANY(ARRAY(...)) is about twice as fast as SELECT IN ( ... ). >>>>> Can anyone explain a reason for this? Results are the bottom and are reproducible. I can test with other versionsif that is necessary. >>>>> >>>> >>>> send a result of EXPLAIN ANALYZE SELECT ..., please >>>> >>>> The reasons can be different - less seq scans, indexes >>>> >>>> Regards >>>> >>>> Pavel Stehule >>>> >>>> >>>> >>>>> ./configure --prefix=/usr/local/pgsql84 --with-openssl --with-perl >>>>> CentOS release 5.4 (Final) >>>>> psql (PostgreSQL) 8.4.1 >>>>> >>>>> prompt2=# select count(*) from nodes; >>>>> count >>>>> -------- >>>>> 754734 >>>>> (1 row) >>>>> >>>>> >>>>> prompt2=# \d nodes >>>>> Table "public.nodes" >>>>> Column | Type | Modifiers >>>>> --------------+--------------------------+----------------------------------------------------------- >>>>> node_id | integer | not null default nextval(('node_id_seq'::text)::regclass) >>>>> node_type_id | integer | not null >>>>> template_id | integer | not null >>>>> timestamp | timestamp with time zone | default ('now'::text)::timestamp(6) with time zone >>>>> Indexes: >>>>> "nodes_pkey" PRIMARY KEY, btree (node_id) >>>>> "n_node_id_index" btree (node_id) >>>>> "n_node_type_id_index" btree (node_type_id) >>>>> "n_template_id_index" btree (template_id) >>>>> >>>>> prompt2=# select count(*) from nodes where node_id = any( Array(select node_id from nodes limit 100000) ); >>>>> count >>>>> -------- >>>>> 100000 >>>>> (1 row) >>>>> >>>>> Time: 404.530 ms >>>>> prompt2=# select count(*) from nodes where node_id = any( Array(select node_id from nodes limit 100000) ); >>>>> count >>>>> -------- >>>>> 100000 >>>>> (1 row) >>>>> >>>>> Time: 407.316 ms >>>>> prompt2=# select count(*) from nodes where node_id = any( Array(select node_id from nodes limit 100000) ); >>>>> count >>>>> -------- >>>>> 100000 >>>>> (1 row) >>>>> >>>>> Time: 408.728 ms >>>>> prompt2=# select count(*) from nodes where node_id in (select node_id from nodes limit 100000 ); >>>>> count >>>>> -------- >>>>> 100000 >>>>> (1 row) >>>>> >>>>> Time: 793.840 ms >>>>> prompt2=# select count(*) from nodes where node_id in (select node_id from nodes limit 100000 ); >>>>> count >>>>> -------- >>>>> 100000 >>>>> (1 row) >>>>> >>>>> Time: 779.137 ms >>>>> prompt2=# select count(*) from nodes where node_id in (select node_id from nodes limit 100000 ); >>>>> count >>>>> -------- >>>>> 100000 >>>>> (1 row) >>>>> >>>>> Time: 781.820 ms >>>>> >>>>> >>>>> -- >>>>> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) >>>>> To make changes to your subscription: >>>>> http://www.postgresql.org/mailpref/pgsql-performance >>>>> >>> >>> > >
> -----Original Message----- > From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance- > owner@postgresql.org] On Behalf Of Adam Tistler > Sent: Monday, March 21, 2011 12:17 AM > To: Pavel Stehule > Cc: pgsql-performance@postgresql.org > Subject: Re: [PERFORM] Select in subselect vs select = any array > > Pavel, thanks for the help. > > I increased work_mem from 16MB to 64MB, no difference. The queries are > really just a test case. My actual queries are actual just large > number of primary keys that I am selecting from the db: > > For example: > select * from nodes where node_id in ( 1, 2, 3 ..... ) > What does "large" number of primary keys mean ? I have seen some "odd" things happen when I passed, carelessly, tens of thousands of items to an in list for a generated query, but I don't get the feeling that isn't the case here. ..: Mark
On Sun, Mar 20, 2011 at 11:20 PM, Adam Tistler <atistler@gmail.com> wrote: > logicops2=# explain analyze select count(*) from nodes where node_id = any( Array(select node_id from nodes limit 100000)); > QUERY PLAN > ----------------------------------------------------------------------------------------------------------------------------------------- > Aggregate (cost=1718.59..1718.60 rows=1 width=0) (actual time=509.126..509.127 rows=1 loops=1) > InitPlan 1 (returns $0) > -> Limit (cost=0.00..1637.04 rows=100000 width=4) (actual time=0.010..76.604 rows=100000 loops=1) > -> Seq Scan on nodes (cost=0.00..12355.41 rows=754741 width=4) (actual time=0.008..38.105 rows=100000 loops=1) > -> Bitmap Heap Scan on nodes (cost=42.67..81.53 rows=10 width=0) (actual time=447.274..484.283 rows=100000 loops=1) > Recheck Cond: (node_id = ANY ($0)) > -> Bitmap Index Scan on n_node_id_index (cost=0.00..42.67 rows=10 width=0) (actual time=447.074..447.074 rows=100000loops=1) > Index Cond: (node_id = ANY ($0)) > Total runtime: 509.209 ms > (9 rows) > > Time: 510.009 ms > > > logicops2=# explain analyze select count(*) from nodes where node_id in (select node_id from nodes limit 100000); > QUERY PLAN > ---------------------------------------------------------------------------------------------------------------------------------------- > Aggregate (cost=3017.17..3017.18 rows=1 width=0) (actual time=1052.866..1052.866 rows=1 loops=1) > -> Nested Loop (cost=2887.04..3016.67 rows=200 width=0) (actual time=167.310..1021.540 rows=100000 loops=1) > -> HashAggregate (cost=2887.04..2889.04 rows=200 width=4) (actual time=167.198..251.205 rows=100000 loops=1) > -> Limit (cost=0.00..1637.04 rows=100000 width=4) (actual time=0.008..80.090 rows=100000 loops=1) > -> Seq Scan on nodes (cost=0.00..12355.41 rows=754741 width=4) (actual time=0.007..41.566 rows=100000loops=1) > -> Index Scan using n_node_id_index on nodes (cost=0.00..0.63 rows=1 width=4) (actual time=0.006..0.007 rows=1loops=100000) > Index Cond: (public.nodes.node_id = public.nodes.node_id) > Total runtime: 1053.523 ms > (8 rows) > > Time: 1054.864 ms This is a pretty interesting example. I think this is just an optimizer limitation. When trying to build a join tree (in this case, between the copy of nodes inside the subselect and the copy outside the subselect), the planner considers three main join strategies: hash join, nested loop, merge join. A merge or hash join will have to read the outside-the-subselect copy of nodes in its entirety (I think); the only way to avoid that is to compute the subselect first and then use the index probes to pull out just the matching rows. That's what the planner did in both cases, but in the second case it's not smart enough to see that it can gather up all the values from the inner side of the join and shove them into a bitmap index scan all at once, so it just uses a regular index scan to pull 'em out one at a time. I think this would be pretty tricky to support, since the join node would need to understand all the parameter passing that needs to happen between the inner and outer sides of the loop; it's almost like a whole new join type. You might also want to make the opposite transformation, turning the first plan into the second one, if (for example) the subselect is going to return a gigabyte of data. But we're just not that smart. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company