Thread: Select in subselect vs select = any array

Select in subselect vs select = any array

From
Adam Tistler
Date:
I have noticed that SELECT ... = ANY(ARRAY(...))  is about twice as fast as SELECT IN ( ... ).
Can anyone explain a reason for this?  Results are the bottom and are reproducible.  I can test with other versions if
thatis necessary. 

./configure --prefix=/usr/local/pgsql84 --with-openssl --with-perl
CentOS release 5.4 (Final)
psql (PostgreSQL) 8.4.1

prompt2=# select count(*) from nodes;
 count
--------
 754734
(1 row)


prompt2=# \d nodes
                                        Table "public.nodes"
    Column    |           Type           |                         Modifiers
--------------+--------------------------+-----------------------------------------------------------
 node_id      | integer                  | not null default nextval(('node_id_seq'::text)::regclass)
 node_type_id | integer                  | not null
 template_id  | integer                  | not null
 timestamp    | timestamp with time zone | default ('now'::text)::timestamp(6) with time zone
Indexes:
    "nodes_pkey" PRIMARY KEY, btree (node_id)
    "n_node_id_index" btree (node_id)
    "n_node_type_id_index" btree (node_type_id)
    "n_template_id_index" btree (template_id)

prompt2=# select count(*) from nodes where node_id = any(  Array(select node_id from nodes limit 100000) );
 count
--------
 100000
(1 row)

Time: 404.530 ms
prompt2=# select count(*) from nodes where node_id = any(  Array(select node_id from nodes limit 100000) );
 count
--------
 100000
(1 row)

Time: 407.316 ms
prompt2=# select count(*) from nodes where node_id = any(  Array(select node_id from nodes limit 100000) );
 count
--------
 100000
(1 row)

Time: 408.728 ms
prompt2=# select count(*) from nodes where node_id in (select node_id from nodes limit 100000 );
 count
--------
 100000
(1 row)

Time: 793.840 ms
prompt2=# select count(*) from nodes where node_id in (select node_id from nodes limit 100000 );
 count
--------
 100000
(1 row)

Time: 779.137 ms
prompt2=# select count(*) from nodes where node_id in (select node_id from nodes limit 100000 );
 count
--------
 100000
(1 row)

Time: 781.820 ms


Re: Select in subselect vs select = any array

From
Pavel Stehule
Date:
Hello

2011/3/20 Adam Tistler <atistler@gmail.com>:
> I have noticed that SELECT ... = ANY(ARRAY(...))  is about twice as fast as SELECT IN ( ... ).
> Can anyone explain a reason for this?  Results are the bottom and are reproducible.  I can test with other versions
ifthat is necessary. 
>

send a result of EXPLAIN ANALYZE SELECT ..., please

The reasons can be different - less seq scans, indexes

Regards

Pavel Stehule



> ./configure --prefix=/usr/local/pgsql84 --with-openssl --with-perl
> CentOS release 5.4 (Final)
> psql (PostgreSQL) 8.4.1
>
> prompt2=# select count(*) from nodes;
>  count
> --------
>  754734
> (1 row)
>
>
> prompt2=# \d nodes
>                                        Table "public.nodes"
>    Column    |           Type           |                         Modifiers
> --------------+--------------------------+-----------------------------------------------------------
>  node_id      | integer                  | not null default nextval(('node_id_seq'::text)::regclass)
>  node_type_id | integer                  | not null
>  template_id  | integer                  | not null
>  timestamp    | timestamp with time zone | default ('now'::text)::timestamp(6) with time zone
> Indexes:
>    "nodes_pkey" PRIMARY KEY, btree (node_id)
>    "n_node_id_index" btree (node_id)
>    "n_node_type_id_index" btree (node_type_id)
>    "n_template_id_index" btree (template_id)
>
> prompt2=# select count(*) from nodes where node_id = any(  Array(select node_id from nodes limit 100000) );
>  count
> --------
>  100000
> (1 row)
>
> Time: 404.530 ms
> prompt2=# select count(*) from nodes where node_id = any(  Array(select node_id from nodes limit 100000) );
>  count
> --------
>  100000
> (1 row)
>
> Time: 407.316 ms
> prompt2=# select count(*) from nodes where node_id = any(  Array(select node_id from nodes limit 100000) );
>  count
> --------
>  100000
> (1 row)
>
> Time: 408.728 ms
> prompt2=# select count(*) from nodes where node_id in (select node_id from nodes limit 100000 );
>  count
> --------
>  100000
> (1 row)
>
> Time: 793.840 ms
> prompt2=# select count(*) from nodes where node_id in (select node_id from nodes limit 100000 );
>  count
> --------
>  100000
> (1 row)
>
> Time: 779.137 ms
> prompt2=# select count(*) from nodes where node_id in (select node_id from nodes limit 100000 );
>  count
> --------
>  100000
> (1 row)
>
> Time: 781.820 ms
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

Re: Select in subselect vs select = any array

From
Adam Tistler
Date:
logicops2=# explain analyze select count(*) from nodes where node_id = any(  Array(select node_id from nodes limit
100000)); 
                                                               QUERY PLAN
                 

-----------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1718.59..1718.60 rows=1 width=0) (actual time=509.126..509.127 rows=1 loops=1)
   InitPlan 1 (returns $0)
     ->  Limit  (cost=0.00..1637.04 rows=100000 width=4) (actual time=0.010..76.604 rows=100000 loops=1)
           ->  Seq Scan on nodes  (cost=0.00..12355.41 rows=754741 width=4) (actual time=0.008..38.105 rows=100000
loops=1)
   ->  Bitmap Heap Scan on nodes  (cost=42.67..81.53 rows=10 width=0) (actual time=447.274..484.283 rows=100000
loops=1)
         Recheck Cond: (node_id = ANY ($0))
         ->  Bitmap Index Scan on n_node_id_index  (cost=0.00..42.67 rows=10 width=0) (actual time=447.074..447.074
rows=100000loops=1) 
               Index Cond: (node_id = ANY ($0))
 Total runtime: 509.209 ms
(9 rows)

Time: 510.009 ms


logicops2=# explain analyze select count(*) from nodes where node_id in (select node_id from nodes limit 100000);
                                                               QUERY PLAN
                

----------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=3017.17..3017.18 rows=1 width=0) (actual time=1052.866..1052.866 rows=1 loops=1)
   ->  Nested Loop  (cost=2887.04..3016.67 rows=200 width=0) (actual time=167.310..1021.540 rows=100000 loops=1)
         ->  HashAggregate  (cost=2887.04..2889.04 rows=200 width=4) (actual time=167.198..251.205 rows=100000 loops=1)
               ->  Limit  (cost=0.00..1637.04 rows=100000 width=4) (actual time=0.008..80.090 rows=100000 loops=1)
                     ->  Seq Scan on nodes  (cost=0.00..12355.41 rows=754741 width=4) (actual time=0.007..41.566
rows=100000loops=1) 
         ->  Index Scan using n_node_id_index on nodes  (cost=0.00..0.63 rows=1 width=4) (actual time=0.006..0.007
rows=1loops=100000) 
               Index Cond: (public.nodes.node_id = public.nodes.node_id)
 Total runtime: 1053.523 ms
(8 rows)

Time: 1054.864 ms



On Mar 20, 2011, at 2:51 AM, Pavel Stehule wrote:

> Hello
>
> 2011/3/20 Adam Tistler <atistler@gmail.com>:
>> I have noticed that SELECT ... = ANY(ARRAY(...))  is about twice as fast as SELECT IN ( ... ).
>> Can anyone explain a reason for this?  Results are the bottom and are reproducible.  I can test with other versions
ifthat is necessary. 
>>
>
> send a result of EXPLAIN ANALYZE SELECT ..., please
>
> The reasons can be different - less seq scans, indexes
>
> Regards
>
> Pavel Stehule
>
>
>
>> ./configure --prefix=/usr/local/pgsql84 --with-openssl --with-perl
>> CentOS release 5.4 (Final)
>> psql (PostgreSQL) 8.4.1
>>
>> prompt2=# select count(*) from nodes;
>>  count
>> --------
>>  754734
>> (1 row)
>>
>>
>> prompt2=# \d nodes
>>                                        Table "public.nodes"
>>    Column    |           Type           |                         Modifiers
>> --------------+--------------------------+-----------------------------------------------------------
>>  node_id      | integer                  | not null default nextval(('node_id_seq'::text)::regclass)
>>  node_type_id | integer                  | not null
>>  template_id  | integer                  | not null
>>  timestamp    | timestamp with time zone | default ('now'::text)::timestamp(6) with time zone
>> Indexes:
>>    "nodes_pkey" PRIMARY KEY, btree (node_id)
>>    "n_node_id_index" btree (node_id)
>>    "n_node_type_id_index" btree (node_type_id)
>>    "n_template_id_index" btree (template_id)
>>
>> prompt2=# select count(*) from nodes where node_id = any(  Array(select node_id from nodes limit 100000) );
>>  count
>> --------
>>  100000
>> (1 row)
>>
>> Time: 404.530 ms
>> prompt2=# select count(*) from nodes where node_id = any(  Array(select node_id from nodes limit 100000) );
>>  count
>> --------
>>  100000
>> (1 row)
>>
>> Time: 407.316 ms
>> prompt2=# select count(*) from nodes where node_id = any(  Array(select node_id from nodes limit 100000) );
>>  count
>> --------
>>  100000
>> (1 row)
>>
>> Time: 408.728 ms
>> prompt2=# select count(*) from nodes where node_id in (select node_id from nodes limit 100000 );
>>  count
>> --------
>>  100000
>> (1 row)
>>
>> Time: 793.840 ms
>> prompt2=# select count(*) from nodes where node_id in (select node_id from nodes limit 100000 );
>>  count
>> --------
>>  100000
>> (1 row)
>>
>> Time: 779.137 ms
>> prompt2=# select count(*) from nodes where node_id in (select node_id from nodes limit 100000 );
>>  count
>> --------
>>  100000
>> (1 row)
>>
>> Time: 781.820 ms
>>
>>
>> --
>> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-performance
>>


Re: Select in subselect vs select = any array

From
Pavel Stehule
Date:
Hello

I think so HashAggregate goes out of memory - you can try to increase
a work_mem.

There are better queries for counting duplicit then cross join

Regards

Pavel Stehule

2011/3/21 Adam Tistler <atistler@gmail.com>:
> logicops2=# explain analyze select count(*) from nodes where node_id = any(  Array(select node_id from nodes limit
100000)); 
>                                                               QUERY PLAN
>
-----------------------------------------------------------------------------------------------------------------------------------------
>  Aggregate  (cost=1718.59..1718.60 rows=1 width=0) (actual time=509.126..509.127 rows=1 loops=1)
>   InitPlan 1 (returns $0)
>     ->  Limit  (cost=0.00..1637.04 rows=100000 width=4) (actual time=0.010..76.604 rows=100000 loops=1)
>           ->  Seq Scan on nodes  (cost=0.00..12355.41 rows=754741 width=4) (actual time=0.008..38.105 rows=100000
loops=1)
>   ->  Bitmap Heap Scan on nodes  (cost=42.67..81.53 rows=10 width=0) (actual time=447.274..484.283 rows=100000
loops=1)
>         Recheck Cond: (node_id = ANY ($0))
>         ->  Bitmap Index Scan on n_node_id_index  (cost=0.00..42.67 rows=10 width=0) (actual time=447.074..447.074
rows=100000loops=1) 
>               Index Cond: (node_id = ANY ($0))
>  Total runtime: 509.209 ms
> (9 rows)
>
> Time: 510.009 ms
>
>
> logicops2=# explain analyze select count(*) from nodes where node_id in (select node_id from nodes limit 100000);
>                                                               QUERY PLAN
>
----------------------------------------------------------------------------------------------------------------------------------------
>  Aggregate  (cost=3017.17..3017.18 rows=1 width=0) (actual time=1052.866..1052.866 rows=1 loops=1)
>   ->  Nested Loop  (cost=2887.04..3016.67 rows=200 width=0) (actual time=167.310..1021.540 rows=100000 loops=1)
>         ->  HashAggregate  (cost=2887.04..2889.04 rows=200 width=4) (actual time=167.198..251.205 rows=100000
loops=1)
>               ->  Limit  (cost=0.00..1637.04 rows=100000 width=4) (actual time=0.008..80.090 rows=100000 loops=1)
>                     ->  Seq Scan on nodes  (cost=0.00..12355.41 rows=754741 width=4) (actual time=0.007..41.566
rows=100000loops=1) 
>         ->  Index Scan using n_node_id_index on nodes  (cost=0.00..0.63 rows=1 width=4) (actual time=0.006..0.007
rows=1loops=100000) 
>               Index Cond: (public.nodes.node_id = public.nodes.node_id)
>  Total runtime: 1053.523 ms
> (8 rows)
>
> Time: 1054.864 ms
>
>
>
> On Mar 20, 2011, at 2:51 AM, Pavel Stehule wrote:
>
>> Hello
>>
>> 2011/3/20 Adam Tistler <atistler@gmail.com>:
>>> I have noticed that SELECT ... = ANY(ARRAY(...))  is about twice as fast as SELECT IN ( ... ).
>>> Can anyone explain a reason for this?  Results are the bottom and are reproducible.  I can test with other versions
ifthat is necessary. 
>>>
>>
>> send a result of EXPLAIN ANALYZE SELECT ..., please
>>
>> The reasons can be different - less seq scans, indexes
>>
>> Regards
>>
>> Pavel Stehule
>>
>>
>>
>>> ./configure --prefix=/usr/local/pgsql84 --with-openssl --with-perl
>>> CentOS release 5.4 (Final)
>>> psql (PostgreSQL) 8.4.1
>>>
>>> prompt2=# select count(*) from nodes;
>>>  count
>>> --------
>>>  754734
>>> (1 row)
>>>
>>>
>>> prompt2=# \d nodes
>>>                                        Table "public.nodes"
>>>    Column    |           Type           |                         Modifiers
>>> --------------+--------------------------+-----------------------------------------------------------
>>>  node_id      | integer                  | not null default nextval(('node_id_seq'::text)::regclass)
>>>  node_type_id | integer                  | not null
>>>  template_id  | integer                  | not null
>>>  timestamp    | timestamp with time zone | default ('now'::text)::timestamp(6) with time zone
>>> Indexes:
>>>    "nodes_pkey" PRIMARY KEY, btree (node_id)
>>>    "n_node_id_index" btree (node_id)
>>>    "n_node_type_id_index" btree (node_type_id)
>>>    "n_template_id_index" btree (template_id)
>>>
>>> prompt2=# select count(*) from nodes where node_id = any(  Array(select node_id from nodes limit 100000) );
>>>  count
>>> --------
>>>  100000
>>> (1 row)
>>>
>>> Time: 404.530 ms
>>> prompt2=# select count(*) from nodes where node_id = any(  Array(select node_id from nodes limit 100000) );
>>>  count
>>> --------
>>>  100000
>>> (1 row)
>>>
>>> Time: 407.316 ms
>>> prompt2=# select count(*) from nodes where node_id = any(  Array(select node_id from nodes limit 100000) );
>>>  count
>>> --------
>>>  100000
>>> (1 row)
>>>
>>> Time: 408.728 ms
>>> prompt2=# select count(*) from nodes where node_id in (select node_id from nodes limit 100000 );
>>>  count
>>> --------
>>>  100000
>>> (1 row)
>>>
>>> Time: 793.840 ms
>>> prompt2=# select count(*) from nodes where node_id in (select node_id from nodes limit 100000 );
>>>  count
>>> --------
>>>  100000
>>> (1 row)
>>>
>>> Time: 779.137 ms
>>> prompt2=# select count(*) from nodes where node_id in (select node_id from nodes limit 100000 );
>>>  count
>>> --------
>>>  100000
>>> (1 row)
>>>
>>> Time: 781.820 ms
>>>
>>>
>>> --
>>> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
>>> To make changes to your subscription:
>>> http://www.postgresql.org/mailpref/pgsql-performance
>>>
>
>

Re: Select in subselect vs select = any array

From
Adam Tistler
Date:
Pavel, thanks for the help.

I increased work_mem from 16MB to 64MB, no difference.  The queries are really just a test case.  My actual queries are
actualjust large number of primary keys that I am selecting from the db: 

For example:
   select * from nodes where node_id in ( 1, 2, 3 ..... )

I found that even for small queries, the following is faster:
   select * from nodes where node_in = any (array[1,2,3 .... ])


Its not really a big deal to me, I was just wondering if others could reproduce it on other systems/versions and if
perhapsthis is an issue that I should point out to postgres-dev. 


Results below:

logicops2=# explain analyze select count(*) from nodes where node_id in ( select node_id from nodes limit 100000 );
                                                               QUERY PLAN
                

----------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=3017.18..3017.19 rows=1 width=0) (actual time=1017.051..1017.051 rows=1 loops=1)
   ->  Nested Loop  (cost=2887.05..3016.68 rows=200 width=0) (actual time=157.290..986.329 rows=100000 loops=1)
         ->  HashAggregate  (cost=2887.05..2889.05 rows=200 width=4) (actual time=157.252..241.995 rows=100000 loops=1)
               ->  Limit  (cost=0.00..1637.05 rows=100000 width=4) (actual time=0.009..73.942 rows=100000 loops=1)
                     ->  Seq Scan on nodes  (cost=0.00..12355.34 rows=754734 width=4) (actual time=0.008..35.428
rows=100000loops=1) 
         ->  Index Scan using n_node_id_index on nodes  (cost=0.00..0.63 rows=1 width=4) (actual time=0.006..0.006
rows=1loops=100000) 
               Index Cond: (public.nodes.node_id = public.nodes.node_id)
 Total runtime: 1017.794 ms
(8 rows)

logicops2=# explain analyze select count(*) from nodes where node_id = any(array ( select node_id from nodes limit
100000)); 
                                                               QUERY PLAN
                 

-----------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1718.60..1718.61 rows=1 width=0) (actual time=485.554..485.555 rows=1 loops=1)
   InitPlan 1 (returns $0)
     ->  Limit  (cost=0.00..1637.05 rows=100000 width=4) (actual time=0.011..73.037 rows=100000 loops=1)
           ->  Seq Scan on nodes  (cost=0.00..12355.34 rows=754734 width=4) (actual time=0.010..34.462 rows=100000
loops=1)
   ->  Bitmap Heap Scan on nodes  (cost=42.67..81.53 rows=10 width=0) (actual time=433.003..461.108 rows=100000
loops=1)
         Recheck Cond: (node_id = ANY ($0))
         ->  Bitmap Index Scan on n_node_id_index  (cost=0.00..42.67 rows=10 width=0) (actual time=432.810..432.810
rows=100000loops=1) 
               Index Cond: (node_id = ANY ($0))
 Total runtime: 485.638 ms
(9 rows)

On Mar 21, 2011, at 1:54 AM, Pavel Stehule wrote:

> Hello
>
> I think so HashAggregate goes out of memory - you can try to increase
> a work_mem.
>
> There are better queries for counting duplicit then cross join
>
> Regards
>
> Pavel Stehule
>
> 2011/3/21 Adam Tistler <atistler@gmail.com>:
>> logicops2=# explain analyze select count(*) from nodes where node_id = any(  Array(select node_id from nodes limit
100000)); 
>>                                                               QUERY PLAN
>>
-----------------------------------------------------------------------------------------------------------------------------------------
>>  Aggregate  (cost=1718.59..1718.60 rows=1 width=0) (actual time=509.126..509.127 rows=1 loops=1)
>>   InitPlan 1 (returns $0)
>>     ->  Limit  (cost=0.00..1637.04 rows=100000 width=4) (actual time=0.010..76.604 rows=100000 loops=1)
>>           ->  Seq Scan on nodes  (cost=0.00..12355.41 rows=754741 width=4) (actual time=0.008..38.105 rows=100000
loops=1)
>>   ->  Bitmap Heap Scan on nodes  (cost=42.67..81.53 rows=10 width=0) (actual time=447.274..484.283 rows=100000
loops=1)
>>         Recheck Cond: (node_id = ANY ($0))
>>         ->  Bitmap Index Scan on n_node_id_index  (cost=0.00..42.67 rows=10 width=0) (actual time=447.074..447.074
rows=100000loops=1) 
>>               Index Cond: (node_id = ANY ($0))
>>  Total runtime: 509.209 ms
>> (9 rows)
>>
>> Time: 510.009 ms
>>
>>
>> logicops2=# explain analyze select count(*) from nodes where node_id in (select node_id from nodes limit 100000);
>>                                                               QUERY PLAN
>>
----------------------------------------------------------------------------------------------------------------------------------------
>>  Aggregate  (cost=3017.17..3017.18 rows=1 width=0) (actual time=1052.866..1052.866 rows=1 loops=1)
>>   ->  Nested Loop  (cost=2887.04..3016.67 rows=200 width=0) (actual time=167.310..1021.540 rows=100000 loops=1)
>>         ->  HashAggregate  (cost=2887.04..2889.04 rows=200 width=4) (actual time=167.198..251.205 rows=100000
loops=1)
>>               ->  Limit  (cost=0.00..1637.04 rows=100000 width=4) (actual time=0.008..80.090 rows=100000 loops=1)
>>                     ->  Seq Scan on nodes  (cost=0.00..12355.41 rows=754741 width=4) (actual time=0.007..41.566
rows=100000loops=1) 
>>         ->  Index Scan using n_node_id_index on nodes  (cost=0.00..0.63 rows=1 width=4) (actual time=0.006..0.007
rows=1loops=100000) 
>>               Index Cond: (public.nodes.node_id = public.nodes.node_id)
>>  Total runtime: 1053.523 ms
>> (8 rows)
>>
>> Time: 1054.864 ms
>>
>>
>>
>> On Mar 20, 2011, at 2:51 AM, Pavel Stehule wrote:
>>
>>> Hello
>>>
>>> 2011/3/20 Adam Tistler <atistler@gmail.com>:
>>>> I have noticed that SELECT ... = ANY(ARRAY(...))  is about twice as fast as SELECT IN ( ... ).
>>>> Can anyone explain a reason for this?  Results are the bottom and are reproducible.  I can test with other
versionsif that is necessary. 
>>>>
>>>
>>> send a result of EXPLAIN ANALYZE SELECT ..., please
>>>
>>> The reasons can be different - less seq scans, indexes
>>>
>>> Regards
>>>
>>> Pavel Stehule
>>>
>>>
>>>
>>>> ./configure --prefix=/usr/local/pgsql84 --with-openssl --with-perl
>>>> CentOS release 5.4 (Final)
>>>> psql (PostgreSQL) 8.4.1
>>>>
>>>> prompt2=# select count(*) from nodes;
>>>>  count
>>>> --------
>>>>  754734
>>>> (1 row)
>>>>
>>>>
>>>> prompt2=# \d nodes
>>>>                                        Table "public.nodes"
>>>>    Column    |           Type           |                         Modifiers
>>>> --------------+--------------------------+-----------------------------------------------------------
>>>>  node_id      | integer                  | not null default nextval(('node_id_seq'::text)::regclass)
>>>>  node_type_id | integer                  | not null
>>>>  template_id  | integer                  | not null
>>>>  timestamp    | timestamp with time zone | default ('now'::text)::timestamp(6) with time zone
>>>> Indexes:
>>>>    "nodes_pkey" PRIMARY KEY, btree (node_id)
>>>>    "n_node_id_index" btree (node_id)
>>>>    "n_node_type_id_index" btree (node_type_id)
>>>>    "n_template_id_index" btree (template_id)
>>>>
>>>> prompt2=# select count(*) from nodes where node_id = any(  Array(select node_id from nodes limit 100000) );
>>>>  count
>>>> --------
>>>>  100000
>>>> (1 row)
>>>>
>>>> Time: 404.530 ms
>>>> prompt2=# select count(*) from nodes where node_id = any(  Array(select node_id from nodes limit 100000) );
>>>>  count
>>>> --------
>>>>  100000
>>>> (1 row)
>>>>
>>>> Time: 407.316 ms
>>>> prompt2=# select count(*) from nodes where node_id = any(  Array(select node_id from nodes limit 100000) );
>>>>  count
>>>> --------
>>>>  100000
>>>> (1 row)
>>>>
>>>> Time: 408.728 ms
>>>> prompt2=# select count(*) from nodes where node_id in (select node_id from nodes limit 100000 );
>>>>  count
>>>> --------
>>>>  100000
>>>> (1 row)
>>>>
>>>> Time: 793.840 ms
>>>> prompt2=# select count(*) from nodes where node_id in (select node_id from nodes limit 100000 );
>>>>  count
>>>> --------
>>>>  100000
>>>> (1 row)
>>>>
>>>> Time: 779.137 ms
>>>> prompt2=# select count(*) from nodes where node_id in (select node_id from nodes limit 100000 );
>>>>  count
>>>> --------
>>>>  100000
>>>> (1 row)
>>>>
>>>> Time: 781.820 ms
>>>>
>>>>
>>>> --
>>>> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
>>>> To make changes to your subscription:
>>>> http://www.postgresql.org/mailpref/pgsql-performance
>>>>
>>
>>


Re: Select in subselect vs select = any array

From
Pavel Stehule
Date:
2011/3/21 Adam Tistler <atistler@gmail.com>:
> Pavel, thanks for the help.
>
> I increased work_mem from 16MB to 64MB, no difference.  The queries are really just a test case.  My actual queries
areactual just large number of primary keys that I am selecting from the db: 
>
> For example:
>   select * from nodes where node_id in ( 1, 2, 3 ..... )
>
> I found that even for small queries, the following is faster:
>   select * from nodes where node_in = any (array[1,2,3 .... ])

it depends on version. I think so on last postgres, these queries are
same, not sure.

Regards

Pavel

>
>
> Its not really a big deal to me, I was just wondering if others could reproduce it on other systems/versions and if
perhapsthis is an issue that I should point out to postgres-dev. 
>
>
> Results below:
>
> logicops2=# explain analyze select count(*) from nodes where node_id in ( select node_id from nodes limit 100000 );
>                                                               QUERY PLAN
>
----------------------------------------------------------------------------------------------------------------------------------------
>  Aggregate  (cost=3017.18..3017.19 rows=1 width=0) (actual time=1017.051..1017.051 rows=1 loops=1)
>   ->  Nested Loop  (cost=2887.05..3016.68 rows=200 width=0) (actual time=157.290..986.329 rows=100000 loops=1)
>         ->  HashAggregate  (cost=2887.05..2889.05 rows=200 width=4) (actual time=157.252..241.995 rows=100000
loops=1)
>               ->  Limit  (cost=0.00..1637.05 rows=100000 width=4) (actual time=0.009..73.942 rows=100000 loops=1)
>                     ->  Seq Scan on nodes  (cost=0.00..12355.34 rows=754734 width=4) (actual time=0.008..35.428
rows=100000loops=1) 
>         ->  Index Scan using n_node_id_index on nodes  (cost=0.00..0.63 rows=1 width=4) (actual time=0.006..0.006
rows=1loops=100000) 
>               Index Cond: (public.nodes.node_id = public.nodes.node_id)
>  Total runtime: 1017.794 ms
> (8 rows)
>
> logicops2=# explain analyze select count(*) from nodes where node_id = any(array ( select node_id from nodes limit
100000)); 
>                                                               QUERY PLAN
>
-----------------------------------------------------------------------------------------------------------------------------------------
>  Aggregate  (cost=1718.60..1718.61 rows=1 width=0) (actual time=485.554..485.555 rows=1 loops=1)
>   InitPlan 1 (returns $0)
>     ->  Limit  (cost=0.00..1637.05 rows=100000 width=4) (actual time=0.011..73.037 rows=100000 loops=1)
>           ->  Seq Scan on nodes  (cost=0.00..12355.34 rows=754734 width=4) (actual time=0.010..34.462 rows=100000
loops=1)
>   ->  Bitmap Heap Scan on nodes  (cost=42.67..81.53 rows=10 width=0) (actual time=433.003..461.108 rows=100000
loops=1)
>         Recheck Cond: (node_id = ANY ($0))
>         ->  Bitmap Index Scan on n_node_id_index  (cost=0.00..42.67 rows=10 width=0) (actual time=432.810..432.810
rows=100000loops=1) 
>               Index Cond: (node_id = ANY ($0))
>  Total runtime: 485.638 ms
> (9 rows)
>
> On Mar 21, 2011, at 1:54 AM, Pavel Stehule wrote:
>
>> Hello
>>
>> I think so HashAggregate goes out of memory - you can try to increase
>> a work_mem.
>>
>> There are better queries for counting duplicit then cross join
>>
>> Regards
>>
>> Pavel Stehule
>>
>> 2011/3/21 Adam Tistler <atistler@gmail.com>:
>>> logicops2=# explain analyze select count(*) from nodes where node_id = any(  Array(select node_id from nodes limit
100000)); 
>>>                                                               QUERY PLAN
>>>
-----------------------------------------------------------------------------------------------------------------------------------------
>>>  Aggregate  (cost=1718.59..1718.60 rows=1 width=0) (actual time=509.126..509.127 rows=1 loops=1)
>>>   InitPlan 1 (returns $0)
>>>     ->  Limit  (cost=0.00..1637.04 rows=100000 width=4) (actual time=0.010..76.604 rows=100000 loops=1)
>>>           ->  Seq Scan on nodes  (cost=0.00..12355.41 rows=754741 width=4) (actual time=0.008..38.105 rows=100000
loops=1)
>>>   ->  Bitmap Heap Scan on nodes  (cost=42.67..81.53 rows=10 width=0) (actual time=447.274..484.283 rows=100000
loops=1)
>>>         Recheck Cond: (node_id = ANY ($0))
>>>         ->  Bitmap Index Scan on n_node_id_index  (cost=0.00..42.67 rows=10 width=0) (actual time=447.074..447.074
rows=100000loops=1) 
>>>               Index Cond: (node_id = ANY ($0))
>>>  Total runtime: 509.209 ms
>>> (9 rows)
>>>
>>> Time: 510.009 ms
>>>
>>>
>>> logicops2=# explain analyze select count(*) from nodes where node_id in (select node_id from nodes limit 100000);
>>>                                                               QUERY PLAN
>>>
----------------------------------------------------------------------------------------------------------------------------------------
>>>  Aggregate  (cost=3017.17..3017.18 rows=1 width=0) (actual time=1052.866..1052.866 rows=1 loops=1)
>>>   ->  Nested Loop  (cost=2887.04..3016.67 rows=200 width=0) (actual time=167.310..1021.540 rows=100000 loops=1)
>>>         ->  HashAggregate  (cost=2887.04..2889.04 rows=200 width=4) (actual time=167.198..251.205 rows=100000
loops=1)
>>>               ->  Limit  (cost=0.00..1637.04 rows=100000 width=4) (actual time=0.008..80.090 rows=100000 loops=1)
>>>                     ->  Seq Scan on nodes  (cost=0.00..12355.41 rows=754741 width=4) (actual time=0.007..41.566
rows=100000loops=1) 
>>>         ->  Index Scan using n_node_id_index on nodes  (cost=0.00..0.63 rows=1 width=4) (actual time=0.006..0.007
rows=1loops=100000) 
>>>               Index Cond: (public.nodes.node_id = public.nodes.node_id)
>>>  Total runtime: 1053.523 ms
>>> (8 rows)
>>>
>>> Time: 1054.864 ms
>>>
>>>
>>>
>>> On Mar 20, 2011, at 2:51 AM, Pavel Stehule wrote:
>>>
>>>> Hello
>>>>
>>>> 2011/3/20 Adam Tistler <atistler@gmail.com>:
>>>>> I have noticed that SELECT ... = ANY(ARRAY(...))  is about twice as fast as SELECT IN ( ... ).
>>>>> Can anyone explain a reason for this?  Results are the bottom and are reproducible.  I can test with other
versionsif that is necessary. 
>>>>>
>>>>
>>>> send a result of EXPLAIN ANALYZE SELECT ..., please
>>>>
>>>> The reasons can be different - less seq scans, indexes
>>>>
>>>> Regards
>>>>
>>>> Pavel Stehule
>>>>
>>>>
>>>>
>>>>> ./configure --prefix=/usr/local/pgsql84 --with-openssl --with-perl
>>>>> CentOS release 5.4 (Final)
>>>>> psql (PostgreSQL) 8.4.1
>>>>>
>>>>> prompt2=# select count(*) from nodes;
>>>>>  count
>>>>> --------
>>>>>  754734
>>>>> (1 row)
>>>>>
>>>>>
>>>>> prompt2=# \d nodes
>>>>>                                        Table "public.nodes"
>>>>>    Column    |           Type           |                         Modifiers
>>>>> --------------+--------------------------+-----------------------------------------------------------
>>>>>  node_id      | integer                  | not null default nextval(('node_id_seq'::text)::regclass)
>>>>>  node_type_id | integer                  | not null
>>>>>  template_id  | integer                  | not null
>>>>>  timestamp    | timestamp with time zone | default ('now'::text)::timestamp(6) with time zone
>>>>> Indexes:
>>>>>    "nodes_pkey" PRIMARY KEY, btree (node_id)
>>>>>    "n_node_id_index" btree (node_id)
>>>>>    "n_node_type_id_index" btree (node_type_id)
>>>>>    "n_template_id_index" btree (template_id)
>>>>>
>>>>> prompt2=# select count(*) from nodes where node_id = any(  Array(select node_id from nodes limit 100000) );
>>>>>  count
>>>>> --------
>>>>>  100000
>>>>> (1 row)
>>>>>
>>>>> Time: 404.530 ms
>>>>> prompt2=# select count(*) from nodes where node_id = any(  Array(select node_id from nodes limit 100000) );
>>>>>  count
>>>>> --------
>>>>>  100000
>>>>> (1 row)
>>>>>
>>>>> Time: 407.316 ms
>>>>> prompt2=# select count(*) from nodes where node_id = any(  Array(select node_id from nodes limit 100000) );
>>>>>  count
>>>>> --------
>>>>>  100000
>>>>> (1 row)
>>>>>
>>>>> Time: 408.728 ms
>>>>> prompt2=# select count(*) from nodes where node_id in (select node_id from nodes limit 100000 );
>>>>>  count
>>>>> --------
>>>>>  100000
>>>>> (1 row)
>>>>>
>>>>> Time: 793.840 ms
>>>>> prompt2=# select count(*) from nodes where node_id in (select node_id from nodes limit 100000 );
>>>>>  count
>>>>> --------
>>>>>  100000
>>>>> (1 row)
>>>>>
>>>>> Time: 779.137 ms
>>>>> prompt2=# select count(*) from nodes where node_id in (select node_id from nodes limit 100000 );
>>>>>  count
>>>>> --------
>>>>>  100000
>>>>> (1 row)
>>>>>
>>>>> Time: 781.820 ms
>>>>>
>>>>>
>>>>> --
>>>>> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
>>>>> To make changes to your subscription:
>>>>> http://www.postgresql.org/mailpref/pgsql-performance
>>>>>
>>>
>>>
>
>

Re: Select in subselect vs select = any array

From
"mark"
Date:
> -----Original Message-----
> From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-
> owner@postgresql.org] On Behalf Of Adam Tistler
> Sent: Monday, March 21, 2011 12:17 AM
> To: Pavel Stehule
> Cc: pgsql-performance@postgresql.org
> Subject: Re: [PERFORM] Select in subselect vs select = any array
>
> Pavel, thanks for the help.
>
> I increased work_mem from 16MB to 64MB, no difference.  The queries are
> really just a test case.  My actual queries are actual just large
> number of primary keys that I am selecting from the db:
>
> For example:
>    select * from nodes where node_id in ( 1, 2, 3 ..... )
>

What does "large" number of primary keys mean ?

I have seen some "odd" things happen when I passed, carelessly, tens of
thousands of items to an in list for a generated query, but I don't get the
feeling that isn't the case here.




..: Mark


Re: Select in subselect vs select = any array

From
Robert Haas
Date:
On Sun, Mar 20, 2011 at 11:20 PM, Adam Tistler <atistler@gmail.com> wrote:
> logicops2=# explain analyze select count(*) from nodes where node_id = any(  Array(select node_id from nodes limit
100000)); 
>                                                               QUERY PLAN
>
-----------------------------------------------------------------------------------------------------------------------------------------
>  Aggregate  (cost=1718.59..1718.60 rows=1 width=0) (actual time=509.126..509.127 rows=1 loops=1)
>   InitPlan 1 (returns $0)
>     ->  Limit  (cost=0.00..1637.04 rows=100000 width=4) (actual time=0.010..76.604 rows=100000 loops=1)
>           ->  Seq Scan on nodes  (cost=0.00..12355.41 rows=754741 width=4) (actual time=0.008..38.105 rows=100000
loops=1)
>   ->  Bitmap Heap Scan on nodes  (cost=42.67..81.53 rows=10 width=0) (actual time=447.274..484.283 rows=100000
loops=1)
>         Recheck Cond: (node_id = ANY ($0))
>         ->  Bitmap Index Scan on n_node_id_index  (cost=0.00..42.67 rows=10 width=0) (actual time=447.074..447.074
rows=100000loops=1) 
>               Index Cond: (node_id = ANY ($0))
>  Total runtime: 509.209 ms
> (9 rows)
>
> Time: 510.009 ms
>
>
> logicops2=# explain analyze select count(*) from nodes where node_id in (select node_id from nodes limit 100000);
>                                                               QUERY PLAN
>
----------------------------------------------------------------------------------------------------------------------------------------
>  Aggregate  (cost=3017.17..3017.18 rows=1 width=0) (actual time=1052.866..1052.866 rows=1 loops=1)
>   ->  Nested Loop  (cost=2887.04..3016.67 rows=200 width=0) (actual time=167.310..1021.540 rows=100000 loops=1)
>         ->  HashAggregate  (cost=2887.04..2889.04 rows=200 width=4) (actual time=167.198..251.205 rows=100000
loops=1)
>               ->  Limit  (cost=0.00..1637.04 rows=100000 width=4) (actual time=0.008..80.090 rows=100000 loops=1)
>                     ->  Seq Scan on nodes  (cost=0.00..12355.41 rows=754741 width=4) (actual time=0.007..41.566
rows=100000loops=1) 
>         ->  Index Scan using n_node_id_index on nodes  (cost=0.00..0.63 rows=1 width=4) (actual time=0.006..0.007
rows=1loops=100000) 
>               Index Cond: (public.nodes.node_id = public.nodes.node_id)
>  Total runtime: 1053.523 ms
> (8 rows)
>
> Time: 1054.864 ms

This is a pretty interesting example.  I think this is just an
optimizer limitation.

When trying to build a join tree (in this case, between the copy of
nodes inside the subselect and the copy outside the subselect), the
planner considers three main join strategies: hash join, nested loop,
merge join.  A merge or hash join will have to read the
outside-the-subselect copy of nodes in its entirety (I think); the
only way to avoid that is to compute the subselect first and then use
the index probes to pull out just the matching rows.  That's what the
planner did in both cases, but in the second case it's not smart
enough to see that it can gather up all the values from the inner side
of the join and shove them into a bitmap index scan all at once, so it
just uses a regular index scan to pull 'em out one at a time.

I think this would be pretty tricky to support, since the join node
would need to understand all the parameter passing that needs to
happen between the inner and outer sides of the loop; it's almost like
a whole new join type.

You might also want to make the opposite transformation, turning the
first plan into the second one, if (for example) the subselect is
going to return a gigabyte of data.  But we're just not that smart.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company