Thread: how do functions affect query plan?

how do functions affect query plan?

From
常超
Date:
Hi,all
I have a table to save received measure data.


CREATE TABLE measure_data
(
  id serial NOT NULL,
  telegram_id integer NOT NULL,
  measure_time timestamp without time zone NOT NULL,
  item_id integer NOT NULL,
  val double precision,
  CONSTRAINT measure_data_pkey PRIMARY KEY (id)
);

CREATE INDEX index_measure_data_telegram_id ON measure_data USING btree (telegram_id);


in my scenario,a telegram contains measure data for multiple data items and timestamps,
BTW,another table is for telegram.

The SQL I used in my application is
  select * from measure_data where telegram_id in(1,2,...,n)
and this query used the index_measure_data_telegram_id index,as expected.

In order to see the performance of my query ,
I used the following query to search the measure data for randomly 30 telegrams.


explain analyze
SELECT md.*
  FROM measure_data md
  where telegram_id in
    (
         SELECT distinct
         trunc((132363-66484) * random() + 66484)
         FROM generate_series(1,30) as s(telegram_id)
    )
  ;

the 132363 and 66484 are the max and min of the telegram id,separately.

What surprised me is that index is not used,instead,a seq scan is performed on measure_data.
Although,intuitively,in this case,it is much wiser to use the index.
Would you please give some clue to why this happened?

"Hash Semi Join  (cost=65.00..539169.32 rows=10277280 width=28) (actual time=76.454..17177.054 rows=9360 loops=1)"
"  Hash Cond: ((md.telegram_id)::double precision = (trunc(((65879::double precision * random()) + 66484::double
precision))))"
"  ->  Seq Scan on measure_data md  (cost=0.00..356682.60 rows=20554560 width=28) (actual time=0.012..13874.809
rows=20554560loops=1)" 
"  ->  Hash  (cost=52.50..52.50 rows=1000 width=8) (actual time=0.062..0.062 rows=30 loops=1)"
"        Buckets: 1024  Batches: 1  Memory Usage: 2kB"
"        ->  HashAggregate  (cost=22.50..42.50 rows=1000 width=0) (actual time=0.048..0.053 rows=30 loops=1)"
"              ->  Function Scan on generate_series s  (cost=0.00..20.00 rows=1000 width=0) (actual time=0.020..0.034
rows=30loops=1)" 
"Total runtime: 17177.527 ms"



Re: how do functions affect query plan?

From
David G Johnston
Date:
常超 wrote
> Hi,all
> I have a table to save received measure data.
>
>
> CREATE TABLE measure_data
> (
>   id serial NOT NULL,
>   telegram_id integer NOT NULL,
>   measure_time timestamp without time zone NOT NULL,
>   item_id integer NOT NULL,
>   val double precision,
>   CONSTRAINT measure_data_pkey PRIMARY KEY (id)
> );
>
> CREATE INDEX index_measure_data_telegram_id ON measure_data USING btree
> (telegram_id);
>
>
> in my scenario,a telegram contains measure data for multiple data items
> and timestamps,
> BTW,another table is for telegram.
>
> The SQL I used in my application is
>   select * from measure_data where telegram_id in(1,2,...,n)
> and this query used the index_measure_data_telegram_id index,as expected.
>
> In order to see the performance of my query ,
> I used the following query to search the measure data for randomly 30
> telegrams.
>
>
> explain analyze
> SELECT md.*
>   FROM measure_data md
>   where telegram_id in
>     (
>          SELECT distinct
>          trunc((132363-66484) * random() + 66484)
>          FROM generate_series(1,30) as s(telegram_id)
>     )
>   ;
>
> the 132363 and 66484 are the max and min of the telegram id,separately.
>
> What surprised me is that index is not used,instead,a seq scan is
> performed on measure_data.
> Although,intuitively,in this case,it is much wiser to use the index.
> Would you please give some clue to why this happened?
>
> "Hash Semi Join  (cost=65.00..539169.32 rows=10277280 width=28) (actual
> time=76.454..17177.054 rows=9360 loops=1)"
> "  Hash Cond: ((md.telegram_id)::double precision = (trunc(((65879::double
> precision * random()) + 66484::double precision))))"
> "  ->  Seq Scan on measure_data md  (cost=0.00..356682.60 rows=20554560
> width=28) (actual time=0.012..13874.809 rows=20554560 loops=1)"
> "  ->  Hash  (cost=52.50..52.50 rows=1000 width=8) (actual
> time=0.062..0.062 rows=30 loops=1)"
> "        Buckets: 1024  Batches: 1  Memory Usage: 2kB"
> "        ->  HashAggregate  (cost=22.50..42.50 rows=1000 width=0) (actual
> time=0.048..0.053 rows=30 loops=1)"
> "              ->  Function Scan on generate_series s  (cost=0.00..20.00
> rows=1000 width=0) (actual time=0.020..0.034 rows=30 loops=1)"
> "Total runtime: 17177.527 ms"

The planner expects to need to return half the table when you provide 1,000
distinct telegram_ids, which is best handled by scanning the whole table
sequentially and tossing out invalid data.

I am curious if the plan will be different if you added a LIMIT 30 to the
sub-query.

The root of the problem is the planner has no way of knowing whether
generate_series is going to return 1 or 1,000,000 rows so by default it (and
all functions) are assumed (by the planner) to return 1,000 rows.  By adding
an explicit limit you can better inform the planner as to how many rows you
are going to be passing up to the parent query and it will hopefully, with
knowledge of only 30 distinct values, use the index.




--
View this message in context:
http://postgresql.1045698.n5.nabble.com/how-do-functions-affect-query-plan-tp5803993p5803996.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


Re: how do functions affect query plan?

From
changchao
Date:
Hi,David

Seems that the root of evil is in the function(random,trunc),
although I don't know why.

Here is the comparison.

1.w/o function : index is wisely used.(Even without the limit 30 clause)

explain analyze
SELECT md.*
  FROM measure_data md
  where telegram_id in
    (
         SELECT  66484 + (132363-66484)/30 * i
         FROM generate_series(1,30) as s(i)
         limit 30
    )
  ;

"Nested Loop  (cost=10.01..39290.79 rows=10392 width=28) (actual time=0.079..3.490 rows=9360 loops=1)"
"  ->  HashAggregate  (cost=0.83..1.13 rows=30 width=4) (actual time=0.027..0.032 rows=30 loops=1)"
"        ->  Limit  (cost=0.00..0.45 rows=30 width=4) (actual time=0.013..0.020 rows=30 loops=1)"
"              ->  Function Scan on generate_series s  (cost=0.00..15.00 rows=1000 width=4) (actual time=0.011..0.016
rows=30loops=1)" 
"  ->  Bitmap Heap Scan on measure_data md  (cost=9.19..1306.20 rows=346 width=28) (actual time=0.030..0.075 rows=312
loops=30)"
"        Recheck Cond: (telegram_id = ((66484 + (2195 * s.i))))"
"        ->  Bitmap Index Scan on index_measure_data_telegram_id  (cost=0.00..9.10 rows=346 width=0) (actual
time=0.025..0.025rows=312 loops=30)" 
"              Index Cond: (telegram_id = ((66484 + (2195 * s.i))))"
"Total runtime: 3.714 ms"


2.when function is there: seq scan

explain analyze
SELECT md.*
  FROM measure_data md
  where telegram_id in
    (
         SELECT  trunc((132363-66484) * random()) +66484
         FROM generate_series(1,30) as s(i)
         limit 30
    )
 ;


"Hash Join  (cost=1.65..490288.89 rows=10277280 width=28) (actual time=0.169..4894.847 rows=9360 loops=1)"
"  Hash Cond: ((md.telegram_id)::double precision = ((trunc((65879::double precision * random())) + 66484::double
precision)))"
"  ->  Seq Scan on measure_data md  (cost=0.00..356682.60 rows=20554560 width=28) (actual time=0.010..2076.932
rows=20554560loops=1)" 
"  ->  Hash  (cost=1.28..1.28 rows=30 width=8) (actual time=0.041..0.041 rows=30 loops=1)"
"        Buckets: 1024  Batches: 1  Memory Usage: 2kB"
"        ->  HashAggregate  (cost=0.98..1.28 rows=30 width=8) (actual time=0.034..0.036 rows=30 loops=1)"
"              ->  Limit  (cost=0.00..0.60 rows=30 width=0) (actual time=0.016..0.026 rows=30 loops=1)"
"                    ->  Function Scan on generate_series s  (cost=0.00..20.00 rows=1000 width=0) (actual
time=0.015..0.023rows=30 loops=1)" 
"Total runtime: 4895.239 ms"


----------------------------------------
> Date: Wed, 14 May 2014 22:43:24 -0700
> From: david.g.johnston@gmail.com
> To: pgsql-performance@postgresql.org
> Subject: Re: [PERFORM] how do functions affect query plan?
>
> 常超 wrote
>> Hi,all
>> I have a table to save received measure data.
>>
>>
>> CREATE TABLE measure_data
>> (
>>   id serial NOT NULL,
>>   telegram_id integer NOT NULL,
>>   measure_time timestamp without time zone NOT NULL,
>>   item_id integer NOT NULL,
>>   val double precision,
>>   CONSTRAINT measure_data_pkey PRIMARY KEY (id)
>> );
>>
>> CREATE INDEX index_measure_data_telegram_id ON measure_data USING btree
>> (telegram_id);
>>
>>
>> in my scenario,a telegram contains measure data for multiple data items
>> and timestamps,
>> BTW,another table is for telegram.
>>
>> The SQL I used in my application is
>>   select * from measure_data where telegram_id in(1,2,...,n)
>> and this query used the index_measure_data_telegram_id index,as expected.
>>
>> In order to see the performance of my query ,
>> I used the following query to search the measure data for randomly 30
>> telegrams.
>>
>>
>> explain analyze
>> SELECT md.*
>>   FROM measure_data md
>>   where telegram_id in
>>     (
>>          SELECT distinct
>>          trunc((132363-66484) * random() + 66484)
>>          FROM generate_series(1,30) as s(telegram_id)
>>     )
>>   ;
>>
>> the 132363 and 66484 are the max and min of the telegram id,separately.
>>
>> What surprised me is that index is not used,instead,a seq scan is
>> performed on measure_data.
>> Although,intuitively,in this case,it is much wiser to use the index.
>> Would you please give some clue to why this happened?
>>
>> "Hash Semi Join  (cost=65.00..539169.32 rows=10277280 width=28) (actual
>> time=76.454..17177.054 rows=9360 loops=1)"
>> "  Hash Cond: ((md.telegram_id)::double precision = (trunc(((65879::double
>> precision * random()) + 66484::double precision))))"
>> "  ->  Seq Scan on measure_data md  (cost=0.00..356682.60 rows=20554560
>> width=28) (actual time=0.012..13874.809 rows=20554560 loops=1)"
>> "  ->  Hash  (cost=52.50..52.50 rows=1000 width=8) (actual
>> time=0.062..0.062 rows=30 loops=1)"
>> "        Buckets: 1024  Batches: 1  Memory Usage: 2kB"
>> "        ->  HashAggregate  (cost=22.50..42.50 rows=1000 width=0) (actual
>> time=0.048..0.053 rows=30 loops=1)"
>> "              ->  Function Scan on generate_series s  (cost=0.00..20.00
>> rows=1000 width=0) (actual time=0.020..0.034 rows=30 loops=1)"
>> "Total runtime: 17177.527 ms"
>
> The planner expects to need to return half the table when you provide 1,000
> distinct telegram_ids, which is best handled by scanning the whole table
> sequentially and tossing out invalid data.
>
> I am curious if the plan will be different if you added a LIMIT 30 to the
> sub-query.
>
> The root of the problem is the planner has no way of knowing whether
> generate_series is going to return 1 or 1,000,000 rows so by default it (and
> all functions) are assumed (by the planner) to return 1,000 rows. By adding
> an explicit limit you can better inform the planner as to how many rows you
> are going to be passing up to the parent query and it will hopefully, with
> knowledge of only 30 distinct values, use the index.
>
>
>
>
> --
> View this message in context:
http://postgresql.1045698.n5.nabble.com/how-do-functions-affect-query-plan-tp5803993p5803996.html
> Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance

Re: how do functions affect query plan?

From
changchao
Date:

Interestingly,adding type cast made postgresql wiser.
Anyone knows the reason?

1.no type cast
SELECT md.*
  FROM measure_data md
  where telegram_id in (trunc(66484.2),trunc(132362.1 ))


"Seq Scan on measure_data md  (cost=0.00..459455.40 rows=205546 width=28) (actual time=77.144..6458.870 rows=624
loops=1)"
"  Filter: ((telegram_id)::numeric = ANY ('{66484,132362}'::numeric[]))"
"  Rows Removed by Filter: 20553936"
"Total runtime: 6458.921 ms"


2.type cast

SELECT md.*
  FROM measure_data md
  where telegram_id in (trunc(66484.2)::int,trunc(132362.1 )::int)

"Bitmap Heap Scan on measure_data md  (cost=16.06..2618.86 rows=684 width=28) (actual time=0.076..0.154 rows=624
loops=1)"
"  Recheck Cond: (telegram_id = ANY ('{66484,132362}'::integer[]))"
"  ->  Bitmap Index Scan on index_measure_data_telegram_id  (cost=0.00..15.88 rows=684 width=0) (actual
time=0.065..0.065rows=624 loops=1)" 
"        Index Cond: (telegram_id = ANY ('{66484,132362}'::integer[]))"
"Total runtime: 0.187 ms"


----------------------------------------
> From: chang-chao@hotmail.com
> To: david.g.johnston@gmail.com; pgsql-performance@postgresql.org
> Subject: Re: [PERFORM] how do functions affect query plan?
> Date: Thu, 15 May 2014 15:19:13 +0900
>
> Hi,David
>
> Seems that the root of evil is in the function(random,trunc),
> although I don't know why.
>
> Here is the comparison.
>
> 1.w/o function : index is wisely used.(Even without the limit 30 clause)
>
> explain analyze
> SELECT md.*
>   FROM measure_data md
>   where telegram_id in
>     (
>          SELECT  66484 + (132363-66484)/30 * i
>          FROM generate_series(1,30) as s(i)
>          limit 30
>     )
>   ;
>
> "Nested Loop  (cost=10.01..39290.79 rows=10392 width=28) (actual time=0.079..3.490 rows=9360 loops=1)"
> "  ->  HashAggregate  (cost=0.83..1.13 rows=30 width=4) (actual time=0.027..0.032 rows=30 loops=1)"
> "        ->  Limit  (cost=0.00..0.45 rows=30 width=4) (actual time=0.013..0.020 rows=30 loops=1)"
> "              ->  Function Scan on generate_series s  (cost=0.00..15.00 rows=1000 width=4) (actual time=0.011..0.016
rows=30loops=1)" 
> "  ->  Bitmap Heap Scan on measure_data md  (cost=9.19..1306.20 rows=346 width=28) (actual time=0.030..0.075 rows=312
loops=30)"
> "        Recheck Cond: (telegram_id = ((66484 + (2195 * s.i))))"
> "        ->  Bitmap Index Scan on index_measure_data_telegram_id  (cost=0.00..9.10 rows=346 width=0) (actual
time=0.025..0.025rows=312 loops=30)" 
> "              Index Cond: (telegram_id = ((66484 + (2195 * s.i))))"
> "Total runtime: 3.714 ms"
>
>
> 2.when function is there: seq scan
>
> explain analyze
> SELECT md.*
>   FROM measure_data md
>   where telegram_id in
>     (
>          SELECT  trunc((132363-66484) * random()) +66484
>          FROM generate_series(1,30) as s(i)
>          limit 30
>     )
>  ;
>
>
> "Hash Join  (cost=1.65..490288.89 rows=10277280 width=28) (actual time=0.169..4894.847 rows=9360 loops=1)"
> "  Hash Cond: ((md.telegram_id)::double precision = ((trunc((65879::double precision * random())) + 66484::double
precision)))"
> "  ->  Seq Scan on measure_data md  (cost=0.00..356682.60 rows=20554560 width=28) (actual time=0.010..2076.932
rows=20554560loops=1)" 
> "  ->  Hash  (cost=1.28..1.28 rows=30 width=8) (actual time=0.041..0.041 rows=30 loops=1)"
> "        Buckets: 1024  Batches: 1  Memory Usage: 2kB"
> "        ->  HashAggregate  (cost=0.98..1.28 rows=30 width=8) (actual time=0.034..0.036 rows=30 loops=1)"
> "              ->  Limit  (cost=0.00..0.60 rows=30 width=0) (actual time=0.016..0.026 rows=30 loops=1)"
> "                    ->  Function Scan on generate_series s  (cost=0.00..20.00 rows=1000 width=0) (actual
time=0.015..0.023rows=30 loops=1)" 
> "Total runtime: 4895.239 ms"
>
>
> ----------------------------------------
>> Date: Wed, 14 May 2014 22:43:24 -0700
>> From: david.g.johnston@gmail.com
>> To: pgsql-performance@postgresql.org
>> Subject: Re: [PERFORM] how do functions affect query plan?
>>
>> 常超 wrote
>>> Hi,all
>>> I have a table to save received measure data.
>>>
>>>
>>> CREATE TABLE measure_data
>>> (
>>> id serial NOT NULL,
>>> telegram_id integer NOT NULL,
>>> measure_time timestamp without time zone NOT NULL,
>>> item_id integer NOT NULL,
>>> val double precision,
>>> CONSTRAINT measure_data_pkey PRIMARY KEY (id)
>>> );
>>>
>>> CREATE INDEX index_measure_data_telegram_id ON measure_data USING btree
>>> (telegram_id);
>>>
>>>
>>> in my scenario,a telegram contains measure data for multiple data items
>>> and timestamps,
>>> BTW,another table is for telegram.
>>>
>>> The SQL I used in my application is
>>> select * from measure_data where telegram_id in(1,2,...,n)
>>> and this query used the index_measure_data_telegram_id index,as expected.
>>>
>>> In order to see the performance of my query ,
>>> I used the following query to search the measure data for randomly 30
>>> telegrams.
>>>
>>>
>>> explain analyze
>>> SELECT md.*
>>> FROM measure_data md
>>> where telegram_id in
>>> (
>>> SELECT distinct
>>> trunc((132363-66484) * random() + 66484)
>>> FROM generate_series(1,30) as s(telegram_id)
>>> )
>>> ;
>>>
>>> the 132363 and 66484 are the max and min of the telegram id,separately.
>>>
>>> What surprised me is that index is not used,instead,a seq scan is
>>> performed on measure_data.
>>> Although,intuitively,in this case,it is much wiser to use the index.
>>> Would you please give some clue to why this happened?
>>>
>>> "Hash Semi Join (cost=65.00..539169.32 rows=10277280 width=28) (actual
>>> time=76.454..17177.054 rows=9360 loops=1)"
>>> " Hash Cond: ((md.telegram_id)::double precision = (trunc(((65879::double
>>> precision * random()) + 66484::double precision))))"
>>> " -> Seq Scan on measure_data md (cost=0.00..356682.60 rows=20554560
>>> width=28) (actual time=0.012..13874.809 rows=20554560 loops=1)"
>>> " -> Hash (cost=52.50..52.50 rows=1000 width=8) (actual
>>> time=0.062..0.062 rows=30 loops=1)"
>>> " Buckets: 1024 Batches: 1 Memory Usage: 2kB"
>>> " -> HashAggregate (cost=22.50..42.50 rows=1000 width=0) (actual
>>> time=0.048..0.053 rows=30 loops=1)"
>>> " -> Function Scan on generate_series s (cost=0.00..20.00
>>> rows=1000 width=0) (actual time=0.020..0.034 rows=30 loops=1)"
>>> "Total runtime: 17177.527 ms"
>>
>> The planner expects to need to return half the table when you provide 1,000
>> distinct telegram_ids, which is best handled by scanning the whole table
>> sequentially and tossing out invalid data.
>>
>> I am curious if the plan will be different if you added a LIMIT 30 to the
>> sub-query.
>>
>> The root of the problem is the planner has no way of knowing whether
>> generate_series is going to return 1 or 1,000,000 rows so by default it (and
>> all functions) are assumed (by the planner) to return 1,000 rows. By adding
>> an explicit limit you can better inform the planner as to how many rows you
>> are going to be passing up to the parent query and it will hopefully, with
>> knowledge of only 30 distinct values, use the index.
>>
>>
>>
>>
>> --
>> View this message in context:
http://postgresql.1045698.n5.nabble.com/how-do-functions-affect-query-plan-tp5803993p5803996.html
>> Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
>>
>>
>> --
>> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-performance
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance

Re: FW: how do functions affect query plan?

From
楊新波
Date:
hi

i think the telegram_id's type should be integer. 

please change telegram_id to numeric and try to run the the following sql. the index should be used. 

explain SELECT md.* 
  FROM measure_data md 
  where telegram_id in (trunc(66484.2),trunc(132362.1 )) 


2014-05-15 17:28 GMT+09:00 changchao <chang-chao@hotmail.com>:


----------------------------------------
> From: chang-chao@hotmail.com
> To: pgsql-performance@postgresql.org
> Subject: Re: [PERFORM] how do functions affect query plan?
> Date: Thu, 15 May 2014 16:59:30 +0900
>
>
>
> Interestingly,adding type cast made postgresql wiser.
> Anyone knows the reason?
>
> 1.no type cast
> SELECT md.*
>   FROM measure_data md
>   where telegram_id in (trunc(66484.2),trunc(132362.1 ))
>
>
> "Seq Scan on measure_data md  (cost=0.00..459455.40 rows=205546 width=28) (actual time=77.144..6458.870 rows=624 loops=1)"
> "  Filter: ((telegram_id)::numeric = ANY ('{66484,132362}'::numeric[]))"
> "  Rows Removed by Filter: 20553936"
> "Total runtime: 6458.921 ms"
>
>
> 2.type cast
>
> SELECT md.*
>   FROM measure_data md
>   where telegram_id in (trunc(66484.2)::int,trunc(132362.1 )::int)
>
> "Bitmap Heap Scan on measure_data md  (cost=16.06..2618.86 rows=684 width=28) (actual time=0.076..0.154 rows=624 loops=1)"
> "  Recheck Cond: (telegram_id = ANY ('{66484,132362}'::integer[]))"
> "  ->  Bitmap Index Scan on index_measure_data_telegram_id  (cost=0.00..15.88 rows=684 width=0) (actual time=0.065..0.065 rows=624 loops=1)"
> "        Index Cond: (telegram_id = ANY ('{66484,132362}'::integer[]))"
> "Total runtime: 0.187 ms"
>
>
> ----------------------------------------
>> From: chang-chao@hotmail.com
>> To: david.g.johnston@gmail.com; pgsql-performance@postgresql.org
>> Subject: Re: [PERFORM] how do functions affect query plan?
>> Date: Thu, 15 May 2014 15:19:13 +0900
>>
>> Hi,David
>>
>> Seems that the root of evil is in the function(random,trunc),
>> although I don't know why.
>>
>> Here is the comparison.
>>
>> 1.w/o function : index is wisely used.(Even without the limit 30 clause)
>>
>> explain analyze
>> SELECT md.*
>> FROM measure_data md
>> where telegram_id in
>> (
>> SELECT 66484 + (132363-66484)/30 * i
>> FROM generate_series(1,30) as s(i)
>> limit 30
>> )
>> ;
>>
>> "Nested Loop (cost=10.01..39290.79 rows=10392 width=28) (actual time=0.079..3.490 rows=9360 loops=1)"
>> " -> HashAggregate (cost=0.83..1.13 rows=30 width=4) (actual time=0.027..0.032 rows=30 loops=1)"
>> " -> Limit (cost=0.00..0.45 rows=30 width=4) (actual time=0.013..0.020 rows=30 loops=1)"
>> " -> Function Scan on generate_series s (cost=0.00..15.00 rows=1000 width=4) (actual time=0.011..0.016 rows=30 loops=1)"
>> " -> Bitmap Heap Scan on measure_data md (cost=9.19..1306.20 rows=346 width=28) (actual time=0.030..0.075 rows=312 loops=30)"
>> " Recheck Cond: (telegram_id = ((66484 + (2195 * s.i))))"
>> " -> Bitmap Index Scan on index_measure_data_telegram_id (cost=0.00..9.10 rows=346 width=0) (actual time=0.025..0.025 rows=312 loops=30)"
>> " Index Cond: (telegram_id = ((66484 + (2195 * s.i))))"
>> "Total runtime: 3.714 ms"
>>
>>
>> 2.when function is there: seq scan
>>
>> explain analyze
>> SELECT md.*
>> FROM measure_data md
>> where telegram_id in
>> (
>> SELECT trunc((132363-66484) * random()) +66484
>> FROM generate_series(1,30) as s(i)
>> limit 30
>> )
>> ;
>>
>>
>> "Hash Join (cost=1.65..490288.89 rows=10277280 width=28) (actual time=0.169..4894.847 rows=9360 loops=1)"
>> " Hash Cond: ((md.telegram_id)::double precision = ((trunc((65879::double precision * random())) + 66484::double precision)))"
>> " -> Seq Scan on measure_data md (cost=0.00..356682.60 rows=20554560 width=28) (actual time=0.010..2076.932 rows=20554560 loops=1)"
>> " -> Hash (cost=1.28..1.28 rows=30 width=8) (actual time=0.041..0.041 rows=30 loops=1)"
>> " Buckets: 1024 Batches: 1 Memory Usage: 2kB"
>> " -> HashAggregate (cost=0.98..1.28 rows=30 width=8) (actual time=0.034..0.036 rows=30 loops=1)"
>> " -> Limit (cost=0.00..0.60 rows=30 width=0) (actual time=0.016..0.026 rows=30 loops=1)"
>> " -> Function Scan on generate_series s (cost=0.00..20.00 rows=1000 width=0) (actual time=0.015..0.023 rows=30 loops=1)"
>> "Total runtime: 4895.239 ms"
>>
>>
>> ----------------------------------------
>>> Date: Wed, 14 May 2014 22:43:24 -0700
>>> From: david.g.johnston@gmail.com
>>> To: pgsql-performance@postgresql.org
>>> Subject: Re: [PERFORM] how do functions affect query plan?
>>>
>>> 常超 wrote
>>>> Hi,all
>>>> I have a table to save received measure data.
>>>>
>>>>
>>>> CREATE TABLE measure_data
>>>> (
>>>> id serial NOT NULL,
>>>> telegram_id integer NOT NULL,
>>>> measure_time timestamp without time zone NOT NULL,
>>>> item_id integer NOT NULL,
>>>> val double precision,
>>>> CONSTRAINT measure_data_pkey PRIMARY KEY (id)
>>>> );
>>>>
>>>> CREATE INDEX index_measure_data_telegram_id ON measure_data USING btree
>>>> (telegram_id);
>>>>
>>>>
>>>> in my scenario,a telegram contains measure data for multiple data items
>>>> and timestamps,
>>>> BTW,another table is for telegram.
>>>>
>>>> The SQL I used in my application is
>>>> select * from measure_data where telegram_id in(1,2,...,n)
>>>> and this query used the index_measure_data_telegram_id index,as expected.
>>>>
>>>> In order to see the performance of my query ,
>>>> I used the following query to search the measure data for randomly 30
>>>> telegrams.
>>>>
>>>>
>>>> explain analyze
>>>> SELECT md.*
>>>> FROM measure_data md
>>>> where telegram_id in
>>>> (
>>>> SELECT distinct
>>>> trunc((132363-66484) * random() + 66484)
>>>> FROM generate_series(1,30) as s(telegram_id)
>>>> )
>>>> ;
>>>>
>>>> the 132363 and 66484 are the max and min of the telegram id,separately.
>>>>
>>>> What surprised me is that index is not used,instead,a seq scan is
>>>> performed on measure_data.
>>>> Although,intuitively,in this case,it is much wiser to use the index.
>>>> Would you please give some clue to why this happened?
>>>>
>>>> "Hash Semi Join (cost=65.00..539169.32 rows=10277280 width=28) (actual
>>>> time=76.454..17177.054 rows=9360 loops=1)"
>>>> " Hash Cond: ((md.telegram_id)::double precision = (trunc(((65879::double
>>>> precision * random()) + 66484::double precision))))"
>>>> " -> Seq Scan on measure_data md (cost=0.00..356682.60 rows=20554560
>>>> width=28) (actual time=0.012..13874.809 rows=20554560 loops=1)"
>>>> " -> Hash (cost=52.50..52.50 rows=1000 width=8) (actual
>>>> time=0.062..0.062 rows=30 loops=1)"
>>>> " Buckets: 1024 Batches: 1 Memory Usage: 2kB"
>>>> " -> HashAggregate (cost=22.50..42.50 rows=1000 width=0) (actual
>>>> time=0.048..0.053 rows=30 loops=1)"
>>>> " -> Function Scan on generate_series s (cost=0.00..20.00
>>>> rows=1000 width=0) (actual time=0.020..0.034 rows=30 loops=1)"
>>>> "Total runtime: 17177.527 ms"
>>>
>>> The planner expects to need to return half the table when you provide 1,000
>>> distinct telegram_ids, which is best handled by scanning the whole table
>>> sequentially and tossing out invalid data.
>>>
>>> I am curious if the plan will be different if you added a LIMIT 30 to the
>>> sub-query.
>>>
>>> The root of the problem is the planner has no way of knowing whether
>>> generate_series is going to return 1 or 1,000,000 rows so by default it (and
>>> all functions) are assumed (by the planner) to return 1,000 rows. By adding
>>> an explicit limit you can better inform the planner as to how many rows you
>>> are going to be passing up to the parent query and it will hopefully, with
>>> knowledge of only 30 distinct values, use the index.
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context: http://postgresql.1045698.n5.nabble.com/how-do-functions-affect-query-plan-tp5803993p5803996.html
>>> Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
>>>
>>>
>>> --
>>> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
>>> To make changes to your subscription:
>>> http://www.postgresql.org/mailpref/pgsql-performance
>>
>> --
>> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-performance
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
                                         

Re: how do functions affect query plan?

From
changchao
Date:
Your answer seemed to get the point.

index on telegram_id(type=integer) column can't be used for the filter condition below
because type mismatches.

  ((telegram_id)::numeric = ANY ('{66484,132362}'::numeric[]))"

________________________________
> Date: Thu, 15 May 2014 17:31:10 +0900
> Subject: Re: FW: [PERFORM] how do functions affect query plan?
> From: silent0608@gmail.com
> To: pgsql-performance@postgresql.org
>
> hi
>
> i think the telegram_id's type should be integer.
>
> please change telegram_id to numeric and try to run the the following
> sql. the index should be used.
>
> explain SELECT md.*
>    FROM measure_data md
>    where telegram_id in (trunc(66484.2),trunc(132362.1 ))
>
>
> 2014-05-15 17:28 GMT+09:00 changchao
> <chang-chao@hotmail.com<mailto:chang-chao@hotmail.com>>:
>
>
> ----------------------------------------
> > From: chang-chao@hotmail.com<mailto:chang-chao@hotmail.com>
> > To:
> pgsql-performance@postgresql.org<mailto:pgsql-performance@postgresql.org>
> > Subject: Re: [PERFORM] how do functions affect query plan?
> > Date: Thu, 15 May 2014 16:59:30 +0900
> >
> >
> >
> > Interestingly,adding type cast made postgresql wiser.
> > Anyone knows the reason?
> >
> > 1.no<http://1.no> type cast
> > SELECT md.*
> >   FROM measure_data md
> >   where telegram_id in (trunc(66484.2),trunc(132362.1 ))
> >
> >
> > "Seq Scan on measure_data md  (cost=0.00..459455.40 rows=205546
> width=28) (actual time=77.144..6458.870 rows=624 loops=1)"
> > "  Filter: ((telegram_id)::numeric = ANY ('{66484,132362}'::numeric[]))"
> > "  Rows Removed by Filter: 20553936"
> > "Total runtime: 6458.921 ms"
> >
> >
> > 2.type cast
> >
> > SELECT md.*
> >   FROM measure_data md
> >   where telegram_id in (trunc(66484.2)::int,trunc(132362.1 )::int)
> >
> > "Bitmap Heap Scan on measure_data md  (cost=16.06..2618.86 rows=684
> width=28) (actual time=0.076..0.154 rows=624 loops=1)"
> > "  Recheck Cond: (telegram_id = ANY ('{66484,132362}'::integer[]))"
> > "  ->  Bitmap Index Scan on index_measure_data_telegram_id
>   (cost=0.00..15.88 rows=684 width=0) (actual time=0.065..0.065 rows=624
> loops=1)"
> > "        Index Cond: (telegram_id = ANY ('{66484,132362}'::integer[]))"
> > "Total runtime: 0.187 ms"
> >
> >
> > ----------------------------------------
> >> From: chang-chao@hotmail.com<mailto:chang-chao@hotmail.com>
> >> To: david.g.johnston@gmail.com<mailto:david.g.johnston@gmail.com>;
> pgsql-performance@postgresql.org<mailto:pgsql-performance@postgresql.org>
> >> Subject: Re: [PERFORM] how do functions affect query plan?
> >> Date: Thu, 15 May 2014 15:19:13 +0900
> >>
> >> Hi,David
> >>
> >> Seems that the root of evil is in the function(random,trunc),
> >> although I don't know why.
> >>
> >> Here is the comparison.
> >>
> >> 1.w/o function : index is wisely used.(Even without the limit 30 clause)
> >>
> >> explain analyze
> >> SELECT md.*
> >> FROM measure_data md
> >> where telegram_id in
> >> (
> >> SELECT 66484 + (132363-66484)/30 * i
> >> FROM generate_series(1,30) as s(i)
> >> limit 30
> >> )
> >> ;
> >>
> >> "Nested Loop (cost=10.01..39290.79 rows=10392 width=28) (actual
> time=0.079..3.490 rows=9360 loops=1)"
> >> " -> HashAggregate (cost=0.83..1.13 rows=30 width=4) (actual
> time=0.027..0.032 rows=30 loops=1)"
> >> " -> Limit (cost=0.00..0.45 rows=30 width=4) (actual
> time=0.013..0.020 rows=30 loops=1)"
> >> " -> Function Scan on generate_series s (cost=0.00..15.00 rows=1000
> width=4) (actual time=0.011..0.016 rows=30 loops=1)"
> >> " -> Bitmap Heap Scan on measure_data md (cost=9.19..1306.20
> rows=346 width=28) (actual time=0.030..0.075 rows=312 loops=30)"
> >> " Recheck Cond: (telegram_id = ((66484 + (2195 * s.i))))"
> >> " -> Bitmap Index Scan on index_measure_data_telegram_id
> (cost=0.00..9.10 rows=346 width=0) (actual time=0.025..0.025 rows=312
> loops=30)"
> >> " Index Cond: (telegram_id = ((66484 + (2195 * s.i))))"
> >> "Total runtime: 3.714 ms"
> >>
> >>
> >> 2.when function is there: seq scan
> >>
> >> explain analyze
> >> SELECT md.*
> >> FROM measure_data md
> >> where telegram_id in
> >> (
> >> SELECT trunc((132363-66484) * random()) +66484
> >> FROM generate_series(1,30) as s(i)
> >> limit 30
> >> )
> >> ;
> >>
> >>
> >> "Hash Join (cost=1.65..490288.89 rows=10277280 width=28) (actual
> time=0.169..4894.847 rows=9360 loops=1)"
> >> " Hash Cond: ((md.telegram_id)::double precision =
> ((trunc((65879::double precision * random())) + 66484::double
> precision)))"
> >> " -> Seq Scan on measure_data md (cost=0.00..356682.60 rows=20554560
> width=28) (actual time=0.010..2076.932 rows=20554560 loops=1)"
> >> " -> Hash (cost=1.28..1.28 rows=30 width=8) (actual
> time=0.041..0.041 rows=30 loops=1)"
> >> " Buckets: 1024 Batches: 1 Memory Usage: 2kB"
> >> " -> HashAggregate (cost=0.98..1.28 rows=30 width=8) (actual
> time=0.034..0.036 rows=30 loops=1)"
> >> " -> Limit (cost=0.00..0.60 rows=30 width=0) (actual
> time=0.016..0.026 rows=30 loops=1)"
> >> " -> Function Scan on generate_series s (cost=0.00..20.00 rows=1000
> width=0) (actual time=0.015..0.023 rows=30 loops=1)"
> >> "Total runtime: 4895.239 ms"
> >>
> >>
> >> ----------------------------------------
> >>> Date: Wed, 14 May 2014 22:43:24 -0700
> >>> From: david.g.johnston@gmail.com<mailto:david.g.johnston@gmail.com>
> >>> To:
> pgsql-performance@postgresql.org<mailto:pgsql-performance@postgresql.org>
> >>> Subject: Re: [PERFORM] how do functions affect query plan?
> >>>
> >>> 常超 wrote
> >>>> Hi,all
> >>>> I have a table to save received measure data.
> >>>>
> >>>>
> >>>> CREATE TABLE measure_data
> >>>> (
> >>>> id serial NOT NULL,
> >>>> telegram_id integer NOT NULL,
> >>>> measure_time timestamp without time zone NOT NULL,
> >>>> item_id integer NOT NULL,
> >>>> val double precision,
> >>>> CONSTRAINT measure_data_pkey PRIMARY KEY (id)
> >>>> );
> >>>>
> >>>> CREATE INDEX index_measure_data_telegram_id ON measure_data USING btree
> >>>> (telegram_id);
> >>>>
> >>>>
> >>>> in my scenario,a telegram contains measure data for multiple data items
> >>>> and timestamps,
> >>>> BTW,another table is for telegram.
> >>>>
> >>>> The SQL I used in my application is
> >>>> select * from measure_data where telegram_id in(1,2,...,n)
> >>>> and this query used the index_measure_data_telegram_id index,as
> expected.
> >>>>
> >>>> In order to see the performance of my query ,
> >>>> I used the following query to search the measure data for randomly 30
> >>>> telegrams.
> >>>>
> >>>>
> >>>> explain analyze
> >>>> SELECT md.*
> >>>> FROM measure_data md
> >>>> where telegram_id in
> >>>> (
> >>>> SELECT distinct
> >>>> trunc((132363-66484) * random() + 66484)
> >>>> FROM generate_series(1,30) as s(telegram_id)
> >>>> )
> >>>> ;
> >>>>
> >>>> the 132363 and 66484 are the max and min of the telegram id,separately.
> >>>>
> >>>> What surprised me is that index is not used,instead,a seq scan is
> >>>> performed on measure_data.
> >>>> Although,intuitively,in this case,it is much wiser to use the index.
> >>>> Would you please give some clue to why this happened?
> >>>>
> >>>> "Hash Semi Join (cost=65.00..539169.32 rows=10277280 width=28) (actual
> >>>> time=76.454..17177.054 rows=9360 loops=1)"
> >>>> " Hash Cond: ((md.telegram_id)::double precision =
> (trunc(((65879::double
> >>>> precision * random()) + 66484::double precision))))"
> >>>> " -> Seq Scan on measure_data md (cost=0.00..356682.60 rows=20554560
> >>>> width=28) (actual time=0.012..13874.809 rows=20554560 loops=1)"
> >>>> " -> Hash (cost=52.50..52.50 rows=1000 width=8) (actual
> >>>> time=0.062..0.062 rows=30 loops=1)"
> >>>> " Buckets: 1024 Batches: 1 Memory Usage: 2kB"
> >>>> " -> HashAggregate (cost=22.50..42.50 rows=1000 width=0) (actual
> >>>> time=0.048..0.053 rows=30 loops=1)"
> >>>> " -> Function Scan on generate_series s (cost=0.00..20.00
> >>>> rows=1000 width=0) (actual time=0.020..0.034 rows=30 loops=1)"
> >>>> "Total runtime: 17177.527 ms"
> >>>
> >>> The planner expects to need to return half the table when you
> provide 1,000
> >>> distinct telegram_ids, which is best handled by scanning the whole table
> >>> sequentially and tossing out invalid data.
> >>>
> >>> I am curious if the plan will be different if you added a LIMIT 30 to the
> >>> sub-query.
> >>>
> >>> The root of the problem is the planner has no way of knowing whether
> >>> generate_series is going to return 1 or 1,000,000 rows so by
> default it (and
> >>> all functions) are assumed (by the planner) to return 1,000 rows.
> By adding
> >>> an explicit limit you can better inform the planner as to how many
> rows you
> >>> are going to be passing up to the parent query and it will
> hopefully, with
> >>> knowledge of only 30 distinct values, use the index.
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>> View this message in context:
> http://postgresql.1045698.n5.nabble.com/how-do-functions-affect-query-plan-tp5803993p5803996.html
> >>> Sent from the PostgreSQL - performance mailing list archive at
> Nabble.com.
> >>>
> >>>
> >>> --
> >>> Sent via pgsql-performance mailing list
> (pgsql-performance@postgresql.org<mailto:pgsql-performance@postgresql.org>)
> >>> To make changes to your subscription:
> >>> http://www.postgresql.org/mailpref/pgsql-performance
> >>
> >> --
> >> Sent via pgsql-performance mailing list
> (pgsql-performance@postgresql.org<mailto:pgsql-performance@postgresql.org>)
> >> To make changes to your subscription:
> >> http://www.postgresql.org/mailpref/pgsql-performance
> >
> > --
> > Sent via pgsql-performance mailing list
> (pgsql-performance@postgresql.org<mailto:pgsql-performance@postgresql.org>)
> > To make changes to your subscription:
> > http://www.postgresql.org/mailpref/pgsql-performance
>
>