Re: FW: how do functions affect query plan? - Mailing list pgsql-performance

From 楊新波
Subject Re: FW: how do functions affect query plan?
Date
Msg-id CAK=TrEFfot8yz-906EcDqqgyFPiMkGT77+Dg5MTjTqVOreXQ=g@mail.gmail.com
Whole thread Raw
In response to how do functions affect query plan?  (常超 <chang-chao@hotmail.com>)
Responses Re: how do functions affect query plan?  (changchao <chang-chao@hotmail.com>)
List pgsql-performance
hi

i think the telegram_id's type should be integer. 

please change telegram_id to numeric and try to run the the following sql. the index should be used. 

explain SELECT md.* 
  FROM measure_data md 
  where telegram_id in (trunc(66484.2),trunc(132362.1 )) 


2014-05-15 17:28 GMT+09:00 changchao <chang-chao@hotmail.com>:


----------------------------------------
> From: chang-chao@hotmail.com
> To: pgsql-performance@postgresql.org
> Subject: Re: [PERFORM] how do functions affect query plan?
> Date: Thu, 15 May 2014 16:59:30 +0900
>
>
>
> Interestingly,adding type cast made postgresql wiser.
> Anyone knows the reason?
>
> 1.no type cast
> SELECT md.*
>   FROM measure_data md
>   where telegram_id in (trunc(66484.2),trunc(132362.1 ))
>
>
> "Seq Scan on measure_data md  (cost=0.00..459455.40 rows=205546 width=28) (actual time=77.144..6458.870 rows=624 loops=1)"
> "  Filter: ((telegram_id)::numeric = ANY ('{66484,132362}'::numeric[]))"
> "  Rows Removed by Filter: 20553936"
> "Total runtime: 6458.921 ms"
>
>
> 2.type cast
>
> SELECT md.*
>   FROM measure_data md
>   where telegram_id in (trunc(66484.2)::int,trunc(132362.1 )::int)
>
> "Bitmap Heap Scan on measure_data md  (cost=16.06..2618.86 rows=684 width=28) (actual time=0.076..0.154 rows=624 loops=1)"
> "  Recheck Cond: (telegram_id = ANY ('{66484,132362}'::integer[]))"
> "  ->  Bitmap Index Scan on index_measure_data_telegram_id  (cost=0.00..15.88 rows=684 width=0) (actual time=0.065..0.065 rows=624 loops=1)"
> "        Index Cond: (telegram_id = ANY ('{66484,132362}'::integer[]))"
> "Total runtime: 0.187 ms"
>
>
> ----------------------------------------
>> From: chang-chao@hotmail.com
>> To: david.g.johnston@gmail.com; pgsql-performance@postgresql.org
>> Subject: Re: [PERFORM] how do functions affect query plan?
>> Date: Thu, 15 May 2014 15:19:13 +0900
>>
>> Hi,David
>>
>> Seems that the root of evil is in the function(random,trunc),
>> although I don't know why.
>>
>> Here is the comparison.
>>
>> 1.w/o function : index is wisely used.(Even without the limit 30 clause)
>>
>> explain analyze
>> SELECT md.*
>> FROM measure_data md
>> where telegram_id in
>> (
>> SELECT 66484 + (132363-66484)/30 * i
>> FROM generate_series(1,30) as s(i)
>> limit 30
>> )
>> ;
>>
>> "Nested Loop (cost=10.01..39290.79 rows=10392 width=28) (actual time=0.079..3.490 rows=9360 loops=1)"
>> " -> HashAggregate (cost=0.83..1.13 rows=30 width=4) (actual time=0.027..0.032 rows=30 loops=1)"
>> " -> Limit (cost=0.00..0.45 rows=30 width=4) (actual time=0.013..0.020 rows=30 loops=1)"
>> " -> Function Scan on generate_series s (cost=0.00..15.00 rows=1000 width=4) (actual time=0.011..0.016 rows=30 loops=1)"
>> " -> Bitmap Heap Scan on measure_data md (cost=9.19..1306.20 rows=346 width=28) (actual time=0.030..0.075 rows=312 loops=30)"
>> " Recheck Cond: (telegram_id = ((66484 + (2195 * s.i))))"
>> " -> Bitmap Index Scan on index_measure_data_telegram_id (cost=0.00..9.10 rows=346 width=0) (actual time=0.025..0.025 rows=312 loops=30)"
>> " Index Cond: (telegram_id = ((66484 + (2195 * s.i))))"
>> "Total runtime: 3.714 ms"
>>
>>
>> 2.when function is there: seq scan
>>
>> explain analyze
>> SELECT md.*
>> FROM measure_data md
>> where telegram_id in
>> (
>> SELECT trunc((132363-66484) * random()) +66484
>> FROM generate_series(1,30) as s(i)
>> limit 30
>> )
>> ;
>>
>>
>> "Hash Join (cost=1.65..490288.89 rows=10277280 width=28) (actual time=0.169..4894.847 rows=9360 loops=1)"
>> " Hash Cond: ((md.telegram_id)::double precision = ((trunc((65879::double precision * random())) + 66484::double precision)))"
>> " -> Seq Scan on measure_data md (cost=0.00..356682.60 rows=20554560 width=28) (actual time=0.010..2076.932 rows=20554560 loops=1)"
>> " -> Hash (cost=1.28..1.28 rows=30 width=8) (actual time=0.041..0.041 rows=30 loops=1)"
>> " Buckets: 1024 Batches: 1 Memory Usage: 2kB"
>> " -> HashAggregate (cost=0.98..1.28 rows=30 width=8) (actual time=0.034..0.036 rows=30 loops=1)"
>> " -> Limit (cost=0.00..0.60 rows=30 width=0) (actual time=0.016..0.026 rows=30 loops=1)"
>> " -> Function Scan on generate_series s (cost=0.00..20.00 rows=1000 width=0) (actual time=0.015..0.023 rows=30 loops=1)"
>> "Total runtime: 4895.239 ms"
>>
>>
>> ----------------------------------------
>>> Date: Wed, 14 May 2014 22:43:24 -0700
>>> From: david.g.johnston@gmail.com
>>> To: pgsql-performance@postgresql.org
>>> Subject: Re: [PERFORM] how do functions affect query plan?
>>>
>>> 常超 wrote
>>>> Hi,all
>>>> I have a table to save received measure data.
>>>>
>>>>
>>>> CREATE TABLE measure_data
>>>> (
>>>> id serial NOT NULL,
>>>> telegram_id integer NOT NULL,
>>>> measure_time timestamp without time zone NOT NULL,
>>>> item_id integer NOT NULL,
>>>> val double precision,
>>>> CONSTRAINT measure_data_pkey PRIMARY KEY (id)
>>>> );
>>>>
>>>> CREATE INDEX index_measure_data_telegram_id ON measure_data USING btree
>>>> (telegram_id);
>>>>
>>>>
>>>> in my scenario,a telegram contains measure data for multiple data items
>>>> and timestamps,
>>>> BTW,another table is for telegram.
>>>>
>>>> The SQL I used in my application is
>>>> select * from measure_data where telegram_id in(1,2,...,n)
>>>> and this query used the index_measure_data_telegram_id index,as expected.
>>>>
>>>> In order to see the performance of my query ,
>>>> I used the following query to search the measure data for randomly 30
>>>> telegrams.
>>>>
>>>>
>>>> explain analyze
>>>> SELECT md.*
>>>> FROM measure_data md
>>>> where telegram_id in
>>>> (
>>>> SELECT distinct
>>>> trunc((132363-66484) * random() + 66484)
>>>> FROM generate_series(1,30) as s(telegram_id)
>>>> )
>>>> ;
>>>>
>>>> the 132363 and 66484 are the max and min of the telegram id,separately.
>>>>
>>>> What surprised me is that index is not used,instead,a seq scan is
>>>> performed on measure_data.
>>>> Although,intuitively,in this case,it is much wiser to use the index.
>>>> Would you please give some clue to why this happened?
>>>>
>>>> "Hash Semi Join (cost=65.00..539169.32 rows=10277280 width=28) (actual
>>>> time=76.454..17177.054 rows=9360 loops=1)"
>>>> " Hash Cond: ((md.telegram_id)::double precision = (trunc(((65879::double
>>>> precision * random()) + 66484::double precision))))"
>>>> " -> Seq Scan on measure_data md (cost=0.00..356682.60 rows=20554560
>>>> width=28) (actual time=0.012..13874.809 rows=20554560 loops=1)"
>>>> " -> Hash (cost=52.50..52.50 rows=1000 width=8) (actual
>>>> time=0.062..0.062 rows=30 loops=1)"
>>>> " Buckets: 1024 Batches: 1 Memory Usage: 2kB"
>>>> " -> HashAggregate (cost=22.50..42.50 rows=1000 width=0) (actual
>>>> time=0.048..0.053 rows=30 loops=1)"
>>>> " -> Function Scan on generate_series s (cost=0.00..20.00
>>>> rows=1000 width=0) (actual time=0.020..0.034 rows=30 loops=1)"
>>>> "Total runtime: 17177.527 ms"
>>>
>>> The planner expects to need to return half the table when you provide 1,000
>>> distinct telegram_ids, which is best handled by scanning the whole table
>>> sequentially and tossing out invalid data.
>>>
>>> I am curious if the plan will be different if you added a LIMIT 30 to the
>>> sub-query.
>>>
>>> The root of the problem is the planner has no way of knowing whether
>>> generate_series is going to return 1 or 1,000,000 rows so by default it (and
>>> all functions) are assumed (by the planner) to return 1,000 rows. By adding
>>> an explicit limit you can better inform the planner as to how many rows you
>>> are going to be passing up to the parent query and it will hopefully, with
>>> knowledge of only 30 distinct values, use the index.
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context: http://postgresql.1045698.n5.nabble.com/how-do-functions-affect-query-plan-tp5803993p5803996.html
>>> Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
>>>
>>>
>>> --
>>> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
>>> To make changes to your subscription:
>>> http://www.postgresql.org/mailpref/pgsql-performance
>>
>> --
>> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-performance
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
                                         

pgsql-performance by date:

Previous
From: Pavel Stehule
Date:
Subject: Re: Stats collector constant I/O
Next
From: changchao
Date:
Subject: Re: how do functions affect query plan?