Re: how do functions affect query plan? - Mailing list pgsql-performance

From changchao
Subject Re: how do functions affect query plan?
Date
Msg-id BAY174-W2444445CB18631AE1C3FAA84360@phx.gbl
Whole thread Raw
In response to Re: FW: how do functions affect query plan?  (楊新波 <silent0608@gmail.com>)
List pgsql-performance
Your answer seemed to get the point.

index on telegram_id(type=integer) column can't be used for the filter condition below
because type mismatches.

  ((telegram_id)::numeric = ANY ('{66484,132362}'::numeric[]))"

________________________________
> Date: Thu, 15 May 2014 17:31:10 +0900
> Subject: Re: FW: [PERFORM] how do functions affect query plan?
> From: silent0608@gmail.com
> To: pgsql-performance@postgresql.org
>
> hi
>
> i think the telegram_id's type should be integer.
>
> please change telegram_id to numeric and try to run the the following
> sql. the index should be used.
>
> explain SELECT md.*
>    FROM measure_data md
>    where telegram_id in (trunc(66484.2),trunc(132362.1 ))
>
>
> 2014-05-15 17:28 GMT+09:00 changchao
> <chang-chao@hotmail.com<mailto:chang-chao@hotmail.com>>:
>
>
> ----------------------------------------
> > From: chang-chao@hotmail.com<mailto:chang-chao@hotmail.com>
> > To:
> pgsql-performance@postgresql.org<mailto:pgsql-performance@postgresql.org>
> > Subject: Re: [PERFORM] how do functions affect query plan?
> > Date: Thu, 15 May 2014 16:59:30 +0900
> >
> >
> >
> > Interestingly,adding type cast made postgresql wiser.
> > Anyone knows the reason?
> >
> > 1.no<http://1.no> type cast
> > SELECT md.*
> >   FROM measure_data md
> >   where telegram_id in (trunc(66484.2),trunc(132362.1 ))
> >
> >
> > "Seq Scan on measure_data md  (cost=0.00..459455.40 rows=205546
> width=28) (actual time=77.144..6458.870 rows=624 loops=1)"
> > "  Filter: ((telegram_id)::numeric = ANY ('{66484,132362}'::numeric[]))"
> > "  Rows Removed by Filter: 20553936"
> > "Total runtime: 6458.921 ms"
> >
> >
> > 2.type cast
> >
> > SELECT md.*
> >   FROM measure_data md
> >   where telegram_id in (trunc(66484.2)::int,trunc(132362.1 )::int)
> >
> > "Bitmap Heap Scan on measure_data md  (cost=16.06..2618.86 rows=684
> width=28) (actual time=0.076..0.154 rows=624 loops=1)"
> > "  Recheck Cond: (telegram_id = ANY ('{66484,132362}'::integer[]))"
> > "  ->  Bitmap Index Scan on index_measure_data_telegram_id
>   (cost=0.00..15.88 rows=684 width=0) (actual time=0.065..0.065 rows=624
> loops=1)"
> > "        Index Cond: (telegram_id = ANY ('{66484,132362}'::integer[]))"
> > "Total runtime: 0.187 ms"
> >
> >
> > ----------------------------------------
> >> From: chang-chao@hotmail.com<mailto:chang-chao@hotmail.com>
> >> To: david.g.johnston@gmail.com<mailto:david.g.johnston@gmail.com>;
> pgsql-performance@postgresql.org<mailto:pgsql-performance@postgresql.org>
> >> Subject: Re: [PERFORM] how do functions affect query plan?
> >> Date: Thu, 15 May 2014 15:19:13 +0900
> >>
> >> Hi,David
> >>
> >> Seems that the root of evil is in the function(random,trunc),
> >> although I don't know why.
> >>
> >> Here is the comparison.
> >>
> >> 1.w/o function : index is wisely used.(Even without the limit 30 clause)
> >>
> >> explain analyze
> >> SELECT md.*
> >> FROM measure_data md
> >> where telegram_id in
> >> (
> >> SELECT 66484 + (132363-66484)/30 * i
> >> FROM generate_series(1,30) as s(i)
> >> limit 30
> >> )
> >> ;
> >>
> >> "Nested Loop (cost=10.01..39290.79 rows=10392 width=28) (actual
> time=0.079..3.490 rows=9360 loops=1)"
> >> " -> HashAggregate (cost=0.83..1.13 rows=30 width=4) (actual
> time=0.027..0.032 rows=30 loops=1)"
> >> " -> Limit (cost=0.00..0.45 rows=30 width=4) (actual
> time=0.013..0.020 rows=30 loops=1)"
> >> " -> Function Scan on generate_series s (cost=0.00..15.00 rows=1000
> width=4) (actual time=0.011..0.016 rows=30 loops=1)"
> >> " -> Bitmap Heap Scan on measure_data md (cost=9.19..1306.20
> rows=346 width=28) (actual time=0.030..0.075 rows=312 loops=30)"
> >> " Recheck Cond: (telegram_id = ((66484 + (2195 * s.i))))"
> >> " -> Bitmap Index Scan on index_measure_data_telegram_id
> (cost=0.00..9.10 rows=346 width=0) (actual time=0.025..0.025 rows=312
> loops=30)"
> >> " Index Cond: (telegram_id = ((66484 + (2195 * s.i))))"
> >> "Total runtime: 3.714 ms"
> >>
> >>
> >> 2.when function is there: seq scan
> >>
> >> explain analyze
> >> SELECT md.*
> >> FROM measure_data md
> >> where telegram_id in
> >> (
> >> SELECT trunc((132363-66484) * random()) +66484
> >> FROM generate_series(1,30) as s(i)
> >> limit 30
> >> )
> >> ;
> >>
> >>
> >> "Hash Join (cost=1.65..490288.89 rows=10277280 width=28) (actual
> time=0.169..4894.847 rows=9360 loops=1)"
> >> " Hash Cond: ((md.telegram_id)::double precision =
> ((trunc((65879::double precision * random())) + 66484::double
> precision)))"
> >> " -> Seq Scan on measure_data md (cost=0.00..356682.60 rows=20554560
> width=28) (actual time=0.010..2076.932 rows=20554560 loops=1)"
> >> " -> Hash (cost=1.28..1.28 rows=30 width=8) (actual
> time=0.041..0.041 rows=30 loops=1)"
> >> " Buckets: 1024 Batches: 1 Memory Usage: 2kB"
> >> " -> HashAggregate (cost=0.98..1.28 rows=30 width=8) (actual
> time=0.034..0.036 rows=30 loops=1)"
> >> " -> Limit (cost=0.00..0.60 rows=30 width=0) (actual
> time=0.016..0.026 rows=30 loops=1)"
> >> " -> Function Scan on generate_series s (cost=0.00..20.00 rows=1000
> width=0) (actual time=0.015..0.023 rows=30 loops=1)"
> >> "Total runtime: 4895.239 ms"
> >>
> >>
> >> ----------------------------------------
> >>> Date: Wed, 14 May 2014 22:43:24 -0700
> >>> From: david.g.johnston@gmail.com<mailto:david.g.johnston@gmail.com>
> >>> To:
> pgsql-performance@postgresql.org<mailto:pgsql-performance@postgresql.org>
> >>> Subject: Re: [PERFORM] how do functions affect query plan?
> >>>
> >>> 常超 wrote
> >>>> Hi,all
> >>>> I have a table to save received measure data.
> >>>>
> >>>>
> >>>> CREATE TABLE measure_data
> >>>> (
> >>>> id serial NOT NULL,
> >>>> telegram_id integer NOT NULL,
> >>>> measure_time timestamp without time zone NOT NULL,
> >>>> item_id integer NOT NULL,
> >>>> val double precision,
> >>>> CONSTRAINT measure_data_pkey PRIMARY KEY (id)
> >>>> );
> >>>>
> >>>> CREATE INDEX index_measure_data_telegram_id ON measure_data USING btree
> >>>> (telegram_id);
> >>>>
> >>>>
> >>>> in my scenario,a telegram contains measure data for multiple data items
> >>>> and timestamps,
> >>>> BTW,another table is for telegram.
> >>>>
> >>>> The SQL I used in my application is
> >>>> select * from measure_data where telegram_id in(1,2,...,n)
> >>>> and this query used the index_measure_data_telegram_id index,as
> expected.
> >>>>
> >>>> In order to see the performance of my query ,
> >>>> I used the following query to search the measure data for randomly 30
> >>>> telegrams.
> >>>>
> >>>>
> >>>> explain analyze
> >>>> SELECT md.*
> >>>> FROM measure_data md
> >>>> where telegram_id in
> >>>> (
> >>>> SELECT distinct
> >>>> trunc((132363-66484) * random() + 66484)
> >>>> FROM generate_series(1,30) as s(telegram_id)
> >>>> )
> >>>> ;
> >>>>
> >>>> the 132363 and 66484 are the max and min of the telegram id,separately.
> >>>>
> >>>> What surprised me is that index is not used,instead,a seq scan is
> >>>> performed on measure_data.
> >>>> Although,intuitively,in this case,it is much wiser to use the index.
> >>>> Would you please give some clue to why this happened?
> >>>>
> >>>> "Hash Semi Join (cost=65.00..539169.32 rows=10277280 width=28) (actual
> >>>> time=76.454..17177.054 rows=9360 loops=1)"
> >>>> " Hash Cond: ((md.telegram_id)::double precision =
> (trunc(((65879::double
> >>>> precision * random()) + 66484::double precision))))"
> >>>> " -> Seq Scan on measure_data md (cost=0.00..356682.60 rows=20554560
> >>>> width=28) (actual time=0.012..13874.809 rows=20554560 loops=1)"
> >>>> " -> Hash (cost=52.50..52.50 rows=1000 width=8) (actual
> >>>> time=0.062..0.062 rows=30 loops=1)"
> >>>> " Buckets: 1024 Batches: 1 Memory Usage: 2kB"
> >>>> " -> HashAggregate (cost=22.50..42.50 rows=1000 width=0) (actual
> >>>> time=0.048..0.053 rows=30 loops=1)"
> >>>> " -> Function Scan on generate_series s (cost=0.00..20.00
> >>>> rows=1000 width=0) (actual time=0.020..0.034 rows=30 loops=1)"
> >>>> "Total runtime: 17177.527 ms"
> >>>
> >>> The planner expects to need to return half the table when you
> provide 1,000
> >>> distinct telegram_ids, which is best handled by scanning the whole table
> >>> sequentially and tossing out invalid data.
> >>>
> >>> I am curious if the plan will be different if you added a LIMIT 30 to the
> >>> sub-query.
> >>>
> >>> The root of the problem is the planner has no way of knowing whether
> >>> generate_series is going to return 1 or 1,000,000 rows so by
> default it (and
> >>> all functions) are assumed (by the planner) to return 1,000 rows.
> By adding
> >>> an explicit limit you can better inform the planner as to how many
> rows you
> >>> are going to be passing up to the parent query and it will
> hopefully, with
> >>> knowledge of only 30 distinct values, use the index.
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>> View this message in context:
> http://postgresql.1045698.n5.nabble.com/how-do-functions-affect-query-plan-tp5803993p5803996.html
> >>> Sent from the PostgreSQL - performance mailing list archive at
> Nabble.com.
> >>>
> >>>
> >>> --
> >>> Sent via pgsql-performance mailing list
> (pgsql-performance@postgresql.org<mailto:pgsql-performance@postgresql.org>)
> >>> To make changes to your subscription:
> >>> http://www.postgresql.org/mailpref/pgsql-performance
> >>
> >> --
> >> Sent via pgsql-performance mailing list
> (pgsql-performance@postgresql.org<mailto:pgsql-performance@postgresql.org>)
> >> To make changes to your subscription:
> >> http://www.postgresql.org/mailpref/pgsql-performance
> >
> > --
> > Sent via pgsql-performance mailing list
> (pgsql-performance@postgresql.org<mailto:pgsql-performance@postgresql.org>)
> > To make changes to your subscription:
> > http://www.postgresql.org/mailpref/pgsql-performance
>
>

pgsql-performance by date:

Previous
From: 楊新波
Date:
Subject: Re: FW: how do functions affect query plan?
Next
From: Scott Marlowe
Date:
Subject: Query plan good in 8.4, bad in 9.2 and better in 9.3