Thread: Performance problems with multiple layers of functions
Hi there. I have hit a edge in the planning and I hope you can help. The system uses a lot of stored procedures to move as much of the intelligence into the database layer as possible. My (development) query looks like and runs reasonably fast: explain analyze select dataset_id, entity, sum(amount) from entrydata_current where flow_direction in (select * from outflow_direction(dataset_id)) and dataset_id in (122,112,125,89,111) group by dataset_id, entity; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------- HashAggregate (cost=918171.00..918171.30 rows=24 width=19) (actual time=11533.297..11533.340 rows=50 loops=1) -> Bitmap Heap Scan on entrydata_current (cost=676.72..917736.04 rows=57994 width=19) (actual time=23.921..11425.373 rows=37870 loops=1) Recheck Cond: ((dataset_id = 122) OR (dataset_id = 112) OR (dataset_id = 125) OR (dataset_id = 89) OR (dataset_id = 111)) Filter: (subplan) -> BitmapOr (cost=676.72..676.72 rows=117633 width=0) (actual time=15.765..15.765 rows=0 loops=1) -> Bitmap Index Scan on entrydata_current_dataset_idx (cost=0.00..83.97 rows=14563 width=0) (actual time=1.881..1.881 rows=13728 loops=1) Index Cond: (dataset_id = 122) -> Bitmap Index Scan on entrydata_current_dataset_idx (cost=0.00..156.12 rows=27176 width=0) (actual time=3.508..3.508 rows=25748 loops=1) Index Cond: (dataset_id = 112) -> Bitmap Index Scan on entrydata_current_dataset_idx (cost=0.00..124.24 rows=21498 width=0) (actual time=2.729..2.729 rows=20114 loops=1) Index Cond: (dataset_id = 125) -> Bitmap Index Scan on entrydata_current_dataset_idx (cost=0.00..102.20 rows=17771 width=0) (actual time=2.351..2.351 rows=17344 loops=1) Index Cond: (dataset_id = 89) -> Bitmap Index Scan on entrydata_current_dataset_idx (cost=0.00..210.19 rows=36625 width=0) (actual time=5.292..5.292 rows=37118 loops=1) Index Cond: (dataset_id = 111) SubPlan -> Function Scan on outflow_direction (cost=0.00..12.50 rows=1000 width=4) (actual time=0.093..0.095 rows=4 loops=114052) Total runtime: 11540.506 ms (18 rows) The problem is, that the application should not need to know the five dataset_ids (it will always know one - its own). So I make a function to return the five ids and then the query looks like: explain select dataset_id, entity, sum(amount) from entrydata_current where flow_direction in (select * from outflow_direction(dataset_id)) and dataset_id in (select * from get_dataset_ids(122)) group by dataset_id, entity; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------- HashAggregate (cost=24672195.68..24672203.88 rows=656 width=19) -> Hash IN Join (cost=15.00..24660005.45 rows=1625364 width=19) Hash Cond: ("outer".dataset_id = "inner".get_dataset_ids) -> Index Scan using entrydata_current_dataset_idx on entrydata_current (cost=0.00..24558405.20 rows=1625364 width=19) Filter: (subplan) SubPlan -> Function Scan on outflow_direction (cost=0.00..12.50 rows=1000 width=4) -> Hash (cost=12.50..12.50 rows=1000 width=4) -> Function Scan on get_dataset_ids (cost=0.00..12.50 rows=1000 width=4) (9 rows) which does not return within 10 minutes - which is unacceptable. Is there any way to get a better plan for the second ? The planner should really see the two queries as equal as there is no dependencies between the outer query and get_dataset_ids (isn't it called constant folding?). Thanks in advance Svenne
Attachment
On Fri, Mar 24, 2006 at 01:49:17PM +0100, Svenne Krap wrote: > explain select dataset_id, entity, sum(amount) from entrydata_current > where flow_direction in (select * from outflow_direction(dataset_id)) > and dataset_id in (select * from get_dataset_ids(122)) group by > dataset_id, entity; <snip> > which does not return within 10 minutes - which is unacceptable. The issue is that the planner has no way to know what's comming back from get_dataset_ids. I think your best bet will be to wrap that select into it's own function and have that function prepare the query statement, going back to hard-coded values. So you could do something like: SQL := 'SELECT ... AND dataset_id IN (''' || get_dataset_ids(122) || ''');' (yeah, I know that won't work as written, but you get the idea). -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
"Jim C. Nasby" <jnasby@pervasive.com> writes: > On Fri, Mar 24, 2006 at 01:49:17PM +0100, Svenne Krap wrote: >> explain select dataset_id, entity, sum(amount) from entrydata_current >> where flow_direction in (select * from outflow_direction(dataset_id)) >> and dataset_id in (select * from get_dataset_ids(122)) group by >> dataset_id, entity; > The issue is that the planner has no way to know what's comming back > from get_dataset_ids. More specifically, the first IN is not optimizable into a join because the results of the sub-SELECT depend on the current row of the outer query. The second IN is being optimized fine, but the first one is what's killing you. I'd suggest refactoring the functions into something that returns a set of outflow_direction/dataset_id pairs, and then phrase the query as where (flow_direction, dataset_id) in (select * from new_func(122)) You could do it without refactoring: where (flow_direction, dataset_id) in (select outflow_direction(id),id from get_dataset_ids(122) id) however this won't work if outflow_direction() is a plpgsql function because of limitations in plpgsql's set-function support. (It will work if outflow_direction() is a SQL function, or you could kluge it as a SQL function wrapper around a plpgsql function.) regards, tom lane
Tom Lane wrote: > where (flow_direction, dataset_id) in (select * from new_func(122)) > Is this form of multi-column IN mentioned anywhere in the docs? I can't find it. Svenne
Attachment
Svenne Krap <svenne@krap.dk> writes: > Tom Lane wrote: >> where (flow_direction, dataset_id) in (select * from new_func(122)) > Is this form of multi-column IN mentioned anywhere in the docs? I can't > find it. Sure, look under "Subquery Expressions". 8.0 and later refer to it as a row_constructor, but it's documented at least as far back as 7.3. regards, tom lane
Whoa ! bookmark_delta contains very few rows but is inserted/deleted very often... the effect is spectacular ! I guess I'll have to vacuum analyze this table every minute... annonces=# EXPLAIN ANALYZE SELECT id, priority FROM annonces WHERE id IN (SELECT annonce_id FROM bookmark_delta); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------ Hash IN Join (cost=32.12..8607.08 rows=1770 width=6) (actual time=387.011..387.569 rows=1 loops=1) Hash Cond: ("outer".id = "inner".annonce_id) -> Seq Scan on annonces (cost=0.00..7796.00 rows=101500 width=6) (actual time=0.022..164.369 rows=101470 loops=1) -> Hash (cost=27.70..27.70 rows=1770 width=4) (actual time=0.013..0.013 rows=5 loops=1) -> Seq Scan on bookmark_delta (cost=0.00..27.70 rows=1770 width=4) (actual time=0.004..0.010 rows=5 loops=1) Total runtime: 387.627 ms (6 lignes) annonces=# EXPLAIN ANALYZE SELECT id, priority FROM annonces a, (SELECT annonce_id FROM bookmark_delta GROUP BY annonce_id) foo WHERE a.id=foo.annonce_id; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------- Nested Loop (cost=32.12..10409.31 rows=1770 width=6) (actual time=0.081..0.084 rows=1 loops=1) -> HashAggregate (cost=32.12..49.83 rows=1770 width=4) (actual time=0.038..0.040 rows=1 loops=1) -> Seq Scan on bookmark_delta (cost=0.00..27.70 rows=1770 width=4) (actual time=0.024..0.027 rows=5 loops=1) -> Index Scan using annonces_pkey on annonces a (cost=0.00..5.83 rows=1 width=6) (actual time=0.039..0.040 rows=1 loops=1) Index Cond: (a.id = "outer".annonce_id) Total runtime: 0.163 ms (6 lignes) annonces=# vacuum bookmark_delta ; VACUUM annonces=# EXPLAIN ANALYZE SELECT id, priority FROM annonces WHERE id IN (SELECT annonce_id FROM bookmark_delta); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------ Hash IN Join (cost=32.12..8607.08 rows=1770 width=6) (actual time=195.284..196.063 rows=1 loops=1) Hash Cond: ("outer".id = "inner".annonce_id) -> Seq Scan on annonces (cost=0.00..7796.00 rows=101500 width=6) (actual time=0.014..165.626 rows=101470 loops=1) -> Hash (cost=27.70..27.70 rows=1770 width=4) (actual time=0.008..0.008 rows=2 loops=1) -> Seq Scan on bookmark_delta (cost=0.00..27.70 rows=1770 width=4) (actual time=0.003..0.004 rows=2 loops=1) Total runtime: 196.122 ms (6 lignes) annonces=# vacuum analyze bookmark_delta ; VACUUM annonces=# EXPLAIN ANALYZE SELECT id, priority FROM annonces WHERE id IN (SELECT annonce_id FROM bookmark_delta); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------ Nested Loop (cost=1.02..6.88 rows=1 width=6) (actual time=0.025..0.027 rows=1 loops=1) -> HashAggregate (cost=1.02..1.03 rows=1 width=4) (actual time=0.011..0.012 rows=1 loops=1) -> Seq Scan on bookmark_delta (cost=0.00..1.02 rows=2 width=4) (actual time=0.004..0.006 rows=2 loops=1) -> Index Scan using annonces_pkey on annonces (cost=0.00..5.83 rows=1 width=6) (actual time=0.009..0.010 rows=1 loops=1) Index Cond: (annonces.id = "outer".annonce_id) Total runtime: 0.104 ms (6 lignes)
On 24.03.2006, at 23:54 Uhr, PFC wrote: > bookmark_delta contains very few rows but is inserted/deleted very > often... the effect is spectacular ! > I guess I'll have to vacuum analyze this table every minute... What about using autovacuum? cug -- PharmaLine, Essen, GERMANY Software and Database Development