Thread: Slow set-returning functions
Hi, I have been having difficulty with some functions which return sets of rows. The functions seem to run very slowly, even though the queries they run execute very quicky if I run them directly from psgl. Typically these queries are only returning a few hundred rows with my real data. I have had difficulty coming up with a simple test case, but the code below usually shows the same problem. Sometimes I have to run the setup code a few times before it happens - not sure why (I would expect this to be deterministic), but perhaps there is some randomness introduced by the sampling done by the analyse. The function foo() which has a hard-coded LIMIT always executes quickly (comparable to running the query directly). However, the function foo(int) which is passed the same LIMIT as a parameter executes around 30 times slower. The only difference is that the LIMIT is a parameter to the function, although the LIMIT isn't reached anyway in this case. Sometimes running this same script generates data for which this function executes as fast as the other one (which is always fast). I am finding this all very puzzling. Am I just doing something silly? Is there any way that I can see what execution plan is being used internally by the functions? Thanks, Dean. DROP FUNCTION IF EXISTS setup(); CREATE FUNCTION setup() RETURNS void AS $$ DECLARE i int; c1 int; c2 int; c3 int; BEGIN DROP TABLE IF EXISTS foo CASCADE; CREATE TABLE foo (id int PRIMARY KEY, name text); i := 1; c1 := ascii('A'); WHILE c1 <= ascii('Z') LOOP c2 := ascii('a'); WHILE c2 <= ascii('z') LOOP c3 := ascii('a'); WHILE c3 <= ascii('z') LOOP INSERT INTO foo VALUES(i, chr(c1)||chr(c2)||'o'); i := i+1; c3 := c3+1; END LOOP; c2 := c2+1; END LOOP; c1 := c1+1; END LOOP; CREATE INDEX foo_name_idx ON foo(lower(name) text_pattern_ops); END; $$ LANGUAGE plpgsql; SELECT setup(); ANALYZE foo; DROP FUNCTION IF EXISTS foo(); CREATE FUNCTION foo() RETURNS SETOF int AS $$ SELECT id FROM foo WHERE lower(name) LIKE 'foo' ORDER BY id OFFSET 0 LIMIT 100; $$ LANGUAGE SQL STABLE; DROP FUNCTION IF EXISTS foo(int); CREATE FUNCTION foo(int) RETURNS SETOF int AS $$ SELECT id FROM foo WHERE lower(name) LIKE 'foo' ORDER BY id OFFSET 0 LIMIT $1; $$ LANGUAGE SQL STABLE; EXPLAIN ANALYZE SELECT id FROM foo WHERE name ILIKE 'foo' ORDER BY id OFFSET 0 LIMIT 100; EXPLAIN ANALYZE SELECT id FROM foo WHERE lower(name) LIKE 'foo' ORDER BY id OFFSET 0 LIMIT 100; EXPLAIN ANALYZE SELECT * FROM foo(); EXPLAIN ANALYZE SELECT * FROM foo(100); QUERY PLAN --------------------------------------------------------------------------------------------------------------- Limit (cost=336.52..336.77 rows=100 width=4) (actual time=13.377..13.507 rows=26 loops=1) -> Sort (cost=336.52..336.86 rows=136 width=4) (actual time=13.369..13.415 rows=26 loops=1) Sort Key: id -> Seq Scan on foo (cost=0.00..331.70 rows=136 width=4) (actual time=2.462..13.267 rows=26 loops=1) Filter: (name ~~* 'foo'::text) Total runtime: 13.627 ms (6 rows) QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------- Limit (cost=68.13..68.19 rows=26 width=4) (actual time=0.345..0.475 rows=26 loops=1) -> Sort (cost=68.13..68.19 rows=26 width=4) (actual time=0.338..0.381 rows=26 loops=1) Sort Key: id -> Bitmap Heap Scan on foo (cost=4.46..67.52 rows=26 width=4) (actual time=0.164..0.257 rows=26 loops=1) Filter: (lower(name) ~~ 'foo'::text) -> Bitmap Index Scan on foo_name_idx (cost=0.00..4.45 rows=26 width=0) (actual time=0.109..0.109 rows=26loops=1) Index Cond: (lower(name) ~=~ 'foo'::text) Total runtime: 0.597 ms (8 rows) QUERY PLAN ------------------------------------------------------------------------------------------------------- Function Scan on foo (cost=0.00..12.50 rows=1000 width=4) (actual time=1.524..1.570 rows=26 loops=1) Total runtime: 1.647 ms (2 rows) QUERY PLAN --------------------------------------------------------------------------------------------------------- Function Scan on foo (cost=0.00..12.50 rows=1000 width=4) (actual time=21.496..21.545 rows=26 loops=1) Total runtime: 21.636 ms (2 rows) [PostgreSQL 8.2.5, tested on SUSE Linux 10.3, 64-bit and Windows 2000] _________________________________________________________________ Who's friends with who and co-starred in what? http://www.searchgamesbox.com/celebrityseparation.shtml
Dean Rasheed wrote: > I have been having difficulty with some functions which return sets of > rows. The functions seem to run very slowly, even though the queries > they run execute very quicky if I run them directly from psgl. > Typically these queries are only returning a few hundred rows with my > real data. > > I have had difficulty coming up with a simple test case, but the code > below usually shows the same problem. Sometimes I have to run the > setup code a few times before it happens - not sure why (I would > expect this to be deterministic), but perhaps there is some randomness > introduced by the sampling done by the analyse. > > The function foo() which has a hard-coded LIMIT always executes > quickly (comparable to running the query directly). > > However, the function foo(int) which is passed the same LIMIT as a > parameter executes around 30 times slower. The only difference is that > the LIMIT is a parameter to the function, although the LIMIT isn't > reached anyway in this case. Sometimes running this same script > generates data for which this function executes as fast as the other > one (which is always fast). This is clearly because the planner doesn't know what the value for the parameter will be at run time, so it chooses a plan that's not optimal for LIMIT 100. > Is there any way that I can see what execution plan is being used > internally by the functions? Not directly, but you can do this: postgres=# PREPARE p (int4) AS SELECT id FROM foo WHERE lower(name) LIKE 'foo' ORDER BY id OFFSET 0 LIMIT $1; PREPARE postgres=# EXPLAIN EXECUTE p(100); QUERY PLAN ----------------------------------------------------------------------------- Limit (cost=0.00..49.18 rows=2 width=4) -> Index Scan using foo_pkey on foo (cost=0.00..614.77 rows=25 width=4) Filter: (lower(name) ~~ 'foo'::text) (3 rows) You could work around that by using EXECUTE in the plpgsql function, to force the query to be planned on every execution with the actual value of the LIMIT. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Jan 20, 2008 9:34 AM, Heikki Linnakangas <heikki@enterprisedb.com> wrote: > Dean Rasheed wrote: > > I have been having difficulty with some functions which return sets of > > rows. The functions seem to run very slowly, even though the queries > > they run execute very quicky if I run them directly from psgl. > > Typically these queries are only returning a few hundred rows with my > > real data. > > > > I have had difficulty coming up with a simple test case, but the code > > below usually shows the same problem. Sometimes I have to run the > > setup code a few times before it happens - not sure why (I would > > expect this to be deterministic), but perhaps there is some randomness > > introduced by the sampling done by the analyse. > > > > The function foo() which has a hard-coded LIMIT always executes > > quickly (comparable to running the query directly). > > > > However, the function foo(int) which is passed the same LIMIT as a > > parameter executes around 30 times slower. The only difference is that > > the LIMIT is a parameter to the function, although the LIMIT isn't > > reached anyway in this case. Sometimes running this same script > > generates data for which this function executes as fast as the other > > one (which is always fast). > > This is clearly because the planner doesn't know what the value for the > parameter will be at run time, so it chooses a plan that's not optimal > for LIMIT 100. > > > Is there any way that I can see what execution plan is being used > > internally by the functions? prepared statements have the same problem. IIRC the planner assumes 10%, which will often drop to a seqscan or a bitmap index scan. Some years back I argued (unsuccessfully) to have the planner guess 100 rows or something like that. Ideally, I think it would generate the plan from the value passed into the first invocation of the function. merlin
Thanks for the replies. Converting the functions to plpgsql and using EXECUTE works a treat. On the real data, one of my functions is now over 50x faster :-) Dean > Date: Sun, 20 Jan 2008 10:25:48 -0500 > From: mmoncure@gmail.com > To: heikki@enterprisedb.com > Subject: Re: [PERFORM] Slow set-returning functions > CC: dean_rasheed@hotmail.com; pgsql-performance@postgresql.org > > On Jan 20, 2008 9:34 AM, Heikki Linnakangas wrote: >> Dean Rasheed wrote: >>> I have been having difficulty with some functions which return sets of >>> rows. The functions seem to run very slowly, even though the queries >>> they run execute very quicky if I run them directly from psgl. >>> Typically these queries are only returning a few hundred rows with my >>> real data. >>> >>> I have had difficulty coming up with a simple test case, but the code >>> below usually shows the same problem. Sometimes I have to run the >>> setup code a few times before it happens - not sure why (I would >>> expect this to be deterministic), but perhaps there is some randomness >>> introduced by the sampling done by the analyse. >>> >>> The function foo() which has a hard-coded LIMIT always executes >>> quickly (comparable to running the query directly). >>> >>> However, the function foo(int) which is passed the same LIMIT as a >>> parameter executes around 30 times slower. The only difference is that >>> the LIMIT is a parameter to the function, although the LIMIT isn't >>> reached anyway in this case. Sometimes running this same script >>> generates data for which this function executes as fast as the other >>> one (which is always fast). >> >> This is clearly because the planner doesn't know what the value for the >> parameter will be at run time, so it chooses a plan that's not optimal >> for LIMIT 100. >> >>> Is there any way that I can see what execution plan is being used >>> internally by the functions? > > prepared statements have the same problem. IIRC the planner assumes > 10%, which will often drop to a seqscan or a bitmap index scan. Some > years back I argued (unsuccessfully) to have the planner guess 100 > rows or something like that. Ideally, I think it would generate the > plan from the value passed into the first invocation of the function. > > merlin _________________________________________________________________ Get Hotmail on your mobile, text MSN to 63463! http://mobile.uk.msn.com/pc/mail.aspx
"Merlin Moncure" <mmoncure@gmail.com> writes: > prepared statements have the same problem. IIRC the planner assumes > 10%, which will often drop to a seqscan or a bitmap index scan. Some > years back I argued (unsuccessfully) to have the planner guess 100 > rows or something like that. Ideally, I think it would generate the > plan from the value passed into the first invocation of the function. I believe it's the case that that will happen now in the same contexts where the planner uses the first value of any other parameter (ie, unnamed statements in the extended-query protocol). regards, tom lane
>> Is there any way that I can see what execution plan is being used >> internally by the functions? >> > > Not directly, but you can do this: > > > postgres=# PREPARE p (int4) AS SELECT id FROM foo WHERE lower(name) LIKE > 'foo' ORDER BY id OFFSET 0 LIMIT $1; > PREPARE > > postgres=# EXPLAIN EXECUTE p(100); QUERY PLAN > ----------------------------------------------------------------------------- > Limit (cost=0.00..49.18 rows=2 width=4) > > -> Index Scan using foo_pkey on foo (cost=0.00..614.77 rows=25 width=4) > Filter: (lower(name) ~~ 'foo'::text) > (3 rows) I think that having the ability to see the execution plans being used by queries inside functions would be quite useful. More generally, I would like to be able to log the execution plans of all queries issued by an application (in my case I am working on a web application, where some of the queries are auto-generated by Hibernate). I've tried setting debug_print_plan, but the results are a little hard to interpret. As an experiment, I have tried hacking around a little with the code. This is my first foray into the source code, so I might well be missing something, but basically I added a new configuration parameter debug_explain_plan which causes all queries to be instrumented and ExecutorRun() to call explain_outNode() at the end, logging the results at level DEBUG1. It seems to work quite well as a development aid for my web application. It is also useful from a psql session (similar to AUTOTRACE in Oracle): test=# create table foo(a int primary key); NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "foo_pkey" for table "foo" CREATE TABLE test=# insert into foo values (1), (2), (3), (4), (5); INSERT 0 5 test=# set debug_explain_plan=true; SET test=# set client_min_messages=debug1; SET test=# select count(*) from foo where a>3; DEBUG: ------------------- query plan ------------------- DETAIL: Aggregate (cost=32.45..32.46 rows=1 width=0) (actual time=0.066..0.068 rows=1 loops=1) -> Bitmap Heap Scan on foo (cost=10.45..30.45 rows=800 width=0) (actual time=0.039..0.043 rows=2 loops=1) Recheck Cond: (a> 3) -> Bitmap Index Scan on foo_pkey (cost=0.00..10.25 rows=800 width=0) (actual time=0.025..0.025 rows=2 loops=1) Index Cond: (a> 3) Query runtime: 0.089 ms count ------- 2 (1 row) test=# create function foo() returns int as 'select max(a) from foo;' language sql stable; CREATE FUNCTION test=# select * from foo where a=foo(); DEBUG: ------------------- query plan ------------------- DETAIL: Result (cost=0.04..0.05 rows=1 width=0) (actual time=0.044..0.044 rows=1 loops=1) InitPlan -> Limit (cost=0.00..0.04 rows=1 width=4) (actual time=0.032..0.034 rows=1 loops=1) -> Index Scan Backward using foo_pkey on foo (cost=0.00..84.25 rows=2400 width=4) (actual time=0.025..0.025 rows=1loops=1) Filter: (a IS NOT NULL) Query runtime: 0.050 ms CONTEXT: SQL function "foo" statement 1 DEBUG: ------------------- query plan ------------------- DETAIL: Result (cost=0.04..0.05 rows=1 width=0) (actual time=0.037..0.037 rows=1 loops=1) InitPlan -> Limit (cost=0.00..0.04 rows=1 width=4) (actual time=0.027..0.029 rows=1 loops=1) -> Index Scan Backward using foo_pkey on foo (cost=0.00..84.25 rows=2400 width=4) (actual time=0.021..0.021 rows=1loops=1) Filter: (a IS NOT NULL) Query runtime: 0.044 ms CONTEXT: SQL function "foo" statement 1 DEBUG: ------------------- query plan ------------------- DETAIL: Index Scan using foo_pkey on foo (cost=0.25..8.52 rows=1 width=4) (actual time=1.638..1.642 rows=1 loops=1) Index Cond: (a = foo()) Query runtime: 1.686 ms a --- 5 (1 row) (Curious that foo() is being executed twice in this case). The CONTEXT is very useful, particularly when functions call other functions, since it gives the call stack (presumably only for SQL and PL/pgSQL functions). For top-level queries I would ideally like the CONTEXT to log the SQL being executed, but I couldn't figure out how to access that information. Anyway, I'd be interested to know if anyone has thought about doing anything like this before and if anyone else might find this useful. Dean _________________________________________________________________ Free games, great prizes - get gaming at Gamesbox. http://www.searchgamesbox.com
Dnia 27-01-2008, N o godzinie 17:29 +0000, Dean Rasheed pisze: > The CONTEXT is very useful, particularly when functions call other > functions, since it gives the call stack (presumably only for SQL and > PL/pgSQL functions). For top-level queries I would ideally like the > CONTEXT to log the SQL being executed, but I couldn't figure out how > to access that information. > > Anyway, I'd be interested to know if anyone has thought about doing > anything like this before and if anyone else might find this useful. > > Dean I'd love to see that. The mentioned PREPARE workaround doesn't seem to work when executed function calls for example three another (or I don't know how to use it in such situation) - and is generally painful to use. I'm afraid I can't help you much (besides testing) but I'd be more than interested in such improvement. Regards, Marcin
On Jan 27, 2008 12:29 PM, Dean Rasheed <dean_rasheed@hotmail.com> wrote: > >> Is there any way that I can see what execution plan is being used > >> internally by the functions? > >> > > > > Not directly, but you can do this: > > > > > > postgres=# PREPARE p (int4) AS SELECT id FROM foo WHERE lower(name) LIKE > > 'foo' ORDER BY id OFFSET 0 LIMIT $1; > > PREPARE > > > > postgres=# EXPLAIN EXECUTE p(100); QUERY PLAN > > ----------------------------------------------------------------------------- > > Limit (cost=0.00..49.18 rows=2 width=4) > > > > -> Index Scan using foo_pkey on foo (cost=0.00..614.77 rows=25 width=4) > > Filter: (lower(name) ~~ 'foo'::text) > > (3 rows) > > > I think that having the ability to see the execution plans being used > by queries inside functions would be quite useful. > > More generally, I would like to be able to log the execution plans of > all queries issued by an application (in my case I am working on a web > application, where some of the queries are auto-generated by > Hibernate). I've tried setting debug_print_plan, but the results are a > little hard to interpret. > > As an experiment, I have tried hacking around a little with the code. > This is my first foray into the source code, so I might well be > missing something, but basically I added a new configuration parameter > debug_explain_plan which causes all queries to be instrumented and > ExecutorRun() to call explain_outNode() at the end, logging the > results at level DEBUG1. I read your email, blinked twice, and thought: where have you been all my life! :-) (IOW, +1) merlin
I've posted the patch here: http://archives.postgresql.org/pgsql-patches/2008-01/msg00123.php Dean. _________________________________________________________________ Get Hotmail on your mobile, text MSN to 63463! http://mobile.uk.msn.com/pc/mail.aspx