Thread: Slow set-returning functions

Slow set-returning functions

From
Dean Rasheed
Date:
Hi,

I have been having difficulty with some functions which return sets of
rows. The functions seem to run very slowly, even though the queries
they run execute very quicky if I run them directly from psgl.
Typically these queries are only returning a few hundred rows with my
real data.

I have had difficulty coming up with a simple test case, but the code
below usually shows the same problem. Sometimes I have to run the
setup code a few times before it happens - not sure why (I would
expect this to be deterministic), but perhaps there is some randomness
introduced by the sampling done by the analyse.

The function foo() which has a hard-coded LIMIT always executes
quickly (comparable to running the query directly).

However, the function foo(int) which is passed the same LIMIT as a
parameter executes around 30 times slower. The only difference is that
the LIMIT is a parameter to the function, although the LIMIT isn't
reached anyway in this case. Sometimes running this same script
generates data for which this function executes as fast as the other
one (which is always fast).

I am finding this all very puzzling. Am I just doing something silly?

Is there any way that I can see what execution plan is being used
internally by the functions?

Thanks,

Dean.



DROP FUNCTION IF EXISTS setup();
CREATE FUNCTION setup()
RETURNS void AS
$$
DECLARE
  i int;
  c1 int;
  c2 int;
  c3 int;
BEGIN
  DROP TABLE IF EXISTS foo CASCADE;
  CREATE TABLE foo (id int PRIMARY KEY, name text);

  i := 1;
  c1 := ascii('A');
  WHILE c1 <= ascii('Z') LOOP
    c2 := ascii('a');
    WHILE c2 <= ascii('z') LOOP
      c3 := ascii('a');
      WHILE c3 <= ascii('z') LOOP
        INSERT INTO foo VALUES(i, chr(c1)||chr(c2)||'o');
        i := i+1;
        c3 := c3+1;
      END LOOP;
      c2 := c2+1;
    END LOOP;
    c1 := c1+1;
  END LOOP;

  CREATE INDEX foo_name_idx ON foo(lower(name) text_pattern_ops);
END;
$$ LANGUAGE plpgsql;

SELECT setup();
ANALYZE foo;

DROP FUNCTION IF EXISTS foo();
CREATE FUNCTION foo() RETURNS SETOF int AS
$$
  SELECT id FROM foo WHERE lower(name) LIKE 'foo' ORDER BY id OFFSET 0 LIMIT 100;
$$ LANGUAGE SQL STABLE;

DROP FUNCTION IF EXISTS foo(int);
CREATE FUNCTION foo(int) RETURNS SETOF int AS
$$
  SELECT id FROM foo WHERE lower(name) LIKE 'foo' ORDER BY id OFFSET 0 LIMIT $1;
$$ LANGUAGE SQL STABLE;

EXPLAIN ANALYZE SELECT id FROM foo WHERE name ILIKE 'foo' ORDER BY id OFFSET 0 LIMIT 100;
EXPLAIN ANALYZE SELECT id FROM foo WHERE lower(name) LIKE 'foo' ORDER BY id OFFSET 0 LIMIT 100;
EXPLAIN ANALYZE SELECT * FROM foo();
EXPLAIN ANALYZE SELECT * FROM foo(100);



                                                  QUERY PLAN
---------------------------------------------------------------------------------------------------------------
 Limit  (cost=336.52..336.77 rows=100 width=4) (actual time=13.377..13.507 rows=26 loops=1)
   ->  Sort  (cost=336.52..336.86 rows=136 width=4) (actual time=13.369..13.415 rows=26 loops=1)
         Sort Key: id
         ->  Seq Scan on foo  (cost=0.00..331.70 rows=136 width=4) (actual time=2.462..13.267 rows=26 loops=1)
               Filter: (name ~~* 'foo'::text)
 Total runtime: 13.627 ms
(6 rows)

                                                            QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=68.13..68.19 rows=26 width=4) (actual time=0.345..0.475 rows=26 loops=1)
   ->  Sort  (cost=68.13..68.19 rows=26 width=4) (actual time=0.338..0.381 rows=26 loops=1)
         Sort Key: id
         ->  Bitmap Heap Scan on foo  (cost=4.46..67.52 rows=26 width=4) (actual time=0.164..0.257 rows=26 loops=1)
               Filter: (lower(name) ~~ 'foo'::text)
               ->  Bitmap Index Scan on foo_name_idx  (cost=0.00..4.45 rows=26 width=0) (actual time=0.109..0.109
rows=26loops=1) 
                     Index Cond: (lower(name) ~=~ 'foo'::text)
 Total runtime: 0.597 ms
(8 rows)

                                              QUERY PLAN
-------------------------------------------------------------------------------------------------------
 Function Scan on foo  (cost=0.00..12.50 rows=1000 width=4) (actual time=1.524..1.570 rows=26 loops=1)
 Total runtime: 1.647 ms
(2 rows)

                                               QUERY PLAN
---------------------------------------------------------------------------------------------------------
 Function Scan on foo  (cost=0.00..12.50 rows=1000 width=4) (actual time=21.496..21.545 rows=26 loops=1)
 Total runtime: 21.636 ms
(2 rows)


[PostgreSQL 8.2.5, tested on SUSE Linux 10.3, 64-bit and Windows 2000]

_________________________________________________________________
Who's friends with who and co-starred in what?
http://www.searchgamesbox.com/celebrityseparation.shtml

Re: Slow set-returning functions

From
"Heikki Linnakangas"
Date:
Dean Rasheed wrote:
> I have been having difficulty with some functions which return sets of
> rows. The functions seem to run very slowly, even though the queries
> they run execute very quicky if I run them directly from psgl.
> Typically these queries are only returning a few hundred rows with my
> real data.
>
> I have had difficulty coming up with a simple test case, but the code
> below usually shows the same problem. Sometimes I have to run the
> setup code a few times before it happens - not sure why (I would
> expect this to be deterministic), but perhaps there is some randomness
> introduced by the sampling done by the analyse.
>
> The function foo() which has a hard-coded LIMIT always executes
> quickly (comparable to running the query directly).
>
> However, the function foo(int) which is passed the same LIMIT as a
> parameter executes around 30 times slower. The only difference is that
> the LIMIT is a parameter to the function, although the LIMIT isn't
> reached anyway in this case. Sometimes running this same script
> generates data for which this function executes as fast as the other
> one (which is always fast).

This is clearly because the planner doesn't know what the value for the
parameter will be at run time, so it chooses a plan that's not optimal
for LIMIT 100.

> Is there any way that I can see what execution plan is being used
> internally by the functions?

Not directly, but you can do this:

postgres=# PREPARE p (int4) AS SELECT id FROM foo WHERE lower(name) LIKE
'foo' ORDER BY id OFFSET 0 LIMIT $1;
PREPARE
postgres=# EXPLAIN EXECUTE p(100);                                 QUERY
PLAN
-----------------------------------------------------------------------------
  Limit  (cost=0.00..49.18 rows=2 width=4)
    ->  Index Scan using foo_pkey on foo  (cost=0.00..614.77 rows=25
width=4)
          Filter: (lower(name) ~~ 'foo'::text)
(3 rows)

You could work around that by using EXECUTE in the plpgsql function, to
force the query to be planned on every execution with the actual value
of the LIMIT.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Slow set-returning functions

From
"Merlin Moncure"
Date:
On Jan 20, 2008 9:34 AM, Heikki Linnakangas <heikki@enterprisedb.com> wrote:
> Dean Rasheed wrote:
> > I have been having difficulty with some functions which return sets of
> > rows. The functions seem to run very slowly, even though the queries
> > they run execute very quicky if I run them directly from psgl.
> > Typically these queries are only returning a few hundred rows with my
> > real data.
> >
> > I have had difficulty coming up with a simple test case, but the code
> > below usually shows the same problem. Sometimes I have to run the
> > setup code a few times before it happens - not sure why (I would
> > expect this to be deterministic), but perhaps there is some randomness
> > introduced by the sampling done by the analyse.
> >
> > The function foo() which has a hard-coded LIMIT always executes
> > quickly (comparable to running the query directly).
> >
> > However, the function foo(int) which is passed the same LIMIT as a
> > parameter executes around 30 times slower. The only difference is that
> > the LIMIT is a parameter to the function, although the LIMIT isn't
> > reached anyway in this case. Sometimes running this same script
> > generates data for which this function executes as fast as the other
> > one (which is always fast).
>
> This is clearly because the planner doesn't know what the value for the
> parameter will be at run time, so it chooses a plan that's not optimal
> for LIMIT 100.
>
> > Is there any way that I can see what execution plan is being used
> > internally by the functions?

prepared statements have the same problem.  IIRC the planner assumes
10%, which will often drop to a seqscan or a bitmap index scan.  Some
years back I argued (unsuccessfully) to have the planner guess 100
rows or something like that.  Ideally, I think it would generate the
plan from the value passed into the first invocation of the function.

merlin

Re: Slow set-returning functions

From
Dean Rasheed
Date:
Thanks for the replies.

Converting the functions to plpgsql and using EXECUTE works a treat.
On the real data, one of my functions is now over 50x faster :-)

Dean


> Date: Sun, 20 Jan 2008 10:25:48 -0500
> From: mmoncure@gmail.com
> To: heikki@enterprisedb.com
> Subject: Re: [PERFORM] Slow set-returning functions
> CC: dean_rasheed@hotmail.com; pgsql-performance@postgresql.org
>
> On Jan 20, 2008 9:34 AM, Heikki Linnakangas  wrote:
>> Dean Rasheed wrote:
>>> I have been having difficulty with some functions which return sets of
>>> rows. The functions seem to run very slowly, even though the queries
>>> they run execute very quicky if I run them directly from psgl.
>>> Typically these queries are only returning a few hundred rows with my
>>> real data.
>>>
>>> I have had difficulty coming up with a simple test case, but the code
>>> below usually shows the same problem. Sometimes I have to run the
>>> setup code a few times before it happens - not sure why (I would
>>> expect this to be deterministic), but perhaps there is some randomness
>>> introduced by the sampling done by the analyse.
>>>
>>> The function foo() which has a hard-coded LIMIT always executes
>>> quickly (comparable to running the query directly).
>>>
>>> However, the function foo(int) which is passed the same LIMIT as a
>>> parameter executes around 30 times slower. The only difference is that
>>> the LIMIT is a parameter to the function, although the LIMIT isn't
>>> reached anyway in this case. Sometimes running this same script
>>> generates data for which this function executes as fast as the other
>>> one (which is always fast).
>>
>> This is clearly because the planner doesn't know what the value for the
>> parameter will be at run time, so it chooses a plan that's not optimal
>> for LIMIT 100.
>>
>>> Is there any way that I can see what execution plan is being used
>>> internally by the functions?
>
> prepared statements have the same problem. IIRC the planner assumes
> 10%, which will often drop to a seqscan or a bitmap index scan. Some
> years back I argued (unsuccessfully) to have the planner guess 100
> rows or something like that. Ideally, I think it would generate the
> plan from the value passed into the first invocation of the function.
>
> merlin

_________________________________________________________________
Get Hotmail on your mobile, text MSN to 63463!
http://mobile.uk.msn.com/pc/mail.aspx

Re: Slow set-returning functions

From
Tom Lane
Date:
"Merlin Moncure" <mmoncure@gmail.com> writes:
> prepared statements have the same problem.  IIRC the planner assumes
> 10%, which will often drop to a seqscan or a bitmap index scan.  Some
> years back I argued (unsuccessfully) to have the planner guess 100
> rows or something like that.  Ideally, I think it would generate the
> plan from the value passed into the first invocation of the function.

I believe it's the case that that will happen now in the same contexts
where the planner uses the first value of any other parameter (ie,
unnamed statements in the extended-query protocol).

            regards, tom lane

Re: Slow set-returning functions

From
Dean Rasheed
Date:
>> Is there any way that I can see what execution plan is being used
>> internally by the functions?
>>
>
> Not directly, but you can do this:
>
>
> postgres=# PREPARE p (int4) AS SELECT id FROM foo WHERE lower(name) LIKE
> 'foo' ORDER BY id OFFSET 0 LIMIT $1;
> PREPARE
>
> postgres=# EXPLAIN EXECUTE p(100); QUERY PLAN
> -----------------------------------------------------------------------------
>  Limit  (cost=0.00..49.18 rows=2 width=4)
>
> -> Index Scan using foo_pkey on foo (cost=0.00..614.77 rows=25 width=4)
>          Filter: (lower(name) ~~ 'foo'::text)
> (3 rows)


I think that having the ability to see the execution plans being used
by queries inside functions would be quite useful.

More generally, I would like to be able to log the execution plans of
all queries issued by an application (in my case I am working on a web
application, where some of the queries are auto-generated by
Hibernate). I've tried setting debug_print_plan, but the results are a
little hard to interpret.

As an experiment, I have tried hacking around a little with the code.
This is my first foray into the source code, so I might well be
missing something, but basically I added a new configuration parameter
debug_explain_plan which causes all queries to be instrumented and
ExecutorRun() to call explain_outNode() at the end, logging the
results at level DEBUG1.

It seems to work quite well as a development aid for my web
application. It is also useful from a psql session (similar to
AUTOTRACE in Oracle):

test=# create table foo(a int primary key);
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "foo_pkey" for table "foo"
CREATE TABLE
test=# insert into foo values (1), (2), (3), (4), (5);
INSERT 0 5
test=# set debug_explain_plan=true;
SET
test=# set client_min_messages=debug1;
SET
test=# select count(*) from foo where a>3;
DEBUG:  ------------------- query plan -------------------
DETAIL:  Aggregate  (cost=32.45..32.46 rows=1 width=0) (actual time=0.066..0.068 rows=1 loops=1)
  ->  Bitmap Heap Scan on foo  (cost=10.45..30.45 rows=800 width=0) (actual time=0.039..0.043 rows=2 loops=1)
        Recheck Cond: (a> 3)
        ->  Bitmap Index Scan on foo_pkey  (cost=0.00..10.25 rows=800 width=0) (actual time=0.025..0.025 rows=2
loops=1)
              Index Cond: (a> 3)
Query runtime: 0.089 ms
 count
-------
     2
(1 row)

test=# create function foo() returns int as 'select max(a) from foo;' language sql stable;
CREATE FUNCTION
test=# select * from foo where a=foo();
DEBUG:  ------------------- query plan -------------------
DETAIL:  Result  (cost=0.04..0.05 rows=1 width=0) (actual time=0.044..0.044 rows=1 loops=1)
  InitPlan
    ->  Limit  (cost=0.00..0.04 rows=1 width=4) (actual time=0.032..0.034 rows=1 loops=1)
          ->  Index Scan Backward using foo_pkey on foo  (cost=0.00..84.25 rows=2400 width=4) (actual time=0.025..0.025
rows=1loops=1) 
                Filter: (a IS NOT NULL)
Query runtime: 0.050 ms
CONTEXT:  SQL function "foo" statement 1
DEBUG:  ------------------- query plan -------------------
DETAIL:  Result  (cost=0.04..0.05 rows=1 width=0) (actual time=0.037..0.037 rows=1 loops=1)
  InitPlan
    ->  Limit  (cost=0.00..0.04 rows=1 width=4) (actual time=0.027..0.029 rows=1 loops=1)
          ->  Index Scan Backward using foo_pkey on foo  (cost=0.00..84.25 rows=2400 width=4) (actual time=0.021..0.021
rows=1loops=1) 
                Filter: (a IS NOT NULL)
Query runtime: 0.044 ms
CONTEXT:  SQL function "foo" statement 1
DEBUG:  ------------------- query plan -------------------
DETAIL:  Index Scan using foo_pkey on foo  (cost=0.25..8.52 rows=1 width=4) (actual time=1.638..1.642 rows=1 loops=1)
  Index Cond: (a = foo())
Query runtime: 1.686 ms
 a
---
 5
(1 row)


(Curious that foo() is being executed twice in this case).

The CONTEXT is very useful, particularly when functions call other
functions, since it gives the call stack (presumably only for SQL and
PL/pgSQL functions). For top-level queries I would ideally like the
CONTEXT to log the SQL being executed, but I couldn't figure out how
to access that information.

Anyway, I'd be interested to know if anyone has thought about doing
anything like this before and if anyone else might find this useful.

Dean

_________________________________________________________________
Free games, great prizes - get gaming at Gamesbox.
http://www.searchgamesbox.com

Re: Slow set-returning functions

From
Marcin Stępnicki
Date:
Dnia 27-01-2008, N o godzinie 17:29 +0000, Dean Rasheed pisze:
> The CONTEXT is very useful, particularly when functions call other
> functions, since it gives the call stack (presumably only for SQL and
> PL/pgSQL functions). For top-level queries I would ideally like the
> CONTEXT to log the SQL being executed, but I couldn't figure out how
> to access that information.
>
> Anyway, I'd be interested to know if anyone has thought about doing
> anything like this before and if anyone else might find this useful.
>
> Dean

I'd love to see that. The mentioned PREPARE workaround doesn't seem to
work when executed function calls for example three another (or I don't
know how to use it in such situation) - and is generally painful to
use.

I'm afraid I can't help you much (besides testing) but I'd be more than
interested in such improvement.

Regards,
Marcin


Re: Slow set-returning functions

From
"Merlin Moncure"
Date:
On Jan 27, 2008 12:29 PM, Dean Rasheed <dean_rasheed@hotmail.com> wrote:
> >> Is there any way that I can see what execution plan is being used
> >> internally by the functions?
> >>
> >
> > Not directly, but you can do this:
> >
> >
> > postgres=# PREPARE p (int4) AS SELECT id FROM foo WHERE lower(name) LIKE
> > 'foo' ORDER BY id OFFSET 0 LIMIT $1;
> > PREPARE
> >
> > postgres=# EXPLAIN EXECUTE p(100); QUERY PLAN
> > -----------------------------------------------------------------------------
> >  Limit  (cost=0.00..49.18 rows=2 width=4)
> >
> > -> Index Scan using foo_pkey on foo (cost=0.00..614.77 rows=25 width=4)
> >          Filter: (lower(name) ~~ 'foo'::text)
> > (3 rows)
>
>
> I think that having the ability to see the execution plans being used
> by queries inside functions would be quite useful.
>
> More generally, I would like to be able to log the execution plans of
> all queries issued by an application (in my case I am working on a web
> application, where some of the queries are auto-generated by
> Hibernate). I've tried setting debug_print_plan, but the results are a
> little hard to interpret.
>
> As an experiment, I have tried hacking around a little with the code.
> This is my first foray into the source code, so I might well be
> missing something, but basically I added a new configuration parameter
> debug_explain_plan which causes all queries to be instrumented and
> ExecutorRun() to call explain_outNode() at the end, logging the
> results at level DEBUG1.

I read your email, blinked twice, and thought: where have you been all
my life! :-)

(IOW, +1)

merlin

Re: Slow set-returning functions

From
Dean Rasheed
Date:
I've posted the patch here:

http://archives.postgresql.org/pgsql-patches/2008-01/msg00123.php

Dean.

_________________________________________________________________
Get Hotmail on your mobile, text MSN to 63463!
http://mobile.uk.msn.com/pc/mail.aspx