Thread: Replacing Cursors with Temporary Tables

Replacing Cursors with Temporary Tables

From

Eliot Gable

Date:

21 April 2010, 20:16:31

I have previously discussed my very long PL/PGSQL stored procedure on this list. However, without getting into too many details, I have another performance-related question.

The procedure currently uses cursors to return multiple result sets to the program executing the procedure. Basically, I do this:

BEGIN;
SELECT * FROM stored_proc();
FETCH ALL FROM cursor1;
FETCH ALL FROM cursor2;
FETCH ALL FROM cursor3;
etc.
COMMIT;

However, there are some cases in the stored procedure where some of the result sets returned by these cursors are also needed as inputs to additional queries. To use them, I am currently doing:

FOR temp IN cursorX LOOP
-- Some code that pushes the current temp record onto the end of an array
END LOOP;
OPEN cursorX;
MOVE FIRST FROM cursorX;

Then, when I need to use the results in a query, I do something like:

SELECT * FROM table1 INNER JOIN (SELECT * FROM unnest(result_array)) AS table2 ON ( blah blah ) WHERE blah

This seems extremely inefficient to me. First, I'm not sure of the penalty for unnesting an array into a SET OF object. Second, the set of records returned from unnesting would not be indexed for the join which means a sequential scan. Third, building the array in the first place using array_append seems extremely inefficient. Fourth, opening the cursor twice seems like it would execute the query twice, though given the placement and context, it's probably got it cached somewhere (hopefully). I'm sure there are probably other things I am overlooking.

Instead of doing things this way, I think using temporary tables is really what I want. I am thinking that instead of doing this cursor BS, I can do something like:

CREATE TEMPORARY TABLE table2 WITH (OIDS=FALSE) ON COMMIT DROP AS (
SELECT * FROM blah blah blah -- whatever the cursor is defined as doing
);
ALTER TABLE table2 ADD PRIMARY KEY (id);
CREATE INDEX table2_blah_idx ON table2 USING btree (blah);
ANALYZE table2;

Then, when I need to use the results in another query, I could do:

SELECT * FROM table1 INNER JOIN table2 ON ( blah blah ) WHERE blah

This would use the indexes and the primary key appropriately. I could also ensure that the order of the information in the temporary table is such that it facilitates any joining, where clauses, or order by clauses on the additional queries. Finally, to get results into my application, I would then do:

BEGIN;
SELECT * FROM stored_procedure();
SELECT * FROM temp_table1;
SELECT * FROM temp_table2;
SELECT * FROM temp_table3;
etc
COMMIT;

However, this is a fairly major re-write of how things are done. Before I spend the time to do all that re-writing, can anyone share some insight on where / how I might expect to gain performance from this conversion and also speak to some of the overhead (if any) in using temporary tables like this (building them, creating indexes on them, analyzing them, then dropping them on commit)? It is worth mentioning that the data being stored in these temporary tables is probably <1,000 records for all tables involved. Most will probably be <100 records. Some of these temporary tables will be joined to other tables up to 4 more times throughout the rest of the stored procedure. Most will be generated and then retrieved only from outside the stored procedure. Obviously, I would not create indexes on or analyze the temporary tables being retrieved only from outside the stored procedure. Indexes and primary keys will only be created on the tables that are joined to other tables and have WHERE conditions applied to them.

I have done a lot of Googling on temporary tables and cursors in PostgreSQL, but I have found only very limited discussion as to performance differences with respect to how I'm planning on using them, and I am unsure about the quality of the information given that most of it is 4+ years out of date and posted on various expert exchanges and not on this pgsql-performance list.

One final question:

In this conversion to temporary table use, there are a couple of cases where I would prefer to do something like:

prepare blah(blah blah) as select blah;

Then, I want to call this prepared statement multiple times, passing a different argument value each time. The only reason to do this would be to save writing code and to ensure that updating the select statement in once place covers all places where it is used. However, I am concerned it might incur a performance hit by re-preparing the query since I assume that having this inside the PL/PGSQL procedure means it is already prepared once. Can anyone speak to this? I know that I could put it in a separate stored procedure, but then the question becomes, does that add extra overhead? Or, in different words, is it similar to the difference between an inlined function and a non-inlined function in C?

I would greatly appreciate any insights to these questions/issues.

Thanks in advance for any assistance anyone can provide.

--
Eliot Gable

"We do not inherit the Earth from our ancestors: we borrow it from our children." ~David Brower

"I decided the words were too conservative for me. We're not borrowing from our children, we're stealing from them--and it's not even considered to be a crime." ~David Brower

"Esse oportet ut vivas, non vivere ut edas." (Thou shouldst eat to live; not live to eat.) ~Marcus Tullius Cicero

Re: Replacing Cursors with Temporary Tables

From

Robert Haas

Date:

22 April 2010, 01:21:44

I think it's really tough to say how this is going to perform.  I'd
recommend constructing a couple of simplified test cases and
benchmarking the heck out of it.  One of the problems with temporary
tables is that every time you create a temporary table, it creates a
(temporary) record in pg_class; that can get to be a problem if you do
it a lot.  Another is that for non-trivial queries you may need to do
a manual ANALYZE on the table to get good stats for the rest of the
query to perform well.  But on the flip side, as you say, nesting and
unnesting of arrays and function calls are not free either.  I am
going to hazard a SWAG that the array implementation is faster UNLESS
the lack of good stats on the contents of the arrays is hosing the
performance somewhere down the road.  But that is really just a total
shot in the dark.

Another possible implementation might be to have a couple of permanent
tables where you store the results.  Give each such table a "batch id"
column, and return the batch id from your stored procedure.  This
would probably avoid a lot of the overhead associated with temp tables
while retaining many of the benefits.

...Robert

Re: Replacing Cursors with Temporary Tables

From

Scott Carey

Date:

22 April 2010, 03:15:29

On Apr 21, 2010, at 1:16 PM, Eliot Gable wrote:

> I have previously discussed my very long PL/PGSQL stored procedure on this list. However, without getting into too
manydetails, I have another performance-related question. 
>
> The procedure currently uses cursors to return multiple result sets to the program executing the procedure.
Basically,I do this: 
>
> CREATE TEMPORARY TABLE table2 WITH (OIDS=FALSE) ON COMMIT DROP AS (
>   SELECT * FROM blah blah blah -- whatever the cursor is defined as doing
> );
> ALTER TABLE table2 ADD PRIMARY KEY (id);
> CREATE INDEX table2_blah_idx ON table2 USING btree (blah);
> ANALYZE table2;
>
> Then, when I need to use the results in another query, I could do:
>
> SELECT * FROM table1 INNER JOIN table2 ON ( blah blah ) WHERE blah
>
> This would use the indexes and the primary key appropriately. I could also ensure that the order of the information
inthe temporary table is such that it facilitates any joining, where clauses, or order by clauses on the additional
queries.Finally, to get results into my application, I would then do: 

I have had good luck with temp tables, but beware -- there isn't anything special performance wise about them -- they
doas much I/O as a real table without optimizations that know that it will be dropped on commit so it doesn't have to
beas fail-safe as ordinary ones.  Even so, a quick  
CREATE TABLE foo ON COMMIT DROP AS (SELECT ...);
ANALYZE foo;
SELECT FROM foo JOIN bar ...  ;
can be very effective for performance.

However, creating the indexes above is going to slow it down a lot.  Most likely, the join with a seqscan will be
fasterthan an index build followed by the join.  After all, in order to build the index it has to seqscan!   If you are
consumingthese tables for many later select queries rather than just one or two, building the index might help.
Otherwiseits just a lot of extra work. 

I suggest you experiment with the performance differences using psql on a specific use case on real data.

> One final question:
>
> In this conversion to temporary table use, there are a couple of cases where I would prefer to do something like:
>
> prepare blah(blah blah) as select blah;
>
> Then, I want to call this prepared statement multiple times, passing a different argument value each time. The only
reasonto do this would be to save writing code and to ensure that updating the select statement in once place covers
allplaces where it is used. However, I am concerned it might incur a performance hit by re-preparing the query since I
assumethat having this inside the PL/PGSQL procedure means it is already prepared once. Can anyone speak to this? I
knowthat I could put it in a separate stored procedure, but then the question becomes, does that add extra overhead?
Or,in different words, is it similar to the difference between an inlined function and a non-inlined function in C? 

I can't speak for the details in your question, but it brings up a different issue I can speak to:
Prepared statements usually cause the planner to create a generic query plan for all possible inputs.  For some queries
wherethe parameters can significantly influence the query plan, this can be a big performance drop.  For other queries
(particularlyinserts or simple selects on PK's) the cached plan saves time. 

> I would greatly appreciate any insights to these questions/issues.
>
> Thanks in advance for any assistance anyone can provide.
>
>
> --
> Eliot Gable
>
> "We do not inherit the Earth from our ancestors: we borrow it from our children." ~David Brower
>
> "I decided the words were too conservative for me. We're not borrowing from our children, we're stealing from
them--andit's not even considered to be a crime." ~David Brower 
>
> "Esse oportet ut vivas, non vivere ut edas." (Thou shouldst eat to live; not live to eat.) ~Marcus Tullius Cicero

Re: Replacing Cursors with Temporary Tables

From

Merlin Moncure

Date:

22 April 2010, 12:14:55

On Wed, Apr 21, 2010 at 4:16 PM, Eliot Gable
<egable+pgsql-performance@gmail.com> wrote:
> I have previously discussed my very long PL/PGSQL stored procedure on this
> list. However, without getting into too many details, I have another
> performance-related question.
>
> The procedure currently uses cursors to return multiple result sets to the
> program executing the procedure. Basically, I do this:
>
> BEGIN;
> SELECT * FROM stored_proc();
> FETCH ALL FROM cursor1;
> FETCH ALL FROM cursor2;
> FETCH ALL FROM cursor3;
> etc.
> COMMIT;
>
> However, there are some cases in the stored procedure where some of the
> result sets returned by these cursors are also needed as inputs to
> additional queries. To use them, I am currently doing:
>
> FOR temp IN cursorX LOOP
>   -- Some code that pushes the current temp record onto the end of an array
> END LOOP;
> OPEN cursorX;
> MOVE FIRST FROM cursorX;
>
> Then, when I need to use the results in a query, I do something like:
>
> SELECT * FROM table1 INNER JOIN (SELECT * FROM unnest(result_array)) AS
> table2 ON ( blah blah ) WHERE blah
>
> This seems extremely inefficient to me. First, I'm not sure of the penalty
> for unnesting an array into a SET OF object. Second, the set of records
> returned from unnesting would not be indexed for the join which means a
> sequential scan. Third, building the array in the first place using
> array_append seems extremely inefficient. Fourth, opening the cursor twice
> seems like it would execute the query twice, though given the placement and
> context, it's probably got it cached somewhere (hopefully). I'm sure there
> are probably other things I am overlooking.

*) don't use temp tables unless there is no other way (for example, if
the set is quite large)
*) unnest is cheap unless the array is large
*) Don't build arrays thay way:

declare a_cursor for a_query

becomes
CREATE FUNCTION get_foos(out foo[]) RETURNS record AS -- foo is a
table or composite type
$$
BEGIN
  select array (a_query) into foos;
  [...]
$$ language plpgsql;

In 8.4, we will manipulate the results typically like this:

WITH f AS (select unnest(foos) as foo)
SELECT * from f join bar on (f).foo.bar_id= bar.bar_id [...]

or this:
WITH f AS (select (foo).* from (select unnest(foos) as foo) q)
SELECT * from f join bar on f.bar_id= bar.bar_id [...]


This will use an index on bar.bar_id if it exists.  Obviously, any
indexes on foo are not used after creating the array but doesn't
matter much as long as the right side is indexed.  Your cursor method
does do any better in this regard.  You can create an index on a temp
table but the cost of building the index will probably be more than
any savings you get unless this is some type of special case, for
example if the left (temp table) side is big and you need to have it
sorted from that side.

merlin

Re: Replacing Cursors with Temporary Tables

From

Merlin Moncure

Date:

22 April 2010, 12:17:12

On Thu, Apr 22, 2010 at 8:14 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
> This will use an index on bar.bar_id if it exists.  Obviously, any
> indexes on foo are not used after creating the array but doesn't
> matter much as long as the right side is indexed.  Your cursor method
> does do any better in this regard.  You can create an index on a temp

er, meant to say: 'doesn't do any better'.

merlin

Re: Replacing Cursors with Temporary Tables

From

Merlin Moncure

Date:

22 April 2010, 16:47:58

On Wed, Apr 21, 2010 at 4:16 PM, Eliot Gable
<egable+pgsql-performance@gmail.com> wrote:
> I have previously discussed my very long PL/PGSQL stored procedure on this
> list. However, without getting into too many details, I have another
> performance-related question.

ok, here's a practical comparion:
-- test data
create table foo(foo_id int primary key);
insert into foo select generate_series(1, 1000) v;
create table bar(bar_id int, foo_id int references foo);
create index bar_foo_id_idx on bar(foo_id);
insert into bar select v, (v % 1000) + 1 from generate_series(1, 1000000) v;

-- arrays
create or replace function get_foobars(_foo_id int, _foo out foo,
_bars out bar[]) returns record as
$$
  begin
    select * from foo where foo_id = _foo_id into _foo;

    select array(select bar from bar where foo_id = _foo_id) into _bars;
  end;
$$ language plpgsql;

select (unnest(_bars)).* from get_foobars(6);  -- ~ 4ms on my box

-- temp table

create or replace function get_foobars(_foo_id int) returns void as
$$
  begin
    create temp table bars on commit drop as select * from bar where
foo_id = _foo_id;
  end;
$$ language plpgsql;

begin;
select get_foobars(6); -- ~ 3ms
select * from bars;  -- 1.6ms
commit; -- 1ms

The timings are similar, but the array returning case:
*)  runs in a single statement.  If this is executed from the client
that means less round trips
*) can be passed around as a variable between functions.  temp table
requires re-query
*) make some things easier/cheap such as counting the array -- you get
to call the basically free array_upper()
*) makes some things harder.  specifically dealing with arrays on the
client is a pain UNLESS you expand the array w/unnest() or use
libpqtypes
*) can nest. you can trivially nest complicated sets w/arrays
*) does not require explicit transaction mgmt

merlin

Re: Replacing Cursors with Temporary Tables

From

Merlin Moncure

Date:

22 April 2010, 16:51:41

On Thu, Apr 22, 2010 at 10:11 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
> The timings are similar, but the array returning case:
> *)  runs in a single statement.  If this is executed from the client
> that means less round trips
> *) can be passed around as a variable between functions.  temp table
> requires re-query
> *) make some things easier/cheap such as counting the array -- you get
> to call the basically free array_upper()
> *) makes some things harder.  specifically dealing with arrays on the
> client is a pain UNLESS you expand the array w/unnest() or use
> libpqtypes
> *) can nest. you can trivially nest complicated sets w/arrays
> *) does not require explicit transaction mgmt

I neglected to mention perhaps the most important point about the array method:
*) does not rely on any temporary resources.

If you write a lot of plpsql, you will start to appreciate the
difference in execution time between planned and unplanned functions.
The first time you run a function in a database session, it has to be
parsed and planned.  The planning time in particular for large-ish
functions that touch a lot of objects can exceed the execution time of
the function.  Depending on _any_ temporary resources causes plan mgmt
issues because the database detects that a table in the old plan is
gone ('on commit drop') and has to re-plan.   If your functions are
complex/long and you are counting milliseconds, then that alone should
be enough to dump any approach that depends on temp tables.

merlin

Re: Replacing Cursors with Temporary Tables

From

Eliot Gable

Date:

22 April 2010, 20:57:23

I appreciate all the comments.

I will perform some benchmarking before doing the rewrite to be certain of how it will impact performance. At the very least, I think can say for near-certain now that the indexes are not going to help me given the particular queries I am dealing with and limited number of records the temp tables will have combined with the limited number of times I will re-use them.

On Thu, Apr 22, 2010 at 10:42 AM, Merlin Moncure <mmoncure@gmail.com> wrote:

On Thu, Apr 22, 2010 at 10:11 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
> The timings are similar, but the array returning case:
> *) runs in a single statement. If this is executed from the client
> that means less round trips
> *) can be passed around as a variable between functions. temp table
> requires re-query
> *) make some things easier/cheap such as counting the array -- you get
> to call the basically free array_upper()
> *) makes some things harder. specifically dealing with arrays on the
> client is a pain UNLESS you expand the array w/unnest() or use
> libpqtypes
> *) can nest. you can trivially nest complicated sets w/arrays
> *) does not require explicit transaction mgmt

I neglected to mention perhaps the most important point about the array method:
*) does not rely on any temporary resources.

If you write a lot of plpsql, you will start to appreciate the
difference in execution time between planned and unplanned functions.
The first time you run a function in a database session, it has to be
parsed and planned. The planning time in particular for large-ish
functions that touch a lot of objects can exceed the execution time of
the function. Depending on _any_ temporary resources causes plan mgmt
issues because the database detects that a table in the old plan is
gone ('on commit drop') and has to re-plan. If your functions are
complex/long and you are counting milliseconds, then that alone should
be enough to dump any approach that depends on temp tables.

merlin

--
Eliot Gable

"We do not inherit the Earth from our ancestors: we borrow it from our children." ~David Brower

"I decided the words were too conservative for me. We're not borrowing from our children, we're stealing from them--and it's not even considered to be a crime." ~David Brower

"Esse oportet ut vivas, non vivere ut edas." (Thou shouldst eat to live; not live to eat.) ~Marcus Tullius Cicero

Re: Replacing Cursors with Temporary Tables

From

Eliot Gable

Date:

23 April 2010, 20:43:13

To answer the question of whether calling a stored procedure adds any significant overhead, I built a test case and the short answer is that it seems that it does:

CREATE OR REPLACE FUNCTION Test1() RETURNS INTEGER AS
$BODY$
DECLARE
    temp INTEGER;
BEGIN
    FOR i IN 1..1000 LOOP
        SELECT 1 AS id INTO temp;
    END LOOP;
    RETURN 1;
END;
$BODY$
LANGUAGE plpgsql;

CREATE OR REPLACE FUNCTION Test2A() RETURNS SETOF INTEGER AS
$BODY$
DECLARE
BEGIN
    RETURN QUERY SELECT 1 AS id;
END;
$BODY$
LANGUAGE plpgsql;

CREATE OR REPLACE FUNCTION Test2B() RETURNS INTEGER AS
$BODY$
DECLARE
    temp INTEGER;
BEGIN
    FOR i IN 1..1000 LOOP
        temp := Test2A();
    END LOOP;
    RETURN 1;
END;
$BODY$
LANGUAGE plpgsql;

EXPLAIN ANALYZE SELECT * FROM Test1();
"Function Scan on test1 (cost=0.00..0.26 rows=1 width=4) (actual time=6.568..6.569 rows=1 loops=1)"
"Total runtime: 6.585 ms"

EXPLAIN ANALYZE SELECT * FROM Test2B();
"Function Scan on test2b (cost=0.00..0.26 rows=1 width=4) (actual time=29.006..29.007 rows=1 loops=1)"
"Total runtime: 29.020 ms"

So, when chasing milliseconds, don't call sub functions if it can realistically and easily be avoided. I only have one operation/algorithm broken out into another stored procedure because I call it in about 8 different places and it is 900+ lines long. While everything else could be broken out into different stored procedures to make it easier to analyze the whole set of code and probably make it easier to maintain, it does not make sense from a performance perspective. Each different logical group of actions that would be in its own stored procedure is only ever used once in the whole algorithm, so there is no good code re-use going on. Further, since the containing stored procedure gets called by itself hundreds or even thousands of times per second on a production system, the nested calls to individual sub-stored procedures would just add extra overhead for no real gain. And, from these tests, it would be significant overhead.

On Thu, Apr 22, 2010 at 4:57 PM, Eliot Gable <egable+pgsql-performance@gmail.com> wrote:

I appreciate all the comments.

I will perform some benchmarking before doing the rewrite to be certain of how it will impact performance. At the very least, I think can say for near-certain now that the indexes are not going to help me given the particular queries I am dealing with and limited number of records the temp tables will have combined with the limited number of times I will re-use them.

On Thu, Apr 22, 2010 at 10:42 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
On Thu, Apr 22, 2010 at 10:11 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
> The timings are similar, but the array returning case:
> *) runs in a single statement. If this is executed from the client
> that means less round trips
> *) can be passed around as a variable between functions. temp table
> requires re-query
> *) make some things easier/cheap such as counting the array -- you get
> to call the basically free array_upper()
> *) makes some things harder. specifically dealing with arrays on the
> client is a pain UNLESS you expand the array w/unnest() or use
> libpqtypes
> *) can nest. you can trivially nest complicated sets w/arrays
> *) does not require explicit transaction mgmt

I neglected to mention perhaps the most important point about the array method:
*) does not rely on any temporary resources.

If you write a lot of plpsql, you will start to appreciate the
difference in execution time between planned and unplanned functions.
The first time you run a function in a database session, it has to be
parsed and planned. The planning time in particular for large-ish
functions that touch a lot of objects can exceed the execution time of
the function. Depending on _any_ temporary resources causes plan mgmt
issues because the database detects that a table in the old plan is
gone ('on commit drop') and has to re-plan. If your functions are
complex/long and you are counting milliseconds, then that alone should
be enough to dump any approach that depends on temp tables.

merlin

--
Eliot Gable

"We do not inherit the Earth from our ancestors: we borrow it from our children." ~David Brower

"I decided the words were too conservative for me. We're not borrowing from our children, we're stealing from them--and it's not even considered to be a crime." ~David Brower

"Esse oportet ut vivas, non vivere ut edas." (Thou shouldst eat to live; not live to eat.) ~Marcus Tullius Cicero

Re: Replacing Cursors with Temporary Tables

From

Merlin Moncure

Date:

23 April 2010, 21:01:34

On Fri, Apr 23, 2010 at 4:42 PM, Eliot Gable
<egable+pgsql-performance@gmail.com> wrote:
> To answer the question of whether calling a stored procedure adds any
> significant overhead, I built a test case and the short answer is that it
> seems that it does:
>
> CREATE OR REPLACE FUNCTION Test1() RETURNS INTEGER AS
> $BODY$
> DECLARE
>     temp INTEGER;
> BEGIN
>     FOR i IN 1..1000 LOOP
>         SELECT 1 AS id INTO temp;
>     END LOOP;
>     RETURN 1;
> END;
> $BODY$
> LANGUAGE plpgsql;
>
> CREATE OR REPLACE FUNCTION Test2A() RETURNS SETOF INTEGER AS
> $BODY$
> DECLARE
> BEGIN
>     RETURN QUERY SELECT 1 AS id;
> END;
> $BODY$
> LANGUAGE plpgsql;
>
> CREATE OR REPLACE FUNCTION Test2B() RETURNS INTEGER AS
> $BODY$
> DECLARE
>     temp INTEGER;
> BEGIN
>     FOR i IN 1..1000 LOOP
>         temp := Test2A();
>     END LOOP;
>     RETURN 1;
> END;
> $BODY$
> LANGUAGE plpgsql;
>
>
> EXPLAIN ANALYZE SELECT * FROM Test1();
> "Function Scan on test1  (cost=0.00..0.26 rows=1 width=4) (actual
> time=6.568..6.569 rows=1 loops=1)"
> "Total runtime: 6.585 ms"
>
>
> EXPLAIN ANALYZE SELECT * FROM Test2B();
> "Function Scan on test2b  (cost=0.00..0.26 rows=1 width=4) (actual
> time=29.006..29.007 rows=1 loops=1)"
> "Total runtime: 29.020 ms"

That's not a fair test.  test2a() is a SRF which has higher overhead
than regular function.  Try it this way and the timings will level
out:

CREATE OR REPLACE FUNCTION Test2A() RETURNS  INTEGER AS
$BODY$
DECLARE
BEGIN
    RETURN  1 ;
END;
$BODY$
LANGUAGE plpgsql ;

merlin

Re: Replacing Cursors with Temporary Tables

From

Robert Haas

Date:

23 April 2010, 22:09:13

On Fri, Apr 23, 2010 at 4:42 PM, Eliot Gable
<egable+pgsql-performance@gmail.com> wrote:
> And, from these tests, it would be significant overhead.

Yeah, I've been very disappointed by the size of the function-call
overhead on many occasions.  It might be worth putting some effort
into seeing if there's anything that can be done about this, but I
haven't.  :-)

...Robert

Re: Replacing Cursors with Temporary Tables

From

Eliot Gable

Date:

24 April 2010, 00:39:26

That's a good point. However, even after changing it, it is still 12ms with the function call verses 6ms without the extra function call. Though, it is worth noting that if you can make the function call be guaranteed to return the same results when used with the same input parameters, it ends up being faster (roughly 3ms in my test case) due to caching -- at least when executing it multiple times in a row like this. Unfortunately, I cannot take advantage of that, because in my particular use case, the chances of it being called again with the same input values within the cache lifetime of the results is close to zero. Add to that the fact that the function queries tables that could change between transactions (meaning the function is volatile) and it's a moot point. However, it is worth noting that for those people using a non-volatile function call multiple times in the same transaction with the same input values, there is no need to inline the function call.

On Fri, Apr 23, 2010 at 5:01 PM, Merlin Moncure <mmoncure@gmail.com> wrote:

On Fri, Apr 23, 2010 at 4:42 PM, Eliot Gable
<egable+pgsql-performance@gmail.com> wrote:
> To answer the question of whether calling a stored procedure adds any
> significant overhead, I built a test case and the short answer is that it
> seems that it does:
>
> CREATE OR REPLACE FUNCTION Test1() RETURNS INTEGER AS
> $BODY$
> DECLARE
>     temp INTEGER;
> BEGIN
>     FOR i IN 1..1000 LOOP
>         SELECT 1 AS id INTO temp;
>     END LOOP;
>     RETURN 1;
> END;
> $BODY$
> LANGUAGE plpgsql;
>
> CREATE OR REPLACE FUNCTION Test2A() RETURNS SETOF INTEGER AS
> $BODY$
> DECLARE
> BEGIN
>     RETURN QUERY SELECT 1 AS id;
> END;
> $BODY$
> LANGUAGE plpgsql;
>
> CREATE OR REPLACE FUNCTION Test2B() RETURNS INTEGER AS
> $BODY$
> DECLARE
>     temp INTEGER;
> BEGIN
>     FOR i IN 1..1000 LOOP
>         temp := Test2A();
>     END LOOP;
>     RETURN 1;
> END;
> $BODY$
> LANGUAGE plpgsql;
>
>
> EXPLAIN ANALYZE SELECT * FROM Test1();
> "Function Scan on test1 (cost=0.00..0.26 rows=1 width=4) (actual
> time=6.568..6.569 rows=1 loops=1)"
> "Total runtime: 6.585 ms"
>
>
> EXPLAIN ANALYZE SELECT * FROM Test2B();
> "Function Scan on test2b (cost=0.00..0.26 rows=1 width=4) (actual
> time=29.006..29.007 rows=1 loops=1)"
> "Total runtime: 29.020 ms"

That's not a fair test. test2a() is a SRF which has higher overhead
than regular function. Try it this way and the timings will level
out:

CREATE OR REPLACE FUNCTION Test2A() RETURNS INTEGER AS
$BODY$
DECLARE
BEGIN
RETURN 1 ;
END;
$BODY$
LANGUAGE plpgsql ;

merlin

Re: Replacing Cursors with Temporary Tables

From

Eliot Gable

Date:

24 April 2010, 02:31:29

More benchmarking results are in with a comparison between cursors, arrays, and temporary tables for storing, using, and accessing data outside the stored procedure:

CREATE OR REPLACE FUNCTION Test_Init() RETURNS INTEGER AS
$BODY$
DECLARE
temp INTEGER;
BEGIN
DROP TABLE IF EXISTS test_table1 CASCADE;
CREATE TABLE test_table1 (
id SERIAL NOT NULL PRIMARY KEY,
junk_field1 INTEGER,
junk_field2 INTEGER,
junk_field3 INTEGER
) WITH (OIDS=FALSE);
DROP INDEX IF EXISTS test_table1_junk_field1_idx CASCADE;
DROP INDEX IF EXISTS test_table1_junk_field2_idx CASCADE;
DROP INDEX IF EXISTS test_table1_junk_field3_idx CASCADE;
FOR i IN 1..10000 LOOP
INSERT INTO test_table1 (junk_field1, junk_field2, junk_field3) VALUES
(i%10, i%20, i%30);
END LOOP;
CREATE INDEX test_table1_junk_field1_idx ON test_table1 USING btree (junk_field1);
CREATE INDEX test_table1_junk_field2_idx ON test_table1 USING btree (junk_field2);
CREATE INDEX test_table1_junk_field3_idx ON test_table1 USING btree (junk_field3);
DROP TABLE IF EXISTS test_table2 CASCADE;
CREATE TABLE test_table2 (
id SERIAL NOT NULL PRIMARY KEY,
junk_field1 INTEGER,
junk_field2 INTEGER,
junk_field3 INTEGER
) WITH (OIDS=FALSE);
DROP INDEX IF EXISTS test_table2_junk_field1_idx CASCADE;
DROP INDEX IF EXISTS test_table2_junk_field2_idx CASCADE;
DROP INDEX IF EXISTS test_table2_junk_field3_idx CASCADE;
FOR i IN 1..10000 LOOP
INSERT INTO test_table2 (junk_field1, junk_field2, junk_field3) VALUES
(i%15, i%25, i%35);
END LOOP;
CREATE INDEX test_table2_junk_field1_idx ON test_table2 USING btree (junk_field1);
CREATE INDEX test_table2_junk_field2_idx ON test_table2 USING btree (junk_field2);
CREATE INDEX test_table2_junk_field3_idx ON test_table2 USING btree (junk_field3);
RETURN 1;
END;
$BODY$
LANGUAGE plpgsql;

SELECT * FROM Test_Init();

DROP TYPE IF EXISTS test_row_type CASCADE;
CREATE TYPE test_row_type AS (
junk_field1 INTEGER,
junk_field2 INTEGER,
junk_field3 INTEGER
);

CREATE OR REPLACE FUNCTION Test1() RETURNS INTEGER AS
$BODY$
DECLARE
temp_row test_row_type;
cursresults test_row_type[];
curs SCROLL CURSOR IS
SELECT * FROM test_table1 WHERE junk_field1=8;
BEGIN
FOR temp IN curs LOOP
temp_row := temp;
cursresults := array_append(cursresults, temp_row);
END LOOP;
OPEN curs;
RETURN 1;
END;
$BODY$
LANGUAGE plpgsql;

CREATE OR REPLACE FUNCTION Test2() RETURNS INTEGER AS
$BODY$
DECLARE
cursresults test_row_type[];
cur SCROLL CURSOR IS
SELECT * FROM unnest(cursresults);
BEGIN
cursresults := array(SELECT (junk_field1, junk_field2, junk_field3)::test_row_type AS rec FROM test_table1 WHERE junk_field1=8);
OPEN cur;
RETURN 1;
END;
$BODY$
LANGUAGE plpgsql;

CREATE OR REPLACE FUNCTION Test3() RETURNS INTEGER AS
$BODY$
DECLARE
BEGIN
CREATE TEMPORARY TABLE results WITH (OIDS=FALSE) ON COMMIT DROP AS (
SELECT junk_field1, junk_field2, junk_field3 FROM test_table1 WHERE junk_field1=8
);
RETURN 1;
END;
$BODY$
LANGUAGE plpgsql;

CREATE OR REPLACE FUNCTION Test4() RETURNS INTEGER AS
$BODY$
DECLARE
cur SCROLL CURSOR IS
SELECT * FROM results;
BEGIN
CREATE TEMPORARY TABLE results WITH (OIDS=FALSE) ON COMMIT DROP AS (
SELECT junk_field1, junk_field2, junk_field3 FROM test_table1 WHERE junk_field1=8
);
OPEN cur;
RETURN 1;
END;
$BODY$
LANGUAGE plpgsql;

EXPLAIN ANALYZE SELECT * FROM Test1();
"Function Scan on test1 (cost=0.00..0.26 rows=1 width=4) (actual time=17.701..17.701 rows=1 loops=1)"
"Total runtime: 17.714 ms" -- Ouch

EXPLAIN ANALYZE SELECT * FROM Test2();

"Function Scan on test2 (cost=0.00..0.26 rows=1 width=4) (actual time=1.137..1.137 rows=1 loops=1)"
"Total runtime: 1.153 ms" -- Wow

EXPLAIN ANALYZE SELECT * FROM Test3();
"Function Scan on test3 (cost=0.00..0.26 rows=1 width=4) (actual time=2.033..2.034 rows=1 loops=1)"
"Total runtime: 2.050 ms"

EXPLAIN ANALYZE SELECT * FROM Test4();
"Function Scan on test4 (cost=0.00..0.26 rows=1 width=4) (actual time=2.001..2.001 rows=1 loops=1)"
"Total runtime: 2.012 ms"

In each case, the results are available outside the stored procedure by either fetching from the cursor or selecting from the temporary table. Clearly, the temporary table takes a performance hit compared using arrays. Building an array with array append is horrendously inefficient. Unnesting an array is surprisingly efficient. As can be seen from Test3 and Test4, cursors have no detectable overhead for opening the cursor (at least in this example with 1000 result rows). It is unclear whether there is any difference at all from Test3 and Test4 for retrieving the data as I have no easy way right now to measure that accurately. However, since arrays+cursors are more efficient than anything having to do with temp tables, that is the way I will go. With the number of rows I am dealing with (which should always be less than 1,000 in the final returned result set), unnesting an array is much faster than building a temp table and selecting from it.

If anyone thinks I may have missed some important item in this testing, please let me know.

On Fri, Apr 23, 2010 at 8:39 PM, Eliot Gable <egable+pgsql-performance@gmail.com> wrote:

That's a good point. However, even after changing it, it is still 12ms with the function call verses 6ms without the extra function call. Though, it is worth noting that if you can make the function call be guaranteed to return the same results when used with the same input parameters, it ends up being faster (roughly 3ms in my test case) due to caching -- at least when executing it multiple times in a row like this. Unfortunately, I cannot take advantage of that, because in my particular use case, the chances of it being called again with the same input values within the cache lifetime of the results is close to zero. Add to that the fact that the function queries tables that could change between transactions (meaning the function is volatile) and it's a moot point. However, it is worth noting that for those people using a non-volatile function call multiple times in the same transaction with the same input values, there is no need to inline the function call.

On Fri, Apr 23, 2010 at 5:01 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
On Fri, Apr 23, 2010 at 4:42 PM, Eliot Gable
<egable+pgsql-performance@gmail.com> wrote:
> To answer the question of whether calling a stored procedure adds any
> significant overhead, I built a test case and the short answer is that it
> seems that it does:
>
> CREATE OR REPLACE FUNCTION Test1() RETURNS INTEGER AS
> $BODY$
> DECLARE
>     temp INTEGER;
> BEGIN
>     FOR i IN 1..1000 LOOP
>         SELECT 1 AS id INTO temp;
>     END LOOP;
>     RETURN 1;
> END;
> $BODY$
> LANGUAGE plpgsql;
>
> CREATE OR REPLACE FUNCTION Test2A() RETURNS SETOF INTEGER AS
> $BODY$
> DECLARE
> BEGIN
>     RETURN QUERY SELECT 1 AS id;
> END;
> $BODY$
> LANGUAGE plpgsql;
>
> CREATE OR REPLACE FUNCTION Test2B() RETURNS INTEGER AS
> $BODY$
> DECLARE
>     temp INTEGER;
> BEGIN
>     FOR i IN 1..1000 LOOP
>         temp := Test2A();
>     END LOOP;
>     RETURN 1;
> END;
> $BODY$
> LANGUAGE plpgsql;
>
>
> EXPLAIN ANALYZE SELECT * FROM Test1();
> "Function Scan on test1 (cost=0.00..0.26 rows=1 width=4) (actual
> time=6.568..6.569 rows=1 loops=1)"
> "Total runtime: 6.585 ms"
>
>
> EXPLAIN ANALYZE SELECT * FROM Test2B();
> "Function Scan on test2b (cost=0.00..0.26 rows=1 width=4) (actual
> time=29.006..29.007 rows=1 loops=1)"
> "Total runtime: 29.020 ms"

That's not a fair test. test2a() is a SRF which has higher overhead
than regular function. Try it this way and the timings will level
out:

CREATE OR REPLACE FUNCTION Test2A() RETURNS INTEGER AS
$BODY$
DECLARE
BEGIN
RETURN 1 ;
END;
$BODY$
LANGUAGE plpgsql ;

merlin

--
Eliot Gable

"We do not inherit the Earth from our ancestors: we borrow it from our children." ~David Brower

"I decided the words were too conservative for me. We're not borrowing from our children, we're stealing from them--and it's not even considered to be a crime." ~David Brower

"Esse oportet ut vivas, non vivere ut edas." (Thou shouldst eat to live; not live to eat.) ~Marcus Tullius Cicero

Re: Replacing Cursors with Temporary Tables

From

"Pierre C"

Date:

24 April 2010, 09:27:14

FYI, I had a query like this :

(complex search query ORDER BY foo LIMIT X)
LEFT JOIN objects_categories oc
LEFT JOIN categories c
GROUP BY ...
(more joins)
ORDER BY foo LIMIT X

Here, we do a search on "objects" (i'm not gonna give all the details,
they're not interesting for the problem at hand).
Point is that these objects can belong to several categories, so I need to
perform a GROUP BY with array_agg() somewhere unless I want the JOIN to
return several rows per object, which is not what I want. This makes the
query quite complicated...

I ended up rewriting it like this :

(complex search query ORDER BY foo LIMIT X)
LEFT JOIN
(SELECT .. FROM objects_categories oc
  LEFT JOIN categories c
  GROUP BY ...
) ON ...
(more joins)
ORDER BY foo LIMIT X

Basically moving the aggregates into a separate query. It is easier to
handle.

I tried to process it like this, in a stored proc :

- do the (complex search query ORDER BY foo LIMIT X) alone and stuff it in
a cursor
- extract the elements needed into arrays (mostly object_id)
- get the other information as separate queries like :

SELECT object_id, category_id, category_name
 FROM objects_categories JOIN categories ON ...
WHERE object_id =ANY( my_array );

and return the results into cursors, too.

Or like this (using 2 cursors) :

SELECT object_id, array_agg(category_id) FROM objects_categories WHERE
object_id =ANY( my_array );

SELECT category_id, category_name, ...
 FROM categories WHERE category_id IN (
  SELECT category_id FROM objects_categories WHERE object_id =ANY( my_array
));

I found it to be quite faster, and it also simplifies my PHP code. From
PHP's point of view, it is simpler to get a cursor that returns the
objects, and separate cursors that can be used to build an in-memory PHP
hashtable of only the categories we're going to display. Also, it avoids
retrieving lots of data multiple times, since many objects will belong to
the same categories. With the second example, I can use my ORM to
instantiate only one copy of each.

It would be quite useful if we could SELECT from a cursor, or JOIN a
cursor to an existing table...

Re: Replacing Cursors with Temporary Tables

From

Merlin Moncure

Date:

24 April 2010, 13:23:53

On Fri, Apr 23, 2010 at 10:31 PM, Eliot Gable
<egable+pgsql-performance@gmail.com> wrote:
> In each case, the results are available outside the stored procedure by
> either fetching from the cursor or selecting from the temporary table.
> Clearly, the temporary table takes a performance hit compared using arrays.
> Building an array with array append is horrendously inefficient. Unnesting
> an array is surprisingly efficient. As can be seen from Test3 and Test4,
> cursors have no detectable overhead for opening the cursor (at least in this
> example with 1000 result rows). It is unclear whether there is any
> difference at all from Test3 and Test4 for retrieving the data as I have no
> easy way right now to measure that accurately. However, since arrays+cursors
> are more efficient than anything having to do with temp tables, that is the
> way I will go. With the number of rows I am dealing with (which should
> always be less than 1,000 in the final returned result set), unnesting an
> array is much faster than building a temp table and selecting from it.
> If anyone thinks I may have missed some important item in this testing,
> please let me know.

Well, you missed the most important part: not using cursors at all.
Instead of declaring a cursor and looping it to build the array, build
it with array().  That's what I've been saying: arrays can completely
displace both temp tables _and_ cursors when passing small sets around
functions.

merlin

Re: Replacing Cursors with Temporary Tables

From

Grzegorz Jaśkiewicz

Date:

24 April 2010, 14:38:16

On Sat, Apr 24, 2010 at 2:23 PM, Merlin Moncure <mmoncure@gmail.com> wrote:

Well, you missed the most important part: not using cursors at all.
Instead of declaring a cursor and looping it to build the array, build
it with array(). That's what I've been saying: arrays can completely
displace both temp tables _and_ cursors when passing small sets around
functions.

with huge emphasis on the word small.

--
GJ

Re: Replacing Cursors with Temporary Tables

From

Merlin Moncure

Date:

24 April 2010, 15:20:26

2010/4/24 Grzegorz Jaśkiewicz <gryzman@gmail.com>:
>
>
> On Sat, Apr 24, 2010 at 2:23 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>>
>> Well, you missed the most important part: not using cursors at all.
>> Instead of declaring a cursor and looping it to build the array, build
>> it with array().  That's what I've been saying: arrays can completely
>> displace both temp tables _and_ cursors when passing small sets around
>> functions.
>>
> with huge emphasis on the word small.

The rule of thumb I use is 10000 narrow records (scalars, or very
small composites) or 1000 wide/complex records.  I routinely pass
extremely complex (3-4 levels nesting) nested composite arrays to the
client for processing -- it is extremely efficient and clean.  This of
course is going to depend on hardware and other factors.

merlin