Thread: dubious optimization of the function in SELECT INTO target list

dubious optimization of the function in SELECT INTO target list

From
Oleksii Kliukin
Date:
Hello,

I have an issue with a function that is called as a part of the SELECT
INTO target list in pl/pgSQL. I'd like to illustrate it with a simple
example:

DO $$
DECLARE l_id integer;
    BEGIN
       SELECT test(id) INTO l_id
        FROM generate_series(1,10) t(id);
    END;
$$ LANGUAGE plpgsql;

It looks like the test function in this example is executed only once.
In order to check this, one can define the test function as following:

CREATE TABLE foo(id integer);

CREATE OR REPLACE FUNCTION public.test(id integer)
RETURNS integer
LANGUAGE plpgsql
AS
$fn$
    BEGIN
        INSERT INTO foo VALUES($1);
        RETURN $1;
    END;
$fn$

Basically, if we invoke the first example, the foo table with have only
1 row and not 10, as supplied by the generate_series.
However, when ORDER BY is attached to the query, or aggregate (such as
max, min or array_agg) is wrapped around the test(id) call, the test
function is called exactly 10 times. If I replace the SELECT INTO with
PERFORM, it would also be called 10 times. Unfortunately, it is not
possible to use PERFORM directly in the CTE expression.

The documentation on the SELECT INTO suggests that the rows returned by
the test function may be discarded after the first one:

"If STRICT is not specified in the INTO clause, then target will be set
to the first row returned by the query, or to nulls if the query
returned no rows. (Note that "the first row" is not well-defined unless
you've used ORDER BY.) Any result rows after the first row are
discarded."
http://www.postgresql.org/docs/current/static/plpgsql-statements.html#PLPGSQL-STATEMENTS-SQL-ONEROW

However, it does not say anything about the number of rows the query
target list will be evaluated, meaning one may expect it to be evaluated
more than once. It seems that in the case of the example above
optimizing out calls to the 'test' function would only produce an
expected result if the function itself does not have any side-effects,
e.g.. qualifies as 'stable' or 'immutable'.

Is there some (undocumented) restriction on the functions allowed in the
SELECT target list, and isn't the optimization to limit the number of
calls to 'test' to 1 wrong in this case?

Kind regards,
--
Oleksii


Re: dubious optimization of the function in SELECT INTO target list

From
Adrian Klaver
Date:
On 10/06/2015 01:13 PM, Oleksii Kliukin wrote:
> Hello,
>
> I have an issue with a function that is called as a part of the SELECT
> INTO target list in pl/pgSQL. I'd like to illustrate it with a simple
> example:
>
> DO $$
> DECLARE l_id integer;
>      BEGIN
>         SELECT test(id) INTO l_id
>          FROM generate_series(1,10) t(id);
>      END;
> $$ LANGUAGE plpgsql;
>
> It looks like the test function in this example is executed only once.
> In order to check this, one can define the test function as following:
>
> CREATE TABLE foo(id integer);
>
> CREATE OR REPLACE FUNCTION public.test(id integer)
> RETURNS integer
> LANGUAGE plpgsql
> AS
> $fn$
>      BEGIN
>          INSERT INTO foo VALUES($1);
>          RETURN $1;
>      END;
> $fn$
>
> Basically, if we invoke the first example, the foo table with have only
> 1 row and not 10, as supplied by the generate_series.
> However, when ORDER BY is attached to the query, or aggregate (such as
> max, min or array_agg) is wrapped around the test(id) call, the test
> function is called exactly 10 times. If I replace the SELECT INTO with
> PERFORM, it would also be called 10 times. Unfortunately, it is not
> possible to use PERFORM directly in the CTE expression.

What CTE expression?

>
> The documentation on the SELECT INTO suggests that the rows returned by
> the test function may be discarded after the first one:
>
> "If STRICT is not specified in the INTO clause, then target will be set
> to the first row returned by the query, or to nulls if the query
> returned no rows. (Note that "the first row" is not well-defined unless
> you've used ORDER BY.) Any result rows after the first row are
> discarded."
> http://www.postgresql.org/docs/current/static/plpgsql-statements.html#PLPGSQL-STATEMENTS-SQL-ONEROW
>
> However, it does not say anything about the number of rows the query
> target list will be evaluated, meaning one may expect it to be evaluated
> more than once. It seems that in the case of the example above
> optimizing out calls to the 'test' function would only produce an
> expected result if the function itself does not have any side-effects,
> e.g.. qualifies as 'stable' or 'immutable'.

How about:

DO $$
DECLARE l_id integer;
     BEGIN
        FOR l_id IN SELECT id
         FROM generate_series(1,10) as id LOOP
             SELECT INTO l_id test(l_id);
        END LOOP;
     END;
$$ LANGUAGE plpgsql;

>
> Is there some (undocumented) restriction on the functions allowed in the
> SELECT target list, and isn't the optimization to limit the number of
> calls to 'test' to 1 wrong in this case?
>
> Kind regards,
>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: dubious optimization of the function in SELECT INTO target list

From
Oleksii Kliukin
Date:

On 06 Oct 2015, at 22:40, Adrian Klaver <adrian.klaver@aklaver.com> wrote:

On 10/06/2015 01:13 PM, Oleksii Kliukin wrote:

Basically, if we invoke the first example, the foo table with have only
1 row and not 10, as supplied by the generate_series.
However, when ORDER BY is attached to the query, or aggregate (such as
max, min or array_agg) is wrapped around the test(id) call, the test
function is called exactly 10 times. If I replace the SELECT INTO with
PERFORM, it would also be called 10 times. Unfortunately, it is not
possible to use PERFORM directly in the CTE expression.

What CTE expression?

Any CTE expression :-). The example here is just an illustration to expose the issue. The real-world query I came across used a complex CTE expression and called a function at the end of it inside the SELECT INTO statement.



How about:

DO $$
DECLARE l_id integer;
   BEGIN
      FOR l_id IN SELECT id
       FROM generate_series(1,10) as id LOOP
           SELECT INTO l_id test(l_id);
      END LOOP;
   END;
$$ LANGUAGE plpgsql;

This should work, but I'm interested in finding out why the original statement behaves the way I’ve described. 

Kind regards,
--
Oleksii

Re: dubious optimization of the function in SELECT INTO target list

From
Adrian Klaver
Date:
On 10/06/2015 01:48 PM, Oleksii Kliukin wrote:
>
>> On 06 Oct 2015, at 22:40, Adrian Klaver <adrian.klaver@aklaver.com
>> <mailto:adrian.klaver@aklaver.com>> wrote:
>>
>> On 10/06/2015 01:13 PM, Oleksii Kliukin wrote:
>>>
>>> Basically, if we invoke the first example, the foo table with have only
>>> 1 row and not 10, as supplied by the generate_series.
>>> However, when ORDER BY is attached to the query, or aggregate (such as
>>> max, min or array_agg) is wrapped around the test(id) call, the test
>>> function is called exactly 10 times. If I replace the SELECT INTO with
>>> PERFORM, it would also be called 10 times. Unfortunately, it is not
>>> possible to use PERFORM directly in the CTE expression.
>>
>> What CTE expression?
>
> Any CTE expression :-). The example here is just an illustration to
> expose the issue. The real-world query I came across used a complex CTE
> expression and called a function at the end of it inside the SELECT INTO
> statement.

Remember SELECT INTO inside plpgsql is different from SELECT INTO outside:

http://www.postgresql.org/docs/9.4/interactive/plpgsql-statements.html#PLPGSQL-STATEMENTS-SQL-ONEROW

"Tip: Note that this interpretation of SELECT with INTO is quite
different from PostgreSQL's regular SELECT INTO command, wherein the
INTO target is a newly created table. If you want to create a table from
a SELECT result inside a PL/pgSQL function, use the syntax CREATE TABLE
... AS SELECT.

So a simple CTE example might help clear things up.

>
>
>>
>> How about:
>>
>> DO $$
>> DECLARE l_id integer;
>>    BEGIN
>>       FOR l_id IN SELECT id
>>        FROM generate_series(1,10) as id LOOP
>>            SELECT INTO l_id test(l_id);
>>       END LOOP;
>>    END;
>> $$ LANGUAGE plpgsql;
>
> This should work, but I'm interested in finding out why the original
> statement behaves the way I’ve described.
>
> Kind regards,
> --
> Oleksii
>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: dubious optimization of the function in SELECT INTO target list

From
Oleksii Kliukin
Date:

On 06 Oct 2015, at 22:50, Adrian Klaver <adrian.klaver@aklaver.com> wrote:

On 10/06/2015 01:48 PM, Oleksii Kliukin wrote:

On 06 Oct 2015, at 22:40, Adrian Klaver <adrian.klaver@aklaver.com
<mailto:adrian.klaver@aklaver.com>> wrote:

On 10/06/2015 01:13 PM, Oleksii Kliukin wrote:

Basically, if we invoke the first example, the foo table with have only
1 row and not 10, as supplied by the generate_series.
However, when ORDER BY is attached to the query, or aggregate (such as
max, min or array_agg) is wrapped around the test(id) call, the test
function is called exactly 10 times. If I replace the SELECT INTO with
PERFORM, it would also be called 10 times. Unfortunately, it is not
possible to use PERFORM directly in the CTE expression.

What CTE expression?

Any CTE expression :-). The example here is just an illustration to
expose the issue. The real-world query I came across used a complex CTE
expression and called a function at the end of it inside the SELECT INTO
statement.

Remember SELECT INTO inside plpgsql is different from SELECT INTO outside:

http://www.postgresql.org/docs/9.4/interactive/plpgsql-statements.html#PLPGSQL-STATEMENTS-SQL-ONEROW

"Tip: Note that this interpretation of SELECT with INTO is quite different from PostgreSQL's regular SELECT INTO command, wherein the INTO target is a newly created table. If you want to create a table from a SELECT result inside a PL/pgSQL function, use the syntax CREATE TABLE ... AS SELECT.

Thank you. In this case SELECT INTO was consciously  called inside the pl/pgSQL function, as one cannot do SELECT function() from pl/pgSQL without storing the result of the function somewhere (with the INTO clause).

The problem itself has nothing to do with CTEs, the only reason why I’ve mentioned it is to justify why I didn’t use PERFORM instead of SELECT INTO (the following thread http://www.postgresql.org/message-id/91873FFA-838D-4A16-ABED-A0255ED5168F@justatheory.com gives more details, although it is irrelevant to the problem being described).

Kind regards
--
Oleksii

Re: dubious optimization of the function in SELECT INTO target list

From
Adrian Klaver
Date:
On 10/06/2015 02:00 PM, Oleksii Kliukin wrote:
>
>> On 06 Oct 2015, at 22:50, Adrian Klaver <adrian.klaver@aklaver.com
>> <mailto:adrian.klaver@aklaver.com>> wrote:
>>
>> On 10/06/2015 01:48 PM, Oleksii Kliukin wrote:
>>>
>>>> On 06 Oct 2015, at 22:40, Adrian Klaver <adrian.klaver@aklaver.com
>>>> <mailto:adrian.klaver@aklaver.com>
>>>> <mailto:adrian.klaver@aklaver.com>> wrote:
>>>>
>>>> On 10/06/2015 01:13 PM, Oleksii Kliukin wrote:
>>>>>
>>>>> Basically, if we invoke the first example, the foo table with have only
>>>>> 1 row and not 10, as supplied by the generate_series.
>>>>> However, when ORDER BY is attached to the query, or aggregate (such as
>>>>> max, min or array_agg) is wrapped around the test(id) call, the test
>>>>> function is called exactly 10 times. If I replace the SELECT INTO with
>>>>> PERFORM, it would also be called 10 times. Unfortunately, it is not
>>>>> possible to use PERFORM directly in the CTE expression.
>>>>
>>>> What CTE expression?
>>>
>>> Any CTE expression :-). The example here is just an illustration to
>>> expose the issue. The real-world query I came across used a complex CTE
>>> expression and called a function at the end of it inside the SELECT INTO
>>> statement.
>>
>> Remember SELECT INTO inside plpgsql is different from SELECT INTO outside:
>>
>> http://www.postgresql.org/docs/9.4/interactive/plpgsql-statements.html#PLPGSQL-STATEMENTS-SQL-ONEROW
>>
>> "Tip: Note that this interpretation of SELECT with INTO is quite
>> different from PostgreSQL's regular SELECT INTO command, wherein the
>> INTO target is a newly created table. If you want to create a table
>> from a SELECT result inside a PL/pgSQL function, use the syntax CREATE
>> TABLE ... AS SELECT.
>
> Thank you. In this case SELECT INTO was consciously  called inside the
> pl/pgSQL function, as one cannot do SELECT function() from pl/pgSQL
> without storing the result of the function somewhere (with the INTO clause).

So what you asking is why to replicate this:

DECLARE l_id integer;
     BEGIN
        PERFORM test(id)
         FROM generate_series(1,10) as id ;
     END;
$$ LANGUAGE plpgsql;

you have to do something like this?:

DO $$
DECLARE l_id integer;
     BEGIN
        SELECT test(id) INTO l_id
         FROM generate_series(1,10) AS id  order by id;
     END;
$$ LANGUAGE plpgsql;
DO


>
> The problem itself has nothing to do with CTEs, the only reason why I’ve
> mentioned it is to justify why I didn’t use PERFORM instead of SELECT
> INTO (the following thread
> http://www.postgresql.org/message-id/91873FFA-838D-4A16-ABED-A0255ED5168F@justatheory.com gives
> more details, although it is irrelevant to the problem being described).
>
> Kind regards
> --
> Oleksii
>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: dubious optimization of the function in SELECT INTO target list

From
Tom Lane
Date:
Oleksii Kliukin <alexk@hintbits.com> writes:
> This should work, but I'm interested in finding out why the original statement behaves the way I�ve described.

plpgsql's SELECT INTO is only capable of storing a single result row,
so it only executes the statement far enough to obtain one row, and
then stops (as though a LIMIT were present).  There is no guarantee
about how much useless computation will get done underneath.

If this is not the behavior you want, you shouldn't be using SELECT INTO
(which, I'll note, is very clearly documented as meant only for single-row
results).  A plausible alternative is a FOR IN SELECT loop, which would
have the benefit that you could actually do something with the row values.

            regards, tom lane


Re: dubious optimization of the function in SELECT INTO target list

From
Oleksii Kliukin
Date:

On 06 Oct 2015, at 23:11, Adrian Klaver <adrian.klaver@aklaver.com> wrote:

On 10/06/2015 02:00 PM, Oleksii Kliukin wrote:

On 06 Oct 2015, at 22:50, Adrian Klaver <adrian.klaver@aklaver.com
<mailto:adrian.klaver@aklaver.com>> wrote:

On 10/06/2015 01:48 PM, Oleksii Kliukin wrote:

On 06 Oct 2015, at 22:40, Adrian Klaver <adrian.klaver@aklaver.com
<mailto:adrian.klaver@aklaver.com>
<mailto:adrian.klaver@aklaver.com>> wrote:

On 10/06/2015 01:13 PM, Oleksii Kliukin wrote:

Basically, if we invoke the first example, the foo table with have only
1 row and not 10, as supplied by the generate_series.
However, when ORDER BY is attached to the query, or aggregate (such as
max, min or array_agg) is wrapped around the test(id) call, the test
function is called exactly 10 times. If I replace the SELECT INTO with
PERFORM, it would also be called 10 times. Unfortunately, it is not
possible to use PERFORM directly in the CTE expression.

What CTE expression?

Any CTE expression :-). The example here is just an illustration to
expose the issue. The real-world query I came across used a complex CTE
expression and called a function at the end of it inside the SELECT INTO
statement.

Remember SELECT INTO inside plpgsql is different from SELECT INTO outside:

http://www.postgresql.org/docs/9.4/interactive/plpgsql-statements.html#PLPGSQL-STATEMENTS-SQL-ONEROW

"Tip: Note that this interpretation of SELECT with INTO is quite
different from PostgreSQL's regular SELECT INTO command, wherein the
INTO target is a newly created table. If you want to create a table
from a SELECT result inside a PL/pgSQL function, use the syntax CREATE
TABLE ... AS SELECT.

Thank you. In this case SELECT INTO was consciously  called inside the
pl/pgSQL function, as one cannot do SELECT function() from pl/pgSQL
without storing the result of the function somewhere (with the INTO clause).

So what you asking is why to replicate this:

DECLARE l_id integer;
   BEGIN
      PERFORM test(id)
       FROM generate_series(1,10) as id ;
   END;
$$ LANGUAGE plpgsql;

you have to do something like this?:

DO $$
DECLARE l_id integer;
   BEGIN
      SELECT test(id) INTO l_id
       FROM generate_series(1,10) AS id  order by id;
   END;
$$ LANGUAGE plpgsql;
DO

My question was, essentially, if SELECT INTO in pl/pgSQL is supposed to stop after emitting the first row, ignoring the fact that the expression it calls may have side effects. I think I’ve got the answer from Tom that yes, it is supposed to be so, but I still think the docs are quite ambiguous about it (i.e. I read "Any result rows after the first row are discarded.” in the SELECT INTO description as a possible sign that they are still evaluated).

Kind regards,
--
Oleksii

Re: dubious optimization of the function in SELECT INTO target list

From
Oleksii Kliukin
Date:

On 06 Oct 2015, at 23:31, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Oleksii Kliukin <alexk@hintbits.com> writes:
This should work, but I'm interested in finding out why the original statement behaves the way I’ve described.

plpgsql's SELECT INTO is only capable of storing a single result row,
so it only executes the statement far enough to obtain one row, and
then stops (as though a LIMIT were present).  There is no guarantee
about how much useless computation will get done underneath.

Thank you, now it’s clear. I have to say there is no guarantee that the computation would be useless. Someone might be calling a function that updates/deletes rows in the SELECT INTO block, being forced to use SELECT INTO by inability of pl/pgSQL to just discard the result of a normal SELECT. I know one can use a loop or call PERFORM, but in some cases (a complex CTE computing the data for the function being called at the end, which updates the tables with this data) actually using SELECT INTO looks like the easiest path to achieve the desired result.

This is essentially the same catch as with LIMIT, but LIMIT is better documented :-)



If this is not the behavior you want, you shouldn't be using SELECT INTO
(which, I'll note, is very clearly documented as meant only for single-row
results).

This is true, but what if I don’t care about the result and cannot use PERFORM?

I admit it is a rather corner case, but to me it’s not clear from the documentation that SELECT INTO will not try to compute more rows than necessary. The docs say "Any result rows after the first row are discarded”, it’s not clear from it whether those rows are supposed to be evaluated before they are discarded, hence, the question that started this thread.


 A plausible alternative is a FOR IN SELECT loop, which would
have the benefit that you could actually do something with the row values.

Agree on that.

Kind regards,
--
Oleksii

Re: dubious optimization of the function in SELECT INTO target list

From
Adrian Klaver
Date:
On 10/08/2015 01:57 AM, Oleksii Kliukin wrote:
>
>> On 06 Oct 2015, at 23:31, Tom Lane <tgl@sss.pgh.pa.us
>> <mailto:tgl@sss.pgh.pa.us>> wrote:
>>
>> Oleksii Kliukin <alexk@hintbits.com <mailto:alexk@hintbits.com>> writes:
>>> This should work, but I'm interested in finding out why the original
>>> statement behaves the way I’ve described.
>>
>> plpgsql's SELECT INTO is only capable of storing a single result row,
>> so it only executes the statement far enough to obtain one row, and
>> then stops (as though a LIMIT were present).  There is no guarantee
>> about how much useless computation will get done underneath.
>
> Thank you, now it’s clear. I have to say there is no guarantee that the
> computation would be useless. Someone might be calling a function that
> updates/deletes rows in the SELECT INTO block, being forced to use
> SELECT INTO by inability of pl/pgSQL to just discard the result of a
> normal SELECT. I know one can use a loop or call PERFORM, but in some
> cases (a complex CTE computing the data for the function being called at
> the end, which updates the tables with this data) actually using SELECT
> INTO looks like the easiest path to achieve the desired result.

Well the best I can come up with at the moment is:

DO $$
DECLARE l_id integer;
     BEGIN
        WITH gs AS (select generate_series(1,10) as id)
         SELECT test(id) FROM gs ORDER BY id INTO l_id;
     END;
$$ LANGUAGE plpgsql;


>
> This is essentially the same catch as with LIMIT, but LIMIT is better
> documented :-)
>
>
>>
>> If this is not the behavior you want, you shouldn't be using SELECT INTO
>> (which, I'll note, is very clearly documented as meant only for single-row
>> results).
>
> This is true, but what if I don’t care about the result and cannot use
> PERFORM?
>
> I admit it is a rather corner case, but to me it’s not clear from the
> documentation that SELECT INTO will not try to compute more rows than
> necessary. The docs say "Any result rows after the first row are
> discarded”, it’s not clear from it whether those rows are supposed to be
> evaluated before they are discarded, hence, the question that started
> this thread.
>
>
>>  A plausible alternative is a FOR IN SELECT loop, which would
>> have the benefit that you could actually do something with the row values.
>
> Agree on that.
>
> Kind regards,
> --
> Oleksii
>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: dubious optimization of the function in SELECT INTO target list

From
Alvaro Herrera
Date:
Oleksii Kliukin wrote:

> Thank you, now it’s clear. I have to say there is no guarantee that
> the computation would be useless. Someone might be calling a function
> that updates/deletes rows in the SELECT INTO block, being forced to
> use SELECT INTO by inability of pl/pgSQL to just discard the result of
> a normal SELECT. I know one can use a loop or call PERFORM, but in
> some cases (a complex CTE computing the data for the function being
> called at the end, which updates the tables with this data) actually
> using SELECT INTO looks like the easiest path to achieve the desired
> result.

So this whole issue is just because it is not possible to use PERFORM
alongside WITH?

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: dubious optimization of the function in SELECT INTO target list

From
Oleksii Kliukin
Date:

On 08 Oct 2015, at 16:00, Adrian Klaver <adrian.klaver@aklaver.com> wrote:

On 10/08/2015 01:57 AM, Oleksii Kliukin wrote:

On 06 Oct 2015, at 23:31, Tom Lane <tgl@sss.pgh.pa.us
<mailto:tgl@sss.pgh.pa.us>> wrote:

Oleksii Kliukin <alexk@hintbits.com <mailto:alexk@hintbits.com>> writes:
This should work, but I'm interested in finding out why the original
statement behaves the way I’ve described.

plpgsql's SELECT INTO is only capable of storing a single result row,
so it only executes the statement far enough to obtain one row, and
then stops (as though a LIMIT were present).  There is no guarantee
about how much useless computation will get done underneath.

Thank you, now it’s clear. I have to say there is no guarantee that the
computation would be useless. Someone might be calling a function that
updates/deletes rows in the SELECT INTO block, being forced to use
SELECT INTO by inability of pl/pgSQL to just discard the result of a
normal SELECT. I know one can use a loop or call PERFORM, but in some
cases (a complex CTE computing the data for the function being called at
the end, which updates the tables with this data) actually using SELECT
INTO looks like the easiest path to achieve the desired result.

Well the best I can come up with at the moment is:

DO $$
DECLARE l_id integer;
   BEGIN
      WITH gs AS (select generate_series(1,10) as id)
       SELECT test(id) FROM gs ORDER BY id INTO l_id;
   END;
$$ LANGUAGE plpgsql;

Yeah, or use max/min/some other aggregate instead of ORDER BY.

Kind regards,
--
Oleksii

Re: dubious optimization of the function in SELECT INTO target list

From
Oleksii Kliukin
Date:

On 08 Oct 2015, at 19:54, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

Oleksii Kliukin wrote:

Thank you, now it’s clear. I have to say there is no guarantee that
the computation would be useless. Someone might be calling a function
that updates/deletes rows in the SELECT INTO block, being forced to
use SELECT INTO by inability of pl/pgSQL to just discard the result of
a normal SELECT. I know one can use a loop or call PERFORM, but in
some cases (a complex CTE computing the data for the function being
called at the end, which updates the tables with this data) actually
using SELECT INTO looks like the easiest path to achieve the desired
result.

So this whole issue is just because it is not possible to use PERFORM
alongside WITH?

The issue is in the SELECT INTO behaviour, but the root cause is exactly the lack of support for perform in CTEs in pl/pgSQL.

Kind regards,
--
Oleksii