Thread: Some notes about Param handling with "Oracle style" plpgsql variables

Some notes about Param handling with "Oracle style" plpgsql variables

From
Tom Lane
Date:
One of the interesting properties of Oracle-compatible variable
references in plpgsql is that the set of variables referenced by a
given query could change during a forced replan.  For example,
consider
declare x int;               r record;...for r in select x,y from tab loop ...

If tab contains a column "x" then the "x" reference in the SELECT
refers to tab.x; if not, it refers to the plpgsql variable x.
So when first executing the SELECT we might find that it requires
a Param reference to the plpgsql variable, and then after a replan
is forced by ALTER TABLE tab ADD COLUMN x, there is no need for
the Param anymore.  Or vice versa.

This kinda calls into question whether the Oracle way is actually
a good idea or not; but my purpose here is not to debate that,
just to look at what it takes to implement it.

Currently, plpgsql generates a list of the variables referenced by
any SQL statement or expression immediately upon seeing the text,
before it's ever even fed to the core parser.  I had been envisioning
having the parser callback hook construct the list on-the-fly during
parsing, but the possibility that the list will change from time to
time means that other changes are needed too.  Notably:

1. plancache.c does not have any provision for letting the Param type
array associated with a stored statement change when the statement is
replanned due to SI invalidation.

2. The control flow for a replan is that plpgsql calls SPI_execute_plan,
which calls RevalidateCachedPlan, which does the replan if the cached
plan is discovered to be stale.  However, plpgsql already had to set up
the list of actual parameter values before it called SPI_execute_plan,
which means it is *way* too late to change the list of required Params
even if plancache let us do it.

After chewing on these facts for awhile, I am thinking that the best
solution is for plpgsql to abandon the notion of a predetermined list
of parameters for a SQL query altogether.  What that list basically
provides is a mapping from Param numbers ($n) to plpgsql "datum numbers"
(indexes in the list of a plpgsql function's variables).  We could make
that mapping always be one-to-one, since there's no real reason that the
Params available to a query have to be consecutively numbered.  So the
transformColumnRef hook would just pass back a Param using the
referenced variable's datum number as paramid; it wouldn't bother at all
with building a data structure listing the specific variables actually
used in the query.

As far as plancache goes, it would therefore always see a null array
of Param type OIDs associated with a plpgsql-generated query, and we'd
not have to provide a way to update that.  (We'd still keep the ability
to store such an array, because most other callers of plancache will
still want a fixed list of Params.)  What we'd have to add to plancache
instead is the ability to install caller-determined parser callback
hooks when it is calling the parser for a replan.  This seems fairly
easy to do --- I'm envisioning a sort of meta-hook function that gets
called with the new ParseState and can insert hook function pointers
in it.

The other issue with this is what to do at runtime.  We could do it
with no other changes if we had plpgsql always set up Values/Nulls
arrays listing *every* datum's current value.  This seems a bit
brute-force though --- it could be slow in a function with a lot of
variables, and in most cases any specific query or expression would
not need most of those values.  What I think we should do instead
is extend the ParamListInfo structure to add a callback hook function
that populates individual ParamExternData array entries on-demand.
The core executor would call the hook when it tried to fetch the
value of a Param that was currently invalid (ptype == 0).  So the
hook would be invoked only once per query per referenced parameter,
which shouldn't be much overhead.  Another interesting property
of this approach is that it'd fix the longstanding user complaint
that constructions likeif (TG_OP = 'INSERT' and NEW.foo = 'bar') ...
fail prematurely.  The executor would never demand the value
of NEW.foo, and thus not fail, if TG_OP isn't INSERT.

Comments?
        regards, tom lane


Re: Some notes about Param handling with "Oracle style" plpgsql variables

From
Pavel Stehule
Date:
2009/11/2 Tom Lane <tgl@sss.pgh.pa.us>:
> One of the interesting properties of Oracle-compatible variable
> references in plpgsql is that the set of variables referenced by a
> given query could change during a forced replan.  For example,
> consider
>
>        declare x int;
>                r record;
>        ...
>        for r in select x,y from tab loop ...
>
> If tab contains a column "x" then the "x" reference in the SELECT
> refers to tab.x; if not, it refers to the plpgsql variable x.
> So when first executing the SELECT we might find that it requires
> a Param reference to the plpgsql variable, and then after a replan
> is forced by ALTER TABLE tab ADD COLUMN x, there is no need for
> the Param anymore.  Or vice versa.
>
> This kinda calls into question whether the Oracle way is actually
> a good idea or not; but my purpose here is not to debate that,
> just to look at what it takes to implement it.
>
> Currently, plpgsql generates a list of the variables referenced by
> any SQL statement or expression immediately upon seeing the text,
> before it's ever even fed to the core parser.  I had been envisioning
> having the parser callback hook construct the list on-the-fly during
> parsing, but the possibility that the list will change from time to
> time means that other changes are needed too.  Notably:
>
> 1. plancache.c does not have any provision for letting the Param type
> array associated with a stored statement change when the statement is
> replanned due to SI invalidation.
>
> 2. The control flow for a replan is that plpgsql calls SPI_execute_plan,
> which calls RevalidateCachedPlan, which does the replan if the cached
> plan is discovered to be stale.  However, plpgsql already had to set up
> the list of actual parameter values before it called SPI_execute_plan,
> which means it is *way* too late to change the list of required Params
> even if plancache let us do it.
>
> After chewing on these facts for awhile, I am thinking that the best
> solution is for plpgsql to abandon the notion of a predetermined list
> of parameters for a SQL query altogether.  What that list basically
> provides is a mapping from Param numbers ($n) to plpgsql "datum numbers"
> (indexes in the list of a plpgsql function's variables).  We could make
> that mapping always be one-to-one, since there's no real reason that the
> Params available to a query have to be consecutively numbered.  So the
> transformColumnRef hook would just pass back a Param using the
> referenced variable's datum number as paramid; it wouldn't bother at all
> with building a data structure listing the specific variables actually
> used in the query.
>
> As far as plancache goes, it would therefore always see a null array
> of Param type OIDs associated with a plpgsql-generated query, and we'd
> not have to provide a way to update that.  (We'd still keep the ability
> to store such an array, because most other callers of plancache will
> still want a fixed list of Params.)  What we'd have to add to plancache
> instead is the ability to install caller-determined parser callback
> hooks when it is calling the parser for a replan.  This seems fairly
> easy to do --- I'm envisioning a sort of meta-hook function that gets
> called with the new ParseState and can insert hook function pointers
> in it.
>
> The other issue with this is what to do at runtime.  We could do it
> with no other changes if we had plpgsql always set up Values/Nulls
> arrays listing *every* datum's current value.  This seems a bit
> brute-force though --- it could be slow in a function with a lot of
> variables, and in most cases any specific query or expression would
> not need most of those values.  What I think we should do instead
> is extend the ParamListInfo structure to add a callback hook function
> that populates individual ParamExternData array entries on-demand.
> The core executor would call the hook when it tried to fetch the
> value of a Param that was currently invalid (ptype == 0).  So the
> hook would be invoked only once per query per referenced parameter,
> which shouldn't be much overhead.  Another interesting property
> of this approach is that it'd fix the longstanding user complaint
> that constructions like
>        if (TG_OP = 'INSERT' and NEW.foo = 'bar') ...
> fail prematurely.  The executor would never demand the value
> of NEW.foo, and thus not fail, if TG_OP isn't INSERT.
>

good idea.

regards
Pavel

> Comments?
>
>                        regards, tom lane
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


Re: Some notes about Param handling with "Oracle style" plpgsql variables

From
Robert Haas
Date:
On Mon, Nov 2, 2009 at 11:31 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>  Another interesting property
> of this approach is that it'd fix the longstanding user complaint
> that constructions like
>        if (TG_OP = 'INSERT' and NEW.foo = 'bar') ...
> fail prematurely.  The executor would never demand the value
> of NEW.foo, and thus not fail, if TG_OP isn't INSERT.

I don't really know enough to comment on the best way to go about this
project overall, but fixing this would be incredibly nice, if it can
be done without too much damage.

...Robert


Re: Some notes about Param handling with "Oracle style" plpgsql variables

From
Pavel Stehule
Date:
2009/11/2 Tom Lane <tgl@sss.pgh.pa.us>:
> One of the interesting properties of Oracle-compatible variable
> references in plpgsql is that the set of variables referenced by a
> given query could change during a forced replan.  For example,
> consider
>
>        declare x int;
>                r record;
>        ...
>        for r in select x,y from tab loop ...
>
> If tab contains a column "x" then the "x" reference in the SELECT
> refers to tab.x; if not, it refers to the plpgsql variable x.
> So when first executing the SELECT we might find that it requires
> a Param reference to the plpgsql variable, and then after a replan
> is forced by ALTER TABLE tab ADD COLUMN x, there is no need for
> the Param anymore.  Or vice versa.
>
> This kinda calls into question whether the Oracle way is actually
> a good idea or not; but my purpose here is not to debate that,
> just to look at what it takes to implement it.

This is reason, why I would to see third mode (incompatible with
Oracle or pg), that raise error, when it detects any intersecting
identifiers. I understand so this mode should not be default, but
personally I'll use it everywhere.

Pavel

>
> Currently, plpgsql generates a list of the variables referenced by
> any SQL statement or expression immediately upon seeing the text,
> before it's ever even fed to the core parser.  I had been envisioning
> having the parser callback hook construct the list on-the-fly during
> parsing, but the possibility that the list will change from time to
> time means that other changes are needed too.  Notably:
>
> 1. plancache.c does not have any provision for letting the Param type
> array associated with a stored statement change when the statement is
> replanned due to SI invalidation.
>
> 2. The control flow for a replan is that plpgsql calls SPI_execute_plan,
> which calls RevalidateCachedPlan, which does the replan if the cached
> plan is discovered to be stale.  However, plpgsql already had to set up
> the list of actual parameter values before it called SPI_execute_plan,
> which means it is *way* too late to change the list of required Params
> even if plancache let us do it.
>
> After chewing on these facts for awhile, I am thinking that the best
> solution is for plpgsql to abandon the notion of a predetermined list
> of parameters for a SQL query altogether.  What that list basically
> provides is a mapping from Param numbers ($n) to plpgsql "datum numbers"
> (indexes in the list of a plpgsql function's variables).  We could make
> that mapping always be one-to-one, since there's no real reason that the
> Params available to a query have to be consecutively numbered.  So the
> transformColumnRef hook would just pass back a Param using the
> referenced variable's datum number as paramid; it wouldn't bother at all
> with building a data structure listing the specific variables actually
> used in the query.
>
> As far as plancache goes, it would therefore always see a null array
> of Param type OIDs associated with a plpgsql-generated query, and we'd
> not have to provide a way to update that.  (We'd still keep the ability
> to store such an array, because most other callers of plancache will
> still want a fixed list of Params.)  What we'd have to add to plancache
> instead is the ability to install caller-determined parser callback
> hooks when it is calling the parser for a replan.  This seems fairly
> easy to do --- I'm envisioning a sort of meta-hook function that gets
> called with the new ParseState and can insert hook function pointers
> in it.
>
> The other issue with this is what to do at runtime.  We could do it
> with no other changes if we had plpgsql always set up Values/Nulls
> arrays listing *every* datum's current value.  This seems a bit
> brute-force though --- it could be slow in a function with a lot of
> variables, and in most cases any specific query or expression would
> not need most of those values.  What I think we should do instead
> is extend the ParamListInfo structure to add a callback hook function
> that populates individual ParamExternData array entries on-demand.
> The core executor would call the hook when it tried to fetch the
> value of a Param that was currently invalid (ptype == 0).  So the
> hook would be invoked only once per query per referenced parameter,
> which shouldn't be much overhead.  Another interesting property
> of this approach is that it'd fix the longstanding user complaint
> that constructions like
>        if (TG_OP = 'INSERT' and NEW.foo = 'bar') ...
> fail prematurely.  The executor would never demand the value
> of NEW.foo, and thus not fail, if TG_OP isn't INSERT.
>
> Comments?
>
>                        regards, tom lane
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


Re: Some notes about Param handling with "Oracle style" plpgsql variables

From
Tom Lane
Date:
Pavel Stehule <pavel.stehule@gmail.com> writes:
> 2009/11/2 Tom Lane <tgl@sss.pgh.pa.us>:
>> This kinda calls into question whether the Oracle way is actually
>> a good idea or not; but my purpose here is not to debate that,
>> just to look at what it takes to implement it.

> This is reason, why I would to see third mode (incompatible with
> Oracle or pg), that raise error, when it detects any intersecting
> identifiers. I understand so this mode should not be default, but
> personally I'll use it everywhere.

Actually, I thought we'd decided that "throw error on conflict"
*would* be the default behavior.
        regards, tom lane


Re: Some notes about Param handling with "Oracle style" plpgsql variables

From
Pavel Stehule
Date:
2009/11/2 Tom Lane <tgl@sss.pgh.pa.us>:
> Pavel Stehule <pavel.stehule@gmail.com> writes:
>> 2009/11/2 Tom Lane <tgl@sss.pgh.pa.us>:
>>> This kinda calls into question whether the Oracle way is actually
>>> a good idea or not; but my purpose here is not to debate that,
>>> just to look at what it takes to implement it.
>
>> This is reason, why I would to see third mode (incompatible with
>> Oracle or pg), that raise error, when it detects any intersecting
>> identifiers. I understand so this mode should not be default, but
>> personally I'll use it everywhere.
>
> Actually, I thought we'd decided that "throw error on conflict"
> *would* be the default behavior.

good

regards
Pavel
>
>                        regards, tom lane
>


Re: Some notes about Param handling with "Oracle style" plpgsql variables

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Nov 2, 2009 at 11:31 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>  Another interesting property
>> of this approach is that it'd fix the longstanding user complaint
>> that constructions like
>>        if (TG_OP = 'INSERT' and NEW.foo = 'bar') ...
>> fail prematurely.  The executor would never demand the value
>> of NEW.foo, and thus not fail, if TG_OP isn't INSERT.

> I don't really know enough to comment on the best way to go about this
> project overall, but fixing this would be incredibly nice, if it can
> be done without too much damage.

After further reflection, there's a little more here than meets the eye.
We can make it work as above for constructs that execute indivisibly
from the point of view of a plpgsql function, like simple expressions.
But there are also time-extended executions, like cursors and FOR-loop
queries.  What happens if you do something like
declare x int;...for r in select * from tab where id = x loop ...

and change x inside the loop?

Currently the code guarantees that these queries are run using the
values that plpgsql variables had at the opening of the cursor or start
of the for-loop.  I think it would be a really bad idea to let it behave
any differently --- even if it were rational to do something different,
can you imagine trying to find bugs caused by such a change in functions
that used to work?  But that means we have to evaluate and copy the
values of all variables that such a query *could* reference, even if it
then fails to touch them at runtime.

This doesn't seem like a fatal objection to me, but it's worth
mentioning that the improvement will only apply in some contexts.
        regards, tom lane