Thread: Pass-by-reference UDTs and volatility

Pass-by-reference UDTs and volatility

From
Stephen Scheck
Date:
Hello,

I am working on an extension which defines a number of user-defined functions which will operate on a common, custom data type to perform a pipeline of transformations (the data type is the IN/OUT parameter for all of the functions), eventually being supplied to a sink function which takes the data type as input and produces tuple(s) as output. A source function will produce the initial instance of the data type and feed it to the head of the pipeline. As such, the data type only ever exists in memory and is never intended to be used as a column of a table or persisted to disk.

What I would really like is something like the "internal" pseudo-type that can be used to define the functions, but is not allowed to be used as a column type in a table. However, since these functions must be callable from a user context, "internal" will not work. I'm not that concerned with users trying to use the data type in table DDL as they can simply be instructed not to in documentation (and suffer the consequences of their actions if they don't read the fine manual). However, in chapter 35.9 of the Postgres docs, there is this warning:

"Never modify the contents of a pass-by-reference input value. If you do so you are likely to corrupt on-disk data, since the pointer you are given might point directly into a disk buffer. The sole exception to this rule is explained in Section 35.10."

If the UDTs the extension defines are the sole producer/consumer of the data type and are consistent in the way they manipulate the in-memory data structure for the type, can the above rule be safely ignored? Or could the backend do something like try to persist intermediate return values from functions to temporary hard storage as it proceeds with execution of a query plan?

Thanks.

Re: Pass-by-reference UDTs and volatility

From
Tom Lane
Date:
Stephen Scheck <singularsyntax@gmail.com> writes:
> "Never modify the contents of a pass-by-reference input value. If you do so
> you are likely to corrupt on-disk data, since the pointer you are given
> might point directly into a disk buffer. The sole exception to this rule is
> explained in Section 35.10."

> If the UDTs the extension defines are the sole producer/consumer of the
> data type and are consistent in the way they manipulate the in-memory data
> structure for the type, can the above rule be safely ignored?

No.

> Or could the
> backend do something like try to persist intermediate return values from
> functions to temporary hard storage as it proceeds with execution of a
> query plan?

It might well do that; you really do not have the option to create Datum
values that can't be copied by datumCopy().  Even more directly, if you
do something like

select foo('...'::pass_by_ref_type)

and foo elects to scribble on its input, it will be corrupting a Const
node in the query plan.  You'd probably not notice any bad effects from
that in the case of a one-shot plan, but it would definitely break
cached plans.

Just brainstorming here, but: you might consider keeping the actual
value(s) in private storage, perhaps a hashtable, and making the Datums
that Postgres passes around be just tokens referencing hashtable
entries.  This would for one thing give you greatly more security
against user query-structure errors than what you're sketching.
The main thing that might be hard to deal with is figuring out when it's
safe to reclaim a no-longer-referenced value.  You could certainly do so
at top-level transaction end, but depending on what your app is doing,
that might not be enough.

            regards, tom lane


Re: Pass-by-reference UDTs and volatility

From
Stephen Scheck
Date:
Hmm, that might work - so allocate the values in a transaction-scoped memory context?
But how would the hash table keys themselves be deleted? Is there some callback API to
hook transaction completion?


On Wed, Jun 12, 2013 at 1:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Stephen Scheck <singularsyntax@gmail.com> writes:
> "Never modify the contents of a pass-by-reference input value. If you do so
> you are likely to corrupt on-disk data, since the pointer you are given
> might point directly into a disk buffer. The sole exception to this rule is
> explained in Section 35.10."

> If the UDTs the extension defines are the sole producer/consumer of the
> data type and are consistent in the way they manipulate the in-memory data
> structure for the type, can the above rule be safely ignored?

No.

> Or could the
> backend do something like try to persist intermediate return values from
> functions to temporary hard storage as it proceeds with execution of a
> query plan?

It might well do that; you really do not have the option to create Datum
values that can't be copied by datumCopy().  Even more directly, if you
do something like

select foo('...'::pass_by_ref_type)

and foo elects to scribble on its input, it will be corrupting a Const
node in the query plan.  You'd probably not notice any bad effects from
that in the case of a one-shot plan, but it would definitely break
cached plans.

Just brainstorming here, but: you might consider keeping the actual
value(s) in private storage, perhaps a hashtable, and making the Datums
that Postgres passes around be just tokens referencing hashtable
entries.  This would for one thing give you greatly more security
against user query-structure errors than what you're sketching.
The main thing that might be hard to deal with is figuring out when it's
safe to reclaim a no-longer-referenced value.  You could certainly do so
at top-level transaction end, but depending on what your app is doing,
that might not be enough.

                        regards, tom lane

Re: Pass-by-reference UDTs and volatility

From
Tom Lane
Date:
Stephen Scheck <singularsyntax@gmail.com> writes:
> But how would the hash table keys themselves be deleted? Is there some
> callback API to hook transaction completion?

See RegisterXactCallback and RegisterSubXactCallback.

            regards, tom lane