Thread: Detection of nested function calls

Detection of nested function calls

From
Hugo Mercier
Date:
Hi all,

The Oslandia team is involved in PostGIS project for years, with a
current focus on PostGIS 3D support.
With PostGIS queries, nested functions calls that manipulate geometries
are quite common, e.g.: SELECT ST_Union(ST_Intersection(a.geom,
ST_Buffer(b.geom, 50)))

PostGIS functions that manipulate geometries have to unserialize their
input geometries from the 'flat' varlena representation to their own,
and serialize the processed geometries back when returning.
But in such nested call queries, this serialization-unserialization
process is just an overhead.

Avoiding it could then lead to a real gain in terms of performances [1],
especially here when the internal type takes time to serialize (and with
new PostGIS types like rasters or 3D geometries it's really meaningful)

So we thought having a way for user functions to know if they are part
of a nested call could allow them to avoid this serialization phase.

The idea would be to have a boolean flag reachable from a user function
(within FunctionCallInfoData) that says if the current function is
nested or not.

We already investigated such a modification and here is where we are up
to now :
  - we modified the parser with a new boolean member 'nested' to the
FuncExpr struct. Within the parser, we know if a function call is nested
into another one and then we can mark the FuncExpr
  - the executor has been modified so it can take into account this
nested member and pass it to the FunctionCallInfoData structure before
evaluating the function

We are working on a PostGIS branch that takes benefit of this
functionality [2]

You can find in attachment a first draft of the patch.

Obviously, even if this is about a PostGIS use case here, this subject
could be helpful for every other queries using both nested functions and
serialization.

I am quite new to postgresql hacking, so I'm sure there is room for
improvements. But, what about this first proposal ?

I'll be at the PGDay conf in Dublin next week, so we could discuss this
topic.

[1] Talking about performances, we already investigated such
"pass-by-reference" mechanism with PostGIS. Taking a dummy function
"st_copy" that only copies its input geometry to its output with 4
levels of nesting gives encouraging results (passing geometries by
reference is more than 2x faster than (un)serializing) :
https://github.com/Oslandia/sfcgal-tests/blob/master/bench/report_serialization_referenced_vs_native.pdf

[2] https://github.com/Oslandia/postgis/tree/nested_ref_passing
--
Hugo Mercier
Oslandia

Attachment

Re: Detection of nested function calls

From
Pavel Stehule
Date:
Hello


2013/10/25 Hugo Mercier <hugo.mercier@oslandia.com>
Hi all,

The Oslandia team is involved in PostGIS project for years, with a
current focus on PostGIS 3D support.
With PostGIS queries, nested functions calls that manipulate geometries
are quite common, e.g.: SELECT ST_Union(ST_Intersection(a.geom,
ST_Buffer(b.geom, 50)))

PostGIS functions that manipulate geometries have to unserialize their
input geometries from the 'flat' varlena representation to their own,
and serialize the processed geometries back when returning.
But in such nested call queries, this serialization-unserialization
process is just an overhead.

Avoiding it could then lead to a real gain in terms of performances [1],
especially here when the internal type takes time to serialize (and with
new PostGIS types like rasters or 3D geometries it's really meaningful)

So we thought having a way for user functions to know if they are part
of a nested call could allow them to avoid this serialization phase.

The idea would be to have a boolean flag reachable from a user function
(within FunctionCallInfoData) that says if the current function is
nested or not.

We already investigated such a modification and here is where we are up
to now :
  - we modified the parser with a new boolean member 'nested' to the
FuncExpr struct. Within the parser, we know if a function call is nested
into another one and then we can mark the FuncExpr
  - the executor has been modified so it can take into account this
nested member and pass it to the FunctionCallInfoData structure before
evaluating the function

We are working on a PostGIS branch that takes benefit of this
functionality [2]

You can find in attachment a first draft of the patch.

Obviously, even if this is about a PostGIS use case here, this subject
could be helpful for every other queries using both nested functions and
serialization.

I am quite new to postgresql hacking, so I'm sure there is room for
improvements. But, what about this first proposal ?

I am not sure, if this solution is enough - what will be done if I store some values in PL/pgSQL variables?

Regards

Pavel
 

I'll be at the PGDay conf in Dublin next week, so we could discuss this
topic.

[1] Talking about performances, we already investigated such
"pass-by-reference" mechanism with PostGIS. Taking a dummy function
"st_copy" that only copies its input geometry to its output with 4
levels of nesting gives encouraging results (passing geometries by
reference is more than 2x faster than (un)serializing) :
https://github.com/Oslandia/sfcgal-tests/blob/master/bench/report_serialization_referenced_vs_native.pdf

[2] https://github.com/Oslandia/postgis/tree/nested_ref_passing
--
Hugo Mercier
Oslandia


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Detection of nested function calls

From
Hugo Mercier
Date:
Le 25/10/2013 14:29, Pavel Stehule a écrit :
> Hello
>
>
> 2013/10/25 Hugo Mercier <hugo.mercier@oslandia.com
> <mailto:hugo.mercier@oslandia.com>>.
>
>     I am quite new to postgresql hacking, so I'm sure there is room for
>     improvements. But, what about this first proposal ?
>
>
> I am not sure, if this solution is enough - what will be done if I store
> some values in PL/pgSQL variables?
>

You mean if you store the result of a (nested) function evaluation in a
PL/pgSQL variable ?
Then no nesting will be detected by the parser and in this case the user
function must ensure its result is serialized, since it could be stored
(in a variable or a table) at any time.

--
Hugo Mercier
Oslandia



Re: Detection of nested function calls

From
Pavel Stehule
Date:



2013/10/25 Hugo Mercier <hugo.mercier@oslandia.com>
Le 25/10/2013 14:29, Pavel Stehule a écrit :
> Hello
>
>
> 2013/10/25 Hugo Mercier <hugo.mercier@oslandia.com
> <mailto:hugo.mercier@oslandia.com>>.
>
>     I am quite new to postgresql hacking, so I'm sure there is room for
>     improvements. But, what about this first proposal ?
>
>
> I am not sure, if this solution is enough - what will be done if I store
> some values in PL/pgSQL variables?
>

You mean if you store the result of a (nested) function evaluation in a
PL/pgSQL variable ?
Then no nesting will be detected by the parser and in this case the user
function must ensure its result is serialized, since it could be stored
(in a variable or a table) at any time.

ok

I remember, so I though about similar optimization when I worked on SQL/XML implementation - so same optimization can be used there.

Regards

Pavel

 

--
Hugo Mercier
Oslandia


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Detection of nested function calls

From
Tom Lane
Date:
Hugo Mercier <hugo.mercier@oslandia.com> writes:
> PostGIS functions that manipulate geometries have to unserialize their
> input geometries from the 'flat' varlena representation to their own,
> and serialize the processed geometries back when returning.
> But in such nested call queries, this serialization-unserialization
> process is just an overhead.

This is a reasonable thing to worry about, not just for PostGIS types but
for many container types such as arrays --- it'd be nice to be able to
work with an in-memory representation that wasn't just a contiguous blob
of data.  For instance, assignment to an array element might become a
constant-time operation even when working with variable-length datatypes.

> So we thought having a way for user functions to know if they are part
> of a nested call could allow them to avoid this serialization phase.

However, this seems like a completely wrong way to go at it.  In the first
place, it wouldn't help for situations like a complex value stored in a
plpgsql variable.  In the second, I don't think that what you are
describing scales to any more than the most trivial situations.  What
about functions with more than one complex-type input, for example?  And
you'd need to be certain that every single function taking or returning
the datatype gets updated at exactly the same time, else it'll break.

I think the right way to attack it is to create some way for a Datum
value to indicate, at runtime, whether it's a flat value or an in-memory
representation.  Any given function returning the type could choose to
return either representation.  The datatype would have to provide a way
to serialize the in-memory representation, when and if it came time to
store it in a table.  To avoid breaking functions that hadn't yet been
taught about the new representation, we'd probably want to redefine the
existing DETOAST macros as also invoking this datatype flattening
function, and then you'd need to use some new access macro if you wanted
visibility of the non-flat representation.  (This assumes that the whole
thing is only applicable to toastable datatypes, but that seems like a
reasonable restriction.)

Another thing that would have to be attacked in order to make the
plpgsql-variable case work is that you'd need some design for copying such
Datums in-memory, and perhaps a reference count mechanism to optimize away
unnecessary copies.  Your idea of tying the optimization to the nested
function call scenario would avoid the need to solve this problem, but
I think it's too narrow a scope to justify all the other work that'd be
involved.

Some colleagues of mine at Salesforce have been playing with ideas like
this, though last I heard they were nowhere near having a submittable
patch.
        regards, tom lane



Re: Detection of nested function calls

From
Andres Freund
Date:
On 2013-10-25 10:18:27 -0400, Tom Lane wrote:
> I think the right way to attack it is to create some way for a Datum
> value to indicate, at runtime, whether it's a flat value or an in-memory
> representation.  Any given function returning the type could choose to
> return either representation.  The datatype would have to provide a way
> to serialize the in-memory representation, when and if it came time to
> store it in a table.  To avoid breaking functions that hadn't yet been
> taught about the new representation, we'd probably want to redefine the
> existing DETOAST macros as also invoking this datatype flattening
> function, and then you'd need to use some new access macro if you wanted
> visibility of the non-flat representation.  (This assumes that the whole
> thing is only applicable to toastable datatypes, but that seems like a
> reasonable restriction.)

That sounds reasonable, and we have most of the infrastructure for it
since the "indirect toast" thing got in.

> Another thing that would have to be attacked in order to make the
> plpgsql-variable case work is that you'd need some design for copying such
> Datums in-memory, and perhaps a reference count mechanism to optimize away
> unnecessary copies.  Your idea of tying the optimization to the nested
> function call scenario would avoid the need to solve this problem, but
> I think it's too narrow a scope to justify all the other work that'd be
> involved.

I've thought about refcounting Datums several times, but I always got
stuck when thinking about how to deal memory context resets and errors.
Any ideas about that?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Detection of nested function calls

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On 2013-10-25 10:18:27 -0400, Tom Lane wrote:
>> I think the right way to attack it is to create some way for a Datum
>> value to indicate, at runtime, whether it's a flat value or an in-memory
>> representation.

> That sounds reasonable, and we have most of the infrastructure for it
> since the "indirect toast" thing got in.

Oh really?  I hadn't been paying much attention to that, but obviously
I better go back and study it.

> I've thought about refcounting Datums several times, but I always got
> stuck when thinking about how to deal memory context resets and errors.
> Any ideas about that?

Not yet.  But it makes no sense to claim that a Datum could have a
reference that's longer-lived than the memory context it's in, so
I'm not sure the context reset case is really a problem.
        regards, tom lane



Re: Detection of nested function calls

From
Hugo Mercier
Date:
Le 25/10/2013 16:18, Tom Lane a écrit :
> Hugo Mercier <hugo.mercier@oslandia.com> writes:
>> PostGIS functions that manipulate geometries have to unserialize their
>> input geometries from the 'flat' varlena representation to their own,
>> and serialize the processed geometries back when returning.
>> But in such nested call queries, this serialization-unserialization
>> process is just an overhead.
>
> This is a reasonable thing to worry about, not just for PostGIS types but
> for many container types such as arrays --- it'd be nice to be able to
> work with an in-memory representation that wasn't just a contiguous blob
> of data.  For instance, assignment to an array element might become a
> constant-time operation even when working with variable-length datatypes.
>
>> So we thought having a way for user functions to know if they are part
>> of a nested call could allow them to avoid this serialization phase.
>
> However, this seems like a completely wrong way to go at it.  In the first
> place, it wouldn't help for situations like a complex value stored in a
> plpgsql variable.  In the second, I don't think that what you are
> describing scales to any more than the most trivial situations.  What
> about functions with more than one complex-type input, for example?  And
> you'd need to be certain that every single function taking or returning
> the datatype gets updated at exactly the same time, else it'll break.

About plpgsql variables : no there won't be no optimization in that
case. At the time the function result has to be stored in a variable, it
must be serialized.

About functions with more than one complex-type input, as soon as each
parameter are of the same type, there is no problem with that.
But if your function deals with more than one complex type AND you want
to avoid serialization on each parameter, then yes, each type must be
aware of this possible optimization (choose whether to serialize or not).

I don't understand what you mean by "be certain that every single
function ... gets updated at exactly the same time". Could you develop ?

>
> I think the right way to attack it is to create some way for a Datum
> value to indicate, at runtime, whether it's a flat value or an in-memory
> representation.  Any given function returning the type could choose to
> return either representation.  The datatype would have to provide a way
> to serialize the in-memory representation, when and if it came time to
> store it in a table.  To avoid breaking functions that hadn't yet been
> taught about the new representation, we'd probably want to redefine the
> existing DETOAST macros as also invoking this datatype flattening
> function, and then you'd need to use some new access macro if you wanted
> visibility of the non-flat representation.  (This assumes that the whole
> thing is only applicable to toastable datatypes, but that seems like a
> reasonable restriction.)

You're totally right. That is very close to what I am working on with
PostGIS.
This is still early work, but for some details :

https://github.com/Oslandia/postgis/blob/nested_ref_passing/postgis/lwgeom_ref.h

Basically, the 'geometry' type of PostGIS is here extended with a flag
saying if the data is actual 'flat' data or a plain pointer. And if this
is a pointer, a type identifier is stored.

And there is a new DETOAST macro (here POSTGIS_DETOAST_DATUM) that will
test if the Datum is a pointer or not and if it is the case, call
corresponding unserializing functions. So you can avoid copies if your
function is aware of that, and the change for existing functions will be
minimum.

https://github.com/Oslandia/postgis/blob/nested_ref_passing/postgis/lwgeom_ref.c

You said "when and if it came time to store it in a table". And, that is
exactly the point of this 'nested' boolean: when do you know that it is
time to store in a table, from a function point of view, otherwise ?

>
> Another thing that would have to be attacked in order to make the
> plpgsql-variable case work is that you'd need some design for copying such
> Datums in-memory, and perhaps a reference count mechanism to optimize away
> unnecessary copies.  Your idea of tying the optimization to the nested
> function call scenario would avoid the need to solve this problem, but
> I think it's too narrow a scope to justify all the other work that'd be
> involved.

Do you think it must necessarly cover the plpgsql variable case to be
acceptable ?


--
Hugo Mercier
Oslandia



Re: Detection of nested function calls

From
Andres Freund
Date:
On 2013-10-25 11:01:28 -0400, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On 2013-10-25 10:18:27 -0400, Tom Lane wrote:
> >> I think the right way to attack it is to create some way for a Datum
> >> value to indicate, at runtime, whether it's a flat value or an in-memory
> >> representation.
> 
> > That sounds reasonable, and we have most of the infrastructure for it
> > since the "indirect toast" thing got in.
> 
> Oh really?  I hadn't been paying much attention to that, but obviously
> I better go back and study it.

Well, it has the infrastructure for adding further types of
varattrib_1b_e types and for computing the size independently. So you
can easily add a new type of toast datum. There still needs to be
handling for it in tuptoaster.c et al, but that's not surprising ;)

> > I've thought about refcounting Datums several times, but I always got
> > stuck when thinking about how to deal memory context resets and errors.
> > Any ideas about that?
> 
> Not yet.  But it makes no sense to claim that a Datum could have a
> reference that's longer-lived than the memory context it's in, so
> I'm not sure the context reset case is really a problem.

Given how short lived many of the contexts used for expression
evaluation are, that might restrict the usefullness quite a bit. I think
at the very least it has to be allowed that a Datum gets also used in
child contexts.
But that's already opens up the door for refcount leakage when the child
context gets destroyed.

I wonder if this needs mcxt.c/aset.c support to be useful.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Detection of nested function calls

From
Tom Lane
Date:
Hugo Mercier <hugo.mercier@oslandia.com> writes:
> Le 25/10/2013 16:18, Tom Lane a �crit :
>> However, this seems like a completely wrong way to go at it.  In the first
>> place, it wouldn't help for situations like a complex value stored in a
>> plpgsql variable.  In the second, I don't think that what you are
>> describing scales to any more than the most trivial situations.  What
>> about functions with more than one complex-type input, for example?  And
>> you'd need to be certain that every single function taking or returning
>> the datatype gets updated at exactly the same time, else it'll break.

> About functions with more than one complex-type input, as soon as each
> parameter are of the same type, there is no problem with that.

How do you tell the difference between
      foo(col1, bar(col2))      foo(bar(col1), col2)

> I don't understand what you mean by "be certain that every single
> function ... gets updated at exactly the same time". Could you develop ?

If you're tying this to the syntax of the expression, then bar() *must*
return a non-serialized value when and only when foo() is expecting that,
therefore their implementations must change at the same time.  Perhaps
that's workable for PostGIS, but it's a complete nonstarter for
widely-known datatypes like arrays, where affected functions might be
spread through any number of extensions.  We need a design that permits
incremental fixing of functions that work with a deserializable datatype.

Another point worth worrying about is that not all expressions are
function calls, nor do all function calls arise from expressions.
Chasing down all the corner cases and making sure they work properly
in a syntax-driven approach is going to be a headache.

> Basically, the 'geometry' type of PostGIS is here extended with a flag
> saying if the data is actual 'flat' data or a plain pointer. And if this
> is a pointer, a type identifier is stored.

If you're doing that, why do you need the decoration on the FuncExpr
expressions?  Can't you just look at your input datums and see if they're
flat or not?
        regards, tom lane



Re: Detection of nested function calls

From
Hugo Mercier
Date:
Le 25/10/2013 17:20, Tom Lane a écrit :
> Hugo Mercier <hugo.mercier@oslandia.com> writes:
>> Le 25/10/2013 16:18, Tom Lane a écrit :

> How do you tell the difference between
>
>        foo(col1, bar(col2))
>        foo(bar(col1), col2)
>

Still not sure to understand ...
I assume foo() takes two argument of type A.
bar() can take one argument of A or another type B.

In bar(), you would have the choice to return either a plain A
or a pointer to A. Because bar() knows its call is nested (by foo()),
than it can decide to return a pointer to A.

foo() is then evaluated and we assume it knows A can be a pointer.
foo() then knows its nesting level of 0 and must return something
serialized in that case.

>> I don't understand what you mean by "be certain that every single
>> function ... gets updated at exactly the same time". Could you develop ?
>
> If you're tying this to the syntax of the expression, then bar() *must*
> return a non-serialized value when and only when foo() is expecting that,
> therefore their implementations must change at the same time.  Perhaps
> that's workable for PostGIS, but it's a complete nonstarter for
> widely-known datatypes like arrays, where affected functions might be
> spread through any number of extensions.  We need a design that permits
> incremental fixing of functions that work with a deserializable datatype.

Yes.
It could work for each user type assuming each function working with
this type is aware of this pointer/serialized nature, including extensions.
So you have to, at least, recompile every extensions depending on that
types. Which ... limits the interest for very general types, I have to
admit.

>
> Another point worth worrying about is that not all expressions are
> function calls, nor do all function calls arise from expressions.
> Chasing down all the corner cases and making sure they work properly
> in a syntax-driven approach is going to be a headache.

We could add this 'nesting' detection to operators (and probably other
constructs that I don't know) little by little.
Optimizing only function calls as a first step is not enough ?

>
>> Basically, the 'geometry' type of PostGIS is here extended with a flag
>> saying if the data is actual 'flat' data or a plain pointer. And if this
>> is a pointer, a type identifier is stored.
>
> If you're doing that, why do you need the decoration on the FuncExpr
> expressions?  Can't you just look at your input datums and see if they're
> flat or not?
>

If a function returns a pointer whatever the nesting level is, you could
end with something storing a raw pointer, which is bad. You could
eventually add a way to detect that what you stored was as pointer and
that your data no longer exists (be NULL ?) when read back, but you
basically end with users manipulating pointers, which is bad.

If you want to make it transparent to the user, you need to know the
nesting level to decide whether you could just pass it to something that
is aware of this pointer (nesting level >=1) or serialize it back
(nesting level == 0).

--
Hugo Mercier
Oslandia



Re: Detection of nested function calls

From
Tom Lane
Date:
Hugo Mercier <hugo.mercier@oslandia.com> writes:
> Le 25/10/2013 17:20, Tom Lane a �crit :
>> How do you tell the difference between
>> 
>> foo(col1, bar(col2))
>> foo(bar(col1), col2)

> Still not sure to understand ...
> I assume foo() takes two argument of type A.
> bar() can take one argument of A or another type B.

I was assuming everything was the same datatype in this example, ie
col1, col2, and the result of bar() are all type A.

The point I'm trying to make is that in the first case, foo would be
receiving a first argument that was flat and a second that was not flat;
while in the second case, it would be receiving a first argument that was
not flat and a second that was flat.  The expression labeling you're
proposing does not help it tell the difference.  What's more, you're
proposing that the labeling be made by generic code that can't possibly
know what bar() is really going to do.

> In bar(), you would have the choice to return either a plain A
> or a pointer to A. Because bar() knows its call is nested (by foo()),
> than it can decide to return a pointer to A.

> foo() is then evaluated and we assume it knows A can be a pointer.
> foo() then knows its nesting level of 0 and must return something
> serialized in that case.

Whoa.  That's the most fragile, assumption-filled way you could possibly
go about this.  In general, bar() cannot be expected to know whether the
outer function is able to take a non-flat parameter value.  And you've
glossed over how foo() would know whether its input was flat or not.

Another point here is that there's no good reason to suppose that a
function should return a flattened value just because it's at the outer
level of its syntactic expression.  For example, if we're doing a plain
SELECT foo(...) FROM ..., the next thing that will happen with that value
is it'll be fed to the output function for the datatype.  Maybe that
output function would like to have a non-flat input value, too, to save
the time of transforming back to that representation.  On the other hand,
if it's a SELECT ... ORDER BY ... and the planner chooses to do the ORDER
BY with a final sort step, we'll probably have to flatten the value to
pass it through sorting.  (Or possibly not --- perhaps we could just pass
the toast token through sorting?)  There are a lot of considerations here
and it's really unreasonable to expect that static expression labeling
will be able to do the right thing every time.

Basically the only way to make this work reliably is for Datums to be
self-identifying as to whether they're flat or structured values; then
make code do the right thing on-the-fly at runtime depending on what kind
of Datum it gets.  Once you've done that, I don't see that parse-time
labeling of expression nesting adds anything useful.  As Andres said,
the provisions for toasted datums are a good precedent, and none of that
depends on parse-time decisions.
        regards, tom lane



Re: Detection of nested function calls

From
Robert Haas
Date:
On Fri, Oct 25, 2013 at 10:18 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Hugo Mercier <hugo.mercier@oslandia.com> writes:
>> PostGIS functions that manipulate geometries have to unserialize their
>> input geometries from the 'flat' varlena representation to their own,
>> and serialize the processed geometries back when returning.
>> But in such nested call queries, this serialization-unserialization
>> process is just an overhead.
>
> This is a reasonable thing to worry about, not just for PostGIS types but
> for many container types such as arrays --- it'd be nice to be able to
> work with an in-memory representation that wasn't just a contiguous blob
> of data.  For instance, assignment to an array element might become a
> constant-time operation even when working with variable-length datatypes.

I bet numeric could benefit as well.  Essentially all of the
operations on numeric start by transforming the on-disk representation
into an internal form used only for the duration of a single call, and
end by transforming the internal form of the result back to the
on-disk representation.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Detection of nested function calls

From
Hugo Mercier
Date:
Le 25/10/2013 18:44, Tom Lane a écrit :
> Hugo Mercier <hugo.mercier@oslandia.com> writes:
>> Le 25/10/2013 17:20, Tom Lane a écrit :
>>> How do you tell the difference between
>>>
>>> foo(col1, bar(col2))
>>> foo(bar(col1), col2)
>
>> Still not sure to understand ...
>> I assume foo() takes two argument of type A.
>> bar() can take one argument of A or another type B.
>
> I was assuming everything was the same datatype in this example, ie
> col1, col2, and the result of bar() are all type A.
>
> The point I'm trying to make is that in the first case, foo would be
> receiving a first argument that was flat and a second that was not flat;
> while in the second case, it would be receiving a first argument that was
> not flat and a second that was flat.  The expression labeling you're
> proposing does not help it tell the difference.

No it does not. It's then up to the data type to store whether it is
flat or not. And every functions manipulating this type is assumed to be
aware of this flat/non-flat flagging.


>
> Another point here is that there's no good reason to suppose that a
> function should return a flattened value just because it's at the outer
> level of its syntactic expression.  For example, if we're doing a plain
> SELECT foo(...) FROM ..., the next thing that will happen with that value
> is it'll be fed to the output function for the datatype.  Maybe that
> output function would like to have a non-flat input value, too, to save
> the time of transforming back to that representation.  On the other hand,
> if it's a SELECT ... ORDER BY ... and the planner chooses to do the ORDER
> BY with a final sort step, we'll probably have to flatten the value to
> pass it through sorting.  (Or possibly not --- perhaps we could just pass
> the toast token through sorting?)  There are a lot of considerations here
> and it's really unreasonable to expect that static expression labeling
> will be able to do the right thing every time.

Again, my proposal is very conservative here. It does not expect to
optimize all spots where copies are not necessary. Only at a some level
of function evaluation with ... some assumptions.

>
> Basically the only way to make this work reliably is for Datums to be
> self-identifying as to whether they're flat or structured values; then
> make code do the right thing on-the-fly at runtime depending on what kind
> of Datum it gets.  Once you've done that, I don't see that parse-time
> labeling of expression nesting adds anything useful.  As Andres said,
> the provisions for toasted datums are a good precedent, and none of that
> depends on parse-time decisions.
>

This is something I have to investigate, thanks for pointing it out.
What I've understood so far is that there is room for new flags in the
TOAST mechanism, so the idea would be to add a new strategy where opaque
pointers could be stored. And it would then require a way for extensions
to register their own "(de)toasting" functions, right ?


--
Hugo Mercier
Oslandia



Re: Detection of nested function calls

From
Andres Freund
Date:
On 2013-10-28 09:13:06 +0100, Hugo Mercier wrote:
> Le 25/10/2013 18:44, Tom Lane a écrit :
> > Hugo Mercier <hugo.mercier@oslandia.com> writes:
> >> Le 25/10/2013 17:20, Tom Lane a écrit :
> >>> How do you tell the difference between
> >>>
> >>> foo(col1, bar(col2))
> >>> foo(bar(col1), col2)
> >
> >> Still not sure to understand ...
> >> I assume foo() takes two argument of type A.
> >> bar() can take one argument of A or another type B.
> >
> > I was assuming everything was the same datatype in this example, ie
> > col1, col2, and the result of bar() are all type A.
> >
> > The point I'm trying to make is that in the first case, foo would be
> > receiving a first argument that was flat and a second that was not flat;
> > while in the second case, it would be receiving a first argument that was
> > not flat and a second that was flat.  The expression labeling you're
> > proposing does not help it tell the difference.
>
> No it does not. It's then up to the data type to store whether it is
> flat or not. And every functions manipulating this type is assumed to be
> aware of this flat/non-flat flagging.

But what if the in-memory type contains pointers and is copied or
spilled to disk? There needs to be a mechanism handling that case.

> > Basically the only way to make this work reliably is for Datums to be
> > self-identifying as to whether they're flat or structured values; then
> > make code do the right thing on-the-fly at runtime depending on what kind
> > of Datum it gets.  Once you've done that, I don't see that parse-time
> > labeling of expression nesting adds anything useful.  As Andres said,
> > the provisions for toasted datums are a good precedent, and none of that
> > depends on parse-time decisions.
> >
>
> This is something I have to investigate, thanks for pointing it out.
> What I've understood so far is that there is room for new flags in the
> TOAST mechanism, so the idea would be to add a new strategy where opaque
> pointers could be stored. And it would then require a way for extensions
> to register their own "(de)toasting" functions, right ?

I think we'd need another argument to CREATE FUNCTION like SERIALIZE
pointing to a function that that has to return data that can be stored
on disk. Deserialization would be up to individual functions.

Depending on the specification this might turn out to be slightly
invasive, tuplestore/sort et al probably have to care...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Detection of nested function calls

From
Pavel Stehule
Date:



2013/10/28 Andres Freund <andres@2ndquadrant.com>
On 2013-10-28 09:13:06 +0100, Hugo Mercier wrote:
> Le 25/10/2013 18:44, Tom Lane a écrit :
> > Hugo Mercier <hugo.mercier@oslandia.com> writes:
> >> Le 25/10/2013 17:20, Tom Lane a écrit :
> >>> How do you tell the difference between
> >>>
> >>> foo(col1, bar(col2))
> >>> foo(bar(col1), col2)
> >
> >> Still not sure to understand ...
> >> I assume foo() takes two argument of type A.
> >> bar() can take one argument of A or another type B.
> >
> > I was assuming everything was the same datatype in this example, ie
> > col1, col2, and the result of bar() are all type A.
> >
> > The point I'm trying to make is that in the first case, foo would be
> > receiving a first argument that was flat and a second that was not flat;
> > while in the second case, it would be receiving a first argument that was
> > not flat and a second that was flat.  The expression labeling you're
> > proposing does not help it tell the difference.
>
> No it does not. It's then up to the data type to store whether it is
> flat or not. And every functions manipulating this type is assumed to be
> aware of this flat/non-flat flagging.

But what if the in-memory type contains pointers and is copied or
spilled to disk? There needs to be a mechanism handling that case.

> > Basically the only way to make this work reliably is for Datums to be
> > self-identifying as to whether they're flat or structured values; then
> > make code do the right thing on-the-fly at runtime depending on what kind
> > of Datum it gets.  Once you've done that, I don't see that parse-time
> > labeling of expression nesting adds anything useful.  As Andres said,
> > the provisions for toasted datums are a good precedent, and none of that
> > depends on parse-time decisions.
> >
>
> This is something I have to investigate, thanks for pointing it out.
> What I've understood so far is that there is room for new flags in the
> TOAST mechanism, so the idea would be to add a new strategy where opaque
> pointers could be stored. And it would then require a way for extensions
> to register their own "(de)toasting" functions, right ?

I think we'd need another argument to CREATE FUNCTION like SERIALIZE
pointing to a function that that has to return data that can be stored
on disk. Deserialization would be up to individual functions.

Depending on the specification this might turn out to be slightly
invasive, tuplestore/sort et al probably have to care...

Then you need a functions than prepare a clone of unpacked data too.

Regards

Pavel
 

Greetings,

Andres Freund

--
 Andres Freund                     http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Detection of nested function calls

From
Andres Freund
Date:
On 2013-10-28 10:12:41 +0100, Pavel Stehule wrote:
> > I think we'd need another argument to CREATE FUNCTION like SERIALIZE
> > pointing to a function that that has to return data that can be stored
> > on disk. Deserialization would be up to individual functions.
> >
> > Depending on the specification this might turn out to be slightly
> > invasive, tuplestore/sort et al probably have to care...

> Then you need a functions than prepare a clone of unpacked data too.

Why? In those case we can (and should) just store the ondisk
representation.

Greetings,

Andres Freund

PS: Could you please try to trim the quoted emails a bit?

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Detection of nested function calls

From
Hugo Mercier
Date:
Le 28/10/2013 09:39, Andres Freund a écrit :
> On 2013-10-28 09:13:06 +0100, Hugo Mercier wrote:
>> Le 25/10/2013 18:44, Tom Lane a écrit :
>>> Hugo Mercier <hugo.mercier@oslandia.com> writes:
>>>> Le 25/10/2013 17:20, Tom Lane a écrit :
>>>>> How do you tell the difference between

>>> The point I'm trying to make is that in the first case, foo would be
>>> receiving a first argument that was flat and a second that was not flat;
>>> while in the second case, it would be receiving a first argument that was
>>> not flat and a second that was flat.  The expression labeling you're
>>> proposing does not help it tell the difference.
>>
>> No it does not. It's then up to the data type to store whether it is
>> flat or not. And every functions manipulating this type is assumed to be
>> aware of this flat/non-flat flagging.
>
> But what if the in-memory type contains pointers and is copied or
> spilled to disk? There needs to be a mechanism handling that case.

It must not happen. The 'nested' boolean may be seen as "everything
returning from this function may be stored on disk at any time, so
serialize it" for nested==0.

If there is another mechanism to tell, inside a function, if the result
will be "stored" (stored on disk, copied to another context, ...) or
not, then I'll be happy with that.



> I think we'd need another argument to CREATE FUNCTION like SERIALIZE
> pointing to a function that that has to return data that can be stored
> on disk. Deserialization would be up to individual functions.

Either as argument to CREATE FUNCTION or to CREATE TYPE, right ?

Ok, so a user function calls PG_DETOAST to get its input. The most
nested will get it straight from where it is stored.
Then the function can decide to deserialize it in its own format,
process it, and return it as is, with probably a call to
PG_RETURN(pointer). Nested function will get their inputs still from
PG_DETOAST and can use them directly.
But for the last function in the nesting chain, how the pointer will be
serialized back to something storeable ? i.e. who will call the
serialize function declared in CREATE (FUNCTION|TYPE) ?

--
Hugo Mercier
Oslandia



Re: Detection of nested function calls

From
Andres Freund
Date:
On 2013-10-28 10:29:59 +0100, Hugo Mercier wrote:
> Le 28/10/2013 09:39, Andres Freund a écrit :
> > On 2013-10-28 09:13:06 +0100, Hugo Mercier wrote:
> >> Le 25/10/2013 18:44, Tom Lane a écrit :
> >>> Hugo Mercier <hugo.mercier@oslandia.com> writes:
> >>>> Le 25/10/2013 17:20, Tom Lane a écrit :
> >>>>> How do you tell the difference between
>
> >>> The point I'm trying to make is that in the first case, foo would be
> >>> receiving a first argument that was flat and a second that was not flat;
> >>> while in the second case, it would be receiving a first argument that was
> >>> not flat and a second that was flat.  The expression labeling you're
> >>> proposing does not help it tell the difference.
> >>
> >> No it does not. It's then up to the data type to store whether it is
> >> flat or not. And every functions manipulating this type is assumed to be
> >> aware of this flat/non-flat flagging.
> >
> > But what if the in-memory type contains pointers and is copied or
> > spilled to disk? There needs to be a mechanism handling that case.
>
> It must not happen. The 'nested' boolean may be seen as "everything
> returning from this function may be stored on disk at any time, so
> serialize it" for nested==0.

I don't think that's sufficient. There'll be lots of places where you'd
need to special-case hack this logic.
Think of SELECT aggregate(somefunc(foo)) FROM ... GROUP BY something_else;


> If there is another mechanism to tell, inside a function, if the result
> will be "stored" (stored on disk, copied to another context, ...) or
> not, then I'll be happy with that.

I don't think telling the function that is the right approach at all.

> > I think we'd need another argument to CREATE FUNCTION like SERIALIZE
> > pointing to a function that that has to return data that can be stored
> > on disk. Deserialization would be up to individual functions.
>
> Either as argument to CREATE FUNCTION or to CREATE TYPE, right ?

Err, CREATE TYPE, yes.

> Ok, so a user function calls PG_DETOAST to get its input. The most
> nested will get it straight from where it is stored.
> Then the function can decide to deserialize it in its own format,
> process it, and return it as is, with probably a call to
> PG_RETURN(pointer). Nested function will get their inputs still from
> PG_DETOAST and can use them directly.
> But for the last function in the nesting chain, how the pointer will be
> serialized back to something storeable ? i.e. who will call the
> serialize function declared in CREATE (FUNCTION|TYPE) ?

Something around toast_insert_or_update(). We'd need to set
HeapTupleHasExternal() for those kind of tuples or similar so it gets
called, but that shouldn't be the biggest problem.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Detection of nested function calls

From
Pavel Stehule
Date:



2013/10/28 Andres Freund <andres@2ndquadrant.com>
On 2013-10-28 10:12:41 +0100, Pavel Stehule wrote:
> > I think we'd need another argument to CREATE FUNCTION like SERIALIZE
> > pointing to a function that that has to return data that can be stored
> > on disk. Deserialization would be up to individual functions.
> >
> > Depending on the specification this might turn out to be slightly
> > invasive, tuplestore/sort et al probably have to care...

> Then you need a functions than prepare a clone of unpacked data too.

Why? In those case we can (and should) just store the ondisk
representation.


ok,

Pavel
 

Greetings,

Andres Freund

PS: Could you please try to trim the quoted emails a bit?

--
 Andres Freund                     http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Detection of nested function calls

From
Tom Lane
Date:
Hugo Mercier <hugo.mercier@oslandia.com> writes:
> Le 25/10/2013 18:44, Tom Lane a �crit :
>> The point I'm trying to make is that in the first case, foo would be
>> receiving a first argument that was flat and a second that was not flat;
>> while in the second case, it would be receiving a first argument that was
>> not flat and a second that was flat.  The expression labeling you're
>> proposing does not help it tell the difference.

> No it does not. It's then up to the data type to store whether it is
> flat or not. And every functions manipulating this type is assumed to be
> aware of this flat/non-flat flagging.

If the data must contain such markers, then what do you need the proposed
expression labeling for?  It's awkward and ultimately insufficient.

What you do need, as Andres is saying, is a datatype-specific function
that will re-flatten a non-flat Datum; and the core code has to be aware
to call this when preparing data to be stored on disk.  Once you have
that, every C function returning this datatype can make its own choice
of whether to return a flat or non-flat value.  Probably the bias would be
towards the latter choice, but there might be cases where it's easier to
return a flattened value.  The important point here is that it's a whole
lot easier all around if the choice is made on-the-fly at runtime, rather
than trying to determine what will happen at parse time.

> What I've understood so far is that there is room for new flags in the
> TOAST mechanism, so the idea would be to add a new strategy where opaque
> pointers could be stored. And it would then require a way for extensions
> to register their own "(de)toasting" functions, right ?

The idea I'm thinking about at the moment is that toast tokens of this
sort might each contain a function pointer to the required flattening
function.  This avoids an expensive catalog lookup when flattening is
needed.  We'd never accept such a thing for data destined for disk;
but since the whole point here is that such data lives only in memory,
I can't see anything wrong with including a function pointer in it.
        regards, tom lane



Re: Detection of nested function calls

From
Robert Haas
Date:
On Mon, Oct 28, 2013 at 11:12 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> The idea I'm thinking about at the moment is that toast tokens of this
> sort might each contain a function pointer to the required flattening
> function.  This avoids an expensive catalog lookup when flattening is
> needed.  We'd never accept such a thing for data destined for disk;
> but since the whole point here is that such data lives only in memory,
> I can't see anything wrong with including a function pointer in it.

This might be OK, but it bloats the in-memory representation.  For
small data types like numeric that might well be significant.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Detection of nested function calls

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Oct 28, 2013 at 11:12 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The idea I'm thinking about at the moment is that toast tokens of this
>> sort might each contain a function pointer to the required flattening
>> function.

> This might be OK, but it bloats the in-memory representation.  For
> small data types like numeric that might well be significant.

Meh.  If you don't include a function pointer you will still need the OID
of the datatype or the decompression function, so it's not like omitting
it is free.  In any case, the design target here is for data values that
are going to be quite large, so an extra 4 bytes or whatever in the
reference object doesn't really seem to me to be something to stress over.
        regards, tom lane



Re: Detection of nested function calls

From
Andres Freund
Date:
On 2013-10-28 12:42:28 -0400, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Mon, Oct 28, 2013 at 11:12 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> The idea I'm thinking about at the moment is that toast tokens of this
> >> sort might each contain a function pointer to the required flattening
> >> function.
> 
> > This might be OK, but it bloats the in-memory representation.  For
> > small data types like numeric that might well be significant.
> 
> Meh.  If you don't include a function pointer you will still need the OID
> of the datatype or the decompression function, so it's not like omitting
> it is free.

That's what I thought at first too - but I am not sure it's actually
true. The reason we need to include the toastrelid in varatt_externals
(which I guess you are thinking of, like me) is that we need to be able
to resolve "naked" Datums to their original value without any context.
But at the locations where we'd need to call the memory
representation->disk conversion function we should have a TupleDesc with
type information, so we could lookup the needed information there.

> In any case, the design target here is for data values that
> are going to be quite large, so an extra 4 bytes or whatever in the
> reference object doesn't really seem to me to be something to stress
> over.

I'd actually be happy if we can get this to work for numeric as well - I
have seen several workloads where that's a bottleneck. Not that I am
sure that the 8bytes for a pointer would be the problem there (in
contrast to additional typecache lookups).

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Detection of nested function calls

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On 2013-10-28 12:42:28 -0400, Tom Lane wrote:
>> Meh.  If you don't include a function pointer you will still need the OID
>> of the datatype or the decompression function, so it's not like omitting
>> it is free.

> That's what I thought at first too - but I am not sure it's actually
> true. The reason we need to include the toastrelid in varatt_externals
> (which I guess you are thinking of, like me) is that we need to be able
> to resolve "naked" Datums to their original value without any context.
> But at the locations where we'd need to call the memory
> representation->disk conversion function we should have a TupleDesc with
> type information, so we could lookup the needed information there.

I don't think that's a safe assumption at all.  We need to be able to do
flattening anywhere PG_DETOAST_DATUM() can be called.

In any case, my point here is largely that I don't want to add a catalog
lookup to the operation.  This whole proposal is basically about trading
greater short-term memory usage to gain speed, so griping about an extra 4
or so bytes per value seems to me to be missing the point completely.
Or to put it even more bluntly: if you've not realized that the extra
palloc overhead of an out-of-line instance of the datum will swamp what
we're talking about here, you need to realize that.

>> In any case, the design target here is for data values that
>> are going to be quite large, so an extra 4 bytes or whatever in the
>> reference object doesn't really seem to me to be something to stress
>> over.

> I'd actually be happy if we can get this to work for numeric as well - I
> have seen several workloads where that's a bottleneck. Not that I am
> sure that the 8bytes for a pointer would be the problem there (in
> contrast to additional typecache lookups).

Yeah.  The other point worth considering is that there will not likely be
all that many such values floating around at once, so even if there does
end up being a significant percentage bloat in the size of a non-flat
numeric, I doubt anyone will notice.
        regards, tom lane



Re: Detection of nested function calls

From
"ktm@rice.edu"
Date:
On Mon, Oct 28, 2013 at 05:48:55PM +0100, Andres Freund wrote:
> On 2013-10-28 12:42:28 -0400, Tom Lane wrote:
> > Robert Haas <robertmhaas@gmail.com> writes:
> > > On Mon, Oct 28, 2013 at 11:12 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > >> The idea I'm thinking about at the moment is that toast tokens of this
> > >> sort might each contain a function pointer to the required flattening
> > >> function.
> > 
> > > This might be OK, but it bloats the in-memory representation.  For
> > > small data types like numeric that might well be significant.
> > 
> > Meh.  If you don't include a function pointer you will still need the OID
> > of the datatype or the decompression function, so it's not like omitting
> > it is free.
> 
> That's what I thought at first too - but I am not sure it's actually
> true. The reason we need to include the toastrelid in varatt_externals
> (which I guess you are thinking of, like me) is that we need to be able
> to resolve "naked" Datums to their original value without any context.
> But at the locations where we'd need to call the memory
> representation->disk conversion function we should have a TupleDesc with
> type information, so we could lookup the needed information there.
> 
> > In any case, the design target here is for data values that
> > are going to be quite large, so an extra 4 bytes or whatever in the
> > reference object doesn't really seem to me to be something to stress
> > over.
> 
> I'd actually be happy if we can get this to work for numeric as well - I
> have seen several workloads where that's a bottleneck. Not that I am
> sure that the 8bytes for a pointer would be the problem there (in
> contrast to additional typecache lookups).
> 
> Greetings,
> 
> Andres Freund
> 
With the type information available, you could have a single lookup table
per backend with the function pointer so the space would be negligible
amortized over all of the datums of each type.

Regards,
Ken



Re: Detection of nested function calls

From
Andres Freund
Date:
On 2013-10-28 13:41:46 -0400, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On 2013-10-28 12:42:28 -0400, Tom Lane wrote:
> >> Meh.  If you don't include a function pointer you will still need the OID
> >> of the datatype or the decompression function, so it's not like omitting
> >> it is free.
> 
> > That's what I thought at first too - but I am not sure it's actually
> > true. The reason we need to include the toastrelid in varatt_externals
> > (which I guess you are thinking of, like me) is that we need to be able
> > to resolve "naked" Datums to their original value without any context.
> > But at the locations where we'd need to call the memory
> > representation->disk conversion function we should have a TupleDesc with
> > type information, so we could lookup the needed information there.
> 
> I don't think that's a safe assumption at all.  We need to be able to do
> flattening anywhere PG_DETOAST_DATUM() can be called.

I am not sure we want things to work along those lines. I'd rather make
PG_DETOAST_DATUM pass along such in-memory Datums unchanged and require
any funtion that wants to poke into into the Datum in detail to know
about the different representations. That will require a bit more
widespread changes in functions using those types natively, but it will
make it more realistic to use the optimization across much of the code
that detoasts Datums generally.

> In any case, my point here is largely that I don't want to add a catalog
> lookup to the operation.  This whole proposal is basically about trading
> greater short-term memory usage to gain speed, so griping about an extra 4
> or so bytes per value seems to me to be missing the point completely.
> Or to put it even more bluntly: if you've not realized that the extra
> palloc overhead of an out-of-line instance of the datum will swamp what
> we're talking about here, you need to realize that.

I am not arguing against this at all though.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Detection of nested function calls

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On 2013-10-28 13:41:46 -0400, Tom Lane wrote:
>> I don't think that's a safe assumption at all.  We need to be able to do
>> flattening anywhere PG_DETOAST_DATUM() can be called.

> I am not sure we want things to work along those lines. I'd rather make
> PG_DETOAST_DATUM pass along such in-memory Datums unchanged and require
> any funtion that wants to poke into into the Datum in detail to know
> about the different representations.

No; see my upthread comments.  I think what we want to do is to have
PG_DETOAST_DATUM automatically flatten non-flat datums, and to require
functions that can cope with non-flat inputs to use a new argument
fetching macro, exactly along the lines of what we did with non-aligned
toasted values awhile ago (see PG_GETARG_TEXT_PP and suchlike).  That way,
code that hasn't yet been updated to deal with non-flat datums will still
work, if a bit inefficiently compared to what we'd like.
Non-performance-critical functions might never get updated at all.

If we do what you're suggesting here, any attempt to de-flatten a datatype
will be a flag day on which *every* *single* *function* that deals with
that datatype has to change simultaneously.  That will basically destroy
our chance of ever doing anything about widely-used types like arrays.
This feature *must* be something that we can install support for
incrementally (ie one function at a time), the same way we did with
non-aligned toasted values, or for that matter with several previous
global changes like the adoption of V1 call convention.

> That will require a bit more
> widespread changes in functions using those types natively, but it will
> make it more realistic to use the optimization across much of the code
> that detoasts Datums generally.

You've got that exactly backward; if it's a source code flag day it will
never happen at all.  We need to change code when it gets updated to
handle the case, not run around and try to find every function we're not
updating.
        regards, tom lane



Re: Detection of nested function calls

From
Andres Freund
Date:
On 2013-10-28 14:26:20 -0400, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On 2013-10-28 13:41:46 -0400, Tom Lane wrote:
> >> I don't think that's a safe assumption at all.  We need to be able to do
> >> flattening anywhere PG_DETOAST_DATUM() can be called.
> 
> > I am not sure we want things to work along those lines. I'd rather make
> > PG_DETOAST_DATUM pass along such in-memory Datums unchanged and require
> > any funtion that wants to poke into into the Datum in detail to know
> > about the different representations.
> 
> No; see my upthread comments.  I think what we want to do is to have
> PG_DETOAST_DATUM automatically flatten non-flat datums, and to require
> functions that can cope with non-flat inputs to use a new argument
> fetching macro, exactly along the lines of what we did with non-aligned
> toasted values awhile ago (see PG_GETARG_TEXT_PP and suchlike).  That way,
> code that hasn't yet been updated to deal with non-flat datums will still
> work, if a bit inefficiently compared to what we'd like.
> Non-performance-critical functions might never get updated at all.

My problem isn't datatype specific functions doing superflous
detoasting. If it were just them, I'd be perfectly happy going your way.
My concern is type-independent code detoasting everything without giving
the type specific code any say over it. Like, printtup.c, spi.c,
rowtype.c...
I guess we'll have to spread knowledge over the the new toast type to
those places then.

> If we do what you're suggesting here, any attempt to de-flatten a datatype
> will be a flag day on which *every* *single* *function* that deals with
> that datatype has to change simultaneously.  That will basically destroy
> our chance of ever doing anything about widely-used types like arrays.
> This feature *must* be something that we can install support for
> incrementally (ie one function at a time), the same way we did with
> non-aligned toasted values, or for that matter with several previous
> global changes like the adoption of V1 call convention.

I don't think this is a change on the same scale as V1 call conventions
or short varlenas which are type independent because a type explicitly
has to sign up for it.
E.g. the numeric storage is private to numeric.c, so it'd be perfectly
fine to change the numeric representation in a flag day manner as long
as we still can read the old representation.

I grant you that arrays are *the* counter example to this.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Detection of nested function calls

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On 2013-10-28 14:26:20 -0400, Tom Lane wrote:
>> No; see my upthread comments.  I think what we want to do is to have
>> PG_DETOAST_DATUM automatically flatten non-flat datums, and to require
>> functions that can cope with non-flat inputs to use a new argument
>> fetching macro, exactly along the lines of what we did with non-aligned
>> toasted values awhile ago (see PG_GETARG_TEXT_PP and suchlike).

> My problem isn't datatype specific functions doing superflous
> detoasting. If it were just them, I'd be perfectly happy going your way.
> My concern is type-independent code detoasting everything without giving
> the type specific code any say over it. Like, printtup.c, spi.c,
> rowtype.c...

In all those cases, if we're detoasting at all then there is probably good
reason to flatten as well.  Or if not, we'll teach the code the
difference.  Just as with function arguments, it can never be *wrong* to
flatten, it might only be inefficient --- so we'll improve the
inefficiencies.  One at a time.

>> If we do what you're suggesting here, any attempt to de-flatten a datatype
>> will be a flag day on which *every* *single* *function* that deals with
>> that datatype has to change simultaneously.

> I don't think this is a change on the same scale as V1 call conventions
> or short varlenas which are type independent because a type explicitly
> has to sign up for it.

I think it's going to be more widely adopted than you think, unless we
make it so impractical to adopt that only one or two types ever do it.

> E.g. the numeric storage is private to numeric.c, so it'd be perfectly
> fine to change the numeric representation in a flag day manner as long
> as we still can read the old representation.

So in other words, you believe there is no extension anywhere that deals
with numeric values?  Sorry, I don't believe that.  And even if I believed
it for numeric, the assumption certainly falls to the ground once you
extend it to every datatype that might have use for this facility.
        regards, tom lane