Thread: Lambda expressions (was Re: BUG #15471)

Lambda expressions (was Re: BUG #15471)

From
Tom Lane
Date:
[ splitting this off to a new thread ]

Andres Freund <andres@anarazel.de> writes:
> On 2018-10-30 09:30:54 -0400, Tom Lane wrote:
>> This is all some more fuel for the idea that we need a less messy
>> substitute for CaseTestExpr.  As it happens, I was just fooling with
>> that yesterday, and hope to have something to post soon.  But it'll
>> be too invasive to back-patch.  I'll try to fix this particular
>> problem more locally.

> Yea, I agree we need something better there.  I happened to also play
> around with that (more from the angle that we'd want similar
> infrastructure for the SQL function inlining case we talked about) on
> the long flights back from .eu yesterday, but didn't get that far.

Yeah, that conversation was what spurred me to do something.  The patch
as I have it right now only fixes (most of) the inlining-failure cases.
I've been thinking about how to reuse the mechanism to get rid of
CaseTestExpr, but am running into some issues.

The core idea that I'm working on is to invent a new node type
LambdaExpr that evaluates an expression and substitutes it as a
Param into another expression, notationally sort of like

    LET($n := expression1 IN ... result expression using $n ...)

and then the different values can be distinguished by having
different Param numbers.  These Params have a different ParamKind
from PARAM_EXTERNAL or PARAM_EXEC, because there's too many
assumptions about the behavior of those ParamKinds, but otherwise
they're not that much different.

The stumbling block that I'm seeing in terms of making parser output
use this is that if, say, each CaseExpr has a distinct Param number
for what had been its CaseTestExpr, then textually and logically
equivalent CASE constructs might not compare equal() because of
different Param numbers.  And that's really bad news for a bunch
of reasons, see e.g. recent bugs with DDL on partition tables.
But we can't have equal() ignore the Param numbers without
re-introducing all the same issues we have with CaseTestExpr.

So what I'm thinking about at the moment is to keep using CaseTestExpr
in parser output trees, but have the planner replace them with
(uniquely numbered) Params during eval_const_expressions.  As long
as we integrate that correctly with function inlining, it should be
safe and ensure that CASEs coming in from functions can't be confused
with ones present in the original query (because we'd renumber anything
in the inlined tree to be distinct from what we had already).  But that
seems a bit ugly, and it might interfere with late-stage optimizations.

Have you got a better idea?

            regards, tom lane


Re: Lambda expressions (was Re: BUG #15471)

From
Andres Freund
Date:
Hi,

On 2018-10-30 15:04:55 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2018-10-30 09:30:54 -0400, Tom Lane wrote:
> >> This is all some more fuel for the idea that we need a less messy
> >> substitute for CaseTestExpr.  As it happens, I was just fooling with
> >> that yesterday, and hope to have something to post soon.  But it'll
> >> be too invasive to back-patch.  I'll try to fix this particular
> >> problem more locally.
> 
> > Yea, I agree we need something better there.  I happened to also play
> > around with that (more from the angle that we'd want similar
> > infrastructure for the SQL function inlining case we talked about) on
> > the long flights back from .eu yesterday, but didn't get that far.
> 
> Yeah, that conversation was what spurred me to do something.  The patch
> as I have it right now only fixes (most of) the inlining-failure cases.
> I've been thinking about how to reuse the mechanism to get rid of
> CaseTestExpr, but am running into some issues.
> 
> The core idea that I'm working on is to invent a new node type
> LambdaExpr that evaluates an expression and substitutes it as a
> Param into another expression, notationally sort of like
> 
>     LET($n := expression1 IN ... result expression using $n ...)

Right, that's what I was thinking about as well (except I called it a
WithValues node).


> and then the different values can be distinguished by having
> different Param numbers.  These Params have a different ParamKind
> from PARAM_EXTERNAL or PARAM_EXEC, because there's too many
> assumptions about the behavior of those ParamKinds, but otherwise
> they're not that much different.

Right, that makes sense.

How did you deal with the fact that we might extract subset of the LET()
into e.g. a RestrictionInfo (and then e.g. an IndexPath), but another
part would be e.g. evaluated as part of a qual?  I got to the point
where I allocated a bunch of memory for a set of Param nodes (e.g. from
a single inlined function) and abandoned having something like the
WithValues node - and that memory was directly referenced by the new
Param nodes.  But I didn't really make that work outside of the simplest
cases.


> The stumbling block that I'm seeing in terms of making parser output
> use this is that if, say, each CaseExpr has a distinct Param number
> for what had been its CaseTestExpr, then textually and logically
> equivalent CASE constructs might not compare equal() because of
> different Param numbers.  And that's really bad news for a bunch
> of reasons, see e.g. recent bugs with DDL on partition tables.
> But we can't have equal() ignore the Param numbers without
> re-introducing all the same issues we have with CaseTestExpr.

I never got this far.  I think it'd be worthwhile to try to have a
better parse time representation too, we do a number of somewhat ugly
contortions to make that work.

I wonder a bit if we should have Param nodes of the LET kind refer to
their "parent" LET node, and have something paramlevelsup or letlevelsup
(similar to varlevelsup) type logic, where each LET node has a link to
its parent. The numbering for each level would start at 0, solving the
issue of breaking equal() in most cases. But that seems awfully
complicated.

Alternatively I wonder if there's really a need to have a Param number
for this new kind of Param. All it really needs is a way to 1) refer to
the expression that determines it value 2) a way to avoid repeatedly
evaluate the expression 3) a way to have the cached values reset between
expression evaluations.  Neither of these really require a LET() node,
or Param numbers, right?  It makes sense to have Param numbers out of
symmetry, but perhaps that symmetry isn't actually that strong?


> So what I'm thinking about at the moment is to keep using CaseTestExpr
> in parser output trees, but have the planner replace them with
> (uniquely numbered) Params during eval_const_expressions.  As long
> as we integrate that correctly with function inlining, it should be
> safe and ensure that CASEs coming in from functions can't be confused
> with ones present in the original query (because we'd renumber anything
> in the inlined tree to be distinct from what we had already).  But that
> seems a bit ugly, and it might interfere with late-stage optimizations.

Better than nothing, but I'm not a huge fan...  Could be a large step
towards a larger improvement tho.

Greetings,

Andres Freund


Re: Lambda expressions (was Re: BUG #15471)

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2018-10-30 15:04:55 -0400, Tom Lane wrote:
>> The core idea that I'm working on is to invent a new node type
>> LambdaExpr that evaluates an expression and substitutes it as a
>> Param into another expression, notationally sort of like
>> LET($n := expression1 IN ... result expression using $n ...)

> Right, that's what I was thinking about as well (except I called it a
> WithValues node).

> How did you deal with the fact that we might extract subset of the LET()
> into e.g. a RestrictionInfo (and then e.g. an IndexPath), but another
> part would be e.g. evaluated as part of a qual?

Well, a Lambda expression is not something that can be optimized away
(unless perhaps you can get rid of the need for any of its output Params)
so I don't see how any of its subexpressions would ever wind up split out
from inside it.

> I wonder a bit if we should have Param nodes of the LET kind refer to
> their "parent" LET node, and have something paramlevelsup or letlevelsup
> (similar to varlevelsup) type logic, where each LET node has a link to
> its parent. The numbering for each level would start at 0, solving the
> issue of breaking equal() in most cases. But that seems awfully
> complicated.

We can't have up-links in an expression tree without creating all sorts of
fun circularity problems.  However, maybe we do not need to.  It might be
sufficient if each Lambda has a nesting depth number that's guaranteed
greater than the nesting depth number of any parent Lambda --- not
necessarily exactly parent + 1, just greater.  Then if its referencing
Param(s) also have that nesting depth number, we can tell what's what.
If there are other Lambdas elsewhere with the same nesting depth, it's
not a problem, as long as they aren't parent or child of one with a
conflicting depth.

The trick here would be to preserve that relationship when doing tree
manipulations --- subquery flattening, in particular, would likely find
that to be a nightmare when pulling up Lambdas.  Maybe it could be made
to work, but I'm not sure.

Also, this doesn't seem like it fixes the equal() problem.  We could fix
equal() if we numbered from the bottom up (that is, a Lambda has depth
exactly 0 if it contains no Lambdas, exactly 1 if it contains only Lambdas
with depth 0, etc) but I'm afraid that definition makes the tree
manipulation problem even worse.

The approach I was using was just to assign unique output Param IDs to
each Lambda, generated from a query-wide counter.  That seems to make
the tree manipulations a lot more robust than I think these other ways
could hope to be, and I'm not sure that losing equal() matching is a huge
problem as long as we manage not to have it happen on parser output trees
(which is what DDL works on).

> Alternatively I wonder if there's really a need to have a Param number
> for this new kind of Param. All it really needs is a way to 1) refer to
> the expression that determines it value 2) a way to avoid repeatedly
> evaluate the expression 3) a way to have the cached values reset between
> expression evaluations.

Hm, isn't (1) going to end up being isomorphic to a Param number?
Also, I don't think we need (3) as long as we use a Lambda node, because
it's descending through the Lambda node that triggers evaluation of the
associated Param(s).

I'd been envisioning that a Lambda could set more than one Param, so you'd
need both a depth number and an index number within the referenced Lambda.
That's not utterly necessary to allow, but stacking several Lambdas for a
single inlined function seems kinda inefficient.

            regards, tom lane


Re: Lambda expressions (was Re: BUG #15471)

From
Andres Freund
Date:
On 2018-10-30 16:23:37 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > How did you deal with the fact that we might extract subset of the LET()
> > into e.g. a RestrictionInfo (and then e.g. an IndexPath), but another
> > part would be e.g. evaluated as part of a qual?
> 
> Well, a Lambda expression is not something that can be optimized away
> (unless perhaps you can get rid of the need for any of its output Params)
> so I don't see how any of its subexpressions would ever wind up split out
> from inside it.

Isn't that a pretty fundamental requirement for the postgis (and I
presume lots of other) usecases?  What they have is wrapper function
functions like ST_DWithin(geometry, geometry, float) that roughly expand
to something (simplified) like

SELECT $1 && ST_Expand($2, $3) AND _ST_DWithin($1, $2, $3);

where && is the overlaps operator, and then _ST_DWithin is the accurate
computation. That allows a quick bounding-box (or similar) search via
index, and then an accurate re-check.   But $2 might be the result of a
function (with constant input), and that'll currently prevent the SQL
function from being inlined when the function is expensive. And the
postgis folks *want* its functions be marked expensive, because they
really are - but getting index searches is also important.

Am I missing something, or how would what you propose not break that?

(delaying answering to the rest until we cleared up the above)

Greetings,

Andres Freund


Re: Lambda expressions (was Re: BUG #15471)

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2018-10-30 16:23:37 -0400, Tom Lane wrote:
>> Well, a Lambda expression is not something that can be optimized away
>> (unless perhaps you can get rid of the need for any of its output Params)
>> so I don't see how any of its subexpressions would ever wind up split out
>> from inside it.

> Isn't that a pretty fundamental requirement for the postgis (and I
> presume lots of other) usecases?  What they have is wrapper function
> functions like ST_DWithin(geometry, geometry, float) that roughly expand
> to something (simplified) like

> SELECT $1 && ST_Expand($2, $3) AND _ST_DWithin($1, $2, $3);

> where && is the overlaps operator, and then _ST_DWithin is the accurate
> computation. That allows a quick bounding-box (or similar) search via
> index, and then an accurate re-check.   But $2 might be the result of a
> function (with constant input), and that'll currently prevent the SQL
> function from being inlined when the function is expensive. And the
> postgis folks *want* its functions be marked expensive, because they
> really are - but getting index searches is also important.

Hm.  But if we're trying to avoid evaluating $2 twice, how would you
expect to allow the ST_Expand and _ST_DWithin calls to get separated?
They can't be allowed to get very far apart in the tree, ISTM.

The patch as I had it also dealt with another limitation on function
inlining, which was the restrictions on what you can do with strict
functions, by means of allowing a LambdaExpr to be "strict"; that
is short-circuit to a NULL result if any of the Param values turn
out null.  It doesn't seem possible to do what you're talking about
in that case either.  Maybe the PostGIS guys don't care, since they're
probably OK with not marking their wrapper functions strict, but I'm
not sure that that's the whole universe we should be worried about.

            regards, tom lane


Re: Lambda expressions (was Re: BUG #15471)

From
Andres Freund
Date:
Hi,

On 2018-10-30 16:54:45 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2018-10-30 16:23:37 -0400, Tom Lane wrote:
> >> Well, a Lambda expression is not something that can be optimized away
> >> (unless perhaps you can get rid of the need for any of its output Params)
> >> so I don't see how any of its subexpressions would ever wind up split out
> >> from inside it.
> 
> > Isn't that a pretty fundamental requirement for the postgis (and I
> > presume lots of other) usecases?  What they have is wrapper function
> > functions like ST_DWithin(geometry, geometry, float) that roughly expand
> > to something (simplified) like
> 
> > SELECT $1 && ST_Expand($2, $3) AND _ST_DWithin($1, $2, $3);
> 
> > where && is the overlaps operator, and then _ST_DWithin is the accurate
> > computation. That allows a quick bounding-box (or similar) search via
> > index, and then an accurate re-check.   But $2 might be the result of a
> > function (with constant input), and that'll currently prevent the SQL
> > function from being inlined when the function is expensive. And the
> > postgis folks *want* its functions be marked expensive, because they
> > really are - but getting index searches is also important.
> 
> Hm.  But if we're trying to avoid evaluating $2 twice, how would you
> expect to allow the ST_Expand and _ST_DWithin calls to get separated?
> They can't be allowed to get very far apart in the tree, ISTM.

I think postgis would be OK just tacking a 'force-inline this' on
functions like ST_DWithin, and live with the redundant expansion. Right
now they can either have accurate costs (and thus trigger
e.g. parallelism / avoid plans with more costs) but no index scans, or
the reverse.  While not optimal, it'd be better than what's there now -
but it's still awfully crummy.

And yes, they can't get allowed to get too far apart. That's the problem
I was trying - and obviously failing - to describe in the lounge, btw
:).  If we allowed the LET() to be split at the baserel level (around
the match_restriction_clauses_to_index() in create_index_paths()), it
would probably be ok, but that'd be architecturally somewhat
invasive.


> The patch as I had it also dealt with another limitation on function
> inlining, which was the restrictions on what you can do with strict
> functions, by means of allowing a LambdaExpr to be "strict"; that
> is short-circuit to a NULL result if any of the Param values turn
> out null.

Right, and similar with volatile.


> It doesn't seem possible to do what you're talking about in that case
> either.  Maybe the PostGIS guys don't care, since they're probably OK
> with not marking their wrapper functions strict, but I'm not sure that
> that's the whole universe we should be worried about.

I agree, but we also should take care not to regress cases of inlining
like postgis', which currently be separated.  If we'd only create the
new LET node in cases we previously couldn't inline that's obviously not
a problem.

Greetings,

Andres Freund


Re: Lambda expressions (was Re: BUG #15471)

From
Stefan Keller
Date:
Dear Tom, Andres, Andrew, all

Any update on this interesting topic?

--Stefan

P.S. TIL Databricks SQL lambda functions: "A parameterized expression
that can be passed to a function to control its behavior. For example,
array_sort function (Databricks SQL) accepts a lambda function as an
argument to define a custom sort order."
https://docs.databricks.com/sql/language-manual/sql-ref-lambda-functions.html

Am Di., 30. Okt. 2018 um 23:20 Uhr schrieb Andres Freund <andres@anarazel.de>:
>
> Hi,
>
> On 2018-10-30 16:54:45 -0400, Tom Lane wrote:
> > Andres Freund <andres@anarazel.de> writes:
> > > On 2018-10-30 16:23:37 -0400, Tom Lane wrote:
> > >> Well, a Lambda expression is not something that can be optimized away
> > >> (unless perhaps you can get rid of the need for any of its output Params)
> > >> so I don't see how any of its subexpressions would ever wind up split out
> > >> from inside it.
> >
> > > Isn't that a pretty fundamental requirement for the postgis (and I
> > > presume lots of other) usecases?  What they have is wrapper function
> > > functions like ST_DWithin(geometry, geometry, float) that roughly expand
> > > to something (simplified) like
> >
> > > SELECT $1 && ST_Expand($2, $3) AND _ST_DWithin($1, $2, $3);
> >
> > > where && is the overlaps operator, and then _ST_DWithin is the accurate
> > > computation. That allows a quick bounding-box (or similar) search via
> > > index, and then an accurate re-check.   But $2 might be the result of a
> > > function (with constant input), and that'll currently prevent the SQL
> > > function from being inlined when the function is expensive. And the
> > > postgis folks *want* its functions be marked expensive, because they
> > > really are - but getting index searches is also important.
> >
> > Hm.  But if we're trying to avoid evaluating $2 twice, how would you
> > expect to allow the ST_Expand and _ST_DWithin calls to get separated?
> > They can't be allowed to get very far apart in the tree, ISTM.
>
> I think postgis would be OK just tacking a 'force-inline this' on
> functions like ST_DWithin, and live with the redundant expansion. Right
> now they can either have accurate costs (and thus trigger
> e.g. parallelism / avoid plans with more costs) but no index scans, or
> the reverse.  While not optimal, it'd be better than what's there now -
> but it's still awfully crummy.
>
> And yes, they can't get allowed to get too far apart. That's the problem
> I was trying - and obviously failing - to describe in the lounge, btw
> :).  If we allowed the LET() to be split at the baserel level (around
> the match_restriction_clauses_to_index() in create_index_paths()), it
> would probably be ok, but that'd be architecturally somewhat
> invasive.
>
>
> > The patch as I had it also dealt with another limitation on function
> > inlining, which was the restrictions on what you can do with strict
> > functions, by means of allowing a LambdaExpr to be "strict"; that
> > is short-circuit to a NULL result if any of the Param values turn
> > out null.
>
> Right, and similar with volatile.
>
>
> > It doesn't seem possible to do what you're talking about in that case
> > either.  Maybe the PostGIS guys don't care, since they're probably OK
> > with not marking their wrapper functions strict, but I'm not sure that
> > that's the whole universe we should be worried about.
>
> I agree, but we also should take care not to regress cases of inlining
> like postgis', which currently be separated.  If we'd only create the
> new LET node in cases we previously couldn't inline that's obviously not
> a problem.
>
> Greetings,
>
> Andres Freund
>



Re: Lambda expressions (was Re: BUG #15471)

From
Paul Ramsey
Date:
On Tue, Oct 30, 2018 at 3:20 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2018-10-30 16:54:45 -0400, Tom Lane wrote:
> > Andres Freund <andres@anarazel.de> writes:
> > > On 2018-10-30 16:23:37 -0400, Tom Lane wrote:
> > >> Well, a Lambda expression is not something that can be optimized away
> > >> (unless perhaps you can get rid of the need for any of its output Params)
> > >> so I don't see how any of its subexpressions would ever wind up split out
> > >> from inside it.
> >
> > > Isn't that a pretty fundamental requirement for the postgis (and I
> > > presume lots of other) usecases?  What they have is wrapper function
> > > functions like ST_DWithin(geometry, geometry, float) that roughly expand
> > > to something (simplified) like
> >
> > > SELECT $1 && ST_Expand($2, $3) AND _ST_DWithin($1, $2, $3);

Just an FYI on our current state of play. Since Pg12 / PostGIS 3 we
have been using Tom's support function API to slide an index operator
into function calls that we want to have index support (ST_DWithin,
ST_Intersects, ST_Contains, etc, etc). So we are a lot less sensitive
to the vagaries of inlining than we used to be (there's still some of
this in our raster sub-module).

Thanks always!

P.


> >
> > > where && is the overlaps operator, and then _ST_DWithin is the accurate
> > > computation. That allows a quick bounding-box (or similar) search via
> > > index, and then an accurate re-check.   But $2 might be the result of a
> > > function (with constant input), and that'll currently prevent the SQL
> > > function from being inlined when the function is expensive. And the
> > > postgis folks *want* its functions be marked expensive, because they
> > > really are - but getting index searches is also important.
> >
> > Hm.  But if we're trying to avoid evaluating $2 twice, how would you
> > expect to allow the ST_Expand and _ST_DWithin calls to get separated?
> > They can't be allowed to get very far apart in the tree, ISTM.
>
> I think postgis would be OK just tacking a 'force-inline this' on
> functions like ST_DWithin, and live with the redundant expansion. Right
> now they can either have accurate costs (and thus trigger
> e.g. parallelism / avoid plans with more costs) but no index scans, or
> the reverse.  While not optimal, it'd be better than what's there now -
> but it's still awfully crummy.
>
> And yes, they can't get allowed to get too far apart. That's the problem
> I was trying - and obviously failing - to describe in the lounge, btw
> :).  If we allowed the LET() to be split at the baserel level (around
> the match_restriction_clauses_to_index() in create_index_paths()), it
> would probably be ok, but that'd be architecturally somewhat
> invasive.
>
>
> > The patch as I had it also dealt with another limitation on function
> > inlining, which was the restrictions on what you can do with strict
> > functions, by means of allowing a LambdaExpr to be "strict"; that
> > is short-circuit to a NULL result if any of the Param values turn
> > out null.
>
> Right, and similar with volatile.
>
>
> > It doesn't seem possible to do what you're talking about in that case
> > either.  Maybe the PostGIS guys don't care, since they're probably OK
> > with not marking their wrapper functions strict, but I'm not sure that
> > that's the whole universe we should be worried about.
>
> I agree, but we also should take care not to regress cases of inlining
> like postgis', which currently be separated.  If we'd only create the
> new LET node in cases we previously couldn't inline that's obviously not
> a problem.
>
> Greetings,
>
> Andres Freund
>