Thread: Lambda expressions (was Re: BUG #15471)
[ splitting this off to a new thread ] Andres Freund <andres@anarazel.de> writes: > On 2018-10-30 09:30:54 -0400, Tom Lane wrote: >> This is all some more fuel for the idea that we need a less messy >> substitute for CaseTestExpr. As it happens, I was just fooling with >> that yesterday, and hope to have something to post soon. But it'll >> be too invasive to back-patch. I'll try to fix this particular >> problem more locally. > Yea, I agree we need something better there. I happened to also play > around with that (more from the angle that we'd want similar > infrastructure for the SQL function inlining case we talked about) on > the long flights back from .eu yesterday, but didn't get that far. Yeah, that conversation was what spurred me to do something. The patch as I have it right now only fixes (most of) the inlining-failure cases. I've been thinking about how to reuse the mechanism to get rid of CaseTestExpr, but am running into some issues. The core idea that I'm working on is to invent a new node type LambdaExpr that evaluates an expression and substitutes it as a Param into another expression, notationally sort of like LET($n := expression1 IN ... result expression using $n ...) and then the different values can be distinguished by having different Param numbers. These Params have a different ParamKind from PARAM_EXTERNAL or PARAM_EXEC, because there's too many assumptions about the behavior of those ParamKinds, but otherwise they're not that much different. The stumbling block that I'm seeing in terms of making parser output use this is that if, say, each CaseExpr has a distinct Param number for what had been its CaseTestExpr, then textually and logically equivalent CASE constructs might not compare equal() because of different Param numbers. And that's really bad news for a bunch of reasons, see e.g. recent bugs with DDL on partition tables. But we can't have equal() ignore the Param numbers without re-introducing all the same issues we have with CaseTestExpr. So what I'm thinking about at the moment is to keep using CaseTestExpr in parser output trees, but have the planner replace them with (uniquely numbered) Params during eval_const_expressions. As long as we integrate that correctly with function inlining, it should be safe and ensure that CASEs coming in from functions can't be confused with ones present in the original query (because we'd renumber anything in the inlined tree to be distinct from what we had already). But that seems a bit ugly, and it might interfere with late-stage optimizations. Have you got a better idea? regards, tom lane
Hi, On 2018-10-30 15:04:55 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2018-10-30 09:30:54 -0400, Tom Lane wrote: > >> This is all some more fuel for the idea that we need a less messy > >> substitute for CaseTestExpr. As it happens, I was just fooling with > >> that yesterday, and hope to have something to post soon. But it'll > >> be too invasive to back-patch. I'll try to fix this particular > >> problem more locally. > > > Yea, I agree we need something better there. I happened to also play > > around with that (more from the angle that we'd want similar > > infrastructure for the SQL function inlining case we talked about) on > > the long flights back from .eu yesterday, but didn't get that far. > > Yeah, that conversation was what spurred me to do something. The patch > as I have it right now only fixes (most of) the inlining-failure cases. > I've been thinking about how to reuse the mechanism to get rid of > CaseTestExpr, but am running into some issues. > > The core idea that I'm working on is to invent a new node type > LambdaExpr that evaluates an expression and substitutes it as a > Param into another expression, notationally sort of like > > LET($n := expression1 IN ... result expression using $n ...) Right, that's what I was thinking about as well (except I called it a WithValues node). > and then the different values can be distinguished by having > different Param numbers. These Params have a different ParamKind > from PARAM_EXTERNAL or PARAM_EXEC, because there's too many > assumptions about the behavior of those ParamKinds, but otherwise > they're not that much different. Right, that makes sense. How did you deal with the fact that we might extract subset of the LET() into e.g. a RestrictionInfo (and then e.g. an IndexPath), but another part would be e.g. evaluated as part of a qual? I got to the point where I allocated a bunch of memory for a set of Param nodes (e.g. from a single inlined function) and abandoned having something like the WithValues node - and that memory was directly referenced by the new Param nodes. But I didn't really make that work outside of the simplest cases. > The stumbling block that I'm seeing in terms of making parser output > use this is that if, say, each CaseExpr has a distinct Param number > for what had been its CaseTestExpr, then textually and logically > equivalent CASE constructs might not compare equal() because of > different Param numbers. And that's really bad news for a bunch > of reasons, see e.g. recent bugs with DDL on partition tables. > But we can't have equal() ignore the Param numbers without > re-introducing all the same issues we have with CaseTestExpr. I never got this far. I think it'd be worthwhile to try to have a better parse time representation too, we do a number of somewhat ugly contortions to make that work. I wonder a bit if we should have Param nodes of the LET kind refer to their "parent" LET node, and have something paramlevelsup or letlevelsup (similar to varlevelsup) type logic, where each LET node has a link to its parent. The numbering for each level would start at 0, solving the issue of breaking equal() in most cases. But that seems awfully complicated. Alternatively I wonder if there's really a need to have a Param number for this new kind of Param. All it really needs is a way to 1) refer to the expression that determines it value 2) a way to avoid repeatedly evaluate the expression 3) a way to have the cached values reset between expression evaluations. Neither of these really require a LET() node, or Param numbers, right? It makes sense to have Param numbers out of symmetry, but perhaps that symmetry isn't actually that strong? > So what I'm thinking about at the moment is to keep using CaseTestExpr > in parser output trees, but have the planner replace them with > (uniquely numbered) Params during eval_const_expressions. As long > as we integrate that correctly with function inlining, it should be > safe and ensure that CASEs coming in from functions can't be confused > with ones present in the original query (because we'd renumber anything > in the inlined tree to be distinct from what we had already). But that > seems a bit ugly, and it might interfere with late-stage optimizations. Better than nothing, but I'm not a huge fan... Could be a large step towards a larger improvement tho. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2018-10-30 15:04:55 -0400, Tom Lane wrote: >> The core idea that I'm working on is to invent a new node type >> LambdaExpr that evaluates an expression and substitutes it as a >> Param into another expression, notationally sort of like >> LET($n := expression1 IN ... result expression using $n ...) > Right, that's what I was thinking about as well (except I called it a > WithValues node). > How did you deal with the fact that we might extract subset of the LET() > into e.g. a RestrictionInfo (and then e.g. an IndexPath), but another > part would be e.g. evaluated as part of a qual? Well, a Lambda expression is not something that can be optimized away (unless perhaps you can get rid of the need for any of its output Params) so I don't see how any of its subexpressions would ever wind up split out from inside it. > I wonder a bit if we should have Param nodes of the LET kind refer to > their "parent" LET node, and have something paramlevelsup or letlevelsup > (similar to varlevelsup) type logic, where each LET node has a link to > its parent. The numbering for each level would start at 0, solving the > issue of breaking equal() in most cases. But that seems awfully > complicated. We can't have up-links in an expression tree without creating all sorts of fun circularity problems. However, maybe we do not need to. It might be sufficient if each Lambda has a nesting depth number that's guaranteed greater than the nesting depth number of any parent Lambda --- not necessarily exactly parent + 1, just greater. Then if its referencing Param(s) also have that nesting depth number, we can tell what's what. If there are other Lambdas elsewhere with the same nesting depth, it's not a problem, as long as they aren't parent or child of one with a conflicting depth. The trick here would be to preserve that relationship when doing tree manipulations --- subquery flattening, in particular, would likely find that to be a nightmare when pulling up Lambdas. Maybe it could be made to work, but I'm not sure. Also, this doesn't seem like it fixes the equal() problem. We could fix equal() if we numbered from the bottom up (that is, a Lambda has depth exactly 0 if it contains no Lambdas, exactly 1 if it contains only Lambdas with depth 0, etc) but I'm afraid that definition makes the tree manipulation problem even worse. The approach I was using was just to assign unique output Param IDs to each Lambda, generated from a query-wide counter. That seems to make the tree manipulations a lot more robust than I think these other ways could hope to be, and I'm not sure that losing equal() matching is a huge problem as long as we manage not to have it happen on parser output trees (which is what DDL works on). > Alternatively I wonder if there's really a need to have a Param number > for this new kind of Param. All it really needs is a way to 1) refer to > the expression that determines it value 2) a way to avoid repeatedly > evaluate the expression 3) a way to have the cached values reset between > expression evaluations. Hm, isn't (1) going to end up being isomorphic to a Param number? Also, I don't think we need (3) as long as we use a Lambda node, because it's descending through the Lambda node that triggers evaluation of the associated Param(s). I'd been envisioning that a Lambda could set more than one Param, so you'd need both a depth number and an index number within the referenced Lambda. That's not utterly necessary to allow, but stacking several Lambdas for a single inlined function seems kinda inefficient. regards, tom lane
On 2018-10-30 16:23:37 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > How did you deal with the fact that we might extract subset of the LET() > > into e.g. a RestrictionInfo (and then e.g. an IndexPath), but another > > part would be e.g. evaluated as part of a qual? > > Well, a Lambda expression is not something that can be optimized away > (unless perhaps you can get rid of the need for any of its output Params) > so I don't see how any of its subexpressions would ever wind up split out > from inside it. Isn't that a pretty fundamental requirement for the postgis (and I presume lots of other) usecases? What they have is wrapper function functions like ST_DWithin(geometry, geometry, float) that roughly expand to something (simplified) like SELECT $1 && ST_Expand($2, $3) AND _ST_DWithin($1, $2, $3); where && is the overlaps operator, and then _ST_DWithin is the accurate computation. That allows a quick bounding-box (or similar) search via index, and then an accurate re-check. But $2 might be the result of a function (with constant input), and that'll currently prevent the SQL function from being inlined when the function is expensive. And the postgis folks *want* its functions be marked expensive, because they really are - but getting index searches is also important. Am I missing something, or how would what you propose not break that? (delaying answering to the rest until we cleared up the above) Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2018-10-30 16:23:37 -0400, Tom Lane wrote: >> Well, a Lambda expression is not something that can be optimized away >> (unless perhaps you can get rid of the need for any of its output Params) >> so I don't see how any of its subexpressions would ever wind up split out >> from inside it. > Isn't that a pretty fundamental requirement for the postgis (and I > presume lots of other) usecases? What they have is wrapper function > functions like ST_DWithin(geometry, geometry, float) that roughly expand > to something (simplified) like > SELECT $1 && ST_Expand($2, $3) AND _ST_DWithin($1, $2, $3); > where && is the overlaps operator, and then _ST_DWithin is the accurate > computation. That allows a quick bounding-box (or similar) search via > index, and then an accurate re-check. But $2 might be the result of a > function (with constant input), and that'll currently prevent the SQL > function from being inlined when the function is expensive. And the > postgis folks *want* its functions be marked expensive, because they > really are - but getting index searches is also important. Hm. But if we're trying to avoid evaluating $2 twice, how would you expect to allow the ST_Expand and _ST_DWithin calls to get separated? They can't be allowed to get very far apart in the tree, ISTM. The patch as I had it also dealt with another limitation on function inlining, which was the restrictions on what you can do with strict functions, by means of allowing a LambdaExpr to be "strict"; that is short-circuit to a NULL result if any of the Param values turn out null. It doesn't seem possible to do what you're talking about in that case either. Maybe the PostGIS guys don't care, since they're probably OK with not marking their wrapper functions strict, but I'm not sure that that's the whole universe we should be worried about. regards, tom lane
Hi, On 2018-10-30 16:54:45 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2018-10-30 16:23:37 -0400, Tom Lane wrote: > >> Well, a Lambda expression is not something that can be optimized away > >> (unless perhaps you can get rid of the need for any of its output Params) > >> so I don't see how any of its subexpressions would ever wind up split out > >> from inside it. > > > Isn't that a pretty fundamental requirement for the postgis (and I > > presume lots of other) usecases? What they have is wrapper function > > functions like ST_DWithin(geometry, geometry, float) that roughly expand > > to something (simplified) like > > > SELECT $1 && ST_Expand($2, $3) AND _ST_DWithin($1, $2, $3); > > > where && is the overlaps operator, and then _ST_DWithin is the accurate > > computation. That allows a quick bounding-box (or similar) search via > > index, and then an accurate re-check. But $2 might be the result of a > > function (with constant input), and that'll currently prevent the SQL > > function from being inlined when the function is expensive. And the > > postgis folks *want* its functions be marked expensive, because they > > really are - but getting index searches is also important. > > Hm. But if we're trying to avoid evaluating $2 twice, how would you > expect to allow the ST_Expand and _ST_DWithin calls to get separated? > They can't be allowed to get very far apart in the tree, ISTM. I think postgis would be OK just tacking a 'force-inline this' on functions like ST_DWithin, and live with the redundant expansion. Right now they can either have accurate costs (and thus trigger e.g. parallelism / avoid plans with more costs) but no index scans, or the reverse. While not optimal, it'd be better than what's there now - but it's still awfully crummy. And yes, they can't get allowed to get too far apart. That's the problem I was trying - and obviously failing - to describe in the lounge, btw :). If we allowed the LET() to be split at the baserel level (around the match_restriction_clauses_to_index() in create_index_paths()), it would probably be ok, but that'd be architecturally somewhat invasive. > The patch as I had it also dealt with another limitation on function > inlining, which was the restrictions on what you can do with strict > functions, by means of allowing a LambdaExpr to be "strict"; that > is short-circuit to a NULL result if any of the Param values turn > out null. Right, and similar with volatile. > It doesn't seem possible to do what you're talking about in that case > either. Maybe the PostGIS guys don't care, since they're probably OK > with not marking their wrapper functions strict, but I'm not sure that > that's the whole universe we should be worried about. I agree, but we also should take care not to regress cases of inlining like postgis', which currently be separated. If we'd only create the new LET node in cases we previously couldn't inline that's obviously not a problem. Greetings, Andres Freund
Dear Tom, Andres, Andrew, all Any update on this interesting topic? --Stefan P.S. TIL Databricks SQL lambda functions: "A parameterized expression that can be passed to a function to control its behavior. For example, array_sort function (Databricks SQL) accepts a lambda function as an argument to define a custom sort order." https://docs.databricks.com/sql/language-manual/sql-ref-lambda-functions.html Am Di., 30. Okt. 2018 um 23:20 Uhr schrieb Andres Freund <andres@anarazel.de>: > > Hi, > > On 2018-10-30 16:54:45 -0400, Tom Lane wrote: > > Andres Freund <andres@anarazel.de> writes: > > > On 2018-10-30 16:23:37 -0400, Tom Lane wrote: > > >> Well, a Lambda expression is not something that can be optimized away > > >> (unless perhaps you can get rid of the need for any of its output Params) > > >> so I don't see how any of its subexpressions would ever wind up split out > > >> from inside it. > > > > > Isn't that a pretty fundamental requirement for the postgis (and I > > > presume lots of other) usecases? What they have is wrapper function > > > functions like ST_DWithin(geometry, geometry, float) that roughly expand > > > to something (simplified) like > > > > > SELECT $1 && ST_Expand($2, $3) AND _ST_DWithin($1, $2, $3); > > > > > where && is the overlaps operator, and then _ST_DWithin is the accurate > > > computation. That allows a quick bounding-box (or similar) search via > > > index, and then an accurate re-check. But $2 might be the result of a > > > function (with constant input), and that'll currently prevent the SQL > > > function from being inlined when the function is expensive. And the > > > postgis folks *want* its functions be marked expensive, because they > > > really are - but getting index searches is also important. > > > > Hm. But if we're trying to avoid evaluating $2 twice, how would you > > expect to allow the ST_Expand and _ST_DWithin calls to get separated? > > They can't be allowed to get very far apart in the tree, ISTM. > > I think postgis would be OK just tacking a 'force-inline this' on > functions like ST_DWithin, and live with the redundant expansion. Right > now they can either have accurate costs (and thus trigger > e.g. parallelism / avoid plans with more costs) but no index scans, or > the reverse. While not optimal, it'd be better than what's there now - > but it's still awfully crummy. > > And yes, they can't get allowed to get too far apart. That's the problem > I was trying - and obviously failing - to describe in the lounge, btw > :). If we allowed the LET() to be split at the baserel level (around > the match_restriction_clauses_to_index() in create_index_paths()), it > would probably be ok, but that'd be architecturally somewhat > invasive. > > > > The patch as I had it also dealt with another limitation on function > > inlining, which was the restrictions on what you can do with strict > > functions, by means of allowing a LambdaExpr to be "strict"; that > > is short-circuit to a NULL result if any of the Param values turn > > out null. > > Right, and similar with volatile. > > > > It doesn't seem possible to do what you're talking about in that case > > either. Maybe the PostGIS guys don't care, since they're probably OK > > with not marking their wrapper functions strict, but I'm not sure that > > that's the whole universe we should be worried about. > > I agree, but we also should take care not to regress cases of inlining > like postgis', which currently be separated. If we'd only create the > new LET node in cases we previously couldn't inline that's obviously not > a problem. > > Greetings, > > Andres Freund >
On Tue, Oct 30, 2018 at 3:20 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2018-10-30 16:54:45 -0400, Tom Lane wrote: > > Andres Freund <andres@anarazel.de> writes: > > > On 2018-10-30 16:23:37 -0400, Tom Lane wrote: > > >> Well, a Lambda expression is not something that can be optimized away > > >> (unless perhaps you can get rid of the need for any of its output Params) > > >> so I don't see how any of its subexpressions would ever wind up split out > > >> from inside it. > > > > > Isn't that a pretty fundamental requirement for the postgis (and I > > > presume lots of other) usecases? What they have is wrapper function > > > functions like ST_DWithin(geometry, geometry, float) that roughly expand > > > to something (simplified) like > > > > > SELECT $1 && ST_Expand($2, $3) AND _ST_DWithin($1, $2, $3); Just an FYI on our current state of play. Since Pg12 / PostGIS 3 we have been using Tom's support function API to slide an index operator into function calls that we want to have index support (ST_DWithin, ST_Intersects, ST_Contains, etc, etc). So we are a lot less sensitive to the vagaries of inlining than we used to be (there's still some of this in our raster sub-module). Thanks always! P. > > > > > where && is the overlaps operator, and then _ST_DWithin is the accurate > > > computation. That allows a quick bounding-box (or similar) search via > > > index, and then an accurate re-check. But $2 might be the result of a > > > function (with constant input), and that'll currently prevent the SQL > > > function from being inlined when the function is expensive. And the > > > postgis folks *want* its functions be marked expensive, because they > > > really are - but getting index searches is also important. > > > > Hm. But if we're trying to avoid evaluating $2 twice, how would you > > expect to allow the ST_Expand and _ST_DWithin calls to get separated? > > They can't be allowed to get very far apart in the tree, ISTM. > > I think postgis would be OK just tacking a 'force-inline this' on > functions like ST_DWithin, and live with the redundant expansion. Right > now they can either have accurate costs (and thus trigger > e.g. parallelism / avoid plans with more costs) but no index scans, or > the reverse. While not optimal, it'd be better than what's there now - > but it's still awfully crummy. > > And yes, they can't get allowed to get too far apart. That's the problem > I was trying - and obviously failing - to describe in the lounge, btw > :). If we allowed the LET() to be split at the baserel level (around > the match_restriction_clauses_to_index() in create_index_paths()), it > would probably be ok, but that'd be architecturally somewhat > invasive. > > > > The patch as I had it also dealt with another limitation on function > > inlining, which was the restrictions on what you can do with strict > > functions, by means of allowing a LambdaExpr to be "strict"; that > > is short-circuit to a NULL result if any of the Param values turn > > out null. > > Right, and similar with volatile. > > > > It doesn't seem possible to do what you're talking about in that case > > either. Maybe the PostGIS guys don't care, since they're probably OK > > with not marking their wrapper functions strict, but I'm not sure that > > that's the whole universe we should be worried about. > > I agree, but we also should take care not to regress cases of inlining > like postgis', which currently be separated. If we'd only create the > new LET node in cases we previously couldn't inline that's obviously not > a problem. > > Greetings, > > Andres Freund >