Thread: Odd Shortcut behaviour in PG14
Hello all
Got a really weird problem with shortcut processing on one server.
We have just upgraded to PG14 from PG11. The following code works as expected on our primary Dev server, and we recently upgraded our QA server to the same level. However in this case the shortcut processing seems broken.
Here is the code in question:
LEFT JOIN lateral (
SELECT
CASE WHEN (ekey > 0) THEN convert_from(crypto_secretbox_open, 'utf8')::JSON ELSE NULL END AS edata
FROM crypto_secretbox_open(
sc.data,
sc.nonce,
boxkey)
) enc ON true and (ekey > 0)
This code has worked up till now and works on the dev server. The expectation being that if ekey=0 the lateral join will be ignored. However on the new QA server this is hit and miss. It fails many times as it seems to ignore the shortcut even though ekey is 0. We can make it work by doing this:
LEFT JOIN lateral (
SELECT
CASE WHEN (ekey > 0) THEN convert_from(crypto_secretbox_open, 'utf8')::JSON ELSE NULL END AS edata
FROM crypto_secretbox_open(
sc.data,
sc.nonce,
boxkey)
) enc ON (ekey > 0) and true
This should theoretically be no different – but it solves the issue 100%. Any guidance on why this would be the case?
Zahir Lalani
Director of Enterprise Systems
151 Rosebery Avenue
London EC1R 4AB
t: +44 (0)7956 455 168
Thank you for your email. For a faster response please take note of the below:
- If your query is regarding a hosting issue/new request, please direct your request to Devops.support@oliver.agency
- If your request is regarding the data warehouse or zoho, please direct your request to mi_reporting@oliver.agency
- If your request is regarding OMG issues/configuration, please direct your request to omgsupport@oliver.agency
For all other requests I will try and respond as soon as I can. Please note our response SLA’s
- Critical (P1) – systems down or unresponsive = within 1 hour
- High (P2) – Issues with data accuracy/access/functional = within 4 business hours
- Medium (P3) – new requests / amendments to current setups etc = within 2 business days
- Low – general requests regarding any system = best endeavour
Attachment
On 11/23/23 08:55, Zahir Lalani wrote: > Hello all > > Got a really weird problem with shortcut processing on one server. > > We have just upgraded to PG14 from PG11. The following code works as > expected on our primary Dev server, and we recently upgraded our QA > server to the same level. However in this case the shortcut processing > seems broken. To be clear both the Dev and QA servers are at 14.x now? If so are they at same minor version? Same OS and version? Using same data set? Also from here: https://www.postgresql.org/docs/14/sql-expressions.html#SYNTAX-EXPRESS-EVAL " 4.2.14. Expression Evaluation Rules The order of evaluation of subexpressions is not defined. In particular, the inputs of an operator or function are not necessarily evaluated left-to-right or in any other fixed order. Furthermore, if the result of an expression can be determined by evaluating only some parts of it, then other subexpressions might not be evaluated at all. For instance, if one wrote: SELECT true OR somefunc(); then somefunc() would (probably) not be called at all. The same would be the case if one wrote: SELECT somefunc() OR true; Note that this is not the same as the left-to-right “short-circuiting” of Boolean operators that is found in some programming languages. " This also held in version 11. > > Here is the code in question: > > LEFT JOIN lateral ( > > SELECT > > CASE WHEN (ekey > 0) THEN convert_from(crypto_secretbox_open, > 'utf8')::JSON ELSE NULL END AS edata > > FROM crypto_secretbox_open( > > sc.data, > > sc.nonce, > > boxkey) > > ) *enc ON true and (ekey > 0)* > > This code has worked up till now and works on the dev server. The > expectation being that if ekey=0 the lateral join will be ignored. > However on the new QA server this is hit and miss. It fails many times > as it seems to ignore the shortcut even though ekey is 0. We can make it > work by doing this: > > LEFT JOIN lateral ( > > SELECT > > CASE WHEN (ekey > 0) THEN convert_from(crypto_secretbox_open, > 'utf8')::JSON ELSE NULL END AS edata > > FROM crypto_secretbox_open( > > sc.data, > > sc.nonce, > > boxkey) > > *) enc ON (ekey > 0) and true* > > This should theoretically be no different – but it solves the issue > 100%. Any guidance on why this would be the case? > > *Zahir Lalani** > *Director of Enterprise Systems > -- Adrian Klaver adrian.klaver@aklaver.com
Zahir Lalani <ZahirLalani@oliver.agency> writes: > Got a really weird problem with shortcut processing on one server. This question is unanswerable as given. You have not even defined what you mean by "fail" (error? wrong query result?), let alone provided enough detail for someone else to reproduce the problem. regards, tom lane
Zahir Lalani <ZahirLalani@oliver.agency> writes:
> Got a really weird problem with shortcut processing on one server.
This question is unanswerable as given. You have not even defined
what you mean by "fail" (error? wrong query result?), let alone
provided enough detail for someone else to reproduce the problem.
"David G. Johnston" <david.g.johnston@gmail.com> writes: > On Thursday, November 23, 2023, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> This question is unanswerable as given. You have not even defined >> what you mean by "fail" (error? wrong query result?), let alone >> provided enough detail for someone else to reproduce the problem. > The OP complains about the apparent inconsistent optimizing away of the SRF > function call of crypto_secretbox_open in the lateral when the join > predicate is known to be false. Hmm, your crystal ball apparently works better than mine today, because I sure do not see where that information was presented. regards, tom lane
On Thursday, November 23, 2023, Tom Lane <tgl@sss.pgh.pa.us> wrote:
"David G. Johnston" <david.g.johnston@gmail.com> writes:
> On Thursday, November 23, 2023, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> This question is unanswerable as given. You have not even defined
>> what you mean by "fail" (error? wrong query result?), let alone
>> provided enough detail for someone else to reproduce the problem.
> The OP complains about the apparent inconsistent optimizing away of the SRF
> function call of crypto_secretbox_open in the lateral when the join
> predicate is known to be false.
Hmm, your crystal ball apparently works better than mine today,
because I sure do not see where that information was presented.
Hello all
Got a really weird problem with shortcut processing on one server.
We have just upgraded to PG14 from PG11. The following code works as expected on our primary Dev server, and we recently upgraded our QA server to the same level. However in this case the shortcut processing seems broken.
Here is the code in question:
LEFT JOIN lateral (
SELECT
CASE WHEN (ekey > 0) THEN convert_from(crypto_secretbox_open, 'utf8')::JSON ELSE NULL END AS edata
FROM crypto_secretbox_open(
sc.data,
sc.nonce,
boxkey)
) enc ON true and (ekey > 0)
[snip]
LEFT JOIN lateral (
SELECT
CASE WHEN (ekey > 0) THEN convert_from(crypto_secretbox_open, 'utf8')::JSON ELSE NULL END AS edata
FROM crypto_secretbox_open(
sc.data,
sc.nonce,
boxkey)
) enc ON (ekey > 0) and true
This should theoretically be no different – but it solves the issue 100%. Any guidance on why this would be the case?
Ron Johnson <ronljohnsonjr@gmail.com> writes: > Out of curiosity, what is the point of adding the "true" predicate no > matter the position? Maybe I've created an incorrect truth table, but > "true AND" (and "AND true") don't make any logical difference when added > to (ekey > 0)*.* Not only does it not make any logical difference, but it shouldn't make any practical difference either, because the useless "true" subclause will be thrown away very early in planning, before any decisions would be taken on the strength of what is in the ON clause. So I was skeptical that the details presented were even correct. I think whatever is causing the behavioral change is something else that the OP hasn't identified/controlled for. Hard to tell with such fragmentary details. regards, tom lane
> -----Original Message----- > From: Tom Lane <tgl@sss.pgh.pa.us> > Sent: Thursday, November 23, 2023 7:45 PM > To: Ron Johnson <ronljohnsonjr@gmail.com> > Cc: pgsql-generallists.postgresql.org <pgsql-general@lists.postgresql.org> > Subject: Re: Odd Shortcut behaviour in PG14 > > Ron Johnson <ronljohnsonjr@gmail.com> writes: > > Out of curiosity, what is the point of adding the "true" predicate no > > matter the position? Maybe I've created an incorrect truth table, but > > "true AND" (and "AND true") don't make any logical difference when > > added to (ekey > 0)*.* > > Not only does it not make any logical difference, but it shouldn't make any > practical difference either, because the useless "true" > subclause will be thrown away very early in planning, before any decisions > would be taken on the strength of what is in the ON clause. > So I was skeptical that the details presented were even correct. > I think whatever is causing the behavioral change is something else that the OP > hasn't identified/controlled for. Hard to tell with such fragmentary details. > > regards, tom lane > Apologies but I am struggling to make this happen in isolation. A few things: The true was an error - we previously did not have the AND so needed the true. The devs added the AND of ekey but shouldhave removed the true. We have done that now. There is a difference between the PG versions. The Dev server which works is 14.9 and the QA server which fails is 14.10 Looking at the application logs this function is being called once per display row - it is running successfully around 10times with the same input params. When it fails, it is with the same params! All I can say for sure is that the first 10it ignores the lateral join as the (ekey > 0) evaluates to (0>0). However on the next invocation, for whatever reason,even though ekey is still 0, it decides to evaluate the lateral join and fails. If we change the code from (ekey > 0) to (0 > 0) - it always works!! So the planner seems to be making different choicesbased on other functions being called higher up in the chain. We need a way of telling the planner not to run thelateral join if ekey=0. Sorry I can't provide a testable case - have not been able to reproduce just running this one function multiple times ina loop. Thx Z
> -----Original Message-----
> From: Tom Lane <tgl@sss.pgh.pa.us>
> Sent: Thursday, November 23, 2023 7:45 PM
> To: Ron Johnson <ronljohnsonjr@gmail.com>
> Cc: pgsql-generallists.postgresql.org <pgsql-general@lists.postgresql.org>
> Subject: Re: Odd Shortcut behaviour in PG14
>
> Ron Johnson <ronljohnsonjr@gmail.com> writes:
> > Out of curiosity, what is the point of adding the "true" predicate no
> > matter the position? Maybe I've created an incorrect truth table, but
> > "true AND" (and "AND true") don't make any logical difference when
> > added to (ekey > 0)*.*
>
> Not only does it not make any logical difference, but it shouldn't make any
> practical difference either, because the useless "true"
> subclause will be thrown away very early in planning, before any decisions
> would be taken on the strength of what is in the ON clause.
> So I was skeptical that the details presented were even correct.
> I think whatever is causing the behavioral change is something else that the OP
> hasn't identified/controlled for. Hard to tell with such fragmentary details.
>
> regards, tom lane
>
Apologies but I am struggling to make this happen in isolation. A few things:
The true was an error - we previously did not have the AND so needed the true. The devs added the AND of ekey but should have removed the true. We have done that now.
There is a difference between the PG versions. The Dev server which works is 14.9 and the QA server which fails is 14.10
Looking at the application logs this function is being called once per display row - it is running successfully around 10 times with the same input params. When it fails, it is with the same params! All I can say for sure is that the first 10 it ignores the lateral join as the (ekey > 0) evaluates to (0>0). However on the next invocation, for whatever reason, even though ekey is still 0, it decides to evaluate the lateral join and fails.
If we change the code from (ekey > 0) to (0 > 0) - it always works!! So the planner seems to be making different choices based on other functions being called higher up in the chain. We need a way of telling the planner not to run the lateral join if ekey=0.
Sorry I can't provide a testable case - have not been able to reproduce just running this one function multiple times in a loop.
Zahir Lalani <ZahirLalani@oliver.agency> writes: > Looking at the application logs this function is being called once per > display row - it is running successfully around 10 times with the same > input params. When it fails, it is with the same params! You *still* haven't defined what you mean by "fails". We can't help you effectively with such tiny dribs and drabs of information. At the very least I'd like to see the whole query, because the fragment you've shown us does not reveal what ekey is or why you think the system should believe that it is or is not zero. But it's also unclear why that should matter. Having said that ... if the statement is being executed with a cached plan (via a named statement, or PREPARE, or inside plpgsql) then maybe the problem occurs if the plan switches from custom to generic? If so, messing with the plan_cache_mode setting might provide a workaround. regards, tom lane
> -----Original Message----- > From: Tom Lane <tgl@sss.pgh.pa.us> > Sent: Friday, November 24, 2023 3:35 PM > To: Zahir Lalani <ZahirLalani@oliver.agency> > Cc: Ron Johnson <ronljohnsonjr@gmail.com>; pgsql- > generallists.postgresql.org <pgsql-general@lists.postgresql.org> > Subject: Re: Odd Shortcut behaviour in PG14 > You *still* haven't defined what you mean by "fails". We can't help you > effectively with such tiny dribs and drabs of information. > At the very least I'd like to see the whole query, because the fragment you've > shown us does not reveal what ekey is or why you think the system should > believe that it is or is not zero. > But it's also unclear why that should matter. > > Having said that ... if the statement is being executed with a cached plan (via a > named statement, or PREPARE, or inside plpgsql) then maybe the problem > occurs if the plan switches from custom to generic? > If so, messing with the plan_cache_mode setting might provide a workaround. > > regards, tom lane Sorry Tom - let me try and clarify: (ekey is a variable passed into the function) RETURN QUERY SELECT CASE WHEN (ekey = 0) THEN person.first_name ELSE (enc.edata->>'firstname')::text END AS first_name, CASE WHEN (ekey = 0) THEN person.last_name ELSE (enc.edata->>'lastname')::text END AS last_name, CASE WHEN (ekey = 0) THEN person.email ELSE (enc.edata->>'emailaddress')::text END AS email, CASE WHEN (ekey = 0) THEN person.first_name || ' ' || person.last_name ELSE (enc.edata->>'firstname')::text ||' ' || (enc.edata->>'lastname')::text END AS full_name FROM "public".person AS person -- join to decrypt person data LEFT OUTER JOIN "public".secure AS sc ON CASE WHEN (ekey > 0) THEN sc.fk_org_id = org_id and sc.fk_entity_id = 6 and sc.fk_entity_obj_id = person.id ELSEfalse END LEFT JOIN lateral ( SELECT CASE WHEN (ekey > 0) THEN convert_from(crypto_secretbox_open, 'utf8')::JSON ELSE NULL END AS edata FROM crypto_secretbox_open( sc.data, sc.nonce, boxkey) ) enc ON (ekey > 0) WHERE person.id = person_id ----------------------- We have certain clients with encrypted data and this query needs to apply the decryption only if the record is deemed tobe encrypted - this is determined by the variable ekey being > 0. The "failure" is the invocation of the lateral join whenekey is 0 - i.e. its not encrypted data. The crypto function will fail if sc.data is null - as will be the case for anon-encrypted record. Our assumption - probably incorrectly it seems, is that if the expression (ekey > 0) is false, PG will not invoke the lateraljoin and hence not execute crypto. This has worked in PG11 and worked in 14.9. What is very odd is that in 14.10 thisstill seems to work in many cases, but not all. This particular function works in psql, and sometimes in the applicationbut not always. The parameters are identical for when it works as expected and for when it does not - i.e, whenit invokes the lateral even though ekey = 0 To prove that the issue lies in this area alone, we have split the above function and run specific code for the two casesof ekey which guarantees that it will never do the unexpected - this fixes the issue, but I would still like to knowif our fundamental assumption about when the lateral runs was misguided Hope that gives more context Z
Zahir Lalani <ZahirLalani@oliver.agency> writes: > Sorry Tom - let me try and clarify: > (ekey is a variable passed into the function) OK, so if this is a plpgsql function and ekey is a function variable, the planner will definitely perceive this as a query parameterized by the value of "ekey". We will consider a "custom" plan where the value is directly substituted into the query (allowing plan-time folding based on whether ekey is zero or not), but we will also consider a "generic" plan where the value of ekey is not known at plan time so no such folding occurs, and that's probably where your failure is happening. Replanning for every query execution is expensive so there's a preference for using generic plans if we can. I don't really understand why you wrote > SELECT > CASE WHEN (ekey > 0) THEN convert_from(crypto_secretbox_open, 'utf8')::JSON ELSE NULL END AS edata > FROM crypto_secretbox_open( > sc.data, > sc.nonce, > boxkey) rather than just SELECT CASE WHEN (ekey > 0) THEN convert_from( crypto_secretbox_open(sc.data, sc.nonce, boxkey), 'utf8')::JSON ELSE NULL END AS edata I see no reason why you should feel entitled to assume that crypto_secretbox_open won't get called in the first formulation. The normal understanding of such a SELECT is that we evaluate FROM and then apply the SELECT expressions to its result, so the existence of a CASE in the SELECT expression doesn't cause the function call in FROM to get bypassed. Likewise, the fact that the JOIN ON condition is false seems like a poor reason to assume that the join's input relation won't get evaluated. Another approach could be to force matters in the plpgsql logic: IF ekey > 0 THEN RETURN QUERY query-with-decryption; ELSE RETURN QUERY query-without-decryption; END IF; which seems a good deal safer than relying on undocumented details of planner optimization behavior. I also wonder why you don't make crypto_secretbox_open a bit more robust --- at the very least mark it strict (RETURNS NULL ON NULL INPUT). regards, tom lane
> -----Original Message----- > From: Tom Lane <tgl@sss.pgh.pa.us> > Sent: Friday, November 24, 2023 6:44 PM > To: Zahir Lalani <ZahirLalani@oliver.agency> > Cc: Ron Johnson <ronljohnsonjr@gmail.com>; pgsql- > generallists.postgresql.org <pgsql-general@lists.postgresql.org> > Subject: Re: Odd Shortcut behaviour in PG14 > > OK, so if this is a plpgsql function and ekey is a function variable, the planner > will definitely perceive this as a query parameterized by the value of "ekey". > We will consider a "custom" plan where the value is directly substituted into > the query (allowing plan-time folding based on whether ekey is zero or not), > but we will also consider a "generic" plan where the value of ekey is not known > at plan time so no such folding occurs, and that's probably where your failure > is happening. Replanning for every query execution is expensive so there's a > preference for using generic plans if we can. > > I don't really understand why you wrote > > > SELECT > > CASE WHEN (ekey > 0) THEN convert_from(crypto_secretbox_open, > 'utf8')::JSON ELSE NULL END AS edata > > FROM crypto_secretbox_open( > > sc.data, > > sc.nonce, > > boxkey) > > rather than just > > SELECT > CASE WHEN (ekey > 0) THEN convert_from( > crypto_secretbox_open(sc.data, > sc.nonce, > boxkey), > 'utf8')::JSON ELSE NULL END AS edata > > I see no reason why you should feel entitled to assume that > crypto_secretbox_open won't get called in the first formulation. > The normal understanding of such a SELECT is that we evaluate FROM and > then apply the SELECT expressions to its result, so the existence of a CASE in > the SELECT expression doesn't cause the function call in FROM to get > bypassed. > > Likewise, the fact that the JOIN ON condition is false seems like a poor reason > to assume that the join's input relation won't get evaluated. > > Another approach could be to force matters in the plpgsql logic: > > IF ekey > 0 THEN > RETURN QUERY query-with-decryption; > ELSE > RETURN QUERY query-without-decryption; > END IF; > > which seems a good deal safer than relying on undocumented details of > planner optimization behavior. > > I also wonder why you don't make crypto_secretbox_open a bit more robust - > -- at the very least mark it strict (RETURNS NULL ON NULL INPUT). > > regards, tom lane Thank you for the detailed explanation Tom - much appreciated. So our assumption was made as it worked for so long - ok - so we have learnt not to do that again.... > IF ekey > 0 THEN > RETURN QUERY query-with-decryption; > ELSE > RETURN QUERY query-without-decryption; > END IF; This is indeed what we have done - we were trying to keep code maintenance down as the above requires updates in two placesrather than 1 - but better that than a broken system! > I also wonder why you don't make crypto_secretbox_open a bit more robust - > -- at the very least mark it strict (RETURNS NULL ON NULL INPUT). This is a third party plugin (pgsodium). Thank you to this group for your expertise! Z