Thread: BUG #17777: An assert failed in nodeWindowAgg.c
The following bug has been logged on the website: Bug reference: 17777 Logged by: Anban Company Email address: xinwen@stu.scu.edu.cn PostgreSQL version: 15.1 Operating system: Ubuntu 20.04 Description: When executing the following query,assert failed may be triggered, which may be related to RANDOM(): CREATE TABLE table0 ( column0 INT ) ; INSERT INTO table0 VALUES ( 1 ) , ( 2 ) , ( 3 ) , ( 4 ) , ( 5 ) , ( 6 ) , ( 7 ) , ( 8 ) , ( 9 ) , ( 10 ) ; SELECT WIDTH_BUCKET ( SUM ( 1 ) FILTER ( WHERE ( SELECT LAG ( TRUE , CAST ( RANDOM ( ) AS INT ) , column0 <= 1 ) OVER ( ) ) ) OVER ( PARTITION BY 1 ORDER BY column0 RANGE BETWEEN CURRENT ROW AND 1 FOLLOWING ) , 1 , 2 , 3 ) FROM table0 ; I get a failed assertion with the following stacktrace: Core was generated by `postgres: postgres postgres [local] SELECT '. Program terminated with signal SIGABRT, Aborted. #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 #1 0x00007fab725fa859 in __GI_abort () at abort.c:79 #2 0x0000564bcbd68a88 in ExceptionalCondition (conditionName=conditionName@entry=0x564bcbec5788 "peraggstate->transValueCount > 0", errorType=errorType@entry=0x564bcbdc64a0 "FailedAssertion", fileName=fileName@entry=0x564bcbec56b8 "/home/postgres/postgresql-15.1/original_bin-15.1/../src/backend/executor/nodeWindowAgg.c", lineNumber=lineNumber@entry=475) at /home/postgres/postgresql-15.1/original_bin-15.1/../src/backend/utils/error/assert.c:69 #3 0x0000564bcbae93ad in advance_windowaggregate_base (winstate=0x564bcc969c20, perfuncstate=0x564bcc9a5610, peraggstate=0x564bcc9a56a8) at /home/postgres/postgresql-15.1/original_bin-15.1/../src/backend/executor/nodeWindowAgg.c:475 #4 0x0000564bcbaec69c in eval_windowaggregates (winstate=0x564bcc969c20) at /home/postgres/postgresql-15.1/original_bin-15.1/../src/backend/executor/nodeWindowAgg.c:833 #5 ExecWindowAgg (pstate=0x564bcc969c20) at /home/postgres/postgresql-15.1/original_bin-15.1/../src/backend/executor/nodeWindowAgg.c:2240 #6 0x0000564bcbaacc93 in ExecProcNode (node=0x564bcc969c20) at /home/postgres/postgresql-15.1/original_bin-15.1/../src/include/executor/executor.h:259 #7 ExecutePlan (execute_once=<optimized out>, dest=0x564bcc989240, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x564bcc969c20, estate=0x564bcc97ce10) at /home/postgres/postgresql-15.1/original_bin-15.1/../src/backend/executor/execMain.c:1636 #8 standard_ExecutorRun (queryDesc=0x564bcc8c35c0, direction=<optimized out>, count=0, execute_once=<optimized out>) at /home/postgres/postgresql-15.1/original_bin-15.1/../src/backend/executor/execMain.c:363 #9 0x0000564bcbc3fe5f in PortalRunSelect (portal=0x564bcc90f560, forward=<optimized out>, count=0, dest=<optimized out>) at /home/postgres/postgresql-15.1/original_bin-15.1/../src/backend/tcop/pquery.c:924 #10 0x0000564bcbc41431 in PortalRun (portal=portal@entry=0x564bcc90f560, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0x564bcc989240, altdest=altdest@entry=0x564bcc989240, qc=0x7ffebb433880) at /home/postgres/postgresql-15.1/original_bin-15.1/../src/backend/tcop/pquery.c:768 #11 0x0000564bcbc3d202 in exec_simple_query ( query_string=0x564bcc8a2010 "SELECT WIDTH_BUCKET ( SUM ( 1 ) FILTER ( WHERE ( SELECT LAG ( TRUE , CAST ( RANDOM ( ) AS INT ) , column0 <= 1 ) OVER ( ) ) ) OVER ( PARTITION BY 1 ORDER BY column0 RANGE BETWEEN CURRENT ROW AND 1 FOL"...) at /home/postgres/postgresql-15.1/original_bin-15.1/../src/backend/tcop/postgres.c:1250 #12 0x0000564bcbc3ef8c in PostgresMain (dbname=<optimized out>, username=<optimized out>) at /home/postgres/postgresql-15.1/original_bin-15.1/../src/backend/tcop/postgres.c:4581 #13 0x0000564bcbbabe8a in BackendRun (port=<optimized out>, port=<optimized out>) at /home/postgres/postgresql-15.1/original_bin-15.1/../src/backend/postmaster/postmaster.c:4504 #14 BackendStartup (port=<optimized out>) at /home/postgres/postgresql-15.1/original_bin-15.1/../src/backend/postmaster/postmaster.c:4232 #15 ServerLoop () at /home/postgres/postgresql-15.1/original_bin-15.1/../src/backend/postmaster/postmaster.c:1806 #16 0x0000564bcbbacffb in PostmasterMain (argc=<optimized out>, argv=0x564bcc89c310) at /home/postgres/postgresql-15.1/original_bin-15.1/../src/backend/postmaster/postmaster.c:1478 #17 0x0000564bcb8d7630 in main (argc=3, argv=0x564bcc89c310) at /home/postgres/postgresql-15.1/original_bin-15.1/../src/backend/main/main.c:202 I also find this assert failed in 14.6, 13.9, 12.13 and 11.18 using the same statement.
Hi, On 2023-02-06 14:48:44 +0000, PG Bug reporting form wrote: > When executing the following query,assert failed may be triggered, which > may be related to RANDOM(): > > > CREATE TABLE table0 ( column0 INT ) ; > INSERT INTO table0 VALUES ( 1 ) , ( 2 ) , ( 3 ) , ( 4 ) , ( 5 ) , ( 6 ) , ( > 7 ) , ( 8 ) , ( 9 ) , ( 10 ) ; > SELECT WIDTH_BUCKET ( SUM ( 1 ) FILTER ( WHERE ( SELECT LAG ( TRUE , CAST ( > RANDOM ( ) AS INT ) , column0 <= 1 ) OVER ( ) ) ) OVER ( PARTITION BY 1 > ORDER BY column0 RANGE BETWEEN CURRENT ROW AND 1 FOLLOWING ) , 1 , 2 , 3 ) > FROM table0 ; Yes, you're right that this is caused by the random(), partially. A simplified trigger of the problem is: SELECT SUM (1) -- triggers the problem, volatile function in subplan, not removed due to correlated subquery FILTER (WHERE ( SELECT COALESCE(random() < 0.5, i = 0))) -- doesn't trigger the problem --FILTER (WHERE random() < 0.5 ) OVER ( ORDER BY i RANGE BETWEEN CURRENT ROW AND 1 FOLLOWING) FROM generate_series(0, 1000) g1(i); If you replace 1 FOLLOWING with UNBOUNDED FOLLOWING, this crashes on 9.4-10 too, just a bit less reliably. The problem is that normally we supress the moving aggregate optimization if a volatile function is contained in the filter. But unfortunately, contain_volatile_functions() doesn't descend into subplans. So we don't see the volatile expression. This is documented: * We will recursively look into Query nodes (i.e., SubLink sub-selects) * but not into SubPlans. This is a bit odd, but intentional. If we are * looking at a SubLink, we are probably deciding whether a query tree * transformation is safe, and a contained sub-select should affect that; * for example, duplicating a sub-select containing a volatile function * would be bad. However, once we've got to the stage of having SubPlans, * subsequent planning need not consider volatility within those, since * the executor won't change its evaluation rules for a SubPlan based on * volatility. Which seems, uh, a big assumption for something as generally named as contain_volatile_functions(). It makes sense for pushdown / pullup transformation, but in many, if not most, other cases, it seems pretty broken. Why is that safe for equivclass.c, indxpath.c, partprune.c, etc? Completely separately, but IMO the decision whether to use the moving aggregate optimization ought to be a plan time decision, rather than an execution time one... Greetings, Andres Freund
Hi, Sorry for sending this again, accidentally used an old email of David's. On 2023-02-06 14:48:44 +0000, PG Bug reporting form wrote: > When executing the following query,assert failed may be triggered, which > may be related to RANDOM(): > > > CREATE TABLE table0 ( column0 INT ) ; > INSERT INTO table0 VALUES ( 1 ) , ( 2 ) , ( 3 ) , ( 4 ) , ( 5 ) , ( 6 ) , ( > 7 ) , ( 8 ) , ( 9 ) , ( 10 ) ; > SELECT WIDTH_BUCKET ( SUM ( 1 ) FILTER ( WHERE ( SELECT LAG ( TRUE , CAST ( > RANDOM ( ) AS INT ) , column0 <= 1 ) OVER ( ) ) ) OVER ( PARTITION BY 1 > ORDER BY column0 RANGE BETWEEN CURRENT ROW AND 1 FOLLOWING ) , 1 , 2 , 3 ) > FROM table0 ; Yes, you're right that this is caused by the random(), partially. A simplified trigger of the problem is: SELECT SUM (1) -- triggers the problem, volatile function in subplan, not removed due to correlated subquery FILTER (WHERE ( SELECT COALESCE(random() < 0.5, i = 0))) -- doesn't trigger the problem --FILTER (WHERE random() < 0.5 ) OVER ( ORDER BY i RANGE BETWEEN CURRENT ROW AND 1 FOLLOWING) FROM generate_series(0, 1000) g1(i); If you replace 1 FOLLOWING with UNBOUNDED FOLLOWING, this crashes on 9.4-10 too, just a bit less reliably. The problem is that normally we supress the moving aggregate optimization if a volatile function is contained in the filter. But unfortunately, contain_volatile_functions() doesn't descend into subplans. So we don't see the volatile expression. This is documented: * We will recursively look into Query nodes (i.e., SubLink sub-selects) * but not into SubPlans. This is a bit odd, but intentional. If we are * looking at a SubLink, we are probably deciding whether a query tree * transformation is safe, and a contained sub-select should affect that; * for example, duplicating a sub-select containing a volatile function * would be bad. However, once we've got to the stage of having SubPlans, * subsequent planning need not consider volatility within those, since * the executor won't change its evaluation rules for a SubPlan based on * volatility. Which seems, uh, a big assumption for something as generally named as contain_volatile_functions(). It makes sense for pushdown / pullup transformation, but in many, if not most, other cases, it seems pretty broken. Why is that safe for equivclass.c, indxpath.c, partprune.c, etc? Completely separately, but IMO the decision whether to use the moving aggregate optimization ought to be a plan time decision, rather than an execution time one... Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > The problem is that normally we supress the moving aggregate optimization if a > volatile function is contained in the filter. But unfortunately, > contain_volatile_functions() doesn't descend into subplans. So we don't see > the volatile expression. I would say that if a volatile function in the argument crashes things, that's an executor bug. You won't get any sympathy from me for complaints about whether contain_volatile_functions noticed that, because *immutability markings can be lies*. It is not acceptable to crash if they're wrong. It looks to me like maybe we could just remove the Assert and do - if (peraggstate->transValueCount == 1) + if (peraggstate->transValueCount < 2) a few lines further down? I've not dug into the details though. regards, tom lane
Hi, On 2023-02-10 18:12:32 -0500, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > The problem is that normally we supress the moving aggregate optimization if a > > volatile function is contained in the filter. But unfortunately, > > contain_volatile_functions() doesn't descend into subplans. So we don't see > > the volatile expression. > > I would say that if a volatile function in the argument crashes things, > that's an executor bug. You won't get any sympathy from me for > complaints about whether contain_volatile_functions noticed that, > because *immutability markings can be lies*. It is not acceptable > to crash if they're wrong. This leads to wrong query results, with correctly labeled functions. I don't have a problem with bogus results if they're not correctly labeled, but they should be correct, if correctly labeled. Clearly we can't crash in production builds, due to a mislabeled function. I'm a bit more on the fence about whether assertion failures are ok. > It looks to me like maybe we could just remove the Assert and do > > - if (peraggstate->transValueCount == 1) > + if (peraggstate->transValueCount < 2) > > a few lines further down? I've not dug into the details though. I'd make it an ERROR, similar to this case: if (peraggstate->transValueIsNull) elog(ERROR, "aggregate transition value is NULL before inverse transition"); afaics once we got to this point, the query results are always bogus. It's not a reliable detection of mislabeled volatility, but still seems vastly better than knowingly returning wrong results. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2023-02-10 18:12:32 -0500, Tom Lane wrote: >> It looks to me like maybe we could just remove the Assert and do >> >> - if (peraggstate->transValueCount == 1) >> + if (peraggstate->transValueCount < 2) >> >> a few lines further down? I've not dug into the details though. > I'd make it an ERROR, similar to this case: > if (peraggstate->transValueIsNull) > elog(ERROR, "aggregate transition value is NULL before inverse transition"); > afaics once we got to this point, the query results are always bogus. It's not > a reliable detection of mislabeled volatility, but still seems vastly better > than knowingly returning wrong results. Yeah, after looking a little closer, the problem is that we may get inconsistent results about whether a row passes the filter between when we add the row to the trans state and when we remove from it. If transValueCount would go negative then we know that that happened, but it might happen without us being able to detect it. We could force a restart, but that would only fix the results going forward, not fix any bad values we emitted before detecting the problem (and there almost certainly were some). I'm content with replacing the Assert with something like /* * There should still be an added but not yet removed value; if there * isn't, we presumably got inconsistent results from the aggfilter * expression, probably due to an allegedly-immutable expression * delivering changing results. */ if (peraggstate->transValueCount <= 0) elog(ERROR, "aggregate inverse transition failed, probably due to volatile FILTER expression"); I might be more excited about it if there were a visible use-case for volatile filter expressions, but I can't see one. The presented test case is obviously just a fuzzer, not a useful query. regards, tom lane
Hi, On 2023-02-10 18:57:10 -0500, Tom Lane wrote: > I might be more excited about it if there were a visible use-case > for volatile filter expressions, but I can't see one. The presented > test case is obviously just a fuzzer, not a useful query. I don't care about the performance of such a query, but it doesn't seem great that the defense that we do have, doesn't work. It's not like we don't have a fallback execution path, it's just that we don't know that we have to use it. Do you think all other uses of contain_volatile_functions() (and it looks like contain_mutable_functions()) are fine with not detecting volatility in subplans? I don't think it's common, but I don't think it's crazy to have a volatile function somewhere within an aggregate. pg_current_wal_lsn(), clock_timestamp(), pg_relation_size() or such. I think we could just add a !contain_subplans() to the code deciding whether it's safe to use the movable window optimization? Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > I think we could just add a !contain_subplans() to the code deciding whether > it's safe to use the movable window optimization? Yeah, perhaps. That doesn't seem like a mainstream use-case either. Another idea, tying into your previous point, is to try to check contain_volatile_functions in the planner before we've reduced sublinks to subplans. I'm not sure that would be convenient to do though; subplan-conversion happens pretty early. (I'm quite hesitant to move the goalposts on what contain_volatile_functions detects. As that comment indicates, some thought has gone into its current behavior, and I think we might hit some unwanted side-effects if we change it.) regards, tom lane
Hi, On 2023-02-10 20:08:09 -0500, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > I think we could just add a !contain_subplans() to the code deciding whether > > it's safe to use the movable window optimization? > > Yeah, perhaps. That doesn't seem like a mainstream use-case either. I suspect we ought to backpatch a fix and compared to some other ideas, that seems not terribly invasive. > Another idea, tying into your previous point, is to try to check > contain_volatile_functions in the planner before we've reduced > sublinks to subplans. I'm not sure that would be convenient to do > though; subplan-conversion happens pretty early. Yea, that doesn't seem too promising. What I was referencing is that we already moved most aggregate processing to the planner, c.f. preprocess_aggref(), but we didn't do the same for window functions. Even though there's a lot of similar code there. To fix the bug, we could just do a minimal version of that, I think, and add a new field to WindowFunc, that we populate somewhere around optimize_window_clauses(). Perhaps we ought to add something similar to parallel_safe to SubPlan? > (I'm quite hesitant to move the goalposts on what > contain_volatile_functions detects. As that comment indicates, > some thought has gone into its current behavior, and I think > we might hit some unwanted side-effects if we change it.) Agreed. We could have a second "interface" function, using the same caller though. But afaics, without adding information to the SubPlan nodes, we can't really do better anyway? Greetings, Andres Freund
On Sat, 11 Feb 2023 at 13:49, Andres Freund <andres@anarazel.de> wrote: > I think we could just add a !contain_subplans() to the code deciding whether > it's safe to use the movable window optimization? I think this is a fair way to fix the bug. I think we're pretty unlikely to disappoint too many people by just disabling the inverse transitions when the filter has a subplan. I think it'll be rare that a WindowFunc would even have a filter, let alone one with a subplan. I think there are other reasons that a subplan might not return the same thing when it's executed again. Maybe a synchronous seq scan looking for some tuple with a subquery containing a LIMIT 1. If the sync scan started in a different place each time then some other tuple could be returned. So I think checking if the filter contains subplans is a good fix. Here's a patch for that. David
Attachment
David Rowley <dgrowleyml@gmail.com> writes: > On Sat, 11 Feb 2023 at 13:49, Andres Freund <andres@anarazel.de> wrote: >> I think we could just add a !contain_subplans() to the code deciding whether >> it's safe to use the movable window optimization? > I think this is a fair way to fix the bug. Agreed. > Here's a patch for that. Why is it okay to check only the filter, and not the rest of the WindowFunc's subexpressions? The arguments we've just run through seem to apply to a subplan in the direct or aggregated arguments as well. regards, tom lane
On Mon, 13 Feb 2023 at 05:19, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Why is it okay to check only the filter, and not the rest of the > WindowFunc's subexpressions? The arguments we've just run through > seem to apply to a subplan in the direct or aggregated arguments > as well. Good point. I had just been thinking in terms of the reported bug to make sure we inverse transition the same rows we transition. We also need to make sure the transition value matches in both transition directions. I've adjusted the patch accordingly. David
Attachment
Hi, On 2023-02-13 13:31:54 +1300, David Rowley wrote: > On Mon, 13 Feb 2023 at 05:19, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Why is it okay to check only the filter, and not the rest of the > > WindowFunc's subexpressions? The arguments we've just run through > > seem to apply to a subplan in the direct or aggregated arguments > > as well. > > Good point. I had just been thinking in terms of the reported bug to > make sure we inverse transition the same rows we transition. We also > need to make sure the transition value matches in both transition > directions. What about find_compatible_agg()? I don't think there's as severe consequences, but it also doesn't seem right as-is. Greetings, Andres Freund
David Rowley <dgrowleyml@gmail.com> writes: > On Mon, 13 Feb 2023 at 05:19, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Why is it okay to check only the filter, and not the rest of the >> WindowFunc's subexpressions? The arguments we've just run through >> seem to apply to a subplan in the direct or aggregated arguments >> as well. > Good point. I had just been thinking in terms of the reported bug to > make sure we inverse transition the same rows we transition. We also > need to make sure the transition value matches in both transition > directions. > I've adjusted the patch accordingly. Code looks good now, but the comment still claims this is only important in the FILTER clause. I'd rewrite the whole thing perhaps: * We also don't risk using moving aggregates when there are subplans * in the arguments or FILTER clause. This is partly because * contain_volatile_functions() doesn't look inside subplans; but * there are other reasons why a subplan's output might be volatile. * For example, syncscan mode can render the results nonrepeatable. regards, tom lane
On Mon, 13 Feb 2023 at 13:46, Andres Freund <andres@anarazel.de> wrote: > What about find_compatible_agg()? I don't think there's as severe > consequences, but it also doesn't seem right as-is. This does not seem related to the issue being reported here. It seems to me this would only affect the aggregate deduplication logic and nothing else. I agree it would be nice if we were more consistent with when volatile functions are evaluated and when they are not. There are also a bunch of weird issues with SRFs as I highlighted in [1]. The following shows another case where we change the volatile function evaluation. This one we do on purpose, which might or might not be good, depending on if you wanted the random() in the targetlist for the same purpose as the random() in the ORDER BY. # select random(), random(), random(); random | random | random ---------------------+---------------------+-------------------- 0.15295030404849452 | 0.34841025373624745 | 0.7195551260981239 # select random(), random(), random() order by random(); random | random | random --------------------+--------------------+-------------------- 0.8569677358482639 | 0.8569677358482639 | 0.8569677358482639 It was just last week I wrote a query to fetch a random row from a table by using ORDER BY random() LIMIT 1; I also needed to call random() somewhere else in the query for something unrelated but this deduplication behaviour caused that not to work the way I wanted it to. This caused my unrelated random() value to be consistently very low due to it being shared with the random() for the sort. I had to add a subquery for the 2nd random() to fix the problem. For the case you're complaining about, I don't think shifting the goalposts randomly in back-branches is a good idea. For the one you reported we might upset someone who has come to rely on the aggstate deduplication either for performance or for the number of volatile function evaluations. I'm not really on board with changing that unless someone highlights that it's causing an actual problem. David [1] https://postgr.es/m/CAApHDvqKO00nNQtchBs65VTfjEMUsYiB5r2P4VUreTdXE9RY1g@mail.gmail.com
On Mon, 13 Feb 2023 at 13:55, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Code looks good now, but the comment still claims this is only > important in the FILTER clause. I'd rewrite the whole thing > perhaps: > > * We also don't risk using moving aggregates when there are subplans > * in the arguments or FILTER clause. This is partly because > * contain_volatile_functions() doesn't look inside subplans; but > * there are other reasons why a subplan's output might be volatile. > * For example, syncscan mode can render the results nonrepeatable. That seems better, so I pushed it with that. Thanks. David
David Rowley <dgrowleyml@gmail.com> writes: > On Mon, 13 Feb 2023 at 13:46, Andres Freund <andres@anarazel.de> wrote: >> What about find_compatible_agg()? I don't think there's as severe >> consequences, but it also doesn't seem right as-is. > This does not seem related to the issue being reported here. It seems > to me this would only affect the aggregate deduplication logic and > nothing else. > For the case you're complaining about, I don't think shifting the > goalposts randomly in back-branches is a good idea. For the one you > reported we might upset someone who has come to rely on the aggstate > deduplication either for performance or for the number of volatile > function evaluations. I'm not really on board with changing that > unless someone highlights that it's causing an actual problem. Yeah. This behavior is at least semi-documented, IIRC. We need to fix moving-aggregate mode so it's not used when its assumptions don't hold, because there's no way that that doesn't lead to insane behavior. But redefining the deduplication rules seems way more likely to break existing queries than make anybody happy. regards, tom lane
Hi, On 2023-02-13 16:21:11 +1300, David Rowley wrote: > On Mon, 13 Feb 2023 at 13:46, Andres Freund <andres@anarazel.de> wrote: > > What about find_compatible_agg()? I don't think there's as severe > > consequences, but it also doesn't seem right as-is. > > This does not seem related to the issue being reported here. It seems > to me this would only affect the aggregate deduplication logic and > nothing else. Fair enough. I guess I was influenced by my thinking that we ought to combine more of the relevant logic between "plain" aggregates and window functions. Greetings, Andres Freund