Thread: CLOBBER_CACHE_ALWAYS regression instability
Over in the thread at [1] we were wondering how come buildfarm member hyrax has suddenly started to fail like this: diff -U3 /home/buildfarm/buildroot/HEAD/pgsql.build/src/test/regress/expected/errors.out /home/buildfarm/buildroot/HEAD/pgsql.build/src/test/regress/results/errors.out --- /home/buildfarm/buildroot/HEAD/pgsql.build/src/test/regress/expected/errors.out 2018-11-21 13:48:48.340000000 -0500 +++ /home/buildfarm/buildroot/HEAD/pgsql.build/src/test/regress/results/errors.out 2020-04-04 04:48:16.704699045 -0400 @@ -446,4 +446,4 @@ 'select infinite_recurse()' language sql; \set VERBOSITY terse select infinite_recurse(); -ERROR: stack depth limit exceeded +ERROR: stack depth limit exceeded at character 8 I've now looked into this, and found that it's not at all hard to duplicate; compile HEAD with -DCLOBBER_CACHE_ALWAYS, and run "select infinite_recurse()", and you'll likely get the changed error message. (The lack of other buildfarm failures is probably just because we have so few animals doing CLOBBER_CACHE_ALWAYS builds frequently.) The issue seems indeed to have been triggered by 8f59f6b9c0, because that inserted an equal() call into recomputeNamespacePath(). equal() includes a check_stack_depth() call. We get the observed message if this call is the one where the stack limit is hit, and it is invoked inside ParseFuncOrColumn(), which has set up a parser error callback to point at the infinite_recurse() call that it's trying to resolve. That callback's been there a long time of course, so we may conclude that no other code path reached from ParseFuncOrColumn contains a stack depth check, or we'd likely have seen this before. It's a bit surprising perhaps that we run out of stack here and not somewhere else; but ParseFuncOrColumn and its subroutines consume quite a lot of stack, because of FUNC_MAX_ARGS-sized local arrays, so it's not *that* surprising. So, what to do to re-stabilize the regression tests? Even if we wanted to revert 8f59f6b9c0, that'd be kind of a band-aid fix, because there are lots of other possible ways that a parser error callback could be active at the point of the error. A few other possibilities are: 1. Change the test to do "\set VERBOSITY sqlstate" so that all that is printed is ERROR: 54001 ERRCODE_STATEMENT_TOO_COMPLEX is used in few enough places that this wouldn't be too much of a loss of specificity. (Or we could give stack overflow its very own ERRCODE.) 2. Hack pcb_error_callback so that it suppresses the error position report for ERRCODE_STATEMENT_TOO_COMPLEX, as it already does for ERRCODE_QUERY_CANCELED. That seems pretty unpleasant though. 3. Create a separate expected-file to match the variant output. This would be a maintenance problem, but we could ameliorate that by moving the test to its own regression script, which was something that'd already been proposed to get around the PPC64 Linux kernel signal-handling bug that's been causing intermittent failures on most of the PPC64 buildfarm animals [2]. On the whole I find #1 the least unpleasant, as well as the most likely to forestall future variants of this issue. It won't dodge the PPC64 problem, but I'm willing to keep living with that one for now. Thoughts? regards, tom lane [1] https://www.postgresql.org/message-id/CA%2BTgmoaUOS5X64nKgFxNV7JHN4sRkNAJYW2gHz-LMb0Ej4xHig%40mail.gmail.com [2] https://www.postgresql.org/message-id/27924.1571068231%40sss.pgh.pa.us
Hi, On 2020-04-05 14:33:19 -0400, Tom Lane wrote: > It's a bit surprising perhaps that we run out of stack here and not > somewhere else; but ParseFuncOrColumn and its subroutines consume > quite a lot of stack, because of FUNC_MAX_ARGS-sized local arrays, > so it's not *that* surprising. > > So, what to do to re-stabilize the regression tests? Even if > we wanted to revert 8f59f6b9c0, that'd be kind of a band-aid fix, > because there are lots of other possible ways that a parser error > callback could be active at the point of the error. A few other > possibilities are: > > 1. Change the test to do "\set VERBOSITY sqlstate" so that all that > is printed is > ERROR: 54001 > ERRCODE_STATEMENT_TOO_COMPLEX is used in few enough places that > this wouldn't be too much of a loss of specificity. (Or we could > give stack overflow its very own ERRCODE.) We could print the error using :LAST_ERROR_MESSAGE after removing a potential trailing "at character ..." if we're worried about the loss of specificity. > On the whole I find #1 the least unpleasant, as well as the most > likely to forestall future variants of this issue. It won't dodge > the PPC64 problem, but I'm willing to keep living with that one > for now. Another avenue could be to make ParseFuncOrColumn et al use less stack, and hope that it avoids the problem. It's a bit insane that we use this much. We don't have to go there in this case, but I've before wondered about adding helpers that use an on-stack var for small allocations, and falls back to palloc otherwise. Something boiling down to: Oid actual_arg_types_stack[3]; Oid *actual_arg_types; if (list_length(fargs) <= lengthof(actual_arg_types_stack)) actual_arg_types = actual_arg_types_stack; else actual_arg_types = palloc(sizeof(*actual_arg_types) * list_length(fargs)) Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > Another avenue could be to make ParseFuncOrColumn et al use less stack, > and hope that it avoids the problem. It's a bit insane that we use this > much. That would only reduce the chance of getting a stack overflow there, and not by that much, especially not for a CLOBBER_CACHE_ALWAYS animal which is going to be doing catalog accesses inside there too. > We don't have to go there in this case, but I've before wondered about > adding helpers that use an on-stack var for small allocations, and falls > back to palloc otherwise. Something boiling down to: Seems like that adds a lot of potential for memory leakage? regards, tom lane
Hi, On 2020-04-05 15:04:30 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > Another avenue could be to make ParseFuncOrColumn et al use less stack, > > and hope that it avoids the problem. It's a bit insane that we use this > > much. > > That would only reduce the chance of getting a stack overflow there, > and not by that much, especially not for a CLOBBER_CACHE_ALWAYS animal > which is going to be doing catalog accesses inside there too. It'd certainly not be bullet proof. But I don't think we ever were? If I understood you correctly we were just not noticing the stack overflow danger before? We did catalog accesses from within there before too, that's not changed by the addition of equal(), no? Reminds me: I'll try to dust up my patch to make cache invalidation processing non-recursive for 14 (I wrote an initial version as part of a bugfix that we ended up fixing differently). Besides making CLOBBER_CACHE_ALWAYS vastly less expensive, it also reduces the cost of logical decoding substantially. > > We don't have to go there in this case, but I've before wondered about > > adding helpers that use an on-stack var for small allocations, and falls > > back to palloc otherwise. Something boiling down to: > > Seems like that adds a lot of potential for memory leakage? Depends on the case, I'd say. Certainly might be useful to add a helper for a corresponding conditional free. For parsing cases like this it could be better to bulk free at the end. Compared to the memory needed for all the transformed arguments etc it'd probably not matter in the short term (especially if only done for 4+ args). Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2020-04-05 14:33:19 -0400, Tom Lane wrote: >> 1. Change the test to do "\set VERBOSITY sqlstate" so that all that >> is printed is >> ERROR: 54001 >> ERRCODE_STATEMENT_TOO_COMPLEX is used in few enough places that >> this wouldn't be too much of a loss of specificity. (Or we could >> give stack overflow its very own ERRCODE.) > We could print the error using :LAST_ERROR_MESSAGE after removing a > potential trailing "at character ..." if we're worried about the loss of > specificity. Oh, actually it seems that :LAST_ERROR_MESSAGE is already just the primary message, without any "at character N" addon, so this would be a very easy way to ameliorate that complaint. ("at character N" is added by libpq's pqBuildErrorMessage3 in TERSE mode, but psql does not use that when filling LAST_ERROR_MESSAGE.) regards, tom lane
Andres Freund <andres@anarazel.de> writes: > On 2020-04-05 15:04:30 -0400, Tom Lane wrote: >> That would only reduce the chance of getting a stack overflow there, >> and not by that much, especially not for a CLOBBER_CACHE_ALWAYS animal >> which is going to be doing catalog accesses inside there too. > It'd certainly not be bullet proof. But I don't think we ever were? If I > understood you correctly we were just not noticing the stack overflow > danger before? We did catalog accesses from within there before too, > that's not changed by the addition of equal(), no? Ah, you're right, the CCA aspect is not such a problem as long as there are not check_stack_depth() calls inside the code that's run to load a syscache or relcache entry. Which there probably aren't, at least not for system catalogs. The reason we're seeing this on a CCA animal is simply that a cache flush has occurred to force recomputeNamespacePath to do some work. (In theory it could happen on a non-CCA animal, given unlucky timing of an sinval overrun.) My point here though is that it's basically been blind luck that we've not seen this before. There's certainly no good reason to assume that a check_stack_depth() call shouldn't happen while parsing a function call, or within some other chunk of the parser that happens to set up a transient error-position callback. And it's only going to get more likely in future, seeing for example my ambitions to extend the executor so that run-time expression failures can also report error cursors. So I think that we should be looking for a permanent fix, not a reduce-the-odds band-aid. >> Seems like that adds a lot of potential for memory leakage? > Depends on the case, I'd say. Certainly might be useful to add a helper > for a corresponding conditional free. > For parsing cases like this it could be better to bulk free at the > end. Compared to the memory needed for all the transformed arguments etc > it'd probably not matter in the short term (especially if only done for > 4+ args). What I wish we had was alloca(), so you don't need a FUNC_MAX_ARGS-sized array to parse a two-argument function call. Too bad C99 didn't add that. (But some sniffing around suggests that an awful lot of systems have it anyway ... even MSVC. Hmmm.) regards, tom lane
On 2020-04-05 15:38:29 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > We could print the error using :LAST_ERROR_MESSAGE after removing a > > potential trailing "at character ..." if we're worried about the loss of > > specificity. > > Oh, actually it seems that :LAST_ERROR_MESSAGE is already just the > primary message, without any "at character N" addon, so this would be > a very easy way to ameliorate that complaint. ("at character N" is > added by libpq's pqBuildErrorMessage3 in TERSE mode, but psql does > not use that when filling LAST_ERROR_MESSAGE.) Heh. I though it worked differently because I just had typed some gibberish and got an error in :LAST_ERROR_MESSAGE that ended in "at or near ...". But that's scan.l adding that explicitly...
On 2020-Apr-05, Tom Lane wrote: > What I wish we had was alloca(), so you don't need a FUNC_MAX_ARGS-sized > array to parse a two-argument function call. Too bad C99 didn't add > that. (But some sniffing around suggests that an awful lot of systems > have it anyway ... even MSVC. Hmmm.) Isn't it the case that you can create an inner block with a constant whose size is determined by a containing block's variable? I mean as in the attached, which refuses to compile because of our -Werror=vla -- but if I remove it, it compiles fine and works in my system. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Hi, On 2020-04-05 19:54:19 -0400, Alvaro Herrera wrote: > Isn't it the case that you can create an inner block with a constant > whose size is determined by a containing block's variable? I mean as in > the attached, which refuses to compile because of our -Werror=vla -- but > if I remove it, it compiles fine and works in my system. IIRC msvc doesn't support VLAs. And there's generally a slow push towards deprecating them (they've e.g. been moved to optional in C11). They don't tend to make a lot of sense for sizes that aren't tightly bound. In contrast to palloc etc, there's no good way to catch allocation errors. Most of the time you'll just get a SIGBUS or such, but sometimes you'll just end up overwriting data (if the allocation is large enough to not touch the guard pages). Both alloca/vlas also add some per-call overhead. Allocating the common size on-stack, and the uncommon ones on heap should be cheaper, and handles the cases of large allocations much better. Greetings, Andres Freund