Thread: CLOBBER_CACHE_ALWAYS regression instability

CLOBBER_CACHE_ALWAYS regression instability

From

Tom Lane

Date:

05 April 2020, 18:33:19

Over in the thread at [1] we were wondering how come buildfarm member
hyrax has suddenly started to fail like this:

diff -U3 /home/buildfarm/buildroot/HEAD/pgsql.build/src/test/regress/expected/errors.out
/home/buildfarm/buildroot/HEAD/pgsql.build/src/test/regress/results/errors.out
--- /home/buildfarm/buildroot/HEAD/pgsql.build/src/test/regress/expected/errors.out    2018-11-21 13:48:48.340000000
-0500
+++ /home/buildfarm/buildroot/HEAD/pgsql.build/src/test/regress/results/errors.out    2020-04-04 04:48:16.704699045
-0400
@@ -446,4 +446,4 @@
 'select infinite_recurse()' language sql;
 \set VERBOSITY terse
 select infinite_recurse();
-ERROR:  stack depth limit exceeded
+ERROR:  stack depth limit exceeded at character 8

I've now looked into this, and found that it's not at all hard to
duplicate; compile HEAD with -DCLOBBER_CACHE_ALWAYS, and run "select
infinite_recurse()", and you'll likely get the changed error message.
(The lack of other buildfarm failures is probably just because we
have so few animals doing CLOBBER_CACHE_ALWAYS builds frequently.)

The issue seems indeed to have been triggered by 8f59f6b9c0, because
that inserted an equal() call into recomputeNamespacePath().  equal()
includes a check_stack_depth() call.  We get the observed message if
this call is the one where the stack limit is hit, and it is invoked
inside ParseFuncOrColumn(), which has set up a parser error callback
to point at the infinite_recurse() call that it's trying to resolve.
That callback's been there a long time of course, so we may conclude
that no other code path reached from ParseFuncOrColumn contains a
stack depth check, or we'd likely have seen this before.

It's a bit surprising perhaps that we run out of stack here and not
somewhere else; but ParseFuncOrColumn and its subroutines consume
quite a lot of stack, because of FUNC_MAX_ARGS-sized local arrays,
so it's not *that* surprising.

So, what to do to re-stabilize the regression tests?  Even if
we wanted to revert 8f59f6b9c0, that'd be kind of a band-aid fix,
because there are lots of other possible ways that a parser error
callback could be active at the point of the error.  A few other
possibilities are:

1. Change the test to do "\set VERBOSITY sqlstate" so that all that
is printed is
    ERROR:  54001
ERRCODE_STATEMENT_TOO_COMPLEX is used in few enough places that
this wouldn't be too much of a loss of specificity.  (Or we could
give stack overflow its very own ERRCODE.)

2. Hack pcb_error_callback so that it suppresses the error position
report for ERRCODE_STATEMENT_TOO_COMPLEX, as it already does
for ERRCODE_QUERY_CANCELED.  That seems pretty unpleasant though.

3. Create a separate expected-file to match the variant output.
This would be a maintenance problem, but we could ameliorate that
by moving the test to its own regression script, which was something
that'd already been proposed to get around the PPC64 Linux kernel
signal-handling bug that's been causing intermittent failures on
most of the PPC64 buildfarm animals [2].

On the whole I find #1 the least unpleasant, as well as the most
likely to forestall future variants of this issue.  It won't dodge
the PPC64 problem, but I'm willing to keep living with that one
for now.

Thoughts?

            regards, tom lane

[1] https://www.postgresql.org/message-id/CA%2BTgmoaUOS5X64nKgFxNV7JHN4sRkNAJYW2gHz-LMb0Ej4xHig%40mail.gmail.com
[2] https://www.postgresql.org/message-id/27924.1571068231%40sss.pgh.pa.us

Re: CLOBBER_CACHE_ALWAYS regression instability

From

Andres Freund

Date:

05 April 2020, 18:57:49

Hi,

On 2020-04-05 14:33:19 -0400, Tom Lane wrote:
> It's a bit surprising perhaps that we run out of stack here and not
> somewhere else; but ParseFuncOrColumn and its subroutines consume
> quite a lot of stack, because of FUNC_MAX_ARGS-sized local arrays,
> so it's not *that* surprising.
> 
> So, what to do to re-stabilize the regression tests?  Even if
> we wanted to revert 8f59f6b9c0, that'd be kind of a band-aid fix,
> because there are lots of other possible ways that a parser error
> callback could be active at the point of the error.  A few other
> possibilities are:

> 
> 1. Change the test to do "\set VERBOSITY sqlstate" so that all that
> is printed is
>     ERROR:  54001
> ERRCODE_STATEMENT_TOO_COMPLEX is used in few enough places that
> this wouldn't be too much of a loss of specificity.  (Or we could
> give stack overflow its very own ERRCODE.)

We could print the error using :LAST_ERROR_MESSAGE after removing a
potential trailing "at character ..." if we're worried about the loss of
specificity.


> On the whole I find #1 the least unpleasant, as well as the most
> likely to forestall future variants of this issue.  It won't dodge
> the PPC64 problem, but I'm willing to keep living with that one
> for now.

Another avenue could be to make ParseFuncOrColumn et al use less stack,
and hope that it avoids the problem. It's a bit insane that we use this
much.


We don't have to go there in this case, but I've before wondered about
adding helpers that use an on-stack var for small allocations, and falls
back to palloc otherwise. Something boiling down to:

    Oid actual_arg_types_stack[3];
    Oid *actual_arg_types;

    if (list_length(fargs) <= lengthof(actual_arg_types_stack))
        actual_arg_types = actual_arg_types_stack;
    else
        actual_arg_types = palloc(sizeof(*actual_arg_types) * list_length(fargs))

Greetings,

Andres Freund

Re: CLOBBER_CACHE_ALWAYS regression instability

From

Tom Lane

Date:

05 April 2020, 19:04:30

Andres Freund <andres@anarazel.de> writes:
> Another avenue could be to make ParseFuncOrColumn et al use less stack,
> and hope that it avoids the problem. It's a bit insane that we use this
> much.

That would only reduce the chance of getting a stack overflow there,
and not by that much, especially not for a CLOBBER_CACHE_ALWAYS animal
which is going to be doing catalog accesses inside there too.

> We don't have to go there in this case, but I've before wondered about
> adding helpers that use an on-stack var for small allocations, and falls
> back to palloc otherwise. Something boiling down to:

Seems like that adds a lot of potential for memory leakage?

            regards, tom lane

Re: CLOBBER_CACHE_ALWAYS regression instability

From

Andres Freund

Date:

05 April 2020, 19:21:30

Hi,

On 2020-04-05 15:04:30 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > Another avenue could be to make ParseFuncOrColumn et al use less stack,
> > and hope that it avoids the problem. It's a bit insane that we use this
> > much.
> 
> That would only reduce the chance of getting a stack overflow there,
> and not by that much, especially not for a CLOBBER_CACHE_ALWAYS animal
> which is going to be doing catalog accesses inside there too.

It'd certainly not be bullet proof. But I don't think we ever were? If I
understood you correctly we were just not noticing the stack overflow
danger before? We did catalog accesses from within there before too,
that's not changed by the addition of equal(), no?

Reminds me: I'll try to dust up my patch to make cache invalidation
processing non-recursive for 14 (I wrote an initial version as part of a
bugfix that we ended up fixing differently). Besides making
CLOBBER_CACHE_ALWAYS vastly less expensive, it also reduces the cost of
logical decoding substantially.

> > We don't have to go there in this case, but I've before wondered about
> > adding helpers that use an on-stack var for small allocations, and falls
> > back to palloc otherwise. Something boiling down to:
> 
> Seems like that adds a lot of potential for memory leakage?

Depends on the case, I'd say. Certainly might be useful to add a helper
for a corresponding conditional free.

For parsing cases like this it could be better to bulk free at the
end. Compared to the memory needed for all the transformed arguments etc
it'd probably not matter in the short term (especially if only done for
4+ args).

Greetings,

Andres Freund

Re: CLOBBER_CACHE_ALWAYS regression instability

From

Tom Lane

Date:

05 April 2020, 19:38:29

Andres Freund <andres@anarazel.de> writes:
> On 2020-04-05 14:33:19 -0400, Tom Lane wrote:
>> 1. Change the test to do "\set VERBOSITY sqlstate" so that all that
>> is printed is
>> ERROR:  54001
>> ERRCODE_STATEMENT_TOO_COMPLEX is used in few enough places that
>> this wouldn't be too much of a loss of specificity.  (Or we could
>> give stack overflow its very own ERRCODE.)

> We could print the error using :LAST_ERROR_MESSAGE after removing a
> potential trailing "at character ..." if we're worried about the loss of
> specificity.

Oh, actually it seems that :LAST_ERROR_MESSAGE is already just the
primary message, without any "at character N" addon, so this would be
a very easy way to ameliorate that complaint.  ("at character N" is
added by libpq's pqBuildErrorMessage3 in TERSE mode, but psql does
not use that when filling LAST_ERROR_MESSAGE.)

            regards, tom lane

Re: CLOBBER_CACHE_ALWAYS regression instability

From

Tom Lane

Date:

05 April 2020, 19:52:35

Andres Freund <andres@anarazel.de> writes:
> On 2020-04-05 15:04:30 -0400, Tom Lane wrote:
>> That would only reduce the chance of getting a stack overflow there,
>> and not by that much, especially not for a CLOBBER_CACHE_ALWAYS animal
>> which is going to be doing catalog accesses inside there too.

> It'd certainly not be bullet proof. But I don't think we ever were? If I
> understood you correctly we were just not noticing the stack overflow
> danger before? We did catalog accesses from within there before too,
> that's not changed by the addition of equal(), no?

Ah, you're right, the CCA aspect is not such a problem as long as there
are not check_stack_depth() calls inside the code that's run to load a
syscache or relcache entry.  Which there probably aren't, at least not
for system catalogs.  The reason we're seeing this on a CCA animal is
simply that a cache flush has occurred to force recomputeNamespacePath
to do some work.  (In theory it could happen on a non-CCA animal, given
unlucky timing of an sinval overrun.)

My point here though is that it's basically been blind luck that we've
not seen this before.  There's certainly no good reason to assume that
a check_stack_depth() call shouldn't happen while parsing a function
call, or within some other chunk of the parser that happens to set up
a transient error-position callback.  And it's only going to get more
likely in future, seeing for example my ambitions to extend the
executor so that run-time expression failures can also report error
cursors.  So I think that we should be looking for a permanent fix,
not a reduce-the-odds band-aid.

>> Seems like that adds a lot of potential for memory leakage?

> Depends on the case, I'd say. Certainly might be useful to add a helper
> for a corresponding conditional free.
> For parsing cases like this it could be better to bulk free at the
> end. Compared to the memory needed for all the transformed arguments etc
> it'd probably not matter in the short term (especially if only done for
> 4+ args).

What I wish we had was alloca(), so you don't need a FUNC_MAX_ARGS-sized
array to parse a two-argument function call.  Too bad C99 didn't add
that.  (But some sniffing around suggests that an awful lot of systems
have it anyway ... even MSVC.  Hmmm.)

            regards, tom lane

Re: CLOBBER_CACHE_ALWAYS regression instability

From

Andres Freund

Date:

05 April 2020, 20:44:04

On 2020-04-05 15:38:29 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > We could print the error using :LAST_ERROR_MESSAGE after removing a
> > potential trailing "at character ..." if we're worried about the loss of
> > specificity.
> 
> Oh, actually it seems that :LAST_ERROR_MESSAGE is already just the
> primary message, without any "at character N" addon, so this would be
> a very easy way to ameliorate that complaint.  ("at character N" is
> added by libpq's pqBuildErrorMessage3 in TERSE mode, but psql does
> not use that when filling LAST_ERROR_MESSAGE.)

Heh. I though it worked differently because I just had typed some
gibberish and got an error in :LAST_ERROR_MESSAGE that ended in "at or
near ...". But that's scan.l adding that explicitly...

Re: CLOBBER_CACHE_ALWAYS regression instability

From

Alvaro Herrera

Date:

05 April 2020, 23:54:19

On 2020-Apr-05, Tom Lane wrote:

> What I wish we had was alloca(), so you don't need a FUNC_MAX_ARGS-sized
> array to parse a two-argument function call.  Too bad C99 didn't add
> that.  (But some sniffing around suggests that an awful lot of systems
> have it anyway ... even MSVC.  Hmmm.)

Isn't it the case that you can create an inner block with a constant
whose size is determined by a containing block's variable?  I mean as in
the attached, which refuses to compile because of our -Werror=vla -- but
if I remove it, it compiles fine and works in my system.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

not-alloca.patch

Re: CLOBBER_CACHE_ALWAYS regression instability

From

Andres Freund

Date:

06 April 2020, 00:07:15

Hi,

On 2020-04-05 19:54:19 -0400, Alvaro Herrera wrote:
> Isn't it the case that you can create an inner block with a constant
> whose size is determined by a containing block's variable?  I mean as in
> the attached, which refuses to compile because of our -Werror=vla -- but
> if I remove it, it compiles fine and works in my system.

IIRC msvc doesn't support VLAs. And there's generally a slow push
towards deprecating them (they've e.g. been moved to optional in C11).

They don't tend to make a lot of sense for sizes that aren't tightly
bound. In contrast to palloc etc, there's no good way to catch
allocation errors. Most of the time you'll just get a SIGBUS or such,
but sometimes you'll just end up overwriting data (if the allocation is
large enough to not touch the guard pages).

Both alloca/vlas also add some per-call overhead.

Allocating the common size on-stack, and the uncommon ones on heap
should be cheaper, and handles the cases of large allocations much
better.

Greetings,

Andres Freund