Thread: [PATCH] Detect escape of ErrorContextCallback stack pointers (and from PG_TRY() )

[PATCH] Detect escape of ErrorContextCallback stack pointers (and from PG_TRY() )

From
Craig Ringer
Date:
Hi folks

The attached patch series adds support for detecting coding errors where a stack-allocated ErrorContextCallback is not popped from the error_context_stack before the variable leaves scope.

With the attached patches this code will emit a backtrace and abort() on cassert builds at the "return" statement where it leaks the stack pointer:

    ErrorContextCallback cb pg_errcontext_check();
    ...
    cb.previous = error_context_stack;
    error_context_stack = &cb;
    ...
    if (something)
        return; /* whoops! */
    ...
    error_context_stack = error_context_stack->previous;


Note the pgl_errcontext_check() in the above, which enables the check.

The infrastructure added by the patch could well be useful for other similar validation steps. In fact, I've added a similar validator that detects escapes from PG_TRY() blocks, so you get a neat assertion failure at the point of escape instead of a crash at a __longjmp() with a nonsensical stack. Example using a deliberately introduced mistake in postgres_fdw as a test:

Without the PG_TRY() patch:

#0  __longjmp () at ../sysdeps/x86_64/__longjmp.S:111
#1  0xc35a5d9e1a902cfc in ?? ()
Backtrace stopped: Cannot access memory at address 0xc0ea5d9e1a902cfc

With the PG_TRY() patch (abbreviated):

#3  0x0000000000abbd4b in pg_try_check_return_guard
#4  0x00007f8e2e20e41a in fetch_more_data
#5  0x00007f8e2e20a80a in postgresIterateForeignScan
...

Note that valgrind's memcheck, gcc's -fstack-protector, etc are all quite bad at detecting this class of error, so being able to do so automatically should be nice. I've had an interesting time chasing down error context stack mistakes a few times in the past

We unfortunately cannot use it for RAII-style coding because the underlying __attribute__((callback(..))) is not available on MSVC. As usual. But it remains useful for checking and assertions.

The patch is supplied as a small series:

(1)      Add some documentation on error context callback usage
(2)      Add pg_errcontext_check() to detect ErrorContextCallback escapes
(3)      Enable pg_errcontext_check() for all in-tree users
(4)      Document that generally PG_CATCH() should PG_RE_THROW()
(5)      Assert() if returning from inside a PG_TRY() / PG_CATCH() block
(6)      [RFC] Add a more succinct API for error context stack callbacks
(7)      [RFC] Use __has_attribute for compiler attribute detection where possible

I also added a patch (6) on top as a RFC of sorts, which adds a more succinct API for error context setup and teardown. It's not required for these changes, but it might make sense to adopt it at the same time if we're changing usage across the tree anyway.

A further patch (7) switches over to using __has_attribute instead of explicit compiler version checks. Again, optional, more of a RFC.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise
Attachment
Craig Ringer <craig@2ndquadrant.com> writes:
> The attached patch series adds support for detecting coding errors where a
> stack-allocated ErrorContextCallback is not popped from the
> error_context_stack before the variable leaves scope.

So my immediate thoughts about this are

(1) It's mighty invasive for what it accomplishes.  AFAIK we have
had few of this class of bug, and so I'm not excited about touching
every callback use in the tree to add defenses.  It's also not great
that future code additions won't be protected unless they remember
to add these magic annotations.  The PG_TRY application is better since
it's wrapped into the existing macro.

(2) I don't like that this is adding runtime overhead to try to detect
such bugs.

(3) I think it's a complete failure that such a bug will only be
detected if the faulty code path is actually taken.  I think it's
quite likely that any such bugs that are lurking are in "can't
happen", or at least hard-to-hit, corner cases --- if they were in
routinely tested paths, we'd have noticed them by now.  So it'd be far
more helpful if we had a static-analysis way to detect such issues.

Thinking about (3), I wonder whether there's a way to instruct Coverity
to detect such errors.

            regards, tom lane



On Thu, 3 Sep 2020 at 22:28, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Craig Ringer <craig@2ndquadrant.com> writes:
> The attached patch series adds support for detecting coding errors where a
> stack-allocated ErrorContextCallback is not popped from the
> error_context_stack before the variable leaves scope.

So my immediate thoughts about this are

(1) It's mighty invasive for what it accomplishes.  AFAIK we have
had few of this class of bug, and so I'm not excited about touching
every callback use in the tree to add defenses.  It's also not great
that future code additions won't be protected unless they remember
to add these magic annotations.  The PG_TRY application is better since
it's wrapped into the existing macro.

I agree with you there. I'd actually like to change how we set up errcontext callbacks anyway, as I think they're ugly, verbose and error-prone.

See patch 6 for a RFC on that.

With regards to PG_TRY() you may note that patch 5 adds a similar check to detect escapes from PG_TRY() / PG_CATCH() bodies.

(2) I don't like that this is adding runtime overhead to try to detect
such bugs.

Right. Only in cassert builds, though.

Frankly I'd be happy enough even having it as something I can turn on when I wanted it. I've had a heck of a time tracking down errcontext escapes in the past. I wrote it for an extension, then figured I'd submit it to core and see if anyone wanted it.

Even if we don't adopt it in PostgreSQL, now it's out there to help out others debugging similar issues.

(3) I think it's a complete failure that such a bug will only be
detected if the faulty code path is actually taken.  I think it's
quite likely that any such bugs that are lurking are in "can't
happen", or at least hard-to-hit, corner cases --- if they were in
routinely tested paths, we'd have noticed them by now.  So it'd be far
more helpful if we had a static-analysis way to detect such issues.

Thinking about (3), I wonder whether there's a way to instruct Coverity
to detect such errors.

I think calling it a "complete failure" is ridiculous. By the same rationale, we shouldn't bother with Assert(...) at all. But I agree that it's definitely far from as good as having a statically verifiable check would be, and this *should* be something we can statically verify.

I actually had a pretty good look around for static analysis options to see if I could find anything that might help us out before I landed up with this approach.

The sticking point is the API we use. By using auto stack variables (and doing so in pure C where they cannot have destructors) and using direct assignments to globals, we cannot use the majority of static analysis annotations since they tend to be aimed at function-based APIs. And for some reason most static analysis tools appear to be poor at discovering escapes of pointers to stack variables leaving scope, at least unless you use annotated functions that record ownership transfers. Which we can't because ... direct assignment.

It's one of the reasons I want to wrap the errcontext API per patch 6 in the above series. The current API is extremely verbose, hard to validate and check, and difficult to apply static analysis to.

If we had error context setup/teardown macros we could implement them using static inline functions and annotate them as appropriate for the target toolchain and available static analysis tools.

So what do you think of patch 6?

I'll try tweaking the updated API to add annotateable functions and see if that helps static analysis tools detect issues, then report back.

I personally think it'd be well worth adopting a wrapped API and simultaneously renaming ErrorContextCallback to ErrorContextCallbackData to ensure that code must be updated to compile with the changes. It'd be simple to provide a backwards compatibility header that extension authors can copy into their trees so they can use the new API when building against old postgres, greatly reducing the impact on extension authors. (We do that sort of thing constantly in pglogical and BDR).

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise
On Fri, 4 Sep 2020 at 14:13, Craig Ringer <craig@2ndquadrant.com> wrote:

I actually had a pretty good look around for static analysis options to see if I could find anything that might help us out before I landed up with this approach.



Using a test program:

so ... that's interesting. I'll need to do some checking and verify that it's effective on the actual problem I originally had, but if so, I shall proceed with kicking myself now.

Handily, the same thing can be used to detect PG_TRY() escapes.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise
On Fri, 4 Sep 2020 at 14:55, Craig Ringer <craig@2ndquadrant.com> wrote:
 

Example here:


So I find that actually, the __attribute__((callback(fn)) approach is unnecessary for the purpose proposed.



I still think we should be looking at tidying up the error contexts API, but it's easy enough for extensions to provide a wrapper over it, so that's not a big deal really.

I will submit my docs patches separately to ensure they get some attention, and I'll experiment with llvm threadsafety annotations / capabilities.




--
 Craig Ringer                   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise
Craig Ringer <craig@2ndquadrant.com> writes:
> Example here:
> https://github.com/ringerc/scrapcode/tree/master/c/clang_return_stack_checks
> So I find that actually, the __attribute__((callback(fn)) approach is
> unnecessary for the purpose proposed.

I tested this by injecting some faults of the described sort.

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index db7d24a511..eaf7716816 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -3025,6 +3025,8 @@ CopyFrom(CopyState cstate)

                        myslot = CopyMultiInsertInfoNextFreeSlot(&multiInsertInfo,

resultRelInfo);
+                       if (!myslot)
+                         return 0;
                }

                /*

leads to

/home/tgl/pgsql/src/backend/commands/copy.c:3029:6: warning: Address of stack memory associated with local variable
'errcallback'is still referred to by the global variable 'error_context_stack' upon returning to the caller.  This will
bea dangling reference 
                          return 0;
                          ^~~~~~~~

So that's good.  However, a similar experiment with returning from inside
a PG_TRY does *not* produce a warning.  I suspect maybe the compiler
throws up its hands when it sees sigsetjmp?

(These results from clang 10.0.0 on Fedora 32.)

            regards, tom lane



On Tue, 8 Sep 2020 at 10:56, Tom Lane <tgl@sss.pgh.pa.us> wrote:

So that's good.  However, a similar experiment with returning from inside
a PG_TRY does *not* produce a warning.  I suspect maybe the compiler
throws up its hands when it sees sigsetjmp?

I find that odd myself, given that in PG_TRY() we:

    PG_exception_stack = &_local_sigjmp_buf;

and a return within that block should count as a pointer to _local_sigjmp_buf escaping. Yet I confirm that I don't get a warning on clang-analyzer 10.0.0 (fedora 32) in my builds either. Even with -O0.

TL;DR:
-------

It's not due to sigsetjmp() at all though; see https://github.com/ringerc/scrapcode/blob/master/c/clang_return_stack_checks/mask_local_escape.c for part of what's going on.

I haven't worked out why sometimes an unrelated stack escape gets hidden though.

Attached patches demonstrate tests and are DEFINITELY NOT proposals for core!

SUMMARY
-----

I found per test details below (links and demo patches supplied, patches NOT proposed for inclusion in pg) that scan-build seems to be able to track a local escape via simple indirection such as copying the address to a local first, but cannot track it when it is assigned to a struct member in a global pointer if that global pointer is then replaced, even if it's restored immediately.


This produces no warnings from scan-build but leaks a pointer to an auto-variable on the stack, then dereferences it to read garbage. I deliberately clobber the stack to ensure that it's predictable garbage, so we get:

$ make run_mask_local_escape
./mask_local_escape
&g = 0x7fff1b31e6bc
&g has escaped: 0x7fff1b31e6bc
value is: 0x7f7f7f7f

I think we're using a similar pattern for PG_error_stack, and that's why we don't get any warnings.

In this case valgrind detects the issue when the bogus data is actually read. Run "valgrind ./mask_local_escape". But I'm pretty sure I've seen other cases where it does not.

So I'm wondering if the explicit __attribute__((callback(cbfn))) approach has some merits after all, for cassert builds. Assuming we care about this error, since it's a bit of an academic exercise at this point. I just needed to understand what was going on and why there was no warning.


DETAILS OF TESTS
-----

If I wrap the sigjmp_buf in a struct defined in Pg itself and allow that to escape I see the same behaviour, so it's not specific to sigjmp_buf.

I tried adding a dummy guard variable to PG_TRY() and PG_END_TRY() solely for the purpose of tracking this sort of escape. But I notice that a warning from scan-build is only emitted in PG_CATCH() or PG_FINALLY(), never from inside the PG_TRY() body. I'd expect to see a warning from either branch, as it's escaping either way.

scan-build will raise other warnings from within the PG_CATCH() block, such as if I introduce a

+       int foo = 0;
+       fprintf(stdout, "%d", 10/foo);

so it's not that scan-build as a whole is failing to take the sigsetjmp() == 0  branch. It raises *other* warnings. This can also be verified by replacing sigsetjmp() with a dummy static inline function that always takes one branch or the other.

I landed up writing the experiment code at https://github.com/ringerc/scrapcode/blob/master/c/clang_return_stack_checks/ to explore the behaviour some more.

The file sigjmp_escape.c explores various combinations of guard variables and early returns. The file sigjmp_escape_hdr.c and its companions sigjmp_escape_hdr_try.{c,h} check whether indirection via separate header and compilation unit makes any difference (it didn't).

Initially I found that in my PG_TRY()-alike test code I got a warning like:

sigjmp_escape_hdr.c:36:4: warning: Address of stack memory associated with local variable '_local_sigjmp_buf' is still referred to by the global variable 'TEST_exception_stack' upon returning to the caller.  This will be a dangling reference
                        return;
                        ^~~~~~

just like I want to see from inside a PG_TRY() block.

But with some further testing I noticed that it seems the checker may not notice escapes where there's even the simplest level of indirection, and also that one candidate escape may mask another. I'm not totally sure yet.

In the sigjmp_escape.c test, if I build it with both inner and outer guards (-DUSE_INNER_GUARDS), scan-build will only complain about the inner guard pointer g_branch_1 escaping. It will not complain about the stack pointer to the outer guard escaping, nor about the possibility of should_return_early causing the sigjmp_buf to escape into the global sigjmp_buf * jbuf. Yet running

    ./sigjmp_escape 1 1

shows that both can occur.

This still doesn't explain how it *also* misses the escape of &b into jbuf. Is it losing track of the unrelated escape when it stops tracking the first one?

The test case sigjmp_escape_hdr doesn't attempt to use the guards, and it reports the sigjmp_buf escape fine as shown above.

If sigjmp_escape.c is built with -DUSE_INNER_GUARDS -DPOP_ONLY_INNER_GUARDS then it raises no warnings at all, BUT has both the sigjmp_buf leak AND a guard variable stack escape as seen by:

   ./sigjmp_escape_inner_pop 1 1

We've seen above that it loses the outer guard pointer escape because it doesn't track any sort of indirection of assignments.

What's a bit surprising is that it *also* fails to complain about the escape of the sigjmp_buf pointer &b into the global variable jbuf in this case, even though it successfully detects the escape with other combinations of guard uses. It's like it can't track multiple things at once properly.

RELATED: detecting return from PG_FINALLY()
--------

Note that even if scan-build did detect the escape of the sigjmp_buf, returning from PG_FINALLY() is a coding error that we cannot detect this way since we've restored PG_errror_stack so there's no local auto ptr to escape. To do that we'd need our own stack that we pushed in PG_TRY() and didn't pop until PG_END_TRY(), like in the demo patch attached. That works, but adds another local to each PG_TRY() that's otherwise useless. If we really felt like it we could use it to produce a poor-man's backtrace through each PG_TRY() at the cost of one sizeof(void) per PG_TRY() + some constdata storage, but that seems like a solution looking for a problem.

Also, AFAICS it's actually harmless, if ugly, to return from PG_CATCH(). It should still be discouraged.


--
 Craig Ringer                   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise
Attachment