Thread: Re: Improve pg_re_throw: check if sigjmp_buf is valid and report error

On Mon, Aug 19, 2024 at 2:17 AM Xiaoran Wang <fanfuxiaoran@gmail.com> wrote:
> If the code in PG_TRY contains any non local control flow other than
> ereport(ERROR) like goto, break etc., the PG_CATCH or PG_END_TRY cannot
> be called, then the PG_exception_stack will point to the memory whose
> stack frame has been released. So after that, when the pg_re_throw
> called, __longjmp() will crash and report Segmentation fault error.
>
>  Addition to sigjmp_buf, add another field 'int magic' which is next to
>  the sigjum_buf in the local stack frame memory. The magic's value is always
>  'PG_exception_magic 0x12345678'. And in 'pg_re_throw' routine, check if
>  the magic's value is still '0x12345678', if not, that means the memory
>  where the 'PG_exception_stack' points to has been released, and the 'sigbuf'
>  must be invalid.
>
>   The related code is in patch 0001

This is an interesting idea. I suspect if we do this we want a
different magic number and a different error message than what you
chose here, but those are minor details.

I'm not sure how reliable this would be at actually finding problems,
though. It doesn't seem guaranteed that the magic number will get
overwritten if you do something wrong; it's just a possibility. Maybe
that's still useful enough that we should adopt this, but I'm not
sure.

Personally, I don't think I've ever made this particular mistake. I
think a much more common usage error is exiting the catch-block
without either throwing an error or rolling back a subtransaction. But
YMMV, of course.

--
Robert Haas
EDB: http://www.enterprisedb.com



Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Aug 19, 2024 at 2:17 AM Xiaoran Wang <fanfuxiaoran@gmail.com> wrote:
>> If the code in PG_TRY contains any non local control flow other than
>> ereport(ERROR) like goto, break etc., the PG_CATCH or PG_END_TRY cannot
>> be called, then the PG_exception_stack will point to the memory whose
>> stack frame has been released. So after that, when the pg_re_throw
>> called, __longjmp() will crash and report Segmentation fault error.
>>
>> Addition to sigjmp_buf, add another field 'int magic' which is next to
>> the sigjum_buf in the local stack frame memory. The magic's value is always
>> 'PG_exception_magic 0x12345678'. And in 'pg_re_throw' routine, check if
>> the magic's value is still '0x12345678', if not, that means the memory
>> where the 'PG_exception_stack' points to has been released, and the 'sigbuf'
>> must be invalid.

> This is an interesting idea. I suspect if we do this we want a
> different magic number and a different error message than what you
> chose here, but those are minor details.

I would suggest just adding an Assert; I doubt this check would be
useful in production.

> I'm not sure how reliable this would be at actually finding problems,
> though.

Yeah, that's the big problem.  I don't have any confidence at all
that this would detect misuse.  It'd require that the old stack
frame gets overwritten, which might not happen for a long time,
and it'd require that somebody eventually do a longjmp, which again
might not happen for a long time --- and when those did happen, the
error would be detected in someplace far away from the actual bug,
with little evidence remaining to help you localize it.

Also, as soon as some outer level of PG_TRY is exited in the proper
way, the dangling stack pointer will be fixed up.  That means there's
a fairly narrow time frame in which the overwrite and longjmp must
happen for this to catch a bug.

So on the whole I doubt this'd be terribly useful in this form;
and I don't like the amount of code churn required.

> Personally, I don't think I've ever made this particular mistake. I
> think a much more common usage error is exiting the catch-block
> without either throwing an error or rolling back a subtransaction. But
> YMMV, of course.

We have had multiple instances of code "return"ing out of a PG_TRY,
so I fully agree that some better way to detect that would be good.
But maybe we ought to think about static analysis for that.

            regards, tom lane





Tom Lane <tgl@sss.pgh.pa.us> 于2024年8月19日周一 22:12写道:
Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Aug 19, 2024 at 2:17 AM Xiaoran Wang <fanfuxiaoran@gmail.com> wrote:
>> If the code in PG_TRY contains any non local control flow other than
>> ereport(ERROR) like goto, break etc., the PG_CATCH or PG_END_TRY cannot
>> be called, then the PG_exception_stack will point to the memory whose
>> stack frame has been released. So after that, when the pg_re_throw
>> called, __longjmp() will crash and report Segmentation fault error.
>>
>> Addition to sigjmp_buf, add another field 'int magic' which is next to
>> the sigjum_buf in the local stack frame memory. The magic's value is always
>> 'PG_exception_magic 0x12345678'. And in 'pg_re_throw' routine, check if
>> the magic's value is still '0x12345678', if not, that means the memory
>> where the 'PG_exception_stack' points to has been released, and the 'sigbuf'
>> must be invalid.

> This is an interesting idea. I suspect if we do this we want a
> different magic number and a different error message than what you
> chose here, but those are minor details.

I would suggest just adding an Assert; I doubt this check would be
useful in production.

Agree, an Assert is enough. 

> I'm not sure how reliable this would be at actually finding problems,
> though.

Yeah, that's the big problem.  I don't have any confidence at all
that this would detect misuse.  It'd require that the old stack
frame gets overwritten, which might not happen for a long time,
and it'd require that somebody eventually do a longjmp, which again
might not happen for a long time --- and when those did happen, the
error would be detected in someplace far away from the actual bug,
with little evidence remaining to help you localize it.

Exactly, it cannot tell you which PG_TRY left the invalid sigjmp_buf,
but to implement that is easy I think, recording the line num maybe.

I think this is still useful, at least  it tell you that the error is due to the
PG_TRY.

Also, as soon as some outer level of PG_TRY is exited in the proper
way, the dangling stack pointer will be fixed up.  That means there's
a fairly narrow time frame in which the overwrite and longjmp must
happen for this to catch a bug.


Yes, if the outer level PG_TRY call pg_re_throw instead of ereport, 
the dangling stack pointer will be fixed up.  It's would be great to fix that
up in any case. But I have no idea how to implement that now.

In pg_re_throw, if we could use '_local_sigjmp_buf' instead of the
global var PG_exception_stack, that would be great. We don't
need to worry about the invalid sigjum_buf.

So on the whole I doubt this'd be terribly useful in this form;
and I don't like the amount of code churn required.

> Personally, I don't think I've ever made this particular mistake. I
> think a much more common usage error is exiting the catch-block
> without either throwing an error or rolling back a subtransaction. But
> YMMV, of course.

We have had multiple instances of code "return"ing out of a PG_TRY,
so I fully agree that some better way to detect that would be good.
But maybe we ought to think about static analysis for that.

                        regards, tom lane


--
Best regards !
Xiaoran Wang


Xiaoran Wang <fanfuxiaoran@gmail.com> 于2024年8月20日周二 11:32写道:


Tom Lane <tgl@sss.pgh.pa.us> 于2024年8月19日周一 22:12写道:
Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Aug 19, 2024 at 2:17 AM Xiaoran Wang <fanfuxiaoran@gmail.com> wrote:
>> If the code in PG_TRY contains any non local control flow other than
>> ereport(ERROR) like goto, break etc., the PG_CATCH or PG_END_TRY cannot
>> be called, then the PG_exception_stack will point to the memory whose
>> stack frame has been released. So after that, when the pg_re_throw
>> called, __longjmp() will crash and report Segmentation fault error.
>>
>> Addition to sigjmp_buf, add another field 'int magic' which is next to
>> the sigjum_buf in the local stack frame memory. The magic's value is always
>> 'PG_exception_magic 0x12345678'. And in 'pg_re_throw' routine, check if
>> the magic's value is still '0x12345678', if not, that means the memory
>> where the 'PG_exception_stack' points to has been released, and the 'sigbuf'
>> must be invalid.

> This is an interesting idea. I suspect if we do this we want a
> different magic number and a different error message than what you
> chose here, but those are minor details.

I would suggest just adding an Assert; I doubt this check would be
useful in production.

Agree, an Assert is enough. 

> I'm not sure how reliable this would be at actually finding problems,
> though.

Yeah, that's the big problem.  I don't have any confidence at all
that this would detect misuse.  It'd require that the old stack
frame gets overwritten, which might not happen for a long time,
and it'd require that somebody eventually do a longjmp, which again
might not happen for a long time --- and when those did happen, the
error would be detected in someplace far away from the actual bug,
with little evidence remaining to help you localize it.

Exactly, it cannot tell you which PG_TRY left the invalid sigjmp_buf,
but to implement that is easy I think, recording the line num maybe.

I think this is still useful, at least  it tell you that the error is due to the
PG_TRY.

Also, as soon as some outer level of PG_TRY is exited in the proper
way, the dangling stack pointer will be fixed up.  That means there's
a fairly narrow time frame in which the overwrite and longjmp must
happen for this to catch a bug.


Yes, if the outer level PG_TRY call pg_re_throw instead of ereport, 
the dangling stack pointer will be fixed up.  It's would be great to fix that
up in any case. But I have no idea how to implement that now.


Sorry,  "Yes, if the outer level PG_TRY call pg_re_throw instead of ereport, " should
be "Yes, if the outer level PG_TRY  reset the PG_exception_stack."

In pg_re_throw, if we could use '_local_sigjmp_buf' instead of the
global var PG_exception_stack, that would be great. We don't
need to worry about the invalid sigjum_buf.

So on the whole I doubt this'd be terribly useful in this form;
and I don't like the amount of code churn required.

> Personally, I don't think I've ever made this particular mistake. I
> think a much more common usage error is exiting the catch-block
> without either throwing an error or rolling back a subtransaction. But
> YMMV, of course.

We have had multiple instances of code "return"ing out of a PG_TRY,
so I fully agree that some better way to detect that would be good.
But maybe we ought to think about static analysis for that.

                        regards, tom lane


--
Best regards !
Xiaoran Wang


--
Best regards !
Xiaoran Wang
Xiaoran Wang <fanfuxiaoran@gmail.com> writes:
>> Yeah, that's the big problem.  I don't have any confidence at all
>> that this would detect misuse.  It'd require that the old stack
>> frame gets overwritten, which might not happen for a long time,
>> and it'd require that somebody eventually do a longjmp, which again
>> might not happen for a long time --- and when those did happen, the
>> error would be detected in someplace far away from the actual bug,
>> with little evidence remaining to help you localize it.

> Exactly, it cannot tell you which PG_TRY left the invalid sigjmp_buf,
> but to implement that is easy I think, recording the line num maybe.

I don't think you get to assume that the canary word gets overwritten
but debug data a few bytes away survives.

            regards, tom lane





Tom Lane <tgl@sss.pgh.pa.us> 于2024年8月20日周二 11:44写道:
Xiaoran Wang <fanfuxiaoran@gmail.com> writes:
>> Yeah, that's the big problem.  I don't have any confidence at all
>> that this would detect misuse.  It'd require that the old stack
>> frame gets overwritten, which might not happen for a long time,
>> and it'd require that somebody eventually do a longjmp, which again
>> might not happen for a long time --- and when those did happen, the
>> error would be detected in someplace far away from the actual bug,
>> with little evidence remaining to help you localize it.

> Exactly, it cannot tell you which PG_TRY left the invalid sigjmp_buf,
> but to implement that is easy I think, recording the line num maybe.

I don't think you get to assume that the canary word gets overwritten
but debug data a few bytes away survives.

We can have a global var like 'PG_exception_debug'  to save the debug info,
not saved in the  local stack frame. 

1. Before PG_TRY, we can save the debug info as 'save_debug_info',  just like 
'_save_exception_stack', but not a pointer, memory copy the info into the
'_save_debug_info'.  
2. In PG_TRY, set the new debug info into the global var 'PG_exception_debug'
3. And in PG_CATCH and PG_END_TRY, we can restore it.
So that the debug info will not be overwritten.

> It doesn't seem guaranteed that the magic number will get
> overwritten if you do something wrong;

That's my concern too.  How about besides checking  if the 'PG_exception_stack->magic'
is overwritten, also compare the address of  'PG_exception_stack->buf' and current
stack top address? if the address of 'PG_exception_stack->buf' is lower than current
stack top address, it must be invalid, otherwise , it must be overwritten. Just like below

int  stack_top;
if (PG_exception_stack->magic <= &stack_top || PG_exception_stack->magic != PG_exception_magic)
    ereport(FATAL,
                 (errmsg("Invalid sigjum_buf, code in PG_TRY cannot contain"
                                " any non local control flow other than ereport")));

I'm not sure if this can work, any thoughts?




                        regards, tom lane


--
Best regards !
Xiaoran Wang
Hi

On Mon, Aug 19, 2024 at 10:12 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> We have had multiple instances of code "return"ing out of a PG_TRY,
> so I fully agree that some better way to detect that would be good.
> But maybe we ought to think about static analysis for that.

I have some static analysis scripts for detecting this kind of problem
(of mis-using PG_TRY). Not sure if my scripts are helpful here but I
would like to share them.

- A clang plugin for detecting unsafe control flow statements in
PG_TRY. https://github.com/higuoxing/clang-plugins/blob/main/lib/ReturnInPgTryBlockChecker.cpp
- Same as above, but in CodeQL[^1] script.
https://github.com/higuoxing/postgres.ql/blob/main/return-in-PG_TRY.ql
- A CodeQL script for detecting the missing of volatile qualifiers
(objects have been changed between the setjmp invocation and longjmp
call should be qualified with volatile).
https://github.com/higuoxing/postgres.ql/blob/main/volatile-in-PG_TRY.ql

Andres also has some compiler hacking to detect return statements in PG_TRY[^2].

[^1]: https://codeql.github.com/
[^2]: https://www.postgresql.org/message-id/20230113054900.b7onkvwtkrykeu3z%40awork3.anarazel.de