Thread: ubsan

ubsan

From
Andres Freund
Date:
Hi,

I tried to run postgres with ubsan to debug something. I ran into two main
issues:


1) Despite Tom's recent efforts in [1], I see two ubsan failures in
   HEAD.

   These are easy enough to fix as per the attached, although the fix for the
   GetConfigOptionByNum() isn't great - we should probably pass a nulls array
   to GetConfigOptionByNum() as well, but that'd have a bunch of followup
   changes. So I'm inclined to go with the minimal for now.


2) When debugging issues I got very confused by the fact that *sometimes*
   UBSAN_OPTIONS takes effect and sometimes not. I was trying to have ubsan
   generate backtraces as well as a coredump with
   UBSAN_OPTIONS="print_stacktrace=1:disable_coredump=0:abort_on_error=1:verbosity=2"

   After a lot of debugging I figured out that the options took effect in
   postmaster, but not in any children. Which in turn is because
   set_ps_display() breaks /proc/$pid/environ - it's empty in all postmaster
   children for me.

   The sanitizer library use /proc/$pid/environ (from [2] to [3]), because
   they don't want to use getenv() because it wants to work without libc
   (whether that's the right way, i have my doubts).  When just using
   undefined and alignment sanitizers, the sanitizer library is only
   initialized when the first error occurs, by which time we've often already
   called set_ps_display().

   Note that ps_status.c breaks /proc/$pid/environ even if the
   update_process_title GUC is set to false, because init_ps_display() ignores
   that. Yes, that confused me for a while last night.


   The reason that /proc/$pid/environ is borken is fairly obvious: We
   overwrite it in set_ps_display() in the PS_USE_CLOBBER_ARGV case. The
   kernel just looks at the range set up originally, which we'll overwrite
   with zeroes.

   We could try telling the kernel about the new location of environ using
   prctl() PR_SET_MM_ENV_START/END but the restrictions around that sound
   problematic.


   I've included a hacky workaround: Define __ubsan_default_options, a weak
   symbol libsanitizer uses to get defaults from the application, and return
   getenv("UBSAN_OPTIONS"). But only if main already was reached, so that we
   don't end up relying on a not-yet-working getenv().


   This also lead me to finally figure out why /proc/$pid/cmdline of postgres
   children includes a lot of NULL bytes. I'd noticed this because tools like
   pidstat -l started to print long long status strings at some point, before
   it got fixed on the pidstat side.
   The way we overwrite doesn't trigger this special case in the kernel:
   https://github.com/torvalds/linux/blob/master/fs/proc/base.c#L296
   therefore the original size of the commandline arguments are printed, with
   a lot of boring zeroes.


A test run of this in ci, with the guc.c intentionally reintroduced, shows the
failure both via core dump
https://api.cirrus-ci.com/v1/task/6543164315009024/logs/cores.log
and
in postmaster's log:
https://api.cirrus-ci.com/v1/artifact/task/6543164315009024/log/src/test/regress/log/postmaster.log

although that part is a bit annoying to read, because the error is
interspersed with other log messages:

guc.c:9801:15: runtime error: null pointer passed as argument 2, which is declared to never be null
==19253==Using libbacktrace symbolizer.
2022-03-23 17:20:35.412 UTC [19258][client backend] [pg_regress/alter_operator][14/429:0] ERROR:  must be owner of
operator===
 
...
2022-03-23 17:20:35.601 UTC [19254][client backend] [pg_regress/alter_generic][10/1569:0] STATEMENT:  ALTER STATISTICS
alt_stat2OWNER TO regress_alter_generic_user2;
 
    #0 0x5626562b2ab4 in GetConfigOptionByNum /tmp/cirrus-ci-build/src/backend/utils/misc/guc.c:9801
    #1 0x5626562b3fd5 in show_all_settings /tmp/cirrus-ci-build/src/backend/utils/misc/guc.c:10137
2022-03-23 17:20:35.604 UTC [19254][client backend] [pg_regress/alter_generic][10/1576:0] ERROR:  must be owner of
statisticsobject alt_stat3
 
...
2022-03-23 17:20:35.601 UTC [19254][client backend] [pg_regress/alter_generic][10/1569:0] STATEMENT:  ALTER STATISTICS
alt_stat2OWNER TO regress_alter_generic_user2;
 
    #0 0x5626562b2ab4 in GetConfigOptionByNum /tmp/cirrus-ci-build/src/backend/utils/misc/guc.c:9801
    #1 0x5626562b3fd5 in show_all_settings /tmp/cirrus-ci-build/src/backend/utils/misc/guc.c:10137
2022-03-23 17:20:35.604 UTC [19254][client backend] [pg_regress/alter_generic][10/1576:0] ERROR:  must be owner of
statisticsobject alt_stat3
 
2022-03-23 17:20:35.604 UTC [19254][client backend] [pg_regress/alter_generic][10/1576:0] STATEMENT:  ALTER STATISTICS
alt_stat3RENAME TO alt_stat4;
 
    #2 0x562655c0ea86 in ExecMakeTableFunctionResult /tmp/cirrus-ci-build/src/backend/executor/execSRF.c:234
    #3 0x562655c3f8be in FunctionNext /tmp/cirrus-ci-build/src/backend/executor/nodeFunctionscan.c:95
2022-03-23 17:20:35.605 UTC [19254][client backend] [pg_regress/alter_generic][10/1578:0] ERROR:  must be owner of
statisticsobject alt_stat3
 
2022-03-23 17:20:35.605 UTC [19254][client backend] [pg_regress/alter_generic][10/1578:0] STATEMENT:  ALTER STATISTICS
alt_stat3OWNER TO regress_alter_generic_user2;
 
2022-03-23 17:20:35.606 UTC [19254][client backend] [pg_regress/alter_generic][10/1579:0] ERROR:  must be member of
role"regress_alter_generic_user3"
 
2022-03-23 17:20:35.606 UTC [19254][client backend] [pg_regress/alter_generic][10/1579:0] STATEMENT:  ALTER STATISTICS
alt_stat2OWNER TO regress_alter_generic_user3;
 
    #4 0x562655c10175 in ExecScanFetch /tmp/cirrus-ci-build/src/backend/executor/execScan.c:133
    #5 0x562655c10653 in ExecScan /tmp/cirrus-ci-build/src/backend/executor/execScan.c:199
    #6 0x562655c3f643 in ExecFunctionScan /tmp/cirrus-ci-build/src/backend/executor/nodeFunctionscan.c:270
2022-03-23 17:20:35.606 UTC [19254][client backend] [pg_regress/alter_generic][10/1580:0] ERROR:  must be owner of
statisticsobject alt_stat3
 
2022-03-23 17:20:35.606 UTC [19254][client backend] [pg_regress/alter_generic][10/1580:0] STATEMENT:  ALTER STATISTICS
alt_stat3SET SCHEMA alt_nsp2;
 
2022-03-23 17:20:35.606 UTC [19254][client backend] [pg_regress/alter_generic][10/1581:0] ERROR:  statistics object
"alt_stat2"already exists in schema "alt_nsp2"
 
2022-03-23 17:20:35.606 UTC [19254][client backend] [pg_regress/alter_generic][10/1581:0] STATEMENT:  ALTER STATISTICS
alt_stat2SET SCHEMA alt_nsp2;
 
    #7 0x562655c09bc9 in ExecProcNodeFirst /tmp/cirrus-ci-build/src/backend/executor/execProcnode.c:463
    #8 0x562655bf7580 in ExecProcNode ../../../src/include/executor/executor.h:259
    #9 0x562655bf7580 in ExecutePlan /tmp/cirrus-ci-build/src/backend/executor/execMain.c:1633
    #10 0x562655bf78b9 in standard_ExecutorRun /tmp/cirrus-ci-build/src/backend/executor/execMain.c:362


Greetings,

Andres Freund

[1] https://postgr.es/m/CALNJ-vT9r0DSsAOw9OXVJFxLENoVS_68kJ5x0p44atoYH+H4dg@mail.gmail.com
[2]
https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libsanitizer/ubsan/ubsan_flags.cpp;h=9a66bd37518b3a0606049b761ffdd7ddf3c3c714;hb=refs/heads/master#l68
[2]
https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libsanitizer/sanitizer_common/sanitizer_linux.cpp;h=aa59d9718ca89cc554bdf677df3e64ddd233ca59;hb=refs/heads/master#l559

Attachment

Re: ubsan

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> I tried to run postgres with ubsan to debug something.

For 0001, could we just replace configure's dlopen check with the
dlsym check?  Or are you afraid of reverse-case failures?

0002: ugh, but my only real complaint is that __ubsan_default_options
needs more than zero comment.  Also, it's not "our" getenv is it?

0003: OK.  Interesting though that we haven't seen these before.

0004: no opinion

            regards, tom lane



Re: ubsan

From
Andres Freund
Date:
Hi,

On 2022-03-23 13:54:50 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > I tried to run postgres with ubsan to debug something.
> 
> For 0001, could we just replace configure's dlopen check with the
> dlsym check?  Or are you afraid of reverse-case failures?

Yea, I was worried about that. But now that I think more about it, it's hard
to believe something could provide / intercept dlsym but not dlopen. I guess
we can try and see?


> 0002: ugh, but my only real complaint is that __ubsan_default_options
> needs more than zero comment.

Yea, definitely. I am still hoping that somebody could see a better approach
than that ugly hack.

Haven't yet checked, but probably should also verify asan either doesn't have
the same problem or provide the same hack for ASAN_OPTIONS.


> Also, it's not "our" getenv is it?

Not really. "libc's getenv()"?


> 0003: OK.  Interesting though that we haven't seen these before.

I assume it's a question of library version and configure flags.

Looks like the fwrite nonnull case isn't actually due to the nonnull
attribute, but just fwrite() getting intercepted by the sanitizer
library. Looks like that was added starting in gcc 9 [1]

And the guc.c case presumably requires --enable-nls and a version of gettext
using the nonnull attribute?


Wonder if there's a few functions we should add nonnull to ourselves. Probably
would help "everyday compiler warnings", static analyzers, and ubsan.

Greetings,

Andres Freund

[1]
5d3805fca3e9 (Jakub Jelinek     2017-10-19 13:23:59 +0200  1151) #if SANITIZER_INTERCEPT_FWRITE
5d3805fca3e9 (Jakub Jelinek     2017-10-19 13:23:59 +0200  1152) INTERCEPTOR(SIZE_T, fwrite, const void *p, uptr size,
uptrnmemb, void *file) {
 
5d3805fca3e9 (Jakub Jelinek     2017-10-19 13:23:59 +0200  1153)   // libc file streams can call user-supplied
functions,see fopencookie.
 
5d3805fca3e9 (Jakub Jelinek     2017-10-19 13:23:59 +0200  1154)   void *ctx;
5d3805fca3e9 (Jakub Jelinek     2017-10-19 13:23:59 +0200  1155)   COMMON_INTERCEPTOR_ENTER(ctx, fwrite, p, size,
nmemb,file);
 
5d3805fca3e9 (Jakub Jelinek     2017-10-19 13:23:59 +0200  1156)   SIZE_T res = REAL(fwrite)(p, size, nmemb, file);
5d3805fca3e9 (Jakub Jelinek     2017-10-19 13:23:59 +0200  1157)   if (res > 0) COMMON_INTERCEPTOR_READ_RANGE(ctx, p,
res* size);
 
5d3805fca3e9 (Jakub Jelinek     2017-10-19 13:23:59 +0200  1158)   return res;
5d3805fca3e9 (Jakub Jelinek     2017-10-19 13:23:59 +0200  1159) }

$ git describe --tags 5d3805fca3e9
basepoints/gcc-8-3961-g5d3805fca3e



Re: ubsan

From
Andres Freund
Date:
Hi,

On 2022-03-23 11:21:37 -0700, Andres Freund wrote:
> On 2022-03-23 13:54:50 -0400, Tom Lane wrote:
> > Andres Freund <andres@anarazel.de> writes:
> > > I tried to run postgres with ubsan to debug something.
> > 
> > For 0001, could we just replace configure's dlopen check with the
> > dlsym check?  Or are you afraid of reverse-case failures?
> 
> Yea, I was worried about that. But now that I think more about it, it's hard
> to believe something could provide / intercept dlsym but not dlopen. I guess
> we can try and see?
>
> > 0003: OK.  Interesting though that we haven't seen these before.

I think we should backpatch both, based on the reasoning in
46ab07ffda9d6c8e63360ded2d4568aa160a7700 ?

Greetings,

Andres Freund



Re: ubsan

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> I think we should backpatch both, based on the reasoning in
> 46ab07ffda9d6c8e63360ded2d4568aa160a7700 ?

Yeah, I suppose.  Is anyone going to step up and run a buildfarm
member with ubsan enabled?  (I'm already checking -fsanitize=alignment
on longfin, but it seems advisable to keep that separate from
-fsanitize=undefined.)

            regards, tom lane



Re: ubsan

From
Andres Freund
Date:
Hi,

On 2022-03-23 15:58:09 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > I think we should backpatch both, based on the reasoning in
> > 46ab07ffda9d6c8e63360ded2d4568aa160a7700 ?
> 
> Yeah, I suppose.  Is anyone going to step up and run a buildfarm
> member with ubsan enabled?  (I'm already checking -fsanitize=alignment
> on longfin, but it seems advisable to keep that separate from
> -fsanitize=undefined.)

I'm planning to enable it on two of mine. Looks like gcc and clang find
slightly different things, so I was intending to enable it on one of each.

Greetings,

Andres Freund



Re: ubsan

From
Andres Freund
Date:
Hi,

On 2022-03-23 13:12:34 -0700, Andres Freund wrote:
> I'm planning to enable it on two of mine. Looks like gcc and clang find
> slightly different things, so I was intending to enable it on one of each.

Originally I'd planned to mix them into existing members, but I think it'd be
better to have dedicated ones. Applied for a few new buildfarm names for:
{gcc,clang}-{-fsanitize=undefined,-fsanitize=address}.


Running with asan found an existing use-after-free bug in pg_waldump (*), a bug in
dshash_seq_next() next that probably can't be hit in HEAD and a bug in my
shared memory stats patch. I count that as a success.

It's particularly impressive that the cost of running with ASAN is *so* much
lower than valgrind. On my workstation a check-world with
-fsanitize=alignment,undefined,address takes 3min17s, vs 1min10s or so without
-fsanitize.  Not something to always use, but certainly better than valgrind.

Greetings,

Andres Freund

(*) search_directory() uses fname = xlde->d_name after closedir(). Found in
pg_verifybackup.c's tests. Probably worth adding a few simple tests to
pg_waldump itself.



Re: ubsan

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> Running with asan found an existing use-after-free bug in pg_waldump (*), a bug in
> dshash_seq_next() next that probably can't be hit in HEAD and a bug in my
> shared memory stats patch. I count that as a success.

Nice!

            regards, tom lane



Re: ubsan

From
Noah Misch
Date:
On Wed, Mar 23, 2022 at 03:58:09PM -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > I think we should backpatch both, based on the reasoning in
> > 46ab07ffda9d6c8e63360ded2d4568aa160a7700 ?
> 
> Yeah, I suppose.  Is anyone going to step up and run a buildfarm
> member with ubsan enabled?

thorntail has been running with UBSan since 2019.  I've removed flag
-fno-sanitize=nonnull-attribute, which your changes rendered superfluous.



Re: ubsan

From
David Steele
Date:
On 3/23/22 16:55, Andres Freund wrote:
> 
> It's particularly impressive that the cost of running with ASAN is *so* much
> lower than valgrind. On my workstation a check-world with
> -fsanitize=alignment,undefined,address takes 3min17s, vs 1min10s or so without
> -fsanitize.  Not something to always use, but certainly better than valgrind.

It also catches things that valgrind does not so that's a bonus.

One thing to note, though. I have noticed that when enabling 
-fsanitize=undefined and/or -fsanitize=address in combination with 
-fprofile-arcs -ftest-coverage there is a loss in reported coverage, at 
least on gcc 9.3. This may not be very obvious unless coverage is 
normally at 100%.

Regards,
-David



Re: ubsan

From
Andres Freund
Date:
Hi,

On 2022-03-23 15:55:28 -0700, Andres Freund wrote:
> Originally I'd planned to mix them into existing members, but I think it'd be
> better to have dedicated ones. Applied for a few new buildfarm names for:
> {gcc,clang}-{-fsanitize=undefined,-fsanitize=address}.

They're now enabled...

tamandua: gcc, -fsanitize=undefined,alignment
kestrel: clang, -fsanitize=undefined,alignment
grassquit: gcc, -fsanitize=address
olingo: clang, -fsanitize=address

The first three have started reporting in, the last is starting its first run
now.

Greetings,

Andres Freund



Re: ubsan

From
Andres Freund
Date:
Hi,

On 2022-03-23 13:54:50 -0400, Tom Lane wrote:
> 0002: ugh, but my only real complaint is that __ubsan_default_options
> needs more than zero comment.  Also, it's not "our" getenv is it?
> 
> 0004: no opinion

Attached is a rebased version of this patch. Hopefully with a reasonable
amount of comments?  I kind of wanted to add a comment to reached_main, but it
just seems to end up restating the variable name...

Greetings,

Andres Freund

Attachment

Re: ubsan

From
Andres Freund
Date:
Hi,

On 2022-09-29 18:17:55 -0700, Andres Freund wrote:
> Attached is a rebased version of this patch. Hopefully with a reasonable
> amount of comments?  I kind of wanted to add a comment to reached_main, but it
> just seems to end up restating the variable name...

I've now pushed a version of this with a few cleanups, mostly in
.cirrus.yml.  It'll be interesting to see how many additional problems
it finds via cfbot.

Greetings,

Andres Freund



Re: ubsan

From
Justin Pryzby
Date:
On Mon, Nov 21, 2022 at 03:15:03PM -0800, Andres Freund wrote:
> Hi,
> 
> On 2022-09-29 18:17:55 -0700, Andres Freund wrote:
> > Attached is a rebased version of this patch. Hopefully with a reasonable
> > amount of comments?  I kind of wanted to add a comment to reached_main, but it
> > just seems to end up restating the variable name...
> 
> I've now pushed a version of this with a few cleanups, mostly in

Thanks.  I'd meant to ask if there's a reason why you didn't use
meson -D sanitize ?

I recall seeing a bug which affected linking ..  maybe that's why ...

-- 
Justin



Re: ubsan

From
Andres Freund
Date:
Hi,

On November 21, 2022 3:42:38 PM PST, Justin Pryzby <pryzby@telsasoft.com> wrote:
>On Mon, Nov 21, 2022 at 03:15:03PM -0800, Andres Freund wrote:
>> Hi,
>>
>> On 2022-09-29 18:17:55 -0700, Andres Freund wrote:
>> > Attached is a rebased version of this patch. Hopefully with a reasonable
>> > amount of comments?  I kind of wanted to add a comment to reached_main, but it
>> > just seems to end up restating the variable name...
>>
>> I've now pushed a version of this with a few cleanups, mostly in
>
>Thanks.  I'd meant to ask if there's a reason why you didn't use
>meson -D sanitize ?
>
>I recall seeing a bug which affected linking ..  maybe that's why ...

Doesn't allow enabling multiple sanitizers (the PR for that might recently have been merged, but that doesn't help us
yet).We also need to add the no-recover flag anyway. So there doesn't seem to be much point. 

Andres

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: ubsan

From
Alexander Lakhin
Date:
Hello Andres,

22.11.2022 02:15, Andres Freund wrote:
> Hi,
>
> On 2022-09-29 18:17:55 -0700, Andres Freund wrote:
>> Attached is a rebased version of this patch. Hopefully with a reasonable
>> amount of comments?  I kind of wanted to add a comment to reached_main, but it
>> just seems to end up restating the variable name...
> I've now pushed a version of this with a few cleanups, mostly in
> .cirrus.yml.  It'll be interesting to see how many additional problems
> it finds via cfbot.

I've just discovered that that function __ubsan_default_options() is
incompatible with -fsanitize=hwaddress:
$ tmp_install/usr/local/pgsql/bin/postgres
Segmentation fault

Program received signal SIGSEGV, Segmentation fault.
0x000000555639e3ec in __hwasan_check_x0_0 ()
(gdb) bt
#0  0x000000555639e3ec in __hwasan_check_x0_0 ()
#1  0x000000555697b5a8 in __ubsan_default_options () at main.c:446
#2  0x0000005556367e48 in InitializeFlags ()
     at /home/builder/.termux-build/libllvm/src/compiler-rt/lib/hwasan/hwasan.cpp:133
#3  __hwasan_init () at /home/builder/.termux-build/libllvm/src/compiler-rt/lib/hwasan/hwasan.cpp:351
#4  0x0000007ff7f4929c in __dl__ZL13call_functionPKcPFviPPcS2_ES0_ () from /system/bin/linker64
#5  0x0000007ff7f4900c in __dl__ZL10call_arrayIPFviPPcS1_EEvPKcPT_mbS5_ () from /system/bin/linker64
#6  0x0000007ff7f45670 in __dl__ZL29__linker_init_post_relocationR19KernelArgumentBlockR6soinfo ()
    from /system/bin/linker64
#7  0x0000007ff7f449c8 in __dl___linker_init () from /system/bin/linker64
#8  0x0000007ff7f4b208 in __dl__start () from /system/bin/linker64

I use clang version 16.0.6, Target: aarch64-unknown-linux-android24.

With just 'return ""' in __ubsan_default_options(), I've managed to run
`make check` (there is also an issue with check_stack_depth(),
but that's another story)...

Best regards,
Alexander