Thread: ubsan
Hi, I tried to run postgres with ubsan to debug something. I ran into two main issues: 1) Despite Tom's recent efforts in [1], I see two ubsan failures in HEAD. These are easy enough to fix as per the attached, although the fix for the GetConfigOptionByNum() isn't great - we should probably pass a nulls array to GetConfigOptionByNum() as well, but that'd have a bunch of followup changes. So I'm inclined to go with the minimal for now. 2) When debugging issues I got very confused by the fact that *sometimes* UBSAN_OPTIONS takes effect and sometimes not. I was trying to have ubsan generate backtraces as well as a coredump with UBSAN_OPTIONS="print_stacktrace=1:disable_coredump=0:abort_on_error=1:verbosity=2" After a lot of debugging I figured out that the options took effect in postmaster, but not in any children. Which in turn is because set_ps_display() breaks /proc/$pid/environ - it's empty in all postmaster children for me. The sanitizer library use /proc/$pid/environ (from [2] to [3]), because they don't want to use getenv() because it wants to work without libc (whether that's the right way, i have my doubts). When just using undefined and alignment sanitizers, the sanitizer library is only initialized when the first error occurs, by which time we've often already called set_ps_display(). Note that ps_status.c breaks /proc/$pid/environ even if the update_process_title GUC is set to false, because init_ps_display() ignores that. Yes, that confused me for a while last night. The reason that /proc/$pid/environ is borken is fairly obvious: We overwrite it in set_ps_display() in the PS_USE_CLOBBER_ARGV case. The kernel just looks at the range set up originally, which we'll overwrite with zeroes. We could try telling the kernel about the new location of environ using prctl() PR_SET_MM_ENV_START/END but the restrictions around that sound problematic. I've included a hacky workaround: Define __ubsan_default_options, a weak symbol libsanitizer uses to get defaults from the application, and return getenv("UBSAN_OPTIONS"). But only if main already was reached, so that we don't end up relying on a not-yet-working getenv(). This also lead me to finally figure out why /proc/$pid/cmdline of postgres children includes a lot of NULL bytes. I'd noticed this because tools like pidstat -l started to print long long status strings at some point, before it got fixed on the pidstat side. The way we overwrite doesn't trigger this special case in the kernel: https://github.com/torvalds/linux/blob/master/fs/proc/base.c#L296 therefore the original size of the commandline arguments are printed, with a lot of boring zeroes. A test run of this in ci, with the guc.c intentionally reintroduced, shows the failure both via core dump https://api.cirrus-ci.com/v1/task/6543164315009024/logs/cores.log and in postmaster's log: https://api.cirrus-ci.com/v1/artifact/task/6543164315009024/log/src/test/regress/log/postmaster.log although that part is a bit annoying to read, because the error is interspersed with other log messages: guc.c:9801:15: runtime error: null pointer passed as argument 2, which is declared to never be null ==19253==Using libbacktrace symbolizer. 2022-03-23 17:20:35.412 UTC [19258][client backend] [pg_regress/alter_operator][14/429:0] ERROR: must be owner of operator=== ... 2022-03-23 17:20:35.601 UTC [19254][client backend] [pg_regress/alter_generic][10/1569:0] STATEMENT: ALTER STATISTICS alt_stat2OWNER TO regress_alter_generic_user2; #0 0x5626562b2ab4 in GetConfigOptionByNum /tmp/cirrus-ci-build/src/backend/utils/misc/guc.c:9801 #1 0x5626562b3fd5 in show_all_settings /tmp/cirrus-ci-build/src/backend/utils/misc/guc.c:10137 2022-03-23 17:20:35.604 UTC [19254][client backend] [pg_regress/alter_generic][10/1576:0] ERROR: must be owner of statisticsobject alt_stat3 ... 2022-03-23 17:20:35.601 UTC [19254][client backend] [pg_regress/alter_generic][10/1569:0] STATEMENT: ALTER STATISTICS alt_stat2OWNER TO regress_alter_generic_user2; #0 0x5626562b2ab4 in GetConfigOptionByNum /tmp/cirrus-ci-build/src/backend/utils/misc/guc.c:9801 #1 0x5626562b3fd5 in show_all_settings /tmp/cirrus-ci-build/src/backend/utils/misc/guc.c:10137 2022-03-23 17:20:35.604 UTC [19254][client backend] [pg_regress/alter_generic][10/1576:0] ERROR: must be owner of statisticsobject alt_stat3 2022-03-23 17:20:35.604 UTC [19254][client backend] [pg_regress/alter_generic][10/1576:0] STATEMENT: ALTER STATISTICS alt_stat3RENAME TO alt_stat4; #2 0x562655c0ea86 in ExecMakeTableFunctionResult /tmp/cirrus-ci-build/src/backend/executor/execSRF.c:234 #3 0x562655c3f8be in FunctionNext /tmp/cirrus-ci-build/src/backend/executor/nodeFunctionscan.c:95 2022-03-23 17:20:35.605 UTC [19254][client backend] [pg_regress/alter_generic][10/1578:0] ERROR: must be owner of statisticsobject alt_stat3 2022-03-23 17:20:35.605 UTC [19254][client backend] [pg_regress/alter_generic][10/1578:0] STATEMENT: ALTER STATISTICS alt_stat3OWNER TO regress_alter_generic_user2; 2022-03-23 17:20:35.606 UTC [19254][client backend] [pg_regress/alter_generic][10/1579:0] ERROR: must be member of role"regress_alter_generic_user3" 2022-03-23 17:20:35.606 UTC [19254][client backend] [pg_regress/alter_generic][10/1579:0] STATEMENT: ALTER STATISTICS alt_stat2OWNER TO regress_alter_generic_user3; #4 0x562655c10175 in ExecScanFetch /tmp/cirrus-ci-build/src/backend/executor/execScan.c:133 #5 0x562655c10653 in ExecScan /tmp/cirrus-ci-build/src/backend/executor/execScan.c:199 #6 0x562655c3f643 in ExecFunctionScan /tmp/cirrus-ci-build/src/backend/executor/nodeFunctionscan.c:270 2022-03-23 17:20:35.606 UTC [19254][client backend] [pg_regress/alter_generic][10/1580:0] ERROR: must be owner of statisticsobject alt_stat3 2022-03-23 17:20:35.606 UTC [19254][client backend] [pg_regress/alter_generic][10/1580:0] STATEMENT: ALTER STATISTICS alt_stat3SET SCHEMA alt_nsp2; 2022-03-23 17:20:35.606 UTC [19254][client backend] [pg_regress/alter_generic][10/1581:0] ERROR: statistics object "alt_stat2"already exists in schema "alt_nsp2" 2022-03-23 17:20:35.606 UTC [19254][client backend] [pg_regress/alter_generic][10/1581:0] STATEMENT: ALTER STATISTICS alt_stat2SET SCHEMA alt_nsp2; #7 0x562655c09bc9 in ExecProcNodeFirst /tmp/cirrus-ci-build/src/backend/executor/execProcnode.c:463 #8 0x562655bf7580 in ExecProcNode ../../../src/include/executor/executor.h:259 #9 0x562655bf7580 in ExecutePlan /tmp/cirrus-ci-build/src/backend/executor/execMain.c:1633 #10 0x562655bf78b9 in standard_ExecutorRun /tmp/cirrus-ci-build/src/backend/executor/execMain.c:362 Greetings, Andres Freund [1] https://postgr.es/m/CALNJ-vT9r0DSsAOw9OXVJFxLENoVS_68kJ5x0p44atoYH+H4dg@mail.gmail.com [2] https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libsanitizer/ubsan/ubsan_flags.cpp;h=9a66bd37518b3a0606049b761ffdd7ddf3c3c714;hb=refs/heads/master#l68 [2] https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libsanitizer/sanitizer_common/sanitizer_linux.cpp;h=aa59d9718ca89cc554bdf677df3e64ddd233ca59;hb=refs/heads/master#l559
Attachment
Andres Freund <andres@anarazel.de> writes: > I tried to run postgres with ubsan to debug something. For 0001, could we just replace configure's dlopen check with the dlsym check? Or are you afraid of reverse-case failures? 0002: ugh, but my only real complaint is that __ubsan_default_options needs more than zero comment. Also, it's not "our" getenv is it? 0003: OK. Interesting though that we haven't seen these before. 0004: no opinion regards, tom lane
Hi, On 2022-03-23 13:54:50 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > I tried to run postgres with ubsan to debug something. > > For 0001, could we just replace configure's dlopen check with the > dlsym check? Or are you afraid of reverse-case failures? Yea, I was worried about that. But now that I think more about it, it's hard to believe something could provide / intercept dlsym but not dlopen. I guess we can try and see? > 0002: ugh, but my only real complaint is that __ubsan_default_options > needs more than zero comment. Yea, definitely. I am still hoping that somebody could see a better approach than that ugly hack. Haven't yet checked, but probably should also verify asan either doesn't have the same problem or provide the same hack for ASAN_OPTIONS. > Also, it's not "our" getenv is it? Not really. "libc's getenv()"? > 0003: OK. Interesting though that we haven't seen these before. I assume it's a question of library version and configure flags. Looks like the fwrite nonnull case isn't actually due to the nonnull attribute, but just fwrite() getting intercepted by the sanitizer library. Looks like that was added starting in gcc 9 [1] And the guc.c case presumably requires --enable-nls and a version of gettext using the nonnull attribute? Wonder if there's a few functions we should add nonnull to ourselves. Probably would help "everyday compiler warnings", static analyzers, and ubsan. Greetings, Andres Freund [1] 5d3805fca3e9 (Jakub Jelinek 2017-10-19 13:23:59 +0200 1151) #if SANITIZER_INTERCEPT_FWRITE 5d3805fca3e9 (Jakub Jelinek 2017-10-19 13:23:59 +0200 1152) INTERCEPTOR(SIZE_T, fwrite, const void *p, uptr size, uptrnmemb, void *file) { 5d3805fca3e9 (Jakub Jelinek 2017-10-19 13:23:59 +0200 1153) // libc file streams can call user-supplied functions,see fopencookie. 5d3805fca3e9 (Jakub Jelinek 2017-10-19 13:23:59 +0200 1154) void *ctx; 5d3805fca3e9 (Jakub Jelinek 2017-10-19 13:23:59 +0200 1155) COMMON_INTERCEPTOR_ENTER(ctx, fwrite, p, size, nmemb,file); 5d3805fca3e9 (Jakub Jelinek 2017-10-19 13:23:59 +0200 1156) SIZE_T res = REAL(fwrite)(p, size, nmemb, file); 5d3805fca3e9 (Jakub Jelinek 2017-10-19 13:23:59 +0200 1157) if (res > 0) COMMON_INTERCEPTOR_READ_RANGE(ctx, p, res* size); 5d3805fca3e9 (Jakub Jelinek 2017-10-19 13:23:59 +0200 1158) return res; 5d3805fca3e9 (Jakub Jelinek 2017-10-19 13:23:59 +0200 1159) } $ git describe --tags 5d3805fca3e9 basepoints/gcc-8-3961-g5d3805fca3e
Hi, On 2022-03-23 11:21:37 -0700, Andres Freund wrote: > On 2022-03-23 13:54:50 -0400, Tom Lane wrote: > > Andres Freund <andres@anarazel.de> writes: > > > I tried to run postgres with ubsan to debug something. > > > > For 0001, could we just replace configure's dlopen check with the > > dlsym check? Or are you afraid of reverse-case failures? > > Yea, I was worried about that. But now that I think more about it, it's hard > to believe something could provide / intercept dlsym but not dlopen. I guess > we can try and see? > > > 0003: OK. Interesting though that we haven't seen these before. I think we should backpatch both, based on the reasoning in 46ab07ffda9d6c8e63360ded2d4568aa160a7700 ? Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > I think we should backpatch both, based on the reasoning in > 46ab07ffda9d6c8e63360ded2d4568aa160a7700 ? Yeah, I suppose. Is anyone going to step up and run a buildfarm member with ubsan enabled? (I'm already checking -fsanitize=alignment on longfin, but it seems advisable to keep that separate from -fsanitize=undefined.) regards, tom lane
Hi, On 2022-03-23 15:58:09 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > I think we should backpatch both, based on the reasoning in > > 46ab07ffda9d6c8e63360ded2d4568aa160a7700 ? > > Yeah, I suppose. Is anyone going to step up and run a buildfarm > member with ubsan enabled? (I'm already checking -fsanitize=alignment > on longfin, but it seems advisable to keep that separate from > -fsanitize=undefined.) I'm planning to enable it on two of mine. Looks like gcc and clang find slightly different things, so I was intending to enable it on one of each. Greetings, Andres Freund
Hi, On 2022-03-23 13:12:34 -0700, Andres Freund wrote: > I'm planning to enable it on two of mine. Looks like gcc and clang find > slightly different things, so I was intending to enable it on one of each. Originally I'd planned to mix them into existing members, but I think it'd be better to have dedicated ones. Applied for a few new buildfarm names for: {gcc,clang}-{-fsanitize=undefined,-fsanitize=address}. Running with asan found an existing use-after-free bug in pg_waldump (*), a bug in dshash_seq_next() next that probably can't be hit in HEAD and a bug in my shared memory stats patch. I count that as a success. It's particularly impressive that the cost of running with ASAN is *so* much lower than valgrind. On my workstation a check-world with -fsanitize=alignment,undefined,address takes 3min17s, vs 1min10s or so without -fsanitize. Not something to always use, but certainly better than valgrind. Greetings, Andres Freund (*) search_directory() uses fname = xlde->d_name after closedir(). Found in pg_verifybackup.c's tests. Probably worth adding a few simple tests to pg_waldump itself.
Andres Freund <andres@anarazel.de> writes: > Running with asan found an existing use-after-free bug in pg_waldump (*), a bug in > dshash_seq_next() next that probably can't be hit in HEAD and a bug in my > shared memory stats patch. I count that as a success. Nice! regards, tom lane
On Wed, Mar 23, 2022 at 03:58:09PM -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > I think we should backpatch both, based on the reasoning in > > 46ab07ffda9d6c8e63360ded2d4568aa160a7700 ? > > Yeah, I suppose. Is anyone going to step up and run a buildfarm > member with ubsan enabled? thorntail has been running with UBSan since 2019. I've removed flag -fno-sanitize=nonnull-attribute, which your changes rendered superfluous.
On 3/23/22 16:55, Andres Freund wrote: > > It's particularly impressive that the cost of running with ASAN is *so* much > lower than valgrind. On my workstation a check-world with > -fsanitize=alignment,undefined,address takes 3min17s, vs 1min10s or so without > -fsanitize. Not something to always use, but certainly better than valgrind. It also catches things that valgrind does not so that's a bonus. One thing to note, though. I have noticed that when enabling -fsanitize=undefined and/or -fsanitize=address in combination with -fprofile-arcs -ftest-coverage there is a loss in reported coverage, at least on gcc 9.3. This may not be very obvious unless coverage is normally at 100%. Regards, -David
Hi, On 2022-03-23 15:55:28 -0700, Andres Freund wrote: > Originally I'd planned to mix them into existing members, but I think it'd be > better to have dedicated ones. Applied for a few new buildfarm names for: > {gcc,clang}-{-fsanitize=undefined,-fsanitize=address}. They're now enabled... tamandua: gcc, -fsanitize=undefined,alignment kestrel: clang, -fsanitize=undefined,alignment grassquit: gcc, -fsanitize=address olingo: clang, -fsanitize=address The first three have started reporting in, the last is starting its first run now. Greetings, Andres Freund
Hi, On 2022-03-23 13:54:50 -0400, Tom Lane wrote: > 0002: ugh, but my only real complaint is that __ubsan_default_options > needs more than zero comment. Also, it's not "our" getenv is it? > > 0004: no opinion Attached is a rebased version of this patch. Hopefully with a reasonable amount of comments? I kind of wanted to add a comment to reached_main, but it just seems to end up restating the variable name... Greetings, Andres Freund
Attachment
Hi, On 2022-09-29 18:17:55 -0700, Andres Freund wrote: > Attached is a rebased version of this patch. Hopefully with a reasonable > amount of comments? I kind of wanted to add a comment to reached_main, but it > just seems to end up restating the variable name... I've now pushed a version of this with a few cleanups, mostly in .cirrus.yml. It'll be interesting to see how many additional problems it finds via cfbot. Greetings, Andres Freund
On Mon, Nov 21, 2022 at 03:15:03PM -0800, Andres Freund wrote: > Hi, > > On 2022-09-29 18:17:55 -0700, Andres Freund wrote: > > Attached is a rebased version of this patch. Hopefully with a reasonable > > amount of comments? I kind of wanted to add a comment to reached_main, but it > > just seems to end up restating the variable name... > > I've now pushed a version of this with a few cleanups, mostly in Thanks. I'd meant to ask if there's a reason why you didn't use meson -D sanitize ? I recall seeing a bug which affected linking .. maybe that's why ... -- Justin
Hi, On November 21, 2022 3:42:38 PM PST, Justin Pryzby <pryzby@telsasoft.com> wrote: >On Mon, Nov 21, 2022 at 03:15:03PM -0800, Andres Freund wrote: >> Hi, >> >> On 2022-09-29 18:17:55 -0700, Andres Freund wrote: >> > Attached is a rebased version of this patch. Hopefully with a reasonable >> > amount of comments? I kind of wanted to add a comment to reached_main, but it >> > just seems to end up restating the variable name... >> >> I've now pushed a version of this with a few cleanups, mostly in > >Thanks. I'd meant to ask if there's a reason why you didn't use >meson -D sanitize ? > >I recall seeing a bug which affected linking .. maybe that's why ... Doesn't allow enabling multiple sanitizers (the PR for that might recently have been merged, but that doesn't help us yet).We also need to add the no-recover flag anyway. So there doesn't seem to be much point. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Hello Andres, 22.11.2022 02:15, Andres Freund wrote: > Hi, > > On 2022-09-29 18:17:55 -0700, Andres Freund wrote: >> Attached is a rebased version of this patch. Hopefully with a reasonable >> amount of comments? I kind of wanted to add a comment to reached_main, but it >> just seems to end up restating the variable name... > I've now pushed a version of this with a few cleanups, mostly in > .cirrus.yml. It'll be interesting to see how many additional problems > it finds via cfbot. I've just discovered that that function __ubsan_default_options() is incompatible with -fsanitize=hwaddress: $ tmp_install/usr/local/pgsql/bin/postgres Segmentation fault Program received signal SIGSEGV, Segmentation fault. 0x000000555639e3ec in __hwasan_check_x0_0 () (gdb) bt #0 0x000000555639e3ec in __hwasan_check_x0_0 () #1 0x000000555697b5a8 in __ubsan_default_options () at main.c:446 #2 0x0000005556367e48 in InitializeFlags () at /home/builder/.termux-build/libllvm/src/compiler-rt/lib/hwasan/hwasan.cpp:133 #3 __hwasan_init () at /home/builder/.termux-build/libllvm/src/compiler-rt/lib/hwasan/hwasan.cpp:351 #4 0x0000007ff7f4929c in __dl__ZL13call_functionPKcPFviPPcS2_ES0_ () from /system/bin/linker64 #5 0x0000007ff7f4900c in __dl__ZL10call_arrayIPFviPPcS1_EEvPKcPT_mbS5_ () from /system/bin/linker64 #6 0x0000007ff7f45670 in __dl__ZL29__linker_init_post_relocationR19KernelArgumentBlockR6soinfo () from /system/bin/linker64 #7 0x0000007ff7f449c8 in __dl___linker_init () from /system/bin/linker64 #8 0x0000007ff7f4b208 in __dl__start () from /system/bin/linker64 I use clang version 16.0.6, Target: aarch64-unknown-linux-android24. With just 'return ""' in __ubsan_default_options(), I've managed to run `make check` (there is also an issue with check_stack_depth(), but that's another story)... Best regards, Alexander