Thread: strange valgrind reports about wrapper_handler on 64-bit arm
Hi, while running check-world on 64-bit arm (rpi5 with Debian 12.9), I got a couple reports like this: ==64550== Use of uninitialised value of size 8 ==64550== at 0xA62FE0: wrapper_handler (pqsignal.c:107) ==64550== by 0x580BB9E7: ??? (in /usr/libexec/valgrind/memcheck-arm64-linux) ==64550== Uninitialised value was created by a stack allocation ==64550== at 0x4F94660: strcoll_l (strcoll_l.c:258) ==64550== { <insert_a_suppression_name_here> Memcheck:Value8 fun:wrapper_handler obj:/usr/libexec/valgrind/memcheck-arm64-linux } **64550** Valgrind detected 1 error(s) during execution of "ANALYZE mcv_lists;" The exact command varies, I don't think it's necessarily about analyze or extended stats. The line the report refers to is this: (*pqsignal_handlers[postgres_signal_arg]) (postgres_signal_arg); so I guess it can't be about postgres_signal_arg (as that's an int). But that leaves just pqsignal_handlers, and why would that be uninitialized? The closest thing I found in archives is [1] from about a year ago, but we haven't found any clear explanation there either :-( [1] https://www.postgresql.org/message-id/f1a022e5-9bec-42c5-badd-cfc00b60515c@enterprisedb.com regards -- Tomas Vondra
On Fri, Mar 07, 2025 at 12:03:47AM +0100, Tomas Vondra wrote: > while running check-world on 64-bit arm (rpi5 with Debian 12.9), I got a > couple reports like this: > > ==64550== Use of uninitialised value of size 8 > ==64550== at 0xA62FE0: wrapper_handler (pqsignal.c:107) > ==64550== by 0x580BB9E7: ??? (in > /usr/libexec/valgrind/memcheck-arm64-linux) > ==64550== Uninitialised value was created by a stack allocation > ==64550== at 0x4F94660: strcoll_l (strcoll_l.c:258) > ==64550== > { > <insert_a_suppression_name_here> > Memcheck:Value8 > fun:wrapper_handler > obj:/usr/libexec/valgrind/memcheck-arm64-linux > } > **64550** Valgrind detected 1 error(s) during execution of "ANALYZE > mcv_lists;" > > The exact command varies, I don't think it's necessarily about analyze > or extended stats. > > The line the report refers to is this: > > (*pqsignal_handlers[postgres_signal_arg]) (postgres_signal_arg); > > so I guess it can't be about postgres_signal_arg (as that's an int). But > that leaves just pqsignal_handlers, and why would that be uninitialized? > > The closest thing I found in archives is [1] from about a year ago, but > we haven't found any clear explanation there either :-( Hm. The pointer to strcoll_l makes me wonder if there might be an issue in one of the handler functions that wrapper_handler calls, and wrapper_handler is getting the blame. I don't see how pqsignal_handlers could be uninitialized. It's static, and we are careful to set it appropriately before we call sigaction(). -- nathan
Hi, On 2025-03-07 00:03:47 +0100, Tomas Vondra wrote: > while running check-world on 64-bit arm (rpi5 with Debian 12.9), I got a > couple reports like this: > > ==64550== Use of uninitialised value of size 8 > ==64550== at 0xA62FE0: wrapper_handler (pqsignal.c:107) > ==64550== by 0x580BB9E7: ??? (in > /usr/libexec/valgrind/memcheck-arm64-linux) > ==64550== Uninitialised value was created by a stack allocation > ==64550== at 0x4F94660: strcoll_l (strcoll_l.c:258) > ==64550== > { > <insert_a_suppression_name_here> > Memcheck:Value8 > fun:wrapper_handler > obj:/usr/libexec/valgrind/memcheck-arm64-linux > } > **64550** Valgrind detected 1 error(s) during execution of "ANALYZE > mcv_lists;" > The exact command varies, I don't think it's necessarily about analyze > or extended stats. Do you have a few other examples from where it was triggered? Is the source of the uninitialized value always strcoll_l? Can you reliably reproduce it in certain scenarios or is it probabilistic in some form? Do you know what signal was delivered (I think that could be detected using valgrinds --vgdb)? > The line the report refers to is this: > > (*pqsignal_handlers[postgres_signal_arg]) (postgres_signal_arg); > > so I guess it can't be about postgres_signal_arg (as that's an int). But > that leaves just pqsignal_handlers, and why would that be uninitialized? Is it possible that the signal number we're getting called for is above PG_NSIG? That'd explain why the source value is something fairly random? ISTM that we should add an Assert() to wrapper_handler() that ensures that the signal arg is below PG_NSIG. Might also be worth trying to run without valgrind but with address and undefined behaviour sanitizers enabled. I don't currently have access to an armv8 machine that's not busy doing other stuff... Greetings, Andres Freund
On Fri, Mar 07, 2025 at 11:32:28AM -0500, Andres Freund wrote: > Is it possible that the signal number we're getting called for is above > PG_NSIG? That'd explain why the source value is something fairly random? > > ISTM that we should add an Assert() to wrapper_handler() that ensures that the > signal arg is below PG_NSIG. We have such an assertion in pqsignal() before we install wrapper_handler for anything. Is there another way it could be getting called with a different signo? -- nathan
Hi, On 2025-03-07 10:36:35 -0600, Nathan Bossart wrote: > On Fri, Mar 07, 2025 at 11:32:28AM -0500, Andres Freund wrote: > > Is it possible that the signal number we're getting called for is above > > PG_NSIG? That'd explain why the source value is something fairly random? > > > > ISTM that we should add an Assert() to wrapper_handler() that ensures that the > > signal arg is below PG_NSIG. > > We have such an assertion in pqsignal() before we install wrapper_handler > for anything. Is there another way it could be getting called with a > different signo? Who the hell knows :). One potential way would be that we got SIGNAL_ARGS wrong for the platform and are interpreting some random thing as the signal number. Or something went wrong in the windows signal emulation code. Or ... It seems cheap insurance to add it both places. Greetings, Andres Freund
On Fri, Mar 07, 2025 at 11:41:38AM -0500, Andres Freund wrote: > On 2025-03-07 10:36:35 -0600, Nathan Bossart wrote: >> On Fri, Mar 07, 2025 at 11:32:28AM -0500, Andres Freund wrote: >> > Is it possible that the signal number we're getting called for is above >> > PG_NSIG? That'd explain why the source value is something fairly random? >> > >> > ISTM that we should add an Assert() to wrapper_handler() that ensures that the >> > signal arg is below PG_NSIG. >> >> We have such an assertion in pqsignal() before we install wrapper_handler >> for anything. Is there another way it could be getting called with a >> different signo? > > Who the hell knows :). > > One potential way would be that we got SIGNAL_ARGS wrong for the platform and > are interpreting some random thing as the signal number. Or something went > wrong in the windows signal emulation code. Or ... > > It seems cheap insurance to add it both places. Good enough for me. I'll commit/back-patch to v17 the attached soon. -- nathan
Attachment
On Fri, Mar 07, 2025 at 10:52:10AM -0600, Nathan Bossart wrote: > Good enough for me. I'll commit/back-patch to v17 the attached soon. On second thought, since the signal number is a signed integer, I think we also ought to check that it's > 0. I'm running the attached patch through the CI tests to make sure that's correct for the common platforms. If that looks good, I'm planning to commit it. -- nathan
Attachment
On Fri, Mar 07, 2025 at 02:38:47PM -0600, Nathan Bossart wrote: > If that looks good, I'm planning to commit it. And committed. -- nathan
On 3/7/25 17:32, Andres Freund wrote: > Hi, > > On 2025-03-07 00:03:47 +0100, Tomas Vondra wrote: >> while running check-world on 64-bit arm (rpi5 with Debian 12.9), I got a >> couple reports like this: >> >> ==64550== Use of uninitialised value of size 8 >> ==64550== at 0xA62FE0: wrapper_handler (pqsignal.c:107) >> ==64550== by 0x580BB9E7: ??? (in >> /usr/libexec/valgrind/memcheck-arm64-linux) >> ==64550== Uninitialised value was created by a stack allocation >> ==64550== at 0x4F94660: strcoll_l (strcoll_l.c:258) >> ==64550== >> { >> <insert_a_suppression_name_here> >> Memcheck:Value8 >> fun:wrapper_handler >> obj:/usr/libexec/valgrind/memcheck-arm64-linux >> } >> **64550** Valgrind detected 1 error(s) during execution of "ANALYZE >> mcv_lists;" > >> The exact command varies, I don't think it's necessarily about analyze >> or extended stats. > > Do you have a few other examples from where it was triggered? > > Is the source of the uninitialized value always strcoll_l? > I've seen a couple reports, but only a single one had info about source of the allocation (and that was strcoll). > Can you reliably reproduce it in certain scenarios or is it probabilistic in > some form? > I believe it's probabilistic, I certainly don't know how to trigger or reproduce it. > Do you know what signal was delivered (I think that could be detected using > valgrinds --vgdb)? > No idea. > >> The line the report refers to is this: >> >> (*pqsignal_handlers[postgres_signal_arg]) (postgres_signal_arg); >> >> so I guess it can't be about postgres_signal_arg (as that's an int). But >> that leaves just pqsignal_handlers, and why would that be uninitialized? > > Is it possible that the signal number we're getting called for is above > PG_NSIG? That'd explain why the source value is something fairly random? > No idea. > ISTM that we should add an Assert() to wrapper_handler() that ensures that the > signal arg is below PG_NSIG. > No idea. > > Might also be worth trying to run without valgrind but with address and > undefined behaviour sanitizers enabled. I don't currently have access to an > armv8 machine that's not busy doing other stuff... > I've restarted check-world with valgrind on my rpi5 machines, with current master. I can try running other stuff once that finishes in a couple hours. regards -- Tomas Vondra
On 3/8/25 21:38, Tomas Vondra wrote: > > I've restarted check-world with valgrind on my rpi5 machines, with > current master. I can try running other stuff once that finishes in a > couple hours. > Shortly after restarting this I got three more reports - all of them are related to strcoll_l. This is on c472a18296e4, i.e. with the asserts added in this thread etc. But none of those seem to fail. regards -- Tomas Vondra
Attachment
On Sat, Mar 08, 2025 at 11:48:22PM +0100, Tomas Vondra wrote: > Shortly after restarting this I got three more reports - all of them are > related to strcoll_l. This is on c472a18296e4, i.e. with the asserts > added in this thread etc. But none of those seem to fail. > ==189168== at 0xA683CC: wrapper_handler (pqsignal.c:90) > ==189168== at 0xA683F0: wrapper_handler (pqsignal.c:91) > ==189168== at 0xA684D4: wrapper_handler (pqsignal.c:110) This appears to refer to the following lines: Assert(postgres_signal_arg > 0); Assert(postgres_signal_arg < PG_NSIG); (*pqsignal_handlers[postgres_signal_arg]) (postgres_signal_arg); The common ingredient seems to be postgres_signal_arg. I haven't found anything else that seems helpful. -- nathan
On 3/9/25 03:16, Nathan Bossart wrote: > On Sat, Mar 08, 2025 at 11:48:22PM +0100, Tomas Vondra wrote: >> Shortly after restarting this I got three more reports - all of them are >> related to strcoll_l. This is on c472a18296e4, i.e. with the asserts >> added in this thread etc. But none of those seem to fail. > >> ==189168== at 0xA683CC: wrapper_handler (pqsignal.c:90) >> ==189168== at 0xA683F0: wrapper_handler (pqsignal.c:91) >> ==189168== at 0xA684D4: wrapper_handler (pqsignal.c:110) > > This appears to refer to the following lines: > > Assert(postgres_signal_arg > 0); > Assert(postgres_signal_arg < PG_NSIG); > (*pqsignal_handlers[postgres_signal_arg]) (postgres_signal_arg); > > The common ingredient seems to be postgres_signal_arg. I haven't found > anything else that seems helpful. > Yeah, it's a bit weird. I got another report on the 64-bit rpi5 machine (except for the command it's exactly the same), and then also this report on the 32-bit machine: --------------------------------------------------------------- ==3482== Use of uninitialised value of size 4 ==3482== at 0x991498: wrapper_handler (pqsignal.c:113) ==3482== by 0xFFFFFFFF: ??? ==3482== Uninitialised value was created by a heap allocation ==3482== at 0x940EB8: palloc (mcxt.c:1341) ==3482== by 0x980037: initStringInfoInternal (stringinfo.c:45) ==3482== by 0x980103: initStringInfo (stringinfo.c:99) ==3482== by 0x7194B7: ReadArrayStr (arrayfuncs.c:613) ==3482== by 0x718803: array_in (arrayfuncs.c:289) ==3482== by 0x90024F: InputFunctionCallSafe (fmgr.c:1607) ==3482== by 0x2E346B: CopyFromTextLikeOneRow (copyfromparse.c:1029) ==3482== by 0x2E304B: CopyFromTextOneRow (copyfromparse.c:919) ==3482== by 0x2E2EEF: NextCopyFrom (copyfromparse.c:890) ==3482== by 0x2DF5C7: CopyFrom (copyfrom.c:1149) ==3482== by 0x2DAE33: DoCopy (copy.c:306) ==3482== by 0x6CC38F: standard_ProcessUtility (utility.c:738) ==3482== by 0x4B654C3: pgss_ProcessUtility (pg_stat_statements.c:1179) ==3482== by 0x6CBBF3: ProcessUtility (utility.c:519) ==3482== by 0x6CA24B: PortalRunUtility (pquery.c:1184) ==3482== by 0x6CA537: PortalRunMulti (pquery.c:1348) ==3482== by 0x6C9837: PortalRun (pquery.c:819) ==3482== by 0x6C1BEB: exec_simple_query (postgres.c:1272) ==3482== by 0x6C74FF: PostgresMain (postgres.c:4693) ==3482== by 0x6BD297: BackendMain (backend_startup.c:107) ==3482== { <insert_a_suppression_name_here> Memcheck:Value4 fun:wrapper_handler obj:* } **3482** Valgrind detected 1 error(s) during execution of "COPY array_op_test FROM '/home/debian/postgres/src/test/regress/data/array.data';" --------------------------------------------------------------- This all seems very strange ... I'm starting to wonder if maybe this is a valgrind issue. Both machines have valgrind 3.19, I'll try with a custom build of 3.24 (the latest release). regards -- Tomas Vondra