Thread: valgrind error
I was just trying to revive lousyjack, my valgrind buildfarm animal which has been offline for 12 days, after having upgraded the machine (fedora 31, gcc 9.3.1, valgrind 3.15) and noticed lots of errors like this: 2020-04-17 19:26:03.483 EDT [63741:3] pg_regress LOG: statement: CREATE DATABASE "regression" TEMPLATE=template0 ==63717== VALGRINDERROR-BEGIN ==63717== Use of uninitialised value of size 8 ==63717== at 0xAC5BB5: pg_comp_crc32c_sb8 (pg_crc32c_sb8.c:82) ==63717== by 0x55A98B: XLogRecordAssemble (xloginsert.c:785) ==63717== by 0x55A268: XLogInsert (xloginsert.c:461) ==63717== by 0x8BC9E0: LogCurrentRunningXacts (standby.c:1005) ==63717== by 0x8BC8F9: LogStandbySnapshot (standby.c:961) ==63717== by 0x550CB3: CreateCheckPoint (xlog.c:8937) ==63717== by 0x82A3B2: CheckpointerMain (checkpointer.c:441) ==63717== by 0x56347D: AuxiliaryProcessMain (bootstrap.c:453) ==63717== by 0x83CA18: StartChildProcess (postmaster.c:5474) ==63717== by 0x83A120: reaper (postmaster.c:3045) ==63717== by 0x4874B1F: ??? (in /usr/lib64/libpthread-2.30.so) ==63717== by 0x5056F29: select (in /usr/lib64/libc-2.30.so) ==63717== by 0x8380A0: ServerLoop (postmaster.c:1691) ==63717== by 0x837A1F: PostmasterMain (postmaster.c:1400) ==63717== by 0x74A71D: main (main.c:210) ==63717== Uninitialised value was created by a stack allocation ==63717== at 0x8BC942: LogCurrentRunningXacts (standby.c:984) ==63717== ==63717== VALGRINDERROR-END { <insert_a_suppression_name_here> Memcheck:Value8 fun:pg_comp_crc32c_sb8 fun:XLogRecordAssemble fun:XLogInsert fun:LogCurrentRunningXacts fun:LogStandbySnapshot fun:CreateCheckPoint fun:CheckpointerMain fun:AuxiliaryProcessMain fun:StartChildProcess fun:reaper obj:/usr/lib64/libpthread-2.30.so fun:select fun:ServerLoop fun:PostmasterMain fun:main } I can't see what the problem is immediately. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 4/18/20 9:15 AM, Andrew Dunstan wrote: > I was just trying to revive lousyjack, my valgrind buildfarm animal > which has been offline for 12 days, after having upgraded the machine > (fedora 31, gcc 9.3.1, valgrind 3.15) and noticed lots of errors like this: > > > 2020-04-17 19:26:03.483 EDT [63741:3] pg_regress LOG: statement: CREATE > DATABASE "regression" TEMPLATE=template0 > ==63717== VALGRINDERROR-BEGIN > ==63717== Use of uninitialised value of size 8 > ==63717== at 0xAC5BB5: pg_comp_crc32c_sb8 (pg_crc32c_sb8.c:82) > ==63717== by 0x55A98B: XLogRecordAssemble (xloginsert.c:785) > ==63717== by 0x55A268: XLogInsert (xloginsert.c:461) > ==63717== by 0x8BC9E0: LogCurrentRunningXacts (standby.c:1005) > ==63717== by 0x8BC8F9: LogStandbySnapshot (standby.c:961) > ==63717== by 0x550CB3: CreateCheckPoint (xlog.c:8937) > ==63717== by 0x82A3B2: CheckpointerMain (checkpointer.c:441) > ==63717== by 0x56347D: AuxiliaryProcessMain (bootstrap.c:453) > ==63717== by 0x83CA18: StartChildProcess (postmaster.c:5474) > ==63717== by 0x83A120: reaper (postmaster.c:3045) > ==63717== by 0x4874B1F: ??? (in /usr/lib64/libpthread-2.30.so) > ==63717== by 0x5056F29: select (in /usr/lib64/libc-2.30.so) > ==63717== by 0x8380A0: ServerLoop (postmaster.c:1691) > ==63717== by 0x837A1F: PostmasterMain (postmaster.c:1400) > ==63717== by 0x74A71D: main (main.c:210) > ==63717== Uninitialised value was created by a stack allocation > ==63717== at 0x8BC942: LogCurrentRunningXacts (standby.c:984) > ==63717== > ==63717== VALGRINDERROR-END > { > <insert_a_suppression_name_here> > Memcheck:Value8 > fun:pg_comp_crc32c_sb8 > fun:XLogRecordAssemble > fun:XLogInsert > fun:LogCurrentRunningXacts > fun:LogStandbySnapshot > fun:CreateCheckPoint > fun:CheckpointerMain > fun:AuxiliaryProcessMain > fun:StartChildProcess > fun:reaper > obj:/usr/lib64/libpthread-2.30.so > fun:select > fun:ServerLoop > fun:PostmasterMain > fun:main > } > > After many hours of testing I have a culprit for this. The error appears with valgrind 3.15.0 with everything else held constant. 3.14.0 does not produce the problem. So lousyjack will be back on the air before long. Here are the build flags it's using: CFLAGS=-Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncatio n -g -fno-omit-frame-pointer -O0 -fPIC CPPFLAGS=-DUSE_VALGRIND -DRELCACHE_FORCE_RELEASE -D_GNU_SOURCE -I/usr/include/libxml2 and valgrind is invoked like this: valgrind --quiet --trace-children=yes --track-origins=yes --read-var-info=yes --num-callers=20 --leak-check=no --gen-suppressions=all --error-limit=no --suppressions=../pgsql/src/tools/valgrind.supp --error-markers=VALGRINDERROR-BEGIN,VALGRINDERROR-END bin/postgres -D data-C Does anyone see anything here that needs tweaking? Note that this is quite an old machine: andrew@freddo:bf (master)*$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 2 On-line CPU(s) list: 0,1 Thread(s) per core: 1 Core(s) per socket: 2 Socket(s): 1 NUMA node(s): 1 Vendor ID: AuthenticAMD CPU family: 16 Model: 6 Model name: AMD Athlon(tm) II X2 215 Processor Stepping: 2 CPU MHz: 2700.000 CPU max MHz: 2700.0000 CPU min MHz: 800.0000 BogoMIPS: 5425.13 Virtualization: AMD-V L1d cache: 64K L1i cache: 64K L2 cache: 512K NUMA node0 CPU(s): 0,1 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt hw_pstate vmmcall npt lbrv svm_lock nrip_save I did not manage to reproduce this anywhere else, tried on various physical, Virtualbox and Docker instances. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sun, May 10, 2020 at 09:29:05AM -0400, Andrew Dunstan wrote: > On 4/18/20 9:15 AM, Andrew Dunstan wrote: > > I was just trying to revive lousyjack, my valgrind buildfarm animal > > which has been offline for 12 days, after having upgraded the machine > > (fedora 31, gcc 9.3.1, valgrind 3.15) and noticed lots of errors like this: > > { > > <insert_a_suppression_name_here> > > Memcheck:Value8 > > fun:pg_comp_crc32c_sb8 > > fun:XLogRecordAssemble > > fun:XLogInsert > > fun:LogCurrentRunningXacts > > fun:LogStandbySnapshot > > fun:CreateCheckPoint > > fun:CheckpointerMain > > fun:AuxiliaryProcessMain > > fun:StartChildProcess > > fun:reaper > > obj:/usr/lib64/libpthread-2.30.so > > fun:select > > fun:ServerLoop > > fun:PostmasterMain > > fun:main > > } > After many hours of testing I have a culprit for this. The error appears > with valgrind 3.15.0 with everything else held constant. 3.14.0 does > not produce the problem. I suspect 3.15.0 is just better at tracking the uninitialized data. A more-remote possibility is valgrind-3.14.0 emulating sse42. That would make pg_crc32c_sse42_available() return true, avoiding the pg_comp_crc32c_sb8(). > andrew@freddo:bf (master)*$ lscpu ... > Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr > pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext > fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl > nonstop_tsc cpuid extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy > svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs > skinit wdt hw_pstate vmmcall npt lbrv svm_lock nrip_save > > > I did not manage to reproduce this anywhere else, tried on various > physical, Virtualbox and Docker instances. I can reproduce this on a 2017-vintage CPU with ./configure ... USE_SLICING_BY_8_CRC32C=1 and then running "make installcheck-parallel" under valgrind-3.15.0 (as packaged by RHEL 7.8). valgrind.supp has a suppression for CRC calculations, but it didn't get the memo when commit 4f700bc renamed the function. The attached patch fixes the suppression.
Attachment
Noah Misch <noah@leadboat.com> writes: > I can reproduce this on a 2017-vintage CPU with ./configure > ... USE_SLICING_BY_8_CRC32C=1 and then running "make installcheck-parallel" > under valgrind-3.15.0 (as packaged by RHEL 7.8). valgrind.supp has a > suppression for CRC calculations, but it didn't get the memo when commit > 4f700bc renamed the function. The attached patch fixes the suppression. I can also reproduce this, on RHEL 8.2 which likewise has valgrind-3.15.0, using the same configuration to force use of that CRC function. I concur with your diagnosis that this is just a missed update of the pre-existing suppression rule. However, rather than - fun:pg_comp_crc32c + fun:pg_comp_crc32c* as you have it, I'd prefer to use - fun:pg_comp_crc32c + fun:pg_comp_crc32c_sb8 which precisely matches what 4f700bc did. The other way seems like it's giving a free pass to problems that could lurk in unrelated CRC implementations. regards, tom lane
On Fri, Jun 05, 2020 at 12:17:54PM -0400, Tom Lane wrote: > Noah Misch <noah@leadboat.com> writes: > > I can reproduce this on a 2017-vintage CPU with ./configure > > ... USE_SLICING_BY_8_CRC32C=1 and then running "make installcheck-parallel" > > under valgrind-3.15.0 (as packaged by RHEL 7.8). valgrind.supp has a > > suppression for CRC calculations, but it didn't get the memo when commit > > 4f700bc renamed the function. The attached patch fixes the suppression. > > I can also reproduce this, on RHEL 8.2 which likewise has valgrind-3.15.0, > using the same configuration to force use of that CRC function. I concur > with your diagnosis that this is just a missed update of the pre-existing > suppression rule. However, rather than > > - fun:pg_comp_crc32c > + fun:pg_comp_crc32c* > > as you have it, I'd prefer to use > > - fun:pg_comp_crc32c > + fun:pg_comp_crc32c_sb8 > > which precisely matches what 4f700bc did. The other way seems like > it's giving a free pass to problems that could lurk in unrelated CRC > implementations. The undefined data is in the CRC input, namely the padding bytes in xl_* structs. Apparently, valgrind-3.15.0 doesn't complain about undefined input to _mm_crc32_u* functions. We should not be surprised if Valgrind gains the features necessary to complain about the other implementations. Most COMP_CRC32C callers don't have a suppression, so Valgrind still studies each CRC implementation via those other callers.
Noah Misch <noah@leadboat.com> writes: > On Fri, Jun 05, 2020 at 12:17:54PM -0400, Tom Lane wrote: >> as you have it, I'd prefer to use >> - fun:pg_comp_crc32c >> + fun:pg_comp_crc32c_sb8 >> which precisely matches what 4f700bc did. The other way seems like >> it's giving a free pass to problems that could lurk in unrelated CRC >> implementations. > The undefined data is in the CRC input, namely the padding bytes in xl_* > structs. Oh, I see. Objection withdrawn. > Apparently, valgrind-3.15.0 doesn't complain about undefined input > to _mm_crc32_u* functions. We should not be surprised if Valgrind gains the > features necessary to complain about the other implementations. Perhaps it already has ... I wonder if anyone's tried this on ARMv8 lately. regards, tom lane
I wrote: > Noah Misch <noah@leadboat.com> writes: >> Apparently, valgrind-3.15.0 doesn't complain about undefined input >> to _mm_crc32_u* functions. We should not be surprised if Valgrind gains the >> features necessary to complain about the other implementations. > Perhaps it already has ... I wonder if anyone's tried this on ARMv8 > lately. I installed Fedora 32/aarch64 on a Raspberry Pi 3B+, and can report that valgrind 3.16.0 is just as blind to this problem in pg_comp_crc32c_armv8 as it is in pg_comp_crc32c_sse42. Seems odd, but there you have it. (There are some other issues, but they seem fit for separate threads.) regards, tom lane